Search Engine War Blog : « Search Engine Marketing Report 2007 (part2) | Search Engine Marketing Agency Neutralize Peers Into its Crystal Ball »

Google data suggests 95% of the web is junk

Monday, 14 January 2008

William here spotted a very interesting post about the amount of data processed by Google each day. You can see the input data is mass produced, which indicates this is probably coming from crawler feeds and uploaded feeds (Base) and the output we guess is what they actually use for their search results.

We can deduce (I admit rather cruedly though) that Google thinks around 95.5% of internet content is not worth including in their index:  spam, erroneous, duplicate or plain not relevant. I don't know the ins and outs of Googles processes so there's also a possibility that much of the data might be html and other non essential content, however from the volume of data they input a lot gets dumped.

Based on Williams estimates:
20k terabytes a day at average 40kb item size = 547 billion web pages/content items processed per day

2,000,000,000* images indexed
+ 30,000,000,000* web pages indexed
= 32,000,000,000* indexed content items
* Estimate based on 2006 figures

That would mean that they'd ignore/filtered out
547,000,000,000 - 32,000,000,000
= 517,000,000,000 content items

Amount kept: 5.484%
Amount filtered or dumped: 94.516%

Compared with the map reduce input and output data volume:
map input data (TB) 403,152
- reduce output data (TB) 14,018
= (TB) 389,134

Amount kept: 3.477%
Amount filtered or dumped: 96.533%

Mean: (94.516% + 96.533%)/2  =  95.5245%



How very sad. I knew it was bad just not this bad.

Post a comment

If you have a TypeKey or TypePad account, please Sign In.


TrackBack URL for this entry:

Listed below are links to weblogs that reference Google data suggests 95% of the web is junk:

Subscribe to this blog's feed

Add to My Yahoo!
Subscribe with Bloglines
Add to Google
Subscribe in NewsGator Online

Add to My AOL
Add to Technorati Favorites!
Add to netvibes