Search Engine War Blog : « Search Engine Marketing Report 2007 (part2) | Search Engine Marketing Agency Neutralize Peers Into its Crystal Ball »

Google data suggests 95% of the web is junk

Monday, 14 January 2008

William here spotted a very interesting post about the amount of data processed by Google each day. You can see the input data is mass produced, which indicates this is probably coming from crawler feeds and uploaded feeds (Base) and the output we guess is what they actually use for their search results.

http://www.techcrunch.com/2008/01/09/google-processing-20000-terabytes-a-day-and-growing/

We can deduce (I admit rather cruedly though) that Google thinks around 95.5% of internet content is not worth including in their index:  spam, erroneous, duplicate or plain not relevant. I don't know the ins and outs of Googles processes so there's also a possibility that much of the data might be html and other non essential content, however from the volume of data they input a lot gets dumped.

Based on Williams estimates:
20k terabytes a day at average 40kb item size = 547 billion web pages/content items processed per day

2,000,000,000* images indexed
+ 30,000,000,000* web pages indexed
= 32,000,000,000* indexed content items
* Estimate based on 2006 figures http://en.wikipedia.org/wiki/Google_search

That would mean that they'd ignore/filtered out
547,000,000,000 - 32,000,000,000
= 517,000,000,000 content items

Amount kept: 5.484%
Amount filtered or dumped: 94.516%

Compared with the map reduce input and output data volume:
map input data (TB) 403,152
- reduce output data (TB) 14,018
= (TB) 389,134

Amount kept: 3.477%
Amount filtered or dumped: 96.533%

Mean: (94.516% + 96.533%)/2  =  95.5245%

Comments

boris

How very sad. I knew it was bad just not this bad.

The comments to this entry are closed.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451c37d69e200e54fdb427f8833

Listed below are links to weblogs that reference Google data suggests 95% of the web is junk:

Subscribe to this blog's feed

Add to My Yahoo!
Subscribe with Bloglines
Add to Google
Subscribe in NewsGator Online

Add to My AOL
Add to Technorati Favorites!
Add to netvibes