Search Engine War Blog : « Google, Yahoo! & MSN all to support the Sitemaps protocol | Google drops click fraud case against Michael Anthony Bradley »

Google indexes it's own duplicate content.

Wednesday, 22 November 2006

Google has a massive flaw in its own robots.txt and domains set-up which allows the Google directory and other content to be crawled at a variety of other Google  subdomains and domains creating a vast array of permutations of duplicate content.

IE:

http://groups.google.com/Top/
http://news.google.com/Top/
http://images.google.com/Top/

At first we thought it was only Yahoo! picking this up:
http://siteexplorer.search.yahoo.com/search?ei=UTF-8&p=news.google.co.uk%2FTop

But Google themselves seem to be indexing their own erroneous pages:
http://www.google.co.uk/search?hl=en&q=news.google.co.uk%2Ftop

ooo'h that's scary, and hopefully not intentional:
http://news.google.fr/googlebooks/scarystories/

This is bad because Google sensibly has a very strong policy against duplicate content yet is unwittingly allowing it to be fully accessed and indexable.  According to the robots.txt it has been this way since at least: 09 November 2006 22:49:53 whether or not it was different before then I am not sure, but I suppose not. The further you dig you find the same robots.txt file with the same last modified date is being used on almost all Google domains and subdomains accept the advertising network parts which suggests a common platform for load balancing and query handling throughout the entire global search platform, and must make it a nightmare trying to organise which parts of the Google network can be indexed or not, hence this mess with duplicate content.

One of the other sites that Google seems to want fully indexed is finance.google.com so this doesn't make an appearance in the robots.txt file which is fine, however it seems that because this is a recent-ish launch someone thought about the problem ran around the other teams to get them to change their code or tweaked the Netscaler rules to put in redirects to finance.google.com when the /finance directory is used straight after the domain. However they didn't get all the subdomains and /finance can still be fully indexed at either www. or finance. and maybe some others.

Matt perhaps you want to get someone to run up the matrix of permutations of subdomains vs directories to be indexed and where from, then to plug the holes with redirects on the Netscalers (it is Citrix Netscalers you use right?), I know it's what you'd ask of us.

Comments

The comments to this entry are closed.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451c37d69e200d83538c59253ef

Listed below are links to weblogs that reference Google indexes it's own duplicate content.:

Subscribe to this blog's feed

Add to My Yahoo!
Subscribe with Bloglines
Add to Google
Subscribe in NewsGator Online

Add to My AOL
Add to Technorati Favorites!
Add to netvibes