The Redundancy Explosion

How Spam is Burying Web Search and Hurting Small-Business Sites.

Search Engine Honesty ( )


Porn sites and other sleazy operations did not have to employ rocket scientists to figure out that they could get better search engine exposure by hosting the same information on 150 different domain names and therefore potentially appearing 150 times as often in search results.  Domain names are very cheap and multiple domain names can be pointed at the same server so no more server resources were required to have 149 “virtual” sites all drawing pages from one real site. 


Because keywords in domain names count for ranking, a page on “” ranks higher than an identical page on “” if the user was searching for “hot chicks.”  This spamming technique, “gross duplication of data” particularly offended the search engines since they could be tricked into indexing exactly the same data multiple times.  The resources, (disk space, bandwidth, etc.) required at the search engine to index a page are at least comparable to the resources required to host the page.  Comparing every page to every other page (about 10 billion factorial comparisons) is beyond even a search engine’s capability although engines have developed extensive methods for detecting duplication.  Search engines censor (ban) small-business sites for duplication or for "not having original content."


Major companies are participants in the wholesale duplication of data and other anti-search engine oriented activities.  For example, R. R. Bowker produces a data base containing information on 14.8 million book, audio, and video titles.  The data contains titles, authors, descriptions, photos, etc. duplicates large portions of this data on their web site.  (So do legions of other online book stores.)  Amazon then “buys links” to these pages by paying “affiliate” site owners a commission to further duplicate book data and link to pages on Amazon.  (Amazon’s was one of the earliest and most successful affiliate programs.)  Google reports indexing 144 million pages at not counting all those affiliate pages.  (No search engine has acted against or even publicly criticized Amazon or other large company for  practices such as duplication and buying links that their content guidelines dislike.) However, search engines do ban small-business sites for these activities.


The recent surge in keyword targeted advertising feeds the redundancy explosion.  Targeted ads, now the mainstay of search engine cash flow, actually work better on poor quality sites.  Imagine the world’s best web site on fly fishing: There are fishing reports, photos, film clips, equipment reviews, message boards, links to other fly fishing sites, etc.  The site runs ads targeted to fly fishing but few people click because there is so much good data on the site.  Those that do click are really interested in the advertised products.  This is great for the advertiser but not so good for the web site owner and targeted ad network that are getting paid by-the-click. 


Now imagine the world’s worst fly fishing site: This site consists only of pages containing targeted ads and blocks of keyword rich, search engine engineered, gibberish text (maybe no rank-reducing outgoing links).  There is literally nothing to do on this site except click on the targeted ads.  This is great for the site owner and the advertising network but maybe not so great for the advertiser.  Imagine the difference in the development and maintenance costs between these two sites.  Picture the disgust of legitimate web site developers (and search engine users) as they discover how badly they have been had.


Where do spammers get good, keyword-rich, engine-optimized text for their pages?  Easy, they steal it.  A spammer can write a program to take keywords and robotically do a search using a major search engine, which returns a list of top ranked sites.  The robot then visits the sites and steals “top-ranking” text to be incorporated into spammer pages.  Such a robot, chomping away for a few days, can easily generate 100,000 web pages about 100,000 different keywords while using the search engines as instruments of their own destruction.  No manual labor or costly content required.  These “scraper” sites are everywhere.  Legitimate site owners can find bits and pieces of their hard work on thousands of other people’s web pages.  Some spammers skip the last step and merely build a text block from the search engine results themselves. The result lines are excerpts that include the keywords and are therefore keyword rich text.


A legitimate store site owner can wake up at the beginning of the Christmas season and find that he has been wiped out because his rank has dropped from 7 to 300.  The search engine changed its ranking algorithm in an “effort to control spam” and the store owner has been caught in the crossfire between the search engine and the spammers.  For the spammer this is no problem.  He just reruns his program and builds 100,000 new pages based on the current top-ranked text. If the site is banned, he just starts over with a new domain name. Perhaps he uses a "previously-owned" domain name that has residual traffic and status from its previous life.


Search engines deluged with a continuous flood of new spam sites using progressively more sophisticated technology are caught in a cost squeeze.  They can't afford more than minimal per-site effort to review and otherwise handle small-business web sites when determining which ones to completely ban.  Google is more affected by this problem than other major search engines because they ban several times as many sites as the others. Google has recently changed their policy for re-reviewing small-business sites that complain about banning (another per-site expense) to accept applications only from sites that stipulate in advance that they are guilty of deceptive practices and have corrected the site. Small business sites deleted by mistake or for editorial reasons become collateral damage in the spam wars.


Copyright © 2006 - 2009 Azinet LLC