Search Engine Mechanics

Search Engine Honesty

 

This section describes how search engines work with emphasis on the aspects that are important to the problems of search engine censoring and bias. 

Search Engine Mechanics

To understand the looming freedom of information problems we need to review how search engines work

 

First, some definitions:  A web page is all the information that is displayed in a single browser window including text and graphics.  A site is all the pages that are hosted under a single host name (www.somedomain.com).  All the pages on a site can normally be found by following links starting at the home page.  There are typically links from and to other sites as well as from and to other pages on a given site.

 

Search engines have spiders, robot programs that follow links from page to page and site to site collecting page information.  This data is then incorporated into a search index, a rapidly searchable representation of the page text.  Search engines typically store the entire text portion of the page in a cache to allow users to access the page if the original site is temporarily unavailable. Robots revisit pages periodically to update the index and add any new pages or new sites linked from previously indexed pages.  Anyone can now buy software enabling them to run their own robot on a home computer.  Robot traffic at a web site is now often comparable to human traffic.

 

In order to be able to claim to have indexed more pages than brand “X”, some engines are now "indexing" just the URL of some pages.  A robot visits some other site, finds a link to a page on your site, and indexes just the link.  The robot hasn’t visited your site or indexed any of the text on the page.  This sort of “pseudo indexing” is essentially worthless.  There is very little chance that a search will find a page just from the URL or that a user would click from seeing only the URL without any descriptive information in a search result.  Some search engines refer to these "link only" entries as "supplemental results."

 

A search engine user enters words or phrases (the keywords). The engine then returns a search engine results page containing links to and brief descriptions of the pages containing the keywords. An extremely specific or obscure search might return only a few results, but typically searches return thousands of hits.  The search engine determines the order (rank) in which results are displayed using a ranking algorithm.  Rank is extremely important to webmasters since most searchers don’t look further than the first or second page of results (top 20 sites).

 

The ranking system combines information about relevancy and merit to determine where to rank a particular page when responding to a search for particular keywords.

 

Relevancy refers to all those calculations and analysis performed by the engine to evaluate the relevance of each of the billions of web pages to the particular search terms.  Since these calculations must be performed for essentially every search in a fraction of a second, there are limits to the complexity of the relevancy calculations.  The following are typical of relevancy tests:

 

-Presence and number of keywords in the domain name; in the page title; in the page text

-Ratio of keywords to non-keywords in the text; title; page headings

-Proximity of keywords to each other

-Proximity of keywords to the top of the page

-Etc.

 

Each of the parameters has a certain optimum value and a certain emphasis or “weight” in the overall ranking scheme used by a given search engine.  Keywords in the domain name or page URL are typically more important than keywords in the title, which in turn are more important than keywords in the body text, and so forth.

 

Merit consists of all the analysis the engine does concerning the “quality” of a page without regard to relevance to any particular keywords.  Since this assessment need only be done occasionally, a search engine can afford to spend thousands or millions of times more computational effort on merit analysis.  The following types of site and page analysis could be used in determining merit:

 

-Total number of words in the page text; in the page title (very small or large is bad)

-Presence of typical low quality signatures (Black background, blinking text, un-requested music)

-Use of valid HTML (good)

-Presence of dead links (bad)

-Days since page last modified (zero or very large is bad)

-Presence of “default” title, no title, or “under construction” messages (bad)

-Depth of page relative to root (/index.html is better than /foo/bar/baa/goo/page77.html)

-Days the site (domainname) has been operating (age-of-site  -- more is better)

 

Search engine folks like to tell webmasters that a well designed, well organized, readable, and esthetically pleasing page will (automatically) rank well.  Unfortunately, that is not actually true and legitimate site developers have to continuously make tradeoffs between accommodating search engine ranking and accommodating the different needs of their human visitors.   Search engines are machines.  It is seldom if ever going to occur that a search engine’s relevancy tests are going to coincide with good writing and pleasing layout for a given site – thus the compromise.

 

Unfortunately, this leads to a situation where the better ranking site is generally going to be of lower value from a human standpoint.  In a worst case scenario, we could imagine a page full of gibberish text that was computer generated to have exactly the right parameters and therefore ranked very highly even though it had no human value whatsoever.

 

Search engines therefore needed some merit factors that were not under the direct control of webmasters. These currently include Link Popularity and Site Popularity described below.

 

Static and Dynamic Web Pages

 

Web sites can have static or dynamic pages (or a combination).  Static pages, such as might be generated with an HTML editor such as FrontPage, each exist on the web server as a file (or files).  The web server simply transmits the file to a user’s browser upon receipt of a request (click).  The page doesn’t change unless manually edited by the writer. There are a finite number of static web pages.

 

Dynamic web pages are “custom” generated by a computer in real time response to each request.  There are an effectively infinite number of dynamic pages.  For example, since you could search for a practically infinite number of search phrases, there are an infinite number of different possible returned search result pages.   Most large sites use dynamic pages.  Many dynamic pages contain essentially static information but are customized for individual users and are therefore slightly different each time they are accessed.  “Hello Fred, Today’s date is …, the weather in Peoria is …”.  Some sites use “Session ID” to track individual visitors as they move through the site.  This makes an essentially static page look dynamic because the URL looks like www.somedomain.com/Foo/SessionID12976366378489/somepage.html or something like that and is different every time.  This could make a single page look like thousands or millions of different pages to a search engine.  Some dynamic page URLs are obvious (they contain a question symbol (?)), others “masquerade” as static.

 

Needless to say, trying to determine which dynamic pages to index or (even determine which pages are dynamic) is a major problem for search engines.  A search engine could ignore a clearly identified session ID but might have problems with many types of dynamic pages.

 

Link Popularity

 

Search engines fastened on link popularity as a solution for some of the problems described above.  If another site (another domain) had a link to a particular page, that was taken to mean that somebody other than the webmaster liked the page.  More links from different sites would be even better.  Link popularity therefore became a merit parameter in the ranking algorithms. 

 

This idea was attractive ten years ago but now appears somewhat naïve because it is ridiculously easy to cheat.  Sites known as “Free-For Alls” (FFA) appeared.  Anyone could list their site on an FFA, thus creating another incoming link.  To search engines, FFAs served no purpose except to “game” link popularity.  Worse, FFAs would accept automated submittal of links.  “Site submission” services appeared that would automatically submit your site “to 1,000 FFAs.”  Services appeared to manage, coordinate, or engender swapping of reciprocal links between sites.  Services appeared to automatically spider a web site to determine if they did install and keep a reciprocal link.  Buying of links became endemic.  Many companies (including Fortune 500 companies) bought links through “affiliate programs” in which website owners would be paid commissions on sales that occurred via their link.  Services appeared to manage and consolidate smaller affiliate programs.

 

There are legitimate features of web sites that allow other webmasters to post a link to their site.  “Guestbooks” allow visitors to a site to post a short message about how they liked the site and include a link to the guest’s site.  There are thousands, maybe millions, of these guestbooks out there.  (Google reports 2.9 million pages containing the phrase “sign my guestbook.”)  Most allow automated submission.  There are millions of messages that read essentially “Nice site.  Here is a link to my site.”

 

There are also thousands of “forums”, “blogs”, and “message boards” that allow visitors to post a message including a link to a web site or even HTML content (allows multiple links).  Many allow automated submission.  Spammers could build robots that would search the web for such sites and submit messages containing links. A number of companies including most internet service providers also provide the ability for users to create their own free or no-additional-cost web pages which can contain links to other sites.  Directories provide subject oriented lists of links.  (Yahoo has the largest and best known directory (11 million pages indexed by Yahoo Search)).  Of course, anyone who controls two or more sites can arrange for “cross linking” to gain link popularity.  Some newer and more sophisticated forums, blogs, and message boards (and all the major search engines) now prevent automated submission by requiring submitters to visually interpret a picture of a “check code.”

 

Search engines consider “link farms” to be a form of “search engine spam” whose whole purpose is to “abuse” search engines.  They carefully avoid actually defining in any detail what constitutes a “link farm.”  FFAs are obviously “link farms.”  However, guestbooks, forums, directories, links pages, blogs, and message boards were provided by their webmasters as useful features for their users and are not seen by their webmasters as efforts to trick search engines.  Our studies show that search engines frequently penalize small-business sites containing these features.  Yahoo provides a directory, message boards and free web sites for its users so this is an apparent case of “do as I say, not as I do.” 

 

Google has a more sophisticated link popularity algorithm that includes some automated methods for assessing the quality of the site generating an incoming link.  Google warns webmasters that outgoing links to “bad neighborhoods” could result in decreased rank.  (They studiously avoid defining “bad neighborhoods.”)  Google’s scheme also penalizes a page for having outgoing links.  A page having many incoming links and few outgoing links must be “popular.”  A page having many outgoing links might be Google’s idea of a link farm.

 

Site Popularity Factor

 

Our data show that all the majors use site popularity (total site human visitors per day) as a merit factor in their ranking algorithms.  Mediametrix and Nielsen//NetRatings measure site traffic and provide the data to clients.  Google has a free browser tool bar that (depending on settings) reports to Google on every page visited by the user.  A million or so users running these tool bars allow Google to internally measure site popularity and even popularity of individual pages (as well as lots of information about those users and the visited sites and pages).  Alexa has a similar tool bar and provides traffic information on any web site. 

 

The site popularity factor results in a situation in which “the rich get richer and the poor get poorer.”  Sites with high traffic get higher ranking which leads to yet more traffic and yet better ranking.  Because of this circular situation, too much emphasis on site popularity results in the engine’s creating popularity, an obviously artificial situation.  Search engines must therefore limit the weight given to site popularity.  (This is a bigger problem for Google than for the other engines since Google has such a large search share.)  To the extent that engines use site popularity (or link popularity) in their ranking, they must necessarily decrease their emphasis on relevance.  Searches will therefore return more “popular” but less relevant results. 

 

If a site has been banned or affected by site-unique bias at one of the major search engines, its traffic due to referrals from that search engine will be drastically affected. Eventually traffic from other search engines will also drop. Site owners tend to feel that the reduction in traffic is due to something they did "wrong" since multiple engines have been affected. In fact a reduction in the other search engine referrals may be due to the original traffic reduction compounded by the effect of the site popularity factor.

 

PageRank

 

PageRank(tm), named after Google cofounder Larry Page,  is Google's merit system and is used in the ranking process for search results.  Unlike the other search engines, Google also displays PageRank (as a number between 0 and 9) of any page displayed in the browser to users of the Google toolbar (if the PageRank feature is enabled).  In order to implement this, the tool bar sends a message to Google, containing the page URL, every time a new page is loaded into a user's browser.  Google's server then sends back the PageRank for the displayed page.  (Users are advised that there are privacy implications.)  Tracking data from the toolbar has many uses.  Google's version of the AOL/Netscape Open Directory displays sites in order of decreasing site PageRank.  Google also artificially (editorially) manipulates PageRank of some sites. There are legal issues associated with publishing false PageRank data so Google may be decreasing emphasis on publishing PageRank.

 

It should be clear that search engines need to produce and maintain databases on sites as well as pages in order to be able to develop site-oriented merit factors such as age-of-site, site popularity, and site PageRank.

 

Robots and Spiders

 

Search engines have software “robots” that access web data for inclusion in their index.  A robot that follows links on a page to find other pages and sites is known as a “spider.”  The process of following links is known as “crawling” the web.  All the pages on a public web site should have links, which if followed would eventually lead to all the other pages on the site.  If a page, anywhere, that is indexed by a search engine has a link to a page on your site, then eventually the spider could find and index all your pages.

 

The spider system needs a significant amount of artificial intelligence because there are many ways for spiders to chase their tails. All the majors operate many robots on different hosts (e.g. crawl-66-249-67-13.googlebot.com ) that operate simultaneously and at least somewhat independently. Merely coordinating these robots is a complex task. Google's system seems to be substantially smarter than the others. Even though some of our sites are visited less frequently by Google's spider Googlebot, Google results covering the sites are generally fresher and more comprehensive. The other spiders frequently visit the same page multiple times while ignoring other, equally important pages or otherwise perform inefficiently. Yahoo's spider Slurp sometimes visits the same file hundreds of times per day.

 

A search engine will not necessarily index all of the pages on a site.  The “depth-of-crawl” is determined by merit considerations such as site popularity or link popularity.  Likewise the frequency at which an engine’s robot returns to your pages can be determined by merit considerations as well as the robot’s determination of whether a page is likely to change based on past experience.

 

Webmasters can also submit a page or pages directly to any of the major search engines for indexing and starting the crawling process.  There have been persistent but unsubstantiated rumors that doing so can result in reduced rank or reduced depth-of-crawl compared to letting the search engine robot find your site “naturally” by following a link from some other site.  There are also reports that search engines may not index pages on a site that does not have any incoming links from other sites, even if the site is manually submitted.  Indexing of an individual page and rate at which a page is re-spidered may also depend on whether that particular page has an incoming link.  This is why even very large and popular sites (e.g. Amazon) solicit and need incoming links to internal pages.

 

Google and Yahoo also provide a service whereby a file containing a list of page URLs (even thousands of pages) and resident on a server can be designated to the search engine.  This might allow the engine to index pages more rapidly than the natural crawl process.  Google’s scheme also provides a way for the webmaster to define the relative importance and update frequency of pages.  Could sites be penalized for using this system relative to the “natural” approach?  Some webmasters claim that they were banned shortly after submitting a sitemap.

 

A “robots.txt” file in the root directory can be used to request that search engines not index particular directories in your site.  A “search engine friendly” site could do this to exclude material that changes too rapidly (news, weather) for indexing to be useful and to exclude material with no plausible public value (log reports, administrative pages, etc.).  Reducing the number of such pages indexed might conceivably increase the number of other, more useful pages indexed from your site.  

 

The robots.txt can also theoretically be used to exclude material from indexing by a specific major search engine or engines while allowing indexing by other engines.  There is no clear legitimate reason for doing this and there are obvious “black hat” reasons so doing so probably raises a “red flag” for search engines.  Be careful.  Errors in robots.txt files have caused many sites to be accidentally de-listed.

 

A “meta robots” tag in the header section of a page can be used to exclude indexing by all robots on a page by page basis.

 

Sleazy robots such as email address harvesters operated by email spammers may ignore robots.txt and meta-robots tags.  Major search engines might run occasional robot inquiries masquerading as browsers to check for sites that are cloaking.  Some also have robots occasionally request a file with a random name and check that the proper response (404 file not found) is returned.

 

Server Configuration Issues

 

Many web servers that answer to “www.somedomain.com” also answer to “somedomain.com”, that is, either form of the URL would work and deliver the same page.  Many also answer to “XXX.somedomain.com” where “XXX” can be anything.  If even one site links to you as “somedomain.com” instead of “www.somedomain.com” that could potentially result in a search engine indexing the same pages multiple times.  This can apparently lead to censoring for “duplication” and could also result in diluting your link popularity, reduced depth-of-crawl, or other problems.  Some experts therefore suggest that the server be set to redirect (response code 301 moved permanently) requests to “somedomain.com” or “XXX.somedomain.com” to “www.somedomain.com”.   However, many large large-business sites are reachable by either method (e.g. http://www.dmoz.org/ and http://dmoz.org/ ) and do not redirect.

 

There are reports that the same issue can arise if any pages on your site can be accessed as "http://www.somedomain.com/..." and also as "https://www.somedomain.com/..." (secure http).

Search Engine Optimization (SEO)

In any other industry, an interface as complicated and technical as the interface between a web site and a search engine would be defined in a 300 page "Interface Control Document (ICD)" or "Applications Programming Interface (API)" manual.  (API is probably the most suitable metaphor here.)  It would be in everybody's interest to make sure that people on both sides of the interface (in this case web sites and search engine) understood the "rules" and details of the interface to the greatest extent possible in order to make sure the overall system worked as well as possible.  There would be means whereby updates to the API would be distributed to all the participants.  There would be a "knowledge base" that provided answers to questions and documented issues with the interface.  The API that defines the interface between Microsoft Windows and thousands of applications programs written by thousands of different programming organizations is a good example of how technical interfaces are normally managed. 

 

A well documented interface provides a "level playing field".  All of the competing organizations writing applications for Windows are starting out from the same position.  The API is freely available. (It might cost $15 plus S&H.)

 

In the case of search engines and web sites an almost surreal situation exists.  Even though search engines are entirely dependent on web sites for their need to exist and proper purpose, and web sites are equally entirely dependent on search engines, both parties to a significant extent consider the other to be "the enemy".  Search engines to a very large degree refuse to deal directly with webmasters including refusing to provide an "API" defining the web site/search engine interface.  This creates a need for an intermediary to work between webmasters and search engines. 

 

It also creates a sort of "black market" situation.  Not only do web based companies have to compete with each other in the usual ways (price, service, quality, variety of products, etc.) but they also have to compete in their ability to determine, by any means possible, the nature of the undisclosed search engine interface.  Their ability to do so will determine the extent of their search engine exposure, which is essential to the success of their business and their ability to compete.  The interface information is only available unofficially, and to some extent "under the table." 

 

(Search engines do publish "webmaster guidelines" on their sites that very briefly describe interfaces and good and bad practices in site design.  (See Webmaster Guidelines.) These guidelines are very vague and incomplete and typically consist of only a few pages, sometimes only a few sentences.)

 

As a result, an entire industry called Search Engine Optimization (SEO) has developed to help webmasters with the problems associated with designing their sites to have good ranking and otherwise deal with the interface between the site and search engines.  SEO's obtain interface information by trial and error, industrial espionage, or leaks (possibly purchased leaks) from search engine people.  (It is possible to externally “reverse engineer” a ranking algorithm by doing a lot of automated searches and examining the characteristics of the resulting top ranked sites.)  Webmasters have to pay in order to get information that would, in any other industry, be freely available.  The level playing field does not exist.  Because of the black market, it is difficult to tell which SEO is "reputable", or really has the best information or which optimization techniques are "legitimate or white-hat" although undisclosed by search engines as opposed to "illegitimate or black-hat" but also undisclosed.  Black-hat SEO scams abound because it typically takes a search engine a long time to ban a site using black-hat techniques.

 

At least part of the reason that search engines don't provide defined rules is that they specifically want to have different rules for different people. (See extensive documentation in case studies.)  The double standard negatively affects small businesses.  (See Impact of Search Engine Editorial Policies on Small Business.) 

 

The antitrust suit against Microsoft was partly concerned with the allegation that Microsoft provided Windows interface information to its own applications programmers and other favored people that was not available in the distributed API and thereby conveyed an unfair advantage.  (By any account the amount of undisclosed API information was tiny compared to the disclosed portion.)  It seems that a similar but more grossly unfair situation exists here.  By not disclosing the API, search engines can give some favored people information and thereby an advantage not provided to other people. Could Microsoft claim that free speech rights and the right to trade secrets allow them to keep some portion of the API secret? Can search engines (even semi-monopoly search engines) make the same claim? Maybe. Maybe not. Notice that the companies writing software to run on Windows are not customers of Microsoft, are competing with Microsoft's own applications programmers, and are dependent on Microsoft for API information.  A similar situation exists between websites and search engines. 

 

While some search engine people continue to take the position that there is no such thing as a good SEO or any need for SEO, most now recognize that some SEO activities (“white hat” SEO) are acceptable or maybe even, just possibly, useful.  All agree that some (“black hat”) SEO activities are unacceptable attempts to “trick” their search engine and therefore take punitive action (censoring) against sites using these techniques.  Such black hat techniques include many methods designed to present the search spider with different information (especially keywords) from that likely to be read by a human visitor, such as use of tiny keyword rich text at the bottom of a page.  Since the major spiders (named “GoogleBot”, “MSNBot”, and “Slurp”) identify themselves to web servers, it is possible to design a web site to return entirely different information to each spider, and yet different data to human visitors, a definite black hat technique.

 

Search engines are also reticent regarding what, in their mind, constitutes white hat or black hat SEO.  (Such a definition would necessarily involve disclosing the same kind of information that would be found in an API. and exposing the double standard)  It is therefore easy for black hat SEOs to trick otherwise legitimate webmasters into believing that their techniques will result in an honest and permanent advantage.  Most webmasters get several emails each day: “We can make your site rank number one!”

 

SEO people are dependent on search engine people for information and also very dependent on the status quo.  If Google published a 300 page API, about half the need for SEO would disappear in a flash.  You will find very few SEO people criticizing a search engine or suggesting that any change in search engine practices is a good idea.

Search Engine Resources

Here are some sources for additional help on search engines, especially regarding censoring.

 

Search Engine Watch (SEW) at http://www.searchenginewatch.com provides excellent and more basic tutorials on how search engines work, statistics about search engines, and pointers on “white hat” techniques for optimizing your site for search engines. 

 

Webmaster Forums  -- There are many forums for webmasters.  Webmaster World at http://www.webmasterworld.com/ or Webmaster Pro ( http://www.webmasterpro.com/ ) are possibly the best and most popular.  The problem with the forums is that only fairly recent posts are easily accessible.  The same questions are asked and answered over and over.

 

Search Engine People – Finding a valid email address or telephone number for a genuine search engine employee knowledgeable about censoring is approximately like getting a permit to enter Fort Knox with a bag.  Any such contact point would be immediately clogged by thousands of unhappy webmasters.  However, there are search engine employees making (mostly) helpful contributions to the webmaster community.  Because of the delicate nature of site censoring, these people represent different levels of “deniability” for search engines and can therefore be more forthright than the search engine’s official guidelines.

 

Writing to search engines, especially Google, is likely to be futile. Some webmasters report getting no response whatever (not even a "thanks for your comment" postcard) from Google.  Some other search engines have been known to respond by email.

 

Some search engine employees posting under their (apparently) real names (such as Google’s excellent Matt Cutts) provide answers to webmaster questions by means of forums and blogs.

 

Other folks who say they are employees post anonymously (e.g. “GoogleGuy”).  If they feel it necessary, Google could deny that “GoogleGuy” was actually an employee and repudiate any statements made by him.

 

Finally there are “flacks” who do not admit to being search engine employees but anonymously post cloyingly positive but uninformative “spin control” messages.  Unhappy webmaster posts censoring complaint about search engine “X”.  Flack immediately posts follow-up message saying he has never had a single bad experience with “X” and that if webmaster’s site has been censored he obviously must have “done something wrong.”

 

Books on Search

 

The Search by John Battelle ISBN 1-59184-088-0 2005 is an excellent description of the search industry including the rise of Google as the major player.

 

Papers on Search

Search Engine Bias and the Demise of Search Engine Utopianism Eric Goldman (March 2006)

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=893892

 

This paper puts forth the view that although search engines are biased, such bias is an acceptable and inevitable practice.

 

Search Engine Honesty   (http://www.searchenginehonesty.com/ )

 

Copyright © 2006 - 2007 Azinet LLC