Search Result Censorship

One of the most valued freedoms in America is the freedom of speech. Itâ€™s part of the first amendment to the US Constitution in the Bill of Rights. We defend it rigorously in everything we do.

When we read, listen or watch the news, we individually can choose media that reflects our own political views. For instance, those that watch FOX news or listen to KSFO560 may be looking for a conservative view. For a more liberal viewpoint, we might chose ABC News or Bloomberg. Regardless of your political slant â€“ America gives you the freedom to choose.

Recently Googleâ€™s search engine has been criticized for being overly algorithmic and perhaps too inhuman. Certainly we all prefer to interact with people than we do to interact with purely the machine. But if search results were not algorithmic, then they would be editorialized. And if they were editorialized, who should do the editing? Certainly some governments would like to be involved in that process! But generally, how do we avoid biases? How do we avoid censorship?

Of course, even if search results are algorithmic, humans created that algorithm, so it is not impervious to biases either. So far, none of the big search engines have been accused of massive political censorship within their search results. But maybe we just havenâ€™t noticed yet?

By the way, Iâ€™m not talking about Google News, Yahoo News, or Bing News. When using these sites, Iâ€™m well aware that biases are introduced simply by the choice of which news content is included in the siteâ€™s news index. Web search is different. Web search is like going to the library â€“ I expect that there is no implicit content filter and I further expect all results will be at my disposal. If there is editorial going on at this library, I need to know so that I can either use a different library or modify my expectations.

Letâ€™s take an example. Search for â€œMartin Luther Kingâ€ on Google, Yahoo, and Bing. On Google, the top result is from Wikipedia. Wikipedia is an independent, collaborative encyclopedia on the web. It is also the #4 result on Yahoo and the #2 result on Bing. I find this to be a fairly credible source. Yahooâ€™s top result is from NobelPrize.org, the official site of the Nobel Prize organization. This site ranks as the #2 result on Google and the #5 result on Bing and also seems like a reputable and credible source. Finally, on Bing, the top result is from MSN Encarta, an online encyclopedia owned and operated by Microsoft. This web page does not rank in the top 10 on either Yahoo or Google. Wait a minute â€“ MSN Encarta is owned and operated by Microsoft?

Since neither Yahoo nor Google rank the Encarta page very high, it is unclear what editorial process Microsoft uses to decide that Encarta deserves the #1 search result spot on Bing. If Microsoft succeeds in its mission to become the top search provider, does this mean that Microsoft hand picks the content we see? Sometimes editorial is good, but sometimes it is not. And how can the user know which is which?

My example may seem trivial, because in reality, the Encarta page seems pretty fair. But, what if Bing had editorialized itâ€™s #4 result (www.martinlutherking.org) (also #3 on Google) to be the top search result? This page looks like an official Martin Luther King history page, but it is actually written by white supremacists. Interestingly, while this page shows up on both Google and Bing, it does not appear in the top-100 results on Yahoo at all. Yahoo appears to have editorialized this result out of their index. While most of us disagree with the Stormfront.org, should our search engines be using their own political beliefs to sensor your search results?

Of course, the advertising displayed on each of the search engines is also editorialized â€“ or at least it is displayed at the discretion of the search engine in question. This can confuse the issue, but at least the 3 search engines all label advertisements distinctly from search results.

To wrap up, we should all be aware that search engines today are biased in some way. As long as those biases are based on algorithms designed to return content most people want and avoid content most people donâ€™t want (spam, malware, etc) without outright censorship, that is okay. But when biases start reflecting political opinions via exclusion or preference of self-created content, search engines have a real problem. Because I donâ€™t believe humans are capable of editorializing a world-wide-web index without introducing accidental or intentional biases, Iâ€™ll stick to search engines which use cold, calculating algorithms.

NOTE: These opinions are my own and do not reflect opinions of my employer.

Adrian

July 12, 2009 at 11:31 am

Permalink

All search engines editorialize. To assert that any of them are driven solely by “old, calculating algorithms” is self-delusional.

There are two key algorithms involved: determining what to put in the index(es) and determining how to rank the search results.

No search engine can crawl more than a small fraction of the Web on a regular basis, so they all have to make hard decisions about where to start, how wide, and how deep to crawl. And with an increasing emphasis on freshness, you often end up with two lists: a short list crawled frequently, and a longer list crawled less often. You have to decide whether it’s worth your bandwidth, storage, and computing capacity to go deeply into adult content, extremist political sites, SEO spam sites, -1 Troll Slashdot comments, and e-commerce sites. Is it fair to index almost everything on Amazon, but not on Amazon’s competitors? ALL search engines have to make these editorial decisions. Hopefully these decisions are data-driven by a feedback loop, but even that approach reinforces the bias of the majority of users.

Ranking search results depends, in part, on the famous Page Rank algorithm, which I’ll grant seems a pretty objective way to gauge the quality and popularity of a site. But it can be (and is) gamed, so editors have to step in and make corrections. This becomes hugely subjective and imperfect.

In reality, all the search engines use Page Rank as just one input into a neural network that’s trained to rank the pages. You need a lot of data to train neural networks. Guess what? This training data is largely based on subjective human evaluations of whether a given page would make a good search result for a given query. That involves inferring the intent of the query, judging the quality of the page, and assessing the relevance. If these decisions are averaged over a large number of diverse editors, then you *might* get a somewhat unbiased set of training data.

Training data not only comes from not only from in-house editors, but also from observing usage. But those observations incorporate heuristic assumptions about how people behave when they get a good or bad search result. Those assumptions can also lead to bias, as can the current incarnation of the results, the ranking, the summary generation, and the UI. And you can only observe the behavior of people who are already users. If you system starts with bias, you could attract similarly biased users, and the whole process reinforces itself.

Lastly, when you train a neural network, you can never really be sure what lessons it’s learning. There’s an apocryphal (but plausible) story about defense department researchers trying to train a neural network to distinguish between photographs that had army tanks in them from ones that didn’t. Training seemed to go very well, but when new pictures were introduced, the network failed miserably. According to the story, the two sets of training photos were taken on different days. As a result, the network had actually learned to distinguish between pictures taken on sunny days from those taken on overcast days.

Thus the “cold, calculating” neural network just might be factoring (possibly unintended) editorial bias from the training data into your search results.

For widely popular topics and queries with clear intent, the vastness of the data sets might work to our advantage by averaging out blips of bias. It would be pretty hard for one of the major engines to swing too far toward outright censorship or otherwise biased ranking on a popular topic. (Then again, all the search engines do tap dance around topics like Tianamen Square and Tibet when returning results in China.) But what about the long tail of the Web? That content–if it’s indexed at all–is at the mercy of the whims of the majority, not the interests and needs of the those actually seeking those results.

All search engines editorialize. Just because the process is largely automated doesn’t mean the results are objective. We need vibrant competition in this market for that same reasons we need more competition in the news media.

One thought on “Search Result Censorship”

Adrian
July 12, 2009 at 11:31 am

All search engines editorialize. To assert that any of them are driven solely by “old, calculating algorithms” is self-delusional.

There are two key algorithms involved: determining what to put in the index(es) and determining how to rank the search results.

No search engine can crawl more than a small fraction of the Web on a regular basis, so they all have to make hard decisions about where to start, how wide, and how deep to crawl. And with an increasing emphasis on freshness, you often end up with two lists: a short list crawled frequently, and a longer list crawled less often. You have to decide whether it’s worth your bandwidth, storage, and computing capacity to go deeply into adult content, extremist political sites, SEO spam sites, -1 Troll Slashdot comments, and e-commerce sites. Is it fair to index almost everything on Amazon, but not on Amazon’s competitors? ALL search engines have to make these editorial decisions. Hopefully these decisions are data-driven by a feedback loop, but even that approach reinforces the bias of the majority of users.

Ranking search results depends, in part, on the famous Page Rank algorithm, which I’ll grant seems a pretty objective way to gauge the quality and popularity of a site. But it can be (and is) gamed, so editors have to step in and make corrections. This becomes hugely subjective and imperfect.

In reality, all the search engines use Page Rank as just one input into a neural network that’s trained to rank the pages. You need a lot of data to train neural networks. Guess what? This training data is largely based on subjective human evaluations of whether a given page would make a good search result for a given query. That involves inferring the intent of the query, judging the quality of the page, and assessing the relevance. If these decisions are averaged over a large number of diverse editors, then you *might* get a somewhat unbiased set of training data.

Training data not only comes from not only from in-house editors, but also from observing usage. But those observations incorporate heuristic assumptions about how people behave when they get a good or bad search result. Those assumptions can also lead to bias, as can the current incarnation of the results, the ranking, the summary generation, and the UI. And you can only observe the behavior of people who are already users. If you system starts with bias, you could attract similarly biased users, and the whole process reinforces itself.

Lastly, when you train a neural network, you can never really be sure what lessons it’s learning. There’s an apocryphal (but plausible) story about defense department researchers trying to train a neural network to distinguish between photographs that had army tanks in them from ones that didn’t. Training seemed to go very well, but when new pictures were introduced, the network failed miserably. According to the story, the two sets of training photos were taken on different days. As a result, the network had actually learned to distinguish between pictures taken on sunny days from those taken on overcast days.

Thus the “cold, calculating” neural network just might be factoring (possibly unintended) editorial bias from the training data into your search results.

For widely popular topics and queries with clear intent, the vastness of the data sets might work to our advantage by averaging out blips of bias. It would be pretty hard for one of the major engines to swing too far toward outright censorship or otherwise biased ranking on a popular topic. (Then again, all the search engines do tap dance around topics like Tianamen Square and Tibet when returning results in China.) But what about the long tail of the Web? That content–if it’s indexed at all–is at the mercy of the whims of the majority, not the interests and needs of the those actually seeking those results.

All search engines editorialize. Just because the process is largely automated doesn’t mean the results are objective. We need vibrant competition in this market for that same reasons we need more competition in the news media.

Mike Belshe

Search Result Censorship

One thought on “Search Result Censorship”

Leave a Reply Cancel reply