Death of a RSS Reader

DISCLAIMER: I am Microsoft employee and I work for Bing at the present moment (27/03/2013). This article reflects my PERSONAL opinion and does not necessarily represent the position of my employer: it is a personal interpretation of facts going on at the moment based on my experience and expertise in the search engines field.


Abstract

Everyone is talking about this: Google will shut down Google Reader July 1st, 2013. In this, short, document I will try to understand what went wrong with the product and why it was probably important.

Why am I Writing This?

As soon as Google announced the death of Google Reader, people started filling the web with their opinions and complains about the decision. I was involved in some of these discussions with friends, in person or by email or on Facebook, and some of those discussions made their way to the web (as in the case of Nicola Carmignani's blog – in Italian).

Aside from being a user of Google Reader, I had the chance to work for more than 5 years in the search engines field in two different companies and I believe that there were, and still there are, very good reasons for Google to have a RSS reader and I was trying to understand what reasons could have taken Google to the decision of shutting that down.

I took my time to think about that and the following ones are my – sparse – thoughts about why things are as they are "if I were Google".

On the death of Google Reader

The structure of the following "thoughts" is: in bold a title for the argument, after the separator the reason why I would have done that "if I would have been Google" and, in the indented paragraphs, what could have gone wrong.

Profiling the users — Google never made a mystery of the fact that they profile the users in order to provide a "personalized" experience and, practically, being able to target the users with hopefully better/more relevant advertisements (for instance, here and here).

Profiling of the users could not be limited to the single user that Google is observing. If two users have common interests, follow the same blogs and newspapers, they probably have an overlap in the queries that they issue against the web search as well. In this case, since the users are similar, if one result was clicked by one of the two, the exact same result could be boosted for the same search of the other user. Google Reader looks like a natural place where to collect pieces of information such as my favorite newspapers, blogs etc.

Also in the case in which only one of the two users is a Google Reader user, if the two users' queries overlap, results from the feeds present in the Google Reader of one of the two can surface among the results of the one that does not uses Google Reader. Even better, the results that the similar user liked (or "starred") in the reader could be proposed as results.

There are quite a bit of things that could have gone wrong with the profiling of the users.

The very first thing that could have been gone wrong is that the majority of the users do like the same things. Probably, the fact that a huge quantity of users read the newspaper X or the blog Y does not help profiling them and, even worse, the things in which they differ make them differ to much in order to be effective in serving the right results from one user to the other one.

Another point could be that not all the information that could potentially make us recognizable and enable such a personalisation have a RSS feed available or, the feed drives the user to a page where he needs to pay to consume the information. At this point the user could potentially remove the feed from the list of the feeds that he want to follow not because it is not interesting but just because it is not possible to use it properly (e.g., the "New York Times" needs a subscription to read the news on the website: knowing that there is an article about X but being unable to read that when on the website does not make appealing following the RSS feed).

Profiling the feeds/sources — The single most difficult task that a search engine must face is being able to decide which documents are the most relevant for a query.

Page-Rank works extremely well for the web. Guess where doesn't it work? In the news. The news, by their specific nature have a short life and in such a short time frame it is almost impossible to be linked on the web by other pager or news articles. Moreover, this second case would drive the traffic out of a news site for another and it would not be remunerative for the website.

The best way to know if an article is important is knowing if people are actually reading it and maybe sharing it.

Before Facebook, Google+ and Twitter, the only way to be able to know if an article was interesting was to observe the behavior of the users "in the wild", i.e., while they read such articles on their own websites. Using something like the data coming from Google Analytics it is not said to be sufficient (not all the newspapers or news websites have it).

Since no Facebook, Google+ or Twitter existed at the time, or they were not Google properties, what's best of creating a platform on which people say "I am interested in following X, Y and Z" and more, "Of the feed from website X, I read only articles A and B". This thing can be taken to the extreme. Since the platform is yours, you can place analytic everywhere: how many seconds did the user spent reading this article? Was the article opened/closed multiple times? Did the user scrolled fast?

Moreover, if a reasonable number of users read the same feed, that feed could be "interesting". If two feeds are placed inside the same "folder", they could be similar and this could be a useful signal when clustering together the news. The name of the "folder" itself can be used as a search-able term or category if not too vague or offensive (confession: I have a "folder" called "stuff" containing all the low interest feeds).

There are a huge amount of things that could have gone wrong with the profiling of the feeds/sources.

The very first reason why things did not work out as expected could have been that all these signals indicating the quality of a page given the RSS feed or the quality of an article with respect to another are not "clean" signals:

Selling ads — It is not a secret that the main purpose of Google is selling ads. Every application of Google, to be freely available, needs to re-pay itself displaying ads that the users can click. Even Google Mail displays ads to the users.

In the case of Google Reader, the ads could be related to what the user is reading: reading about music? What about some nice tickets for the concert of X?

Moreover, since a user is logged in the information coming from the Google Reader can be leveraged to sell the ads during the web search or in other products by Google. If the user reads about some specific sport champion in the RSS feeds and then goes to the web search looking for a pair of shoes, one of the options is to show the ads from the sponsor of the sport(wo)man. in this way, the user would be more prone in clicking the ads of a brand that is related to something that he already knows.

This last point is not sci-fi and it can be achieved creating an ontology relating people to products/brands like: Torres ⇒ Chelsea FC ⇒ Samsung. All these pieces of information are already available on the web and they just need to be composed and cleaned maybe using as a base cleaner datasets such as Wikipedia. Even simpler associations could be made using more direct knowledge of the things already seen: if the user is reading some news about Apple, why not proposing when looking for a laptop online to buy a MacBook? Such an association can be made easily with a rule of the type "if the ad contains one of the keywords of the articles that the user reads more frequently, boost it" and obtaining a list of such keywords is something that Google knows how to do very well in web search already.

In this case, the problems are more "political" than technical.

The largest part of the information that is consumed on Google Reader are news and news are damn difficult to monetize. There are mainly two reasons why this is damn difficult:

  1. You risk a real tragedy from the PR point of view.
  2. You risk to sell ads on top of someone else's business.

For the first one, let's imagine that the user is reading an article about the massacre at in Aurora, Colorado, at the theater where the latest Batman movie was projected. Let's imagine that the ads in Google Reader are related to "Buy a house in Aurora" or "Buy Batman movies" or "Visit Colorado". You probably got it yourself now and you can try to imagine all the possible inappropriate advertisements related to the latest news that you read online or on a newspaper today.

For the second one, if you go to any newspaper online, you can find that they have ads on their pages. Makes sense: they produce the content and they have to monetize that and they can cherry pick what not to monetize in order not to fall into the case above. The problem is that every time that a user uses Google Reader and reads an article on the page, there is a certain probability that he will not go to the page of the newspaper. This is upsetting enough for a newspaper but it would be even more if Google would get their money as well.

These are the reason why, if I would be Google, I would not place ads on Google News either.

The motivation of point two is also a good explanation why newspapers usually provide only a short description of the article and (almost) never the full text. If they would offer that all in the feed, people would not need anymore to go to the newspaper page and so they would self-reduce the chance of having their ads clicked. Google Reader could potentially expand the information in the feed with the full page in order provide the complete content to the users but they already have some issues with the newspapers and probably they do not want to get that worse (for instance see here).

That's all for now. If something else will arrive into my mind, I will share.