This past week has seen a ton of hype that real-time search is different. That the content creation and behaviors
expressed and implied are fundamentally new and exclusive to places like Twitter
and mining this data has incredible potential. Having spent a decade in search
and the last four years working with dynamic content, APIs and what is now
being called real time web I want to add some much needed perspective.
Getting Up to Speed
Real-time web did not start with social updates.
Representative State Transfer or REST has been a key component of social media
from the start of blogging to sites like Flickr and now to the cloud with
services like Amazon S3. Cross platform synchronization (e.g. mobile, XMPP) is
also not new. Many of these web services have had this ability for sometime. Certainly
there is a slow boil with the opening of structured web of data, but technically
there is nothing new here.
The “real-time” threat to Google has also been mentioned more than once. As far as search, nothing going on somehow surpasses Google’s
ability to index it in real-time (milliseconds). I have personally seen my blog
posts indexed immediately and presented on SERPs. The idea that Twitter has
some magical real-time stack that Google should be concerned about it is quite
While I’m on the subject of GOOG an important real-time
aspect to Google that has gone overlooked in all the chatter is their
advertising platform. AdWords is a dynamic marketplace with numerous
synchronous and asynchronous real-time rules. Last fall Google moved to
real-time quality score calculations meaning these computations are taking
place at the time of the query, based on the query. Let’s be clear, Google has a real time advertising
optimization platform for content that works insanely well. It is to date the
ultimate achievement in real-time web and
search. Nothing comes close. Positioning them as a laggard is naive. In fact if
you want to know the real web of sentiment you might be best off looking at
keywords and bid history.
Now that we put aside the technical and competitive aspects
the question remains, what is the value of real-time search?
The answer must be looked for not from the perspective of content creation (new
content is always being added to the web) but in how and why people search. Does “real-time” data to sentiment and
opinion provide increased relevancy, new relevancy or changes behavior in a
way that’s different or more beneficial. Keep in mind, these numbers are not
about how many people will search for real-time data they are about how many
queries conceivably could benefit.
My guide through searcher behavior is the landmark 2004 research
“Understanding Goals is Web Search” by Daniel Rose and Danny Levinson that built off the seminal work “Taxonomy of Web Search” by Andrei Broder. Broder came up with the original
trichotomy of web search “types”: navigational, informational, and
transactional that Rose and Levinson expanded upon. To this day their work remains
search’s de facto query classification system.
To get an idea of representative percentages of queries that
would benefit from real-time data contribution within each category I
incorporated my own research on both in AdWords and Google Trends for query
volumes looking at about 100 queries with real time relevance (e.g. “what’s on
tv right now”) vs. those without. Lastly, I added my years of experience
in search data and behavior to extrapolate the results. Also, the benefit number assumes
result sets that simply don’t exist yet and as such these percentages are
Was this scientific? No. Do I think the numbers are pretty
Key: Query Type (Overall
query %)* Real-Time Data Benefit %
Informational Queries: My goal is to learn something
by reading or viewing (61%) 14%
Directed: I want to learn something in
particular about my topic (7%) 2%
Undirected: I want to learn
anything/everything about my topic (22%) 5%
Advice: I want to get advice, ideas,
suggestions, or instructions (5%) 2%
Locate: My goal is to find out
whether/where some real world service or product can be obtained (24%) 5%
List: My goal is to get a list of
plausible suggested web sites each of which might be candidates for
helping me achieve some underlying, unspecified goal (2%) <0.5%
Resource Queries My goal is to obtain a resource (not
information) available on the web (25%) 5%
Download: My goal is to download a resource
that must be on my computer or other device to be useful (5%) <0.5%
Entertain: My goal is to be entertained
simply by viewing items available on the result page (6%) 2%
Interact: My goal is to interact with a
resource using another service I find on the web (6%) 2%
Obtain: My goal is to obtain a resource
that does not require a computer to use. I'm not obtaining it to learn some
information, but because I want to use the resource itself (8%) <0.5%
Navigational Queries: My goal is to go to specific
known website that I already have in mind. The only reason I'm searching
is that it's more convenient than typing the URL, or perhaps I don't know
the URL. (14%) <0.5%
Overall I feel the query numbers for real-time benefit (about 19% of all queries) are optimistic. I have made some very large assumptions both about the ability
to index and query real-time data in a manner that is useful and about the
changes in people wanting or needing to query this data once knowing that it is
available. Also, I did not want to discount any category as being useless even
though it is hard to see at the present time how navigational or
resource>obtain queries stand to benefit from real-time data. In every instance I gave the benefit of the doubt to real-time.
The technology to present real-time data as helpful to
queries has not yet emerged. Even so, while it real-time updates might be
helpful for a small percentage of queries it is not even close to being more
helpful in any one category. That’s the biggest problem for real-time search.
The largest percentage is not surprisingly informational
queries. If Twitter search can become anything it would be more a discovery
engine than a search engine — more Craigslist, Wikipedia or Yelp than Google.
It is after all a publishing and communications platform. Put another way,
people search on the NY Times for opinions, sentiment and news of the day but
that does not make the NY Times a search engine.
The underlying value of temporal content correlates to
benefit it provides to the searcher at that moment of attention. Those ‘real-time’
moments are fleeting and once they are gone the value of the content disappears
with it. Thus to have substantive value you need millions of fleeting moments,
all the time, that can best be helped by understanding what is happening right
now. That’s an interesting idea but it is simply not the way people search.
Understanding the way people search the opposite actually holds true. The greatest
value rests in content that retains usefulness or importance the longest. In
fact, that’s one idea that transcends search. Though with the vigor that Google is
scanning books, maybe not for long.
Also for search to work properly there must be a level of
authority associated with the results set. I just don’t see how to filter
through this noise in real-time. Even trying to do so begins to destroy the value of an open real-time system where the benefit of "right now" matters more than "who."
It’s great to imagine what can be possible with the web but
we’re not going to build anything that changes human nature, only stuff that
amplifies it. Certainly Twitter does that incredibly well, just not in a way that benefits most searchers.