Search at the Guardian Newspaper
Join the DZone community and get the full member experience.Join For Free
i had the privilege of attending a search at the guardian event a while back, organised by tyler tate of the enterprise search london meetup group and martin belam of the guardian newspaper. i say “privilege”, as it seems all 60 places were snapped up within a matter of days, so i consider myself fortunate to have grabbed a place. seems like search has gone viral this last week … anyway, it was well worth the trip as the guardian put on a great show, consisting of talks from their technology development team about the challenges in providing a searchable archive of all their content.
to add an extra note of personal interest, they are actually in the process of migrating from endeca (for whom i currently work) over to apache solr , and with it embracing the wider opportunities of providing open access to their data and search services. one of their key goals in doing this is to enable the development community at large to create value-adding apps and services on top of their data and api, thus transforming the guardian’s role from publisher to content platform.
by their own admission, they haven’t got the best out of their endeca investment, and have allowed their installation to get wildly out of date and unsupported. so what they have on their live site is hardly representative of a typical endeca deployment. but that said, i think there are some basic user experience issues they could improve, regardless of platform. in particular, i think there are significant issues around their implementation of faceted search and the overall design of their results pages. in addition, i think there are some missed opportunities regarding the extent to which the current site supports a serendipitous discovery experience (something which a site like this, if designed appropriately, should really excel at). if i get chance i’ll provide a fuller review, but for now it is probably instructive to refer to the endeca ui design pattern library , in particular the entries for faceted navigation: vertical stack , search box , and search results: related content . these patterns provide much of the background necessary for addressing the immediate issues. (nb although these patterns are published by my colleagues at endeca, the guidance is essentially platform-agnostic and applies to search and discovery experiences in general).
but let’s get back to talking about the event itself. all the half dozen or so presentations were valuable and instructive, but as a ux specialist i did particularly enjoy martin belam ‘s talk, who discussed “ why news search fails…and what you can do about it “. i have a lot of sympathy with martin’s observations about the guardian’s site users and their expectations that the search engine should be able to “read minds”. in particular, he cited the classic problems caused by underspecified or incomplete queries (i.e. should a search for a single word such as “chile” return stories of mining accidents or football reports?). interestingly, this type of phenomenon is exactly the sort that should be reduced through features such as google instant – if you can see the mix of results your query will return before you hit enter, you are more likely to provide the context needed for adequate disambiguation.
martin also talked about the “long tail” of search queries, i.e. the hapax legomena that occur in any search log. search logs, like most natural (language) phenomena display a zipfian distribution , i.e. term rank and frequency are inversely related by a power law. in the guardian’s case, this means a typical day can produce some 17,000 unique queries, most consisting idiosyncratic edge cases. however, a few common patterns do re-occur, including:
- people’s names (which are often incomplete, as alluded to above)
- dates (which martin argued were highly generative and therefore not easily matched by regular expressions , but based on my experiences with named entity recognition at reuters , i’d be more optimistic about the prospects for this)
- misspellings and typographic errors (which in many cases i’d argue are addressable through auto-correct and did you mean techniques, i.e. string-edit distance against a cumulative list of known terms)
also intriguing was the observation that only 1% of their current page views are search-driven – i wonder how this will change as consumption of their content increasingly occurs in a mobile context, with users engaging in highly goal-driven, spontaneous or impulsive tasks, for which search is the obvious entry point? he also outlined some of the ways in which their site search exploits context and metadata to deliver a richer experience (than web search), and uses manually assigned tags to dynamically generate topical landing pages for arbitrary query combinations (e.g. “chess” and “boxing”). martin also alluded to a vision of using “multiple search boxes” to infer the user’s intent based on local context (but i’d prefer to think of this as multiple instances of a single search box ).
one final point – surely all that manual tagging is insanely time consuming an non-scalable? i understand of course the need to apply human editorial quality control, but at reuters even back in 2002 we were using semi-automated text categorization solutions to successfully tag over 11,000 stories a day (and had been doing so for many years previously). i’m a bit surprised the guardian appears to be so reliant on manual methods, and am curious to know how they view the trade-off between efficiency, accuracy & throughput.
so all in all, a very productive an enjoyable evening – thanks again to tyler and martin for making this happen.
Published at DZone with permission of Tony Russell-rose, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.