John Ferrara is an expert in online search systems. His recent article in A List Part is one of the first ones I know of outside of the academic literature that takes a systematic look at relevance. See: Testing Search for Relevancy and Precision.
“Relevance” in information retrieval is an old concept. Each year since 1992, information scientists gather at the Text Retrieval Conference (TREC) to test and evaluate retrieval methods on large collections of documents. This conference is considered an extension of the Cranfield Experiments, indexing studies started in the 1960s that are regarded as the beginnings of modern information retrieval.
The primary measures used then were recall and precision. Briefly, recall is a measure of whether a search system is returning all of the possible relevant documents, or completeness. Precision is a measure of whether only the most appropriate documents are returned, or exactness. Virtually all measures of search system relevance since then have relied on these two measures. But neither are easy to really measure, and there are no practical guides for doing so.
The problem is that recall and precision are system-centered measures. Tests really just look at whether the system is functioning sufficiently from a technical perspective. But as you start to unravel relevancy in a broader context, things quickly move away from simple binary measures: it’s not a yes-no question. For instance, you get grey areas like partial relevance–when only a part of document is relevant to a person’s information need.
In a previous post, I recount some of Tefko Saracevic’s extended concepts of relevance that include more human-centered aspects. In addition to recall and precision measures, which we can call Technical Relevance, he sees four other types of relevance. To summarize again briefly:
- “Topical or subject relevance: relation between the subject or topic expressed in a query, and topic or subject covered by retrieved texts, or more broadly, by texts in the systems file, or even in existence. It is assumed that both queries and texts can be identified as being about a topic or subject. Aboutness is the criterion by which topicality is inferred.
- Cognitive relevance or pertinence: relation between the state of knowledge and cognitive information need of a user, and texts retrieved, or in the file of a system, or even in existence. Cognitive correspondence, informativeness, novelty, information quality, and the like are criteria by which cognitive relevance is inferred.
- Situational relevance or utility: relation between the situation, task, or problem at hand, and texts retrieved by a systems or in the file of a system, or even in existence. Usefulness in decision making, appropriateness of information in resolution of a problem, reduction of uncertainty, and the like are criteria by which situational relevance is inferred.
- Motivational or affective relevance: relation between the intents, goals, and motivations of a user, and texts retrieved by a system or in the file of a system, or even in existence. Satisfaction, success, accomplishment, and the like are criteria for inferring motivational relevance.”
But, as John wisely points out in his opening, user experience designers most often have little to no control over search relevancy. Part of the problem, I think, is that relevance is invisible–to stakeholders, to engineers, to designers, and to users. It’s not something you can easily put your finger on. And so, discussions around it are limited to what you get out of the box from search providers or, worse, don’t happen at all.
I suspect most information retrieval experts will cringe after reading John’s article in A List Part. It has none of the academic hygiene established in the field over the past four decades, it makes unsupported claims, and it’s far too simple of a model.
That’s probably why I like it.
Measuring relevance is far too important to be locked up in the ivory towers of academia. Search is now everywhere, and it’s fundamental to our information seeking experiences online. John gives step-by-step details on how to go about getting some measure of system relevancy. It’s hands-on and practical. His approach relies on recall and precision, for sure, but doesn’t carry any of the baggage academic measures require.
I’d personally like to see this framework extended to include qualitative feedback from users, perhaps addressing aspects like situational relevance or affective relevancy as well as partial relevance. But John has put a stake in the ground with this article, filling a gaping hole in user experience design, I believe. Hopefully this will open more discussion on the topic. Bravo.