Sunday, January 19, 2014

3 ways libraries try to help improve search results in discovery services

Library web scale discovery systems are great. They break down the silos between books , articles and other content types. They provide the "one-search" box experience that our users claim to want.

But problems exist (See my overview - 8 things we know about web scale discovery systems in 2013  and outstanding issues ). In my experience, one of the most sticky issues is the question of getting relevant results.

A typical academic library catalogue system typically serves up say in the range of 1-10 million possible entries (most of which are books) in the index. But once you add in articles, conference proceedings and even newspaper articles into the index, suddenly you get easily 300-500 million results (depending on how aggressively you include free content, newspaper articles, non-full text material etc), a 50-100 fold increase at least in many cases (see this old 2012 post surveying the index size of some ARL libraries on Summon).

Does this increase in content, make relevancy ranking easier (because there are now more possible "good" targets to surface) or harder (the "good" targets are buried by the noise of other irrelevant items) ? I suspect it may depend on librarians prudently adding sources with high relevance rather than adding the kitchen sink of sources just because they can (more to say about this in future posts). 

Leaving that aside, the issue is that of the "big 4" discovery services, few of them provide a way for the library to tune the relevancy system directly even if the library is unhappy with the results. The same relevancy ranking applies for all customers of Summon, Ebsco discovery service (EDS) etc. I believe of the 4, Primo Central is the only one that allows tweaking of the ranking, and only for the locally hosted version.

In any case, tweaking relevancy ranking even if allowed is not trivial.

So what can librarians do if they are not happy with the relevancy ranking?

Here are some of the implementation choices made by libraries, which seem to me to be trying to address perceived flaws in the relevancy ranking of discovery service for particular use cases without directly touching the relevancy ranking.

These changes tend to address issues in known item searching/ finding of catalogue items, known item searches as well as subject searches.

1. Change of default settings to exclude format types (newspaper articles/reviews) and other settings

While one cannot directly adjusting the relevancy ranking in most web scale discovery services, libraries can adjust other settings which does affect the results being shown.

For example, earlier adopter libraries on Summon, noticed an issue early on about how the results were often flooded by newspaper articles and book reviews. This prompted the implementation of a "Exclude Newspaper article" switch to be specially positioned prominently on the interface on top of presumably adjusting of weights for those format types downwards.

The exclude newspaper articles switch

The switch of course allows easy removal of unwanted newspaper articles by the user.

Still, during my survey of Summon libraries in How are libraries designing their search boxes? (I), I found quite a few libraries spotting a design similar to the next picture.

Some libraries have 2 separate check-boxes for excluding newspaper articles and book reviews instead of the combine done as in James Cook University Library, but the key point is the libraries have decided to exclude these items by default.

This is of course a serious trade-off because occasionally users are indeed searching for newspaper articles and they may not notice the defaults have them turned off and hence fail to find the item.  In my institution, I see many daily searches for items in our local daily newspapers papers and keyword terms that obviously refer to the hottest news topics.

In my institution we struggled with this decision as well.

In the end, we decided to exclude newspapers and book reviews (the latter was less controversial), because we found that in many cases without filtering of newspaper articles and book reviews by default, the results for known item searches for books, databases etc and to some extent subject searches would by much poorer due to too many newspaper articles and book reviews. In particular, we suspect users would get frustrated because they couldn't find a known book on the top 10 results thanks to the numerous book reviews and newspaper articles.

A full title search in Summon generally was fine, but *book title* + *author* for popular titles/ generic titles  or *partial book title* + *author* sometimes had issues.

An example would be the following search, gladwell outliers (full title is Outliers: the story of success by Gladwell), where with newspapers and book reviews filtered, the book would appear as of writing 6th/7th (in most catalogues it would be 1st) but a full unrestricted search it would drop off the top 10.

Arguably for libraries that decided to default to filtering away newspaper articles, essentially they are saying in most cases, the relevancy ranking isn't good enough to know when to display the right format types.

Interestingly enough Summon 2.0 seems to try to to address this with spotlighting/grouping of newspaper articles.

This grouping of newspaper articles has a "mini-bento" effect (see later), and it will be interesting to see if libraries start removing the exclude newspaper articles checkbox from their default searches with this feature in place.

While Ebsco discovery service like Summon does not allow tuning of the relevancy ranking, they generally allow libraries more options in terms of default settings include

  • Apply related words
  • Also search within the full text of the articles
  • Available in library collection (if Off - items not available to the library directly would be shown)

While most libraries like MIT Library and Georgia State University library by default turn on "Also search within the full text of the articles", I notice some libraries have chosen not to do so.

Georgia State University library EDS by default searches in full-text

For example, this library does not seem to turn on full text indexing

It's somewhat interesting that in EDS, you actually have to turn on full text matching, rather then it being the default for historical reasons so it's possible some libraries may not have turned this on by accident.

Then again, I know at least one library has explicitly decided not to turn on full text matching in EDS, because they claimed to find the results are generally worse.

If true, again, I find this choice fascinating, since one of the key points of web scale discovery services is being able to match in the full-text.

I would add that Summon does not have the option to search within metadata (though some librarians in the Summon mailing list have asked for it), I personally have simulated "metadata only" searches by matching keywords in title OR subject OR abstract OR title (this matches the default option in Scopus but of course is not a complete metadata only search) and I find the results can often be reliably superior for certain limited classes of searches, so there may be something in leaving out full-text matching occasionally.

2. Best bets / Placards / Known item calls out

When we launched our version of Summon, one of the complaints we received was difficulty for known item searches and this is even after we already expected it and tried to adjust for it by removing newspaper articles and book reviews by default (See above).

For example, we had a complaint that someone was unable to find the link to the journal Urban Geography because Summon was showing all books with that title in the top 10 and not a link to the journal record.

Our tests prior to launch did show that the vast majority of our 100 most searched journal titles (drawn from Encore, our old discovery catalogue) did indeed yield a link to the journal record in the top 10. Side note : we have a one record approach for journals - and we catalogue journal titles combining all print and electronic regardless of vendor in one record.

But Urban Geography was one of those journal titles that failed this test.

At the time, we couldn't do much about it. And then a few days later, Serialssolutions launched the "best bets" feature in late December 2012.

This allowed us to create messages and links to appear when a certain keyword was matched. So naturally I did one for Urban geography.

You can see how it appears below.

I've noticed that while Summon is pretty good at displaying the catalogue record or the 360core record for journals usually at the top, it sometimes fails for single or double title journals with generic titles.

Obviously, it correctly handles "Nature" and "Science", but there are journal titles like Oncology (0030-2414) for example that it fails to bring to the top (for our Summon instance at least). Sometimes it's a possible part title eg : Clinical oncology : a journal of the Royal College of Radiologists but user expects "Clinical oncology" to pull it up.

That's where best bets comes in

For database searches, you can also use the database recommender in Summon, though currently not every database you subscribe to can be added, so if a search doesn't surface our database catalogue record as a top 10 result, I create a best bet for it.

Typically the catalogue record to the database fails to appear when the user types some slight variant of the database title in our record.

You may be wondering why I don't seem modify the catalogue record for journals or databases to include variant names etc.  I've actually tried that but often it has very little impact on the ranking it seems. 

An example for us is Arxiv

It may be possible to compare with other Summon instances to see why their record is higher ranking (assuming they are) and if so is there something different about their MARC record but it involves so many factors (what other material is turned on etc), it's easier just to add a best bet.

As mentioned earlier, users also have problems with finding specific books generally they fall into
  • user types *book title* + *author* 
  • user types *partial book title*  + *author* 
The first usually works (assuming you already default remove newspaper articles + book reviews), unless the title is really generic and/or the author/book is extremely well (and known so there are many book reviews, articles mentioning it appearing instead etc).

A good example of one is 

The irony is in Summon (other web scale discovery systems may or may not differ), often typing just the book title what is history gets you what you want fine, but "helping" by adding the author makes it worse.

Incidentally it seems EDS also as a similar feature to Summon's best bets, see below - example from MIT Library

I am not too sure about the details of EDS's features, though I suspect currently it might be like in Summon - a manual process, where the librarian identifies a needed match, then sets up the message and link.

Obviously, a automated algorithm to automatically suggest known item search matches would be better. For instance, the search would notice a match or partial match (say 245$a) in the journal title or database and suggest a link to it.

I believe UIUC's Suggestion system has that level of smarts and is integrated into their Primo Central version.

3. Bento style systems

So far the first 2 implementations doesn't require much in-house capacity but the third way involves commitment of some resources and this is the bento style approach.

Out of the box, discovery services, provide a standard, one result list approach, with all item types interfiled together, which leads to the problems already mentioned eg. books, databases and other catalogue items "lost" among newspaper articles and journal articles.

Hence the idea of a bento style system, where you have multiple boxes of different content (sometimes by format) displayed on the same page.

Today this is a common idea, libraries from Princeton , Dartmouth , Columbia etc all provide this style of display.

To me the innovators in this space were NCSU and Villanova University's Vufind implementation

It seems to me right now, we are split into 2 different types of implementations from the functional point of view. (See different  degrees of  technical implementations)

There is a 2 column display approach, first implemented by Villanova University , which lists two columns one for the catalogue results - often dubbed books & more (data might be drawn from the catalogue or it would be the discovery service properly filtered) and another for the article level results (typically drawn from an article index).

Then there is what Lorcan Dempsey dubbs this Full library discovery , which typically has a number of result lists, including not just book and articles but also result lists for results drawn from silos like

  • Database & journal title lists
  • Library webpages
  • LibGuides
  • FAQs
  • Librarian profiles
  • Institutional repositories

One way to explain this trend is to say there is mixed evidence on whether the blended/one result list style is what users want according to a bibliographic Wilderness blog post.

If we take the Full library discovery view, we can say we are evolving towards an approach where the search combines our typical content with library services and expertise.

While such benefits are true, both approaches also seem to mistrust the ability of the relevancy system to appropriately rank contents of varying and disparate content types. Depending on the approach, one can just plugin the article index from discovery services and rely on other systems such as the ILS for ranking of catalogue results.


I don't want to give the impression that the relevancy ranking of web scale discovery services are horrible, I think they are usually serviceable but can be much improved (though perceptions and expectations of librarians vary).

But rather, the challenge facing discovery vendors is a big one as they need to rank across huge stores of data and content formats (each with different amounts of metadata and full-text) and across all subject domains (it's easier to rank results for subject specific databases since there is no ambiguity on what a term means) so holes do exist.

Add the challenge of telling when the user is likely to be doing a known item search or a subject search and the challenge mounts.

And by the virtue of the stated aims of discovery services, they compete directly with Google and Google Scholar, which has world-class relevancy (to put it mildly) so users have very high expectations.

It is no wonder there is dissatisfaction with relevancy ranking and librarians do what they can to help out.

BTW If you want to keep up with articles, blog posts, videos etc on web scale discovery, do consider subscribing to my custom magazine curated by me on Flipboard or looking at the bibliography on web scale discovery services)

blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...