Saturday, September 19, 2015

[Research question] What percentage of citations made by our researchers is to freely available content?

I recently signed up to a "research methods" class whose aim was to help practitioners like me produce high quality LIS papers. Inspired slightly by Open Science methods, I will blog my thoughts on the research question I am working on. Writing this helps me clarify my thoughts and of course I am hoping comments from you if my thoughts have piqued your interest. 

The initial motivation

The idea began as a work I was asked to do. Basically I was doing a citation analysis of citations made by our researchers to aid collection development. The idea here was to see if there was a good fit of our collection with what users are using and gauge potential demand for Document Delivery/Inter-Library.

It's a little old school type of study, but generally the procedure to run it goes as follows
  • Sample citations made by your researchers to other items
  • Record what was cited - typically you record age of item, item type cited, impact factor of journal title etc.
  • Check if the cited item is in your collection
Papers like  Hoffmann & Doucette (2012) "A review of citation analysis methodologies for collection management" gives you a taste of such studies if you are unfamiliar with such studies.

The impact of free

But what does "in your collection" mean? This of course would include things you physically hold and subscribed journal articles etc.

But it occurred to me that these days our users could also often obtain what they wanted by searching for free copies and as open access movement starting to take hold, this is becoming more and more effective.

In fact, I did it myself all the time when looking for papers, so I needed to take this into account.

In short, whatever couldn't be obtained through our collection and was not free would be arguably the potential demand for DDS/ILL.

(In theory there are other ways legal and illegal to gain access such as writing to the author, access via coauthors/secondary affiliations or for the trendy ones #canihazpdf requests on Twitter),

How do you define free?

As a librarian with some interest in open access, I am aware that much ink has being spilled over definitions.

There's green/gold/diamond/platinum/hybrid/libre/gratis/Delayed etc open access. But from the point of view of a researcher doing the research, I simply don't care. All I want to see if the full text of the article is available for viewing at the time I need it. It could be "delayed open access" (often argued to be a paradoxical term) but if it's accessible when I need it, it's as good as any.

What would an average researcher do to check for any free full text?

Based on various surveys  and anecdotes talking to faculty both in my current and former place of work, I know Google Scholar is very popular with users.

It also just happens that Google Scholar is a excellent tool for finding free full text, and we have a recent Nature survey showing that when there is no free full text, more users will search Google or Google Scholar for it and a smaller number will use DDS/ILL.

As such it's not a leap to expect the average researcher would probably use Google Scholar or Google to check for free full-text. 

So one would have to factor in the availability of the item for free and this could be obtained by simply checking Google Scholar add that to what is in our "collection" (defined to be physical copy and subscribed material) . 

Whatever remaining that was cited that couldn't explain by these two sources was the potential demand for DDS/ILL.

Preliminary results 

I'll talk about the sampling method later, but essentially, In my first exploratory data collection (with help of colleagues of mine) , I found that of the citations made, 79.7% were to items in our collection (either print or subscribed) and of the remaining cited another 13.4% were freely available by searching Google Scholar.

But the figures above presume a library centric view point and assume that users check our collection first and then turn to free sources only if unavailable there.  Is this a valid assumption?

As one faculty I discussed the results with said "correlation does not imply causation" and mentioned that just because they cited something that could be found in our collection didn't mean they used it. In fact, given the popularity of Google Scholar and the convenience of using it, it might be just as likely they accessed the free copy, especially if they were off campus and using Google.

Librarians who are familiar with open access will immediately say , wait a minute not all free full text are equal especially those that are self archived. Some are pre-prints, some post prints, some final published versions etc because of lack of page numbers etc.

There could in theory be very big differences between preprints and the final published versions and if you only had the post print version you should cite it differently from the final published version.

According to Morris & Thorn (2009) , in a survey, researchers claim that when they don't have access to the published version, 14.50% would rarely; and 52.70% never access the self archived versions. 

This implies researchers usually don't try to access self archived versions that aren't final published version.

Still, this is a self reported usage and one suspects in the service of convenience, many researchers would just be happy with any free version of the text and just cite as if they read the final published version....

For example in the  Ithaka S+R US Faculty Survey 2012 survey, over 80% say they will search for freely available version online, more than those using ILL/DDS. Are these 80% of faculty only looking for freely available final published version? Seems unlikely to me.

 Ithaka S+R US Faculty Survey 2012

Let's flip it around for sake of argument, how do things look like if we assume users access free items (whether preprint/postprint/final version) as a priority and only consult the library collection only when forced to?

As seen above for the same sample, a whopping 80.4% of cited items can be found for free in Google Scholar and this is further supplemented by another 12.7% from the collection.

As we will see later this figure is probably a big overestimate and I don't want to be hung up on it   still it can be very suggestive (if we can trust the figure), because it tells you that if our user did not have access to our library collection, he could still find and read the full text of 80% of items he wanted to! 

It then dawned on me that this figure is actually of great importance to academic libraries. Why?

Why amount of cited material that is free is a harbinger of  change for academic libraries

One of the areas I've been mulling over in the past year is the impact of open access on academic libraries. It was clear to me based on   Ithaka S+R US Faculty Surveys currently faculty  highly value the role of the library as a "wallet" and this was going to drastically change when (if?) open access becomes more and more dominant.

Still timing is everything, you don't want to run too far ahead of where your clients are at. So there is a need to tread carefully when shifting resources.

I wrote, "how fast will the transition occur? Will it be gradual allowing academic libraries to slowly transition operations and competencies or will be it a dramatic shift catching us off-guard?

What would be some signals are signs that open access is gaining ground and it might be time to scale back on traditional activities? Downloads per FTE for subscribed journals start to trend downloads? Decreasing library homepage hits? At what percentage of annual output that is open access, do you start scaling back?"

It came to me that the figure calculated above, the % of cited items that could be found free in Google Scholar , could serve as a benchmark for determining when the academic libraries' role as a purchaser would be in danger.

In the above example if indeed 80% of what they wanted to cite is free at the time of citing, the academic library role as a purchaser would be greatly reduced such that users only need you 2 out of 10 times! Is that really true?

Combining citation analysis with open access studies

My research idea question can be seen as a combination of two different strands of research in LIS.

First there is the classic citation analysis studies for collection development uses that was already mentioned.

Second there is a series of studies in the open access field that focused on determining the amount of open access available throughout the years. 

The latter area, has accumulated a pretty daunting set of literature trying to estimate the amount of open access material available. 

Of these studies, there's a subset that focus on typically sampling from either Scopus or Web of Science and checking if free full text is available in Google Scholar/Google/ Or some combo that bears closest resemblance to my proposed idea.

Free full text found Sample Searched in Coverage of articles checked Comment
Bjork &  et. al (2010) 20.4% Drawn from Scopus Google 2008 articles searched in Oct 2009
Gargouri & et. al (2012) 23.8% Drawn from Web of Science "software robot then trawled the web" 1998-2006 articles searched in 2009
2005-2010 articles searched in 2011
Archambault & et. al (2013) 44% (for 2011 articles) Drawn from Scopus Google and Google scholar 2004-2011 articles searched in April 2013 "Ground truth' of 500 hand checked sample of articles published in 2008, 48% was freely available as at Dec 2012
Martín-Martín & et. al (2014) 40% 64 queries in Google Scholar, collect 1,000 results Google Scholar 1950-2013 articles search in May 2014 & June 2014
Khabsa & Giles (2014) 24% randomly sampled 100 documents from MAS belonging to each field to check for free and multiple that by estimated size of each field determined by capture-release method Google Scholar All? searched in ??
Pitol & De Groote (2014) 58% Draw randomly from Web of science - for Institution C, draw 50 that are not in the IR already and check in Google scholar Google Scholar 2006-2011 Abstract reports 70% free full text, this is for institution A , B and C, where for A and B, the random sample drawn from Wos had to include copies already in IR as well.
Jamali &  Nabavi (2015) 61% Do 3 queries each in Google Scholar for each Scopus third level subcategory. Check the top 10 results for free full text Google Scholar 2004–2014 articles, searched in April 2014

I am still mulling about the exact details of each paper and the differences in methodology of each but the overall percentages are suggestive ranging from 20% to 61%, with the later studies showing generally a higher percentage.

Martín-Martín & et. al (2014) and Archambault & et. al (2013), in particular strike me as very rigorous studies and both show around 40%++ full text is available.

But we can see obviously that what we got with 80% is far above the upper bounds expected. Why?

Big issues with methodology

Here is where I mention the big problem I have. 

First the sample I drawn from were from citations made in papers published in 2000-2015. The field was Economics and the search for free items was done in September 2015.

The first obvious issue is that when I check if something is free in Google Scholar, I am only checking what is free now.

This is okay if all I care is to know what percentage is free now. 

But from my point of view, I want to know how much was free, at the time the researcher was citing it rather than many years later.

So for example take a paper A written in 2003 that cites a paper B written in 2000.

Today (September 2015 as I write this), I determine paper B is free and findable via Google Scholar. The obvious thing of course is while it is free now , it might not be free in 2003 when the author was doing his research!

Whether an article was free at the time the author was writing the paper depends on 

a) When the writer was writing up the paper
b) The age of the article he was citing at the time

The interaction of these two factors make it's very confusing as there is a host of factors affecting if something is free at a certain time. A short list includes a) Journal policies with embargos for self archiving, b) uptake of illegal options like researchgate, c) the general momentum towards open access of both green and gold at the time  etc.

Is there a solution?

Honestly I am not sure, I can think of many ideas to try to fix it but they may not work.

First off, I could just forgot about longitudinal studies and just focus on citations made from papers published in a short window period say within 6 mths of the searching done today in 2015 to reduce such timing effects but even this isn't perfect as we can quibble and say publication dates tell us little of when the writing was actual done, as publishing can have long lead times.

Another way is to carefully examine the source where the full text was found , and hope that the source would have metadata on when the full text was loaded.

For example, some institutional repositories or subject repositories  might have indications for when the full text was uploaded (e.g Number of downloads since.... DD/MM/YY)

Full text upload to Dspace as a Download since indicator

Based on studies like Jamali &  Nabavi (2015) and Martín-Martín & et. al (2014), we know that surprisingly one of the major sources for free full text is items uploaded to researchgate (ranked 1st for the former and 2nd for the later), so this could be a big sticking point.

That said, looking around researchgate, I noticed surprisingly, researchgate does list when something is uploaded.

Was this article uploaded to Researchgate on Jun 6, 2016?

Edit : @varnum suggested a great idea of checking using the internet archive's wayback machine . It works for some domains like the edu domains which helps a little when someone puts up a pdf on university web space.

pdf on duke domain was existing in 2011 according to wayback machine.

Another idea was I could do a blanket rule, if the paper is citing something less than 2 years old at the time and a free full text is found (and it was not published in a Gold Journal), we can assume it wasn't free then as many journals will allow published versions or post prints to be published only after 2 years of publication. 

This will undercount for various reasons of course, least of which is illegal copies.

A more nuanced approach would be take into account policies listed in Sherpa/Romeo. But policies by journal publishers change over time too. 


The more I lay out my thoughts the more I wonder if my idea is fatally flawed? It would be great to be able to get the figure - percentage of items cited that could be freely obtained at the time of citing, but it may just be such a figure is impossible to get at accurately enough even if one limited one's sample to a short window around the time of searching.

What do you think? 

Edit: This post is written from the librarian point of view of reacting to changes in research behavior, and is neutral between whether academic librarians should be revolutionaries or soldiers in the open access movement.

Sunday, August 23, 2015

Things i learnt at ALA Annual Conference 2015 - Or data is rising

I had the privilege to attend ALA annual conference 2015 in San Francisco this summer. This was my 2nd visit to this conference (see my post in 2011) and as usual I had lots of fun.

Presenting at  "Library Guides in an Era of Discovery Layers" Session

My ex-colleague and I were kindly invited to present on our work on a bento-style search we implemented for our LibGuides search.

For technical details please refer to our joint paper Implementing a Bento-Style Search in LibGuides v2 in July's issue of Code4lib.

See the Storify of event at

Data is rising 

Before I attended ALA 2015, I was of course aware that  research data management was increasingly an important service academic librarians are or should be supporting.

To be perfectly frank though, it was a hazy kind of "aware".

I knew that increasingly grant giving organization like NIH and other funders were requiring researchers to submit data sharing plans, so that was an area where academic librarians would provide support in particularly if open access takes hold since it would make obsolete many traditional tasks .

Also I knew there was all this talk about supporting Digital Humanities and GIS (geographic information system) services such that my former institution where I worked with began to start appointing Digital humanities and GIS librarians just before I left.

Perhaps closer to my wheel-house given my interest in library discovery, there was talk about Linked data and BIBFRAME which isn't research data management per se.

All these three areas relate to emerging areas that I knew or strongly suspected would be important but was unsure about the timing or even the nature (see later)

Add the "stewardship's duty of libraries" towards the "Evolving Scholarly Record" (what counts as scholarly record is now much expanded beyond just the final published article and libraries need to collect and preserve that), you can see why data is a word librarians are saying a lot more.

Still attending ALA annual 2015, made me wonder if finally a tipping point has been reached and I should start really looking at it deeper.

Is Linked data finally on the horizon?

While attending a session by Marshall Breeding "The future of Library Resource Discovery: Creating new worlds for users (and Librarians) he asked this question.

Breeding's observation was indeed apt, though one's choice of sessions to attend obviously as an impact so for example this blogger wonders if the overdose of linked data is simply due to her interest.

Still, this year there seemed to be quite a lot of talk on linked data and Bibframe. Perhaps a tipping point has been reached?

I think part of it is due to the fact that ILS/LMS/LSP vendors have began to support linked data.
This breaks the whole chicken and egg problem of people saying there is no interest in using linked data hence there are no tools for it and that there are no tools for it because it isn't worth making because no-one is interested.

The biggest announcement was on Intota v2 - ProQuest's cloud-based library services platform

"Intota v2 will also deliver a next generation version of ProQuest's renowned Knowledgebase. Powered by a linked data metadata engine, Intota will allow libraries to participate in the revolutionary move from MARC records to linked data that can be discovered on the web, increasing the visibility of the library." - Press release

I actually was in attendance during the session but left before it was demoed (kicking myself for that). The tweet below is interesting as well.

Of course, we also can expect Summon to start taking advantage of linked data to enhance discovery via Intota,

Besides Proquest, SirsiDynix announced to "produce BIBFRAME product in Q4 2015".
While Innovative had pledged support to Libhub Initiative a few months earlier.

OCLC of course has always been a early pioneer on linked data.

"Nobody comes to librarians for literature review?"

As part of my attempt to balance going to sessions where I was really interested in the area (and hence likely I would be well versed  in most of the things shown) and sessions where I was totally unfamiliar with (and hence likely most things would go over my head), I decided to go to some GIS sessions.

I accompanied my ex-collegue and co presenter to a couple of sessions on GIS (Geographic Information Systems) which he has an interest/passion in and is currently tasked with trying to start something up for the library.

I attended various sessions including a round table session  which focused more on what libraries were doing as opposed to more technical sessions. It was clear from the start that some academic libraries in the US were far more advanced than others, such as Princeton, who I believe had a librarian state that libraries have being managing data for over 50 years and it's not a new thing to them.

Much nodding of heads occurred when someone warned about jumping on the band wagon simply because their University Librarian thought it was a shiny new thing.

Many talked about staffing models, how to fit in liaison librarians vs specialist roles into these new areas which is a perennial issue whenever a new area emerges (e.g it was promoting open access the last time around for many academic libraries).

One librarian stated that helping faculty handling research data is important because "nobody comes to us anymore for literature searches".

Of course this immediately drew a response from I believe a social science (or was it medical) librarian who said, faculty do come to them for both literature review as well as data sets! :)

Why searching for data is the next challenge

ExLibris has been sharing the following diagram in various conferences recently, listing 5 things users expect to be able to do.

Of the five tasks above, I would say the greatest challenge right now would be to "obtain data for a research project" which can be seen as a different class of problems compared to the other 4 tasks which broadly speaking involve finding text based material.

I would think this is because over the years, improvements in search technology (from the "physical only" days to the early days of online and now to Google scholar and web scale discovery), coupled with easily over a century of effort and thinking of how to organize and handle text - this has made searching for text, in particular scholarly texts (peer reviewed articles in particular) if not a completely solved problem, at least a problem that isn't so daunting that most academics would recoil in terror and ask for help.

Yet, the level of difficulty for searching for data sets/ statistics is I would say about the same level of difficulty for searching for articles in the 1980s to 1990s. While the later has improved by leaps and bounds the former hasn't moved much.

Lack of competition from Google? 

Having worked in a business/management oriented University for 5 months, I am starting to appreciate how much more difficult it is to get datasets from say finance areas and I know many librarians including myself feel a sinking feeling in our stomach when asked to find them.

Firstly, the interfaces to get the data out of them are horrendous. Even the better ones are roughly at the level of the worst article searching interfaces.

This is partly I suspect because without Google to put pressure on these databases, there is no incentive to improve. Competition from Google I believe have driven the likes of EBSCO, Proquest etc to converge into pretty much the same usable design or at least a google like design that takes little to adjust to.

Today, the UI you see in Summon, Web of Science, Scopus, Ebsco platforms etc is pretty much the same, and you practically can use it without any familiarity. (See my post on how library databases have evolved most in terms of functionality and interface to fit into the google world).

Google's relentless drive to improve user experience has benefited libraries to try to keep up. You could say the Ebscos of the world would practically forced to improve or die from irrelevance as students flocked to Google .

Of the databases that libraries subscribe to , the worse ones typically belong to either the smallest outfits or ones that primarily served other non-library sectors.

So the likes of bloomberg , Capitaliq, T1 and even many law databases  such as lexisnexis have comparatively harder to use designs.

They can get away with this because of lack of competition from Google and also these are primarily work tools, and professionals are proud of the hard earned bloomberg skills say that gives them a competitive advantage.

When it comes to non-financial data, it becomes even more challenging, since there isn't many well known repositories of data (at least known to a typical librarian not immersed in data librarianship) that one should look at. Google is of limited help here showing up the usual open data worldbank/UN etc sources that are well known.

How researchers search for public data to use

A recent Nature survey asked researchers how they find data to use.

The article noted that no method predominated with checking references in articles as common a method as searching databases.  Arguably this points to the fact that

a) databases on date are not so well known
b) databases on data are hard to use (due to lack of comprehensiveness of data or poor interface).

Of course this survey question asks about "public data" to reuse,

Researchers often approach me about using data from databases (for content analysis) we license such as newspaper databases and article databases. This seems yet another area that academic libraries can work on, leading libraries like NCSU libraries have took on this task to negotiate access of data from the likes of Adam Matthew and Gale

Confusion over what libraries can or should do with data

Like any new area, academic libraries are trying to get involved in (thanks to reports like NMC's Library Horizon Report - Library editions listing this area as a increased focus) , there is a lot of confusion over the skill sets, roles and responsibilities needed.

What a "data librarian" should do is not a simple question, as this can span many areas.

In Hiring and Being Hired. Or, what to know about the everything data librarian, a librarian talked about how his responsibilities blow up and that "everything data librarians
 don’t actually exist".

He points out that many job ads for data librarians actually comprise 5 separate areas
  •  Instruction and Liaison Librarian
  •  Data Reference and Outreach Librarian
  •  Campus Data Services Librarian - (this job is most associated with Scholarly communication)
  •  Data Viz Librarian (Learning Technologist)
  • The Quantitative Data Librarian (Methods Prof)

I can smell the beginning of what the Library Loon dubs as "new-hire messianism". Where a new hire is expected to possess a impossible number of skill sets, working under indifferent or even hostile environments and expected to almost singlehandedly push for change with no or limited resources or authority. 

Obviously no one staff should be "responsible for data", I've been reading about concept of "tiers of data reference.  and thinking of how to improve in this area.


Like most academic librarians, I am watching developments closely, and trying to learn more about the areas. Some sites

Thursday, July 16, 2015

5 things Google Scholar does better than your library discovery service

I have had experience implementing Summon in my previous institution and currently have some experience with EDS and Primo (Primo Central).

The main thing that struck me is that while they have differences (eg. Default Primo interface is extremely customizable though requires lots of work to get it into shape, while Summon is pretty much excellent UI wise out of the box but less customizable,  EDS is basically Summon but with tons of features already included in the UI), they pretty much have the same strengths and weaknesses via Google Scholar.

So far, my experience with faculty here in my new institution is similar to that from my former's, more and more of them are shifting towards Google Scholar and even Google.

Though Web scale discovery is our library's current closest attempt at mimicking Google Technology it is still different it is in the differences that Google Scholar shines.

Why is Google Scholar, a daring of faculty?

To anticipate the whole argument, Google Scholar serves one particular use case very well - the need to locate recent articles and to provide a comprehensive search.

While library discovery services are hampered by not just technological issues but also the need to balance support for various use cases including the need to support known item searching for book titles, journal titles and database titles.

It is no surprise a jack of all trades tool comes out behind.

Here are some things Google Scholar does better.

1. Google Scholar updates much quicker

One feedback I tend to get is from faculty asking me why their paper (often hot off the press) wasn't appearing in the discovery service.

In the early days of library discovery service, often the journal title simply wasn't covered in the index, so that was that.

These days more often than not the journal title would be listed as covered in the index particularly if it was a well known mainstream journal. So why wasn't the particular article in the discovery service?

Unfortunately, typically I would discover the issue lies with the recency of the article. The article was so new it didn't appear in the discovery service index yet.

Yet I would notice time and time again for example whenever an article appeared on say Springer, within a day or two it would appear in Google Scholar while it would take over a month if that to appear in our discovery service index.

Google Scholar simply updates very quickly using it's crawlers compared to library discovery services which may use other slower methods to update.

Also I have found library discovery services may often not index "early access/edition" versions, while Google Scholar, whose harvesters seem to happily grab anything on the allowed publisher domain have less issues.

The discovery service providers might argue, Google Scholar tends to employ almost zero human oversight and quality control and that as such they provide less accurate results.

This may be so, but it's unclear if the trade-off is worth it, in today's fast paced world where anxious faculty just want to see the article with their name appear.

2. Covers scholarly material not on usual "Scholarly" sources   

Besides speed of updates, Google Scholar shines in identifying and linking to Scholar material even if they are not found in the usual publisher domains.

Take the experience back in 2014 of a Library Director who was trying to access a hot new paper on "Evaluating big deal journal bundles".

The library director was smart enough to know it wouldn't appear in the discovery service and so did an ILL for the article and it turns out she could have just used Google Scholar to find a free PDF that the author linked off his homepage.

Here we see the great ability of  Google Scholar's harvester to spot "Scholarly" papers (famously with some false positives), even if it resides on non-traditional sites. For instance it can link to pdfs that authors have linked off their personal homepages (which may or may not be university domains).

This is something none of our library discovery services even attempt to do. In general our discovery services build their index at a higher level of aggregation, typically at journal level or database level, so there is no way it would spot individual papers sitting on some unusual domain.

3. Greater and more reliable coverage of Open Access and free sources

It's a irony that I find discovery services generally have much poorer coverage of Open Access than Google Scholar.

Let's not even start with Hybrid journals which are often articles in top journals yet impossible to correctly identify and find in discovery services (I notice the example tested in the article on the difficulty of finding hybrid articles works for Google Scholar)

How about Gold Journals. Most discovery services have indexed DOAJ (Directory of Open access Journal), but many libraries experience so bad linking experience (linking may not be at article level and/or lead to broken links), they just turn it off. (Discovery indexes that cover OAIster might have better luck?)

How about institutional repositories? Something created and managed by Libraries? On most discovery services, you typically can add only contents of your own institutional repository and you have a very limited selection of other institution repositories (always on the same discovery service) you can add

Usually you can add only the libraries that have volunteered to open their institutional repositories to other customers on the same discovery service and this is a very short list (probably a dozen or so).

The list is even shorter when you realise some of these institutions are not wholly full text and the discovery service makes it difficult to offer only full text items from these Institutional repositories when you activate them, so you are eventually forced to turn them off.

I am not well versed enough with institutional repositories and OAI-PMH to understand why there is so much difficulty to figure out which items listed in them are full text or not, but all I can say is Google Scholar's harvesters have no such issues identifying free full text and making it available. I would add some of it is not quite legal (eg look at the pdfs in, researchgate etc surfacing in Google Scholar).

Reason #2 and #3 above is the main reason why Google Scholar is by far the most efficient way to find free full text and why apps like Google Scholar Chrome button and Lazy Scholar are so useful.

4. Better Relevancy due to technology and the need to just support article searching

Going through the few head to head comparisons between Google Scholar and discovery services in the literature (refer to the excellent - Discovery Tools, a Bibliography), it's hard to say which one is superior in terms of relevancy, though Google Scholar does come up on top a few times.

My own personal experience is Google Scholar does indeed have some "secret source" that makes it do better ranking. There are many reasons to suspect it is better from the fact it can personalize, uses many more signals (particularly the network of links and link text) and just sheer technical know-how that made it the world's premier Search company.

A somewhat lesser often expressed reason why Google Scholar seems to do so well is that unlike library discovery services, Google Scholar is designed for one primary use case - to allow users to find primarily journal literature.

A library discovery services on the hand according to Exlibris has 5 possible cases

I would argue library discovery services are handicapped because they need to handle at the very least "Access to known book or journal" + "Find materials for a course assignment" + "Locate latest articles in the field".

Trying to balance all these cases simultaneously (which includes ranking totally different material types such as Books/articles/DVDs/Microforms etc) results in a relevancy ranking that can be mediocre compared to one that is optimised just for finding relevant journal articles aka Google Scholar.

During the early days of library web scale discovery, libraries and discovery service vendors learnt a costly lesson that despite the name "Discovery", a large proportion of searches (I see around 50% in most studies) was for known items. This included known items of book titles, journal titles and database titles.

Not catering for such users would typically lead to great unhappiness, so you started seeing many discovery service vendors working on their relevancy to support known item searching and adding features like featured boxes, recommenders to help with this.

All this meant that library web scale discovery services would always be a disadvantage compared to Google Scholar which focused on one main goal , discovery of articles as nobody goes to Google Scholar to look for known book titles, journal titles or database titles.

They do go to Google Scholar for known article title searches but "ranking" of such queries is easy given how unique and long the titles tend to be. In any case, doing well for article known item search is less a matter of ranking and more a matter of ensuring the article needed is in the index and as we have seen above Google Scholar is superior in terms of coverage due to broader sources and faster updates.

5. Nice consistent features

Google Scholar has a small but nice set of features. It has a "related articles" function, you won't find in most web scale discovery services unless you subscribe to BX recommender.

Many users like the "Cited by" function. Your library discovery service doesn't come with that natively, though mutual customers of Scopus or Web of Science can get citation networks from those two databases.

Because Google Scholar creates their own citation network, they can not only rank better but also provide the very popular Google Scholar Citations service. Preliminary results from this survey, seems to indicate Google Scholar citations profile are popular then on, Researchgate etc.

But more important than all this is the fact that it is worth while to invest in mastering Google Scholar. All major academic libraries will support Google Scholar via library links/open url, so you can carry this with you no matter which institution you are at.

On the other hand, if you invest in learning the library discovery service interface at your current institution, there's no guarantee you will have access to the same system at your next institution given that there are four major discovery services on the market (not counting libraries that use discovery service apis to create their own interfaces).


Does this mean library web scale discovery are useless? Not really.

I would argue that web scale discovery tools are designed to be versatile.

While they may come up second best in the following cases

  • In-depth literature review (both Google Scholar and Subject indexes are superior to web scale discovery in different ways)
  • Known item search for books/journal titles/database titles (Catalogues and A-Z journals and database lists are superior)

There are no other tools that can be "pretty good" in all these tasks, hence their popularity with undergraduates who want a all-in-one tool.

Can we solve this issue of being jack of all trades but master of none?

One interesting idea I have heard and read about in various conferences including Ebscohost's webinars was the idea of a popup appearing after entering the keyword and clicking search, asking the user whether he was trying to find a known item or a subject search or any other scenarios and based on the answer the search would execute differently.

Somehow though I suspect it might get annoying fast.

Sunday, May 31, 2015

Rethinking Citation linkers & A-Z lists (I)

I am right now involved in helping my current institution shift towards a new Library Service Platform and discovery service (Alma and Primo) and this has given me an opportunity once again to rethink traditional library tools like citation linkers, A-Z journal and databases lists.

It's pretty obvious such tools need a refresh as they were created

  • before Google/Google Scholar and web scale discovery.
  • in an era where electronic was not yet hugely dominant.

For this post, I will discuss citation linkers and how some vendors or libraries have attempted to update it for the current new environment of discovery followed by a further post on  ideas to update the A-Z database and journal list.

Citation linkers - a outdated tool?

The idea of citation linker (sometimes known as citation finder or article finder) function was meant to be straight forward. You entered a reference and the library would hopefully link you to full text of an article via the library's openurl resolver.

Most link resolvers such as ExLibris's SFX, Innovative's Web Bridge, Ebsco's Linksource etc offer a variant of such a tool.

Below we see some typical citation linkers across different vendors.

Typical citation linker from Proquest's 360 link

Typical SFX citation linker

Typical EBSCO LinkSource Citation finder

Typical Alma Uresolver Citation linker

I first encountered this tool myself pretty late in 2012, when implementing the suite of then Serialssolutions (now Proquest) services including Summon and 360link in my former institution.

Initially, I was totally confused by the fact that simply entering the article title alone would not work! You had to painstakingly enter various pieces of information which even then would often fail to work, depending on the accuracy of the citation fields you entered.

My confusion is understandable because I came upon this tool after the rise of web scale discovery where entering an article title was usually sufficient to get to the full text.

Even after I grasped the concept of how it worked, I realized how unlikely a user would be willing to use it, much less successfully use it since it was much easier to just enter the article title in Google Scholar or a library discovery service.

Sure as I discussed in Different ways of finding a known article - Which is best? way back in 2012, searching by article title via Discovery index has drawbacks (eg it can't find non-indexed items) but it is far easier and more convenient for the user and if there is anything I learnt in my years working in the library, convenience tends to trump everything else.

Can we improve on it? Autocomplete to the rescue

How would I create a citation linker 2.0?

A obvious improvement would to be to work on UX.

One study on the usability of the SFX citation linker  noted that while users who tried finding articles via the Journal A-Z list had issues, it was even worse when using the citation linker.

They suggested improving the usability of the tool by removing unnecessary fields such as author and article title fields which were usually not used for openurl resolution.

Georgia Tech Library seems to have followed this recommendation, as unlike the default sfx link finder
they hid the various author fields (first name, last name, initial) etc

A more interesting proposal to improve the tool was made by Peter Murray way back in 2006 entitled A Known Citation Discovery Tool in a Library2.0 World

"The page also has an HTML form with fields for citation elements. As the user keys information into the form fields, AJAX calls update the results area of the web page with relevant hits. For instance, if a user types the first few letters of the author’s last name, the results area of the web page shows articles by that author in the journal. (We could also help the user with form-field completion based on name authority records and other author tables so that even as the user types the first few letters of the last name he or she could then pick the full name out of a list.) With luck, the user might find the desired article without any additional data entry!"

Essentially he is suggesting that each of the fields in the citation linker would have autocomplete features via ajax which helps the user as well as adding a "Results area" which displays likely articles that the user is searching for. He goes on to suggest similar ideas for various fields such as volume and issue fields.

"Another path into the citation results via the link resolver: if a user types the volume into the form field, the AJAX calls cause links to appear to issues of that volume in addition to updating the results to a reverse chronological listing of articles. If a user then types the issue into the HTML form field or clicks the issue link, the results area displays articles from that issue in page number order. Selecting the link of an article would show the list of sources where the article can be found (as our OpenURL resolvers do now), and off the user goes."

At the time of the proposal, such a feature was not possible because it would require a large article index to draw results from. Today we of course have web scale discovery systems.

Auto parsing of citations 

One of the weaknesses of citation linkers is that it requires the user to parse the citation and enter each piece of information one by one into various fields. Not all users are capable of that or even patient enough to do that.

Why not simply allow users to cut and paste the citation and let the software figure it all out?

Brown University's free cite tool, allows you to toss in a citation and it will try to parse out each citation field. I believe there are a few other similar tools out there. The logical idea of course is to use this parsed output to fill in the citation linker field.

This is exactly what UIUC Journal and Article Linker tries to do.

A interesting variant of this is done by EBSCO.

EBSCO has an app called EBSCO Citation Resolver  via it's new Orbit platform, an Online Catalog of EBSCO Discovery Service™ Apps.

This uses the above mentioned Brown University's Free cite to parse references but instead of passing over the data to a traditional citation linker to try to get to the full text via OpenURL as UIUC does, it passes the data over to EDS itself.

As you can see above, the parsed information is sent to EDS for advanced searching using field searching.

We will get back to this example later.

Finding full text by text and voice recognition

Also why restrict oneself to cutting and pasting citations? What about other input methods? There used to be a ios app, I believe by Thompson Reuter's Web of Science that allowed you to take a photo of a reference and by the magic of OCR and text recognition combined with the citation parser, link you to the full text.

Unfortunately I lost track of that app but I recall it didn't work very well because it was limited to linking you to article entries in Web of Science and the text recognition combined with citation parser wasn't that good.

Still as technology advances I think the idea has legs. I have no doubt if Google desires, they can easily set this up to work with Google Scholar.

Now imagine combining this with voice commands such as Google Now, "Ok Google, find me such and such article by so and so in journal of abc".

Output accuracy should improve too.

Making it easy to input the citation is just one part of the equation, making sure full text can be reached is the other.

Coming back to the EBSCO Citation Resolver a interesting point to note is that after parsing the reference instead of passing it over to a citation linker such as their own Linksource citation finder (see below), it dumps the information into the discovery service Ebsco discovery service.

Parsed citation did not get passed to LinkSource's article finder

Why would one send the information to the discovery service and not the citation linker tool?

Part of the reason is that linking via OpenURL is often hit and miss in terms of linking to full text.

Some studies put full text linking success at around 80% of the time due to well known openurl issues which IOTA and KBART and are trying to solve.

Summon and EDS provide more stable forms of linking (often called direct linking that can work up to 95% of the time), which can be used whenever possible on-top of OpenURL. (Note : 360Link v2.0 provides the same type of direct linking as Summon)

Add the fact that automatic citation parser's is going to be somehwat inaccurate at text recognition, it might be easier to employ strategies that involve just extracting the author and article title to work with the discovery service , then trying to identify every citation field (eg vol, issue, page) to work with the full openurl resolver as the latter method is very error prone, requiring a large number of fields to be recognised correctly to work well.

For a third method that uses crossref metadata search api see "Resolving Citations (we don’t need no stinkin’ parser)"

That said as more citation styles require dois to be added, the work of parsing citation becomes easier as often the doi alone is sufficent to get to the full text. I also suspect the increased use of citations created by reference managers (eg Mendeley, Zotero) and the increased support of  Citation Style Language (CSL) for various styles may eventually make things more consistent and easier for the citation parser.

I can go further and imagine a hybrid system for output that would even work with Google Scholar for free pdfs + Web Scale Discovery direct linking + Openurl linking to give the best chance of reaching the full text.

You can see this hybrid multiple approach system somewhat in play in the Lazy Scholar extension (supports Chrome and Firefox) that checks Google Scholar for free full text and also offers openurl resolution.

This could work either the same way link resolver menus work now and display various options or there would be some intelligent system in the background deciding whether to use the discovery service or Google Scholar to find the full text (how likely was the first result in Summon say based on a title only phrase search the hit?) or to rely on traditional openurl resolution.


All in all though, I don't see much of a future for a stand-alone citation linker sitting on your website.

Few people have the patience to use it.

Ideally a web scale discovery service - basically the big 4 - Summon, EDS, Primo and Worldcat , should be built to handle cases when users copy and paste the whole citation. (I understand Primo has enhancements that handle it).

As it is, I notice the rise of such user behaviour in search logs of discovery services under my care. It's a small but significant amount, something noted in other studies that analyse discovery search logs.

Can Summon handle cutting and pasting full references?

Discovery services should definitely be trained to identify such cases and automatically call the citation linker function.

Perhaps the system would then try to

a) Recognise the likely type of material sought (book, book chapter, article etc)
b) Depending on material type, focus on identifying with high likelihood the title, doi, author etc.
c) Use either the discovery index, doi resolution or traditional openurl methods depending on a) and b)

I expect, usually the system would try a phrase search for an article title, perhaps further narrowed by author in the article index (the top match usually is highly likely to be the right one), sometimes it would resolve the doi and yet other times it would try the traditional citation finder method.

With tons of statistics on success rates, it might be possible to get a reasonably accurate system.

Depending on how certain you are on the model you are using, it could show all the options (similar to how link resolvers menus work now and in particular Umlaut is worth looking at), or it could just show the highest probability match.

Next up, do we really need A-Z database and A-Z Journal lists?

Friday, April 17, 2015

Making electronic resources accessible from my home or office - some improvements

I've recently been involved in analysing  LibQual+ Survey at my new institution and one of the things recommended nowadays when doing LIBQUAL analysis is to do a plot of performance of various items versus how important those items are to users.

Above we see sample data from Library Assessment and LibQUAL+®: Data Analysis

We proxy importance of a factor by the mean desire score on the vertical axis and the how well a factor is performing by the adequacy mean score on the horizontal axis, so the higher the dot the more important it is.

In the above sample data IC 1 or "Making electronic resources accessible from my home or office" is the 2nd to 3rd most important factor, and I suspect this is typical for most libraries.

Also do note that the analysis above is for to *all* users. Undergraduates traditionally have high desire for space, if we include only faculty, it will probably be even higher ranked.

LibQual questions can be hard to pindown on what they mean, though in this case, I would suspect it is the accessibility from home that is the issue. Currently, most forward looking academic libraries try to make access as seamless as possible by ip authentication in-campus so users don't need to use proxy methods within campus. (Expecting users to start from the library homepage to access resources is a futile goal)

Off campus access is more tricky since not all users will be informed enough or bother to VPN even if that is an option.

Meeting Researchers Where They Start: Streamlining Access to Scholarly Resources 

Is seamless access to library resources particularly off-campus really that difficult? Roger C. Schonfeld in the recent Meeting Researchers Where They Start: Streamlining Access to Scholarly Resources believes so. He wrote ,  "Instead of the rich and seamless digital library for scholarship that they need researchers today encounter archipelagos of content bridged by infrastructure that is insufficient and often outdated." 

He makes the following points
  • The library is not the starting point   
  • The campus is not the work location
  • The proxy is not the answer
  • The index is not current  (discovery services often have lag time compared to Google/Scholar)
  • The PC is not the device (despite the mobile push in the last 5 years, publisher interfaces are still not 100% polished) 
  • User accounts are not well implemented
Most of these points are not really new to many of us in academic libraries, though it is still worth a read as a roundup of issues researchers face.

Still the listing above misses one very important issue, that is the classic problem of the "Appropriate copy problem" that the openurl resolver was invented to fix. The key problem is that openurl still isn't widely implemented and while Google Scholar, supports it , Google itself doesn't and it is extremely easy to end up on an article abstract page without any opportunity to use openurl to get to the appropriate copy. More on that later.

BTW Bibliographic Wilderness responds to Roger Schonfeld from the library side of things, pointing out among other things the appropriate copy issue and difficulties of getting vendors to improve their UX (aka we can't cancel stuff based on UX!).

Shibboleth and vendor login pages

So what should be ideal view when a user lands on a article abstract page and needs to authenticate because he is off campus and/or without proxy? 

One way is Shibboleth but that is not something I have experience with but it seems it is poorly supported and as poor usability.  Without Shibboleth is there a way for vendors to make sign-ins easier when users are off campus and land directly on the article page without the proxy?

The way JSTOR has done it (for last 1-2 years?) has always impressed me. 

JSTOR will intelligently look at your ip and suggest institutions to login from. As far as I know you don't have to have Athens/Shibboleth or do anything special for this to work.

Recently Stephen Francoeur brought to my attention the following announcement from Proquest

Essentially the Proquest login screen is redesigned to make it simple to allow users to enter their institution and the system will attempt to authenticate you using the usual method.l

"Today we are debuting a simplified login experience for institutions that use a remote login method such as Proxy, Shibboleth, OpenAthens, or barcode to authenticate users into ProQuest ("

"To reduce this confusion, we've redesigned the login page ( as shown below to make it easier for remote users to authenticate into ProQuest by adding the "Connect through your library or institution" form above the ProQuest account form. Further, remote users can select their institution on the login page, instead of having to click through to another page as they had to do previously. After users select their institution, they will be re-directed to the remote authentication method their institution set up with us."

Though it doesn't seem to suggest institutions, it's still fairly easy to use, just type in your institution and you will be asked to login (via ezproxy in my case).

Ebsco is another one that seems to make it possible to select your institution and login for full text but like the Proquest one above, I could never get it to work either at my old institution or new. This could be some configuration setting needed.

It's really amazing how few publishers follow the lead of JSTOR and Proquest. If the Elseviers/Sages etc of the world followed a similar format, I am sure there will be much less friction for accessing to paywall articles. Let's hope Proquest's move will lead to others converging to a similar login page, the way now many article databases look pretty much similar.

Appropriate copy problem revisited

Say most publishers start to wise up to UX matters and implement a login page like JSTOR so our users can select a institution and quickly get access. Will that solve every problem? Arguably no,

At my old institution, we had great success with promotion of the proxy bookmarklet, Libx etc to overcome proxy issues (part of it is because ALL access is through proxy whether in campus or off so the proxy bookmarklet would be essential all the time as long as you did not start from the library homepage) 

But even if a user was smart even to add the proxy string , that still led to a common problem.

Often even after proxying full access would still not be granted. The reason of course is because we may not have access to full text on that particular page but may have access somewhere else on another platform.

A classic example would be APA journals, where access would be available only via Psycharticles (which can be on Ovid or ebsco platform). Google results tend to favor publisher rather than aggregator sites, so one would often end up on a page where one would have access only via another site.

The more a academic library relies on aggregators like Ebsco or Proquest as opposed to publishers to supply full text the more the appropriate copy issue arises.

As mentioned before this issue can be solved if the user starts off searching at a source that supports openurl such as Google Scholar and access via the library links programme or even a reference manager like Mendeley. But with multiple ways of "discovery" you can't always guarantee this.

In fact, I am noticing the rise in number of people who tell me they don't even use Google Scholar but Google to find known item articles. Interestingly enough the recent ACRL 2015 proceedings Measuring Our Relevancy: Comparing Results in a Web-Scale Discovery Tool, Google & Google Scholar  finds that Google is even better than Google Scholar for known item searching. Google scored 91% relevancy in known item queries while Google Scholar and Summon both scored 74%!

If so, we will have ever increasing number of users who will land on article abstract pages without the opportunity of using link resolvers to find the appropriate copy.

Another example, I find many interesting articles including paywall articles via Twitter.  From the point of someone sharing, what is the right way to link it so others who will have different options for access will be able to get to it?

There's doesn't seem to be a obvious way (link via doi? link to a Google scholar search page?) and even if there was this would be troublesome to the sharer, so most of the time we end up with a link to a publisher version of the article which others may not be able to access.

 Lazy Scholar and the new Google Scholar Chrome extension

So what should we do, if we end up on a article page and we want to check access via our institution?

I've wrote about Libx before but my current favourite Chrome extension is Lazy Scholar, which I reviewed here. 

It exploits the excellent ability of Google Scholar to find free full text and also scrapes the link displayed by Google Scholar for the library link programme.

With more and more providers cooperating with Google Scholar (see the latest announcement by Proquest for Google Scholar to index full text from Proquest), Google Scholar is by far the largest storage of scholarly articles and every Scholar's first stop to check for the existence of an article.

Lazy Scholar automatically attempts to figure out if you are on a articles page and will search Google Scholar for the article and scrape what is available. In this case there is no free full text so it says no full text. But you can click on Ezproxy to proxy it or click on "Instit" which triggers the link resolver link found in Google Scholar (if any).

There are many other functions that the author has added to try to make the extension useful , I encourage you to try it yourself.

Interestingly in the last few days, Google themselves had a similar idea to help known item searches by exploiting the power of Google Scholar. They created the following Google Scholar Button extension.

It is very similar to Lazy Scholar but in the famous Google style a lot simpler.

On any page with an article, you can click on the button and it will attempt to figure out which article you are looking for an search for the title in Google Scholar and display the first result. This brings in all the usual goodies you can find in Google Scholar.

If the title detection isn't working or if you want to check for other articles say in the reference, you can highlight the title and click on the button.

It's interesting to see the future of both extensions, see here for a comparison between the features of Lazy Scholar vs Google Scholar button.


 "Making electronic resources accessible from my home or office" isn't as easy as it seems. A approach that combines

  • improved usability of publishers login pages
  • Plugins to support link resolvers and the ability to find free full text via Google Scholar
is probably the way to go for now, though even that doesn't address issues like seamless support for mobile etc. 

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...