Saturday, September 19, 2015

[Research question] What percentage of citations made by our researchers is to freely available content?

I recently signed up to a "research methods" class whose aim was to help practitioners like me produce high quality LIS papers. Inspired slightly by Open Science methods, I will blog my thoughts on the research question I am working on. Writing this helps me clarify my thoughts and of course I am hoping comments from you if my thoughts have piqued your interest. 

The initial motivation

The idea began as a work I was asked to do. Basically I was doing a citation analysis of citations made by our researchers to aid collection development. The idea here was to see if there was a good fit of our collection with what users are using and gauge potential demand for Document Delivery/Inter-Library.

It's a little old school type of study, but generally the procedure to run it goes as follows
  • Sample citations made by your researchers to other items
  • Record what was cited - typically you record age of item, item type cited, impact factor of journal title etc.
  • Check if the cited item is in your collection
Papers like  Hoffmann & Doucette (2012) "A review of citation analysis methodologies for collection management" gives you a taste of such studies if you are unfamiliar with such studies.

The impact of free

But what does "in your collection" mean? This of course would include things you physically hold and subscribed journal articles etc.

But it occurred to me that these days our users could also often obtain what they wanted by searching for free copies and as open access movement starting to take hold, this is becoming more and more effective.

In fact, I did it myself all the time when looking for papers, so I needed to take this into account.

In short, whatever couldn't be obtained through our collection and was not free would be arguably the potential demand for DDS/ILL.

(In theory there are other ways legal and illegal to gain access such as writing to the author, access via coauthors/secondary affiliations or for the trendy ones #canihazpdf requests on Twitter),

How do you define free?

As a librarian with some interest in open access, I am aware that much ink has being spilled over definitions.

There's green/gold/diamond/platinum/hybrid/libre/gratis/Delayed etc open access. But from the point of view of a researcher doing the research, I simply don't care. All I want to see if the full text of the article is available for viewing at the time I need it. It could be "delayed open access" (often argued to be a paradoxical term) but if it's accessible when I need it, it's as good as any.

What would an average researcher do to check for any free full text?

Based on various surveys  and anecdotes talking to faculty both in my current and former place of work, I know Google Scholar is very popular with users.

It also just happens that Google Scholar is a excellent tool for finding free full text, and we have a recent Nature survey showing that when there is no free full text, more users will search Google or Google Scholar for it and a smaller number will use DDS/ILL.

As such it's not a leap to expect the average researcher would probably use Google Scholar or Google to check for free full-text. 

So one would have to factor in the availability of the item for free and this could be obtained by simply checking Google Scholar add that to what is in our "collection" (defined to be physical copy and subscribed material) . 

Whatever remaining that was cited that couldn't explain by these two sources was the potential demand for DDS/ILL.

Preliminary results 

I'll talk about the sampling method later, but essentially, In my first exploratory data collection (with help of colleagues of mine) , I found that of the citations made, 79.7% were to items in our collection (either print or subscribed) and of the remaining cited another 13.4% were freely available by searching Google Scholar.

But the figures above presume a library centric view point and assume that users check our collection first and then turn to free sources only if unavailable there.  Is this a valid assumption?

As one faculty I discussed the results with said "correlation does not imply causation" and mentioned that just because they cited something that could be found in our collection didn't mean they used it. In fact, given the popularity of Google Scholar and the convenience of using it, it might be just as likely they accessed the free copy, especially if they were off campus and using Google.

Librarians who are familiar with open access will immediately say , wait a minute not all free full text are equal especially those that are self archived. Some are pre-prints, some post prints, some final published versions etc because of lack of page numbers etc.

There could in theory be very big differences between preprints and the final published versions and if you only had the post print version you should cite it differently from the final published version.

According to Morris & Thorn (2009) , in a survey, researchers claim that when they don't have access to the published version, 14.50% would rarely; and 52.70% never access the self archived versions. 

This implies researchers usually don't try to access self archived versions that aren't final published version.

Still, this is a self reported usage and one suspects in the service of convenience, many researchers would just be happy with any free version of the text and just cite as if they read the final published version....

For example in the  Ithaka S+R US Faculty Survey 2012 survey, over 80% say they will search for freely available version online, more than those using ILL/DDS. Are these 80% of faculty only looking for freely available final published version? Seems unlikely to me.

 Ithaka S+R US Faculty Survey 2012

Let's flip it around for sake of argument, how do things look like if we assume users access free items (whether preprint/postprint/final version) as a priority and only consult the library collection only when forced to?

As seen above for the same sample, a whopping 80.4% of cited items can be found for free in Google Scholar and this is further supplemented by another 12.7% from the collection.

As we will see later this figure is probably a big overestimate and I don't want to be hung up on it   still it can be very suggestive (if we can trust the figure), because it tells you that if our user did not have access to our library collection, he could still find and read the full text of 80% of items he wanted to! 

It then dawned on me that this figure is actually of great importance to academic libraries. Why?

Why amount of cited material that is free is a harbinger of  change for academic libraries

One of the areas I've been mulling over in the past year is the impact of open access on academic libraries. It was clear to me based on   Ithaka S+R US Faculty Surveys currently faculty  highly value the role of the library as a "wallet" and this was going to drastically change when (if?) open access becomes more and more dominant.

Still timing is everything, you don't want to run too far ahead of where your clients are at. So there is a need to tread carefully when shifting resources.

I wrote, "how fast will the transition occur? Will it be gradual allowing academic libraries to slowly transition operations and competencies or will be it a dramatic shift catching us off-guard?

What would be some signals are signs that open access is gaining ground and it might be time to scale back on traditional activities? Downloads per FTE for subscribed journals start to trend downloads? Decreasing library homepage hits? At what percentage of annual output that is open access, do you start scaling back?"

It came to me that the figure calculated above, the % of cited items that could be found free in Google Scholar , could serve as a benchmark for determining when the academic libraries' role as a purchaser would be in danger.

In the above example if indeed 80% of what they wanted to cite is free at the time of citing, the academic library role as a purchaser would be greatly reduced such that users only need you 2 out of 10 times! Is that really true?

Combining citation analysis with open access studies

My research idea question can be seen as a combination of two different strands of research in LIS.

First there is the classic citation analysis studies for collection development uses that was already mentioned.

Second there is a series of studies in the open access field that focused on determining the amount of open access available throughout the years. 

The latter area, has accumulated a pretty daunting set of literature trying to estimate the amount of open access material available. 

Of these studies, there's a subset that focus on typically sampling from either Scopus or Web of Science and checking if free full text is available in Google Scholar/Google/ Or some combo that bears closest resemblance to my proposed idea.

Free full text found Sample Searched in Coverage of articles checked Comment
Bjork &  et. al (2010) 20.4% Drawn from Scopus Google 2008 articles searched in Oct 2009
Gargouri & et. al (2012) 23.8% Drawn from Web of Science "software robot then trawled the web" 1998-2006 articles searched in 2009
2005-2010 articles searched in 2011
Archambault & et. al (2013) 44% (for 2011 articles) Drawn from Scopus Google and Google scholar 2004-2011 articles searched in April 2013 "Ground truth' of 500 hand checked sample of articles published in 2008, 48% was freely available as at Dec 2012
Martín-Martín & et. al (2014) 40% 64 queries in Google Scholar, collect 1,000 results Google Scholar 1950-2013 articles search in May 2014 & June 2014
Khabsa & Giles (2014) 24% randomly sampled 100 documents from MAS belonging to each field to check for free and multiple that by estimated size of each field determined by capture-release method Google Scholar All? searched in ??
Pitol & De Groote (2014) 58% Draw randomly from Web of science - for Institution C, draw 50 that are not in the IR already and check in Google scholar Google Scholar 2006-2011 Abstract reports 70% free full text, this is for institution A , B and C, where for A and B, the random sample drawn from Wos had to include copies already in IR as well.
Jamali &  Nabavi (2015) 61% Do 3 queries each in Google Scholar for each Scopus third level subcategory. Check the top 10 results for free full text Google Scholar 2004–2014 articles, searched in April 2014

I am still mulling about the exact details of each paper and the differences in methodology of each but the overall percentages are suggestive ranging from 20% to 61%, with the later studies showing generally a higher percentage.

Martín-Martín & et. al (2014) and Archambault & et. al (2013), in particular strike me as very rigorous studies and both show around 40%++ full text is available.

But we can see obviously that what we got with 80% is far above the upper bounds expected. Why?

Big issues with methodology

Here is where I mention the big problem I have. 

First the sample I drawn from were from citations made in papers published in 2000-2015. The field was Economics and the search for free items was done in September 2015.

The first obvious issue is that when I check if something is free in Google Scholar, I am only checking what is free now.

This is okay if all I care is to know what percentage is free now. 

But from my point of view, I want to know how much was free, at the time the researcher was citing it rather than many years later.

So for example take a paper A written in 2003 that cites a paper B written in 2000.

Today (September 2015 as I write this), I determine paper B is free and findable via Google Scholar. The obvious thing of course is while it is free now , it might not be free in 2003 when the author was doing his research!

Whether an article was free at the time the author was writing the paper depends on 

a) When the writer was writing up the paper
b) The age of the article he was citing at the time

The interaction of these two factors make it's very confusing as there is a host of factors affecting if something is free at a certain time. A short list includes a) Journal policies with embargos for self archiving, b) uptake of illegal options like researchgate, c) the general momentum towards open access of both green and gold at the time  etc.

Is there a solution?

Honestly I am not sure, I can think of many ideas to try to fix it but they may not work.

First off, I could just forgot about longitudinal studies and just focus on citations made from papers published in a short window period say within 6 mths of the searching done today in 2015 to reduce such timing effects but even this isn't perfect as we can quibble and say publication dates tell us little of when the writing was actual done, as publishing can have long lead times.

Another way is to carefully examine the source where the full text was found , and hope that the source would have metadata on when the full text was loaded.

For example, some institutional repositories or subject repositories  might have indications for when the full text was uploaded (e.g Number of downloads since.... DD/MM/YY)

Full text upload to Dspace as a Download since indicator

Based on studies like Jamali &  Nabavi (2015) and Martín-Martín & et. al (2014), we know that surprisingly one of the major sources for free full text is items uploaded to researchgate (ranked 1st for the former and 2nd for the later), so this could be a big sticking point.

That said, looking around researchgate, I noticed surprisingly, researchgate does list when something is uploaded.

Was this article uploaded to Researchgate on Jun 6, 2016?

Edit : @varnum suggested a great idea of checking using the internet archive's wayback machine . It works for some domains like the edu domains which helps a little when someone puts up a pdf on university web space.

pdf on duke domain was existing in 2011 according to wayback machine.

Another idea was I could do a blanket rule, if the paper is citing something less than 2 years old at the time and a free full text is found (and it was not published in a Gold Journal), we can assume it wasn't free then as many journals will allow published versions or post prints to be published only after 2 years of publication. 

This will undercount for various reasons of course, least of which is illegal copies.

A more nuanced approach would be take into account policies listed in Sherpa/Romeo. But policies by journal publishers change over time too. 


The more I lay out my thoughts the more I wonder if my idea is fatally flawed? It would be great to be able to get the figure - percentage of items cited that could be freely obtained at the time of citing, but it may just be such a figure is impossible to get at accurately enough even if one limited one's sample to a short window around the time of searching.

What do you think? 

Edit: This post is written from the librarian point of view of reacting to changes in research behavior, and is neutral between whether academic librarians should be revolutionaries or soldiers in the open access movement.

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...