Thursday, January 12, 2017

The open access aggregators challenge — how well do they identify free full text?

Bielefeld Academic Search Engine (BASE) created by Bielefeld University Library in Bielefeld, Germany is probably one of the largest and most advanced aggregator of open access articles (hitting over 100 million records), others on roughly the same level are CORE (around 60 million records) and OAIster (owned by OCLC).

One way of seeing this class of open access aggregators is to see them as similar to web scale discovery search engines like Summon, EDS, Primo and WorldCat Discovery service. but focusing mainly in the open access context.
How well do web scale discovery engines cover open access?
It seems natural to think that index based solutions like Summon, Primo, EDS should cover both paywall contents as well as open access content, particularly since they typically can use OAI-PMH to harvest the institution’s own Institution Repository. In reality, their coverage of open access material can be spotty. The best ones have indexed OAIster or BASE. But even when open access sources are available in the index, many institution’s choose not to turn them on  for various reasons. This includes unstable links, inability to correctly show only open access material as well as flooding of results by inappropriate data (e.g foreign language or irrelevant subjects).

A unique challenge for open access aggregators

One area where BASE and CORE may differ from Summon and Primo is in that open access aggregators need to be able to tell if an article they harvest from a subject or institutional repository has free full text and this isn’t that easy.

This seems odd if you do not understand the history of open access repositories, but suffice to say when OAI-PMH (which is the standard way of harvesting open access repositories and was established as a way of harvesting metadata only and not full text) was established it was envisioned that most if not all items in such open access repositories would be open access (following the example of Arxiv), so no provision was made to have a standard way or of a mandatory field to indicate if the item is free to access.

In today’s world, of course subject and in particular institutional repositories are a mix of free full text and metadata only records. This happens in particular for institutional repositories because they have multiple goals beyond just supporting open access.
What are the multiple purposes of Institutional repositories?
While most librarians are familiar with Institutional repositories mission to support open access they may not be aware that it is not their only purpose (I also argue even advocates who support self archiving in the open access agenda can have different ultimate aims). Other purposes include
a) “to serve as tangible indicators of a university's quality and to demonstrate the scientific, societal, and economic relevance of its research activities, thus Increasing the institution's visibility, status, and public value” (Crow 2002)
b)"Nurture new forms of scholar communication beyond traditional publishing (e.g ETD,  grey literature, data archiving" – (Clifford 2003)
It is purpose A, tracking the institutional’s output that results in Institutional repositories hosting more than just full text items. Many institutional repositories have in fact more metadata only items than full text. It’s a rare Institutional respository that has more than a third full text records. 

Truth be told, most open access aggregators I have seen simply give up on this problem and just aggregate the contents of whole institutional repositoriesgiving users a mistaken idea that everything is free.

This leads to users wondering if something is wrong when they click through and get led to a metadata only record in the repository. This btw was the reason why I and I suspect many librarians tend not to turn on open access repositories available via Summon/Primo because it doesn’t really show only open access items and it’s a rare few that is say 99% free items (typically ETD or electronic thesis dissertations collections but even though has the occasionally embargoed ones), while many have in fact more metadata only records then full text records particularly if they blindly pull in metadata content via their institution’s research publication systems and/or Scopus/Web of Science.

There are of course ways to identify full text in repositories and Google Scholar seems to do it beautifully on an item level (via intelligent spidering to detect pdfs?) but that doesn’t seem common for non-google systems. As it stands, Google Scholar is current my #1 choice whenever I need to check if free articles exist.

One possibility is for institutional repositories to create “collections” that are 100% or near 100% full text and pull in such items by collections. This usually is what happens for ETD.

The other way of course is to set a metadata tag for each item that has full text but I’m not sure if there is 100% universal standard for this. A good start might be OpenAire’s standard.

BASE indeed does suggest you to support this for optimal indexing. I am not sure how wide spread this is outside the EU.

I’m not a repository manager so I’m not sure how this works, but I get the distinct impression that Digital Commons repositories can definitely reliably identify full text records, given that there can be full-text PDF RSS feeds, I’m just not sure how a third party aggregator can exploit that to identify full text and whether it can be generalised to all Digital commons repositories.

In any case, I think one can probably “hack” and create workarounds to reliably detect full text for one repository the trick is to do it without much work for most of them.
In a sense centralised Subject repositories have the advantage over institutional ones here because by the virtue of their mass, there is great incentive for aggregators to tweak compatibility with them compared to any individual institutional repository.

In any case, both BASE and CORE are capable of identifying full text records in their results, the question is how accurate are they?

How well does BASE and CORE do for identifying full text?

The nice thing about BASE is that it allows you to run a “blank search” which gives you everything that meets the criteria (similar to Summon). So one can easily segment the index based on criteria you desire without crude workarounds like searching for common words that all records would have.

Base results restricted to Source: Singapore

The above shows that when restricted to Singapore sources, BASE knows of

66,934 records from National University of Singapore’s IR — dubbed ScholarBank@NUS (using Dspace)

records from Nanyang Technological University’s IR — dubbed DR-NTU (using Dspace)

records from Singapore Management University’s IR — dubbed INK (using digital commons). [Disclosure I’m a staff of this institution]

Based on my colleague's recent Singapore update on open access figures for total records in each of the repositories — this shows a rough coverage of 67%, 89%, 98% respectively in BASE.

Take this figures with a pinch of salt because the total records I am getting are based on different times, e.g the NUS total record is as of 30 Sept 2016, NTU total record is of 18 October 2016. NUS also has fairly substantial non-traditional records eg. patents and music recordings so that might affect the result. Lastly, I did the search in BASE in early Jan 2017 while the total records are from a quarter earlier, so the actual coverage is probably a bit lower.

Overall, the coverage shown isn’t too bad, but the more important point is how well does BASE identify full text? Let us filter to Access : Open Access

Full text identified by BASE

Not very well it seems.

It is only able to identify 75 free records in National University of Singapore’s IR, 654 free records in Nanyang Technology University’s IR, 143 free records in Singapore Management University’s IR

I did not do a check to see if there were false positives in BASE’s identification of full text but in the best case scenario they are 100% correct, we see only a full text identification ratio of 0.6%, 3.8% and 2.7% respectively!

If you consider the case of Singapore Management University (disclosure again I am staff there), BASE is able to index practically every record in our Repository and yet only identifies 2.7% of our free full text. It’s in the same ballpark for the other Singapore repositories.

Let’s do the same for CORE. How many records does it index for the 3 Singapore repositories?

Here are the results :

National University of Singapore’s Scholarbank.

Records (100,657) + Full text (12)

Keyword : repository: (“Scholarbank@NUS”)

Singapore Management University — INK

Records (18,312) + Full text (166)

Keyword : repository: (“Institutional Knowledge at Singapore Management University”)

Interestingly enough I was unable to find any articles indexed in CORE from Nanyang Technological University’s IR, it’s possible I might have missed them somehow.

In any case, I won’t calculate the percentages for the other 2 IRs, there are broadly similar to the case in BASE, except CORE seems to show substantially more records (including metadata only records) indexed than in BASE.

In fact, CORE is showing more records indexed for both universities then the total records listed in the Singapore update on open access figures (e.g 100k vs 99k in NUS and 18k vs 16k for SMU). This possible because the total records from the Singapore update on open access figures generally refer to 3Q 2016 figures so since then the number of records would have grown.

Still I suspect that’s not the full reason, there could be duplicates archived in CORE inflating the result.

More importantly in terms of records identified as free full text the results for CORE are as dismal as BASE.


Both BASE and CORE are extremely sophisticated open access aggregators. For example they offer APIs (BASE, CORE), are indexed by some web scale discovery services, are doing various interesting things with ORCID, here also, creating recommendation systems or working with OADOI to help surface green open access articles hiding in respositories.

A difference is that BASE currently doesn’t search through full text while I believe CORE does.

However identifying which articles they have harvested has free full text is still problematic, BASE claims to be able to reliably identify 40% of their index as full text though the other 60% is still unknown due to lack of metadata. My own quick tests shows that it’s accuracy is quite bad for certain repositories. My hunch is that BASE either works very well with some respositories or not at all with others.

So this is a major challenge for the open access community and in particular institutional repositories to answer. The alternative is to shrug one’s shoulder’s and let Google Scholar be the default open access aggregator.
blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...