Friday, December 30, 2016

Library Discovery and the Open Access challenge - Take 2

Earlier this year, over at medium , I blogged about the Library Discovery and the Open Access challenge and asked librarians to consider how library discovery should react to the increasing pool of free material due to the inevitable rise of open access.

At the limit when nearly everything is freely available it is possible to consider whether library will have a place in the discovery business. After all, if all researchers have access to the same bulk of journal articles, does it really make sense for each institutional library to provide a separate discovery solution? Even today, many researchers prefer using Google Scholar and other non-institutional discovery solutions that operate at web scale and some (mostly students) grudgingly use our discovery systems to restrict discovery to things they have immediate access to.

This of course is the library discovery will be dead scenario when (almost) everything is free  and not everyone agrees. Some argue, that libraries can add value by providing superior and customized personalized discovery experiences because we know our users better (e.g what courses they taking/teaching, their demographics etc). Then there are plans to leverage linked data etc but I know regretfully little of that.

But the day when open access is dominant is still not here. We live in the world where there is a mix of toll based access and rising but uneven free access, Scihub notwithstanding. I opined that for now "if we really want to stay in the discovery business we need to be able to efficiently and effectively cover the increasing pool of open access resources".

So how does you ensure the library discovery system includes as much discovery of free open access articles as possible?

The idea of a open content discovery matrix by Pascal Calarco, Christine Stohn and John Dove comes to mind.

For most academic libraries who subscribe to commercial discovery indexes (Worldcat Discovery, EBSCO Discovery Service, Summon and Primo, with the later two having merged indexes), there isn't much libraries can do beyond hoping that discovery vendors include such content in their index.

Well I recently came across services like  1science's oaFindr that claims to have a high quality 20 million database of open access papers that perhaps could help? There's also a oafinder+ product that can identify green and gold OA articles for your institution only.

Even if  you can find open access metadata for content that is available for indexing, delivery issues still might occur in index based discovery services as link resolvers are infamously bad at linking to hybrid journals and practically ignore Green Open access articles. 

A alternative approach to such "pull" approaches is a push approach. The new service  (and an earlier service DOAI) is one of the more interesting things to emerge from this year's open access week and it can used together with discovery services.

The idea is simple. One of the challenges of discovery of open access journals in particular Green open access articles archived at subject repositories and institutional repositories is that in general there is no systematic easy way to find them.

 With the service, you can feed the service a doi and it will attempt to locate a free version of the paper, and this includes both articles made free via the Green or Gold roads.

Here's an example say you land on this article page,  Grandchild care, intergenerational transfers, and grandparents’ labor supply on Springer and you have no access.

Quick as a flash, you grab the DOI 10.1007/s11150-013-9221-x and look it up like this . And you get autoredirected to the preprint full text on our institutional repository.

Looks like magic! How does it work? The oadoi service uses a variety of means to try to detect if a open access version of an article is available (see below), but it looks to me that the main source for detecting articles on institutional repository in particular is via the aggregator BASE, so make sure your institutional repository is indexed in BASE.

My own limited testing with Oadoi was initially pretty disappointing as it failed to find most of the articles hosted on my Institutional repository (hosted on Bepress digital commons). It's possible that the way our institutional repository exposes the doi was not correctly picked up by BASE, but this seems to have been resolved somewhat. More testing required.


Savvy readers of this blog might already be screaming, why bother? Just use Google Scholar or plugins like Google Scholar button or Lazy Scholar buttons (which use Google Scholar in background) and all your problems are solved.

It's true that Google Scholar is pretty much unbeatable for finding free articles but the value in OADOI is that it offers a API.

Already many have been quick to use it to provide all kinds of services. For enable Zotero uses it as a lookup engine, librarians have created widgets etc.

But it's greatest value lies in the fact that it can be embedded into discovery services and link resolvers.

Here's work done on SFX doi service and alma libraries like Lincoln University have not been slow to include it either.

These are fairly basic uses of oaidoi and enable users to help direct users to open access content. Still such implementations are usually a "last resort, try it if it works " kind of deal currently and there is no guarantee clicking on the link will work. If you are Exlibris customer on Primo do consider supporting this the feature request "Add as an option in uresolver" which proposes " displays as an option if the API's value of is_free_to_read is true".

To DOI or not to DOI?

A lot of the problems about discovery and delivery of open access content lies in the fact that there are different variants of the same content.

In the old days it was pretty straight forward the only thing that we tracked and access was the article that appears in the journal.

Today, we make accessible a wide variety of content (data, blog posts, conference papers, working papers) and even worse different versions of the same content at different stages of the research lifecycle (preprint/postprint/final published version).

This leads to a great challenge for discovery.

It doesn't help there is a terminology muddle (despite NISO's best efforts at standardising terminology on Journal article Version names and license and access indicators), with people using terms like preprint/postprint/final published versions while others use author submitted manuscripts, author accepted manuscripts and version fo record.

But I think even beyond that, the question I always wonder is , how do we identity/address each version and these days it means assigning dois. The final version of record will have a DOI of course but what about the rest?

As such, I've always been confused about the practice of assigning dois to non peer reviewed papers. For example, should one assign dois to preprints? post-prints? working papers? Should they be a different doi from the final published version? It doesn't help that when you upload items to ResearchGate it offers to create a doi.

I could be wrong, but up to recently I don't think there was a clear guide. But in recent months there seems to be two developments that seemingly clarify this.

First crossref announced they are allowing members to register preprints. The intention here seems to be that the doi of the final version of record is to be a different doi, though there are ways to crosslink both papers, There's even a way to show a relationship between preprints and the later versions as explained in the crossref webinar.

The oadoi service mentioned earlier seems to be pushing in the other direction , encouraging postprints listed in repositories to be added using the same doi as the final version of record to make the postprint findable (but does it mean the preprint isn't since it will have a different doi?). This allows you to find the postprint using the oadoi service as both postprint and final version of record as the same doi.

I'm not quite sure if this is a good idea, while studies show most postprints are not that different then the final published version, it does seem to be a good idea to be able to track the two versions differently. Still mulling over this.


This will probably be my last post for 2016. This year I was particularly inspired towards the end of the year with many ideas but didn't have the time to craft them so expect a flood in the coming year.

I also would like to thank all my loyal readers for following this blog and reading my long winded posts. Next year this blog will be celebrating it's 8th anniversary and my 10th anniversary in the library industry and I might do something special.

Till then, stay happy and healthy and have a great new year's day!

Saturday, December 10, 2016

Aggregating institutional repositories - A rethink

In recently months, I've become increasingly concerned about the competition faced by individual siloed institutional repository versus bigger more centralised repositories like subject repositories and commercial competitors like ResearchGate.

In a way the answer seems simple, just get someone to aggregate all the institutional repositories on one site and start building services on top of that to compete. Given that all institutional repositories already support OAI-PMH, so this seems to be a trivial thing to do. Yet I'm coming to believe that in most cases, creating such an aggregator is pointless. Or rather if your idea of a aggregator is simply getting a OAI-PMH harvester , point it at the OAI PMH endpoints of the repositories of your members and dumping everything into a search interface like VUFIND or even using something commercial like Summon or EDS without any other attempt to standardise metadata, and call it a day, you might want to back off a bit and rethink. For the aggregator to add value, you will need to do more work.....

A simplistic history of aggregation in libraries

Let me tell you a story...

In the 90s - libraries began to offer online catalogues to allow users to help themselves find out what was available (in their mostly print) collections. These sources of informations were siloed and while they were on the web, they were mostly invisible to web crawlers. The only way you could find out what libraries had in their collections would be to go to each of their catalogues and searched.

So, someone said "Why not we aggregate them all together" and Union catalogues (including virtual Union catalogues based on federated searching) were built e.g Copac. People could now search across various silos in one place and all was well.

Librarians and Scholars used such union catalogues to decide what and who to do ILL from and to make collection decisions. Many were still invisible to Google and web search engines (except for a few innovators like OCLC), but it was still better than nothing.

By the late 90s and early 2000s, libraries began to create "digital libraries" (e.g Greenstone digital library software). It was the wild west and digital libraries at the time build up digital collections consisting of practically anything of interest such as digitized images of music scores, maps, photographs -  anything except for peer reviewed material. Most material on digital libraries was often difficult to find or invisible via web search engines for various reasons (e.g. non-text nature of content, lack of support of web standards etc) and it made sense for some degree of aggregation at various levels such as national or regional levels.

Today larger collections like Europeana exist and all was well.

Then came the rise of the Institutional repositories, and by 2010s, most universities had one.

Unlike it's predecessors, the main distinguishing point of institutional repositories was that for many it was designed around distributing Scholarly peer reviewed (or likely to be peer reviewed) content.

While it's true many institutional repositories do contain a healthy electronic thesis collection and some even inherited the mission of what would be earlier called digital libraries and carried grey literature and other digital objects such as data the main focus was always on textual journal articles.

The other major difference is that by then all Institutional Repositories worth the name supports the OAI-PMH standard which making harvesting and aggregating metadata of content in them easy....

And of course , the same logic seem to suggest itself again, why not aggregate all the contents together? And today, we have global aggregators like CORE (not this other CORE - Common Opens Repository Exchange) , BASE and OAISTER as well as regional aggregators built around associations and organizations both national and regional.

In my region for example there's the AUNILO (ASEAN University Network inter-library online) institutional repository discovery service that aggregates content from 20 over institutional repositories in ASEAN. Most University libraries in Singapore are also part of PRRLA (Pacific Rim Research Library Alliance) formerly PRDLA., which also has a Pacific Rim Library (PRL) project built around OAI-PMH harvesting.

I'm sure similar projects exist all around the world based on aggregating data by basically harvesting via OAI-PMH harvestors. And yet, I'm coming to believe that in most cases, creating such an aggregator is pointless, unless additional work is done.

Or rather if your idea of a aggregator is simply getting a OAI-PMH harvestor , point it at the OAI PMH endpoints of the repositories of your members and dumping everything into a search interface like VUFIND or even using something commercial like Summon or EDS, and call it a day, you might want to back off a bit and rethink.

I argue that unlike UNION catalogues or aggregation of digital libraries (by this I mean not the traditional Institutional repository of text based scholarly articles), aggregation of institutional repositories is likely to be pointless, unless you bring more to the table.

Here's why.

1. Items in your institutional repository are already easily discoverable

Unlike in the case of most library catalogues, items in your institutional repository are already easily findable in Google and Google Scholar. There is little value in creating an aggregator when such an excellent and popular one as Google and Google Scholar exist.

101 Innovations in Scholarly Communication - 89% use Google Scholar to search for literature/data

Given the immense popularity of Google Scholar, what would your simple aggregator based around OAI-PMH offer that Google Scholar does not that would make people come to your site to search?

2. Most simple repository aggregators don't link reliably to full text or even index full text

Union catalogues existed in a time, where it was acceptable for users to find items that had no full text online. You used it to find which libraries had the print holdings and either went down there to view it, or used Interlibrary loan to get it.

In today's world, direct to full text is the expected paradigm and you get undergraduates wondering why libraries bother to subscribe to online subject indexes that show items the library may not have access to.

Now how much worse do you think they feel when they search one of your repository aggregators and realise they can't figure out which item has full text or not until they click on it? This is where a glaring weakness in OAI-PMH rears its head.

I first encountered this problem when setting up my Web Scale Discovery Service - Summon a few years back, and I was surprised to realise that while I could easily harvest entries from my Institutional Repository (Dspace) into Summon via OAI-PMH, I couldn't easily get Summon to recognise if an item from the Dspace repository had full text or not.

I remember been stunned to be told that there was no field in the default Dspace fields that indicated full text or not.

This sounds crazy by today's standards. But a little understanding of the context of the time (1999) when OAI-PMH came about helps. It's a long story, but correct me I'm wrong but it was conceived at a time where preprint server Arxiv was the model and it was envisioned repositories would be 100% full text items, so there was no need for such a standard field. Today, this is of course not what happened, due to varying goals on what an Institutional repository should be and reluctance of researchers to self deposit we have a mix of both full text and metadata only items.

Another quirk about OAI-PMH that might surprise many is that it only allows harvesting of metadata only not full-text. Again in today's world where full-text is king and people are accustomed to web search engines (and library full text databases that have followed their lead) matching in the whole document and have search habits designed for that, they find aggregators based around OAI-PMH that only contain metadata odd to use. This is the same problem many students have with using traditional catalogues.

I understand there can be algorithmic workarounds to try to determine if full text exists and some aggregators try to do so with varying results but many don't and just display everything they grab via OAI-PMH.

To top it all off, Google Scholar actually has none of these problems. They can pretty reliably identify if the full text exists and where and combine that with the library links program you can easily tell if you have access to the item.

They crawl and index the full text, and can find items based on matching in full text and can often provide helpful search snippets before you even click into the result.

A vanity search of myself allows me to see where my name appears in context in the full text not just in abstracts

3. Aggregation doesn't have much point due to lack of consistency in standards

Think back to Union catalogues of traditional catalogues back then called OPACs. The nice thing about them was most of them were created using the same consistent standards.

There was MARC, Call number schemes like LCC/DDC/UDC, subject headings standards like LCSH/MeSH that you could crosswalk etc. So you could browse by subject headings or call numbers etc.

I'm probably painting a too positive view of how consistent standards are, but I think it's fair to say that in comparison institutional repositories are in an even worse state.

Under the heading for "Minimal Repository Implementation" in "Implementation Guidelines for the Open Archives Initiative Protocol for Metadata Harvesting" we see it advises that "It is important to stress that there are many optional concepts in the OAI-PMH. The aim is to allow for high fidelity communication between repositories and harvesters when available and desirable."

Also under the section on dublin core which today is pretty much the default we see "Dublin Core (DC) is the resource discovery lingua franca for metadata. Since all DC fields are optional and repeatable, most repositories should have no trouble creating at least a minimal mapping of their native metadata to unqualified DC. "

Clearly, we see the original framers of OAI-PMH decided to give repositories a lot of flexibility on what was mandatory and what wasn't and only specified a minimum set.

In addition the "lingua franca for metadata", unqualified dublin core perhaps on hindsight was not the best option, not when most of your content is journal articles.

Even Google Scholar recommends against the use of Dublin core in favour of other metadata schemes like  Highwire Press tags, Eprints tags or BEpress tags.

On the section of getting indexed on Google Scholar, they advise repository owners to "use Dublin Core tags (e.g., DC.title) as a last resort - they work poorly for journal papers because Dublin Core doesn't have unambiguous fields for journal title, volume, issue, and page numbers."

Even something as fundamental today as doi (and in the future ORCID) isn't mandated.

I recently found that out when I realised the very useful service that allows you to input a DOI and find a copy in repositories (among other ways it works is that it searches for items indexed in BASE) failed for our institutional repository because doi indentifer wasn't in our unqualified Dublin core feed and that was picked up by BASE. The lack of standards is holding repositories back.

Leaving this aside, I'm not sure why this happened (I have a feeling that up to recently the same people working on institutional repositories were not the same people working on cataloguing) but most institutional repositories content do not use controlled vocabulary for subject headings or for subject classification, though they could easily do so.

As a result, unlike in catalogues, once you have aggregated all the content, you can easily slice the content by discipline (e.g. LC call range) or by subject headings (e.g. LCSH).

With aggregators of repositories you get a mass of inconsistent data. Your subjects are the equivalent of author supplied keywords and there is no standardised way to filter to specific disciplines like Economics or Physics.

 The more I think about it the more this lack of standardisation is hurting repositories.

For example, I love the digital commons network that allows me to compare and benchmark performance across all papers posted via digital commons repositories in the same discipline. This is possible only because digital commons has a hosted service has a standardised set of disciplines.

What should your aggregator of repositories do?

So if you read all this and are undeterred but still want to create a aggregator of institutional repository what should you do?

Here's some of the things I think you should shoot for beyond just aggregating everything and dumping it into one search box.

a) Try to detect reliably if an entry you harvested has full text

b) Try to index full text not just metadata

CORE seems to match full text in my search?

One way to detect reliably if full text exists or not is to decide on a metadata field that all repositories you are harvesting from has a metadata field indicating full text. But that won't scale currently at a global level. Another way is to try to crawl repositories to extract pdf full text.

Ideally the world should be moving away from OAI-PMH and start exposing content using new methods like resource-sync so not just metadata alone is synced. I understand that the PRRLA is working on a next generation repository among it's member that will use Resource-Sync.

c) Create consistent standards among repositories you are going to harvest

If you are going to aggregate repositories from say a small set of member institutions, it is important to not just focus on the tech but also focus on metadata standards. It's going to be tough, but if all institution members can agree on mapping to a standard (hint look at this), perhaps even something as simple as providing a mapping to Disciplines, the value of your aggregator increases a lot.

d) Value added services and infrastructure beyond user driven keyword discovery

Frankly, aggregating content just for discovery isn't something that is going to be a game changer even if one provides the best experience with consistent metadata allowing browsing, indexes full text etc as services like Google Scholar are good enough already.

So what else should you do when you aggregate a big bunch of institutional repositories? This is where it gets vague, but the ambitions of SHARE . while big show that aggregators should go beyond just supporting keyword based discovery.

See for example this description of SHARE

"For these reasons, a large focus of SHARE’s current grant award is on metadata enhancement at scale, through statistical and computational interventions, such as machine learning and natural language processing, and human interventions, such as LIS professionals participating in SHARE’s Curation Associates Program. The SHARE Curation Associates Program increases technical, curation confidence among a cohort of library professionals from a diverse range of backgrounds. Through the year-long program, associates are working to enhance their local metadata and institutional curatorial practices or working directly on the SHARE curation platform to link related assets (such as articles and data) to improve machine-learning algorithms."

SHARE isn't along there are other "repository networks" include OpenAIRE (Europe), LA Referencia (Latin America) and Nii (Japan), that work along similar lines , trying to standardise metadata etc.

Others have talked about layering a social layer over aggregated data similar to ResearchGate/, or provide a infrastructure for new forms of scholarly review and evaluation.

Towards a next generation repository?

In past posts on institutional repositories I've been trying to work out my thinking on institutional repositories and it's a complicated subject particularly with competition from larger more centralised subject and social repositories like ResearchGate.

I'm coming to think that to counter this individual smaller repositories need to link up together but yet this cannot be currently done in an effective way.

This is where "next generation repositories" comes in and they may have probably heard about this most prominently under the umbrella of COAR (Confederation of Open Access Repositories).

What I have described above is in fact my layman's understanding of what the next generation repositories must achieve (For a more official definition see this) and why.

Officially the next generation repositories focus on Repository Interoperability (See The Case for Interoperability and The Current State of Repository Interoperability )- which includes working groups on controlled vocabulary and open metrics and even linked data.

All this is necessary for institutional repositories to take their place as necessary and equal partners in the scholarly communication network.


I had the opportunity to attend the Asian Open Access Submit in November at Kuala Lumpur and learned a lot, particularly the talk by Kathleen Shearer from COAR, the Confederation of Open Access Repositories on repository networks helped clarify my thinking on the subject.

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...