Saturday, October 29, 2016

5 thoughts on open access, Institutional and Subject repositories

Despite writing a bit more on open access and repositories in the last few years, I find the issues incredibly deep and nuanced and I am always thinking and learning about them. As this is open access week, here are 5 new thoughts that occurred to me recently.

They probably seem obvious to many open access specialists but I set them out here anyway in case they are not obvious to others.

1.   There are multiple goals for institutional repositories and supporting open access by accumulating full text of published output is just one goal. 

I suspect like many librarians, I first heard of institutional repositories in the context of open access. In particular, we were told to aim to support Green OA by getting copies of published output by faculty (final published version if possible, if not postprint or preprint). But in fact, looking back at the beginning of IRs and Open Access things were not so straight forward.

There seem to have been two seminal papers released at the beginning of the history of IRs. First there was Crow's The Case for Institutional Repositories: A SPARC Position Paper in 2002 and Lynch's Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age in 2003.  (See also the great talk "Dialectic: The Aims of Institutional Repositories" for a breakdown)

Between them, several goals were identified. Two of which were

a) “to serve as tangible indicators of a university's quality and to demonstrate the scientific, societal, and economic relevance of its research activities, thus Increasing the institution's visibility, status, and public value” (Crow 2002)

b)"Nurture new forms of scholar communication beyond traditional publishing (e.g ETD,  grey literature, data archiving" – (Clifford 2003)

All these goals are not mutually exclusive with the mission of supporting open access by accumulating published scholarly output but they are not necessarily complementary either.

For example, one can showcase the university output by merely depositing metadata without free full text, something that is occurring in many Institutional Repositories today that are filled with metadata of the scholarly output of their researchers with precious little full text.

Similarly, systems like Converis, or Pure or systems like Vivo that showcase institutional and reseaarch expertise do not necessarily need to support open access.

It also seems that at the time Clifford envisioned an alternative route for IRs to focus on collecting non-traditional scholarly outputs which includes grey literature instead of collecting published scholarly output. Following that vision, today most University IRs collect Electronic thesis and dissertations at the very least, others collect learning objects, Open Education resources and many are beginning to collect datasets.

2. Self archiving can differ in terms of timing , purpose and there are multiple views on how high rates of self archiving will eventually impact the scholar communication system

Even if you agree the goal of IRs is to collected deposits of published scholarly output there are still more nuances to why you are doing so and what your ultimate aims are.

At what stage is the papers deposited?

As a librarian with little disciplinary connections, I never gave much thought to subject repositories and focused more on institutional ones.

Reading Richard Poynder's somewhat disputed recounting of history of what was to be the first OAI conference at Santa Fe, New Mexico in 2009, I finally realized that while subject repositories and institutional repositories both could collect preprints/postprints the two were very different in terms of timing of deposit and reason for deposit.

Most researchers who submit to subject repositories do so primarily with the goal of getting feedback and this also leads up to the speeding up of scientific communication. While many papers in subject repositories are deposited and immediately submitted to journals for consideration, many are put up in more raw form and are replaced by new versions many times before finally being submitted for publication and many that don't end up been submitted in any journal at all, hence making the term "preprint server" a bit leading. All this is discipline specific of course.

Contrast this with IRs, where rarely researchers put up copies of their papers in IRs until the paper is accepted for publication or more likely already published. The goal here is to provide access for the scholarly poor of published or near published scholarly output and the carrot for researchers is citation advantage of open access papers.

However as the papers in the IR are placed much later in the research cycle, they generally are already in finalised form and nothing much happens to them.

As Dorothea Salo's memorable paper Innkeeper at the Roach Motel states “[The institutional repository] is like a roach motel. Data goes in, but it doesn’t come out.”  This line might also refer to point #4 below....

I am told that there really isn't any obstacle functionally for IRs to accept preprints (in the sense of papers that are going through peer review but haven't been accepted yet or haven't even yet been submitted for consideration for publication), but in actual fact this seldom occurs (though I'm sure there are examples perhaps with say CRIS systems).

Two views of Green Open access

The motivation and final end game for self archiving in IRs also differ among people.

Even if one agrees IRs should only collect post prints (or the final published version if allowed) and the main aim is to provide access to published scholarly material, but what is the ultimate goal or vision here?

Some would envision , green open access working thriving alongside the traditional publishing system today and for all time. In this view, green open access is not a threat to traditional publishing, and that a status quo would result, where there is both green open access self archiving in IRs and libraries continue to subscribe to journals as usual and they point to the effect (or lack of) of high rates of self archiving for high energy physics on subscriptions in that area.

Another view doesn't see self archiving just for the sake of access, they actually aim to eventually disrupt the current scholarly system. They believe that when "universal green OA" is achieved , then we can leverage a favorable transition (in terms of costs/prices) to Gold open access (because there is an alternative to getting the final published version in the post-print version). Without achieving universal green OA, flipping to Gold OA leads to "fool's Gold" and even if open access is achieved it is of very high cost.

This is of course the Steven Harnard view. It's usually paired with the idea of a immediate deposit/optional access mandate, where all researchers will need to deposit their paper at the moment of acceptance. In response to critics that publishers will not sit back and allow Green OA to prevail if it really catches on and they will start imposing embargos, Harnard suggests countering that with a "Request a copy" button on such embargoed item.

I'm not qualified to assess the merits of these arguments but it does seem to me that these two camps are essentially in conflict, as one camp is telling publishers that are in no threat to green open access and there is no likely disruption in the future and the Harnard camp which is trumpeting loudly what they intend to do once Green OA becomes dominant.

Some have suggested supporting Green OA is hypocritical (if for example one tells publishers that they are under no threat, yet secretly hopes for a Harnard disruption eventually), and yet others claim Green OA is flawed and will never succeed because essentially it is "parasitic" on the existing system and survives only because it relys on the current traditional publishing systems

A more radical form of Green open access (if it is considered one)

There is a even more radical purpose to collecting papers in repositories. If you read  Crow's The Case for Institutional Repositories: A SPARC Position Paper, he actually suggests a far more radical idea then just collecting post-prints that have been published by publishers and be happy with the status quo, or even the Harnard idea of flipping to Gold OA on favourable terms eventually,

The future he suggests actually involves competing with traditional publishers. In such a model, researchers would submit papers into IRs, reviewers as per usual would review them, but the key thing is that everything would be done through the repository, and universities, researchers could "take back" the scholarly publication system from traditional publishers.

This sounds a lot like the overlay journals we see done with arxiv. For a institutional repository version, we have the journals on Digital Commons system.

3. Much of the disadvantages in local institutional repositories vs more centralised subject repositories or academic social networks like ResearchGate hinges on the lack of network effects due to poor interoperability

In Are institutional repositories a dead end? , one way to summaries many of the strengths of centralised subject repositories vs institutional repositories is that "size matters".

As I noted in a talk recently, academic social networks like ResearchGate are not new, and there were a flood of them in 2007-2009, including now defunct attempts by Elsevier and Nature.

Yet it is only in recent years it seems ResearchGate and seem to become dominant.

The major reason why this is happening only in the last 2 years or so, is that the field of competition as now narrowed to two major systems left standing ResearchGate and (if you count Mendeley that's a third) and network effects are starting to dominate.

While it is true that if you consider the "denominator" of subject repositories (all scholarly output from a specific subject) or of say ResearchGate (all scholarly output?), they aren't necessarily doing better than institutional repositories (all scholarly output of that institution), in absolute terms the material centralised repositories have dwarfs that of most individual Institutional repositories.

As more papers appear in ResearchGate or a subject repositories network effects kick in. More people will visit the site to search, if there are any social aspects and functionality (which ResearchGate has a ton of) they will start becoming even more useful, and even statistics become more useful.

How so? Put your paper in a IR like Dspace, and even if you have the most innovative developer working on it, with the most interesting statistics, you still are limited to benchmarking your papers against the pitiful number of papers (by standards of centralised repositories) in your isolated institutional repositories.

Put it on SSRN, or ResearchGate and you can compare yourself easily with tons more researchers, papers or institutions.

Above shows ranking of university departments in the field of Acccounting.

In this way, the hosted network of repositories on Bepress Digital commons actually seems the way to go compared to isolated Dspace repositories because one can actually do the same types of comparison on the Digital Common Network that aggregates all the data across various repositories using Digital Commons.

So my institution is currently on Bepress Digital commons and faculty put their papers on it.

So in the above example, I can see how well Faculty from the School of Accountancy here are doing versus various peers in the same field who also put their papers on their IR. Happily I can report, the dean of the accountancy school here is one of September's most popular authors in terms of downloads.

4. interoperability among repositories is the only way to make network effects matter less

My merger understanding of OAI-PMH was that it was indeed designed to ensure all repositories could work together . The ideas was that individual repositories could host papers but others could build services that sat on top of them all and harvest and aggregate all the output into one service.

I know it's fashionable to bash OAI-PMH these days  and I would not like to jump on the band wagon.

Still it strikes me that a protocol that works only on metadata was on hindsight a mistake. Perhaps it was understandable to assume that all records in IRs would have full text as the model back then was arxiv which was full text. But as mentioned above, there were in fact multiple goals and objectives for IRs, and many became filled with metadata only records due to this.

This made it really painful for aggregators to work when they tried to pull all the records together from various IRs using OAI-PMH as they couldn't tell for sure whether there was full text or not. This is the main reason why systems like BASE can't 100% tell for sure a record they harvested has full text (I understand there can be rough algorithmic methods to try to guess if there is full text attached), and it's also the same reason why many libraries running web scale discovery service can't tell if a record they have from their own IR has full text or not. (Also they don't turn on in their discovery index other IRs that are available in the index for the same reason).

In truth making repositories work together involves a host of issues, from having standardized metadata (including subject, content type etc) so aggregators like BASE or CORE and offer better searching, browsing and slicing features, ensuring that full text can easily "flow" from one repository to another or ensuring usage statistics are standardized (or can be combined?).

In fact, there are protocols like  OAI-ORE and SWORD (Simple Web-service Offering Repository Deposit) that try to solve some of these problems. For example SWORD allow one to deposit to multiple repositories at the same time etc and do a repository to repository deposit, but I am unsure how well supported they are in practice.

Fortunately this is indeed what COAR (Confederation of open access repositories) is working on, and they have several working groups working on these issues now.

If individual repositories are to thrive, these issues need to be solved, allowing easy flow and aggregation of metadata, full text and perhaps usage statistics, allowing them to counter the network and size effects of big centralised repositories.

5. There seems to be a move towards integration among the full research cycle and or into author workflows. 

The pitch we have always made is this to researchers, give us your full text, we will put it online and you will gain the benefits (e.g more visibility, the satisfaction of knowing you are helping science progress, or that you are pushing back against commercial publishers), but sadly that doesn't seem to be enough for most to motivate them.

So what can we do?

Integration with University Research management systems from and to repositories

Firstly, we can tell them we are going to reuse all the data they are already giving us. Among other things, we can use their data to populate cv/resume systems like Vivo. Since all the data is already there we can use it for performance assessment at the individual, department and university levels by combining the data with citation metrics.

We can make it easier on the other end too. Instead of getting researchers to enter metadata manually, we can pull them into our systems using Scopus, Web of Science, ORCID or other systems that allow us to pull in researchers by institution.

What I describe above is indeed the idea of a class of software currently known as CRIS (Current research information systems) or RIMS (Research Information Management system). It is basically a faculty/research management workflow that can track the whole life cycle of research system, often including things typically done by other systems such as grants management and integrates with other institution systems like HR or Finance systems.

The three main systems out there are Pure, Converis and Symplectic elements. The point to notice is that these systems are not mainly about supporting open access, but it can be one of their functions.

For example while Converis's publication module accepts publication full text, this full text isn't necessarily available online publicly if you do not get the Research portal module (this isn't mandatory). In the case of Symplectic, I understand it doesn't even have a public facing component but there are integrations with IRs like Dspace available.

But we can have more integrations than this.

Integration with  Publisher systems to repositories

How about considering a integration between a publisher and a IR system? Sounds impossible?

The University of Florida has a tie up with Elsevier where using the Sciencedirect API, metadata from Sciencedirect will automatically populate their IR with articles from their institution. Unfortunately the links on the IR will point to articles on the Sciencedirect plaform. While a few will be open access , most will not be so.

I can hear librarians and open access activists screaming already, this isn't open access. What is interesting is, there is a phase II / pilot project listed where the aim is for Elsevier to provide "author accepted manuscripts" to IRs!

If you have ever tried to get a researcher to find the right version of the paper for depositing into IRs, you know how much of a game changer this will be.

Logically it makes so much sense, the publishers have the postprints already in their publication/manuscript submission systems, so why not give it to IRs? Well the obvious reason is we don't believe publishers would want to do that as it's not in their best interest? Yet ...........

Integration with  Publisher systems from repositories

Besides an integration from post-print to IR, the logical counterpart to that would be an integration from pre-print to publisher submission systems and where pre-prints are sitting is often in Subject repositories.

Indeed this is happening as PLOS as announced a link with their submission system and Bioarxiv.

In the same vein, the earlier mentioned overlay journals, can be said to be having the same idea.

Integration with reference managers?

What other types of integration could occur? Another obvious one would be from Reference managers.

Elsevier happens to own Mendeley, so a obvious route would be people collaborating via Mendeley groups and with a click push it to journal submission system.

Proquest which now owns a pretty big part of the author work flow including various discovery services, reference managers like Flow and refworks could do something similar, for example I remember some vague talk about interacting say Flow their reference manager with depositing thesis into ETD say.

Will a day come where I can push my paper from my reference manager or preprint server to the journal submission system and when accepted the post-print seamlessly goes into my IR of choice and in my IR the data further furthers into other systems for populating my cv profile and/or expert system?

I doubt it, but I can dream.

A 100% friction-less world?


This post has been a whirlwind of different ideas and thoughts, reflecting my still evolving thoughts on open access and repositories. I welcome any discussions, corrections of any misconceptions or errors in my post.
blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...