Harvesting usage data?
I was talking with a researcher the other day who said that, despite his institution mandating deposit of research papers in his institutional repository, he didn’t comply - prefering to deposit in an international subject repository. Naturally, I asked him ‘why?’. He said that it was because he wanted each of his papers to be in one, and only one, place on the web, so that he could get accurate download statistics for it. Obviously, we’re aware in the JISC IE team of the various arguments on this topic, and we’ve funded a piece of work to look at the practical ways in which subject and institutional repositories might work together, which could address this issue among others. We’ve also funded various projects on repository statistics, such as ‘Interoperable Repository Statistics’ (which has developed a tool that repository managers can use to analyse and share statistics) and an ongoing small piece of work on harmonising article-level usage data formats. There is also MESUR and other projects in this space.
However, in the real world, it is likely that copies of some research papers are likely to be at various places on the web, and we wondered whether a tool could be built that used fuzzy matching to identify copies that were probably the same paper, some means of querying the servers on which they sat to get download data, and a reliable way of then aggregating that data into some acceptable statistics. Is that an important use case? Is feasible to build something that addresses it?
What’s the relationship (if any) with name authority services (see the JISC pilot Names project) or persistent identifiers (see the JISC Resourcing Identifier Interoperability for Repositories - RIDIR demonstrator)?
Comments
6 Responses to “Harvesting usage data?”
Leave a Reply
Neil, The obvious comment to make here is that this researcher makes his work open access, and that’s fine. The follow-up question should not be about him, however, but what about all the other researchers, where should they deposit their papers? Given the finding that researchers as users of the literature appear to value open access more than as producers (Harley et al. Sept. 2006 http://cshe.berkeley.edu/publications/docs/ROP.Harley.AcademicValues.13.06.pdf), then this OA author has a vested interest in the answer. The answer should be, where we have the best chance of getting open access to all those other papers. Whatever that answer is, it is also where he should deposit his papers. Clearly, those responsible for introducing the mandate at this researcher’s institution have given their answer. For this researcher, substitute Wellcome, SCOAP, and other OA pioneers, and ask if they can similarly do more to help promote the wider case of OA in terms of OA location.
Thanks Steve, but I’m not sure my question was about where the researcher *should* deposit, but more about what services can be offered to her *wherever* she deposits. Perhaps we need to build technologies that support these services (eg statistics services) across the web landscape, rather than expecting particular behaviour from researchers?
On the subject of publication usage data, aside from a technical solution or implementation, a more fundamental question arises on versioning.
Aggregating usage statistics from different locations through persistent identifiers would be a relative straight forward task. That said, in the current landscape an author is rewarded based on the impact factor of and citations from the publisher’s version, excluding usage data of a given post-print. Finding a solution for aggregating usage data for different copies on different locations might also include a discussion on aggregating usage data for different versions.
When we want to determine impact factors et al. from an aggregated set of versions, the question arises what this set would then comprise of; e.g. only the publisher’s and post-print or should the statistics include any drafts and pre-print of the given article as well?.
In this perspective, using ORE or a standard with similar capabilities to describe these aggregations in combination with persistent identifiers and possibly inluding the findings of the Version Identification Framework project, resulting in statistics compatible with COUNTER could prove to a be both interesting and valuable path to explore. A placement for such an aggregating service may then be sought in the direction of resource identifier resolvers, though this might not necessarily be the most preferred location. That said, a solution that can be implemented within the existing infrastructure and standards (for instance harvesting the data through OAI-PMH) could increase the rate of acceptance and usage of such a solution.
To continue with the repy from Magchiel, I would like to add a small part.
When we live in the fantastic and ideal world where researchers cite their material and all use persistent identifiers, we can make life easy and count the number of hits at the location of each link resolver (such as the handle system).
An author claiming a collection of Persistent Identifiers can get an overview of ‘his’ download statistics.
You could institutionalise the Persistent Identifier - Author link by introducing a Digital Author Identifier (DAI).
Thanks Maurice and Magchiel, this does look like the right road to go down (PIDs, ORE…). But is it worth doing something now, before the “fantastic and ideal world”, when authors may not even know all the URLs of their papers?
Perhaps the approaches cited in the current thread on the jisc-repositories list “Dealing with duplicate papers” are relevant?
“we wondered whether a tool could be built that used fuzzy matching to identify copies that were probably the same paper,…”
The OA-Networking project http://www.dini.de/fileadmin/oa-netzwerk/PM_OA-Netzwerk_Projektstart_en_080116.pdf
is currently working on exactly this topic! Our prototype implementation uses word shingles and shingle statistics to compare documents independent of their file format, some added parts (abstracts, comments etc.) which are stored on different repositories. This is not all this project is busy with, but one of its key features.
A public version will be available in October 2008.