Harvesting usage data?

I was talking with a researcher the other day who said that, despite his institution mandating deposit of research papers in his institutional repository, he didn’t comply - prefering to deposit in an international subject repository. Naturally, I asked him ‘why?’. He said that it was because he wanted each of his papers to be in one, and only one, place on the web, so that he could get accurate download statistics for it. Obviously, we’re aware in the JISC IE team of the various arguments on this topic, and we’ve funded a piece of work to look at the practical ways in which subject and institutional repositories might work together, which could address this issue among others. We’ve also funded various projects on repository statistics, such as ‘Interoperable Repository Statistics’ (which has developed a tool that repository managers can use to analyse and share statistics) and an ongoing small piece of work on harmonising article-level usage data formats. There is also MESUR and other projects in this space.

However, in the real world, it is likely that copies of some research papers are likely to be at various places on the web, and we wondered whether a tool could be built that used fuzzy matching to identify copies that were probably the same paper, some means of querying the servers on which they sat to get download data, and a reliable way of then aggregating that data into some acceptable statistics. Is that an important use case? Is feasible to build something that addresses it?
What’s the relationship (if any) with name authority services (see the JISC pilot Names project) or persistent identifiers (see the JISC Resourcing Identifier Interoperability for Repositories - RIDIR demonstrator)?

ORE@JISC

With the release of the beta OAI-ORE specification this week, I thought it was worth highlighting some of the JISC work in the UK that is contributing to this initiative. Two short projects are looking to experiment with ORE and feed back into its development. The FORESITE project at Liverpool, run by Rob Sanderson, has produced ORE resource map descriptions of the JSTOR collection (1.8 million full text articles), and will also ORE-enable the DSpace repository platform, depositing the JSTOR-ORE collection into DSpace using the SWORD protocol. The Theorem project, based at Cambridge and run by Jim Downing, is looking at etheses, both representing ‘ideal’ born-digital theses as ORE resource maps, and looking at workflows around these. This project is working closely with the Integrated Content Environment (ICE) developed by Peter Sefton at the University of Southern Queensland, Australia, to create an authoring and management environment that produces and handles chemistry theses as born-digital objects, with live links to data, and so on. This work complements an international project led in the UK by Chris Awre, and involving partners from the UK, Netherlands, Germany and Denmark, which is looking to get some international agreement on a complex object format for theses, drawing from the ORE specifications, but building on specifications currently used, such as x-metadiss in Germany. Given the relative simplicity of doctoral theses – they have limited versioning issues for example – and the pressing need in many countries to automate the thesis workflow, it may be that theses become an early ORE adopter.

Click streams -Library Managment Systems

I’ve been meaning to do a short post about the recent library systems study that JISC commissioned with SCONUL so people know about it. So here it is. I’ve been reminded of it as I’m at the Eduserv Symposium today and Ken Chad who worked on the study asked a question related to it.

The Eduserv Symposium is focusing on disruptive technologies and what the impact might be on the organisation. So in our case universities and colleges, and as Andy Powell pointed out in his introduction there is also disruption for related service providers such as Eduserv (and for that matter JISC). So one question is how should the academic/education sector respond to the ‘disruptive’ technologies (for that read web 2.0/ service provision on the network e.g. google and amazon services). Ken Chad mentioned the opportunity that the sector has in terms of the data known about users;for example click streams. The library management systems study (that Ken worked on with Sero Consulting) sees this as an opportunity for academic libraries to make their services more relevant to users. Of course there are delicate issues surrounding the use of click streams; not in the least privacy as Larry Johnston, NMC, pointed out in response to Ken’s question at the Eduserv Symposium.

The report covers far more ground that click streams, it is a horizon scan of what is happening in the UK academic sector in terms of LMS provision and what might be the requirements in the changing context that libraries now find themselves.

http://www.jisc.ac.uk/whatwedo/programmes/resourcediscovery/libraryms.aspx

Metadata for stuff in repositories …

I just wanted to highlight some metadata application profile work that is underway as part of the information environment repository programme. Having attended the birds of a feather session (coordinated by Rosemary Russell, UKOLN and Julie Allinson, University of York) about this at OR08 I finally got to see what JISC had funded. Today at the JISC Repositories and Preservation Advisory Group we discussed some of the work and I guess it made me think it was worth making a few more people aware of it. JISC has funded the development of:

metadata application profiles based on Dublin Core for:
Scholarly works
Geo-spatial data/information
Images
Multi-media

And we’ve also funded some work to assess what might be done in terms of application profiles for the following:
Learning objects
Scientific data

A little bit of context…
After using OAI-PMH across repositories in the JISC Focus on Access to Institutional Repositories (FAIR) programme the experience was that Dublin Core was often not rich enough to be very useful to end user applications. The requirement for both metadata and full text indexing was a specific recommendation of the FAIR ePrints UK harvest and search project. After other work also confirmed this the response was to seek to add to basic DC by developing an application profile. The scholarly works application profile (SWAP) was developed by Julie Allinson (at the time UKOLN now University of York) and Andy Powell (Eduserv Foundation). SWAP aims to help support richer search functions and also to support full text indexing, and as I understand it another benefit is navigation between different versions. It is based on the Functional Requirements for Bibliographic Records (FRBR) model which uses the following entities: work, expression, manifestation and item.
You can read more about SWAP here:
http://www.ariadne.ac.uk/issue50/allinson-et-al/

SWAP, although based on a FRBR type model was kept quite simple. It seems that when creating SWAP some hard lines were drawn to avoid too much complexity and from the feedback I have heard it seems to have addressed requirements. It was certainly good to hear from one of the attendees at the OR08 meeting that SWAP was “exactly what they required”. Mick Eadie (Visual Arts Data Service, University College for the Creative Arts) also described the images AP at the OR08 meeting, and it seems to have tried to keep a simple approach too. A draft of the images AP is now out for comment. See:

http://www.ukoln.ac.uk/repositories/digirep/index/Images_Application_Profile

Of course to get the real benefit of these application profiles the implementation of them has to be made as easy as possible and we need to encourage take-up. Working with repository software providers to support the APs is one thing that might be possible and the teams supporting the work intend to do this. SWAP has been implemented at Warwick University as they customised EPrints software to support it.

If you really want to help or know someone that can :-) a job advert is currently out for a related metadata advocacy post at UKOLN: http://www.ukoln.ac.uk/vacancies/08H127A/job-ad/

Note that SWAP is the most mature of the APs; the other areas are in initial draft and are still being developed.

Here are some related links:

SWAP:

http://www.ukoln.ac.uk/repositories/digirep/index/Eprints_Application_Profile

The geo spatial work that James Reid, EDINA (University of Edinburgh) is leading on is currently out for comment:

http://www.ukoln.ac.uk/repositories/digirep/images/e/ef/Geospatial_Application_Profile.doc

Work done by Phil Barker, CETIS, (Heriot-Watt University) on the learning material application profile is here:

http://www.icbl.hw.ac.uk/lmap/domainModel.draft1.html

It is probably worth mentioning that previously some work has been done for learning materials/objects. See information on RLLOMAP: http://www.intute.ac.uk/publications/rdn-ltsn/ap/
and: http://standards-catalogue.ukoln.ac.uk/index/UK_LOM_Core
RLLOMAP seems to have a similar aim to the current work in that it was to support the exchange of metadata using OAI-PMH and UK LOM did build on this.

Not surprisingly the multi-media application profile is a tough one and drafts are not yet available as far as I know. But I do know via Pete Johnston (Eduserv Foundation) that there are some early results being reviewed! Gayle Calverley is leading the work in this area.

There is also the DCMI Scholarly Communications Community where discussion should take place about the application profiles once the work picks up as a whole (coordination and outreach is currently being planned):
http://dublincore.org/groups/scholar/

Open Repositories 2008

Before arrival at the recent Open Repositories 2008 conference, I was telling myself that this would be a dynamic, busy and vibrant conference, attended by a technically ambitious and knowledgeable community, and that it would obviously be a great opportunity for me to engage in constant blog activity (reading and writing). As it turned out, the preconceptions I had about the conference were exactly right. The aspirations I had about my own activities in the blogosphere, however, turned out to be more a case of ‘amplified expectations’ rather than the ‘amplified conference’ that Lorcan Dempsey has referred to (http://orweblog.oclc.org/archives/001404.html).

From the more comfortable perspective of two weeks after the energetic and meeting-packed week down in Southampton (that made it impossible to get near a blog!) it’s possible to look back and consider a few of the more prominent features of the conference.

One principal item was the role that OAI-ORE (Open Archives Initiative Protocol – Object Reuse and Exchange (http://www.openarchives.org/ore/) may have in describing the structure and semantics of aggregations of web objects, thereby making those objects available to a variety of applications. Though still in beta (or perhaps even alpha) by the time of the conference, this data model was used in the development of the winning prototype of the ‘Repository Challenge’ competition (http://or08.ecs.soton.ac.uk/developers.html ) - a JISC/CRIG sponsored event that was an important and characteristic feature of the conference.

Tim Brody (University of Southampton) along with fellow team members, Ben O’Steen (University of Oxford) and Dave Tarrant (University of Southampton) developed the winning application which was called ‘Mining the ORE’. Tim Brody describes it as …
 
‘A practical approach to copying complex objects between repositories. Every eprint in a repository is exposed as an ORE aggregation (Object Reuse and Exchange). Each ORE
aggregation of an eprint links together all the files and associated metadata. This aggregation of files had one resource that was marked as conforming to simple Dublin Core and this was used as the basis of the metadata interoperability. When ingested into a new repository each resource in the ORE aggregation is retrieved and stored. The simple Dublin Core is used to index the new eprint for the purposes of search and discovery, otherwise all of the component resources are simply shown to the user. We implemented exemplar ORE interfaces for both EPrints and Fedora, enabling the transfer of complex objects between the two system implementations.’

19 teams entered the ‘Repository Challenge’ and in total over 40 developers were involved in creating the rapid prototypes. Five prototypes were shortlisted by an international panel of judges and the winner was then selected by a balloted vote from the conference delegates at the OR08 awards dinner. This type of developmental process is a new departure in terms of JISC-funded initiatives but has proved to be potentially of great benefit in terms of providing candidate service-usage models (SUM’s) for submission to the e-Framework, and other forms of documentation including training materials and case studies. It would be very interesting to hear views and opinions about the value of this form of rapid prototyping exercise. Anyone interested should contact David Flanders at the Common Repositories Interface Group (CRIG) http://www.ukoln.ac.uk/repositories/digirep/index/CRIG. David was the driving force behind the Repository Challenge at OR08 and its success was entirely to do with his energy and determination.

Returning to the mainstream sessions of the conference, Peter Murray Rust gave the first keynote speech and urged delegates to be wary of the ubiquitous use of the pdf format to capture the complexity of scientific information.  This reluctance to accept what has become the de facto deposit standard clearly rang bells with some delegates (http://scilib.typepad.com/science_library_pad/2008/04/or08—the-pres.html).

One of the challenges tackled by many presenters was how to ease the burden of deposit and how to incorporate web 2.0 interfaces and techniques into repository design and workflow. The automation of metadata tagging and the design of batch ingest procedures were also variously discussed.

All the papers are being made available in the OR08 repository (http://pubs.or08.ecs.soton.ac.uk/) and this will give some idea of the complexity of the main part of the conference. What it won’t describe is the amount of peripheral but important activity that happened around these presentations, encompassing: Fedora, e-Prints and DSpace group meetings; a repository manager forum; a developer barcamp; an international meeting about Global Registries; a EurOpen Scholar day addressing issues about Open Access … not to mention gatherings and briefings put together by commercial participants such as Microsoft, who introduced the research data repository platform that they have been developing.

Perhaps the very busiest part of the conference was the one that I almost completely missed. If Owen Stephens’ experience of the conference was anything to go by (and this was someone who wasn’t even at the conference), then all the ‘amplification’ that was going on was perhaps a bit too much! http://ukwebfocus.wordpress.com/2008/04/08/micro-blogging-at-events/#comment-64627.

The ‘chattering classes’ is obviously a thing of the past. Now we have the ‘twittering’ classes.

The Repository Challenge at OR08

judges

The judging of the repository challenge is in full swing and our august panel of judges are being treated to fast and furious demonstrations from eighteen different teams made up of delegates at the Open Repositories 2008 conference. This competition has been organised by the Common Repositories Interfaces Group and $5000 prize money is at stake for the best demonstration of some capacity to create a new and useful item of functionality that will work across at least two different repository platforms. Entrants have 5 minutes to put their idea over to our panel of 5 judges and then have to face a further five minutes of questioning. This is tough …! but all the participants are doing a great job of communicating their hard work, some of which has been created in hotel rooms and at various refectory tables around the campus of Southampton University over the last two days. The prize will be awarded at the conference dinner this evening.

Posted by: Neil Grindley