Research data curation
Back last year, following the Digital Curation Conference in Washington DC, JISC and the Andrew J Mellon Foundation hosted an international workshop to discuss and suggest where the international priorities are for research and development work supporting academic research data curation. It’s taken a while for the notes to become available, for which I apologise, but here they are:
Priorities for research data curation workshop 2007
(I realise this is a PDF file, which won’t please everyone, but shrunk the filesize by over an order of magnitude from MS Word)
The starting point for the workshop was a recognition that, while research data orients largely by (sub)discipline, the way in which infrastructure is developed and funded is often oriented nationally, or even around institutions. Some way is needed to square these two. I have to confess that, on the day, I wasn’t sure we’d made a lot of progress, but in drafting the notes I changed my mind somewhat. Certainly, Peter Murray-Rust seemed to identify the academic department infrastructure as a key point where intervention could serve both that department and the wider goal of data curation and sharing. The photos of flip chart diagrams are perhaps not easy to read or understand, but suggest a distinctive place for libraries and repositories.
Greg Crane’s Perseus project anticipated some of the topics that were covered later - notably how to design an infrastructure that is sustainable and yet adaptive - there are a few ideas in the notes. there are also a few ideas about how the problem space might be broken down so that an international approach can be taken, though this remains difficult. With luck and effort, JISC’s and other UK ‘data’ work will join up with that in the US (eg the NSF Datanet programme), Australia (Australian National Data Service), etc, and these notes will help us do that.
Many thanks to the workshop participants, listed at the end of the notes.
The costs of preserving research data
There’s a new report on the JISC website, authored by Neil Beagrie, Julia Chruszcz and Brian Lavoie. It looks at how much it costs to preserve research data and, perhaps as importantly, how institutions and others could calculate this. There are lots of reasons why this report is likely to have an impact - looking after research data is potentially costly, and yet it is important that - as a community - we make reasonable decisions about what should be preserved and how. Perhaps unsurprisingly (at least for those who already do this for a living), it seems the cost of ingesting the data forms the largest cost in the curation lifecycle, but at the same time the evidence shows that correcting badly ingested data later is even more costly, so the figures probably suggest that there is a positive cost/benefit calculation here. There is potential for developing the methodology here into a tool, and there could also be potential for some join-up with the Data Audit Framework.
Research data and the JISC IE
We’re hoping to present some themed web pages on the innovation work being funded under the JISC Information Environment area, including one on research data. I thought I’d use this blog to offer preview / pilot that page. I’m not sure if that’s an acceptable use of a blog, but I’m sure I’ll find out.
The aim of the JISC IE work on data is to promote and enable new ways of finding, using and sharing research data. Because there are huge variations in what ‘data’ is, and in disciplinary cultures and practices around it, there is likely to be a ‘mixed economy’ of infrastructure and services to support its management.
There has been a large number of reports on data recently, some of which are helpfully listed in a recent presentation by Michael Jubb of the Research Information Network. Three key documents are the report from the then Office of Science and Innovation on ‘e-infrastructure’, which set out a high-level vision, a set of principles for data stewardship developed by the Research Information Network, and the ‘Dealing with Data‘ report from JISC/UKOLN, which made practical recommendations.
In terms of current practice, two projects promise to paint a clear picture from different perspectives. A study of ‘data publication’ practice among researchers has been funded by JISC, the Research Information Network and the Natural Environment Research Council. A different project, SCARP, is exploring disciplinary attitudes and approaches to data deposit, sharing and re-use, curation and preservation.
JISC and the Engineering and Physical Sciences Research Council have jointly funded the Digital Curation Centre (DCC), which is a centre both of innovation and of guidance. Members of the DCC are developing a Data Audit Framework, which will enable universities to assess what data is being held on their computer systems, and who is responsible for it. The Data Audit Framework will be piloted in a number of universities in 2008.
There is a suspicion that the sector lacks sufficient skilled people to manage research data effectively. A report is due shortly that will review the position and make recommendations on how this might be addressed. The DCC will run a summer school this year to begin to address this issue. Of course, investment will only follow if a business case can be made, and a part of making that case is assessing the costs of preserving data. A methodology is being developed that will enable estimates to be made, though of course without assessing the benefits of keeping data, it is only half the story.
The UK is fortunate to have both the UK Data Archive (co-funded by JISC and the Economic and Social Research Council) and the data centres supported by the Natural Environment Research Council. These services offer expert advice and infrastructure for data management. A feasibility study is underway into the possibility of a UK Research Data Service as a collaboration between some UK universities, to fill in some of the gaps between such data centres. In addition, the DISC-UK Datashare project is looking at how UK higher education can increase its capacity to curate and share research data.
Finally, it is worth noting that JISC also funds work under the heading of ‘e-Research’, which is also focused on research data, including grid and semantic enabling of datasets.