Event Report: Managing Research Data Hack Day

kpitkin — Fri, 11 May 2012 09:00:42 +0000

DevCSI worked with the JISC Managing Research Data Programme and the JISC Orbital Project to organise a Managing Research Data hack event in Manchester from 3rd-4th May, 2012. The event was designed to bring together software developers, project managers, data librarians and experts with an interest in the area of managing research data to share, talk, collaborate and create useful solutions.

Participants were encouraged to develop ideas, paper prototypes or even working code to address some of the issues raised by delegates from a range of different projects. A prize was available for the best idea, with the winners receiving their expenses paid to get together and develop their idea further. There were also opportunities to share skills throughout the event.

The event followed a relaxed hack event format, opening with a series of lightning talks from participants describing their projects and areas of interest, followed by a period of brainstorming, development into the evening, and reporting back to the group to gather feedback.

Lightning Talks

History Data Management Plan (HDMP) Project

John Nicholls, University of Hull

John Nicholls

Nicholls described himself as an example of “the reason we are all here.” He represents history researchers at the University of Hull, where he works as the data manager on the JISC-funded HDMP project. This involved working with the university’s library services to create useable data sets from the information collected by ordinary historians, and has resulted in the formulation of a history data management plan, which they are now using to help inform new projects so the researchers can put their data into a useable format from the outset. He was able to offer examples of the historical data for developers to experiment with during the event, but appealed for information about the tools available for managing data that an ordinary researcher could use, and asked how he might engage in the documenting process.

More information about this project is available here.

MongoDB

Nick Jackson, University of Lincoln

Jackson offered to run crash course for those who were interested in storing and querying data in MongoDB, a no SQL database. He provided a brief overview of the benefits of MongoDB, which he argued was massively scaleable, agile and flexible, and explained why it is useful for handling research data.

Click here to view the embedded video.

This video is available on Vimeo.

Nick’s slides from this presentation are available here.

PIMMS (Portable Infrastructure for the Metafor Metadata System)

Gerard Devine, National Centre for Atmospheric Science (NCAS), University of Reading

Gerard Devine

PIMMS provides institutions with tools to capture information about the workflow of running simulations from the design of experiments to the implementation of experiments via running simulations models.

Devine explained how this works within his own research area of climate modelling, where the outputs are so large and complex that past strategies for understanding the models and the limited available metadata are no longer sufficient. The PIMMS project has created a system to help document this climate data, including a web form to help describe all the aspects of the experiments and models according to a set vocabulary. These can then be used in portals which can understand the schema and expose the information in different ways.

He noted that mapping data to metadata has been a particular problem, so he was interested in working with people who have experienced similar issues or found solutions.

Database as a Service implemented in Oxford University

Asif Akram, Oxford University

Asif Akram

Akram outlined the Virtual Infrastructure with Database as a Service (VIDaaS) project, which allows users create a project and upload for example, Microsoft Access databases or Excel spreadsheets to create an online, shareable database in the cloud. The system then converts this into an online database, which can be modified and shared using a simple user interface. He described the three tools they have created to make this process simple, including a database migration tool, a Microsoft Access database converter, and an SQL Designer to help researchers create a working SQL database using drag-and-drop tools.

Further information is available here.

ORCID and DataCite Interoperability Network (ODIN)

John Kay, The British Library

John Kay

Kay described his role as a social sciences curator at the British Library and his work with DataCite, which creates persistent identifiers for datasets so they can be cited. They have just received funding for a project to take DataCite forward, which will include working on interoperability between other systems, such as ORCID.

He was able to offer some APIs for developers to play with at the event, and the opportunity to mint DataCite DOIs.

REWARD

Brian Hole, Ubiquity Press

Brian Hole

Brian Hole described the JISC-funded REWARD project, which aims to incentivise researchers to deposit their data without introducing any new steps into their everyday procedures. He provided an overview of how this worked, including submission to the Journal of Open Archaeological Dataand ePrints. He outlined some of the recommendations resulting from the project, including the need for more training of library staff to customise ePrints to accept data more neatly.

Hole also provided an overview of the Journal of Open Archaeological Data and the benefits this gives to researchers as a peer reviewed journal which guides researchers through the process of finding an acceptable repository and issuing a data paper that can be cited in traditional papers.

He was particularly interested in collaborating with others to consider some of the issues identified by the project, including minting their own identifiers.

Further information is available here.

YouShare

Aaron Turner, University of York

Aaron Turner

YouShare is a HEFCE-funded project to provide an environment for researchers to share programs and data and apply programs to their data.

Turner described their current efforts to link the front end that researches see to an archival system that is standards compliant. This will suck the data set into various tiers of an archival system when it is not being used, then bring this back into a live system when people want to access it to carry out further experiments. He provided a demonstration of the interface, showing how to create workflows using YouShare and publish DOIs to facilitate citations.

He was looking to form collaborations to discuss issues associated with data ingest to archival systems during the hack event.

Further information is available here.

Data.bris

Damien Steer, University of Bristol

Damian Steer

Steer described the cluster system at the University of Bristol, where they have high performance computing and persistent storage for researchers. Research projects can apply for storage, nominate data steward, and receive 5tb free storage. He observed that quite a few people are already using it, including arts and humanities researchers.

The data.bris project aims to create an interface to help researchers use some of this infrastructure and make deposits with metadata. They are looking to add submission via SWORD, despite internal policy questions, and intend to start looking at packaging data sets and integrating with with PURE system by Atira.

DataStage

Sander van der Waal, OSS

Sander van der Waal

Van der Waal outlined the two software components of the DataFlow project: Data Stage and Data Bank. Data Bank is a institutional repository system, and Data Stage is a step before that. As a researcher, before you are ready to publish your data set, you are working with data which needs to be stored and managed. Data Stage helps researchers on a local level to manage departmental data, in a similar way to Dropbox, by providing external back up and version control. Van der Waal explained that he would like to Data Stage to push data to other SWORD compliant repositories, and appealed for people interested in connecting repositories using SWORD to collaborate.

Using dSpace

Ian Wellaway, University of Exeter

Ian Wellaway

Wellsway described work at the University of Exeter, where they are using dSpace with Oracle. He observed that the submission process is a bit clunky, so they have been looking a easy deposit. The problem they have encountered is that a lot of researchers have big data sets, which they are struggling to get into the repository over http. This causes frustration and inevitably puts people off submitting. He appealed for help from people who have solved or have an interest in solving a similar problem.

DMPOnline

Monica Duke, DCC

Monica Duke

Duke introduced the DMPOnline tool developed by the Digital Curation Centre to help researchers create the data management plans, which are now requested by many funders. The DCC are looking to create an API for this, and she has been involved with some thinking about how people might interact with DMPOnline via this API. She was interested in talking further with any people who are interested in getting data in or out of DMPOnline, or think their system should be interacting with it. She also promoted a forthcoming workshop at Open Repositories 2012 which will be exploring this further.

Biomedical Research Infrastructure Software Service kit (BRISSkit)

Malcolm Newbury, Guildfoss

Malcolm Newbury

Newbury outlined the BRISSkit, a suite of applications to support the entire clinical study process, including CiviCRM to recruit participants, CA Tissue which tracks assets (blood, samples etc) and Informatics for Integrating Biology and the Bedside (I2B2) which decomposes information about each patient, adds an ontology and allows researchers to query the data based on that ontology. These tools have all been integrated so information can travel between the application and the full application set can now be provisioned in the cloud.
Newbury observed that they still have some challenges, including generating the unique numbers that are attached to samples, and integrating the applications in a way that does not slow them down, so they are currently looking into open source ways of orchestrating that integration.

Ideas

There were a number of ideas shared before the event, which were summarised briefly before the group began to brainstorm new ideas on the ideas wall. A complete list of all the ideas shared before and at the event can be found on the MRD Hack Days Ideas page.

Teams

Several broad teams formed to discuss the ideas further and suggest potential projects to work on throughout the rest of the event.

Data Activity Stream

A group worked on a proof-of-concept for a centralised service for tracking activity data around research projects and individual datasets. This would allow researchers to see what others have been doing with particular data objects, together with a stream of information about activity within the project as a whole.

In this video interview, Nick Johnson explains the concept in more detail and outlines their progress during the event, which included building a working API.

Click here to view the embedded video.

This video is available on Vimeo.

SWORD 2

This group decided that the problem with SWORD 2 and big data is the resumption problem is that fundamental to http. They discussed how they might send a SWORD request asking server to get content via some other mechanism, such as a bit torrent client, FTP or Dropbox. Discussion with the wider group generated positive feedback about bit torrenting as a good route to handle big data. The group experimented with this during the event to test their reasoning.

In this video interview, Damian Steer outlines the progress made by the group and the issues

Click here to view the embedded video.

This video is available on Vimeo.

Damian’s presentation can be found here: http://www.slideshare.net/shellac/sword2-and-bittorrent

The issue of how to handle big data was discussed by several overlapping groups. In this interview, Jon Besson and Dan Small reflect on their connected discussions in this area…

Click here to view the embedded video.

This video is available on Vimeo.

Academic Dropbox

Also connected with the issue of big data, a separate group discussed the potential of am academic dropbox, using a client rather than a server-based pool approach. They explored a number of tools, including tools like SparkleShare, and documented their survey of the issues in a series of blog posts.

In this video interview Joss Winn and Jez Cope reflect on some of these issues in more detail…

Click here to view the embedded video.

This video is available on Vimeo.

Metadata for Datasets

This group chose to explore existing metadata schemas to identify the minimum number of elements needed in a schema to accompany data transferred between repositories. They highlighted a potential use case involving a researcher who makes a deposit into a subject repository. From an institutional institutional perspective it will be useful to know about all research outputs, so a basic common schema would allow information about the deposit to be shared between the subject repository, the institutional repository, and any other interested repository, such as the British Library. They also noted that this may be useful if the data is held in more than one place, helping to make it clear where the citable data is held and which versions are copies.

The group speculated that an extension of this work could allow people to “follow” a particular researcher in a social media style.

In this video interview, Brian Hole describes the progress they made during the hack event and how they see this developing in the future…

Click here to view the embedded video.

This video is available on Vimeo.

Other Work

During the event there were a number of discussions about issues associated with identifiers. Whilst these did not lead to a working project group, they covered useful ground and led to solutions to some of the specific problems participants brought to the hack event with them.

In this video interview, Gerard Devine from the PIMMs project describes one such outcome…

Click here to view the embedded video.

This video is available on Vimeo.

Alex Wolton from University of Essex also discusses the progress he made on some complex issues associated with his project during the event…

Click here to view the embedded video.

This video is available on Vimeo.

Final Outcomes

Activity Data Stream Group (Rainbow Beam)

Nick Jackson, Julian Cheal, Harry Newton, Nick Syrotiuk

Click here to view the embedded video.

This video is available on Vimeo.

Bit Torrent Group

Sander van der Waal, Damian Steer, Tim Brody, Steve Wellburn

Click here to view the embedded video.

This video is available on Vimeo.

Metadata Group (URMe)

Brian Hole, Carlos Silva, Alex Ball, Thomas Parsons, John Kaye, John Bottomley, John Nicholls, Lindsay Wood, Asif Akram

Click here to view the embedded video.

This video is available on Vimeo.

The Prezi to accompany this presentation is available here.

Academic Dropbox

Joss Winn and Jez Cope

Joss and Jez produced several blog posts documenting the issues they researched during the event:

MRD Hack Days: File backup, sync and versioning, or “The Academic Dropbox” by Jez Cope
Shared, versioned network drives by Joss Winn

Conclusions

One of the key outcomes from the event was a consensus about the need for a different paradigm to deal with moving and managing big data, compared to smaller data sets or multiple small data sets. Exploring these issues and identifying where projects and institutions are encountering similar issues proved to be one of the most useful outcomes for all participants.

Participant Responses

A number of participants blogged about this event from their own perspectives:

What do you call a group of data managers? by Jon Besson
Managing Research Data Hackday: 3/4 May by Malcolm Newbury

Hacking Research Metadata

kpitkin — Tue, 24 Apr 2012 09:11:02 +0000

In this guest post, Alex Ball from UKOLN gives a preview of some of the issues he hopes will be explored at our forthcoming Managing Research Data Hack Days and describes his own work in the field, which could form the basis for the development of some useful MRD tools.
________________

With research funders putting increasing pressure on institutions to manage and expose their research data better, I guess more and more will be thinking about the schemas and systems they use for handling information about their research data. Ideally, institutions should be able to import such metadata from the data centres used by their researchers, collect metadata for locally held data, use the metadata for local discovery and reporting purposes, and export metadata to DataCite when they (the institutions) come around to minting DOIs. At the forthcoming MRD Hack Day it would be great to see some tools developed to help with those tasks, and I would be particularly pleased if some work I did on research metadata could provide some kind of basis for them.

A few years ago, JISC asked me to look into the feasibility of writing an application profile for describing scientific data. This was back when the Scholarly Works Application Profile (SWAP) was near the top of its hype curve, and application profiles based around generic resource types seemed very attractive (these days, they don’t). One problem was that the label ‘scientific data’ doesn’t pick out a coherent set of resources. In other words, a thing which is ‘scientific data’ can have more in common with something that isn’t (e.g. humanities data), than with something else that is. The other, more pressing problem was the ‘application’ bit, always a good thing to consider when designing an application profile. The application evoked by ‘scientific data’ is, naturally, doing science, and the things a crystallographer needs to know in order to use diffraction data are quite different to the things an astrophysicist needs in order to use spectral data.

In order to get anywhere, I had to think more generically on both counts. So what I ended up scoping was an application profile for a hypothetical research data catalogue, one that might be used by an institutional data repository or a national cross-search service. To see if this was in any way feasible, I looked at the metadata already used by (UK) data centres in their catalogues and compared them. The results were actually quite encouraging: I found 33 metadata elements that occurred in at least 3 of the 15 schemes to which I had access. Of these elements, one third occurred in 12 or more schemes: these were things like dataset name, date and identifier, agent, rights/restrictions, summary/description, dataset type and location. For the full list, see the scoping study report or the summary presentation.

In the end, JISC decided not to go ahead with creating the application profile formally, but the work did feed into the discussions that would eventually result in the DataCite Metadata Schema. This schema underlies a search service across all datasets that have been given a DOI, and are in that sense published. Indeed, I found I could map the whole of version 2.2 of that schema (bar one element of internal administrative metadata) to 15 of the elements I’d identified. The most notable elements from my list that the DataCite schema is missing are those relating to spatiotemporal extent, which is important for environmental data for example.

Even though the scoping study did not produce a deployment-ready application profile for research metadata, I hope there’s enough in the report to indicate what one should look like and how it might be used in interoperation with data centres, DataCite and other repositories. If the MRD Hack Day can get some of that interoperation working, I know a lot of people will be made very happy, myself included.

This may look messy, but it shows there's hope

____________

There is still time to sign up to attend the MRD Hack Day. Full details about the event and the booking form are available here. We are also looking for ideas that developers could tackle during the event. If you have an idea you would like to see worked on during the event, please post it here.

DevCSI | Developer Community Supporting Innovation » #mrdhack

Event Report: Managing Research Data Hack Day

Lightning Talks

History Data Management Plan (HDMP) Project

MongoDB

PIMMS (Portable Infrastructure for the Metafor Metadata System)

Database as a Service implemented in Oxford University

ORCID and DataCite Interoperability Network (ODIN)

REWARD

YouShare

Data.bris

DataStage

Using dSpace

DMPOnline

Biomedical Research Infrastructure Software Service kit (BRISSkit)

Ideas

Teams

Data Activity Stream

SWORD 2

Academic Dropbox

Metadata for Datasets

Other Work

Final Outcomes

Activity Data Stream Group (Rainbow Beam)

Bit Torrent Group

Metadata Group (URMe)

Academic Dropbox

Conclusions

Participant Responses

Hacking Research Metadata