Patrick McSweeney from the University of Southampton presented his idea for repositories as data engines at the DevCSI Open Repositories 2012 Developer Challenge.
In his original entry, Patrick gave the following description of his idea:

My idea is a set of tools which turn a repository into a data management and visualisation suite with a simple provenence model.

The suite provides conversion tools to convert scientific data from lab equipment into CSV. CSV can be loaded into a temporary database where querys can be run through a web front end to create derived CSVs from existing data. The repository catalogs how files were derived from source data and what queries were used to do this. CSV can be be visualised in the suite using a range of visualisation tools including D3 JS (example visualisations).

I chose CSV because researchers are comfortable with it as a format. It can be imported by the key data anaylisis tools: Excel, R, Matlab and Python. It is an open text based format which can easily be compressed for long term storage and has low preservation risk. If it is later decided that the data is to be converted into a semantic format CSV can easily be converted into RDF.

I am addressing this problem because of difficulties faced by my friend David Mills over the course of his PhD. It echo’s Jim Grays words about the forth paradigm. Quoted below (licence CC-BY-SA):

Researchers are using many different methods to collect or generate data—from sensors and CCDs to supercomputers and particle colliders. When the data finally shows up in your computer, what do you do with all this information that is now in your digital shoebox? People are continually seeking me out and saying, “Help! I’ve got all this data. What am I supposed to do with it? My Excel spreadsheets are getting out of hand!”

The suggestion that I have been making is that we now have terrible data management tools for most of the science disciplines. Commercial organisations like Walmart can afford to build their own data management software, but in science we do not have that luxury. At present, we have hardly any data visualisation and analysis tools. Some research communities use MATLAB, for example, but the funding agencies in the U.S. and elsewhere need to do a lot more to foster the building of tools to make scientists more productive. When you go and look at what scientists are doing, day in and day out, in terms of data analysis, it is truly dreadful. And I suspect that many of you are in the same state that I am in, where essentially the only tools I have at my disposal are MATLAB and Excel!


Developer Interview


Patrick gave us a quick interview about his pitch and how he believes the idea could change the world….

Judges Comments


Patrick McSweeney’s entry tackles research data management by bringing useful tools for data wrangling and visualisation into the repository.

Patrick picked the challenges facing one PhD student in Engineering to illustrate a prevalent problem, a lack of generic tools for managing tabular data.

We particularly like the way the system allows a user to set up re-usable workflows. This is a further development of recent work on extending the repository ‘upstream’ into the research lifecycle.

We agreed with the audience that this is showing us something cool and new in repositories. This idea is clearly tractable, as some implementation work has been done towards embedding tools into the ePrints platform. In further developing this work, it would be great to see Patrick collaborating with others working on similar approaches elsewhere on other platforms so that workflows and visualizations can be exchanged between researchers working on different platforms.


Victory Interview

Patrick reflected on winning the Developer Challenge and appealed for collaborators to help take the project forward in this short interview…

Further Development

Are you interested in collaborating with Patrick or discussing how this idea could be taken further?

Please leave a comment on this page.

  1. What better way to advance this tool than via more collaboration, especially internationally?! As most in the community know Australia is spending significant funding on changing the ‘data conversation’ down under so that Universities will begin to realise the value they can achieve by working with researchers data. Of course, this is an international issue and so Pat needs to come down here so we can get more consensus and ideas internationally. Just as world class researchers are working internationally, so must developers work internationally to push forward the data solutions! Well done Pat, looking forward to having you down for the #eRes2012 conference :)

