DevCSI Open Repositories Developer Challenge 2012

Open Repositories 2012

Enter Your Idea Here

To enter the DevCSI Open Repository Developer Challenge, please add your initial idea as a comment on this page by Tuesday 10th July. Please include the following information:

The name of the team/individual
The names of the individuals and where they work
A description of the idea and how you will implement it technically. Remember, it is not a requirement to produce working code (unless you really want to!)
How long it would take to code, if you haven’t already written the code.
A URI to some more information about the entry if possible

Please note: Anyone can post ideas here, but entries will only count if the entrant is able to present the idea live at Open Repositories 2012. The deadline for initial entries is: Tuesday 10th July 2012.

40 Comments

William Nixon

July 3, 2012

This is to pitch from a repository administrator rather than a developer: an idea of a widget/dashboard for an EPrints repository [which may be part of MePrints or a wholly separate plugin] which could pull together information for an author/depositor about their publications profile in the repository, ideally visually (graph?) so for instance the total number of papers, the percentage of particular types of paper, a breakdown by year, an indication of the other parts of the institution they co-author the most with. It could also pull together available information such as access and full text/downloads (or identify where there are full text gaps) as well as citation data. Much of this data already exists but being able to unify it and visually display would be very useful/interesting for authors.

Reply
Linda Newman

July 6, 2012

What about a widget to give users the code to more easily embed an object (image, pdf, audio or video — would have to be a recognized file format or mime type) in their own web page? Give the user options for including limited metadata in that embedded view?

Reply
Linda Newman

July 6, 2012

What might be possible to support a ‘drag and drop’ interface from a researcher’s desktop to our repositories? (This idea may not be completely new, but possibly worth re-visiting with current tools?)

Reply
- Ian Stuart
  
  July 16, 2012
  
  Look at the DepositMO and DepositMORE projects
  
  Reply
Chris Awre

July 9, 2012

Two ideas, for others to play with (and code as they feel they can):

- In the light of the Finch Report, a mechanism whereby deposit into a repository triggers a workflow for submitting and paying for Gold OA. In other words, bringing the repository into the Gold OA process and adding value to this.
- Picking up on William’s dashboard idea, but applying this to data, such that a display presents information on what data is stored, where it is stored (local, off-site, cloud, etc.) and relevant tasks/workflow steps required to guide ongoing management.

Reply
Steffen Godskesen

July 9, 2012

We would like to pitch the idea of facilitating recording of research output by building a bridge between a Discovery Interface (DI) and a Current Research Information System (CRIS). Assume for the following that, (1) researchers consider recording research output as tedious and unnecessarily complicated, and (2) the bulk of the metadata required to fill in a record in the CRIS will at some point be accessible in a DI (at least if we restrict ourselves to academic papers and books). We would like to exploit this last assumption in order to invalidate the former be enabling either the researcher herself or an assistant to create a new record in the CRIS based on metadata retrieved from DI. An example might clarify the idea:

– A researcher publishes a paper in a journal
– Metadata about the paper appears in the index for a DI
– An automated query (e.g. for department or author names) alerts a librarian of this newly published, but yet unrecorded, paper
– The librarian injects metadata about the paper into the CRIS and completes the record

This idea fulfills two goals. It

– relieves the researcher of the (perceived) tedious process of recording research output
– facilitates more complete and more uniform metadata in the CRIS, lessening the burden of post-validation and quality assurance of CRIS records

Implementation-wise this idea requires a few supporting tools:

– the DI should expose metadata in a standard format
– the CRIS should allow record creation based on said format
– the CRIS should expose recorded records in said format

So, in short words, the implementation is

– an interface for managing automated queries against the DI
– an interface for managing found records, deciding which to inject into the CRIS
– a piece of code that brings metadata from the DI to the CRIS, possible enriching with local data, possibly interacting with external web-services
– an (optional) interface for researchers to submit un-discoverable research output, review recorded records, and collaborate with librarians during the recording process

Reply
Cameron Neylon

July 10, 2012

A functional survey of access to the published literature

Cameron Neylon, Ben O’Steen, anyone else who is interested.

Outline: We spend a lot of time arguing over whether people have access, should have access, would have access if they knew how to get it. Why don’t we actually just find out whether people really do have access to the published literature from where they sit when they’re doing their work By carrying our a survey in which we functionally check whether a human being thinks they have access to a given work we can look at how access, and its lack, effects the daily work of people interested in research. This will provide a dataset on access that can be used to support policy development and further technical work.

Design: Crossref has recently released a beta API that allows the generation of a random doi within a given date and journal range. We will build a corpus of around ten thousand random dois obtained by taking samples from a randomly distributed set of small date ranges over the past five years. We will record the DOI, date of release, and other bibliographic metadata. This is our test set.

We will build a website that allows a survey participant to enter and provide their location or affiliation. The IP range from which the user originates will also be recorded (to test whether they are within an institutional IP range that corresponds to the claimed location/affiliation). The participant will then be presented with an embedded frame in which the site will attempt to resolve the DOI. The user will be asked whether they see a null result or 404, an abstract, a request for payment, or the full text of paper. Optionally when a participant indicates that they cannot access the full text we might attempt to identify an archived version of the paper in an institutional or disciplinary repository.

Results: An ongoing survey that can monitor degrees of and changes in access to the published literature. A data corpus (to be released under ccZero) that will enable detailed analysis of access by demographics, location, and institution and thus provide a coherent and valuable evidence base for the development of policy and technical development.

Reply
Patrick McSweeney

July 10, 2012

My idea is a set of tools which turn a repository into a data management and visualisation suite with a simple provenence model.

The suite provides conversion tools to convert scientific data from lab equipment into CSV. CSV can be loaded into a temporary database where querys can be run through a web front end to create derived CSVs from existing data. The repository catalogs how files were derived from source data and what queries were used to do this. CSV can be be visualised in the suite using a range of visualisation tools including D3 JS (example visualisations).

I chose CSV because researchers are comfortable with it as a format. It can be imported by the key data anaylisis tools: Excel, R, Matlab and Python. It is an open text based format which can easily be compressed for long term storage and has low preservation risk. If it is later decided that the data is to be converted into a semantic format CSV can easily be converted into RDF.

I am addressing this problem because of difficulties faced by my friend David Mills over the course of his PhD. It echo’s Jim Grays words about the forth paradigm. Quoted below (licence CC-BY-SA):

Researchers are using many different meth-
ods to collect or generate data—from sensors and CCDs to supercomputers and
particle colliders. When the data finally shows up in your computer, what do
you do with all this information that is now in your digital shoebox? People are
continually seeking me out and saying, “Help! I’ve got all this data. What am I
supposed to do with it? My Excel spreadsheets are getting out of hand!”

The suggestion that I have been making is that we now have terrible data man-
agement tools for most of the science disciplines. Commercial organizations like
Walmart can afford to build their own data management software, but in science
we do not have that luxury. At present, we have hardly any data visualization and
analysis tools. Some research communities use MATLAB, for example, but the
funding agencies in the U.S. and elsewhere need to do a lot more to foster the build-
ing of tools to make scientists more productive. When you go and look at what sci-
entists are doing, day in and day out, in terms of data analysis, it is truly dreadful.
And I suspect that many of you are in the same state that I am in where essentially
the only tools I have at my disposal are MATLAB and Excel!

Reply
Asger Askov Blekinge

July 10, 2012

I often despair when looking at the backend code of most repositories. Versioning is a central concept in most preservation services. The IT world have made great leaps in source code versioning systems since the nineties, but the repositories (well, Fedora) have not made use of these improvements. It is an illusion to think that we can, with our limited budgets, build better versioning systems than SVN and GIT.

My idea, which I hope will be a prototype for the presentation, will be a repository with a Fedora like interface, based on an SVN server. This will have the advantage that you will be able to checkout and commit a number of objects as one transaction, something that is not currently possible in Fedora.
For mass processing, you could checkout the entire repository, and put it in the mass processing working storage. This way, your mass processing jobs could access and modify this content, without having to go through your repository. You would then be able to ingest the entire mass processing result as one step.

Reply
- Asger Askov Blekinge
  
  July 10, 2012
  
  And for the required fields
  
  My name is Asger Askov Blekinge
  Organisation: State and University Library, Denmark
  Fedora Committer
  
  I do not yet have the working code, just the unworking code. I expect 2-3 days of work would produce a something that other people could use and improve
  
  Reply
Petr Knoth

July 10, 2012

Collaborators Vojtech Robotka (KMi, Open University)

We would like to pitch the idea of a mobile application that would allow people to search and explore full-text resources stored across Open Access repositories. The full-text of each resource can be downloaded to the mobile device (phone or tablet) and can be used for off-line reading. For example, it can be added to the iBooks application on iOS devices or viewed in Acrobat Reader on Android devices. Since all resources come from Open Access repositories, the service doesn’t require any subscription and is completely free of charge. The service provides benefits for both consumers of content (researchers, students, etc.) as well as repositories themselves. After setting up a repository and registering it with this service, the repository content is immediately available from the mobile devices. No need to install any code on the repository side.

The application should be available for the two main mobile operating systems (Android and iOS) and should work on both tablets and smartphones. Because iOS applications are developed in Objective C and Android applications in Java, it is necessary to develop two clients and register them in the iTunes Store and the Android Market (Play Store). These clients need to be supported by a server application that will provide a seamless layer over Open Access repository content. This application will provide an API for the mobile clients.

Reply
Keith Gilbertson

July 10, 2012

I was originally imagining an electronic device to be used by historians, archivists, and others who work with audio transcripts. The device records an audio file, and then deposits it into a repository, where it is automatically transcribed.

The early thoughts on the idea are here:
http://maroonedlibrarian.wordpress.com/2012/07/10/improving-access-and-findability-by-integrating-repository-curation-tasks-with-cloud-based-audio-and-video-transcription/

We are in the lounge trying to refine the idea. The latest variation is a system including a smartphone app to record audio. The audio is deposited, along with a transcript, into a private workspace in a user configured repository.

Reply
Petr Knoth

July 10, 2012

Repository Analytics Dashboard

I would like to propose an idea of a Repository Analytics dashboard. The dashboard would list Open Access repositories and would provide useful statistical information as well as information regarding potential issues with repository OAI-PMH endpoints, such as unavailability of the OAI-PMH endpoint, metadata interoperability issues or even full-text content harvesting issues. The primary users would be Repository Managers. The service would help to ensure that their repository is harvestable. It would also allow them to see how they compare to other repositories in terms of available metadata, content, compliance with metadata standards or potentially provide even data usage statistics. The second user group that can make use of the Repository Analytics dashboard is Business Intelligence. The tool would provide an overall picture of the content stored across OA repositories. It would provide content growth statistics, usage trends, citation statistics at a repository level, etc.

The tool would be developed as a web based dashboard using data from a large OA repositories content aggregation system, such as CORE.

Reply
Neil Chue Hong

July 10, 2012

My two ideas (for others to use and profit from, attribution required are:

1) A way that a a digital repository can easily ingest a copy of a particular version of a piece of software that is stored in a public code repository (e.g. GitHub, SourceForge), in such a way that common metadata like copyright owner, license, contributors, and dependencies are correctly recorded.

2) A way that a digital repository that has ingested a copy of a piece of software can run provided tests to assess the point when the software becomes obselescent.

If you’d like to discuss further, reach me @npch, sadly I can’t physically attend the developers lounge very much due to other presentations…

Reply
Jimmy Tang

July 10, 2012

Idea: adding FEC codes to the persistance layer in digital
preservation, archival, storage systems. (To add a redundant array of
inexpensive network devices to systems such as iRODS, Fedora-Commons,
Hydra?). The idea is to have raid at the network level for the backend
of a repository system.

Background: Securely storing data by replication at the network level
while it is effective, it can be expensive as the system
scales. Unless the system can buy enough disk in bulk to get the bulk
discount it will can be expensive.

Please note, I did not spend much time at writing the above idea and have no estimation of the time involved in retrofitting such functionality into existing systems.

Reply
Rob Sanderson

July 10, 2012

As Cameron Neylon said in his opening keynote, annotation enhances the network for research. Annotations create links between resources, and at the same time can provide post-publication peer review to enable demand-side filtering. The difficulty is for this to be implemented in a distributed and interoperable fashion. Enter the W3C community group on Open Annotation: http://www.w3.org/community/openannotation/

The challenge idea is to implement the Open Annotation model specifically for post-publication peer review of research outputs on an appropriate collection, such as the PLoS journals. Come find me if you’d like to contribute, some code already exists as a head start!

Reply
- Jimmy Tang
  
  July 10, 2012
  
  sounds interesting would you consider a ranking system (for reputation) of reviewers (of different roles, backgrounds etc…) ??? I think that is something that is lacking right now in this idea.
  
  Reply
  - Rob Sanderson
    
    July 10, 2012
    
    A full reputation system would be very useful in this area, but it could be as simple (given the time frame of the challenge) as just a +1/like tag with associated comment, where each user can only “vote” once. Happy to discuss either way
    
    Reply
ICM CEON Developers

July 10, 2012

The name of the team:
ICM CEON developers
The names of the individuals and where they work
Tomasz Rosiek
Łukasz Wasilewski
Wojciech Sylwestrzak
Jakub Jurkiewicz

ideaTo define common architecture of digital content storage applications and provide an open software platform (environment) which facilitates creation of such applications. It consists of the service integration framework and set of versatile building blocks providing common functionalities that are required in repository applications, among others:
storage, relational and full-text indexing, storage of the data created by users, acquiring data from external systems and enrichment of the data

Its loosely-coupled service-oriented architecture enables deployment of highly-scalable, distributed systems including digital libraries, multimedia providing services, content management systems, document management systems.
The platform consists of the following components:

SOA oriented service integration infrastructure based on Java and Spring Framework
Set of APIs and implementations of generic core services – Content Storage, Annotation storage, Relational Index, Fulltext Index, Process Manager
ProcessManager – Framework that allows to define and start indexing and enrichment processes
Data Acquisition Module – Framework that allows to create data acquisition processes that allow to import data from legacy systems
Reference applications based on platform including Infona Portal and standard OAI-PMH server

How long it would take to code
Currently most of the essential functionalities of the platform are implemented, however there is still lack of documentation, code cleanup and optimization. The first production ready deployment of the Platform is expected in the first half of 2013.
demo
http://yadda2-demo.vls.icm.edu.pl/demo-portal

Reply
Joonas Kesäniemi

July 10, 2012

Collaborators:
Joonas Kesäniemi, University of Helsinki
Kevin Van de Velde, @mire

Service API that provides ontology optimized query expansion for URIs.

Idea is demonstrated using temporal ontology that models evolution of university’s faculties over time. Faculties are established, merged, split and renamed over the course of history. All this data is captured in an ontology.

Service imports data from triple store. How the triple store is populated is out of the scope of the service. One possible solution is use something like Semantic Media Wiki for collaborative ontology creation.

In our example, the service takes in information about the changes in faculties and transforms that to an efficient index/storage for query expansion. Service provides plugin architecture that allows one to configure different kind of handler implementation for different ontologies (for example SKOS vocabularies).

Usage scenario:

Faculty1 was established in 1880. Faculty2 was established in 1890. In 1910 these two faculties were merged into Faculty3. 30 years later, Faculty3 changed its name back to Faculty1. In 2012, Faculty1 was renamed to “Sponsored by Apple”. Query expansion service allows user to query relevant historical items without having to dig in to the university’s history.

Implementation:

Indexing is done using Solr and Jena RDF tools. It is possible to implement

Background:

This service is part of a bigger picture of using and providing Linked Open Data. In addition to the systems for maintaining ontologies, as mentioned above, repositories should start creating links to resources instead of plain strings.

I believe the keynote speaker mentioned something about networks….

Reply
Ben O'Steen

July 10, 2012

Repository Operation “Ram-Raid”

Replicate the contents of a repository to a remote VM to do processor and I/O intensive work – text-mining, image/audio/video analysis, alternate indexing – and push back the new, augmented content and reports back to the host repository (via SWORD2)

Reply
- Julie Allinson
  
  July 10, 2012
  
  I like this idea, we could use this at York.
  
  Reply
- Rob Sanderson
  
  July 10, 2012
  
  Or push the data to the VM/service with SWORD2 to do online processing as part of a workflow?
  
  Reply
  - Ben O'Steen
    
    July 10, 2012
    
    The key problem this is trying to address is the lack of developers working in institutions who are capable of organising or working a workflow.
    
    I’m anticipating a worst-case scenario of a stock repository, potentially hosted by a 3rd party, with no real way to change anything apart from the look’n'feel
    
    Reply
    - Julie Allinson
      
      July 10, 2012
      
      I have a developer who writes workflow!
      
      Reply
Ben O'Steen

July 10, 2012

“Who spoke at a conference and why should I care?”

Use face-detection and recognition on recordings or live video, cross-reference lists of faces in – for example IEEE or Ariadne – and overlay their latest papers and research beside them on the video and in the text below.

Reply
Graham Triggs

July 10, 2012

My entry is an idea for how to extend the scenarios in which SWORD (v2) can be used.

Reply
Julie Allinson

July 10, 2012

Using chronozoom (http://www.chronozoomproject.org/) for visualising repository content, especially image collections.

Reply
- Alex Wade
  
  July 10, 2012
  
  The ChronoZoom source is all available here: http://chronozoom.codeplex.com/
  
  Reply
Mark MacGillivray

July 10, 2012

As a researcher, I know or care very little about the local repository – whether through my own lack of care or that of my institution is not relevant at the moment – let’s just accept for now that what I care about is my research, the output of my research group, and how we can better present ourselves to our peers, our funders, the public, blah blah. How can I do that? Well, I would like to be able to show – on MY web page or on MY research group web page – lots of cool information about me and my group; NOT about the local repository, but about OUR work. If it were possible to easily embed information about my publications on my web page just by virtue of having inserted them into the local repo, then I would have a reason to bother doing so. One issue with this is proof of value – of course repos want to show value by how many hits they get, but really, I want to show value by how many hits I GET (if I know anything about web analytics and alt-metrics at all, that is…). BUT, in both cases, having more links to repository content in more places would benefit both me and the repo – my page would look better and have more useful information, including links to accessible copies of my work, and the repo could measure higher click-through.

My proposed solution to this is a javascript widget that can be easily embedded by an academic on their own web pages that automatically tracks their submissions to their repo, and provides useful stats whilst also linking out to other cool sources of information. I can already demonstrate the basics of this, and would love to develop it further.

Reply
Ben O'Steen

July 10, 2012

(with Julie Allinson)

Use a .Net gadgeteer to show the physical activity of the repository:

- An always-on LCD readout of the current health of the repository
- show the hits per hour on a real physical gauge, like http://www.oomlout.co.uk/analog-dial-meter-05v-72mm-p-271.html
- ring a old-fashioned bell when the ‘deposits-per-day’ target has been reached!
- blow bubbles for each deposit! (like http://bubblino.com/)

Reply
Keith Gilbertson

July 10, 2012

After speaking with other developers, the idea above for automated transcription has been revised. The most efficient way to develop it would be to extend last years SWORD mobile phone app described here:
http://blog.stuartlewis.com/2011/06/20/android-sword-deposit-mobile-app/

Users can choose between Microsoft Research MAVIS and Amazon Mechanical Turk Transcription to automatically transcribe the audio recording.

For the general high level idea, see the image located here:
http://maroonedlibrarian.files.wordpress.com/2012/07/or2012.png

Reply
Leslie Carr

July 10, 2012

The big challenge for researchers at the moment is all about demonstrating impact, and showing off research outcomes. The problem is, there are no tools to support researchers collecting and storing evidence of impact. The activities that happen because of the research (invitations to speak, events, input to important committees, changes in policy, input to standards, new courses, business outcomes) are unreflected upon. Researchers say things like “we must remember this marvellous thing for the final report” but they actually spend all their effort trying to produce the event / activity and not record it for later dissemination or publicity.

I propose a smartphone app, something like Path, that allows a researcher to connect together a diary event, a place, a set of known people and a project with some descriptions (and categories from ROS). Basing this on a ubiquitous smartphone app allows the user to conveniently annotate any of their activities as “Impact Fodder” as they happen. Using a native app interface allows us to escape the standard form-filling metadata paradigm of the 1990s, and opens up the possibility of a beautiful and enjoyable user experience. Collecting impact and outcome evidence should be a constant preoccupation for a researcher – not something that happens at the end of a project.

The collected information can be easily dumped via SWORD into a repository and from thence submitted to ROS.

Path app description: http://itunes.apple.com/us/app/path/id403639508

Reply
Julie Allinson

July 10, 2012

“Repos for Kids”

Engaging and publicising repos by using simple netduino kit to create a repository hit counter – a meter gauge which tracks each repository deposit or hit. Also, LEDs which flash or display graphics/text when particular milestones are hit, each 1000 hits in an hour (yes, I’m being ambitious here!) or maybe the xth deposit.

Reply
Julie Allinson

July 10, 2012

“Repos for Kids”

Using simple netduino kit to visualise repository deposit or hits, eg. simple meter gauge to show hits / deposit, led displays which flash up text for the nth no. of deposits, or a new hourly hit record. That kind of thing. Lots of possibilities with simple and cheap kit.

Reply
- Julie Allinson
  
  July 10, 2012
  
  Apologies for the spamming, I blame the internet!
  
  Reply
Matt R Taylor

July 10, 2012

Splinter Repositories.

Eliminate the overhead associated with managing the high volume of deposits surrounding a conference or event. Instead of creating accounts for every single new user and educating them in how to use the software – any user wishing to contribute can simply visit the event site and click to create a splinter repository for their own personal use. This will provide them with a lightweight version of the repository that they can host themselves with zero configuration. This allows them to very quickly upload and annotate their resources at their own convenience.

At the end of the event, the main repository will automatically absorb all the splinter repositories back into itself to collect all the resources into one location.

Reply
David Tarrant

July 10, 2012

Linked data is one strongly supported way for exposing data. Problem is that a lot of data being exposed is not static, it changes! Over the last couple of years a lot of suggestions have been made on how to handle non static linked-data for single URIs. In this presentation we show how by using one such method we can enable a “time-machine” interface to be built to browse linked-data through time. That’s all very pretty but this also means you can use memento to request the data at a URI from a specific time. The next stage is to support memento headers for SPARQL queries, enabling full historical query of data that will require interconnection and full inference. Can you say scale!

Reply
Jose Martin

July 11, 2012

This idea was born last Monday, during dinner. Azhar Hussein, from Sherpa/Romeo, explained how they are are trying to bring to a new stage interaction between repositories and Sherpa/Romeo policies API.

They have in mind a nightly-run process that would match the “under review” repository records against the latest policy information and update the records information accordingly. This would save API load (they are getting over 250.000 requests per day) and valuable staff time, since this task is usually manually run by repository managers.

So that’s what I’ve come up with: an Eprints script that will update and display the RoMEO colour information for every matched record pending to be reviewed.

Reply
Asger Askov Blekinge

July 11, 2012

Fedora Object Locking

Fedora have long wished to be able to run multiple webapps, on the same object store. The first problem in doing this is to synchronize writes to the objects. This is an implementation that allows multiple fedora instances to lock and thus synchronize writes to the objects.
It uses Hazelcast to synchronise state between the fedora instances.

Reply