Event Report: Bibliohack

Jun 25, 2012 by kpitkin

DevCSI teamed up with the Open Knowledge Foundation for two days of hacking and sharing ideas about open bibliographic metadata at Queen Mary University of London.

The event provided an opportunity to hack with open bibliographic datasets, to experiment with new tools, and to help improve existing systems to provide new ways for institutions to benefit from bibliographic data. It attracted a range of developers from the DevCSI community, bibliographic data specialists and librarians with a keen interest in making better use of open bibliographic data.

Workshops

Day one of the event saw a parallel stream of workshop sessions which provided a context for the practical work in the main hack room.

These sessions addressed the technical aspects of opening up cultural heritage data, examined some of the best open source tools available for doing that, and explored the best standards for preparing and exposing data for reuse.

Introduction to APIs and Linked Data

<a href="http://www.vimeo.com/44721577" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.vimeo.com']);">http://www.vimeo.com/44721577</a>

Preparing your Data for a Hackathon

<a href="http://www.vimeo.com/44721578" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.vimeo.com']);">http://www.vimeo.com/44721578</a>

Diverse Metatdata Standards and the Europeana Solution

<a href="http://www.vimeo.com/44721580" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.vimeo.com']);">http://www.vimeo.com/44721580</a>

Case Study: Cambridge University Library

<a href="http://www.vimeo.com/44721871" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.vimeo.com']);">http://www.vimeo.com/44721871</a>

Reuse of Open Cultural Heritage Data

<a href="http://www.vimeo.com/44721872" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.vimeo.com']);">http://www.vimeo.com/44721872</a>

Case study: The British Library

Lightning Talks

Participants in the hack element of the event shared their ideas and provided some background to their own work in the field in a series of lightning talks. These explored:

Content and Data Mining and PDF extractor by Peter Murray Rust and Ross Mounce
m-biblio Project by Mike Jones
ORI/RJB by Ian Stuart
Making a BibServer Parser by Etienne Posthumus
IDFind: Identifying Identifiers by Emanuil Tolev
BibServer: What we have been doing recently, how that ties into the open access index idea by Mark MacGillivray
TEXTUS by Tom Oinn
Pundit: Collaborative semantic annotations of texts by Simone Fonda
Linked Data by Ian Stuart

Themes

Further discussion of these issues and the ideas shared via the event etherpad led to the realisation that many of the tools and solutions people wanted to work on throughout the event could be broadly brought under the umbrella of a Bibliographic Toolkit. Under this umbrella, a number of distinct groups formed to investigate specific issues.

These included…

BibServer Group

This group began exploring how to unite tools several tools, including PubCrawler, BibServer and BibSoup, to create a tool for abstracting and displaying bibliographic metadata. They planned to connect PubCrawler, which has collected 12 million bibliographic metadata records and will allow you to find out if papers are Open Access, with BibServer, which collects and displays the data, and BibSoup, which allows the creation of local community groups of references for specific interest areas. The group also planned to investigate how easy it was to get BibServer deployed on a variety of operating systems and document that process. They hoped to test the tool using data from the German National Bibliography data, with a subgroup working on how to parse this data ready for use.

In this short video interview, Mike Jones from the University of Bristol describes his side project associated with this group, which involved getting the m.biblio app he is currently developing to connect to BibServer…

This video is also available on Vimeo.

TEXTUS Group

This group focussed on discussing potential roles and developments for TEXTUS, an open source platform for working with collections of texts and with collections of texts and associated metadata, including annotations. The goal was to create useful documentation about how the tool can be developed and extended, rather than working code.

In this short video interview, Simone Fonda from Net7 outlines the progress the group made as a result of these discussions and explained how he sees TEXTUS developing in the future…

This video is also available on Vimeo.

Open Access Index Group

This group set out to build a list of all the journals in the world, including their access policies, where available. Their intention was to create a search facility to query this data, a form to help crowd source further data and updates, and an API to access this data. The intention is to create a tool that will enable them to get a better idea of what is available to support other work.

In this short video interview, Ian Stuart from EDINA describes the work of the group in more detail, and explains the value he feels developers can bring to the process of creating open access services…

This video is also available on Vimeo.

Useful Tools

Throughout the event, participants shared links and information about a number of useful tools. Here is a quick reference guide to some of the tools mentioned:

BibServer

Github: https://github.com/okfn/bibserver

BibSoup

Link: http://bibsoup.net

TEXTUS

URL: http://textusproject.org

Github: https://github.com/okfn/textus

IDFind

Live here: http://idfind.cottagelabs.com/

Code here: https://github.com/CottageLabs/idfind/

Microsoft Academic Search

API documentation: http://academic.research.microsoft.com/

PubCrawler

Link: https://bitbucket.org/wwmm/pub-crawler

wget

Link: http://www.gnu.org/software/wget/manual/wget.html

PDFBox

Link: http://pdfbox.apache.org/

Final Outcomes

As the event drew to a close, each of the groups provided a short presentation to describe their progress:

Open Access Index Group

The group created an academic catalogue of journal titles, which they called ACat. This was the first step of an ambition plan to create an index that will allow you to browse open access resources. During the course of the event, they collected 55,000 journal titles in an elastic database, then created a front end based on Facetview to provide a searchable interface. Whilst the journal title data was difficult to get, now they have it the group can query various APIs to determine licensing information to develop the project further.

Annotation Tools

Tom Oinn from TEXTUS explained how the event allowed him to gain a much better idea of where to take the TEXTUS project next and how to get it to play with other projects, including BibServer.

At the moment you can annotate texts using comments, but Oinn hopes to move towards any annotation being able to contain references. Notably, he hopes to move towards creating personal reading lists, which begin to look like BibSoup instances, and to start using TEXTUS as an annotation tool in its own right.

BibServer Group

The group spent a lot of time getting BibServers up and running on different architectures, noting that getting a high quality, distributable BibServer out is a high priority. Their work to identify the issues with different set ups will help with this. The group engaged in a number of discussions to explore possible connections with other tools. The big unpredicted outcome of this for this group was realising the value of linking BibServer with TEXTUS.

There were also several splinter projects from this core group, including work to connect BibServer and m.biblio, and efforts to add national bibliographic data from the UK, Germany, Spain and Sweden. The latter helped to identify the problem of character encoding, which is likely to crop up as BibServer gets used more outside the UK. The developers will be insisting on UTF 8 for ingest from now on, as a result of this work.

You can watch the final group presentations in full: