DevCSI | Developer Community Supporting Innovation » JISC

Citation Data Hack: Event Report

kpitkin — Mon, 05 Nov 2012 18:26:48 +0000

“What use is citation data without context?”

Max Hammond opened the Future Citation Data Hack by emphasising that, rightly or wrongly, citations are the fundamental currency for measuring the impact of research. He approached DevCSI to run a hack event exploring the practical applications of citation data to help inform the JISC Citation Data Directions project he is conducting, which will examine the whole lifecycle of citation data.
Hammond observed that whilst he is particularly interested in the citation data creation process, understanding the end use of that data is crucial to inform more strategic decisions.

The event was designed to support this work by bringing together a group of domain experts, users and developers to explore ideas related to potential real world uses of citation data and to prototype potential solutions. In particular, Hammond challenged the group to consider: Are there new ideas for the intelligence use of citation data? Are there things we can do with this data beyond the obvious?

With these questions in mind, the participants were invited to map out some of the issues and identify the most useful developments they could work on together during the hack event.

Initial Discussions

The group began with a round table brainstorm of the issues that exist in this space, and identified some of the specific questions members of the group were keen to explore in more detail.

Points of note included:

Acquiring complete citation data is an issue. Publishers see the citation data as their value-added thing, which makes them reticent to give it away.
Grey literature and questions over what constitutes a citation means that, in practice, it is impossible to have all citation data, so you will only ever have a sparse network. Under these circumstances, what can you do with only part of the data?
In different fields people cite in different ways. For example, applied research results in low citations. The completeness of the record depends on the field.

One of the key areas of interest was the potential practical uses of citation data, and identifying who the users might be. Within this initial discussion, members of the group who considered themselves to be users of citation data shared their ideas for potential use scenarios, including:

Some form of rating for different types of citations to help judge impact. There is working going on in this area currently, but the group noted that this would require researchers to apply it rigorously in practice.
A Klout-style system for citations: This would need to assess positive and negative citations to provide a useful single-number score per article. However, the group discussion highlighted that the issue of betweenness – where really valuable citations can be in between traditional boundaries in less typically valued places. This would be hard to address in a single score.
A geographical visualisation of citations as a broad way to explore citation data. The group imagined a visualisation of citations across a world map, connecting global citations with funder information to allow users to explore impact. Further discussion of this idea highlighted that the current incompleteness of the data – including a lack of funder information and data about non-English citations – would be a major stumbling block to gaining an accurate picture in such a visualisation. The group discussed possible routes to remedy this situation, including how to encourage authors to include funding information, which is often omitted.

The discussion then moved towards identifying practical, achievable outcomes for this event, such as:

A comparison of several available citation datasets to identify similarities and establish if researchers working in this area may end up with different results when using graphs from different citation datasets. The published material suggests citation datasets from different sources may be very different, which is by academics claim citations are not a good way to judge them. Could this be verified or explored more deeply using the datasets provided by event participants?
A mind map of the wider picture of citation, including how information and money flow around the citation system. A collaborative page was made available throughout for individual contributions to this big picture.
Experiments with visualisation tools to create citation timelines and measure the decay of citations over time.

There was also interest in taking the opportunity to discuss a number of higher level issues, which could inform the practical use of citation data. Some of the issues touched upon included:

Identifying the barriers to data mining to establish the wider picture around citations.
Questions surrounding the fundamental nature of citations, such as: What is a citation? What would we do differently if we could define this? If we decided that citations are not the best measure of quality, what is?

Which is most valuable: the most-read but least cited paper, or the least-read but most cited paper?

After these initial discussions, the participants split into several groups to pursue these ideas further, with several participants floating between groups to offer ideas and expertise where required.

Discussion Group

Throughout the first day of the event, a wide ranging discussion between various participants explored the more abstract issues associated with citation data – including questions relating to the nature of citation – specifically focusing on data citation and the intrinsic link between the nature of citation and the nature of the material being cited.

The key argument made was that data citation is still very much domain specific, with disciplines having different approaches to making the underlying data available (and therefore citable), or even preserving the data at all, depending on a number of practical and cultural considerations. Members of the group discussed the citation of less conventional materials, such as plant materials or animals in Life Sciences, which rely on a single physical “master” specimen. This discussion highlighted the fact that data isn’t just digits, and citation of data will vary hugely across different disciplines depending on the form of the data being cited. The group also explored examples from archaeology, where convention dictates that a site report is cited in place of an object itself, and questioned whether metadata could be cited in place of the data in other contexts too.

In considering the issue of how you cite differing materials, the question of persistence arose, which led to an extended discussion about data preservation and the need for persistent citations, but again, it was argued that this is largely domain specific. One of the major problems they identified was that of people thinking of the URL as an identifier, even though these offer no persistence.

The discussion wrapped up with reflections about the reason for the current differentiation between data and documents, questioning why the two things are treated differently. They concluded that the understanding of what constitutes a citation in relation to a publication is not really clearly understood and may be influenced by a number of cultural and social factors, making it difficult to apply to data. To be truly understood and assigned a value, a citation needs to be considered in context.

Comparing Large Citation Datasets

Karl Ward, CrossRef
Petr Knoth, Open University
Emma Tonkin, UKOLN
David King, Open University
Sheng Li, University of Birmingham

Two members of this group brought significant citation datasets with them to the event, and there was considerable interest in comparing them to identify similarities. The datasets differed in size and composition, so the group summised that if they resulted in similar but differently sized graphs, this would allow people looking a smaller graphs to make general conclusions about their own data with more confidence. To test this, the group created a series of histograms of citations over time to identify general shapes. These could perhaps be compared against specific resources to identify features such as courtesy citations compared to the general shape of citations over time for generally relied-upon resources.

The two datasets came from the CORE project and CrossRef. Both datasets have different identifiers, so the group had to manually choosing papers that appear in both datasets to start with to enable them to make comparisons. The group also used data from Microsoft Academic Research to support their work. Through further discussion both within the group and with other event participants, they identified a list of statistical techniques and graph-based approaches that they could apply to this data, including calculating the half-life of a paper. However, they stressed the need for a statistician to become involved with the project to assist with many of these analyses.

Karl Ward described their progress and ultimate aims in this short video interview:

Click here to view the embedded video.

You can also watch this video on Vimeo.

By the end of the event, the group had explored some overall metrics to help assess the similarities between the datasets to help researchers better understand the decisions they are making by choosing to use one citation dataset over another without fully understanding the different methodologies that may have been used in compiling either dataset. These metrics included the absolute maximum number of citations that a single node might have received, absolute minimum, average and so on. However, they noted that whilst this tells you a bit about your dataset as a whole, but does not tell you how accurate any given node might be.

To pursue this further, the group picked some specific nodes to study in greater detail, creating histograms of citations over time and considering some of the odd features of some of the data – including a bias towards the present in the CrossRef data. They speculated that this might be due to publishers only recently adding DOIs to papers, enabling them to confidently describe a paper containing a citation, but conceded that this issue needs to be examined in more detail than permitted in the time available at the hack event.

Once these anomalous results had been discounted until they can be fully explained, the group used the Earth Mover’s Distance algorithm to examine the differences between normalised results for one node, comparing the same paper in the data provided by CrossRef and Microsoft Academic Research. They noted that they will need to carry out this same process over a larger group of papers, for which they will need to be able to identify the same papers within both datasets using DOIs.

Going forward, the group would like to look at the general shape of more histograms and how different papers are used over time, and to identify papers that are in the space where they were almost influential, but might not appear in a rank of frequencies.

Watch the final report by Emma Tonkin, who summarises the outcomes of the group’s efforts in full in this presentation:

Click here to view the embedded video.

You can also watch this video on Vimeo.

Information Flow

Throughout the event, Max Hammond asked participants to contribute to a collaborative project to map out the flow of information and money within the citation data ecosystem. Various participants chipped in to add sections of the citation data lifecycle, and the influences and issues that impact on that lifecycle.

Watch the map evolve in this short video:

Click here to view the embedded video.

You can also watch this video on Vimeo.

Visualisation Beyond the Citation Border

Edward Minnett, Faculty of 1000
Tanya Gray, University of Oxford
Sheng Li, University of Birmingham
Tim Brody, University of Southampton
Paul Stokes, JISC

This group were particularly interested in creating visualisations of citation data, including visualisations over time and over geographical space. Their initial approach involved attempting to make modifications to an existing library in order to create tree graphs, but they quickly found this too complicated for the time constraints of the event. As a result, they decided to take a stock visualisation and change some of the parameters to make it more readable for citations, then plug in some of the data from opencitations.net to see what patterns emerged.

Several members of the group took the opportunity to explore new or unfamiliar technologies, including Node.js, which Tim Brody used for the first time at the event to develop a proxy around opencitations.net to act as middleware, which could then be applied to other datasets. Other members of the group practiced using sparql queries to create a dynamic graph around the opencitations.net data.

Edward Minnett described the group’s progress and driving forces from his perspective in this short video interview:

Click here to view the embedded video.

You can also watch this video on Vimeo.

By the end of the event, the group had developed a middleware layer that can extract the APIs from a variety of citation data sources to allow that data to be used with any visualisation tool. In the future, this could be coupled with a system like DOI to allow you to migrate seamlessly across citation databases within a visualisation.

The approach the group took to build this was to construct a node.js based server, which sends off sparql queries to opencitations.net to retrieve metadata about a particular article and all the items that article has been cited by. This is built into a standard format in the middleware, which in turn is passed to the chosen visualisation system.

As part of this work, the group wanted to come up with a new visualisation that would help people in this space to think about citations on a timeline. They created a visualisation which connects articles cited by a particular node, and the articles that in turn cited that node, arranged on a timeline. They used the JavaScript InfoVis toolkit to create and demonstrate this visualisation in real time.

Watch the final report by Tim Brody, who summarises the outcomes of the group’s efforts in full in this presentation:

Click here to view the embedded video.

You can also watch this video on Vimeo.

Remote Participation

There was interest in the event from a number of people who were unable to attend in person. When Jimme Jardine realised he could not attend, he asked us to show a pre-prepared video to the group describing Qiqqa, his own project that was highly relevant to the discussions taking place at the event.

You can watch this video in full below:

Click here to view the embedded video.

You can also watch this video on YouTube.

Conclusions

The event was designed to directly feed into a JISC-funded project examining the life cycle of citation data by connecting the researchers directly to users and developers who can build things with citation data. The practical outcomes of the event as described in this report helped to provide insight into how citation data could be used and to identify some of the difficulties that exist with the current citation data infrastructure that prevent innovation.

Max Hammond summed up how useful the event proved for his project in this short video interview, in which he reflects how on the differences between the needs of high level stakeholders and the developers on the ground who are looking to implement solutions based on citation data:

Click here to view the embedded video.

You can also watch this video on Vimeo.

Citation data matters, so we need to get this right.

code.ac.uk: A Bounty Hunt

kirsty-pitkin — Mon, 26 Mar 2012 09:45:44 +0000

In this guest post, Martin Hamilton from Loughborough University describes his JISC Elevator Pitch idea: code.ac.uk – a bounty hunt.
_____________________________________

Let’s stop reinventing the wheel and share the code we develop to make institutional systems talk to each other.

If you work for a University as a developer you’re probably very familiar with the scenario that “we have just bought product X, which will need to talk to products Y and Z”. There are lots of well established institutional systems such as Library management systems, Finance systems, HR systems, Student Records systems, Virtual Learning Environments, and so on. Heck – there are even email and calendar / online collaboration suites too

As a developer, you’re accustomed to Googling around to figure out how to do things and solve tricky problems. Sites like Stack Overflow help enormously here, but there’s a whole class of stuff that is quite hard to search for – enterprise packages. Aside from a few enlightened cases (e.g. Google Apps API forum posts), there either isn’t a body of sample code to draw upon, or it’s hidden away behind corporate Extranets.

So, I’d like to snarf a chunk of the JISC Elevator Pitch funding to try a little experiment to open source some of this systems integration code. Here’s a short video that explains how I envisage this working:

Click here to view the embedded video.

Watch this video on YouTube.

If successful, I think this little project would help to get institutions thinking about sharing code more generally, and perhaps even move us a little bit closer to a “University API” that exposes say “Finance system functions” rather than “Agresso Finance System functions”, and would permit institutions to move between systems whilst retaining a common API layer. Much of the prior work in this area has been top down, but I suspect a bottom up approach would be more likely to succeed.

I see this as a natural DevCSI project, since participants in DevCSI already “get it” and understand the benefits that accrue from sharing code – particularly around rapid development, pooling expertise, and avoiding unnecessary duplication of effort. As part of the project we would organise a workshop under the DevCSI banner for all those interested in opening up their institutional systems integration code. This would provide an opportunity to agree a common approach to code sharing (e.g. choice of license), and also give people an opportunity to share hints and tips for successful promotion at each others’ institutions.

If you like the sound of what you’re hearing – vote for me! (Note: ac.uk email address required for this)

Library Data And Doing Interesting Things With It

devcsi-team — Thu, 22 Oct 2009 14:15:58 +0000

We would like to bring to your attention the very high quality competition winners and completed demonstrators for the JISC Mosiac Project.Anyone who is interested in developing applications that use data from libraries would get some real inspiration by looking at these examples which were based on four years of data from the University of Huddersfield library, namely circulation library records and information relating to courses.The applications created covered three areas; Improving Resource Discovery, Supporting Learning Choices and Supporting Decision Making. Five of the entries were received from the UK and one from the USA. All are proof of concept or demonstration prototypes.First PrizeAlex Parker a Computer Science undergraduate from the university of Southampton won the first prize of £1000 with ‘Book Galaxy‘. This allows users to browse and / or key word search for books and courses using a constellation type visual interface rather than a list of books. This tool requires the installation of JAVA and was tested on Firefox 3.0.14 and in Internet Explorer 7.

Book Galaxy

Screen shot of Book Galaxy: The search term ‘physics’ was used and it produced a dynamic ‘constellation’ or ‘galaxy’ of points which when hovering over provides information about books that have been found for this topic.Second PlaceSecond place went to Andrew Isherwood from the University of Aberystwyth. The application returns library lending data and the monetary value of those loans related to a specific course. It has been tested on Firefox 3.0.14 and in Internet Explorer 7.

Andrew Isherwood Aberystwyth University Entry for MOSIAC Project (Second)

Screen shot of second place in the Mosiac competition (Andrew Isherwood).Third PlaceThird place went to Alistair Young from the University of the Highlands and Islands, called ‘iLib, the Course Book Finder’. This tool utilises various searches (through a keyword search) to find relevant books relating to specific courses. It has been tested on Firefox 3.0.14 and in Intnernet Explorer 7.

iLib Course Book Finder

Screen shot of the Third prize to ‘i-lib course book finder’ by Alistair Young.Honourable mentionsTony Hirst – Open UniversityThis demonstrator shows how library book loan information could be used to help potential new students get an idea and feel for a prospective course by looking at the reading materials that existing students are taking out for it. It works by a user dragging a ‘bookmarklet’ to their toolbar and when they are on a page that refers to a particular UCAS code of interest, they press the button to reveal suggested reading list items. It has been tested on Firefox 3.0.14.

Tony Hirst Mosiac Entry

Screen shot of Tony Hirst’s UCAS code course reading materials tool.Owen Stephens of the Open UniversityThis prototype is called ‘Read to Learn’ and is a tool that makes suggestions of courses that you could study based on an uploaded list of ISBNs (for example it could comprise of the books that you have read). The tool has been tested on Firefox 3.0.14 and in Internet Explorer 7.

Read to Learn 1

Ready to Learn: Screen Shot one (before upload of file containing lists of ISBNs).

Read to Learn 2

Ready to Learn: Screen Shot two (after upload of file containing lists of ISBNs).Collection Development DashboardSean Hannan of Johns Hopkins University submitted the idea of a prototype web application that visualises (though a series of bar graphs) circulation data relating to courses of study and publishers across the past 4 years. This has been tested on Firefox 3.0.14 and in Internet Explorer 7. The application requires Adobe Flash Player 10 or above . Note – to go back up to the top of the data series you use either ‘ctrl + click’ (PC) or ‘cmd + click’ (Apple)

Collection Development Dashboard 1

Collection Development Dashboard: Screen shot 1 – Data by year

Collection Development Dashboard 2

Collection Development Dashboard: Screen shot 2- 2008 selected and Subjects Displayed

Collection Development Dashboard 3

Collection Development Dashboard: Screen shot 3- BSc Psychology selected and results displayed.The MOSAIC team will be seeking feedback from Higher Education library and learning practitioners on all six applications at the series of workshops over the next month at the Universities of Edinburgh, Sheffield, Sussex and the Open University.The applications will also be featured at the concluding MOSAIC event at the University of Wolverhampton on Wednesday 18 November for more information please visit this page.

YODL-ING – Video: Nigel.V.Thomas – Pitch 53 – Day 2

devcsi-team — Wed, 16 Sep 2009 00:23:21 +0000

YODL-ING into web 3.0, York University are developing access control systems & UIs for deposits into hybrid repositories.The project aims to offer re-usable solutions and recommendations for the wider HE, JISC and Fedora communities.Working with project partners, YODL-ING will build two significant services. One will utilise the SWORD protocol to expedite deposit into multiple repositories from a single deposit interface. The other will offer a simple, scalable solution to control access, defining machine-readable policies and implementing Shibboleth for access control to hybrid repositories used in various HE environment.We are talking to:For more information, please visit:

Click here to view the embedded video.

OneVRE – Video: Thomas Schiebeck – Pitch 51 – Day 2

devcsi-team — Wed, 16 Sep 2009 00:13:31 +0000

One VRE to Join them All.Access Grid Technologies enabling collaboration across multiple portal based VREs.Integrated collaboration across multiple VREs.We are talking to:For more information, please visit:

Click here to view the embedded video.

Easihe – Video: Bart Nagel – Pitch 49 – Day 2

devcsi-team — Tue, 15 Sep 2009 23:53:30 +0000

A JISC-funded project to produce an e-assessment repository with a delivery mechanism and peer assessment features.A single environment to store, retrieve, deliver and mark e-assessments.We have started talking to Peer Pigeon, Edshare and QTItools.For more information, please visit:

Click here to view the embedded video.