Citation Data Hack: Event Report

Nov 5, 2012 by kpitkin

“What use is citation data without context?”

Max Hammond opened the Future Citation Data Hack by emphasising that, rightly or wrongly, citations are the fundamental currency for measuring the impact of research. He approached DevCSI to run a hack event exploring the practical applications of citation data to help inform the JISC Citation Data Directions project he is conducting, which will examine the whole lifecycle of citation data.
Hammond observed that whilst he is particularly interested in the citation data creation process, understanding the end use of that data is crucial to inform more strategic decisions.

The event was designed to support this work by bringing together a group of domain experts, users and developers to explore ideas related to potential real world uses of citation data and to prototype potential solutions. In particular, Hammond challenged the group to consider: Are there new ideas for the intelligence use of citation data? Are there things we can do with this data beyond the obvious?

With these questions in mind, the participants were invited to map out some of the issues and identify the most useful developments they could work on together during the hack event.

Initial Discussions

The group began with a round table brainstorm of the issues that exist in this space, and identified some of the specific questions members of the group were keen to explore in more detail.

Points of note included:

Acquiring complete citation data is an issue. Publishers see the citation data as their value-added thing, which makes them reticent to give it away.
Grey literature and questions over what constitutes a citation means that, in practice, it is impossible to have all citation data, so you will only ever have a sparse network. Under these circumstances, what can you do with only part of the data?
In different fields people cite in different ways. For example, applied research results in low citations. The completeness of the record depends on the field.

One of the key areas of interest was the potential practical uses of citation data, and identifying who the users might be. Within this initial discussion, members of the group who considered themselves to be users of citation data shared their ideas for potential use scenarios, including:

Some form of rating for different types of citations to help judge impact. There is working going on in this area currently, but the group noted that this would require researchers to apply it rigorously in practice.
A Klout-style system for citations: This would need to assess positive and negative citations to provide a useful single-number score per article. However, the group discussion highlighted that the issue of betweenness – where really valuable citations can be in between traditional boundaries in less typically valued places. This would be hard to address in a single score.
A geographical visualisation of citations as a broad way to explore citation data. The group imagined a visualisation of citations across a world map, connecting global citations with funder information to allow users to explore impact. Further discussion of this idea highlighted that the current incompleteness of the data – including a lack of funder information and data about non-English citations – would be a major stumbling block to gaining an accurate picture in such a visualisation. The group discussed possible routes to remedy this situation, including how to encourage authors to include funding information, which is often omitted.

The discussion then moved towards identifying practical, achievable outcomes for this event, such as:

A comparison of several available citation datasets to identify similarities and establish if researchers working in this area may end up with different results when using graphs from different citation datasets. The published material suggests citation datasets from different sources may be very different, which is by academics claim citations are not a good way to judge them. Could this be verified or explored more deeply using the datasets provided by event participants?
A mind map of the wider picture of citation, including how information and money flow around the citation system. A collaborative page was made available throughout for individual contributions to this big picture.
Experiments with visualisation tools to create citation timelines and measure the decay of citations over time.

There was also interest in taking the opportunity to discuss a number of higher level issues, which could inform the practical use of citation data. Some of the issues touched upon included:

Identifying the barriers to data mining to establish the wider picture around citations.
Questions surrounding the fundamental nature of citations, such as: What is a citation? What would we do differently if we could define this? If we decided that citations are not the best measure of quality, what is?

Which is most valuable: the most-read but least cited paper, or the least-read but most cited paper?

After these initial discussions, the participants split into several groups to pursue these ideas further, with several participants floating between groups to offer ideas and expertise where required.

Discussion Group

Throughout the first day of the event, a wide ranging discussion between various participants explored the more abstract issues associated with citation data – including questions relating to the nature of citation – specifically focusing on data citation and the intrinsic link between the nature of citation and the nature of the material being cited.

The key argument made was that data citation is still very much domain specific, with disciplines having different approaches to making the underlying data available (and therefore citable), or even preserving the data at all, depending on a number of practical and cultural considerations. Members of the group discussed the citation of less conventional materials, such as plant materials or animals in Life Sciences, which rely on a single physical “master” specimen. This discussion highlighted the fact that data isn’t just digits, and citation of data will vary hugely across different disciplines depending on the form of the data being cited. The group also explored examples from archaeology, where convention dictates that a site report is cited in place of an object itself, and questioned whether metadata could be cited in place of the data in other contexts too.

In considering the issue of how you cite differing materials, the question of persistence arose, which led to an extended discussion about data preservation and the need for persistent citations, but again, it was argued that this is largely domain specific. One of the major problems they identified was that of people thinking of the URL as an identifier, even though these offer no persistence.

The discussion wrapped up with reflections about the reason for the current differentiation between data and documents, questioning why the two things are treated differently. They concluded that the understanding of what constitutes a citation in relation to a publication is not really clearly understood and may be influenced by a number of cultural and social factors, making it difficult to apply to data. To be truly understood and assigned a value, a citation needs to be considered in context.

Comparing Large Citation Datasets

Karl Ward, CrossRef
Petr Knoth, Open University
Emma Tonkin, UKOLN
David King, Open University
Sheng Li, University of Birmingham

Two members of this group brought significant citation datasets with them to the event, and there was considerable interest in comparing them to identify similarities. The datasets differed in size and composition, so the group summised that if they resulted in similar but differently sized graphs, this would allow people looking a smaller graphs to make general conclusions about their own data with more confidence. To test this, the group created a series of histograms of citations over time to identify general shapes. These could perhaps be compared against specific resources to identify features such as courtesy citations compared to the general shape of citations over time for generally relied-upon resources.

The two datasets came from the CORE project and CrossRef. Both datasets have different identifiers, so the group had to manually choosing papers that appear in both datasets to start with to enable them to make comparisons. The group also used data from Microsoft Academic Research to support their work. Through further discussion both within the group and with other event participants, they identified a list of statistical techniques and graph-based approaches that they could apply to this data, including calculating the half-life of a paper. However, they stressed the need for a statistician to become involved with the project to assist with many of these analyses.

Karl Ward described their progress and ultimate aims in this short video interview:

You can also watch this video on Vimeo.

By the end of the event, the group had explored some overall metrics to help assess the similarities between the datasets to help researchers better understand the decisions they are making by choosing to use one citation dataset over another without fully understanding the different methodologies that may have been used in compiling either dataset. These metrics included the absolute maximum number of citations that a single node might have received, absolute minimum, average and so on. However, they noted that whilst this tells you a bit about your dataset as a whole, but does not tell you how accurate any given node might be.

To pursue this further, the group picked some specific nodes to study in greater detail, creating histograms of citations over time and considering some of the odd features of some of the data – including a bias towards the present in the CrossRef data. They speculated that this might be due to publishers only recently adding DOIs to papers, enabling them to confidently describe a paper containing a citation, but conceded that this issue needs to be examined in more detail than permitted in the time available at the hack event.

Once these anomalous results had been discounted until they can be fully explained, the group used the Earth Mover’s Distance algorithm to examine the differences between normalised results for one node, comparing the same paper in the data provided by CrossRef and Microsoft Academic Research. They noted that they will need to carry out this same process over a larger group of papers, for which they will need to be able to identify the same papers within both datasets using DOIs.

Going forward, the group would like to look at the general shape of more histograms and how different papers are used over time, and to identify papers that are in the space where they were almost influential, but might not appear in a rank of frequencies.

Watch the final report by Emma Tonkin, who summarises the outcomes of the group’s efforts in full in this presentation:

You can also watch this video on Vimeo.

Information Flow

Throughout the event, Max Hammond asked participants to contribute to a collaborative project to map out the flow of information and money within the citation data ecosystem. Various participants chipped in to add sections of the citation data lifecycle, and the influences and issues that impact on that lifecycle.

Watch the map evolve in this short video:

You can also watch this video on Vimeo.

Visualisation Beyond the Citation Border

Edward Minnett, Faculty of 1000
Tanya Gray, University of Oxford
Sheng Li, University of Birmingham
Tim Brody, University of Southampton
Paul Stokes, JISC

This group were particularly interested in creating visualisations of citation data, including visualisations over time and over geographical space. Their initial approach involved attempting to make modifications to an existing library in order to create tree graphs, but they quickly found this too complicated for the time constraints of the event. As a result, they decided to take a stock visualisation and change some of the parameters to make it more readable for citations, then plug in some of the data from opencitations.net to see what patterns emerged.

Several members of the group took the opportunity to explore new or unfamiliar technologies, including Node.js, which Tim Brody used for the first time at the event to develop a proxy around opencitations.net to act as middleware, which could then be applied to other datasets. Other members of the group practiced using sparql queries to create a dynamic graph around the opencitations.net data.

Edward Minnett described the group’s progress and driving forces from his perspective in this short video interview:

You can also watch this video on Vimeo.

By the end of the event, the group had developed a middleware layer that can extract the APIs from a variety of citation data sources to allow that data to be used with any visualisation tool. In the future, this could be coupled with a system like DOI to allow you to migrate seamlessly across citation databases within a visualisation.

The approach the group took to build this was to construct a node.js based server, which sends off sparql queries to opencitations.net to retrieve metadata about a particular article and all the items that article has been cited by. This is built into a standard format in the middleware, which in turn is passed to the chosen visualisation system.

As part of this work, the group wanted to come up with a new visualisation that would help people in this space to think about citations on a timeline. They created a visualisation which connects articles cited by a particular node, and the articles that in turn cited that node, arranged on a timeline. They used the JavaScript InfoVis toolkit to create and demonstrate this visualisation in real time.

Watch the final report by Tim Brody, who summarises the outcomes of the group’s efforts in full in this presentation:

You can also watch this video on Vimeo.

Remote Participation

There was interest in the event from a number of people who were unable to attend in person. When Jimme Jardine realised he could not attend, he asked us to show a pre-prepared video to the group describing Qiqqa, his own project that was highly relevant to the discussions taking place at the event.

You can watch this video in full below:

You can also watch this video on YouTube.

Conclusions

The event was designed to directly feed into a JISC-funded project examining the life cycle of citation data by connecting the researchers directly to users and developers who can build things with citation data. The practical outcomes of the event as described in this report helped to provide insight into how citation data could be used and to identify some of the difficulties that exist with the current citation data infrastructure that prevent innovation.

Max Hammond summed up how useful the event proved for his project in this short video interview, in which he reflects how on the differences between the needs of high level stakeholders and the developers on the ground who are looking to implement solutions based on citation data: