Guru Session: Text Analysis

Feb 17, 2011 by kirsty-pitkin

Emma Tonkin, who’s main interests include formal metadata extraction and dublin core metadata extraction, led this interactive session, in which the group shared their perspectives and questions.

Tonkin explained how she extracts metadata by pulling data out of a paper and using Apache tools to identify and classify data. She observed that key words have nothing to do with the subject but everything to do the funding. There is therefore often lots of information, but not necessarily information that would be useful to a user.

She moved on to discuss sentiment analysis, including using Twitter to identify, for example, the sentiment around the names of politicians. She noted that this is hard to do formally, but then human beings don’t do this very well either.

The group discussed OCR tools, which have limited success depending on the quality of the source. One participant was dealing with 18th century papers from the Royal Society and struggling with the OCR technologies available to extract the data from these papers. Tonkin was able to direct them to the IMPACT project, which is an EU funded project involving OCR work.

There were also discussions about social tagging and socially shared cognition research, and the general “mish mash of stuff” out there.