As of June 2017 I’m a Research Assistant at the Leeds Institute of Medical Education, University of Leeds. My role involves carrying out text analytics of the feedback and comments relating to students’ work placements (myPAL@work), in order to develop innovative learning analytics approaches. I use methods from a variety of domains, including natural language processing and machine learning. I am also currently involved in a project that uses learning analytics to understand engagement in active video watching for soft skill learning.
A significant proportion of clinical data is stored as unstructured free-text reports, such as discharge summaries or radiology reports, which makes them difficult to process and analyse on a large scale. Text analytics methods like document retrieval and information extraction can address this challenge. I have conducted a three-month pilot study on using IBM Watson Content Analytics to identify relevant documents in large-scale collections of clinical reports (~6.5 million documents in total). My task was to retrieve documents which contain positive instances of certain conditions (e.g. “mild hydronephrosis is noted” as a positive instance, but “no evidence of hydronephrosis” as a negative instance). The custom rule-based models built using IBM Watson Content Analytics have achieved very good results for this task.
I started my PhD at the School of Computing (University of Leeds) in November 2013, working within the Artificial Intelligence research theme. My primary supervisor is Dr Vania Dimitrova. I’m also advised by Dr Katja Markert and Dr Justin Washtell. My project is funded by EPSRC and 365media.
Predicting the popularity of textual content has been of wide interest to computational researchers in NLP and Internet companies alike. Types of online content that have been the subject of prediction models include: blog posts, social media brand pages, tweets, hashtags, and news articles. Predicting the popularity of news articles is also the problem I tackle.
Nowadays online news outlets are increasingly trying to engage more effectively with social media audiences. Being able to predict the popularity of news articles on social media would allow news outlets to tailor their content to those audiences. There is currently little computational research into how specific aspects of news article text influence its popularity on social media. Headlines in particular play an important role on social media, as they are often the only part of the news article available to a social media user, yet until now they have not been analysed separately for this task.
This project takes an interdisciplinary approach, combining state-of-the-art computational methods with insights from the journalism community. The overall aim is to identify aspects of headlines which have the greatest impact on social media popularity, using large datasets of news and social media data. The focus is placed on the newsworthiness, style, genre and topic of headlines, and how the impact of these varies depending on different demographics of social media users (e.g. region).
A key part of this project is introducing and evaluating new ways of approximating the online prominence of entities and concepts. This will have practical applications in other NLP tasks and can be tied with existing tools which link entities and concepts with ontologies (e.g. DBpedia).
The outcomes of this project will enable journalists and editors to create headlines which more effectively target social media audiences. While this project focuses on newspaper headlines, it will be possible to extend the prediction model to work with titles of other types of online content, such as videos or blog posts. The model will benefit any organisation that wants to more effectively promote their content on social media.
The detailed analysis of various linguistic and stylistic aspects of headlines will also make it possible to create a training tool for journalists, helping them to write better, more effective headlines.
This project combines techniques from several domains: natural language processing, ontologies, and machine learning. The datasets consist of large numbers of online newspaper headlines and their social media mentions (tweets, shares). Headlines are annotated using various NLP methods (e.g. syntactic parsing, topic classification, sentiment annotation), covering four aspects: newsworthiness, style, topic, and genre. The newsworthiness aspects, long present in journalism studies literature, are operationalised and investigated for the first time in a large-scale computational model. A novel contribution of this project is the development of new methods for calculating online prominence of entities and concepts, using ontologies and time series analysis of large datasets. Demographic-specific models of social media popularity require novel methods for aggregation and classification of social media users, such as geolocating or classifying users into language clusters based on the metadata and textual content of their social media stream.