I started my PhD at the School of Computing (University of Leeds) in November 2013, working within the Artificial Intelligence research theme. My primary supervisor is Dr Vania Dimitrova. I’m also advised by Dr Katja Markert and Dr Justin Washtell. My project is funded by EPSRC and 365media.
Predicting the popularity of textual content has been of wide interest to computational researchers in NLP and Internet companies alike. Types of online content that have been the subject of prediction models include: blog posts, social media brand pages, tweets, hashtags, and news articles. Predicting the popularity of news articles is also the problem I tackle.
Nowadays online news outlets are increasingly trying to engage more effectively with social media audiences. Being able to predict the popularity of news articles on social media would allow news outlets to tailor their content to those audiences. There is currently little computational research into how specific aspects of news article text influence its popularity on social media. Headlines in particular play an important role on social media, as they are often the only part of the news article available to a social media user, yet until now they have not been analysed separately for this task.
This project takes an interdisciplinary approach, combining state-of-the-art computational methods with insights from the journalism community. The overall aim is to identify aspects of headlines which have the greatest impact on social media popularity, using large datasets of news and social media data. The focus is placed on the newsworthiness, style, genre and topic of headlines, and how the impact of these varies depending on different demographics of social media users (e.g. region).
A key part of this project is introducing and evaluating new ways of approximating the online prominence of entities and concepts. This will have practical applications in other NLP tasks and can be tied with existing tools which link entities and concepts with ontologies (e.g. DBpedia).
The outcomes of this project will enable journalists and editors to create headlines which more effectively target social media audiences. While this project focuses on newspaper headlines, it will be possible to extend the prediction model to work with titles of other types of online content, such as videos or blog posts. The model will benefit any organisation that wants to more effectively promote their content on social media.
The detailed analysis of various linguistic and stylistic aspects of headlines will also make it possible to create a training tool for journalists, helping them to write better, more effective headlines.
This project combines techniques from several domains: natural language processing, ontologies, and machine learning. The datasets consist of large numbers of online newspaper headlines and their social media mentions (tweets, shares). Headlines are annotated using various NLP methods (e.g. syntactic parsing, topic classification, sentiment annotation), covering four aspects: newsworthiness, style, topic, and genre. The newsworthiness aspects, long present in journalism studies literature, are operationalised and investigated for the first time in a large-scale computational model. A novel contribution of this project is the development of new methods for calculating online prominence of entities and concepts, using ontologies and time series analysis of large datasets. Demographic-specific models of social media popularity require novel methods for aggregation and classification of social media users, such as geolocating or classifying users into language clusters based on the metadata and textual content of their social media stream.