Phone: (+65) 6828 0785
[ Home ] [ Research ] [ Publications ] [ Code & Data] [ Group ]
I am generally interested in applying natural language processing and statistical machine learning techniques to analyze textual data, particularly user-generated content from social media, to assist knowledge discovery and prediction tasks. Below are summaries of the major lines of my work.
Twitter has become one of the most popular platforms for people to socialize and obtain information. There is a huge amount of text from Twitter, which contains very useful information such as the publicís reactions to major events. Mining textual data from Twitter also faces many new challenges because the text is informal, noisy and constantly changing. To analyze Twitter content, I started with understanding the differences between Twitter and traditional media in terms of topic coverage. I then studied extraction of trendy topics from Twitter. More recently, my focus has been temporal topic modeling on Twitter using principled probabilistic latent variable models.
In [ECIR'11] we used topic modeling to conduct a systematic comparison of the topic distributions and other characteristics of Twitter and New York Times. The differences we discovered suggest that Twitter search should have a different focus than traditional Web search, e.g. Twitter could be a good information source for topics such as celebrities. Subsequently in [ACL'11], we studied how to identify keyphrases that indicate trendy topics from Twitter. While keyphrase extraction has been well studied for traditional text collections such as news articles, the special properties of Twitter pose many new challenges. We consider topic-based contexts and retweets to improve the quality of the extracted key phrases.
In [ACL'12] we studied the problem of detecting bursty topics and identifying events from Twitter. Although several topic models for text streams had been proposed before, none of them was applied to microblog data, which is contributed by many users, each has her own personal interests. We found those existing models unsuitable for bursty topic detection on Twitter and proposed our own probabilistic topic model, which captures both individual userís personal interests and global topic trends. Following this work, in [EMNLP'13] and [SDM'14], we used non-parametric models to cluster tweets into bursty events while at the same time modeling long-standing topics and usersí personal interests. The model in [EMNLP'13] also allows us to reveal the relation between long-standing topics and short-term events, which can help categorize events into different topics.
We are also interested in understanding the relation between Twitter's textual content and Twitter users' behaviors and attributes. In [SDM'13], we proposed a model that considers not only the textual content of Twitter but also users' reply and retweet behaviors. Our goal was to characterize users based on both their topical interests and their behavior patterns. In [AAAI'14], we developed a joint model to model both tweets and users' age information, which allows us to discover interesting age-specific topics.
Online forums provide a platform for people to exchange ideas and oftentimes debate about controversial topics. An important observation is that user interactions as reflected in their textual exchanges in forum threads can often help us better understand the different viewpoints held by different users. In [NAACL'13a] and [CIKM'13], we used sentiment analysis techniques to identify the polarity of user relations as implied in their interaction expressions. We further developed probabilistic models that consider both content topics and user interactions in order to identify different viewpoints and user groups. In [NAACL'13b], we used collaborative filtering techniques to predict relations between users who do not have direct interactions.
Opinion mining from social media has mainly focused on mining product reviews. Little work has been done to extract, analyze and summarize opinions on sociopolitical issues from platforms such as online forums and comments to news articles. In [COLING'12a] and [SocialCom'12], we looked at how to identify high-quality online comments. This work is motivated by the observation that not all online comments can help policy makers better understand and address a social issue. Thoughtful comments are usually more useful. Based on previous findings from computational linguistics, we defined several linguistic features to characterize comments and applied supervised learning to predict the quality of a comment. Even after filtering out the low-quality comments, we would still like to compress and summarize the high-quality comments. In [COLING'12b], we took a first step to extract actionable comments, which contain suggestions that can be acted upon. We formulated the problem as an information extraction problem and applied a standard sequence labeling method from NLP to solve the problem. We further applied clustering to normalize the extracted entity-action pairs.
To discover knowledge from unstructured text, an important first step is to extract information snippets, particularly entities and their relations. For example, the sentence "Facebook was launched by Mark Zuckerberg" can be converted to the binary relation founder_of(mark_zuckerberg, facebook) and subsequently stored in a relational database for further inference and mining. Relation extraction is therefore a fundamental problem in information extraction. In [NAACL'07], we proposed a unified graphic representation of the feature space for relation extraction, which enabled a systematic exploration and comparison of different feature configurations to optimize relation extraction performance. Besides feature engineering, another bottleneck in supervised approach to information extraction is the need for sufficient labeled training data drawn from the same domain as the test data. In real applications, such labeled data is often not available either because the available labeled data comes from a different domain or it is too expensive to obtain labeled data altogether. For example, an entity recognizer trained on one domain often has degraded performance on a new domain because of the domain differences. In [NAACL'06], we designed a domain-aware feature ranking and selection strategy to address this problem. Our method automatically ranks features based on their domain generalizability. It can substantially outperform the standard method that ignores domain differences. Later in [ACL'09], I studied a similar setting for relation extraction, where only a few labeled relation instances are available for a target relation type but plenty of training data can be borrowed from other relation types. I applied a feature selection-based domain adaptation framework I developed earlier and showed that the method can successfully transfer common syntactic patterns across relation types, resulting in substantially improved performance. Continuing the effort, we further studied how to extract specific relation descriptors given training data of general relation types [IJCNLP'11] and how to induce information extraction templates without labeled data altogether [EMNLP'11].