I work on two research themes: software engineering and data mining. I am particularly interested on how techniques from these two research themes/areas could benefit and complement each other.

In the software engineering area, my research interest include software specification mining/protocol inference, mining software repositories, program analysis, software testing and automated debugging. In general, I'm interested on how software can be better developed, better maintained, better tested and better debugged through an analysis of the wealth of software data currently available. Technique-wise, I'm particularly interested in using a composition of techniques including: static analysis, dynamic analysis, data mining, information retrieval, and natural language processing. For a couple of papers illustrating how these techniques could work together, please refer to my publication page.

In the data mining area, my research interest include frequent pattern mining, discriminative pattern mining, and social network mining. I'm particularly interested on datasets expressed as a bag/multi-set of sequences and graphs.

My work has been published in top/major conferences and journals in software engineering, programming languages, data mining, and text mining, including: ICSE, FSE, ASE, ISSTA, ICSM, PLDI, KDD, VLDB, ICDE, ACL, TKDE, etc.

Software Engineering

I analyze and mine software engineering artifacts in various forms: code, execution traces, text, and collaboration network. The following are some of the research threads that I'm currently working on.

Specification Mining/Protocol Inference. As a step forward to reduce software maintenance cost and detect bugs, machine learning and data mining techniques have been employed to infer or reverse engineer high-level specifications from existing programs either statically (i.e. from code) or dynamically (i.e. from execution traces). This is termed as specification mining and has been one recent, promising topics in software engineering. The specification mined can be used for understanding legacy systems, reducing software maintenance cost, re-engineering legacy system, improving regression tests, aiding verification of programs, and detecting bugs.

My current focus is on extending the frontiers of research in specification mining and application the mining techniques to solve software maintenance and dependability issues. So far, four different families of specification mining techniques have been developed (details in the publication page)

  1. Mining Finite-State Machines (FSM): [WCRE'06],[FSE'06],[FSE'09]

  2. Mining Repetitive Patterns: [KDD'07], [SDM'08], [ICDE'09]

  3. Mining Temporal Properties and Rules: [DASFAA'08], [JSME'08], [WODA'08], [ICDE'11]

  4. Mining Live Sequence Charts (LSC) and Message Sequence Charts (MSC): [ASE'07], [ASE'08], [PASTE'08], [ASE'09], [ASE'10], [ICECCS'11], [ICSE'11]

Upcoming in a few years, hopefully, are more powerful specification mining techniques addressing limitations of existing techniques and further case studies to experiment with application of the techniques in the open source community and the industry.

Code Search. I'm interested in building "Google" for code. This search engine needs to take into account the complex structural and semantic relations in code and its affiliated artifacts (comments, requirement documents, etc).

  1. Searching via a Query Language: [ASE'10], [WCRE'11]

  2. Searching using Free-Form Text: [WCRE'11]

  3. Anomaly Detection: [ASE'10]

Debugging and Testing. Testing and debugging is a time-consuming activity. It is essential though to ensure the quality of software systems. I'm interested in the following topics in debugging and testing:

  1. Localizing Fault Given a Set of Failures: [ISSTA'09], [ICSM'10], [ASE'11]

  2. Finding Anomalies in Code: [ASE'10]

  3. Privacy-preserving Software Testing: [PLDI'11]

Analyzing Software Text. Software artifacts are not only code but also include many other documents expressed in natural language. I'm interested in analyzing these wealth of textual data to aid software developers in performing their various tasks. My work include the following:

  1. Mining Bug Reports: [ICSE'11], [ASE'11]

  2. Mining Software Forums: [ASE'11]

Analyzing Developers' Collaboration Network. Aside from software artifacts an important piece in software engineering efforts is developers. I'm also interested in analyzing developers' collaboration networks and utilize information embedded in these networks to aid software developers and managers. My work include the following:

  1. Analyzing Developers' Collaboration Pattern: [WCRE'10]

  2. Developer Recommendation: [WCRE'11]

Data Mining

I analyze sequences and graphs and am mainly interested on mining frequent, significant, and discriminative patterns and rules. I'm also interested in social network mining and text mining. My work include the following:

  1. Mining Sequence Database

    1. Compact Patterns and Rules: [SDM'08], [Inf. Syst'09]

    2. Repetitive Patterns: [KDD'07], [ICDE'09]

    3. Repetitive Rules: [DASFAA'08], [ICDE'09], [TKDE'11]

    4. Dyadic Patterns: [EDBT'11]

  2. Mining Graphs: [ISSTA'09], [WCRE'10], [CIKM'10], [CIKM'11], [VLDB'11]

  3. Mining Social & Collaboration Networks: [PAKDD'10], [CIKM'10], [WCRE'10], [CIKM'11], [WCRE'11]

  4. Text Mining: [ACL'09]


Aside from colleagues in SMU, the research work has greatly benefited from collaborations with members of School of Computing-NUS, Microsoft Research (Redmond and India), Faculty of Mathematics and Computer Science - Weizmann Institute of Science, Israel, Database and Information System Laboratory-University of Illinois at Urbana Champaign, Department of Systems Engineering and Engineering Management - The Chinese University of Hong Kong, School of Computer Engineering-NTU, Laboratory of Test and Analysis - University of Milano-Biccoca, TOPPS group - DIKU, Denmark, Institute of Software - Peking University, China.   

Software Releases

David Lo, Hong Cheng, Jiawei Han, Siau-Cheng Khoo, Chengnian Sun: Classification of software behaviors for failure detection: a discriminative pattern mining approach. KDD 2009. [Implementation and Dataset]

Chengnian Sun, David Lo, Siau-Cheng Khoo, and Jing Jiang: Towards more accurate retrieval of duplicate bug reports. ASE 2011.[Implementation and Dataset]