Our SOftware Analytics Research (SOAR) group primarily works at the intersection of software engineering, cybersecurity and data science, encompassing socio-technical aspects, and analysis of different kinds of software artefacts (e.g., code, execution traces, bug reports, Q&A posts, and developer networks) and the interplay among them. We are particularly interested in transforming passive software engineering data into automated tools that can improve system reliability, security, and performance, increase developer productivity, and generate new insights for decision makers. We are also interested in connecting with practitioners to distil insights and discover pain points that can help direct future research effort.
Our group loves to collaborate. Aside from colleagues in SMU, our group has collaborated with Microsoft Research (Redmond and India), Adobe (USA), SAP (Germany), University of Illinois at Urbana Champaign (USA), CMU (USA), NUS, NTU, Zhejiang University (China), Peking University (China), Chinese University of Hong Kong (China), IIITD (India), Weizmann Institute of Science (Israel), Tel Aviv University (Israel), University of Milano-Biccoca (Italy), DIKU (Denmark), Inria (France), Monash University (Australia), Australian National University (Australia), Stellenbosch University (South Africa), and many more.
Our work has been published in top/major conferences and journals in the areas of software engineering (ICSE, FSE, ASE, ISSTA, ICSME, PLDI, TSE, TOSEM), artificial intelligence and data science (IJCAI, AAAI, KDD, VLDB, ICDE, ACL), and cybersecurity (ESORICS, TIFS).
The following describes a few areas of software analytics work that we have pursued:
Bug and Vulnerability Management
We are intrigued to examine the entire process of how developers manage bugs and vulnerabilities, and how a data-driven approach can help. Our work in this area include:
We have designed a comprehensive array of solutions that can identify vulnerabilities and repair them. They include solutions that can identify vulnerable third party libraries by analyzing National Vulnerability Database entries and commit logs, and solutions that can learn from version history to automatically patch vulnerabilities. Examples of these studies include: (Chen et al. 2020a), (Chen et al. 2020b), (Ma et al. 2017), (Ma et al. 2016).
We have designed a comprehensive array of novel automated solutions to help developers manage large numbers of bug reports. The solutions include techniques to: identify duplicate bug reports, prioritize bug reports, assign bug reports to developers, locate buggy files given a bug report, and many more. Examples of these studies include: (Zhou et al. 2012), (Sun et al. 2010), (Hoang et al. 2019).
We have proposed a set of automated debugging solutions that consider on various settings. These include solutions that: identify likely defective files/commits based on version history data (aka. defect prediction), identify likely buggy program elements given test case failures (aka. spectrum based fault localization), construct likely fixes given test case failures (aka. automatic program repair), and many more. Examples of these studies include: (Lo et al. 2009), (Le et al. 2016), (Xia et al. 2016).
Code and Documentation Management
Given a large code base, a large set of code repositories, or a large set of libraries, it is often hard to find code snippets, methods or libraries of interest, given a particular need. Additionally, due to the fast pace of software development, documentation is often unavailable or outdated. Our work has addressed these pain points in the following ways:
We have designed a number of code search and recommendation engines. These include solutions that: identify code snippets that match certain structural constraints (aka. structured code search), identify code snippets that match a natural language query (aka. text-based code search), identify APIs that should be used given a particular need expressed in natural language (aka. API recommendation), and many more. Examples of our prior studies include: (Sirres et al. 2018), (Huang et al. 2018), (Rahman et al. 2019).
We have designed a number of specification mining engines that infer specifications of how a piece of code or library works by analysing execution traces of its clients. These include solutions that infer specifications in the form of: finite state models, temporal logic constraints, modal sequence diagrams, and many more. The specifications have been used to for various purposes ranging from program comprehension to construction of fine-grained Android sandboxes to prevent malicious behaviours. Examples of our prior studies include: (Lo and Khoo 2006), (Lo et al. 2007), (Le and Lo 2018).
We have designed a number of documentation search and generation engines. These documentations include textual documents and rich media that can help developers in their tasks. Our prior work has produced solutions that: answer natural language queries from software documentations (aka. question and answer bots), output tags for software question and answer posts (aka. tag recommendation), produce natural language comments for code artefacts (aka. comment generation), supporting serendipitous information discovery on Twitter (aka. software tweet analytics), generate workflows for programming videos (aka. interactive video workflow generation), and many more. Examples of our prior studies include: (Xu et al. 2017), (Liu et al. 2018), (Sharma et al. 2018).
We are also interested in bridging the gap between research and practice through empirical studies. They are important to ensure that technologies that we design are relevant to practitioners, address their pain points, and are not evaluated in a biased way. Our work in this area include:
We have mined software repositories to distil insights about how developers perform certain activities and identify inefficiencies, pain points, and directions for future research. These include studies on: bugs and issues affecting various kinds of systems, activities on and growth of GitHub, factors influencing open source and industrial project quality, developer turnover and retention, and many more. Examples of these studies include: (Li et al. 2017), (Thung et al. 2013), (Bao et al. 2019).
We have also analysed bias and limitations that affect existing automated tools. These include studies that analyse: bias in bug localization data, overfitting in program repair, bias in evaluation of defect prediction studies, and many more. Examples of these studies include: (Kochhar et al. 2014), (Le et al. 2018), (Huang et al. 2017).