Research Areas

Research

SIS Research Area - Data Management & Analytics

Research Theme
Constraint-based Frequent Pattern Mining

Central Concerns and Questions

The objective of the research is to develop effective and efficient algorithms to discover frequent patterns with user-defined constraints from large-scale data sets. The challenge is to push these constraints deep into the mining process such that patterns of interest could be mined directly, which is critical to handle today's large-scale data. The research tasks and questions include

To understand the combinatorial structure of the target frequent pattern set defined by the user-specified constraints.
To develop efficient algorithms to identify the pattern set of interest with minimal examination of irrelevant ones.
To develop efficient algorithms to compute a pattern set which is a good approximation to the target set in cases when the target set itself is prohibitively large.
To understand the intrinsic computational complexity of the problem of constraint-based pattern mining.

Emerging Ideas and Initiatives

In many real-life applications, structural patterns which are not just frequent but are of certain constraints are of keen interest to users. However, it used to be the case that these patterns can only be identified in a post-processing stage after all the frequent patterns are mined, which is far from efficient or even infeasible in many cases. What is in urgent need to handle today's huge data are algorithms that can discover patterns of interest directly. We started with the problem of finding patterns of large sizes. Large patterns in item-sets, sequences and graphs are of critical importance for many bioinformatics, system engineering, information network and web applications. We then worked out efficient algorithms for mining patterns of significant sizes from large data sets for item-sets, sequences and graph settings. The work on item-sets proposed a novel core-pattern fusion framework which is based on the discovered robustness of the large patterns and used a randomized fusion model to efficiently discover colossal patterns by leaping in the pattern lattice. The idea of leaping has later been developed to the settings of sequences and graphs. We have designed efficient algorithms to locate approximate sequential patterns of large sizes from long gene sequences. In the more difficult case of graphs, we proposed a general constraint-pushing framework for graph pattern mining offering a comprehensive categorization of graph constraints and leveraging both pattern and data space pruning in one unified.

Selected Publications

[1] Feida Zhu, Xifeng Yan, Philip S. Yu, Jiawei Han.Mining Approximate Sequential Patterns. Next Generation Data Mining, H. Kargupta, et al., (eds.) Chapman & Hall/CRC Press, pp. 69-90, 2009.

[2] Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu. Efficient Discovery of Approximate Sequential Patterns. Proc. 2007 International Conference on Data Mining (ICDM '07), Omaha, USA, October, 2007.

[3] Chen Chen, Xifeng Yan, Feida Zhu, and Jiawei Han. gApprox: Mining Frequent Approximate Patterns from a Massive Network. Proc. 2007 International Conference on Data Mining (ICDM'07), Omaha, USA, October, 2007.

[4] Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu, Hong Cheng. Mining Colossal Frequent Patterns by Core Pattern Fusion. Proc. 2007 International Conference on Data Engineering (ICDE '07), Istanbul, Turkey, April, 2007. [Best Student Paper Award]

[5] Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu. gPrune: A Constraint Pushing Framework for Graph Pattern Mining. Proc. 2007 Pacific-Asia Conference on Knowledge Discovery and Data Mining, (PAKDD '07), Nanjing, China, May 2007. [Best Student Paper Award]

Projects, Presentations and Posters

Feida ZHU, Efficient Mining for Large Frequent Patterns (SMU research poster 2009)

External Collaborations

Xifeng Yan, University of California at Santa-Barbara
Philip S. Yu, University of Illinois at Chicago
Jiawei Han, University of Illinois at Urbana-Champaign

Last updated on 9 November, 2009 by School of Information Systems.