Code & Data
[
Home
] [
Bio
] [
Research
] [
Publications
] [
Code & Data
] [
Group
] [
Teaching
]
Code
BioTokenizer.pl
: As a first step to many information retrieval and natural language processing tasks, tokenization is the process of seperating text into individual tokens that each convey some semantic meaning. For English, in most cases, tokens are equivalent to words. For biomedical text, there are often names and symbols of various types of biomedical entities, such as genes, proteins, chemicals, etc. The special characters contained in these names and symbols make it harder to identify meaningful tokens than in normal English text. This piece of code in Perl implements a number of tokenization heuristics we have studied in the following paper:
Jing Jiang and ChengXiang Zhai.
An empirical study of tokenization strategies for biomedical information retrieval.
Information Retrieval
, 10(4-5):341-363, October 2007.
Domain Adaptive Logistic Regression
: I have implemented a number of domain adaptation techniques that I explored in my PhD thesis in this toolkit.
Data