Yale University
Department of Statistics
Seminar

Thursday, February 6, 2003

Computational genefinding:  probabilistic models and statistical methods

Jing Wu
Center for Biomolecular Science and Engineering
University of California, Santa Cruz

Computational methodology for finding genes and other functional sites in genomic DNA has
evolved significantly over the last 20 years. One type of functional sites in genomic DNA that
researchers have sought to recognize is various binding sites. Finding IHF binding sites in E. coli
DNA is one popular problem people would like to solve. In our approach, a positional weight
matrix is derived from a set of known IHF binding sites and a hidden semi-Markov model based
on the positional weight matrix is developed to simulate the IHF binding sites in E. coli DNA
as well as for detecting putative binding sites in E. coli DNA.

    A new class of gene-prediction algorithms that recently been reported has shown the power of
comparative genomics. The existing genefinding algorithms focus on locating exons in genomic
sequence which have limitation on the input sequences as well as lack of statistical confidence.
Another algorithm designed to detect conserved structural RNAs along with detecting coding
regions is computationally heavy and is focused on structural RNAs. We use sequence similarity
between human and mouse to classify alignments into coding regions and non-coding regions.
Based on the aligned sequences of human and mouse, we propose a log-odds ratio score that
based on conservation measurements and use the distribution of log-odds ratio scores of a fixed
window size of a gapless alignment to separate alignments that contain coding regions from
alignments that do not contain coding regions. The confidence level of our predictions of new
coding regions is given by a multiple hypotheses testing that controls false discovery rate. The
correctness of our prediction is validated by the 1M alignments of ancient repeats and 90,000
exons from refSeq mRNA and 932 pseudogenes produced by Sanger Institute.

            Seminar to be held in Room 107, 24 Hillhouse Avenue at 12:00 pm