mGene: Accurate Computational Gene Finding
mGene is a computational tool for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It is based on recent advances in machine learning and uses discriminative training techniques, such as support vector machines (SVMs) and hidden semi-Markov support vector machines (HSMSVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans. The evaluated developmental version of mGene exhibited the best prediction performance (in terms of the average between sensitivity and specificity) for the multiple-genome prediction tasks on all four evaluation levels (considering, nucleotides, exons, transcripts and genes). The ab-initio version was best on nucleotide, exon and transcript level, and only slightly worse than Augustus on the gene level. The fully developed version shows the best overall performance compared to the submitted gene finders' predictions, including the ones of Fgenesh and Augustus.
|July 9, 2010:||Talk on "Next Generation Genome Annotation with mGene.ngs" at ISCB Student Council Symposium|
|June 23, 2009:||Paper describing mGene accepted for publication in Genome Research |
|June 3, 2009:||Paper describing mGene.web published in Nucleic Acids Research .|
|June 1, 2009:||Software release mgene-0.1.0-beta is available for download at http://mgene.org/download|
|March 1, 2009:||mGene.web goes live, see http://mgene.org/web|
- Genome-wide prediction of protein-coding genes for eukaryotes (currently limited to genome sizes of several hundred Mbp). For this task we provide several pre-trained models (see below for the current list of organisms).
- Training the gene finder on your own data to generate a predictive model for newly sequenced genomes
- Evaluation / comparison of gene annotations
- Training and prediction of individual sensors, including splice sites, transcription start sites, etc.
Recently we have participated in the nGASP competition launched by the Wormbase Consortium . Compared to state-of-the-art gene finding systems mGene achieved excellent performance values. The results for the ab-initio category are shown below.
Table 1: Shown are sensitivity (Sn), specificity (Sp) and their average (Avg), each in percent, on nucleotide, exon, transcript and gene levels. The predictions of mGene.init were prepared after the deadline but strictly adhering to the rules and conditions of the nGASP challenge. The evaluation is based on the submitted sets of the participants and performed with our own routine. The numbers slightly deviate from the official nGASP evaluation on the transcript and gene level due to minor differences in the evaluation criteria. These differences, however, do not change the ranking.
More details on the mGene versions applied in nGASP, which also include a conservation based version, as well as a version that uses EST alignments, see mgene at nGASP competition
We have used the mGene model trained on C. elegans for the genome (re-)annotation of five nematode genomes including: C. elegans, C. brenneri, C. japonica, C. remanei, and C. briggsae. In an assessment of the quality of several available annotations for these genomes, we find that mGene's predictions are most accurate. The gene predictions have been included in the wormbase annotation available at http://www.wormbase.org and http://mgene.org/predictions (also http://ftp.raetschlab.org/resources/mgene/genome_research_2009). They allow us to compare the resulting proteomes among these organisms and to the known protein universe, thereby identifying many species/lineage-specific gene inventions .
Figure 1: Wormbase screenshot, showing aligned ESTs and three genes predicted by mGene.
We are currently training mGene on several other organisms. Here is a list of organisms, for which we can provide initial models:
Preliminary prediction performance measurements can be found here
We tackle the gene prediction problem taking a two-layered approach. In a first step, state-of-the-art kernel machines are employed to detect signal sequences in genomic DNA (like splice sites or transcription start sites) and to discriminate the content of different DNA sequences (like coding exons, introns, etc.). In a second step their outputs are combined to predict whole gene structures. In this step we use a discriminative training approach based on Hidden semi-Markov support vector machines   . A more detailed description is also provided here.
In case of comments, problems, questions etc. feel free to contact
|||(1, 2) Schweikert, G, Behr, J, Zien, A, Zeller, G, Ong, CS, Sonnenburg, S, and Rätsch, G (2009). mGene.web: a web service for accurate computational gene finding. Nucleic Acids Research, Web Server Issue.|
|||Coghlan et al. nGASP: the nematode genome annotation assessment project. BMC Bioinformatics, 2008.|
|||(1, 2, 3) Schweikert, G, Zien, A, Zeller, G, Behr, J, Dieterich, C, Ong, CS, Philips, P, De Bona, F, Hartmann, L, Bohlen, A, Krüger, N, Sonnenburg, S, and Rätsch, G. mGene: Accurate SVM-based Gene Finding with an Application to Nematode Genomes. Genome Research, Advance Access June 29, 2009.|
|||Raetsch and Sonnenburg. Large Scale Semi Hidden Markov SVMs. In Advances in Neural Information Processing Systems 19, Cambridge, MA, 2006. MIT Press.|
|||Raetsch et al. Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning, PLoS Computational Biology, 3(2):e20, 2007|