Annotating a draft de novo genome assembly of a previously-uncharacterized species with gene function information is a challenging problem. Having the genome sequence is not biologically-important until you know what the sequence does, in terms of what proteins are expressed/translated. Genome->transcriptome->protein sequence information is collected experimentally and must be integrated. Some pipelines include using annotated genomes and transcriptomes, proteomes from closely-related species. There is not one clear set of methods, and multiple pipeline tools are available. A question is, which is the best to use? Right now, we should try multiple methods and see which combination of results make the most sense.
1. CEGMA (Core Eukaryotic Genes Mapping Approach), which takes a set of 458 proteins found to be highly conserved among Eukaryotes.
2. EuGene, a software with gene prediction modeling that promises integration with RNAseq or EST data, not straightfoward to learn how to use. Parameters are not easy to follow and require integration with plugins.
3. The PASA pipeline is another method to try: