MAC5719: Proposta de Resenha e Apresentação

Proposta de Resenha: O sequenciamento do genoma é uma disciplina que tem sofrido tremendo desenvolvimento desde o passado. Com a introdução de diferentes tecnologias de sequenciamento maciçamente paralelos esse campo irá sofrer mais modificações e novos desafios irão surgir. Um dos principais desafios é a montagem do genoma sequenciado e para isso diversas ferramentas bioinformáticas estão disponíveis. Pretendo com essa resenha falar brevemente sobre duas metodologias responsáveis pela montagem de genomas: Overlap Layout Consensus (OLC) e de Bruijn Graphs (DBG) além de discorrer sobre os arquivos de saída gerados pelas novas plataformas de sequenciamento e quais montadores conseguem lidar com esse novo desafio (mostrando as principais vantagens e desvantagens de cada heurística).

Referências: essas duas referências constituem respectivamente: o primeiro trabalho descrevendo a montagem de sequências de DNA baseadas em um caminho euleriano (metodologia DBG) e um dos softwares mais usados atualmente para montagem de short-reads, o programa Velvet.

An Eulerian path approach to DNA fragment assembly:
Pavel A. Pevzner, Haixu Tang and Michael S. Waterman

For the last 20 years, fragment assembly in DNA sequencing followed the “overlap–layout–consensus” paradigm that is used in all currently available assembly tools. Although this approach proved useful in assembling clones, it faces difficulties in genomic shotgun assembly. We abandon the classical “overlap–layout–consensus” approach in favor of a new euler algorithm that, for the first time, resolves the 20-year-old “repeat problem” in fragment assembly. Our main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem that allows one to generate accurate solutions of large-scale sequencing problems. euler, in contrast to the celera assembler, does not mask such repeats but uses them instead as a powerful fragment assembly tool.

Velvet: algorithms for de novo short reads assembly
Zerbino DR, Birney E.

We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.

Proposta de Apresentação: como apresentação quero apresentar um capítulo de um livro chamado "Next-Generation Genome Sequencing". O capítulo se chama: Next Generation Sequence Data Analysis, e no diz como devemos analisar e interpretar os dados gerados pelas novas plataformas de sequenciamento.

Abraços, André.

Re: Proposta de Resenha e Apresentação

por Luiz Thibério Rangel - terça-feira, 2 nov. 2010, 23:31

Resumo
Com o início da era genômica ouve uma grande expansão na quantidade de genomas sequenciados. Isso tornou praticamente impossível a anotação automática de todos os novos genomas e levou ao surgimento de várias metodologias e ferramentas para anotação automática, e uma dessas metodologias foi o agrupamento de proteínas ortólogas e sua classificação generalizada. Existem diversos bancos de dados de proteínas ortólogas como o COG/KOG, eggNOG, OrtoMCL, Inparanoid e KEGG, e nesta resenha vou discutir os métodos utilizados para criar grupos e adicionar novas proteínas a eles, como bidirectional best hit (bbh) e single directional best hit (sbh) e a utilização de parâmetros de corte para o BLAST, já que todos os bancos que serão discutidos agrupam as proteínas por alinhamentos par-a-par.

Referências

1 - Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997 Oct 24;278(5338):631-7
Resumo: In order to extract the maximum amount of information from the rapidly accumulating genome sequences, all conserved genes need to be classified according to their homologous relationships. Comparison of proteins encoded in seven complete genomes from five major phylogenetic lineages and elucidation of consistent patterns of sequence similarities allowed the delineation of 720 clusters of orthologous groups (COGs). Each COG consists of individual orthologous proteins or orthologous sets of paralogs from at least three lineages. Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG. This relation automatically yields a number of functional predictions for poorly characterized genomes. The COGs comprise a framework for functional and evolutionary genome analysis.

2 - Remm M, Storm CE, Sonnhammer EL. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001 Dec 14;314(5):1041-52.
Resumo: Orthologs are genes in different species that originate from a single gene in the last common ancestor of these species. Such genes have often retained identical biological roles in the present-day organisms. It is hence important to identify orthologs for transferring functional information between genes in different organisms with a high degree of reliability. For example, orthologs of human proteins are often functionally characterized in model organisms. Unfortunately, orthology analysis between human and e.g. invertebrates is often complex because of large numbers of paralogs within protein families. Paralogs that predate the species split, which we call out-paralogs, can easily be confused with true orthologs. Paralogs that arose after the species split, which we call in-paralogs, however, are bona fide orthologs by definition. Orthologs and in-paralogs are typically detected with phylogenetic methods, but these are slow and difficult to automate. Automatic clustering methods based on two-way best genome-wide matches on the other hand, have so far not separated in-paralogs from out-paralogs effectively. We present a fully automatic method for finding orthologs and in-paralogs from two species. Ortholog clusters are seeded with a two-way best pairwise match, after which an algorithm for adding in-paralogs is applied. The method bypasses multiple alignments and phylogenetic trees, which can be slow and error-prone steps in classical ortholog detection. Still, it robustly detects complex orthologous relationships and assigns confidence values for both orthologs and in-paralogs. The program, called INPARANOID, was tested on all completely sequenced eukaryotic genomes. To assess the quality of INPARANOID results, ortholog clusters were generated from a dataset of worm and mammalian transmembrane proteins, and were compared to clusters derived by manual tree-based ortholog detection methods. This study led to the identification with a high degree of confidence of over a dozen novel worm-mammalian ortholog assignments that were previously undetected because of shortcomings of phylogenetic methods.A WWW server that allows searching for orthologs between human and several fully sequenced genomes is installed at http://www.cgb.ki.se/inparanoid/. This is the first comprehensive resource with orthologs of all fully sequenced eukaryotic genomes. Programs and tables of orthology assignments are available from the same location.

3 - Wu J, Mao X, Cai T, Luo J, Wei L. KOBAS server: a web-based platform for automated annotation and pathway identification. Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W720-4.
Resumo: There is an increasing need to automatically annotate a set of genes or proteins (from genome sequencing, DNA microarray analysis or protein 2D gel experiments) using controlled vocabularies and identify the pathways involved, especially the statistically enriched pathways. We have previously demonstrated the KEGG Orthology (KO) as an effective alternative controlled vocabulary and developed a standalone KO-Based Annotation System (KOBAS). Here we report a KOBAS server with a friendly web-based user interface and enhanced functionalities. The server can support input by nucleotide or amino acid sequences or by sequence identifiers in popular databases and can annotate the input with KO terms and KEGG pathways by BLAST sequence similarity or directly ID mapping to genes with known annotations. The server can then identify both frequent and statistically enriched pathways, offering the choices of four statistical tests and the option of multiple testing correction. The server also has a 'User Space' in which frequent users may store and manage their data and results online. We demonstrate the usability of the server by finding statistically enriched pathways in a set of upregulated genes in Alzheimer's Disease (AD) hippocampal cornu ammonis 1 (CA1). KOBAS server can be accessed at http://kobas.cbi.pku.edu.cn.