/
Bioinformatics  analysis Bioinformatics  analysis

Bioinformatics analysis - PowerPoint Presentation

oconnor
oconnor . @oconnor
Follow
27 views
Uploaded On 2024-02-09

Bioinformatics analysis - PPT Presentation

pipeline for viral metagenomics 161205 Davit Bzhalava 1 Davit Bzhalava PhD Dept of Laboratory Medicine Karolinska Institutet Sweden Human Microbiota We are born 100 human and we die 90 microbial ID: 1045557

05davit reads human graph reads 05davit graph human amp related bzhalava virus taxonomic estimation bruijn sequences mer pipeline bioinformatics

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Bioinformatics analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Bioinformatics analysis pipeline forviral metagenomics16-12-05Davit Bzhalava1Davit Bzhalava, PhDDept. of Laboratory Medicine,Karolinska Institutet, Sweden

2. Human MicrobiotaWe are born 100% human and we die 90% microbial. The term human microbiome or microbiota, defines the collection of microorganisms that reside in the human body. The viral fraction of human microbiome is referred to as the human virome.Viruses constitute only a small part of human microbiota, but their proportion and composition seems to change in diseased individuals.16-12-05Davit Bzhalava2

3. Tumor Viruses2 million (16%) of new cancer cases worldwide was estimated to be attributable to infections in 2008.1300000 (65%) of these cancers were attributable to viral infections There is epidemiological indication that additional cancer-associated viruses may exist: Increased incidence of some cancer types among immunosuppressed individuals; Space and time clustering of childhood leukemias.16-12-05Davit Bzhalava3

4. Purpose of viral metagenomics Who is there?What are they doing? How are they doing it?16-12-05Davit Bzhalava4

5. Needle in a haystackViruses usually constitute <0.1% of the whole metagenomic datasetsSmall changes in the data analysis pipeline can drastically alter results 16-12-05Davit Bzhalava5

6. 16-12-05Davit Bzhalava6Library PreparationSequencingData AnalysisFilter out Human, bacterial, phage and vector sequencesNormalize k-mer frequenciesGenome assemblyAssembly validation & number of reads estimation Taxonomic classificationFinal characterization of virus related sequencesCase-control comparison of virus related & “unknown” sequences / OR estimation Bioinformatics Pipeline

7. 16-12-05Davit Bzhalava7Library PreparationSequencingData AnalysisFilter out Human, bacterial, phage and vector sequencesNormalize k-mer frequenciesGenome assemblyAssembly validation & number of reads estimation Taxonomic classificationFinal characterization of virus related sequencesCase-control comparison of virus related & “unknown” sequences / OR estimation Bioinformatics Pipeline

8. 16-12-058de novo assemblyNGS technologies produce billions of short reads from random locations in the genome by oversampling it. Assembly algorithms, in the process called de novo assembly, reconstruct original genomes present in the sample by merging short genomic fragments into longer contiguous sequences (“contigs”). There are two main types of de novo assembly programs: Overlap/Layout/Consensus (OLC) assemblersde Bruijn Graph AssemblersDavit Bzhalava

9. Overlap: find potentially overlapping readsLayout: merge reads into contigs and contigs into supercontigsConsensus: derive the DNA sequence and correct read errors..ACGATTACAATAGGTT..OLC assemblyDavit Bzhalava

10. 16-12-0510de Bruijn graph assemblyde Bruijn graph assemblers model the relationship between exact substrings of length k extracted from the input reads. In de Bruijn graph the reads themselves are not directly modelled but they are implicitly represented as paths through the de Bruijn graph. Most de Bruijn graph assemblers use the read information to refine the graph structure and to remove graph patterns that are not consistent with the reads.de Bruijn graph approach is based on exact matches, thus error correction approaches (used both before and during assembly) are crucial for achieving high-quality assemblies. Davit Bzhalava

11. Challenges in assemblyIf we have 2 sequences the_quick_brown_fox_jumps jumps_over_the_lazy_dogWill be decomposed into k-mers Kmer = 5put both sentences into the same graph and follow the links in the graphthe_q -> he_qu -> e_qui -> _quic -> quick -> uick_ -> ick_b -> ck_brto spell out the 'assembled' sentence,the_quick_brown_fox_jumps_over_the_lazy_dogIf kmer = 6: there's no 6-mer word that is in common between the sentence fragments. If k-mer = 4, the graph becomes complicated: the word the_ appears twice***Example taken from: http://ivory.idyll.org/blog/the-k-parameter.html16-12-05Davit Bzhalava11

12. Challenges in assemblySolution is to try as many assemblers and with as many parameters as possible. Resources including time is limitedAssemblies are RAM thirsty NextSeq, 300m reads ≈250GB RAMkmer based assemblers scale poorly16-12-05Davit Bzhalava12

13. 16-12-05Davit Bzhalava13Library PreparationSequencingData AnalysisFilter out Human, bacterial, phage and vector sequencesNormalize k-mer frequenciesGenome assemblyAssembly validation & number of reads estimation Taxonomic classificationFinal characterization of virus related sequencesCase-control comparison of virus related & “unknown” sequences / OR estimation Bioinformatics Pipeline

14. 16-12-05141’642’160’122 paired readsNumber of reads before normalizationDavit BzhalavaK-mer normalization

15. 16-12-0515282’961’022 paired reads (17% of initial reads)Number of reads after normalizationDavit Bzhalava

16. 16-12-0516Human genome coverage before normalization Davit Bzhalava

17. 16-12-0517Human genome coverage after normalization Davit Bzhalava

18. 16-12-0518Number of reads after HG clean up6’745’443 paired reads (0.02 % normalized data and 0.004% of initial reads)Davit Bzhalava

19. 16-12-05Davit Bzhalava19Library PreparationSequencingData AnalysisFilter out Human, bacterial, phage and vector sequencesNormalize k-mer frequenciesGenome assemblyAssembly validation & number of reads estimation Taxonomic classificationFinal characterization of virus related sequencesCase-control comparison of virus related & “unknown” sequences / OR estimation Bioinformatics Pipeline

20. Taxonomic classificationNCBI BLAST - One of the most famous similarity-based taxonomic classification NCBI BLAST compares sequences to known genomes16-12-05Davit Bzhalava20

21. Challenges in taxonomic classificationGenome sequencing has led to massive data generation requiring a significant increase in the speed of execution of these algorithms.Necessity to search new and ever expanding databases16-12-05Davit Bzhalava21http://www.ncbi.nlm.nih.gov/genbank/statistics Accessed on Nov 08, 2015

22. Challenges in taxonomic classificationNCBI BLAST-based search tools are extremely time consuming may take days or even weeks to complete when large metagenomic datasets need to be compared against nucleotide or protein databases Paracel Blast a commercial software Achieved the same results, on same file, on same machine 10 times faster Scalable open source NCBI BLAST solutions are needed16-12-05Davit Bzhalava22

23. Thank you! 16-12-05Davit Bzhalava23