/
Introduction To Next Generation  Sequencing (NGS) Data Anal Introduction To Next Generation  Sequencing (NGS) Data Anal

Introduction To Next Generation Sequencing (NGS) Data Anal - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
440 views
Uploaded On 2017-04-06

Introduction To Next Generation Sequencing (NGS) Data Anal - PPT Presentation

Jenny Wu Outline Goals Practical guide to NGS data processing Bioinformatics in NGS data analysis Basics terminology data formats general workflow etc Data Analysis Pipeline Sequence QC and preprocessing ID: 534346

analysis data rna seq data analysis seq rna software workflow ngs sequence pipeline genome bioinformatics reads chip expression reference mapping differential alignment

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Introduction To Next Generation Sequenc..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Introduction To Next Generation Sequencing (NGS) Data Analysis

Jenny

WuSlide2

Outline

Goals : Practical guide to NGS data processingBioinformatics in NGS data analysisBasics: terminology, data formats, general workflow etc.

Data Analysis Pipeline

Sequence QC and preprocessing

Downloading reference sequences

Sequence mapping

Downstream analysis workflow and software

RNA-

Seq

data analysis

Concepts: spliced

alignment, normalization, coverage, differential expression.

Popular RNA-

Seq

pipeline: Tuxedo suite vs.

Tophat-HTSeq

Data visualization with Genome Browsers and R packages.

Downstream Pathway analysis

ChIP-Seq

data analysis workflow and software

NGS bioinformatics resources

SummarySlide3

Why Next Generation Sequencing

One can generate hundreds of millions of short sequences (up to

2

50bp) in a single run in a short period of time with low per base cost.

Illumina/

Solexa

GA

II, HiSeq 2500, 3000,XRoche/454 FLX, TitaniumLife Technologies/Applied Biosystems SOLiD

Reviews: Michael Metzker (2010) Nature Reviews Genetics 11:31Quail et al (2012) BMC Genomics Jul 24;13:341.Slide4

Why Bioinformatics

(wall.hms.harvard.edu)

InformaticsSlide5

Bioinformatics Challenges

in NGS Data Analysis

“Big Data” (thousands

of millions of lines long)

Can’t do ‘business as usual’ with familiar

tools

Impossible memory usage and execution time

Manage, analyze, store, transfer and archive huge files

Need for powerful computers and expertiseInformatics groups must manage compute clustersNew algorithms and software are required and often time they are open source Unix/Linux based.Collaboration of IT experts, bioinformaticians

and biologistsSlide6

Basic NGS Workflow

Olson

et al.Slide7

NGS

D

ata

A

nalysis Overview

Olson

et al.Slide8

Outline

Goals : Practical guide to NGS data processing

Bioinformatics in NGS data analysis

Basics: terminology, data formats, general workflow etc.

Data Analysis Pipeline

Sequence QC and preprocessing

Downloading reference sequences: query NCBI, UCSC databases.

Sequence mappingDownstream analysis workflow and softwareRNA-

Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.

Tuxedo suite: Tophat

, Cufflinks and cummeRbund

Data visualization with Genome Browsers.RNA-

Seq pipeline software: Galaxy vs. shell scripting

ChIP-Seq

data analysis workflow and software

NGS bioinformatics resources

SummarySlide9

Terminology

Experimental Design:Coverage (sequencing depth):

The number of nucleotides from reads that are mapped to a given position

.

average coverage = read length * # reads/ genome size

Paired-End Sequencing:

Both end of the DNA fragment is sequenced, allowing highly precise alignment. Multiplexing/Demultiplexing: "barcode" sequences are added to each sample so they can be distinguished in order to sequence large number of samples on one lane.

Data analysis:Quality Score: Each called base comes with a quality score which measures the probability of base call error.Mapping: Align reads to reference to identify their origin.

Assembly: Merging of fragments of DNA in order to reconstruct the original

sequence.Duplicate reads: Reads that are identical. Can be identified after mapping.Multi-reads:

Reads that can be mapped to multiple locations equally well.Slide10

What does the data look like?Common NGS Data Formats

For a full list, go to http://genome.ucsc.edu/FAQ/FAQformat.htmlSlide11

File Formats

Reference sequences, reads:FASTAFASTQ (FASTA with quality scores)Alignments:SAM (Sequence Alignment Mapping)

BAM (Binary version of SAM)

Features, annotation, scores

:

GFF3/GTF(General Feature Format)

BED/

BigBedWIG/BigWighttp://genome.ucsc.edu/FAQ/FAQformat.htmlSlide12

FASTA Format (Reference Seq)Slide13

FASTQ Format (

Illumina

Example

)

@DJG84KN1:272:D17DBACXX:2:1101:12432:5554

1

:N:0:AGTCAA

CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT

+

BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ

@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG

AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG+@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2

@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AGCCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC+CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ

@

DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG

GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG

+

CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ

Read Record

Header

Read Bases

Separator

(with optional repeated header)

Read Quality Scores

Flow Cell ID

Lane

Tile

Tile Coordinates

Barcode

NOTE: for

paired-end runs, there

is a second file

with one-to-one

corresponding

headers and reads.

(

Passarelli

, 2012)Slide14

GFF3 and GTF format

Khetani

RS et al.

GTF format:

GFF3 format:Slide15

Outline

Goals : Practical guide to NGS data processing

Bioinformatics in NGS data analysis

Basics: terminology, data formats, general workflow etc.

Data Analysis Pipeline

Sequence QC and preprocessing

Downloading reference sequences: query NCBI, UCSC databases.

Sequence mappingDownstream analysis workflow and softwareRNA-

Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.Tuxedo suite:

Tophat, Cufflinks

and cummeRbund

Data visualization with Genome Browsers.RNA-S

eq pipeline software: Galaxy vs. shell scriptingChIP-Seq

data analysis workflow and software

Scripting Languages and bioinformatics resources

SummarySlide16

General Data PipelineSlide17

Why QC?

Sequencing runs cost money Consequences of not assessing the Data

Sequencing

a poor library on multiple runs –

throwing money

away

!

Data analysis costs money and timeCost of analyzing data, CPU time $$Cost of storing raw sequence data $$$Hours of analysis could be wasted $$$$Downstream

analysis can be incorrect. Slide18

How to QC?

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

, available on HPC

Tutorial

:

http://

www.youtube.com/watch?v=bz93ReOv87Y

$ module load

fastqc$

fastqc s_1_1.fastq;Slide19

FastQC: ExampleSlide20

Outline

Goals : Practical guide to NGS data processing

Bioinformatics in NGS data analysis

Basics: terminology, data formats, general workflow etc.

Data Analysis Pipeline

Sequence QC and preprocessing

Downloading reference sequences

Sequence mappingDownstream analysis workflow and software

RNA-Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.

Tuxedo suite:

Tophat, Cufflinks and cummeRbund

Data visualization with Genome Browsers.

RNA-Seq pipeline software: Galaxy vs. shell scripting

ChIP-Seq

data analysis workflow and software

Scripting Languages and bioinformatics resources

SummarySlide21

Premade Genome SequenceIndexes and Annotation

http://ccb.jhu.edu/software/tophat/igenomes.shtmlSlide22

The UCSC Genome Browser Homepage

Get genome annotation here!

General information

Specific information—

new features, current status, etc.

Get reference sequences here!Slide23

Downloading Reference SequencesSlide24

Downloading Reference AnnotationSlide25

Outline

Goals : Practical guide to NGS data processing

Bioinformatics in NGS data analysis

Basics: terminology, data formats, general workflow etc.

Data Analysis Pipeline

Sequence QC and preprocessing

Downloading reference sequences: query NCBI, UCSC databases.

Sequence mappingDownstream analysis workflow and softwareRNA-

Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.

Tuxedo suite: Tophat

, Cufflinks and cummeRbund

Data visualization with Genome Browsers.RNA-

Seq pipeline software: Galaxy vs. shell scripting

ChIP-Seq

data analysis workflow and software

Scripting Languages and bioinformatics resources

SummarySlide26

Sequence Mapping Challenges

Alignment (Mapping) is often the first step once analysis-read reads are obtained.

The task: to align sequencing reads against a known reference.

Difficulties: high volume of data, size of reference genome, computation time, read length constraints, ambiguity caused by repeats and sequencing errors. Slide27

How to choose an aligner?

There are many short read aligners and they vary a lot in performance(accuracy, memory usage, speed and flexibility etc). Factors to consider : application, platform, read length, downstream analysis, etc.Constant trade off between speed and sensitivity (e.g. MAQ vs.

Bowtie2).

Guaranteed high accuracy will take

longer.

Popular choices:

Bowtie2,

BWA, Tophat2, STAR.Slide28

Outline

Goals : Practical guide to NGS data processing

Bioinformatics in NGS data analysis

Basics: terminology, data formats, general workflow etc.

Data Analysis Pipeline

Sequence QC and preprocessing

Downloading reference sequences: query NCBI, UCSC databases.

Sequence mappingDownstream analysis workflow and softwareRNA-

Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.

Tuxedo suite: Tophat

, Cufflinks and cummeRbund

Data visualization with Genome Browsers.RNA-

Seq pipeline software: Galaxy vs. shell scripting

ChIP-Seq

data analysis workflow and software

Scripting Languages and bioinformatics resources

SummarySlide29

Application Specific SoftwareSlide30

Outline

Goals : Practical guide to NGS data processing

Bioinformatics in NGS data analysis

Basics: terminology, data formats, general workflow etc.

Data Analysis Pipeline

Sequence QC and preprocessing

Downloading reference sequences: query NCBI, UCSC databases.

Sequence mappingDownstream analysis workflow and software

RNA-Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.Tuxedo suite:

Tophat, Cufflinks

and cummeRbund

Data visualization with Genome Browsers.RNA-S

eq pipeline software: Galaxy vs. shell scriptingChIP-Seq

data analysis workflow and software

Scripting Languages and bioinformatics resources

SummarySlide31
Slide32

Two Major

Approaches

1.

Gene or Exon level differential expression (DE):

DESeq2,

EdgeR

, DEXSeq…2. Transcripts assembly : Trinity, Velvet-Oasis, TransABySS, Cufflinks, Scripture…Slide33

RNA-

Seq

Pipeline for DESlide34

RNA-

Seq

: Spliced

A

lignment

http://

en.wikipedia.org/wiki/File:RNA-Seq-alignment.png

“Systematic

evaluation of spliced alignment programs for RNA-

seq data”Nature Methods, 2013

Some reads will span two different exons

Need long enough reads to be able to reliably map both sidesUse a splice aware aligner!Slide35

How much sequence do I need?

Oversimplified answer:20-50M PE/sample (Human/Mouse)Depends on: Size and complexity of your transcriptome.

Goal of experiment: DE, transcript discovery.

Tissue type, library type, RNA quality, read length, single-end…Slide36

RNA-Seq: Normalization

Gene-length bias• Differential

expression of longer genes is more significant

because long

genes yield more

reads

RNA-

Seq normalization methods:Scaling factor based: Total count, upper quartile, median, DESeq, TMM in edgeRQuantile, RPKM (cufflinks)ERCC

Normalize by gene length and by number of reads mapped, e.g. RPKM/FPKM (reads/fragments per kilo bases per million mapped reads)Slide37

Definition of Expression levels

RPKM: Reads Per K

ilobase

per

M

illion of mapped

reads

:

FPKM: Fragment Per Kilobase per Million of mapped reads (for paired-end reads)

Mortazavi

, et al. 2008Slide38

RNA-Seq: Differential Expression

 Discrete vs. Continuous data: Microarray florescence intensity data: continuous

Modeled using normal distribution

RNA-

Seq

read

count data: discrete  Modeled using negative binomial distribution Microarray software can NOT be directly used to analyze

RNA-Seq data!Slide39

Outline

Goals : Practical guide to NGS data processing

Bioinformatics in NGS data analysis

Basics: terminology, data formats, general workflow etc.

Data Analysis Pipeline

Sequence QC and preprocessing

Downloading reference sequences: query NCBI, UCSC databases.

Sequence mappingDownstream analysis workflow and software

RNA-Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.

Popular RNA-Seq pipeline: Tuxedo suite, Tophat2-HTSeq-DESeq

Data visualization with Genome Browsers.ChIP-Seq

data analysis workflow and softwareScripting Languages and bioinformatics resourcesSummarySlide40

Popular RNA-Seq DE Pipeline

(The Tuxedo Protocol)

Pipeline 1

Pipeline 2

(The Alternative Protocol)Slide41

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

Classic RNA-

Seq

(Tuxedo Protocol)

2. Transcript assembly and quantification

Spliced Read mapping

3. Merge assembled transcripts from multiple samples

4. Differential Expression analysis

SAM/BAM

GTF/GFFSlide42

Classic vs. Advanced RNA-Seq workflowSlide43

1. Spliced Alignment: Tophat

Tophat

: a spliced short read aligner for RNA-

seq

.

$

tophat

-p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq

C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq$ tophat

-p 8 -G genes.gtf -o

C2_R1_thout genome C2_R1_1.fq C2_R1_2.fq$ tophat

-p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C2_R2_2.fq

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.htmlSlide44

2.Transcript assembly and abundance quantification: Cufflinks

Cufflinks: a

program that assembles aligned RNA-

Seq

reads into transcripts, estimates their abundances, and tests for differential expression and regulation

transcriptome

-wide

.

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

$ cufflinks

-p

8

-

o

C1_R1_clout

C1_R1_thout/

accepted_hits.bam

$ cufflinks

-p

8

-o

C1_R2_clout C1_R2_thout/

accepted_hits.bam

$ cufflinks

-p

8

-o

C2_R1_clout C2_R1_thout/

accepted_hits.bam

$ cufflinks

-p

8

-o

C2_R2_clout C2_R2_thout/

accepted_hits.bamSlide45

3. Final Transcriptome assembly:

Cuffmerge

$

cuffmerge

-g

genes.gtf

-s genome.fa

-p 8

assemblies.txt

$ more assembies.txt./C1_R1_clout/

transcripts.gtf./

C1_R2_clout/

transcripts.gtf

./

C2_R1_clout/

transcripts.gtf

./

C2_R2_clout/

transcripts.gtf

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.htmlSlide46

4.Differential Expression: Cuffdiff

CuffDiff

:

a program

that

compares transcript abundance between

samples.

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

$

cuffdiff

-o

diff_out

-b

genome.fa

-p

8

–L

C1,C2

-u

merged_asm

/

merged.gtf

./C1_R1_thout/

accepted_hits.bam

, ./

C1_R2_thout/

accepted_hits.bam

./C2_R1_thout/

accepted_hits.bam

, ./

C2_R2_thout/

accepted_hits.bamSlide47

Alternative Pipeline with HTSeq

HTSeq

DESeq2/

edgeR

Tophat2,

$

htseq

-count

-f

bam C1_R1_thout/

sorted.bam

-s

no

>

hsc

/C1_R1.counts

$

htseq

-count

-f

bam

C1_R1_thout/

sorted.bam

-s

no

>

hsc

/C1_R1.counts

$

htseq

-count

-f

bam

C1_R1_thout/

sorted.bam

-s

no

>

hsc

/C1_R1.counts

$

htseq

-count

-f

bam

C1_R1_thout/

sorted.bam

-s

no

>

hsc

/C1_R1.counts

Slide48

HTSeq Output: Gene Count Table

…Slide49

DESeq2

http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html

April 21

st

workshp

!Slide50

Downstream Analysis

Pathway and functional analysis:Gene Ontology over representationGene Set Enrichment Analysis (GSEA)Signaling Pathway Impact AnalysisSoftware

DAVID,

GSEA, WGCNA

, Blast2go,

topGO

,

BinGO...IPA,

GeneGO MetaCore, iPathway GuideSlide51

Outline

Goals : Practical guide to NGS data processing

Bioinformatics in NGS data analysis

Basics: terminology, data file formats, general workflow

Data Analysis Pipeline

Sequence QC and preprocessing

Obtaining and preparing reference

Sequence mapping

Downstream analysis workflow and softwareRNA-Seq data analysisspliced

alignment, normalization, coverage, differential expression.

Tuxedo suite: Tophat/Cufflinks parameters setting, cummeRbund

Data Visualization with genome browsers and R packagesChIP-Seq data analysis workflow and softwareOpen source pipeline software with Graphical User

InterfaceSummarySlide52

Integrative

Genomics Viewer (IGV)

http://www.broadinstitute.org/igv

Available on HPC. Use ‘module load

igv

’ and ‘

igv

’Slide53

Visualizing RNA-

Seq

mapping with IGV

http://

www.broadinstitute.org/igv/UserGuide

Integrative Genomics Viewer (IGV): high-performance genomics data visualization and

exploration.Thorvaldsdóttir

H et al. Brief

Bioinform

. 2013Slide54

Genomic Data Visualization

R packages for plots:

ggplot2

ggbio

GenomeGraphsSlide55

Outline

Goals : Practical guide to NGS data processing

Bioinformatics in NGS data analysis

Basics: terminology, data formats, general workflow etc.

Data Analysis Pipeline

Sequence QC and preprocessing

Downloading reference sequences: query NCBI, UCSC databases.

Sequence mappingDownstream analysis workflow and software

RNA-Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.

Tuxedo suite: Tophat

, Cufflinks and cummeRbund

Data visualization with Genome Browsers.RNA-S

eq pipeline software: GalaxyChIP-Seq data analysis workflow and softwareScripting Languages and bioinformatics resources

SummarySlide56

Galaxy: Web based platform for analysis of large datasets

http://hpc-galaxy.oit.uci.edu/root

https

://main.g2.bx.psu.edu/

Galaxy: A platform for interactive large-scale genome analysis

:

Genome

Res. 2005. 15: 1451-1455Slide57

Outline

Goals : Practical guide to NGS data processing

Bioinformatics in NGS data analysis

Basics: terminology, data formats, general workflow etc.

Data Analysis Pipeline

Sequence QC and preprocessing

Downloading reference sequences: query NCBI, UCSC databases.

Sequence mappingDownstream analysis workflow and software

RNA-Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.

Tuxedo suite: Tophat

, Cufflinks and cummeRbund

Data visualization with Genome Browsers.

RNA-Seq pipeline software: Galaxy vs. shell scripting

ChIP-Seq

data analysis workflow and software

Scripting Languages and bioinformatics resources

SummarySlide58

What is

ChIP-Seq?

Chromatin-

Immunoprecipitation

(

ChIP

)- Sequencing

ChIP - A technique of precipitating a protein antigen out of solution using an antibody that specifically binds to the protein.Sequencing

– A technique to determine the order of nucleotide bases in a molecule of DNA.Used in combination to study the interactions between protein and DNA.Slide59

ChIP-Seq

Applications

Enables the accurate profiling of

Transcription factor binding sites

Polymerases

Histone modification sites

DNA methylationSlide60

A View of ChIP-Seq Data

Typically reads (35-55bp) are quite sparsely distributed over the genome.

Controls (i.e. no pull-down by antibody) often show smaller peaks at the same locations

Rozowsky

et al Nature

Biotech,

2009Slide61

ChIP-Seq

Analysis Pipeline

Sequencing

Short read

Sequences

Base Calling

Read

QC

Short read

Alignment

Enriched Regions

Peak Calling

Combine with gene expression

Motif Discovery

Visualization

with genome browser

Differential peaksSlide62

ChIP-Seq: Identification of Peaks

Several methods to identify peaks but they mainly fall into 2 categories:

Tag Density

Directional scoring

In the tag density method, the program searches for large clusters of overlapping sequence tags within a fixed width sliding window across the genome.

In directional scoring methods, the bimodal pattern in the strand-specific tag densities are used to identify protein binding sites.

Determining the exact binding sites from short reads generated from

ChIP-Seq experimentsSISSRs (Site Identification from Short Sequence Reads) (Jothi

2008)MACS (Model-based Analysis of ChIP-Seq) (Zhang et al, 2008)Slide63

ChIP-Seq: Output

A list of enriched locationsCan be used:

In combination with RNA-Seq, to determine the biological function of transcription factors

Identify genes co-regulated by a common transcription factor

Identify common transcription factor binding motifsSlide64

Resources in NGS data analysis

Stackoverflow.comSlide65

Languages in BioinformaticsSlide66

Summary

Thank you!

NGS technologies are

transforming molecular biology.

Bioinformatics analysis is a crucial part in NGS applications

Dat

a formats, terminology, general workflow

Analysis pipeline

Software for various NGS applications

RNA-

Seq

and

ChIP-Seq

data analysis

Pathway Analysis

Data visualization

Bioinformatics resourcesSlide67