Jenny Wu Outline Goals Practical guide to NGS data processing Bioinformatics in NGS data analysis Basics terminology data formats general workflow etc Data Analysis Pipeline Sequence QC and preprocessing ID: 534346
Download Presentation The PPT/PDF document "Introduction To Next Generation Sequenc..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Introduction To Next Generation Sequencing (NGS) Data Analysis
Jenny
WuSlide2
Outline
Goals : Practical guide to NGS data processingBioinformatics in NGS data analysisBasics: terminology, data formats, general workflow etc.
Data Analysis Pipeline
Sequence QC and preprocessing
Downloading reference sequences
Sequence mapping
Downstream analysis workflow and software
RNA-
Seq
data analysis
Concepts: spliced
alignment, normalization, coverage, differential expression.
Popular RNA-
Seq
pipeline: Tuxedo suite vs.
Tophat-HTSeq
Data visualization with Genome Browsers and R packages.
Downstream Pathway analysis
ChIP-Seq
data analysis workflow and software
NGS bioinformatics resources
SummarySlide3
Why Next Generation Sequencing
One can generate hundreds of millions of short sequences (up to
2
50bp) in a single run in a short period of time with low per base cost.
Illumina/
Solexa
GA
II, HiSeq 2500, 3000,XRoche/454 FLX, TitaniumLife Technologies/Applied Biosystems SOLiD
Reviews: Michael Metzker (2010) Nature Reviews Genetics 11:31Quail et al (2012) BMC Genomics Jul 24;13:341.Slide4
Why Bioinformatics
(wall.hms.harvard.edu)
InformaticsSlide5
Bioinformatics Challenges
in NGS Data Analysis
“Big Data” (thousands
of millions of lines long)
Can’t do ‘business as usual’ with familiar
tools
Impossible memory usage and execution time
Manage, analyze, store, transfer and archive huge files
Need for powerful computers and expertiseInformatics groups must manage compute clustersNew algorithms and software are required and often time they are open source Unix/Linux based.Collaboration of IT experts, bioinformaticians
and biologistsSlide6
Basic NGS Workflow
Olson
et al.Slide7
NGS
D
ata
A
nalysis Overview
Olson
et al.Slide8
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data formats, general workflow etc.
Data Analysis Pipeline
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mappingDownstream analysis workflow and softwareRNA-
Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.
Tuxedo suite: Tophat
, Cufflinks and cummeRbund
Data visualization with Genome Browsers.RNA-
Seq pipeline software: Galaxy vs. shell scripting
ChIP-Seq
data analysis workflow and software
NGS bioinformatics resources
SummarySlide9
Terminology
Experimental Design:Coverage (sequencing depth):
The number of nucleotides from reads that are mapped to a given position
.
average coverage = read length * # reads/ genome size
Paired-End Sequencing:
Both end of the DNA fragment is sequenced, allowing highly precise alignment. Multiplexing/Demultiplexing: "barcode" sequences are added to each sample so they can be distinguished in order to sequence large number of samples on one lane.
Data analysis:Quality Score: Each called base comes with a quality score which measures the probability of base call error.Mapping: Align reads to reference to identify their origin.
Assembly: Merging of fragments of DNA in order to reconstruct the original
sequence.Duplicate reads: Reads that are identical. Can be identified after mapping.Multi-reads:
Reads that can be mapped to multiple locations equally well.Slide10
What does the data look like?Common NGS Data Formats
For a full list, go to http://genome.ucsc.edu/FAQ/FAQformat.htmlSlide11
File Formats
Reference sequences, reads:FASTAFASTQ (FASTA with quality scores)Alignments:SAM (Sequence Alignment Mapping)
BAM (Binary version of SAM)
Features, annotation, scores
:
GFF3/GTF(General Feature Format)
BED/
BigBedWIG/BigWighttp://genome.ucsc.edu/FAQ/FAQformat.htmlSlide12
FASTA Format (Reference Seq)Slide13
FASTQ Format (
Illumina
Example
)
@DJG84KN1:272:D17DBACXX:2:1101:12432:5554
1
:N:0:AGTCAA
CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT
+
BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ
@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG
AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG+@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2
@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AGCCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC+CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ
@
DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG
GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG
+
CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ
Read Record
Header
Read Bases
Separator
(with optional repeated header)
Read Quality Scores
Flow Cell ID
Lane
Tile
Tile Coordinates
Barcode
NOTE: for
paired-end runs, there
is a second file
with one-to-one
corresponding
headers and reads.
(
Passarelli
, 2012)Slide14
GFF3 and GTF format
Khetani
RS et al.
GTF format:
GFF3 format:Slide15
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data formats, general workflow etc.
Data Analysis Pipeline
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mappingDownstream analysis workflow and softwareRNA-
Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.Tuxedo suite:
Tophat, Cufflinks
and cummeRbund
Data visualization with Genome Browsers.RNA-S
eq pipeline software: Galaxy vs. shell scriptingChIP-Seq
data analysis workflow and software
Scripting Languages and bioinformatics resources
SummarySlide16
General Data PipelineSlide17
Why QC?
Sequencing runs cost money Consequences of not assessing the Data
Sequencing
a poor library on multiple runs –
throwing money
away
!
Data analysis costs money and timeCost of analyzing data, CPU time $$Cost of storing raw sequence data $$$Hours of analysis could be wasted $$$$Downstream
analysis can be incorrect. Slide18
How to QC?
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
, available on HPC
Tutorial
:
http://
www.youtube.com/watch?v=bz93ReOv87Y
$ module load
fastqc$
fastqc s_1_1.fastq;Slide19
FastQC: ExampleSlide20
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data formats, general workflow etc.
Data Analysis Pipeline
Sequence QC and preprocessing
Downloading reference sequences
Sequence mappingDownstream analysis workflow and software
RNA-Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.
Tuxedo suite:
Tophat, Cufflinks and cummeRbund
Data visualization with Genome Browsers.
RNA-Seq pipeline software: Galaxy vs. shell scripting
ChIP-Seq
data analysis workflow and software
Scripting Languages and bioinformatics resources
SummarySlide21
Premade Genome SequenceIndexes and Annotation
http://ccb.jhu.edu/software/tophat/igenomes.shtmlSlide22
The UCSC Genome Browser Homepage
Get genome annotation here!
General information
Specific information—
new features, current status, etc.
Get reference sequences here!Slide23
Downloading Reference SequencesSlide24
Downloading Reference AnnotationSlide25
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data formats, general workflow etc.
Data Analysis Pipeline
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mappingDownstream analysis workflow and softwareRNA-
Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.
Tuxedo suite: Tophat
, Cufflinks and cummeRbund
Data visualization with Genome Browsers.RNA-
Seq pipeline software: Galaxy vs. shell scripting
ChIP-Seq
data analysis workflow and software
Scripting Languages and bioinformatics resources
SummarySlide26
Sequence Mapping Challenges
Alignment (Mapping) is often the first step once analysis-read reads are obtained.
The task: to align sequencing reads against a known reference.
Difficulties: high volume of data, size of reference genome, computation time, read length constraints, ambiguity caused by repeats and sequencing errors. Slide27
How to choose an aligner?
There are many short read aligners and they vary a lot in performance(accuracy, memory usage, speed and flexibility etc). Factors to consider : application, platform, read length, downstream analysis, etc.Constant trade off between speed and sensitivity (e.g. MAQ vs.
Bowtie2).
Guaranteed high accuracy will take
longer.
Popular choices:
Bowtie2,
BWA, Tophat2, STAR.Slide28
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data formats, general workflow etc.
Data Analysis Pipeline
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mappingDownstream analysis workflow and softwareRNA-
Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.
Tuxedo suite: Tophat
, Cufflinks and cummeRbund
Data visualization with Genome Browsers.RNA-
Seq pipeline software: Galaxy vs. shell scripting
ChIP-Seq
data analysis workflow and software
Scripting Languages and bioinformatics resources
SummarySlide29
Application Specific SoftwareSlide30
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data formats, general workflow etc.
Data Analysis Pipeline
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mappingDownstream analysis workflow and software
RNA-Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.Tuxedo suite:
Tophat, Cufflinks
and cummeRbund
Data visualization with Genome Browsers.RNA-S
eq pipeline software: Galaxy vs. shell scriptingChIP-Seq
data analysis workflow and software
Scripting Languages and bioinformatics resources
SummarySlide31Slide32
Two Major
Approaches
1.
Gene or Exon level differential expression (DE):
DESeq2,
EdgeR
, DEXSeq…2. Transcripts assembly : Trinity, Velvet-Oasis, TransABySS, Cufflinks, Scripture…Slide33
RNA-
Seq
Pipeline for DESlide34
RNA-
Seq
: Spliced
A
lignment
http://
en.wikipedia.org/wiki/File:RNA-Seq-alignment.png
“Systematic
evaluation of spliced alignment programs for RNA-
seq data”Nature Methods, 2013
Some reads will span two different exons
Need long enough reads to be able to reliably map both sidesUse a splice aware aligner!Slide35
How much sequence do I need?
Oversimplified answer:20-50M PE/sample (Human/Mouse)Depends on: Size and complexity of your transcriptome.
Goal of experiment: DE, transcript discovery.
Tissue type, library type, RNA quality, read length, single-end…Slide36
RNA-Seq: Normalization
Gene-length bias• Differential
expression of longer genes is more significant
because long
genes yield more
reads
RNA-
Seq normalization methods:Scaling factor based: Total count, upper quartile, median, DESeq, TMM in edgeRQuantile, RPKM (cufflinks)ERCC
Normalize by gene length and by number of reads mapped, e.g. RPKM/FPKM (reads/fragments per kilo bases per million mapped reads)Slide37
Definition of Expression levels
RPKM: Reads Per K
ilobase
per
M
illion of mapped
reads
:
FPKM: Fragment Per Kilobase per Million of mapped reads (for paired-end reads)
Mortazavi
, et al. 2008Slide38
RNA-Seq: Differential Expression
Discrete vs. Continuous data: Microarray florescence intensity data: continuous
Modeled using normal distribution
RNA-
Seq
read
count data: discrete Modeled using negative binomial distribution Microarray software can NOT be directly used to analyze
RNA-Seq data!Slide39
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data formats, general workflow etc.
Data Analysis Pipeline
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mappingDownstream analysis workflow and software
RNA-Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.
Popular RNA-Seq pipeline: Tuxedo suite, Tophat2-HTSeq-DESeq
Data visualization with Genome Browsers.ChIP-Seq
data analysis workflow and softwareScripting Languages and bioinformatics resourcesSummarySlide40
Popular RNA-Seq DE Pipeline
(The Tuxedo Protocol)
Pipeline 1
Pipeline 2
(The Alternative Protocol)Slide41
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
Classic RNA-
Seq
(Tuxedo Protocol)
2. Transcript assembly and quantification
Spliced Read mapping
3. Merge assembled transcripts from multiple samples
4. Differential Expression analysis
SAM/BAM
GTF/GFFSlide42
Classic vs. Advanced RNA-Seq workflowSlide43
1. Spliced Alignment: Tophat
Tophat
: a spliced short read aligner for RNA-
seq
.
$
tophat
-p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq
C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq$ tophat
-p 8 -G genes.gtf -o
C2_R1_thout genome C2_R1_1.fq C2_R1_2.fq$ tophat
-p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C2_R2_2.fq
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.htmlSlide44
2.Transcript assembly and abundance quantification: Cufflinks
Cufflinks: a
program that assembles aligned RNA-
Seq
reads into transcripts, estimates their abundances, and tests for differential expression and regulation
transcriptome
-wide
.
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
$ cufflinks
-p
8
-
o
C1_R1_clout
C1_R1_thout/
accepted_hits.bam
$ cufflinks
-p
8
-o
C1_R2_clout C1_R2_thout/
accepted_hits.bam
$ cufflinks
-p
8
-o
C2_R1_clout C2_R1_thout/
accepted_hits.bam
$ cufflinks
-p
8
-o
C2_R2_clout C2_R2_thout/
accepted_hits.bamSlide45
3. Final Transcriptome assembly:
Cuffmerge
$
cuffmerge
-g
genes.gtf
-s genome.fa
-p 8
assemblies.txt
$ more assembies.txt./C1_R1_clout/
transcripts.gtf./
C1_R2_clout/
transcripts.gtf
./
C2_R1_clout/
transcripts.gtf
./
C2_R2_clout/
transcripts.gtf
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.htmlSlide46
4.Differential Expression: Cuffdiff
CuffDiff
:
a program
that
compares transcript abundance between
samples.
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
$
cuffdiff
-o
diff_out
-b
genome.fa
-p
8
–L
C1,C2
-u
merged_asm
/
merged.gtf
./C1_R1_thout/
accepted_hits.bam
, ./
C1_R2_thout/
accepted_hits.bam
./C2_R1_thout/
accepted_hits.bam
, ./
C2_R2_thout/
accepted_hits.bamSlide47
Alternative Pipeline with HTSeq
HTSeq
DESeq2/
edgeR
Tophat2,
$
htseq
-count
-f
bam C1_R1_thout/
sorted.bam
-s
no
>
hsc
/C1_R1.counts
$
htseq
-count
-f
bam
C1_R1_thout/
sorted.bam
-s
no
>
hsc
/C1_R1.counts
$
htseq
-count
-f
bam
C1_R1_thout/
sorted.bam
-s
no
>
hsc
/C1_R1.counts
$
htseq
-count
-f
bam
C1_R1_thout/
sorted.bam
-s
no
>
hsc
/C1_R1.counts
Slide48
HTSeq Output: Gene Count Table
…
…Slide49
DESeq2
http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
April 21
st
workshp
!Slide50
Downstream Analysis
Pathway and functional analysis:Gene Ontology over representationGene Set Enrichment Analysis (GSEA)Signaling Pathway Impact AnalysisSoftware
DAVID,
GSEA, WGCNA
, Blast2go,
topGO
,
BinGO...IPA,
GeneGO MetaCore, iPathway GuideSlide51
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data file formats, general workflow
Data Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and softwareRNA-Seq data analysisspliced
alignment, normalization, coverage, differential expression.
Tuxedo suite: Tophat/Cufflinks parameters setting, cummeRbund
Data Visualization with genome browsers and R packagesChIP-Seq data analysis workflow and softwareOpen source pipeline software with Graphical User
InterfaceSummarySlide52
Integrative
Genomics Viewer (IGV)
http://www.broadinstitute.org/igv
Available on HPC. Use ‘module load
igv
’ and ‘
igv
’Slide53
Visualizing RNA-
Seq
mapping with IGV
http://
www.broadinstitute.org/igv/UserGuide
Integrative Genomics Viewer (IGV): high-performance genomics data visualization and
exploration.Thorvaldsdóttir
H et al. Brief
Bioinform
. 2013Slide54
Genomic Data Visualization
R packages for plots:
ggplot2
ggbio
GenomeGraphsSlide55
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data formats, general workflow etc.
Data Analysis Pipeline
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mappingDownstream analysis workflow and software
RNA-Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.
Tuxedo suite: Tophat
, Cufflinks and cummeRbund
Data visualization with Genome Browsers.RNA-S
eq pipeline software: GalaxyChIP-Seq data analysis workflow and softwareScripting Languages and bioinformatics resources
SummarySlide56
Galaxy: Web based platform for analysis of large datasets
http://hpc-galaxy.oit.uci.edu/root
https
://main.g2.bx.psu.edu/
Galaxy: A platform for interactive large-scale genome analysis
:
Genome
Res. 2005. 15: 1451-1455Slide57
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data formats, general workflow etc.
Data Analysis Pipeline
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mappingDownstream analysis workflow and software
RNA-Seq data analysisConcepts: spliced alignment, normalization, coverage, differential expression.
Tuxedo suite: Tophat
, Cufflinks and cummeRbund
Data visualization with Genome Browsers.
RNA-Seq pipeline software: Galaxy vs. shell scripting
ChIP-Seq
data analysis workflow and software
Scripting Languages and bioinformatics resources
SummarySlide58
What is
ChIP-Seq?
Chromatin-
Immunoprecipitation
(
ChIP
)- Sequencing
ChIP - A technique of precipitating a protein antigen out of solution using an antibody that specifically binds to the protein.Sequencing
– A technique to determine the order of nucleotide bases in a molecule of DNA.Used in combination to study the interactions between protein and DNA.Slide59
ChIP-Seq
Applications
Enables the accurate profiling of
Transcription factor binding sites
Polymerases
Histone modification sites
DNA methylationSlide60
A View of ChIP-Seq Data
Typically reads (35-55bp) are quite sparsely distributed over the genome.
Controls (i.e. no pull-down by antibody) often show smaller peaks at the same locations
Rozowsky
et al Nature
Biotech,
2009Slide61
ChIP-Seq
Analysis Pipeline
Sequencing
Short read
Sequences
Base Calling
Read
QC
Short read
Alignment
Enriched Regions
Peak Calling
Combine with gene expression
Motif Discovery
Visualization
with genome browser
Differential peaksSlide62
ChIP-Seq: Identification of Peaks
Several methods to identify peaks but they mainly fall into 2 categories:
Tag Density
Directional scoring
In the tag density method, the program searches for large clusters of overlapping sequence tags within a fixed width sliding window across the genome.
In directional scoring methods, the bimodal pattern in the strand-specific tag densities are used to identify protein binding sites.
Determining the exact binding sites from short reads generated from
ChIP-Seq experimentsSISSRs (Site Identification from Short Sequence Reads) (Jothi
2008)MACS (Model-based Analysis of ChIP-Seq) (Zhang et al, 2008)Slide63
ChIP-Seq: Output
A list of enriched locationsCan be used:
In combination with RNA-Seq, to determine the biological function of transcription factors
Identify genes co-regulated by a common transcription factor
Identify common transcription factor binding motifsSlide64
Resources in NGS data analysis
Stackoverflow.comSlide65
Languages in BioinformaticsSlide66
Summary
Thank you!
NGS technologies are
transforming molecular biology.
Bioinformatics analysis is a crucial part in NGS applications
Dat
a formats, terminology, general workflow
Analysis pipeline
Software for various NGS applications
RNA-
Seq
and
ChIP-Seq
data analysis
Pathway Analysis
Data visualization
Bioinformatics resourcesSlide67