Assistant Professor Department of Medicine Division of Pulmonary Sciences amp Critical Care Medicine What I do Perform highthroughput data analysis for the scientific community microarray and ID: 933457
Download Presentation The PPT/PDF document "PHANG LAB TALK Tzu L Phang Ph.D." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
PHANG LAB TALK
Tzu L Phang Ph.D.
Assistant Professor
Department of Medicine
Division of Pulmonary Sciences & Critical Care Medicine
Slide2What I do:
Perform high-throughput data analysis for
the scientific
community; microarray and
Next Generation Sequencing datasets
Provide
analysis solution for experts and novice users alike
Develop
multi-media approaches to disseminate translational science education
Studying
the role of long non-coding
RNA; second talk
Establishing
the Bioinformatics Consultation and
Analysis
Core to help researchers and scientists
design, analyze and interpret
their
experiments.
Slide3Today’s Talk Layout
The center of my universe:
R and
Bioconductor
Collaboration with Biologists
5x5; simple way to teach and contribute
Next Generation Sequencing (NGS)
Slide4Today’s Talk Layout
The center of my universe:
R and
Bioconductor
Collaboration with Biologists
5x5; simple way to teach and contribute
Next Generation Sequencing (NGS)
Slide5R
r-
project.org
Slide6R is hot
http://
blog.revolutionanalytics.com
/r-is-hot/
Slide7R in the media
Slide8Bioconductor
www.bioconductor.org
Statistical tools in R for high-throughput data analysis
6 month update cycle. Release 2.10 with 554 software package (45 new)
Analysis workflow
Oligonucleotide Arrays
Sequence Analysis
Variants
Accessing Annotation Data
High-throughput Assays
Slide9The Website
www.bioconductor.org
Slide10Categories
Slide11Categories
cont
…
Slide12Typical Analysis Routine
Slide13R is easy
Slide14Result output
Slide15Other Resources
http://
www.rseek.org
/
http://
www.statmethods.net
/
http://
crantastic.org
/
http://
stackoverflow.com
/
Slide16Today’s Talk Layout
The center of my universe:
R and
Bioconductor
Collaboration with Biologists
5x5; simple way to teach and contribute
Next Generation Sequencing (NGS
Slide17Collaboration
>1000 microarray chips / year
Affymetrix
&
Illumina
platforms
Next Generation Sequencing 25 free Pilot Projects.
Serve the rocky mountain region scientific community
Slide18Collaboration - tips
Don’t be a data analyst – be a co-investigator
Suggest analysis approaches that are not obvious
Focus on the result, not method
Always looks for grant writing opportunity
Understand the technical & biological system as thoroughly as possible – you will be surprise what biologists missed informatically
Slide19Exmaple
1: Classification of Pituitary Tumors
Pituitary tumors are the most common type of brain tumor in 20% at autopsy and 1/10,000 persons clinically. Based upon 2010 figures of a veteran population of 22.7 million, this translates into >225,000 veterans with pituitary tumors.
Currently no medical therapies exist for these tumors and surgical resection is the treatment of choice
.
Recurrence rates approach 40%.
Understanding
of the pathways to
tumorigenesis
and markers of aggressiveness and risk of recurrence would alter the intensity and cost of clinical care and may provide novel candidates and pathways to explore for new treatment options for these patients
Principle Component Analysis
Slide22Potential markers
Slide23Outputs
Slide24Example 2: Explore the artistic side!
Slide25Slide26Example 3: Unconventional Usage
Slide27Slide28Introduction
Crohn’s
Disease (CD) is an Inflammatory Bowel Disease (IBD) that affecting up to one million Americans (15 to 30 ages).
Discordance between monozygotic twins affected by CD provide evidence for epigenetic role in etiology of disease.
We combined 2 microarray technologies to study these roles
CHARM array (Comprehensive High-throughout Array for Relative Methylation)
Gene Expression (
Affymetix
Gene 1.0 ST)
Slide29Slide30Slide31Research Informatics Integrated Core (RIIC)
Michael G. Kahn MD, PhD
CCTSI Co-Director & RIIC Core Director
Michael.Kahn@ucdenver.edu
Slide32RIIC Organizational Model
Slide33http://
cctsi.ucdenver.edu
/RIIC/Pages/
ConsultationDataAnalysis.aspx
Slide34Slide35Slide365X5
http://
cctsi.ucdenver.edu
/5x5
Slide37Demonstration
http://
gcrc.ucdenver.edu
/Videos/Informatics/5x5/
SocialNetworking5x5
.wmv
Slide38Tools
Slide39Podcast
Slide40Slide41TIES – Translational Informatics Education Support (TIES)
Bridging the gap in translational research through education
Training biologist informatics
Enhance collaboration through education and knowledge exchange
Bring awareness in latest technical advances
Disseminate knowledge through innovation
Slide42Next Generation Sequencing
The future is here ….
Slide43High Throughput Parallel Sequencing
http://
www.youtube.com
/
watch?v
=77r5p8IBwJk
Slide44Slide45Paradigm Shift
Standard “Sanger” sequencing
96 sample/day
Read length ~650
bp
Total = 450,000 bases of sequence data
454 – the game changer!
~400,000 different templates (reads)/day
Read length ~ 250 (at that time)
Total = 100,000,000 bases of sequence data
Slide46The second generation
Roche
(454)
http://454.com
/
First
on the
market
Emulsion
PCR and
pyrosequencing
Illumina
(
Solexa
)
http://www.illumina.com
/
Second
on the market
Bridge PCR and polymerase based SBS
Abi
(Solid)
http://solid.appliedbiosystems.com
/
Third on the market
Emulsion PCR and ligase based sequencing
Slide47Single molecule sequencing
Helicos
Biosciences
http://helicosbio.com
true Single Molecule Sequencing technology
Pacific Biosciences
http://www.pacificbiosciences.com
Single Molecule Real Time sequencing
Slide48Portable Sequencer
Ion Torrent
Slide49Others
Polonator
http://www.polonator.org
Emulsion PCR and ligase based sequencing
Used in the Personal Genome Project
Open platform, open source
Cheap/affordable
Complete Genomics
http://www.completegenomics.com
Specializing in human genome sequencing
Slide50Type of read data
Base Space or Color Space
Paired end or single end
Stranded or
Unstranded
Slide51Short Reads
Short reads from NGS are challenging (
Solexa
~36
bp
, now
HiSeq
100
bp
single pass)
Very hard to assemble whole genome
Especially on repeat regions
Requires many fold coverage
New and faster algorithm for many traditional bioinformatics operations
Reads are getting longer – another moving target. (2x250)
Slide52Applications
An explosion of scientific innovation!!
New usages not directly foreseen by the original developers of the technology
Some envision the beginning of next revolution – such as PCR – NGS machine in every lab!!
Cheap high-volume sequencing – revisiting data collection and management system
Slide53RNA Sequencing
“
Digital Gene Expression
”
or
“
RNA-Seq
”
Truly accurate gene expression measurements
Can replace gene expression microarrays
25% more sensitive
Does not rely on hybridization (no %GC bias, no cross-hybridization between related genes)
Discover novel genes
(and other kinds of RNA molecules)
one experiment found that 34% of human transcripts were not from known genes
Sultan et al, Science. 2008 Aug 15;321(5891):956-60.
Why
RNAseq
better then microarray?
Not predefine gene annotation — make discovering novel transcripts possible
Low
, if any, background
Large dynamic range of expression levels, no upper limit for quantification
Reveal sequence variation, such as SNP, in the transcript region
In
Helico
— single molecule sequencing — no PCR step, remove amplification bias
Slide55More information from RNA
Can capture true alternative splicing information
Sequence of splice-junctions
One study found 4,096 previously unknown splice junctions in 3,106 human genes
Different transcription start and end points for RNA molecules
Allelic variation (SNPs)
Small RNAs
Slide56Slide57Bottleneck: Data Analysis
Slide58Informatics is the Bottleneck
Scientists are currently able to generate sequence data much faster/more easily than they are able to analyze it
Customized analysis / Bioinformatics consulting is needed for every project
Slide59Bioinformatics Challenges
Need for large amount of CPU power
Informatics groups must manage compute clusters
Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment
Another level of software complexity and challenges to interoperability
VERY large text files
(million
lines
long)
Can’t
do
‘
business as usual
’
with familiar tools such as
Microsoft Excel.
Impossible memory usage and execution time
Sequence Quality filtering
Slide60Auer
P. Statistical design and analysis of RNA sequencing data. Genetics. 2010.
Slide61Data formats
Images
“
raw
”
basecalls with quality scores
Sequence reads aligned to reference genomes
Assemblies (contigs)
Variants (SNPs, indels, copy number variants)
Slide62Hexadecimal mode
Decimal mode
Slide63FASTQ
Raw
Slide64SAM format
Example
Pileup format
Slide65QNAME
FLAG
RNAME
POS
MAPQ
CIGAR
MRNM
MPOS
ISIZE
SEQ
QUAL
Slide66CIGAR
M : match/mismatch
I : Insertion compared with reference
D : Deletion compared with reference
N : Skipped bases on reference
S : soft clipping (unaligned)
H : hard clipping
P : padding
Slide67Slide68Slide69Slide70Slide71Slide72Slide73File Size
s_1_ILS4_sequence.txt [5.2 GB]
s_1_ILS4_sequence.fastq [3.3 GB]
s_1_ILS4_sequence.sam [4.5 GB]
s_1_ILS4_sequence.bam [995 MB]
s_1_ILS4_sequence.sorted.bam [696 MB]
Slide74The Bible
Slide75Utility Tools
SamTools
Picard
Useq
Etc
…
Slide76Bioconductor Solution
Slide77Slide78A demonstration
Slide79Slide80Secondary Tools
Laboratory Management
Data mining and visualization
Project management for genome assembly
Pathway mapping (functional analysis of groups of genes)
Motif finding (for Chip-Seq)
Slide81Integration
Integrate information from different technologies on a single genome map
Genetic variation
Gene expression (mRNA levels)
Alternative splicing
Transcription factor binding
Methylation/histone status
Small RNA levels (gene regulatory molecules
)
Non-coding RNA levels!
Slide82Speed/Efficiency
New emphasis on efficient data structures and algorithms
Use of
“
old style
”
tools such as grep/sed/awk
Machine language programming
Currently a huge burst of programming creativity in an
“
anything goes
”
environment
A desperate scramble for tools that work
Huge duplication of effort in programming, but also in evaluating new software
Slide83Slide84Slide85Amazon Web Services
http://aws.amazon.com/education/
Slide86Future Directions
Sequencing will continue to get much faster and cheaper, by 4-10x per year for several more years.
Affordable complete human genome sequencing will be available as a clinical diagnostic tool within 2-3 years.
Data storage and analysis bottleneck
Data security/privacy issues
Slide87Move to 1:52
Slide88Field Trip
Slide89Slide90Slide91