/
PHANG LAB TALK Tzu L Phang Ph.D. PHANG LAB TALK Tzu L Phang Ph.D.

PHANG LAB TALK Tzu L Phang Ph.D. - PowerPoint Presentation

HotMess
HotMess . @HotMess
Follow
345 views
Uploaded On 2022-08-03

PHANG LAB TALK Tzu L Phang Ph.D. - PPT Presentation

Assistant Professor Department of Medicine Division of Pulmonary Sciences amp Critical Care Medicine What I do Perform highthroughput data analysis for the scientific community microarray and ID: 933457

http sequencing analysis data sequencing http data analysis sequence www rna gene tools bioconductor generation single high genome 000

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "PHANG LAB TALK Tzu L Phang Ph.D." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

PHANG LAB TALK

Tzu L Phang Ph.D.

Assistant Professor

Department of Medicine

Division of Pulmonary Sciences & Critical Care Medicine

Slide2

What I do:

Perform high-throughput data analysis for

the scientific

community; microarray and

Next Generation Sequencing datasets

Provide

analysis solution for experts and novice users alike

Develop

multi-media approaches to disseminate translational science education

Studying

the role of long non-coding

RNA; second talk

Establishing

the Bioinformatics Consultation and

Analysis

Core to help researchers and scientists

design, analyze and interpret

their

experiments.

Slide3

Today’s Talk Layout

The center of my universe:

R and

Bioconductor

Collaboration with Biologists

5x5; simple way to teach and contribute

Next Generation Sequencing (NGS)

Slide4

Today’s Talk Layout

The center of my universe:

R and

Bioconductor

Collaboration with Biologists

5x5; simple way to teach and contribute

Next Generation Sequencing (NGS)

Slide5

R

r-

project.org

Slide6

R is hot

http://

blog.revolutionanalytics.com

/r-is-hot/

Slide7

R in the media

Slide8

Bioconductor

www.bioconductor.org

Statistical tools in R for high-throughput data analysis

6 month update cycle. Release 2.10 with 554 software package (45 new)

Analysis workflow

Oligonucleotide Arrays

Sequence Analysis

Variants

Accessing Annotation Data

High-throughput Assays

Slide9

The Website

www.bioconductor.org

Slide10

Categories

Slide11

Categories

cont

Slide12

Typical Analysis Routine

Slide13

R is easy

Slide14

Result output

Slide15

Other Resources

http://

www.rseek.org

/

http://

www.statmethods.net

/

http://

crantastic.org

/

http://

stackoverflow.com

/

Slide16

Today’s Talk Layout

The center of my universe:

R and

Bioconductor

Collaboration with Biologists

5x5; simple way to teach and contribute

Next Generation Sequencing (NGS

Slide17

Collaboration

>1000 microarray chips / year

Affymetrix

&

Illumina

platforms

Next Generation Sequencing 25 free Pilot Projects.

Serve the rocky mountain region scientific community

Slide18

Collaboration - tips

Don’t be a data analyst – be a co-investigator

Suggest analysis approaches that are not obvious

Focus on the result, not method

Always looks for grant writing opportunity

Understand the technical & biological system as thoroughly as possible – you will be surprise what biologists missed informatically

Slide19

Exmaple

1: Classification of Pituitary Tumors

Pituitary tumors are the most common type of brain tumor in 20% at autopsy and 1/10,000 persons clinically. Based upon 2010 figures of a veteran population of 22.7 million, this translates into >225,000 veterans with pituitary tumors.

Currently no medical therapies exist for these tumors and surgical resection is the treatment of choice

.

Recurrence rates approach 40%.

Understanding

of the pathways to

tumorigenesis

and markers of aggressiveness and risk of recurrence would alter the intensity and cost of clinical care and may provide novel candidates and pathways to explore for new treatment options for these patients

Slide20

Slide21

Principle Component Analysis

Slide22

Potential markers

Slide23

Outputs

Slide24

Example 2: Explore the artistic side!

Slide25

Slide26

Example 3: Unconventional Usage

Slide27

Slide28

Introduction

Crohn’s

Disease (CD) is an Inflammatory Bowel Disease (IBD) that affecting up to one million Americans (15 to 30 ages).

Discordance between monozygotic twins affected by CD provide evidence for epigenetic role in etiology of disease.

We combined 2 microarray technologies to study these roles

CHARM array (Comprehensive High-throughout Array for Relative Methylation)

Gene Expression (

Affymetix

Gene 1.0 ST)

Slide29

Slide30

Slide31

Research Informatics Integrated Core (RIIC)

Michael G. Kahn MD, PhD

CCTSI Co-Director & RIIC Core Director

Michael.Kahn@ucdenver.edu

Slide32

RIIC Organizational Model

Slide33

http://

cctsi.ucdenver.edu

/RIIC/Pages/

ConsultationDataAnalysis.aspx

Slide34

Slide35

Slide36

5X5

http://

cctsi.ucdenver.edu

/5x5

Slide37

Demonstration

http://

gcrc.ucdenver.edu

/Videos/Informatics/5x5/

SocialNetworking5x5

.wmv

Slide38

Tools

Slide39

Podcast

Slide40

Slide41

TIES – Translational Informatics Education Support (TIES)

Bridging the gap in translational research through education

Training biologist informatics

Enhance collaboration through education and knowledge exchange

Bring awareness in latest technical advances

Disseminate knowledge through innovation

Slide42

Next Generation Sequencing

The future is here ….

Slide43

High Throughput Parallel Sequencing

http://

www.youtube.com

/

watch?v

=77r5p8IBwJk

Slide44

Slide45

Paradigm Shift

Standard “Sanger” sequencing

96 sample/day

Read length ~650

bp

Total = 450,000 bases of sequence data

454 – the game changer!

~400,000 different templates (reads)/day

Read length ~ 250 (at that time)

Total = 100,000,000 bases of sequence data

Slide46

The second generation

Roche

(454)

http://454.com

/

First

on the

market

Emulsion

PCR and

pyrosequencing

Illumina

(

Solexa

)

http://www.illumina.com

/

Second

on the market

Bridge PCR and polymerase based SBS

Abi

(Solid)

http://solid.appliedbiosystems.com

/

Third on the market

Emulsion PCR and ligase based sequencing

Slide47

Single molecule sequencing

Helicos

Biosciences

http://helicosbio.com

true Single Molecule Sequencing technology

Pacific Biosciences

http://www.pacificbiosciences.com

Single Molecule Real Time sequencing

Slide48

Portable Sequencer

Ion Torrent

Slide49

Others

Polonator

http://www.polonator.org

Emulsion PCR and ligase based sequencing

Used in the Personal Genome Project

Open platform, open source

Cheap/affordable

Complete Genomics

http://www.completegenomics.com

Specializing in human genome sequencing

Slide50

Type of read data

Base Space or Color Space

Paired end or single end

Stranded or

Unstranded

Slide51

Short Reads

Short reads from NGS are challenging (

Solexa

~36

bp

, now

HiSeq

100

bp

single pass)

Very hard to assemble whole genome

Especially on repeat regions

Requires many fold coverage

New and faster algorithm for many traditional bioinformatics operations

Reads are getting longer – another moving target. (2x250)

Slide52

Applications

An explosion of scientific innovation!!

New usages not directly foreseen by the original developers of the technology

Some envision the beginning of next revolution – such as PCR – NGS machine in every lab!!

Cheap high-volume sequencing – revisiting data collection and management system

Slide53

RNA Sequencing

Digital Gene Expression

or

RNA-Seq

Truly accurate gene expression measurements

Can replace gene expression microarrays

25% more sensitive

Does not rely on hybridization (no %GC bias, no cross-hybridization between related genes)

Discover novel genes

(and other kinds of RNA molecules)

one experiment found that 34% of human transcripts were not from known genes

Sultan et al, Science. 2008 Aug 15;321(5891):956-60.

Slide54

Why

RNAseq

better then microarray?

Not predefine gene annotation — make discovering novel transcripts possible

Low

, if any, background

Large dynamic range of expression levels, no upper limit for quantification

Reveal sequence variation, such as SNP, in the transcript region

In

Helico

— single molecule sequencing — no PCR step, remove amplification bias

Slide55

More information from RNA

Can capture true alternative splicing information

Sequence of splice-junctions

One study found 4,096 previously unknown splice junctions in 3,106 human genes

Different transcription start and end points for RNA molecules

Allelic variation (SNPs)

Small RNAs

Slide56

Slide57

Bottleneck: Data Analysis

Slide58

Informatics is the Bottleneck

Scientists are currently able to generate sequence data much faster/more easily than they are able to analyze it

Customized analysis / Bioinformatics consulting is needed for every project

Slide59

Bioinformatics Challenges

Need for large amount of CPU power

Informatics groups must manage compute clusters

Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment

Another level of software complexity and challenges to interoperability

VERY large text files

(million

lines

long)

Can’t

do

business as usual

with familiar tools such as

Microsoft Excel.

Impossible memory usage and execution time

Sequence Quality filtering

Slide60

Auer

P. Statistical design and analysis of RNA sequencing data. Genetics. 2010.

Slide61

Data formats

Images

raw

basecalls with quality scores

Sequence reads aligned to reference genomes

Assemblies (contigs)

Variants (SNPs, indels, copy number variants)

Slide62

Hexadecimal mode

Decimal mode

Slide63

FASTQ

Raw

Slide64

SAM format

Example

Pileup format

Slide65

QNAME

FLAG

RNAME

POS

MAPQ

CIGAR

MRNM

MPOS

ISIZE

SEQ

QUAL

Slide66

CIGAR

M : match/mismatch

I : Insertion compared with reference

D : Deletion compared with reference

N : Skipped bases on reference

S : soft clipping (unaligned)

H : hard clipping

P : padding

Slide67

Slide68

Slide69

Slide70

Slide71

Slide72

Slide73

File Size

s_1_ILS4_sequence.txt [5.2 GB]

s_1_ILS4_sequence.fastq [3.3 GB]

s_1_ILS4_sequence.sam [4.5 GB]

s_1_ILS4_sequence.bam [995 MB]

s_1_ILS4_sequence.sorted.bam [696 MB]

Slide74

The Bible

Slide75

Utility Tools

SamTools

Picard

Useq

Etc

Slide76

Bioconductor Solution

Slide77

Slide78

A demonstration

Slide79

Slide80

Secondary Tools

Laboratory Management

Data mining and visualization

Project management for genome assembly

Pathway mapping (functional analysis of groups of genes)

Motif finding (for Chip-Seq)

Slide81

Integration

Integrate information from different technologies on a single genome map

Genetic variation

Gene expression (mRNA levels)

Alternative splicing

Transcription factor binding

Methylation/histone status

Small RNA levels (gene regulatory molecules

)

Non-coding RNA levels!

Slide82

Speed/Efficiency

New emphasis on efficient data structures and algorithms

Use of

old style

tools such as grep/sed/awk

Machine language programming

Currently a huge burst of programming creativity in an

anything goes

environment

A desperate scramble for tools that work

Huge duplication of effort in programming, but also in evaluating new software

Slide83

Slide84

Slide85

Amazon Web Services

http://aws.amazon.com/education/

Slide86

Future Directions

Sequencing will continue to get much faster and cheaper, by 4-10x per year for several more years.

Affordable complete human genome sequencing will be available as a clinical diagnostic tool within 2-3 years.

Data storage and analysis bottleneck

Data security/privacy issues

Slide87

Move to 1:52

Slide88

Field Trip

Slide89

Slide90

Slide91