/
Algorithms for Multisample Algorithms for Multisample

Algorithms for Multisample - PowerPoint Presentation

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
375 views
Uploaded On 2018-03-12

Algorithms for Multisample - PPT Presentation

Read Binning Student Gabriel Ilie Major advisor Ion Măndoiu Associate Advisor Sanguthevar Rajasekaran Associate Advisor Yufeng Wu University of Connecticut November 2013 ID: 647803

reads mers data graph mers reads graph data error abundance mer samples bruijn remove species tips exons sample pseudo

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Algorithms for Multisample" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Algorithms for MultisampleRead Binning

Student: Gabriel Ilie Major advisor: Ion Măndoiu Associate Advisor: Sanguthevar Rajasekaran Associate Advisor: Yufeng Wu

University of Connecticut

November 2013Slide2

Outline

MotivationPrevious approachesAlgorithmsResultsOngoing workSlide3

Human microbiome

The microorganisms inhabiting our bodies comprise the microbiome.Microbes outnumber our own cells by 10 to 1 [Wooley et al, 2010].Diabetes, obesity, cancer and even attractiveness to mosquitoes seem to correlate with changes in our microbiome [Turnbaugh et al, 2010].To better understand our own condition we also need to understand the composition of the microbial communities inhabiting our bodies and how they interact not just between themselves but also with their habitat (e.g. the host organism).Slide4

Single organism studies

they rely on clonal culturesmost microorganism cannot be cultivatedsuffer from amplification biasmicrobes do not live in single species communitiesmembers of these communities interact not just with each other but also with their habitats, which includes the host organismData obtained from clonal cultures is highly biased and does not capture a true picture of microbial life.Slide5

Transcriptomics

Genomes only provide information about the potential function of organisms.Having a gene does not mean that this gene is also expressed inside the host or during a particular condition.The transcriptome is the set of all RNA molecules produced in one or a population of cells.In order to understand the physiology of the microorganisms we need to know their transcriptome.Slide6

Metatranscriptomics

The transcriptome of a community is the union over the transcriptomes of each of its members.In metatranscriptomic studies:bulk RNA is extracted directly from environmental samplesthe RNA is reverse-transcribedthe resulting RNA-Seq libraries are sequencedMetatranscriptomic studies are essential in order to understand the physiology of the microbiome.Slide7

Challenges of working with metatranscriptomic data

The volume of sequencing data is several orders of magnitude larger than single organisms.Reads can come from hundreds of different species, each with a different abundance level.In addition to having a range in the abundance levels of the microorganisms, genes are expressed at drastically different levels.Genes usually have multiple isoforms.Metatranscriptomics has to deal with all of the challenges of metagenomics (1 and 2) plus some extra challenges (3 and 4), therefore algorithms devised for metagenomic data can also be applied to metatranscriptomic data.Slide8

(Meta)Transcriptome assembly types

Genome independent reconstruction (de novo):de Bruijn k-mer graphVelvet (2008)Trinity (2011)Genome guided reconstruction:spliced read mappingexon identification

splice graph

Cufflinks (2010)

TRIP (2012)Slide9

(Meta)Transcriptome assembly types

Genome independent reconstruction (de novo):de Bruijn k-mer graphVelvet (2008)Trinity (2011)Genome guided reconstruction:spliced read mappingexon identificationsplice graph

Cufflinks (2010)TRIP (2012)Slide10

Outline

MotivationPrevious approachesAlgorithmsResultsOngoing workSlide11

Clustering reads into bins

Analysis of environmental samples is difficult.To simplify the assembly process, many metagenomic tools have been developed to cluster the reads into bins (i.e. species).Algorithms developed for binning metagenomic reads can also be applied to (meta)transcriptomic reads (bins represent transcripts instead of species).Slide12

Types of reads binning algorithms

Genome dependentCompostBin(2008)Metacluster(2012)DNA composition patterns.G+C content, dinucleotide frequencies vary amongst species.Drawbacks:achieve reasonable performance only for long reads (800~1000 bp, [Wu et al, 2011])NGS technologies produce short reads

Genome independent

AbundanceBin(2011)

MultiBin(2011)

K-mer

frequencies are usually linearly proportional to a genome’s abundance.

Sufficiently long

k-mers

are usually unique.

Works with short sequencing reads.

Drawbacks

group together reads from different species if they have close abundance levels

do not perform well on species with low abundancesSlide13

Metacluster (2012)

Two round unsupervised binning algorithm:the first round clusters high-abundance readsfilter reads with 16-mer that appear less than T timesthe reads are grouped based on shared 36-mersthe groups are clustered using 5-mer distributionsthe second round clusters the remaining reads (low-abundance)those with unique 16-mers are discarded

the reads are grouped based on shared 22-mers

the groups are clustered using 4-mer distributions

Advantages:

better binning of reads from low abundance species

uses both techniques: k-mer abundances and DNA composition patternsSlide14

MultiBin (2011)

Processes multiple samples (N > 1) of the same microbial community.Clusters the reads into b bins (b is the number of species).Binning algorithm:all reads are pooled togethera graph G=(V, E) is generatedV is the set of reads

edges connect reads with substantial overlap (50 bp)greedily partition the vertices into a set, the tags, s.t. each read is either a tag or affiliated with one which substantially overlaps it

cluster the set of tags using V

t

=(c

t1

,c

t2

, …, c

tN

), c

ti

=number of reads from sample i which substantially overlap tag t

each non-tag read is assigned to the same bin as its affiliated tag

b

needs to be known in advance or estimated somehow.

Uses abundance differences from any of the samples to tell low abundance species apart.

The algorithm is quadratic in the total number of readsSlide15

Outline

MotivationPrevious approachesAlgorithmsResultsOngoing workSlide16

Our approach

We propose a novel method for unsupervised abundance-based multiple samples reads binning algorithmSlide17

Pseudo-exons

Pseudo-exons are defined as substrings whose k-mers and (k+1)-mers appear in the same transcripts with the same multiplicity.AB

C

D

A

B

C

A

A

B

C

E

F

E

F

D

T1

T2

T3

T4

Exon signatures:

A

: T1x1 T2x2 T3x1

B

:

T1x1 T2x1 T3x1

C

:

T1x1 T2x1 T3x1

D

:

T1x1 T3x1

E

: T3x1 T4x1

F

:

T3x1 T4x1

Pseudo-exons:

A

BC

D

E

F

Notice that exons

E

and

F

form different pseudo-exons because in

T4

they are in reverse order compared to

T3

.Slide18

K-mer counting

we count k-mers and (k+1)-mers using Jellyfish (2011)Jellyfish was designed for shared memory parallel computers with more than one core, it uses several lock-free data structures and multi-threading to count k-mers much faster than other toolsformally, for a given value of k, we count the number of occurrences of all k-mers in each of the N samples

our algorithms assume we have strand nonspecific data, therefore the counts of complementary k-mers are summed together as they are indistinguishable

we store

k-mers

in

canonical form

(the smaller value lexicographically between a

k-mer

and its reverse complement)

we combine the counts over all the samples into a list of

N-dimensional vectors

the maximum value for

k

supported by Jellyfish is 31Slide19

De Bruijn graph

We construct the de Bruijn graph; vertices are k-mers and edges are (k+1)-mers.Because we store vertices in canonical form, each vertex represents two k-mers: itself and its reverse-complement.We add an edge between any two k-mers if and only if there is a (k+1)-mer in the reads such that its prefix of length

k in canonical form matches one of the vertices, while its suffix of length

k

in canonical form matches the other one.

We will sometimes use

vertices to refer to k-mers

and

edges to refer to (k+1)-mers

.Slide20

De Bruijn graph

ACGCGTCGCGCGCGATCG

ATC

GAT

Edges

ACGC

CGCG

ACGA

ATCG

Vertices

ACG

CGC

CGA

ATC

The lists of

k-mers

and

(k+1)-mers

define an implicit representation of the

de Bruijn

graph, therefore we don’t need to construct it explicitly.

Relative to the canonical form of a vertex, for each edge we can say whether it is

incoming

or

outgoing

:

if the the

k-mer

matches the 5’ end of either forms of an edge then that is an outgoing edge

else it is an incoming edgeSlide21

Error removal

A common approach is to remove k-mers which have counts lower than t >= 1.We found that even for t = 1, we lose too much information, because removing unique k-mers compromises the results for ultra-low abundance transcripts.We found that “tip removal” and “bubble removal” give much better results [Zerbino et al, Velvet, 2008].These methods use the structure of the de Bruijn graph instead of coverage information to remove k-mers affected by sequencing errors.Slide22

Tip and bubble error removal

When a read contains a sequencing errorthe first few k-mers may be correct, until they start to overlap the position where the error occurredthis creates a branch going out of the “correct” paththis new branch will either end in a leaf (creating a “tip”), or if the read is long enough, the k-mers will stop overlapping the error and the branch will merge back into the path (creating a “bubble”)A “tip” is a chain of nodes that is disconnected on one end

we expect the majority of tips to have a maximum length of 2k

removing tips is straightforward; we remove all tips which have a length up to some threshold

removing a tip does not disrupt the connectivity of the graph

Implementing “bubble” removal is still an ongoing work.Slide23

Partitioning the de Bruijn graph

From the de Bruijn graph we want to extract putative pseudo-exons.These putative pseudo-exons, if we ignore self-edges, correspond to paths or chordless cycles in the de Bruijn graph.We use the structure of the graph and the vectors with the counts to do the partitioning.Slide24

Partitioning the de Bruijn graphSlide25

Partitioning the de Bruijn graph

We have the following cases:If a vertex has indegree and outdegree equal to at most 1 (it is on a path), we do nothing.If a vertex has outdegree (indegree) greater than 1, then we remove all outgoing (incoming) edges from that vertexhowever, we keep the most abundant edge if the ratio between its abundance and the sum of the abundances of the other edges is higher than a threshold 0 < e < 1

The value of e should be close to 1 (e.g. 0.97)Slide26

Our approach

We propose a novel method for unsupervised abundance-based multiple samples reads binning algorithmSlide27

Outline

MotivationPrevious approachesAlgorithmsResultsOngoing workSlide28

Test data - error free

GNF Atlas [Su et al, 2004] is a dataset which contains information about the expression levels of a set of genes in several human tissuesFrom this dataset we extracted the expression levels of 19,371 genes in 10 human tissuesWe used only one isoform per geneWe simulated 30 million error free RNA-Seq paired-reads of length 50 from this dataset using a tool called Grinder (2012)Slide29

Test data - with sequencing errors

Grinder was very useful for simulating the error free data, however when we wanted to introduce errors its long running time became an issue.Instead, we simulated sequencing errors by using the error free data.We simulated only one type of errors, substitutions, because these are the most common type found in Illumina datasets.We introduced substitutions into the error free reads with a probability of 0.1%,

0.5% and 1% per base.Slide30

K-mer counts in the simulated data

transcriptsreads#30-mers#31-mers

error

k

#correct

k-mers

#incorrect

k-mers

#missing k-mers

percentage of incorect k-mers

49,572,543

49,611,691

0%

30

49,512,839

0

59,704

0%

31

49,546,279

0

65,412

0%

0.1%

30

49,508,364

109,380,690

64,179

68.84%

31

49,540,750

108,528,925

70,941

68,66%

0.5%

30

49,483,820

404,415,622

88,723

89.1%

31

49,510,299

402,714,823

101,392

89.05%

1%

30

49,433,000

708,460,569

139,543

93.48%

31

49,446,451

706,531,643

165,240

93.46%Slide31

Efficiency of different error removal techniques

Error removal methoderror#correct 31-mers#incorrect 31-mers#missing 31-mersnone

0%

49,546,279

0

65,412

non-unique in at

least 1 sample

0%

46,520,486

0

3,091,205

non-unique in at

least 1 sample

0.1%

46,257,210

6,906,069

3,354,481

remove tips <= 21 over the union of the samples

0.1%

49,502,470

21,772,808

109,221

remove tips <= 60 over the union of the samples

0.1%

49,236,255

12,120,069

375,436

remove tips <= 21 sample-by-sample and over the union of the samples

0.1%

49,127,456

10,998,956

484,235

remove tips <= 60 sample-by-sample and over the union of the samples

0.1%

48,707,567

5,304,983

904,124Slide32

Results for the graph partitioning

errorerror removaltechniqueedge removalthreshold#pseudo-exons>= 50bp

#pseudo-exons >= 50bp and

do not have wrong 31-mers

#30-mers

%transcriptome covered by

col 5

0%

none

1

60,285

60,285

49,299,665

99.5%

0%

none

1

65,051

65,051

49,239,335

99.3%

0.1%

remove tips <= 60 over the union of the samples

0.97

416,853

60,335

37,815,256

76.3%

0.1%

remove tips <= 60 sample-by-sample and over the union of the samples

0.97

228,841

77,010

46,699,963

94.2%

The first row (green) uses all k-mers, the counts are computed from the transcripts.Slide33

Outline

MotivationPrevious approachesAlgorithmsResultsOngoing workSlide34

Ongoing work

We believe that bubble removal will help us get rid of most of the erroneous k-mers which are still present in the data after tip removal.We want to incorporate an error correction algorithm, SEECER (2013), to correct the erroneous reads before doing starting the k-mer counting.Currently our algorithms do not take into account strand strand-specific RNA-Seq data. Optimizing the algorithms to take advantage of this information, when available, represents another opportunity to improve the results of this approach.Slide35

Q&A