/
A FASTA forces assemblers to make mistakes A FASTA forces assemblers to make mistakes

A FASTA forces assemblers to make mistakes - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
382 views
Uploaded On 2015-12-05

A FASTA forces assemblers to make mistakes - PPT Presentation

S trictly linear nature forces assemblers to introduce errors These simple events are difficult to represented in the FASTA format The assembler is forced to choose resulting in a loss of information and errors ID: 215580

fasta fastg gap tactg fastg fasta tactg gap repeat acatt size information genome tandem long alt assembly assembler assemblies

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A FASTA forces assemblers to make mistak..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A

FASTA forces assemblers to make mistakes

Strictly linear nature forces assemblers to introduce errors: These simple events are difficult to represented in the FASTA format The assembler is forced to choose, resulting in a loss of information and errors. Quality scores, annotation, and IUPAC codes provide only a partial solution

FASTG encodes all ambiguities FASTG natively encodes ambiguities that are lost in FASTA Optional properties (probability, copy-number, etc.) can be associated with these events, e.g.

The Solution: FASTG Enhanced FASTA to capture known complexity of the genome Superset of FASTA Faithfully represents genome assemblies without error or loss of information Hybrid approach: - preserves underlying linearity of the genome, but: - captures nonlinear complexity Easily converted to FASTA

The Problem: FASTA is flat The FASTA format is the standard way to represent an assembly: Linear representation of the genome Easy to parse and human readable Provides a simple co-ordinate system that allows easy annotation Supported by many tools Been the warhorse for ~20 years

FASTG: Representing the true information content of a genome assembly

Iain MacCallum, David

B.

JaffeBroad Institute of MIT and Harvard, Cambridge, MAIn collaboration with: Michael Schatz, Daniel Rokhsar, and Assemblathon Format Group

http://www.broadinstitute.org/software/allpaths-lg/blog/

FASTG captures important gap information Gaps in FASTA often hide additional information, e.g. frame shifts FASTG retains this information

Coming soon: FASTG support in ALLPATHS-LG

Our genome assembler ALLPATHS-LG will soon produce FASTG assemblies For the latest ALLPATHS-LG news visit our blog:

FASTG is FASTA compatible FASTG looks like FASTA FASTG can be easily converted to FASTA Any existing tool that works on FASTA can use converted FASTGFASTG FASTA conversion FASTG can easily be converted to FASTA by removing the FASTG extensions “[…]” Conversion can be done with a simple shell or perl script The resulting FASTA can be processed by existing tools FASTG and derived FASTA files share the same base co-ordinate system FASTG extensions plus start location = Markup No additional markup language required FASTA + Markup will produce the original FASTG Can convert markup to existing annotation formats - but only for a subset of FASTG features

Assembler forced to chose a haplotype

ACATT

TACTG

AAGCC

Assembler forced to chose the repeat length

ACATT

TACTG

7Cs

Assembler forced to pick A or T

ACATT

TACTG

ACATT CCCCCC[6:tandem:size

=

(6,5..7)|C] TACTG

ACATT

ACATT CGAGG[5:digraph:path=(a)|>

a;CGAGG

>

b;AAGCC

] TACTG

TACTG

A[1:alt|A,T]

>scaffold1;

CATAT

NNNNNNNNNN[10:gap:size=(10,5..20),start=(a,e),end=(d,g)| >a:b,c,d;CA >b:c,d;T >c:f,g;GAC >d;AA >e:f,g;TT >f:g;A >g;AGT ] GATGT

>scaffoldTACTAGGCNNNNNNNATTAGGCCGTGCNNNNNNNNNNNGCGCCGTTACCATTCNNNNNNACTGCCGTTGACT

>assembly;ATCGGCNNNN[4:gap:size=(4,3..9)]ATTACCTGGCTTATAC[1:alt|C,G]TACCCGATACGTTTACGGTATACGAAAAA[5:tandem:size=(5,4..11)|A]TCT

Long Repeat

Long Repeat

A

B

C

D

Long Repeat

A

B

C

D

Long Repeat

A

B

C

D

Jumping libraries too short

t

o disambiguate the repeat

Graph

FASTA

FASTG

Putative gap sequences

derived from FASTG, e.g.

1

)

CATGACAAGT

2)

CATTTAA

3)

TTAAAAGT

>contig1;

TACCGC

NNNN

[4:gap:size=(4,3..5)]

AGCCTGCC

GTTATA

C

[1:alt:allele|C,G]

TCCCTGGATACGTT

TAGG

ATATAT

[6:tandem:size=(3,2..

5

)|AT]

CC

>contig1

TACCGC

NNNN

AGCCTGCC

GTTATAC

C

TCCCTGGATA

CGTTTAGG

ATATAT

CC

>contig1;

6 [4

:

gap:size

=(4,3..5)

]26 [1:alt:allele|C,G]52 [6:tandem:size=(3,2..5)|AT]

FASTG

FASTA

Markup

A

T

TACTG

ACATT

ACATT

5

Cs

6Cs

7

Cs

TACTG

Uncertain base or SNP

Uncertain tandem repeat

Haplotype separation

+

TAC…

…ACT

…CT

AT…

FASTG captures graphs using a hierarchic approach

FASTG is FASTA-like, preserves linearity and

keeps local complexity local

FASTG is easy to use

GATGT

C

ATAT

Assembly gap with

c

omplex contents

>scaffold1

CATAT

NNNNNNNNNN

GATGT

FASTA

FASTA cannot represent graph assemblies

Not all assemblies can be reduced to a linear form, due to:

Polymorphism that cannot be linearized

Long repeats that cannot be bridged with jumping data

Inversions that cannot be disambiguated

Assemblies must be broken into linear sections, losing information

Any information about the gap

c

ontent is lost

FASTG captures possible content of the gap

Genome

Assembly graph

Uncertain tandem

repeat

Long imperfect

repeat

Single base

difference

>

ContigA

:

ContigC

;

TCGA…

[7:tandem:size=(7,6..

9

)|

T

]

…CATG

>

ContigB

:

ContigC

;

ATAGCG…ATCCAT

>

ContigC

:

ContigA

,

ContigB

;

CGTA…

[1:alt|

C

,

G

]

…AATC

FASTG

ACATT

TACTG

AAGCC

CGAGG

A[1:alt:allele|A,T]

Global graph structure encoded here

CATAT

GATGT

AGT

AA

A

T

CA

TT

GAC

But

FASTA has a number of

limitations

:

ContigA

ContigB

ContigC