S trictly linear nature forces assemblers to introduce errors These simple events are difficult to represented in the FASTA format The assembler is forced to choose resulting in a loss of information and errors ID: 215580
Download Presentation The PPT/PDF document "A FASTA forces assemblers to make mistak..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
A
FASTA forces assemblers to make mistakes
Strictly linear nature forces assemblers to introduce errors: These simple events are difficult to represented in the FASTA format The assembler is forced to choose, resulting in a loss of information and errors. Quality scores, annotation, and IUPAC codes provide only a partial solution
FASTG encodes all ambiguities FASTG natively encodes ambiguities that are lost in FASTA Optional properties (probability, copy-number, etc.) can be associated with these events, e.g.
The Solution: FASTG Enhanced FASTA to capture known complexity of the genome Superset of FASTA Faithfully represents genome assemblies without error or loss of information Hybrid approach: - preserves underlying linearity of the genome, but: - captures nonlinear complexity Easily converted to FASTA
The Problem: FASTA is flat The FASTA format is the standard way to represent an assembly: Linear representation of the genome Easy to parse and human readable Provides a simple co-ordinate system that allows easy annotation Supported by many tools Been the warhorse for ~20 years
FASTG: Representing the true information content of a genome assembly
Iain MacCallum, David
B.
JaffeBroad Institute of MIT and Harvard, Cambridge, MAIn collaboration with: Michael Schatz, Daniel Rokhsar, and Assemblathon Format Group
http://www.broadinstitute.org/software/allpaths-lg/blog/
FASTG captures important gap information Gaps in FASTA often hide additional information, e.g. frame shifts FASTG retains this information
Coming soon: FASTG support in ALLPATHS-LG
Our genome assembler ALLPATHS-LG will soon produce FASTG assemblies For the latest ALLPATHS-LG news visit our blog:
FASTG is FASTA compatible FASTG looks like FASTA FASTG can be easily converted to FASTA Any existing tool that works on FASTA can use converted FASTGFASTG FASTA conversion FASTG can easily be converted to FASTA by removing the FASTG extensions “[…]” Conversion can be done with a simple shell or perl script The resulting FASTA can be processed by existing tools FASTG and derived FASTA files share the same base co-ordinate system FASTG extensions plus start location = Markup No additional markup language required FASTA + Markup will produce the original FASTG Can convert markup to existing annotation formats - but only for a subset of FASTG features
Assembler forced to chose a haplotype
ACATT
TACTG
AAGCC
Assembler forced to chose the repeat length
ACATT
TACTG
7Cs
Assembler forced to pick A or T
ACATT
TACTG
ACATT CCCCCC[6:tandem:size
=
(6,5..7)|C] TACTG
ACATT
ACATT CGAGG[5:digraph:path=(a)|>
a;CGAGG
>
b;AAGCC
] TACTG
TACTG
A[1:alt|A,T]
>scaffold1;
CATAT
NNNNNNNNNN[10:gap:size=(10,5..20),start=(a,e),end=(d,g)| >a:b,c,d;CA >b:c,d;T >c:f,g;GAC >d;AA >e:f,g;TT >f:g;A >g;AGT ] GATGT
>scaffoldTACTAGGCNNNNNNNATTAGGCCGTGCNNNNNNNNNNNGCGCCGTTACCATTCNNNNNNACTGCCGTTGACT
>assembly;ATCGGCNNNN[4:gap:size=(4,3..9)]ATTACCTGGCTTATAC[1:alt|C,G]TACCCGATACGTTTACGGTATACGAAAAA[5:tandem:size=(5,4..11)|A]TCT
Long Repeat
Long Repeat
A
B
C
D
Long Repeat
A
B
C
D
Long Repeat
A
B
C
D
Jumping libraries too short
t
o disambiguate the repeat
Graph
FASTA
FASTG
Putative gap sequences
derived from FASTG, e.g.
1
)
CATGACAAGT
2)
CATTTAA
3)
TTAAAAGT
>contig1;
TACCGC
NNNN
[4:gap:size=(4,3..5)]
AGCCTGCC
GTTATA
C
[1:alt:allele|C,G]
TCCCTGGATACGTT
TAGG
ATATAT
[6:tandem:size=(3,2..
5
)|AT]
CC
>contig1
TACCGC
NNNN
AGCCTGCC
GTTATAC
C
TCCCTGGATA
CGTTTAGG
ATATAT
CC
>contig1;
6 [4
:
gap:size
=(4,3..5)
]26 [1:alt:allele|C,G]52 [6:tandem:size=(3,2..5)|AT]
FASTG
FASTA
Markup
A
T
TACTG
ACATT
ACATT
5
Cs
6Cs
7
Cs
TACTG
Uncertain base or SNP
Uncertain tandem repeat
Haplotype separation
+
TAC…
…ACT
…CT
AT…
FASTG captures graphs using a hierarchic approach
FASTG is FASTA-like, preserves linearity and
keeps local complexity local
FASTG is easy to use
GATGT
C
ATAT
Assembly gap with
c
omplex contents
>scaffold1
CATAT
NNNNNNNNNN
GATGT
FASTA
FASTA cannot represent graph assemblies
Not all assemblies can be reduced to a linear form, due to:
Polymorphism that cannot be linearized
Long repeats that cannot be bridged with jumping data
Inversions that cannot be disambiguated
Assemblies must be broken into linear sections, losing information
Any information about the gap
c
ontent is lost
FASTG captures possible content of the gap
Genome
Assembly graph
Uncertain tandem
repeat
Long imperfect
repeat
Single base
difference
>
ContigA
:
ContigC
;
TCGA…
[7:tandem:size=(7,6..
9
)|
T
]
…CATG
>
ContigB
:
ContigC
;
ATAGCG…ATCCAT
>
ContigC
:
ContigA
,
ContigB
;
CGTA…
[1:alt|
C
,
G
]
…AATC
FASTG
ACATT
TACTG
AAGCC
CGAGG
A[1:alt:allele|A,T]
Global graph structure encoded here
CATAT
GATGT
AGT
AA
A
T
CA
TT
GAC
But
FASTA has a number of
limitations
:
ContigA
ContigB
ContigC