Belinda Giardine Tophat2 Aligns reads from RNA to the genome Ribonucleic acid RNA is a ubiquitous family of large biological molecules that perform multiple vital roles in the coding decoding regulation and expression of genes ID: 556220
Download Presentation The PPT/PDF document "Tuning Tophat2" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Tuning Tophat2
Belinda GiardineSlide2
Tophat2
Aligns reads from RNA to the genome
Ribonucleic acid (RNA) is a ubiquitous family of large biological molecules that perform multiple vital roles in the coding, decoding, regulation, and expression of genes.
Adds on dealing with gaps in the alignments by breaking the reads into small pieces ~20 bases and reassembling the reads after mapping.
Though the new version is more parallel still slow (more than 4 days for recent runs)
It uses Bowtie to do the actual mapping Slide3
RNA-
seq
image from
wikipedia
fastq
file, a single read:
@DGM97JN1:330:C3EW0ACXX:1:1101:2723:1993 1:N:0:NAAGGC
GAATGCCCCCGGCCGTCCCTCTTAATCATGGCCTCAGTTCCGAAAACCANCAAAATAGAACCGCGGTCCTATTNN
+
CCCFFFFFHHHHGIIFGIIIIJJIIJIFGIJEHIIJIGHIJHAGHHFEE#,;;BACEEDDDDDD@B>BBDCDC##Slide4
Tophat2
Pipeline written in C++ (34,351 lines of code in 63 files)
Wrapper written in Python
3 of the programs use Boost
pthreads
long_spanning_reads.cpp
segment_juncs.cpp
tophat_reports.cpp
Programs are compiled as one unit under autoconfig and
automake
, communication between programs with temporary files.
Many prerequisites: zlib, Boost, samtools, Bowtie, this and the amount of file IO makes running on MIC only not feasible.Slide5
Data files
Reads in
fastq
format, 20–200 million reads (2
x
20gb for my test)
Reference sequence and indexes used for mapping 6gb for mouseFinal output 14gb for my testSlide6
Work from last time
Compiling
start with gcc
then icc
then add –mmic (this failed in trying to get all the prerequisites)
Test run on host, using
Tophat’s log of run for time.Run on biostar(Xeon) using 8 threads took 26 hours
Run on stampede (host) using 16 threads took 19 hours, 40
mins
Run on stampede (host) using 32 threads took 24 hoursSlide7
New work
Python wrapper and long run times makes
gprof
and
vtune
difficult to profile code with.
Going from my experience in Biostar, I am starting with
segment_juncs
executable.
Keeping the temporary files that are used for passing data between programs, I ran just
segment_juncs
.
Time for
segment_junctions
run alone:
8 threads 2 hours 13 minutes
16 threads 1 hour 15 minutes (2 ½ out of 19 ½ hours total)
of this 76% is spent in the parallel section
32 threads 2 hours 12 minutes Slide8
Failed attempts
Run
vtune
on
segment_juncs
times out of full datalicense errorsCheck
loops in
par_report
that are assumed dependencies.
lines of code indicated not loops or in loops?
contradictory
lines
Offloading
threaded section of code in
segment_juncs.cpp
. Will it actually improve speed or too much file IO
?
Lots of variables to copy
File IOSlide9
Hardison LabSlide10
vec_report3
segment_juncs
.cpp(135): (col. 32) remark: loop was not
vectorized
: existence of vector dependence.
segment_juncs
.cpp(135): (col. 32) remark: vector dependence: assumed ANTI dependence between r.92068 line 135 and r.92068 line 135.segment_juncs.cpp(135): (col. 32) remark: vector dependence: assumed FLOW dependence between r.92068 line 135 and r.92068 line 135.
Line 135:
left_seg
.left
= max(0,
T.right
() - 2);Slide11
opt_report
REMOVED VAR left_mismatches.201433.0_V$78b
REMOVED PACK left_mismatches.201433.0
REMOVED VAR right_mismatches.201433.0_V$78d
REMOVED PACK right_mismatches.
201433.0Slide12
gprof output for
segment_juncs
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
100.01 0.01 0.01
extend_from_seeds(std::vector<
SeedExtension
,
std::allocator
<
SeedExtension
> >&,
PackedSplice
const&,
std::vector
<
std::vector
<
ReadHit
,
std::allocator
<
ReadHit
> >,
std::allocator
<
std::vector
<
ReadHit
,
std::allocator
<
ReadHit
> > > > const&,
std::string
const&,
std::string
const&, unsigned long, unsigned long,
int
)
0.00 0.01 0.00 89528 0.00 0.00
pack_splice(std::string
const&,
int
,
int
, unsigned
int
)
0.00 0.01 0.00 3 0.00 0.00 __
do_global_dtors_aux
0.00 0.01 0.00 2 0.00 0.00
pack_right_splice_half(std::string
const&, unsigned
int
, unsigned
int
)Slide13
Parallel section of code:
vector<
boost::thread
*> threads;
for (
int
i
= 0;
i
<
num_threads
; ++
i
)
{
SegmentSearchWorker
worker;
worker.rt
= &
rt
;
worker.reads_fname
=
left_reads_fname
;
worker.segmap_fnames
= &
left_segmap_fnames
;
worker.partner_reads_map_fname
=
right_reads_map_fname
;
worker.seg_partner_reads_map_fname
=
right_seg_fname_for_segment_search
;
worker.juncs
= &
vseg_juncs[i
];
worker.deletions
= &
vdeletions[i
];
worker.insertions
= &
vinsertions[i
];
worker.fusions
= &
vfusions[i
];
worker.read
= READ_LEFT;
worker.partner_hit_offset
= 0;
worker.seg_partner_hit_offset
= 0;
if (
i
== 0)
{
worker.begin_id
= 0;
worker.seg_offsets
= vector<int64_t>(
left_segmap_fnames.size
(), 0);
worker.read_offset
= 0;
}
else
{
worker.begin_id
= read_ids[i-1];
worker.seg_offsets.insert(worker.seg_offsets.end
(), offsets[i-1].begin()+1, offsets[i-1].end());
worker.read_offset
= offsets[i-1][0];
if (
partner_offsets.size
() > 0)
worker.partner_hit_offset
= partner_offsets[i-1];
if (
seg_partner_offsets.size
() > 0)
worker.seg_partner_hit_offset
= seg_partner_offsets[i-1];
}
worker
.end_id
= (i+1 <
num_threads
) ?
read_ids[i
] :
std::numeric_limits
<uint64_t>::max();
//Geo debug:
//
fprintf(stderr
, "Worker %
d
:
begin_id
=%
lu
,
end_id
=%
lu\n
",
i
,
worker.begin_id
,
worker.end_id
);
if (
num_threads
> 1 &&
i
+ 1 <
num_threads
)
threads.push_back(new
boost::thread(worker
));
else
worker();
}