Clayton W Naeve PhD Endowed Chair in Bioinformatics SVP amp CIO St Jude Childrens Research Hospital The Data Deluge St Jude Data The First 50 Years 48 Years 800 TB 2 12 Years 1000 TB ID: 931866
Download Presentation The PPT/PDF document "The St. Jude Children’s Research Hospi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective
Clayton W.
Naeve
, Ph.D.
Endowed Chair in Bioinformatics
SVP & CIO
St. Jude Children’s Research Hospital
Slide2The Data Deluge
St. Jude Data: The First 50 Years
48 Years (800 TB)
2 1/2 Years (1000 TB
)
PCGP Data:
917 TB
,
148 million
files
Slide3Launched Feb. 2010St. Jude/
WashU
collaboration
WGS on 600 patients (leukemia, brain tumors, solid tumors)Matched germline and tumor samples1200 genomes (~90 billion bp
/genome) in 36 months~2 Petabytes of data
The PCGP Project
St.
Jude/
WashU
Pediatric Cancer Genome Project
Slide4PCGP Challenges
Moving
data
Data workflow
Data
analysis
Computational horsepower
Data storage
Data sharing
Challenges to Information Sciences
Slide5Moving Data
Multi-Terabyte data transit across networks is
not
trivial
DNA
sequence
raw data
reads,
contig
assembly
,
alignment to reference, variants, etc. shipped
to SJCRH as binary BAM files: ~100
GB
24
hrs
to infinity to send via commodity internetInternet2 connectivity (10
Gbs via MRC) to transfer files from
WashU to SJCRHEvaluated 5 different fast data transfer algorithms….selected FDT (developed at CalTech to transfer LHC data at Cern)Developed a pipeline to facilitate transferToday: ~5 hour transit time/file
Slide6Moving Data
Slide7Moving Data
Slide8Began work on PCGP 9 months prior to launchDeveloped a LIMS system for Validation Lab
Developed a PCGP SharePoint site to facilitate collaboration internally
Developed a bioinformatics workflow engine: PALLAS
Security managementData provenance managementIntermediate and final result trackingFlexible workflow design
Rapid new analytical algorithms/tools configurationWeb-based LSF job submission and monitoring Support a range of protocols to connect to other web application systems, databases, file systems, and etc.Integrated with applications, such as SRM, Genome Browser and etc.Data integration with tissue sample, clinical, and research data
Vision: parse each algorithm to the appropriate computing environment
Data Workflow
Slide9BAM Quality Assurance:
Tumor
Purity Algorithm (
SJCRH)Not Disease/Genomic Swap (SNP checks)Xenograft
Filter (Remove Contaminating Mouse Reads)Gene Exon and Genome Coverage algorithms (Gang Wu)BAM file work:
Bam file extraction and visualizationSamtools and C++/bioperl api’s
BambinoIGV Single Nucleotide Variation
:Freebayes
In-house PCGPCopy Number Variation:Stan’s Copy Number Algorithm
Regression Tree Algorithm
Structural Variation:One End Anchored
Inference:CRESTViralTopology
Fusion Detection:
In-house (Michael Rusch)
RNAseq:RNAseq mysql/Cufflinks
ChipSeq:
ChiPseq mysql/in house (John Obenauer)viralScanin-house
(McGoldrick)
Integration:GFF intersectGff2fastagffBuildersCancer warehouse Visualization:Circos makerBED GFF Tracks maker
Jinghui
Zhang and
CompBio
Team
Data Analyses
Slide10Computational Horsepower (HPCF)
IBM
BladeCenter
(810 cores/3TB RAM)
IBM iDataplex (1,008 cores/4TB RAM) – April 2010
SGI Altix UV1000 (640 cores/5TB RAM/60TB storage using Lustre v2.2) – December 2011IBM SoNAS
(780 TB) – March 2011Data Transfer Node (10 Gbps I2 connection) – April 2011Internal Data Transfer Node (10 Gbps
x2) – June 2011QDR Infiniband
(40 Gbps for all HPC equipment) – January 2012Software (Platform LSF, Intel Parallel Studio)Total: 2,366 cores, 13TB RAM (estimated 11.6 Tflops)
2010: 365,000
cpu hours2011: 712,000
cpu hours
Slide11IBM SoNAS
(780
TB) – March 2011Scales to 21PB; 1 billion files/filesystem; 7,200 drives
Current total on campus: 3.8 Petabytes (3,800,000 Gb)PCGP uses 917 TB (<- +500TB on tape), 148 million data filesIBM TSM systems for backup/archive (Tiered)240 SAS (15k) drives480 SAS-NL (7.2k) drives
Current 7,900 tape capacity, up to 1.6TB/tape; 12.6+ PB total734 TB usable under one file systemHigh
speed/low latency backend interconnect (QDR InfiniBand 20Gb per port and 100ns
latency)
Data Storage
Slide12Gene sequencing project identifies potential drug targets in common childhood brain tumor
Nature
June 20, 2012
Researchers studying the genetic roots of the most common malignant childhood brain tumor have discovered missteps in three of the four subtypes of the cancer that involve genes already targeted for drug development. The most significant gene alterations are linked to subtypes of medulloblastoma
that currently have the best and worst prognosis. They were among 41 genes associated for the first time to medulloblastoma by the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project.World's largest release of comprehensive human cancer genome data helps researchers everywhere speed discoveriesNature Genetics
May 29, 2012To speed progress against cancer and other diseases, the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project today announced the largest-ever release of comprehensive human cancer genome data for free access by the global scientific community. The amount of information released more than doubles the volume of high-coverage, whole genome data currently available from all human genome sources combined. This information is valuable not just to cancer researchers, but also to scientists studying almost any disease.
Genome sequencing initiative links altered gene to age-related neuroblastoma riskJournal of the American Medical Association
March 13, 2012St. Jude Children’s Research Hospital – Washington University Pediatric Cancer Genome Project and Memorial Sloan-Kettering Cancer Center discover the first gene alteration associated with patient age and
neuroblastoma outcome. Researchers have identified the first gene mutation associated with a chronic and often fatal form of neuroblastoma that typically strikes adolescents and young adults. The finding provides the first clue about the genetic basis of the long-recognized but poorly understood link between treatment outcome and age at diagnosis.
Cancer sequencing initiative discovers mutations tied to aggressive childhood brain tumors
Nature GeneticsJanuary 29, 2012Findings from the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project (PCGP) offer important insight into a poorly understood tumor that kills more than 90 percent of patients within two years. The tumor, diffuse intrinsic
pontine glioma (DIPG), is found almost exclusively in children and accounts for 10 to 15 percent of pediatric tumors of the brain and central nervous system.
Cancer sequencing project identifies potential approaches to combat aggressive leukemia
Nature January 11, 2012Researchers with the St. Jude Children's Research Hospital - Washington University Pediatric Cancer Genome Project (PCGP) have discovered that a subtype of leukemia characterized by a poor prognosis is fueled by mutations in pathways distinctly different from a seemingly similar leukemia associated with a much better outcome. The work provides the first details of the genetic alterations fueling a subtype of acute lymphoblastic leukemia (ALL) known as early T-cell precursor ALL (ETP-ALL). The results suggest ETP-ALL has more in common with acute myeloid leukemia (AML) than with other subtypes of ALL.
Gene identified as a new target for treatment of aggressive childhood eye tumor
Nature
January 11, 2012New findings from the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project (PCGP) have helped identify the mechanism that makes the childhood eye tumor retinoblastoma so aggressive. The discovery explains why the tumor develops so rapidly while other cancers can take years or even decades to form. The finding also led investigators to a new treatment target and possible therapy for the rare childhood tumor of the retina, the light-sensing tissue at the back of the eye.
>356 Patients/712 Complete Genomes
Progress
Slide13http://
www.pediatriccancergenomeproject.org
Data Sharing
http
://explore.pediatriccancergenomeproject.org
Slide14Data Sharing
Data Integration is critical: platform data (expression, WGS, methylation, etc.) and processed data (“genomics” data with phenotype data (clinical care, clinical research))
Slide15Total=>150 FTEs with “research informatics” skills
Key: Staff
19 Academic
Departments
2 PhD
2 Support
Information Sciences
PCGP 5
PhD
1 Dev.
8-10
Faculty
50-60 Support
Staff
10
PhD
Bioinformatics
2 developers
Enterprise
Informatics
Clinical
Informatics
127 FTEs
81 FTEs
Research
Informatics
56 FTEs
Offshore
Developers
15 FTEs
HPC
Shared Resources
Computational Biology
Slide16Project total cost: $65M (11 Illuminas
@
WashU
and 4 @ SJCRH, sequencing costs, staffing, IT, etc.)New “IT” staff @ SJCRH: 10 FTEs in CompBiol, 0 FTEs in ISCapital IT investment: ~$7.2 M at SJCRH, $9M at
WashUIT is ~25% of overall project costs (doesn’t include costs of other participating SJ FTEs)
$
ummary
Slide17Information Sciences PCGP Team
Key: Staff
Ashish
Pagare
David Zhao
Dan Alford
Stephen Espy
Kiran Chand Bobba
Scott MaloneDr. Antonio Ferreira
Bill PappasJames McMurryDr. Jianmin Wang
Dr. John ObenauerJared BecksfortPankaj GuptaDr.
Suraj Mukatira
Simon Hagstrom
Sundeep ShakyaAsmita VaidyaSwetha MandavaBhagavathy KrishnaManohar Gorthi
Sandhya Rani KolliSivaram Chintalapudi
Roshan ShresthaIrina McGuirePJ StevensThanh LeJohn PenrodPat Eddy
Dr. Dan McGoldrick
Slide18Questions?
Slide19Data Workflow
cluster
GPU
Contig
assembly
SV
CNV
INDELS
SNV
CIRCOS
PALLAS
large memory