/
Biology Data and EBI’s Infrastructure Biology Data and EBI’s Infrastructure

Biology Data and EBI’s Infrastructure - PowerPoint Presentation

callie
callie . @callie
Follow
27 views
Uploaded On 2024-02-09

Biology Data and EBI’s Infrastructure - PPT Presentation

Surfing the Tsunami Andy Jenkinson Manuela Menchi Overview Data in Molecular Biology Bioinformatics and the EBI What are the challenges for bio data Sequencing a disruptive technology IT Architecture perspective ID: 1045400

ebi data sequence embl data ebi embl sequence project raw infrastructure research public services systems database ftp storage sequencing

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Biology Data and EBI’s Infrastructure" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Biology Data and EBI’s InfrastructureSurfing the TsunamiAndy Jenkinson, Manuela Menchi

2. OverviewData in Molecular BiologyBioinformatics and the EBIWhat are the challenges for bio data?Sequencing: a disruptive technologyIT Architecture perspective

3. IntroductionBioinformaticsThe application of IT to biologyInterdisciplinary sciencePervasive in modern molecular biologyThe European Bioinformatics InstitutePart of the European Molecular Biology LaboratoryInternational, non-profit research instituteEurope’s hub for biological data services and research

4. Who are our users?Bench laboratory researchersBioinformaticians & computer scientistsRoles related to research (trainers, teachers, outreach)Non-scientific roles (admin, funders)

5. Data resources at EMBL-EBIGenomes & variationEnsembl Ensembl GenomesGenome-phenome archiveMetagenomicsNucleotide sequencesEuropean Nucleotide Archive (ENA)ExpressionArray ExpressExpression AtlasPRIDER-WorkbenchProteinsThe Universal Protein Resource (UniProt)InterProChemical biologyChEMBLChEBILiterature & ontologyEurope PubMed CentralGene OntologyMolecular structuresProtein Data Bank in EuropePDBsumProFuncPathwaysIntActReactomeMetabolightsSystemsBioModelsEnzyme PortalBioSamplesPatent sequencesNon-redundant patent sequence dbsPatent compounds

6. Molecular Biology DataHuge variety of methods, each with its own outputProtein structureAtomic coordinatesVolume mapsSmall molecule structuresDNA, RNA sequencesHybridisation (which proteins interact with each other?)Microarrays (which genes are expressed in this cell?)Mass spectrometry (which proteins are in this cell?)Images

7. Global sequence archiveInternational Nucleotide Sequence Database Collaboration (INSDC)NCBI (US), DDBJ (Japan), EMBL-EBI (Europe)1982: EMBL Data Library - 500,000 base pairs1990: Human Genome Project starts1995: EMBL Nucleotide Sequence Database relocated to Hinxton, Human Genome Project2007 or so: second-generation sequencing

8. Data-driven researchFaster: human genome 13 years -> 1 dayCheaper: $3 billion -> $1500Genome centres can sequence moreAny bio lab can sequenceSequencing is now a generic and pervasive methodDo something to your sample, sequence itDNA methylation (bind with a blocking agent, sequence)Gene expression (sequence the RNA)Data driven research

9. EMBL-Bank (assembled sequences)

10. Sequence Read Archive (2nd gen raw data)

11. Web requestsWeb requests received per day by EMBL-EBI and the jointly run Ensembl service (January 2003 to December 2011)

12. A Cancer Study in 2017186 terabases of DNA423 tebibytes of raw data2000 cancer genomes?

13. UK clinical sequencing~9 petabases of DNA21 pebibytes of raw data100,000 cancer/rare disease genomes?

14. Sequencing every child born in the EU?3 pebibytes of raw data every day9 petabases of DNA every week5 million births per year?Storing only variants: much more feasible

15. IT Architecture PerspectiveThis part of the contribution comes from the IT Systems team of EMBL-EBI and provides an overview of Data IT Architecture and relevant challenges at EMBL-EBI.

16. IntroductionIntroduction of the IT Systems team functionDisk growth at EBILifecycle of Data Management within IT SystemsOptimal IT solution for data: elements for decision-makingExample: the ERA, EGA, 1000 genomes archiveRelevant EBI IT infrastructureConclusion

17. The EBI Systems GroupEBI Systems Team is dedicated to plan, procure, implement, maintain IT infrastructure and services: core and desktop support: total 25 people/20 coreCore hardware infrastructure datacentre, disaster recovery site, computing servers, storage, network (LAN/WAN)Core services infrastructure network (ftp, aspera, rsync, email), LSF, database administration, backupsProcurement of central infrastructure (equipment/datacentres)Core user support of EMBL-EBI service groups in their daily activities. The group works closely with all project groups maintaining and planning their specific infrastructures.

18. How has EBI disk grownRaw disk TByear

19. Relevant EMBL-EBI facilitiesDisaster RecoveryDatacentre(backups, standbys)Data Production/Processing Datacentre(data received, data processing)Public FacingDatacentre(data published for public, web external services, ftp.ebi.ac.uk)10 Gbps links~25 PB raw disk30k cpu cores (4GB mem per core)Data IN(ftp-private.ebi.ac.uk)Public FacingDatacentre(data published for public, web external services, ftp.ebi.ac.uk)

20. Lifecycle of Data Management in IT SystemsData AuditingAny Data RetirementBackup/Recovery strategyMonitoring of data growthMonitoring of change of scopeNotice of new Project (Data) commencingDecision on optimal IT architectureImplementation optimal IT architecture

21. Lifecycle of Data Management in IT SystemsWhat is data? Data is “bytes” with “information” associatedInformation typically asked/provided by any new service group project:Initial dataset sizeData lifecycle /Scope of the data within the project (Archiving, backend of public facing application, HPC, Database, temporary, data transfer, etc)Backup or no backupGrowth ForecastOwnership and access list

22. Optimal IT solution for data/project Decision making process: focus on balance between simplicity of central IT data infrastructure as a whole and optimal solution per individual dataset/project. Continuous research of market and evaluation of solutions.Consolidation into a reduced number of storage solutions: traditional NAS, scale-out NAS, SAN, flash memory arrays, parallel storage Scalability: size and IOPS: general data and concurrent access from HPC farms and latency: sensitive applications (databases)Storage Archiving: Scalability for sizes order PBs: SRA archive ca 6PB (was ~1.6PB in 2011) and Data integrity (checksums and hardware strategy where possible)Data migrations/mirroring: disaster recovery, external facing facilities, standby sites, technology refresh, geographical distanceFloor space and electricity, cooling

23. FIRE: File Replication system (inc SRA archive)