Director of Informatics J Craig Venter Institute On behalf of the GSCBRC Metadata Working Group Standardized Metadata for Human PathogenVector Genomic Sequences Genome Sequencing Centers for Infectious Disease GSCID ID: 933837
Download Presentation The PPT/PDF document "Richard H. Scheuermann, Ph.D." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Richard H. Scheuermann, Ph.D.Director of InformaticsJ. Craig Venter InstituteOn behalf of theGSC-BRC Metadata Working Group
Standardized Metadata for Human Pathogen/Vector Genomic
Sequences
Slide2Genome Sequencing Centers for Infectious Disease (GSCID)
Bioinformatics Resource Centers (BRC)
www.viprbrc.org
www.fludb.org
Slide3High Throughput SequencingEnabling technologyEpidemiology of outbreaksPathogen evolutionHost range restrictionGenetic determinants of virulence and pathogenicityMetadata requirementsTemporal-spatial information about isolatesSelective pressuresHost species of specimen source
Disease severity and clinical manifestations
Slide4Metadata Submission Spreadsheets
1
1
1
1
2
2
3
3
4
4
4
Slide5Complex Query Interface
Slide6Metadata InconsistenciesEach project was providing different types of metadataNo consistent nomenclature being usedImpossible to perform reliable comparative genomics analysisRequired extensive custom bioinformatics system development
Slide7GSC-BRC Metadata Standards Working GroupNIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programsDevelop an approach for capturing standardized metadata for pathogen isolate sequencing projectsBottom up approach to capture data considered to be important by users
Compatible with data standards and submission requirements
Slide8Metadata Standardization ProcessCollect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS)Identify data fields that appear to be common across projects and samples (core) and data fields that appear to be pathogen or project specificFor each data field, provide common set of attributes, including preferred term, definition, synonyms, allowed value sets preferably using controlled vocabularies, expected syntax, etc.Assemble all metadata fields into a semantic network based on the Ontology of Biomedical Investigation (OBI)
Compare, map, and harmonize to other relevant initiatives, including Genome Standards Consortium MIxS and NCBI BioProjects/
BioSamplesDraft data submission spreadsheets Beta test version 1.0 standard with new GSCID white paper projects, collecting feedback
Adopt version 1.1 metadata standard and data submission spreadsheets for all
GSCID white paper and BRC-associated projects
Slide9Core Project
Metadata Field ID
Metadata Field Descriptor
OBO Foundry ID
BioProject
/
BioSample
MIxS
CP1
Project Title
http://
purl.obolibrary.org
/obo/OBI_0001622
Title
project name
CP2
Project ID
http://purl.obolibrary.org/obo/OBI_0001628
CP3
Project Description
http://purl.obolibrary.org/obo/OBI_0001615
Description
CP4
Supporting Grants/Contract ID
http://purl.obolibrary.org/obo/OBI_0001629
Grant Agency
CP5
Publication Citation
http://purl.obolibrary.org/obo/OBI_0001617
PubMed ID
ref_biomaterial
CP6
Sample Provider Principal Investigator (PI) Name
CP7
Sample Provider PI's Institution
CP8
Sample Provider PI's email
CP9
Sequencing Facility
CP10
Sequencing Facility Contact Name
CP11
Sequencing Facility Contact's Institution
CP12
Sequencing Facility Contact's email
CP13
Bioinformatics Resource Center
http://purl.obolibrary.org/obo/OBI_0001626
CP14
Bioinformatics Resource Center Contact Name
CP15
Bioinformatics Resource Center Contact's Institution
CP16
Bioinformatics Resource Center Contact's email
CP17
Target Material
Material
CP18
Project Method
Methodology
CP19
Project Objectives
Objective
CP20
Sample Scope
Sample Scope
CP21
Target Capture
Capture
Core Sample
Metadata Field ID
Metadata Field Descriptor
OBO Foundry ID
NCBI
BioSample
MIxS
CS1
Specimen Source ID
http://purl.obolibrary.org/obo/OBI_0001141
host-subject-id
host_subject_id
CS2
Specimen Source Species
http://purl.obolibrary.org/obo/OBI_0100026
specific_host
host_taxid
CS3
Species Source Common Name
host-common-name
host_common_name
CS4
Specimen Source Gender
http://purl.obolibrary.org/obo/PATO_0000047
host-sex
sex
CS5
Specimen Source Age - Value
http://purl.obolibrary.org/obo/OBI_0001167
host-age
age
CS6
Specimen Source Age - Unit
http://purl.obolibrary.org/obo/UO_0000003
host-age
CS7
Specimen Source Health Status
http://purl.obolibrary.org/obo/OGMS_0000022
host-health-state
disease status
CS8
Specimen Collection Date
http://purl.obolibrary.org/obo/OBI_0001619
collection_date
collection date
CS9
Specimen Collection Location - Latitude
http://purl.obolibrary.org/obo/OBI_0001620
lat_lon
geographic location (
lat
and
long)
CS10
Specimen Collection Location - Longitude
http://purl.obolibrary.org/obo/OBI_0001621
lat_lon
geographic location (
lat
and
long)
CS11
Specimen Collection Location - Location
http://purl.obolibrary.org/obo/GAZ_00000448
geo_loc_name
CS12
Specimen Collection Location - Country
http://purl.obolibrary.org/obo/OBI_0001627
geo_loc_name
geographic location (country and/or
sea)
CS13
Specimen ID
http://purl.obolibrary.org/obo/OBI_0001616
sample name
CS14
Specimen Type
http://purl.obolibrary.org/obo/OBI_0001479
host-tissue-sampled
body habitat, body site, body product
CS15
Suspected Organism(s) in Specimen - Species
http://purl.obolibrary.org/obo/OBI_0000925
organism
CS16
Suspected Organism(s) in Specimen -
Subclass
strain
subspecific genetic lineage
CS17
Human Pathogenicity of Suspected Organism(s) in Specimen
http://purl.obolibrary.org/obo/OBI_0000925
phenotype
CS18
Environmental Material
http://purl.obolibrary.org/obo/ENVO_00010483
isolation-source
environment
(material)
CS19
Organism Detection Method
http://purl.obolibrary.org/obo/OBI_0001624
sample collection device or method
CS20
Specimen Repository
culture-collection
source material identifiers
CS21
Specimen Repository Sample ID
culture-collection
source material identifiers
CS22
Sample ID - Sequencing Facility
CS23
Nucleic Acid Extraction Method
http://purl.obolibrary.org/obo/OBI_0666667
samp_mat_process
sample material processing
CS24
Nucleic Acid Preparation Method
samp_mat_process
sample material processing
CS25
Sequencing Method
http://purl.obolibrary.org/obo/OBI_0600047
sequencing method
CS26
Assembly Algorithm
http://purl.obolibrary.org/obo/OBI_0001522
assembly
CS27
Depth of Coverage - Average
http://purl.obolibrary.org/obo/OBI_0001618
finishing strategy
CS28
Annotation Algorithm
http://purl.obolibrary.org/obo/OBI_0001625
CS29
GenBank Record ID
http://purl.obolibrary.org/obo/OBI_0001614
CS30
Comments
http://purl.obolibrary.org/obo/IAO_0000300
host-description
CS31
Specimen Collector Name
collected-by
CS32
Specimen Collector's Institution
CS33
Specimen Collector's email
CS34
Sample Category
attribute_package
CS35
Host Disease
host-disease
Metadata Processes
d
ata transformations –image processingassembly
s
equencing assay
specimen source – organism or environmental
specimen
collector
input sample
reagents
technician
equipment
type
ID
qualities
t
emporal-spatial
region
d
ata transformations –
variant detection
serotype marker detect.
gene detection
primary
data
sequence
data
genotype/serotype/
gene data
specimen
microorganism
enriched
NA sample
microorganism
g
enomic NA
s
pecimen isolation
process
isolation
protocol
sample
processing
data archiving
process
sequence
data record
has_input
has_output
has_output
has_specification
has_part
has_part
is_about
has_input
has_output
has_input
has_input
has_input
has_output
has_output
has_output
is_about
GenBank
ID
denotes
located_in
denotes
has_input
has_quality
instance_of
t
emporal-spatial
region
located_in
Specimen Isolation
Material Processing
Data Processing
Sequencing Assay
Investigation
t
emporal-spatial
region
located_in
t
emporal-spatial
region
located_in
t
emporal-spatial
region
located_in
t
emporal-spatial
region
located_in
quality assessment
assay
Host Characterization
has_input
has_output
Slide12organism
environmentalmaterial
equipment
person
specimen
source role
specimen
capture role
specimen
collector role
t
emporal-spatial
region
spatial
region
temporal
interval
GPS
location
d
ate/time
specimen X
s
pecimen isolation
procedure X
isolation
protocol
has_input
has_output
plays
plays
has_specification
has_part
denotes
located_in
name
denotes
spatial
region
geographic
location
denotes
located_in
affiliation
has_affiliation
ID
denotes
specimen type
instance_of
s
pecimen isolation
procedure type
instance_of
Specimen Isolation
plays
has_input
organism part
hypothesis
is_about
IRB/IACUC
approval
has_authorization
environment
has_quality
organism
pathogenic
disposition
has part
has disposition
ID
denotes
CS1
gender
age
health status
has quality
CS4
CS5/6
CS7
CS2/3
CS8
CS9/10
CS11/12
CS13
CS14
CS18
CS15/16
Slide13Core Project Semantics
Slide14Outcome of Metadata Standards WGConsistent metadata captured across GSCIDBottom up approach focuses standard on important featuresSupport more standardized BRC interface developmentHarmonization with related stakeholders – Genome Standards Consortium MIxS, OBO Foundry OBI and NCBI BioProject/
BioSampleRepresented in the context of an extensible semantic framework
Slide15Identified gaps in data field list (e.g. temporal components)Includes logical structure for other, project-specific, data fields - extensibleIdentified gaps in ontology data standards (use case-driven standard development)Identified commonalities in data structures (reusable)Support for semantic queries and inferential analysis in futureOntology-based framework is extensibleSequencing => “omics”Utility of semantic representation
Slide16AcknowledgementsBruce Birren2,b, Lauren Brinkac1,a, Vincent Bruno3,c, Elizabeth Caler1,a, Ishwar Chandramouliswaran1,a, Sinéad
Chapman2,b, Frank Collins8,h, Christina Cuomo2,b, Joana
Carneiro Da Silva3,c, Valentina Di Francesco4
, Vivien Dugan1,a, Scott Emrich8,h, Mark Eppinger3,c, Michael Feldgarden
2,b, Claire Fraser3,c, W. Florian Fricke3,c, Maria Giovanni
4, Gloria Giraldo-Calderon8,h, Omar S. Harb5,g, Matt Henn2,b, Erin Hine3,c, Julie Dunning Hotopp3,c
, Jessica C. Kissinger
6,g
,
Eun
Mi
Lee
4
,
Punam
Mathur
4, Garry Myers
3,c, Emmanuel Mongodin3,c, Cheryl Murphy2,b, Dan Neafsey2,b, Karen Nelson1,a
, Ruchi Newman2,b, William Nierman
1,a, Brett E. Pickett1,d,e, Julia Puzak4
, David Rasko3,c, David S. Roos5,g, Lisa Sadzewica
3,c, Richard H. Scheuermann1,d,e, Lynn M. Schriml3,c, Bruno Sobral7,f, Tim Stockwell1,a
, Chris Stoeckert5,g, Dan Sullivan7,f, Luke Tallon3,c, Herve
Tettelin3,c, Doyle V. Ward2,b, David Wentworth1,a, Owen White
3,c
, Rebecca Will
7,f
, Jennifer Wortman
2,b
, Alison Yao
4
,
Jie
Zheng
5,g
1
J. Craig Venter Institute, Rockville, MD and San Diego,
CA,
2
Broad
Institute, Cambridge,
MA,
3
Insitute
for Genome Sciences, University of Maryland School of Medicine, Baltimore,
MD,
4National Institute of Allergy and Infectious Diseases, Rockville, MD,
5
University
of Pennsylvania, Philadelphia,
PA,
6
University
of Georgia, Athens,
GA,
7
Cyberinfrastructure
Division, Virginia Bioinformatics Institute, Blacksburg,
VA,
8
University
of Notre Dame, South Bend,
IN,
a
J
. Craig Venter Institute Genome Sequencing Center for Infectious
Diseases,
b
Broad
Institute Genome Sequencing Center for Infectious
Diseases,
c
Institute
for Genome Sciences Genome Sequencing Center for Infectious
Diseases,
d
Influenza
Research Database Bioinformatics Resource
Center,
e
Virus
Pathogen Resource Bioinformatics Resource Center, fPATRIC
Bioinformatics Resource Center, gEuPathDB Bioinformatics Resource Center, hVectorBase
Bioinformatics Resource CenterTanya Barrett – NCBIPelin Yilmaz – Genome Standards Consortium
N01AI2008038 /N01AI40041