Data Commons Repository DCR NCBISRA Publication Community Data folders Data Commons quick share links Sharing Discovery Environment Atmosphere Agave API BisQue DNA Subway Analysis Add delete copy metadata templates bulk metadata ID: 813345
Download The PPT/PDF document "DATA and WORKFLOWS Data Life Cycle" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
DATA and WORKFLOWS
Slide2Data Life Cycle
Data Commons Repository (DCR), NCBI-SRA
Publication
Community Data folders, Data Commons, quick share links
Sharing
Discovery Environment, Atmosphere, Agave API, BisQue, DNA Subway
Analysis
Add, delete, copy; metadata templates; bulk metadata
Metadata
Discovery Environment, iCommands, Cyberduck
Upload
Data Commons Repository (DCR),
Elasticsearch
Discovery
Slide3Upload & download files of any size
Transfer data between
CyVerse
and other locations
Sync data with local drive
EASE
POWER
Discovery Environment
CyberDuck
iRODs
iCommands
Upload & download files <2GB
Share, organize, edit data
Add metadata and templatesPublish data
Upload & download files of any size
Share & organize dataAdd metadataSync data with local drive
Access the Data Store
Slide4Data Dissemination: NCBI SRA Submission Pipeline
Can retrieve SRA report, correct, resubmit
report retrieval app
Discovery
Environment
FASTQ file import and compression
Create SRA Submission Packages
from DE File menu
Create / Update NCBI BioProjects
2 metadata templates
Create NCBI
BioSamples
10 metadata templates
Enter SRA Library metadata
1 metadata template
Validate package and transfer to SRA
submission apps (
Aspera
Connect)
User receives notification from SRA
Slide5Data Publication: Data Commons
Data Commons
Discovery
Environment
Organize dataset
Add
DataCite
metadata and Readme file
Request DOI
Validation by Data Commons curators
Data available to public
datacommons.cyverse.org
Slide6Overview of Genomics Workflows
Sequence Read Processing
Genome Annotation
Assembly Analysis
Genome
Transcriptome
RNA-
Seq
Methylation
HTProcess
Association Pipeline
Validate Pipeline
SRA Submission
Data Commons
Discovery Environment
Agave API
Atmosphere
Assembly
Association
Data Publication
Variation Analysis
Slide7RNA-
Seq
1: Differential Expression
Discovery
Environment
Atmosphere images
Aligners
TopHat2
,
STAR*, HISAT2*
AssemblyCufflinks2,
StringTie*
Differential expressionCuffdiff2,
DESeq2*, Ballgown, edgeR, GFold
Quantification
RSEM, HTSeq,
Bam_to_counts, kallisto, Salmon,Sailfish
, eXpress
Annotation
Trinotate*
Discovery
Environment
Atmosphere
* New additions
Both
(DE &
Atmo
)
Slide8HTProcess
Read Cleanup Workflow
FastQC
Trimmomatic
FastQC
RNA-
Seq
2: High Throughput Process Apps
For handling large groups of data and easier workflow management. Files are managed as a group or library contained in a single directory.
Discovery Environment
HTProcess
Tuxedo Workflow
HTProcess
TopHat
2
HTProcess
CuffMerge
HTProcess
CuffDiff
HTProcess
TopHat-2 (HPC-Agave)
HTProcess
Cufflinks
Discovery
Environment
Atmosphere
* New additions
Both
(DE &
Atmo
)
Slide9RNA-
Seq
3: Multifactorial pairwise comparisons
For doing multiple pairwise comparisons of RNA-
Seq
data for differential gene expression analysis
Discovery Environment
AlignerBWA
Read counting
HTSeq
Read
cleanup
Trim-galore
Differential
gene expression
DESeq2
/edgeR (multifactorial pairwise comparisons)
DESeq2 output
edgeR output
Raw counts
Target file
Discovery
Environment
Atmosphere
* New additions
Both
(DE &
Atmo
)
Slide10Evolinc: Identification and Evolutionary analysis of Long Non-coding RNA (LincRNA)
Discovery Environment
Atmosphere image
Assemblers
Cufflinks-2.2.1
Merge
Cuffmerge2
LincRNA
Evolinc-I-2.0
Aligners
TopHat-2.1.1
LincRNA
Evolinc-II-2.0
Evolinc
-I output
Evolinc
-I output
Discovery
Environment
Atmosphere
* New additions
Both
(DE &
Atmo
)
Slide11Genome Assembly Analysis
Get correctness via
GAGE
Compute
contig
stats
Snap-gene prediction
QUAST
Whole Genome Assembly
AllPathsLG
SOAPdenovo-2
Velvet
Ray
Newbler
HTProcess
Read Cleanup Workflow
FastQC
FastQC
Genome Assembly
Assess assembly
Discovery Environment
Trimmomatic
Jellyfish
Spades
Discovery
Environment
Atmosphere
* New additions
Both
(DE &
Atmo
)
Slide12Master instance
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Genome Annotation: WQ-MAKER in Jetstream
Augustus
SNAP
Exonerate
BLAST
RepeatMasker
cctools
MAKER output
In collaboration with the Douglas Thain lab (
http://www3.nd.edu/~dthain/
)
WQ-MAKER Jetstream Image
iCommands
MAKER
Slide13Differential Expression
GP2S
,
Gradient Tool
Clustering /
Biclustering
BHC
,
TCAP
,
Wigwams
Network Inference
CSI
, hCSI,
oCSI
Transcription Factor Motif EnrichmentHMT,
MEME-LaB
Expression Data Analysis Pipelineby CyVerseUK
Discovery Environment
Discovery
Environment
Atmosphere
* New additions
Both
(DE & Atmo)
Slide14Metagenomics Using QIIME 1
Discovery Environment
Atmosphere images
Check mapping file + metadata
validate_mapping_file
Split libraries + quality filtering (
Illumina
)
split_libraries_fastq
Pick
Operational Taxonomic Units (OTUs
)
pick_open_reference_otus
Analyze diversity
core_diversity_analyses
Make Emperor - 3D
PCoA
plots
make_emperor
Discovery
Environment
Atmosphere
* New additions
Both
(DE &
Atmo
)
Slide15Transcriptome Assembly Analysis
GAGE
to get correctness
Compute
contig
stats
CD-HIT-EST
Annotate transcripts
Decode transcript
Transcriptome Assembly
Trinity
SOAPdenovo
- Trans
Oases
Bayesembler
Cufflinks
HTProcess Read Cleanup Pipeline
FastQC
Trimmomatic
FastQC
Transcriptome Assembly
RNA-QUAST
Discovery Environment
HTProcess
Read Cleanup Workflow
FastQC
FastQC
Trimmomatic
Jellyfish
Discovery
Environment
Atmosphere
* New additions
Both
(DE &
Atmo
)
Slide16Non-model organism
de novo
transcriptomics
Discovery Environment
Atmosphere images
Raw FASTA assessment, adapter removal, trimming
FastQC
+ Scythe + Sickle
De novo
assembly of reads
Trinity
Assessment of assembly
rnaQUAST
Differential gene expression
DESeq2 or
EdgeR
Annotation of transcriptsTrinotate
Discovery
Environment
Atmosphere
* New additions
Both
(DE &
Atmo
)
Slide17DNA Methylation Analysis
Discovery Environment
Genome Indexer
Bisulfite Sequence Aligner and Methylation Caller
Methylation Reporter
Bismark
Genome Preparation
Bismark
BSMAP
*
GSNAP
*
Bismark
Methylation Extractor
ZED-align
*
Differential Methylation Caller
BisuKit
*
Discovery
Environment
Atmosphere* New additions
Both
(DE &
Atmo
)
Slide18* New additions
Discovery Environment
Variant calling
SAMtools
,
GBS
,
EMS
Imputing
NPUTE
Population structure
Structure
,
fastStructure
,
PCA
Association
GLM
,
MLM
,
EMMAX
,
MLMM
,
PLINK
,
FastLMM
,
GEMMA
*,
BayesR
*
Association Analysis
Agave API
Discovery
Environment
Agave API
Both
(DE & Agave)
Slide19Discovery Environment
Aligners
BWA-MEM
,
BWA-ALN
Reference prep
Picard
,
SAMtools
Variant callers
GATK
UnifiedGenotyper
,
Platypus.py
callVariants
,
SAMtools
mpileup
Variant Caller Workflow
* New additions
Discovery
Environment
Agave API
Both
(DE & Agave)
Agave API
Slide20Simulate Apps
Industrial scale GWAS simulation files from Syngenta,
AlphaSim
Associate and Predict Apps
FaST
-LMM
,
GEMMA
,
Plink
,
QxPak
,
BayesR
,
GenSel
,
RidgeRegression
Winnow App
(calculate signal/noise)
Python code, stats libraries, D. Hand’s metrics
Demonstrate Apps
(human-readable graphics and tables)
Python and R code, R ggplot2
Validate Workflow
Extensible, scalable testing of tool accuracy and precision
Discovery Environment
Agave API
* New additions
Discovery
Environment
Agave API
Both
(DE & Agave)
Slide21Comparative Genomics
Discovery
Environment
Atmosphere
CoGe
Genome assembly
AllPaths
,
Velvet
, etc.
Genome annotation
Maker
Genome integration into
CoGe
LoadGenome
Raw reads assessment, trimming
FastQC
,
Scythe
,
Sickle
Genome in NCBI or FTP site
Integration of functional and diversity data with genome
FlowGe
Discovery Environment
CoGe
Syntenic
analysis, visualize, compare to other sequences
SynMap
,
GenomeView
,
GEvo
Data publication
Data Commons
Slide22Species Distribution Modeling
Atmosphere images
Surface Models
Geographic Resource Analysis Support System (GRASS)
*
, System for Automated Geospatial Analysis (SAGA)
*
Georeferencing
and Alignment
Geospatial Data Abstraction Library (GDAL)
*
Species Distribution Models (SDM)
RStudio
-Server (R) *
Zonal StatisticsQuantum Geographic Information System (QGIS) *
Discovery Environment
Discovery
Environment
Atmosphere
* New additions
Both
(DE &
Atmo
)
Slide23In the Works
Phenomics workflow support
Image data management and analysis
Geographic information systems
To discuss your phenomics workflow needs, contact
info@cyverse.org.
Slide24Workflow
Platform
Limits
Documentation
DATA
Data Management Life Cycle
All
--
Data Commons Repository
GENOMICS
SRA Submission
DE
--
Documentation and tutorial
RNA Seq 1 Differential Expression
DE, Atmo
--
Tutorial
RNA Seq 2
HT Process
DE, Agave
150 GB input data max
Documentation
RNA Seq 3 Multifactorial Pairwise Comparisons
DE
--
DESeq2 tutorial
edgeR tutorial
Expression Data Analysis Pipeline (UK)
DE
--
Documentation
Tutorial
Evolinc
DE, Atmo
--
DE tutorial
Atmo tutorial
Genome Assembly
DE, Agave
--
Cleanup
tutorial
Assembly tutorial
Genome Annotation
WQ-MAKER
Jetstream
--
Tutorial
Metagenomics QIIME
DE, A
tmo
--
Atmo tutorial
DE - Discovery Environment
Atmo - Atmosphere
Agave - Agave API
Workflows
:
Quick Reference Guide
Slide25Workflows
:
Quick Reference Guide (cont.)
Workflow
Platform
Limits
Documentation
GENOMICS (cont.)
Transcriptome Assembly
DE, Agave
48 hrs run time max
Tutorial
Non-model Organism Transcriptomics
DE, Atmo
--Methylation Analysis
DE, Agave
--
Bismark Genome PreparationBismarkBismark Methylation Extractor
Association Analysis
DE, Agave
--Documentation and tutorials
Variant CallerDE, Agave
48 hrs run time max
DocumentationValidate
Atmo, Agave--
Documentation
GEOSPATIAL DATA
Species Distribution Modeling
DE, Atmo
--
DE - Discovery Environment
Atmo - Atmosphere
Agave - Agave API