/
DATA and WORKFLOWS Data Life Cycle DATA and WORKFLOWS Data Life Cycle

DATA and WORKFLOWS Data Life Cycle - PowerPoint Presentation

mrsimon
mrsimon . @mrsimon
Follow
343 views
Uploaded On 2020-10-06

DATA and WORKFLOWS Data Life Cycle - PPT Presentation

Data Commons Repository DCR NCBISRA Publication Community Data folders Data Commons quick share links Sharing Discovery Environment Atmosphere Agave API BisQue DNA Subway Analysis Add delete copy metadata templates bulk metadata ID: 813345

environment discovery agave data discovery environment data agave atmosphere atmo amp analysis genome assembly additions tutorial htprocess workflow api

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "DATA and WORKFLOWS Data Life Cycle" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

DATA and WORKFLOWS

Slide2

Data Life Cycle

Data Commons Repository (DCR), NCBI-SRA

Publication

Community Data folders, Data Commons, quick share links

Sharing

Discovery Environment, Atmosphere, Agave API, BisQue, DNA Subway

Analysis

Add, delete, copy; metadata templates; bulk metadata

Metadata

Discovery Environment, iCommands, Cyberduck

Upload

Data Commons Repository (DCR),

Elasticsearch

Discovery

Slide3

Upload & download files of any size

Transfer data between

CyVerse

and other locations

Sync data with local drive

EASE

POWER

Discovery Environment

CyberDuck

iRODs

iCommands

Upload & download files <2GB

Share, organize, edit data

Add metadata and templatesPublish data

Upload & download files of any size

Share & organize dataAdd metadataSync data with local drive

Access the Data Store

Slide4

Data Dissemination: NCBI SRA Submission Pipeline

Can retrieve SRA report, correct, resubmit

report retrieval app

Discovery

Environment

FASTQ file import and compression

Create SRA Submission Packages

from DE File menu

Create / Update NCBI BioProjects

2 metadata templates

Create NCBI

BioSamples

10 metadata templates

Enter SRA Library metadata

1 metadata template

Validate package and transfer to SRA

submission apps (

Aspera

Connect)

User receives notification from SRA

Slide5

Data Publication: Data Commons

Data Commons

Discovery

Environment

Organize dataset

Add

DataCite

metadata and Readme file

Request DOI

Validation by Data Commons curators

Data available to public

datacommons.cyverse.org

Slide6

Overview of Genomics Workflows

Sequence Read Processing

Genome Annotation

Assembly Analysis

Genome

Transcriptome

RNA-

Seq

Methylation

HTProcess

Association Pipeline

Validate Pipeline

SRA Submission

Data Commons

Discovery Environment

Agave API

Atmosphere

Assembly

Association

Data Publication

Variation Analysis

Slide7

RNA-

Seq

1: Differential Expression

Discovery

Environment

Atmosphere images

Aligners

TopHat2

,

STAR*, HISAT2*

AssemblyCufflinks2,

StringTie*

Differential expressionCuffdiff2,

DESeq2*, Ballgown, edgeR, GFold

Quantification

RSEM, HTSeq,

Bam_to_counts, kallisto, Salmon,Sailfish

, eXpress

Annotation

Trinotate*

Discovery

Environment

Atmosphere

* New additions

Both

(DE &

Atmo

)

Slide8

HTProcess

Read Cleanup Workflow

FastQC

Trimmomatic

FastQC

RNA-

Seq

2: High Throughput Process Apps

For handling large groups of data and easier workflow management. Files are managed as a group or library contained in a single directory.

Discovery Environment

HTProcess

Tuxedo Workflow

HTProcess

TopHat

2

HTProcess

CuffMerge

HTProcess

CuffDiff

HTProcess

TopHat-2 (HPC-Agave)

HTProcess

Cufflinks

Discovery

Environment

Atmosphere

* New additions

Both

(DE &

Atmo

)

Slide9

RNA-

Seq

3: Multifactorial pairwise comparisons

For doing multiple pairwise comparisons of RNA-

Seq

data for differential gene expression analysis

Discovery Environment

AlignerBWA

Read counting

HTSeq

Read

cleanup

Trim-galore

Differential

gene expression

DESeq2

/edgeR (multifactorial pairwise comparisons)

DESeq2 output

edgeR output

Raw counts

Target file

Discovery

Environment

Atmosphere

* New additions

Both

(DE &

Atmo

)

Slide10

Evolinc: Identification and Evolutionary analysis of Long Non-coding RNA (LincRNA)

Discovery Environment

Atmosphere image

Assemblers

Cufflinks-2.2.1

Merge

Cuffmerge2

LincRNA

Evolinc-I-2.0

Aligners

TopHat-2.1.1

LincRNA

Evolinc-II-2.0

Evolinc

-I output

Evolinc

-I output

Discovery

Environment

Atmosphere

* New additions

Both

(DE &

Atmo

)

Slide11

Genome Assembly Analysis

Get correctness via

GAGE

Compute

contig

stats

Snap-gene prediction

QUAST

Whole Genome Assembly

AllPathsLG

SOAPdenovo-2

Velvet

Ray

Newbler

HTProcess

Read Cleanup Workflow

FastQC

FastQC

Genome Assembly

Assess assembly

Discovery Environment

Trimmomatic

Jellyfish

Spades

Discovery

Environment

Atmosphere

* New additions

Both

(DE &

Atmo

)

Slide12

Master instance

Worker

Worker

Worker

Worker

Worker

Worker

Worker

Genome Annotation: WQ-MAKER in Jetstream

Augustus

SNAP

Exonerate

BLAST

RepeatMasker

cctools

MAKER output

In collaboration with the Douglas Thain lab (

http://www3.nd.edu/~dthain/

)

WQ-MAKER Jetstream Image

iCommands

MAKER

Slide13

Differential Expression

GP2S

,

Gradient Tool

Clustering /

Biclustering

BHC

,

TCAP

,

Wigwams

Network Inference

CSI

, hCSI,

oCSI

Transcription Factor Motif EnrichmentHMT,

MEME-LaB

Expression Data Analysis Pipelineby CyVerseUK

Discovery Environment

Discovery

Environment

Atmosphere

* New additions

Both

(DE & Atmo)

Slide14

Metagenomics Using QIIME 1

Discovery Environment

Atmosphere images

Check mapping file + metadata

validate_mapping_file

Split libraries + quality filtering (

Illumina

)

split_libraries_fastq

Pick

Operational Taxonomic Units (OTUs

)

pick_open_reference_otus

Analyze diversity

core_diversity_analyses

Make Emperor - 3D

PCoA

plots

make_emperor

Discovery

Environment

Atmosphere

* New additions

Both

(DE &

Atmo

)

Slide15

Transcriptome Assembly Analysis

GAGE

to get correctness

Compute

contig

stats

CD-HIT-EST

Annotate transcripts

Decode transcript

Transcriptome Assembly

Trinity

SOAPdenovo

- Trans

Oases

Bayesembler

Cufflinks

HTProcess Read Cleanup Pipeline

FastQC

Trimmomatic

FastQC

Transcriptome Assembly

RNA-QUAST

Discovery Environment

HTProcess

Read Cleanup Workflow

FastQC

FastQC

Trimmomatic

Jellyfish

Discovery

Environment

Atmosphere

* New additions

Both

(DE &

Atmo

)

Slide16

Non-model organism

de novo

transcriptomics

Discovery Environment

Atmosphere images

Raw FASTA assessment, adapter removal, trimming

FastQC

+ Scythe + Sickle

De novo

assembly of reads

Trinity

Assessment of assembly

rnaQUAST

Differential gene expression

DESeq2 or

EdgeR

Annotation of transcriptsTrinotate

Discovery

Environment

Atmosphere

* New additions

Both

(DE &

Atmo

)

Slide17

DNA Methylation Analysis

Discovery Environment

Genome Indexer

Bisulfite Sequence Aligner and Methylation Caller

Methylation Reporter

Bismark

Genome Preparation

Bismark

BSMAP

*

GSNAP

*

Bismark

Methylation Extractor

ZED-align

*

Differential Methylation Caller

BisuKit

*

Discovery

Environment

Atmosphere* New additions

Both

(DE &

Atmo

)

Slide18

* New additions

Discovery Environment

Variant calling

SAMtools

,

GBS

,

EMS

Imputing

NPUTE

Population structure

Structure

,

fastStructure

,

PCA

Association

GLM

,

MLM

,

EMMAX

,

MLMM

,

PLINK

,

FastLMM

,

GEMMA

*,

BayesR

*

Association Analysis

Agave API

Discovery

Environment

Agave API

Both

(DE & Agave)

Slide19

Discovery Environment

Aligners

BWA-MEM

,

BWA-ALN

Reference prep

Picard

,

SAMtools

Variant callers

GATK

UnifiedGenotyper

,

Platypus.py

callVariants

,

SAMtools

mpileup

Variant Caller Workflow

* New additions

Discovery

Environment

Agave API

Both

(DE & Agave)

Agave API

Slide20

Simulate Apps

Industrial scale GWAS simulation files from Syngenta,

AlphaSim

Associate and Predict Apps

FaST

-LMM

,

GEMMA

,

Plink

,

QxPak

,

BayesR

,

GenSel

,

RidgeRegression

Winnow App

(calculate signal/noise)

Python code, stats libraries, D. Hand’s metrics

Demonstrate Apps

(human-readable graphics and tables)

Python and R code, R ggplot2

Validate Workflow

Extensible, scalable testing of tool accuracy and precision

Discovery Environment

Agave API

* New additions

Discovery

Environment

Agave API

Both

(DE & Agave)

Slide21

Comparative Genomics

Discovery

Environment

Atmosphere

CoGe

Genome assembly

AllPaths

,

Velvet

, etc.

Genome annotation

Maker

Genome integration into

CoGe

LoadGenome

Raw reads assessment, trimming

FastQC

,

Scythe

,

Sickle

Genome in NCBI or FTP site

Integration of functional and diversity data with genome

FlowGe

Discovery Environment

CoGe

Syntenic

analysis, visualize, compare to other sequences

SynMap

,

GenomeView

,

GEvo

Data publication

Data Commons

Slide22

Species Distribution Modeling

Atmosphere images

Surface Models

Geographic Resource Analysis Support System (GRASS)

*

, System for Automated Geospatial Analysis (SAGA)

*

Georeferencing

and Alignment

Geospatial Data Abstraction Library (GDAL)

*

Species Distribution Models (SDM)

RStudio

-Server (R) *

Zonal StatisticsQuantum Geographic Information System (QGIS) *

Discovery Environment

Discovery

Environment

Atmosphere

* New additions

Both

(DE &

Atmo

)

Slide23

In the Works

Phenomics workflow support

Image data management and analysis

Geographic information systems

To discuss your phenomics workflow needs, contact

info@cyverse.org.

Slide24

Workflow

Platform

Limits

Documentation

DATA

Data Management Life Cycle

All

--

Data Commons Repository

GENOMICS

SRA Submission

DE

--

Documentation and tutorial

RNA Seq 1 Differential Expression

DE, Atmo

--

Tutorial

RNA Seq 2

HT Process

DE, Agave

150 GB input data max

Documentation

RNA Seq 3 Multifactorial Pairwise Comparisons

DE

--

DESeq2 tutorial

edgeR tutorial

Expression Data Analysis Pipeline (UK)

DE

--

Documentation

Tutorial

Evolinc

DE, Atmo

--

DE tutorial

Atmo tutorial

Genome Assembly

DE, Agave

--

Cleanup

tutorial

Assembly tutorial

Genome Annotation

WQ-MAKER

Jetstream

--

Tutorial

Metagenomics QIIME

DE, A

tmo

--

Atmo tutorial

DE - Discovery Environment

Atmo - Atmosphere

Agave - Agave API

Workflows

:

Quick Reference Guide

Slide25

Workflows

:

Quick Reference Guide (cont.)

Workflow

Platform

Limits

Documentation

GENOMICS (cont.)

Transcriptome Assembly

DE, Agave

48 hrs run time max

Tutorial

Non-model Organism Transcriptomics

DE, Atmo

--Methylation Analysis

DE, Agave

--

Bismark Genome PreparationBismarkBismark Methylation Extractor

Association Analysis

DE, Agave

--Documentation and tutorials

Variant CallerDE, Agave

48 hrs run time max

DocumentationValidate

Atmo, Agave--

Documentation

GEOSPATIAL DATA

Species Distribution Modeling

DE, Atmo

--

DE - Discovery Environment

Atmo - Atmosphere

Agave - Agave API