An Introduction TCD Bioinformatics Support Team Fiona Roche PhD Date 310815 Overview What is Galaxy Why is it useful Commandline vs Galaxy A Basic Analysis with Galaxy Resources ID: 587417
Download Presentation The PPT/PDF document "Galaxy for Bioinformatics Analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Galaxy for Bioinformatics Analysis
An Introduction
TCD Bioinformatics Support Team
Fiona Roche, PhD
Date: 31/08/15Slide2
Overview
What is Galaxy?
Why is it useful?
Command-line
vs GalaxyA Basic Analysis with GalaxyResources for LearningSlide3
What is Galaxy?
A web-based genome analysis platform designed for experimental biologists
www.galaxyproject.orgSlide4
Why is it useful to a biologist?
Easy to use!
Allows data import from popular resources
Provides access to best practice bioinformatics tools
Allows you to build analysis pipelines and share themProvides multiple ways to visualise your dataSlide5
Case Study: Chip-
seq Analysis Pipeline
Peak calling
Enriched regions
Quality control
Map reads to
reference genome
Pre-processing
of raw reads
SequencingSlide6
Case Study: Chip-
seq Analysis Pipeline
Quality control
Map reads to
reference genome
Peak calling
Pre-processing
of raw reads
Enriched regions
Sequencing
Visualisation
with
genome browser
Motif discovery
Relationship with
gene structure
Gene set analysis
Differential
profile analysisSlide7
Question?
Which promoter regions of genes do these
enriched regions map to??? Slide8
Command-line approach
1. Extract
gene coordinates from UCSC
2. Extract
1kb upstream coordinates from UCSC3. Merge upstream coordinates and gene annotation
5. Join the input
files
6. Create
user track for
UCSC
7. Import to UCSC
8. Run a Wrapper script to enable a re
-run
of this pipeline with
different parameters.
4. Clean
filesSlide9
Command-line approach
1. Extract
gene coordinates from UCSC
2. Extract
1kb upstream coordinates from UCSC3. Merge upstream coordinates and gene annotation
5. Join the input
files
6. Create
user track for
UCSC
7. Import to UCSC
8. Run a Wrapper script to enable a re
-run
of this pipeline with
different parameters.
4. Clean
filesSlide10
Galaxy ApproachSlide11
The Galaxy Interface
Datasources
and Tools
Main Analysis window
History of commands
Main MenusSlide12
Overview of Analysis
Import two datasets into Galaxy
Genomic coordinates of enriched peaks
Genomic coordinates of genes
Extract upstream regions of genesData cleaningIdentify overlap between promoter regions and enriched regions
Visualise on a genome browser
Question:
Which gene promoter
regions
do
these enriched regions map to???
Analysis steps:Slide13
Let’s begin!
Register an account
http://
bioinf.gen.tcd.ie
/workshops/Galaxy/Slide14
Let’s begin!
Step 1: Get Data into GalaxySlide15
Step 1: Get data #1
TAF1 peaks
Get Data -> Upload File -> Paste/Fetch ->
Enter URL ->
Start
1.
Click Upload File
2.
Click Paste/Fetch
t
o display the URL
box above
3.
Paste in the URL
containing your data
6.
Click Start to
upload the data
to your history!
5
.
Type hg19 and
s
pecify
H
uman Feb. 2009 (GRCh37/hg19) (hg19)
7
.
Click Close
4
.
Select ‘tabular’ file type
http://
bioinf.gen.tcd.ie
/workshops/Galaxy/TAF1_peaks.txtSlide16
Data uploaded to your history!
The file was sent to your history and given a number
The history keeps track of all steps in your analysisSlide17
Step 2: Rename your History
1
.
Click here to rename your history
You can have multiple histories with different names
2.
Click the
cog wheel
if you want to
create a new
history or see a list of your saved historiesSlide18
Step
3:
Review
your dataset
1. Click on dataset name to expand/collapse
the meta
data and mini view of the
file content
3. Click the pencil icon
to edit the file attributes
2.
Click the
eye icon
to
s
ee the file contents in the
main
analysis window
4
.
Click the
x
to delete the fileSlide19
Step
4a:
Edit
dataset
1
.
Click the
pencil icon
to
edit the file attributes
3
.
First rename the file
5
.
Click save
Change File name to a shorter name
4
.
Copy and paste the old
n
ame into the info to keep
a
record of it
2.
There are four tabs in edit mode:
To change file name click
AttributesSlide20
Step
4b:
Edit
dataset
1
.
C
lick
Datatype
to change the file format
3
.
D
efine
which columns
of your TAF1 file are
“
chrom
”, “start”
and
“end
”. Look at the mini view image to see your TAF1 file
4
.
Click save
Change File format so Galaxy knows where to find
chr
, start, end
2
.
Select
interval
from drop down and then click
save
5
.
Format changed to interval.
Galaxy now knows where
chr
, start and end are.Slide21
Step 5
: Get data #2 -> Genes
Get Data -> UCSC Main Table BrowserSlide22
Step 5:
Get data #2 -> Genes
Ensure all drop downs as shown below are selected
1.
Select all fields from drop downs as shown above, then click get output
2.
Click Send query to GalaxySlide23
Step
6
:
Edit
dataset
Click the
pencil icon
to
e
dit the file name
Change File name to a shorter name
File name changed
File format =
bed
Galaxy already knows
where
Chr
, start and end areSlide24
Step 7: Get
Promoter Regions
Tool:
Operate on Genomic Intervals
Get Flanks
4
.
Click Execute
3
.
Select 1000bp upstream
1.
Select Genes dataset
2.
Select upstream
5
.
Output sent to history!
Same file content as ‘Genes’ but start and end coordinates are replaced with promoter regions
6.
Rename file to ‘Promoters’Slide25
Step
8
: Clean dataset
Tool:
Text Manipulation
Cut
1.
Cut out the specific columns we want from the ‘Promoters’ file
2
.
Click Execute
3
.
Rename
the
output
file to ‘Clean Promoters
’Slide26
Datasets ready for analysis!
Both files are associated with human hg19
Galaxy knows for each file where
chr
, start and end are.
Now, we are ready to join these files and see which promoters have TAF1 peaks!
Dataset #1
Dataset #2Slide27
How do we Join Genomic Intervals?
Chr1 100
500
int1
+
Chr1 200
400 cloneA +
Chr Start End Name Strand Chr1 100 500 int1
+
Chr1 1000
1200
int2
+
Intervals that overlap!
Interval file #1
Interval file #2
Example
Chr
Start End Name Strand
Chr1 200
4
00 cloneA +Chr1 900
1000
cloneB
+
100-500
200-400
1000-1200
9
00-1000
#1
#2Slide28
Step 9:
Join on Genomic Intervals
Tool: Operate on Genomic Intervals
Join
The second dataset is the one we use
for the filter
(
i.e. we want to filter the promoter dataset for just
those regions that
contain the TAF1 peaks)
The first dataset is the one we want to filter
(i.e. the large dataset containing all of the promoter regions)
Click Execute
Inner join returns only the genomic regions that overlap in both filesSlide29
Step
9:
Join on Genomic Intervals
Output
We have reduced the promoters from >54,000 to 154!
All of these promoter regions contain a TAF1 peak region.
Rename the output file to ‘Overlap’Slide30
Step 10
: Build Custom Tracks for UCSC
Tool: Graph/Display Data
Build custom track
Click ‘Insert Track’ to open the track information.
We will add three tracks to UCSC:
1. TAF1 peaks
2. Promoter regions
3. TAF1 peaks in promoter regionsSlide31
Step
10
: Build Custom Tracks for UCSC
Click ‘Insert Track’ to open another track
Select dataset
Label the track
Describe the track
Select the
colour
of the track
Track 1: TAF1 peaksSlide32
Step 10
: Build Custom Tracks for UCSC
Tracks 2 and 3:
Click Execute when all three tracks are filled inSlide33
Click here to
visualise
your three tracks on UCSC Genome Browser
This single output file contains the information to
visualise
three tracks
o
n UCSC Genome Browser
Step 10: Build Custom Tracks for UCSC
OutputSlide34
Visualisation on
UCSC Genome Browser
The three tracks
Zoom out to see a larger genomic context Slide35
Extract Workflow from History
Want to rerun your analysis but extract 3kb upstream?
Click the
cog wheel
and select
‘Extract Workflow’ from the drop down menuSlide36
Extract Workflow from History
Create a workflow name
Lists all the tools used to create your history
Click
Create workflowSlide37
Extract Workflow from History
Click
edit
workflow
Or access your workflows from the top menuSlide38
Editing Workflows
Click
on
a box and
you can edit the variables of that step in the Details
section
on
the
right (in orange)
Each box is a step of the analysis
Noodles connect the steps
Use
blue window
to move around the workflowSlide39
Editing Workflows
This input dataset is
the
t
ranscription factor dataset . Label this dataset in the details b
ox on the rightSlide40
Editing Workflows
This input dataset is
the
Gene
dataset . Label this dataset in the d
etails b
ox on the rightSlide41
Editing Workflows
1.
Click on Get Flanks
t
ool to edit the upstream promoter region
2.
Change
the upstream
promoter region
to 3000
3.
Click
cog wheel to
save
w
orkflow.
Then click cog wheel again to
Run
the
workflowSlide42
Running Workflows
1.
Select Transcription factor file
(e.g. TAF1_peaks)
3
.
Send
output to a new history
4
.
Run
workflow and go for a coffee!!
2
.
Select Genes file
(e.g. Genes)Slide43
Your new History!Slide44
Summary
What you learned today
Getting data into Galaxy
How to review and edit
datasetsRunning Common Galaxy ToolsHow to visualise your data in UCSC genome browserHow to extract workflows from a historySlide45
Large Tool RepositorySlide46
Data Visualisations
UCSC Genome Browser
Clustered
Heatmaps
Visualisation
of Chip-
seq
data
Charts
Circster
– structural variationSlide47
Galaxy Learning ResourcesSlide48
Thank You
Please fill in the online survey at
bioinf.gen.tcd.ie
/surveys/Galaxy