Alan Ward Data first vs Hypothesis first Hypothesis driven approach Look at the data we have Formulate an hypothesis about Do experiments to test the hypothesis As a byproduct collect more data ID: 917453
Download Presentation The PPT/PDF document "Data first vs Hypothesis first" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Data first vs Hypothesis first
Alan Ward
Slide2Data first vs Hypothesis first
Hypothesis driven approach
Look at the data we have
Formulate an hypothesis about ..Do experiments to test the hypothesisAs a byproduct, collect more data
Weinberg R (2010)
Point
: Hypotheses
first. NATURE 464, 678
Slide3Data first vs Hypothesis first
Data driven approach
Identify a system of interest
Identify an approach to measure/describe attributes of the systemCollect and organise
the data
Golub
T (2010)
Counterpoint
:
Data first. NATURE 464, 679
Slide4Data first vs Hypothesis first
“Reports
that say that something hasn't
happened are always interesting to me, because as we know,
there
are known
knowns
; there are things we know that we know. There are known unknowns; that is to say, there are things that we now know we don't know. But there are also unknown unknowns – there are things we do not know we don't know
.”
—
United States Secretary of Defense, Donald Rumsfeld
Slide5Data first vs Hypothesis first
The
Black Swan: The Impact of the Highly
Improbable. Nassim Taleb
Slide6Data first
vs
Hypothesis first
known
Hypothesis driven research
unknown
Enzyme activity
Feedback inhibition
Allosteric regulation
Transcriptional regulation -
Inducers and repressors
Non-coding short RNAs
Slide7Data first
vs
Hypothesis
first
Breadth first
vs
Depth first
A slice up and down
A slice across
Slide8Observation has always been part of biology as in the
imatinib
example (Golub, 2010)but DNA sequencing technology has revolutionized observational data collection. You can see that Weinberg (2010) is arguing that ‘cheap sequencing’ on a massive scale = too much funding for data collection.
And, he doesn’t argue it but you might spend all your time managing the data1
Data first
vs
Hypothesis first
1
Marx, V (2013) Biology
: The big challenges of big
data.
Nature
498
, 255–260
Slide9Depth first or breadth first
Two different strategies for computer search algorithms
Which is best?
That heavily depends on the structure of the search tree and the number and location of solutions. If you know a solution is not far from the root of the tree, a breadth first search (BFS) might be better. If the tree is very deep and solutions are rare, depth first search (DFS) might rootle around forever, but BFS could be faster.
If the tree is very wide, a BFS might need too
much
memory, so it might be completely impractical. If solutions are frequent but located deep in the tree, BFS could be impractical.
If
the search tree is very deep you will need to restrict the search depth for depth first search (DFS),
anyway.
Data first
vs
Hypothesis first
Slide10Data first vs
Hypothesis
first
EST database
dbEST release 130101
Summary by Organism - 01 January 2013
Number of public entries: 74,186,692
Homo sapiens (human)
8,704,790
Mus
musculus + domesticus (mouse) 4,853,570Zea mays (maize)
2,019,137
Sus
scrofa (
pig
)
1,669,337
Bos
taurus
(
cattle
) 1,559,495Arabidopsis thaliana (thale cress)
1,529,700Danio rerio (zebrafish)
1,488,275Glycine max (soybean)
1,461,722Triticum aestivum (wheat) 1,286,372
Xenopus (Silurana) tropicalis (western clawed frog) 1,271,480Oryza
sativa (rice) 1,253,557Ciona
intestinalis 1,205,674Rattus norvegicus + sp. (
rat
)
1,162,136
Drosophila
melanogaster
(
fruit
fly)
821,005
…..
Salmonella
enterica
subsp
.
enterica
serovar
Typhi
217
Mycobacterium
smegmatis
str. MC2 155 30
Mycobacterium tuberculosis
30
Slide11DbEST
references
Boguski
, MS, Lowe, TMJ, Tolstoshev, CM (1993) DbEST - Database For Expressed Sequence Tags. Nature Genetics 4
, 332-333Boguski
,
MSS (1994) Gene
discovery in
dbEST
.
Science
265, 1993-4 Boguski, MSS (1995) The turning point in genome research
.
Trends in
Biochemical Sciences
20
, 295
-
6
Nagaraj
,
S (2007) A
hitchhiker's guide to expressed sequence tag (EST) analysis
. Briefings in Bioinformatics 8, 6-21
Data first vs Hypothesis first
Slide12Why DNA?An example:
Species and strain identification in prokaryotes
DNA:DNA similarity
MLEE (MultiLocus Enzyme Electrophoresis)MLST (MultiLocus Sequence Typing)
ANI (Average Nucleotide Identity)
Data first
vs
Hypothesis first
Slide13Defining species
The modern concept of species dates back to:
Mayr
, E. (1942) Systematics and the Origin of Species(Columbia Univ. Press, New York
)Biological species concept
: Species are groups of actually or potentially interbreeding natural populations, which are reproductively isolated from other such
groups
de
Queiroz
K (2005) Ernst
Mayr
and the modern concept of species. Proc Natl
Acad
Sci
U S A. 102
Suppl
1: 6600-7.
Slide14Bacterial species
Bacteria do not interbreed in the same way so defining species in bacteria remained an exercise in clustering organisms with similar, initially phenotypic, characters
Stanier
RY. Adaptation, evolutionary and physiological: Or Darwinism among the microorganisms. In: Davies R, Gale EF, editors. Adaptation in Microorganisms, Third Symposium of the Society for General Microbiology. Cambridge: Cambridge University Press;
1953
Goldner
M (2007) The genius of Roger
Stanier
Can J Infect Dis Med
Microbiol
18, 193–194
Slide15DNA:DNA similarity
From the 1960s there was a consensus that
all taxonomic information about a bacterium is incorporated in the complete nucleotide sequence of its genome
Wayne et al
., in 1987 correlated the measurement of the similarity of DNA of two strains with then currently defined species and concluded that:
A DNA:DNA similarity of 70% and a
ΔTm
of > 5°C, both are important, marks the boundary of a group of strains which belong to the same species
Wayne
, L. G., Brenner, D. J., Colwell, R. R.,
Grimont
, P. A. D.,
Kandler
, O.,
Krichevsky
, M. I., Moore, L. H., Moore, W. E. C., Murray, R. G. E. & other authors (1987). Report of the ad hoc committee on reconciliation of approaches to bacterial systematics.
Int
J
Syst
Bacteriol
37, 463–464.
Slide16DNA-DNA similarity
Measuring DNA similarity by
hybridisation
is not the same as DNA sequence similarity and it is measured using a number of different techniques% Similarity
De Ley – rate of renaturationEzaki –
microplate
binding
ΔTm
DNA melting
Elution from hydroxyapatite
The methods are not robust and few labs can do:
Stackebrandt
et al
. (
2002) Report of the Ad Hoc Committee for the re-evaluation of the species definition in bacteriology.
Intl J Systematic
Evol
Microbiol
52, 1043-1047
Slide17Melting Temperature analysis
Slide18DNA Melting
Slide19Using RT-PCR and
Syber
Green for DNA melt curve analysis
Gonzalez, JM &
Saiz
-
Jimenez, C (
2005
) A
simple
fluorimetric
method for the estimation of DNA–DNA
relatedness between
closely related microorganisms by thermal
denaturation temperatures
. Extremophiles
9, 75
–79
Slide20ΔTm determination
Exactly the same melting program, but this time the DNA from Organism 1 and Organism 2 has been mixed, denatured and then
renatured
at the optimum temperature for
renaturation T
OR
calculated from the %GC (
Tor
=0.51(%GC)+
47.0) before adding
Syber
Green and melting
Slide21Disadvantages of DNA-DNA similarity
Because DNA:DNA
hybridisation
compares the whole genome it has remained the “Gold standard” for species delineation but it has several disadvantages:It requires large amounts of high quality DNA
The methods are difficult to doDifferent methods can different results
Reciprocal measurements can be very different
(amount of A binding to B is different from amount of B binding to A)
The experimental measurement has to be made between 2 strains – so to obtain DNA-DNA similarity for 5 strains requires 20 experimental determinations and if a 6
th
strain needs to be compared another 5 experiments are needed
Y
ou
can’t build an incremental database
Slide22Disadvantages of DNA-DNA similarity
Slide23Multilocus Enzyme Electrophoresis
MLEE
Selander
, RK, Caugant,
DA, Ochman, H, Musser, JM, Gilmour, MN and Whittam, TS
(1986
) Methods
of
multilocus
enzyme electrophoresis for bacterial population genetics and
systematics.
Appl. Environ. Microbiol 51, 873-884
Slide24Multilocus sequence typing
MLST
Maiden
, MCJ, Bygraves
, JA, Feil, E, Morelli
, G,
Russell
, JE,
Urwin
, R
, Zhang
, Q, Zhou, J, Zurth, K, Caugant, DA,
Feavers
, IM,
Achtman
, M, and Spratt, BG
(1998
)
Multilocus
sequence typing: A portable approach to
the identification
of clones within populations
of pathogenic
microorganisms. Proc. Natl. Acad. Sci. USA 95, 3140–
3145
Staphylococcus
aureus
Slide25PortableUnambiguous
Reproducible
Cumulative
ScalableMultilocus sequence typing
MLST
Slide26The traditional method of data reduction
is publication —
results are summarized in peer-reviewed
journals. Publications include only the most important results, from experiments that may have been performed over many years.
The published paper is a concise compilation of the data, an interpretation of
the
results, and a comparison
with results
obtained by
others
.
Data first vs Hypothesis first
A
significant fraction of experiments
from
academic laboratories cannot be repeated
in industry
1
. Reflecting
inadequate description of experiments performed on different equipment and on biological samples that were produced with disparate methods.
1
Begley
CG & Ellis
LM (2012)
Drug development: Raise standards for preclinical cancer research Nature 483, 531
–3
Slide27Data first vs Hypothesis first
In 1991 the
GenBank
On-line
Service utilized a Solbourne
5/800 running OS/MP
4.0C.
The
database
work was done on a Sun network 4/490 server and workstations running SunOS UNIX version 4.1. The GenBank database was
maintained
on
Sybase
relational
database management system (RDBMS). Software was developed in ' C language
.
In 1990s NCBI scanned the literature for sequences and manually typed them into the database.
Slide28Data first vs Hypothesis first
Benson, DA, Cavanaugh, M, Clark, K,
Karsch-Mizrachi
, I, Lipman, DJ,
Ostell J and Sayers EW (2013) GenbankNucleic Acids Research
41
, D36–D42