Andrea K Thomer amp Nicholas M Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and Information Science University of Illinois at UrbanaChampaign Time ID: 249439
Download Presentation The PPT/PDF document "The Phylogeny of a Dataset" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The Phylogeny of a Dataset
Andrea K Thomer & Nicholas M. Weber
Center for Informatics Research in Science and Scholarship
Graduate School of Library and Information Science
University of Illinois at Urbana-ChampaignSlide2
TimeSlide3
How do we
understand
the
evolution
of
digital
objects
?
TimeSlide4
How do we understand the evolution of digital objects when they are complexly interrelated?
c/o Steve Worley, NCARSlide5
Evolution as a tree
From http://
tolweb.org
/tree/
home.pages
/
aboutoverview.htmlSlide6
tl;drSlide7
tl;dr
Biologists construct evolutionary trees by comparing animals’ traits and inferring how they may have evolvedSlide8
tl;dr
Biologists construct evolutionary trees by comparing animals’ traits and inferring how they may have evolved
And there’s lots of free, open source software available for this work.Slide9
Why not datasets? (which, like organisms, also often lack explicit documentation…)
Cornets (
Tëmkin
&
Eldredge
, 2007)
“Little Red Riding Hood” (
Tehrani
, 2013)
Non-biological evolutionSlide10
A phylogenetic approach helps us:
Study evolution of digital objects more rigorously
Model how digital objects are reworked into new “species”
Understand what properties of a digital object must be preserved or expressed to facilitate modeling
We ask:
In a digital object, what properties lead to evolutionary fitness?Slide11Slide12
Dataset of datasets: COADS, ICOADS and its derivatives
(I)COADS= (International) Comprehensive Ocean and Atmosphere Dataset
Community project bringing together 1000s of marine surface measurements from buoys, ship’s logs,
more
First release: 1987
New releases as new datasets are added; now at 2.5
Enormously modified & reused by others in climate scienceSlide13
Towards a more rigorous view of the evolutionary process: anagenesis and phylogenesis
ICOADS documentation largely describes anagenesis (versioning)
GCMD* = 1 of many potential sources of data on phylogenesis (branching)
Found 99
metadata records
versions/derivatives of ICOADS (“specimens”) through keyword search
Metadata includes scientific paramaters,
geographic scope, instruments used, more*known
problems in metadata quality,
but value
in GCMD is
breadth rather than depthSlide14
Workflow
Download
<XML>
records
Create character matrix
Create a NEXUS file
Assess the
tree!Slide15
Workflow
Download
<XML>
records
Create character matrix
Create a NEXUS file
Assess the
tree!Slide16
Identifying “characters”
In phylogenetics: characters are morphological features, DNA, other measurable qualities
In ICOADS datasets: we treated each metadata field as a character, and each term as a character stateSlide17
Dates, times, resolution are “binned” into categories
Parameters are split into individual categories, and presence/absence are noted in binarySlide18
https://
github.com
/
akthom
/
phylomemeticsSlide19
Method: *
Software: PAUP* (Phylogenetic Analysis Using Parsimony *and other methods)
Maximum Likelihood algorithm (we can talk about that more if people are interested).
Result:Slide20Slide21
Phylogeny of ICOADS datasets
Each fork = a “speciation event”
Each group joined at a node = a “clade”
We annotated primary
cladesSlide22
Related datasets cluster; some clades show up as derived from “ancestral” forms
Clade 1 – original COADS datasets
Clade 2 – ICOADS input datasets
Clade 3 – Sea surface flux calculations
Clade 4 – later COADS data products
Clade 5 – COADS derivativesSlide23Slide24
Why does it matter that digital objects evolve? Or how?
Digital
preservation
implications
A way to understand the history and contents of a collection
Could be used to browse repositories?Could be used to complement citation analysis?Offers a lens into cooperative processes that create objects
A way to “read” interplay of different scientific culturesSlide25
Challenges and areas for future work
What existing statistical models of evolution are most appropriate for this? Or do we need to develop a new one?
How can existing software be modified for this work?
How do we show reticulating relationships? Slide26
Future work: Phylogenies showing hybridization & ‘spontaneous generation’Slide27
Future work: what makes a dataset “fit”?
Part of ICOADS success and proliferation is surely due to
low levels of “competition”
But is some of it due to its
open availability?
How do we test the effects of openness on a dataset’s fitness-for-purpose?Slide28
Acknowledgements
Thanks to Julie Allen, Peter Fox and Steve Worley for feedback, and our reviewers for excellent comments.
Thanks to CIRSS
and the DCERC program for fundingSlide29
References & Additional Reading
Datasets mentioned in this talk:
https://github.com/akthom/phylomemetics
Howe
, C. J., &
Windram, H. F. (2011). Phylomemetics--evolutionary analysis beyond the gene.
PLoS Biology, 9(5), e1001069. doi:10.1371/journal.pbio.1001069O’Brien, M. J.,
Darwent, J., & Lyman, R. L. (2001). Cladistics Is Useful for Reconstructing Archaeological Phylogenies: Palaeoindian Points from the Southeastern United States. Journal of Archaeological Science,
28
(10), 1115–1136. doi:10.1006/jasc.
2001.0681
Tehrani
JJ (2013) The Phylogeny of Little Red Riding Hood. PLoS ONE 8(11): e78871. doi:10.1371/journal.pone.0078871
Tëmkin, I., & Eldredge, N. (2007). Phylogenetics and Material Cultural Evolution.
Current Anthropology, 48(1), 146–154.Slide30
Homology Slide31Slide32Slide33
Future work: Phylogenies showing hybridization & ‘spontaneous generation’