/
The Phylogeny of a Dataset The Phylogeny of a Dataset

The Phylogeny of a Dataset - PowerPoint Presentation

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
414 views
Uploaded On 2016-03-10

The Phylogeny of a Dataset - PPT Presentation

Andrea K Thomer amp Nicholas M Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and Information Science University of Illinois at UrbanaChampaign Time ID: 249439

amp datasets evolution digital datasets amp digital evolution icoads work objects evolutionary clade coads create character understand dataset tree

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "The Phylogeny of a Dataset" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

The Phylogeny of a Dataset

Andrea K Thomer & Nicholas M. Weber

Center for Informatics Research in Science and Scholarship

Graduate School of Library and Information Science

University of Illinois at Urbana-ChampaignSlide2

TimeSlide3

How do we

understand

the

evolution

of

digital

objects

?

TimeSlide4

How do we understand the evolution of digital objects when they are complexly interrelated?

c/o Steve Worley, NCARSlide5

Evolution as a tree

From http://

tolweb.org

/tree/

home.pages

/

aboutoverview.htmlSlide6

tl;drSlide7

tl;dr

Biologists construct evolutionary trees by comparing animals’ traits and inferring how they may have evolvedSlide8

tl;dr

Biologists construct evolutionary trees by comparing animals’ traits and inferring how they may have evolved

And there’s lots of free, open source software available for this work.Slide9

Why not datasets? (which, like organisms, also often lack explicit documentation…)

Cornets (

Tëmkin

&

Eldredge

, 2007)

“Little Red Riding Hood” (

Tehrani

, 2013)

Non-biological evolutionSlide10

A phylogenetic approach helps us:

Study evolution of digital objects more rigorously

Model how digital objects are reworked into new “species”

Understand what properties of a digital object must be preserved or expressed to facilitate modeling

We ask:

In a digital object, what properties lead to evolutionary fitness?Slide11
Slide12

Dataset of datasets: COADS, ICOADS and its derivatives

(I)COADS= (International) Comprehensive Ocean and Atmosphere Dataset

Community project bringing together 1000s of marine surface measurements from buoys, ship’s logs,

more

First release: 1987

New releases as new datasets are added; now at 2.5

Enormously modified & reused by others in climate scienceSlide13

Towards a more rigorous view of the evolutionary process: anagenesis and phylogenesis

ICOADS documentation largely describes anagenesis (versioning)

GCMD* = 1 of many potential sources of data on phylogenesis (branching)

Found 99

metadata records

versions/derivatives of ICOADS (“specimens”) through keyword search

Metadata includes scientific paramaters,

geographic scope, instruments used, more*known

problems in metadata quality,

but value

in GCMD is

breadth rather than depthSlide14

Workflow

Download

<XML>

records

Create character matrix

Create a NEXUS file

Assess the

tree!Slide15

Workflow

Download

<XML>

records

Create character matrix

Create a NEXUS file

Assess the

tree!Slide16

Identifying “characters”

In phylogenetics: characters are morphological features, DNA, other measurable qualities

In ICOADS datasets: we treated each metadata field as a character, and each term as a character stateSlide17

Dates, times, resolution are “binned” into categories

Parameters are split into individual categories, and presence/absence are noted in binarySlide18

https://

github.com

/

akthom

/

phylomemeticsSlide19

Method: *

Software: PAUP* (Phylogenetic Analysis Using Parsimony *and other methods)

Maximum Likelihood algorithm (we can talk about that more if people are interested).

Result:Slide20
Slide21

Phylogeny of ICOADS datasets

Each fork = a “speciation event”

Each group joined at a node = a “clade”

We annotated primary

cladesSlide22

Related datasets cluster; some clades show up as derived from “ancestral” forms

Clade 1 – original COADS datasets

Clade 2 – ICOADS input datasets

Clade 3 – Sea surface flux calculations

Clade 4 – later COADS data products

Clade 5 – COADS derivativesSlide23
Slide24

Why does it matter that digital objects evolve? Or how?

Digital

preservation

implications

A way to understand the history and contents of a collection

Could be used to browse repositories?Could be used to complement citation analysis?Offers a lens into cooperative processes that create objects

A way to “read” interplay of different scientific culturesSlide25

Challenges and areas for future work

What existing statistical models of evolution are most appropriate for this? Or do we need to develop a new one?

How can existing software be modified for this work?

How do we show reticulating relationships? Slide26

Future work: Phylogenies showing hybridization & ‘spontaneous generation’Slide27

Future work: what makes a dataset “fit”?

Part of ICOADS success and proliferation is surely due to

low levels of “competition”

But is some of it due to its

open availability?

How do we test the effects of openness on a dataset’s fitness-for-purpose?Slide28

Acknowledgements

Thanks to Julie Allen, Peter Fox and Steve Worley for feedback, and our reviewers for excellent comments.

Thanks to CIRSS

and the DCERC program for fundingSlide29

References & Additional Reading

Datasets mentioned in this talk:

https://github.com/akthom/phylomemetics

Howe

, C. J., &

Windram, H. F. (2011). Phylomemetics--evolutionary analysis beyond the gene.

PLoS Biology, 9(5), e1001069. doi:10.1371/journal.pbio.1001069O’Brien, M. J.,

Darwent, J., & Lyman, R. L. (2001). Cladistics Is Useful for Reconstructing Archaeological Phylogenies: Palaeoindian Points from the Southeastern United States. Journal of Archaeological Science,

28

(10), 1115–1136. doi:10.1006/jasc.

2001.0681

Tehrani

JJ (2013) The Phylogeny of Little Red Riding Hood. PLoS ONE 8(11): e78871. doi:10.1371/journal.pone.0078871

Tëmkin, I., & Eldredge, N. (2007). Phylogenetics and Material Cultural Evolution.

Current Anthropology, 48(1), 146–154.Slide30

Homology Slide31
Slide32
Slide33

Future work: Phylogenies showing hybridization & ‘spontaneous generation’