/
BiographyNed BiographyNed

BiographyNed - PowerPoint Presentation

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
384 views
Uploaded On 2017-11-30

BiographyNed - PPT Presentation

eScience Center 21 March 2013 Why a good case for eScience Involves big data with high complexity Rich meta data joining diverse textual sources and selections of data Incomplete ID: 611384

information data interface text data information text interface biography historical dutch events metadata biographies nlp tools history biographical sources

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "BiographyNed" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

BiographyNed

eScience

Center 21

March

2013Slide2
Slide3

Why a good case for eScience?

Involves

big data with high complexity

Rich

meta data joining diverse textual sources

and selections

of data

Incomplete

and noisy

Potential

to investigate difficult questions, e.g.

:

How

did the current Dutch elite develop from the

colonial past?

Biographies

may represent different views

and realities

and thus answers to

questions:

hero

or

villain

2.8

textual sources per personSlide4

What will we do?

Develop

generic text mining technology

that converts

textual data to structured

data

Taking into account nature of historical text

Enrich and externally link data repository of Dutch biographies

Develop visualizations

and interactions on

the data

set to support historical research

Develop

a range of cases that demonstrate

the possibilities

and impossibilities of the data

set and

technologySlide5

Patterns in data

Value

Interpretation

Line composition in paintings

Twitter patterns during elections

Cubism

Democratic participation

Nature of

eHumanitiesSlide6

Patterns in data

Value

Interpretation

Narratives

Cases: persons/objects/events

Line composition in paintings

Twitter patterns during elections

Cubism

Democratic participation

19th-century Japanese prints

Biographical descriptions of Prince Bernhard

The rise of the Japanese middle class

German nobles in the

InterbellumSlide7

Methodological Issues

How

telling is the output of

our

tools?

Selection

made

by (editors of) dictionariesReliability of automated text analysisIntroduction of biases in the methodology Careful evaluation and

detailed communication is required…Slide8

Statistics on available

informationSlide9

Textual Information per person

Information

Numbers

Average

XML-files

per

individual2.79Texts78.75%Words (total/person)

288.83Words (longest text/person)229.04Words (total/text)366.76Words (longest text)/texts290.83Slide10

Availability of Information in the portalSlide11

Presence of information for

governors

of Dutch

Indies

(%

on

71

individuals)Slide12

Biography Portal of the Netherlands. The

Sources

Katwijkseweg

33Slide13

The Historical Perspective

History

and

Biography

Where

do

eScience

and History meet?Use CasesSlide14

Historical Research

The Art and

Science

of

History

:

Drawing

up a narrative from primary and secondary sources which approximates historical reality as well as possible.Slide15

Building Blocks and Concrete

Building

blocks

:

facts

derived

mainly from archival findings and existing literatureConcrete: the methods historians use to put them together into a narrative/

synthesis.The Narrative: a historical synthesis which can not be scientifically proven (only made likely) based on facts which can be proven or falsified. There is necessarily a creative

element in drawing up a narrativeSlide16

Example: Grand Pensionary Johan de Witt (1625-1672)

Building

blocks

:

born

in 1625;

son

of Jacob and Anna van den Corput; appointed grand pensionary in 1653;murdered in the Hague in 1672; enemy of William (III) of Orange; William ofOrange rewarded one of the instigators of

the murderConcrete: (logic) Based on these last data itis likely that William ordered the death of Johan Narrative: William probably ordered the death of Johan <= proposition based on facts and reasoningSlide17

The House of HistorySlide18

The Importance of Provenance

The

only

way

to

falsify

presented historical facts is by going back to the original source(s) and look at those sources critically. Highly important to

be able to know what information comes from where exactly.Slide19

Our Sources Here

The Metadata: building

blocks

The

entries

in

biographical

dictionaries themselves: short historical narrativesSlide20

Status of Biography in Academia and Society

Despite

improved

efforts

this century to embed biography in academic theories and methods, some still do not consider it (e.g. some

social historians) a worthy academic discipline, being too anecdotal and limited.Biography is the most popular non-fiction genre in bookstores (from both academic and lay authors)Slide21

Where do eScience and History

meet? (I)

“And

when

the capsule

biography

of an individual is combined with 50,000 others, many of them relatively obscure, […] and when they are all powerfully searchable

online, the social historian’s grumbles about biography’s limitations as an approach to historical study dissolves into nothingness.”(Brian Harrison, 2004, former editor of the Oxford Dictionary of National Biography) Slide22

Where do eScience and History

meet? (II)

Quantitative

analyses of a

larger

group

of people (prosopography).Surpassing the anecdotal. B. Finding relations/networks between

people which are otherwise hard to detect Slide23

Where do eScience and History

meet? III

C

.

Insight

in

Historiography

and historical selectivity. Who was described/included and why? “Undoubtedly I have deprived many interesting women

by not including them. The only thing I can say to defend myself is this: history writing is also a process of ruthless selection.” (Els Kloek, Head Biography portal and

main author 1001 vrouwen)D. Thematic research. E.g.: When did the discovery of America start to influence people’s lives?Slide24

BiographyNed Use Cases

In the

initial

stages of the research a list of

possible

historical questions within one of those four themes was drawn up (subject to change) , which the demonstrator should be able to

give us an answer to, or at least point into a direction/trend.Slide25

Case I: Making life easier

: Group

portrait

of the

Governors-General

Highest

Official in the Dutch indies 1610-194971 men (still a relatively small group)What can we say about these men as a group?Who

was appointed and what qualities did he have to have? Etc …. Slide26

Case I: data mining

Family

connections

(

parents

/

wife

/children, other relevant connections <= patronage)Place of BirthEducation ReligionCareer (patterns)Age at appointmentDuration of holding the officeReason for

leaving the officePlace of DeathSlide27

Case I: Time and Effort

More

than

1 full week

to

manually

mine this information from the Biography Portal. Can a historian do this with (almost) the same results in

under one hour if helped by the demonstrator? Slide28

Case II: Making things possible

: The Dutch

Nation

&

Identity

Who

were selected to be included in National Biographical Dictionaries and why? (what was their claim to fame?)

Are there different perspectives on the sameperson over the time and how can this be explained?Who was deemed most important? (based on the length of the entries)What time

periods are most represented?Is there a difference in claim to fame for people from different periods in history, or between men and women?Which words are used most often and can we link them

to national identities?Slide29

Case II: More Questions …

What

events

are

mentioned

most

often and what does that say about the status questionis of how the Dutch see/saw themselves?What are the

differences in the answers to these questions between several national biographical dictionaries?Are people and events described or appreciated differently over time? Does the perspective change?How does this

relate to biographical dictionaries, nations and identities elsewhere in Europe?Slide30

Conversion to Linked DataSlide31

Online machine readable

data with links

Simple facts called ‘RDF Triples’

Thorbecke

>

hasBirthPlace

> Zwolle

Some technology concepts: Schemas: To structure LDRDF Stores: To store LD

SPARQL: To access LDHuge growth in the past years: More than 300 data sourcesMore than 30 billion triplesA crash course on Linked DataSlide32

Purely syntactic conversion

Preserve the original structure of the data

Prevent loss of information

Allow for reinterpretation of the original data in the future

The

conversion

process

Data PreservationSlide33

Conversion steps:

Retrieval of XML dump of the Biography Portal

Initial conversion to ‘crude’ RDF

Using

ClioPatria

and the XMLRDF

tool for

ClioPatriaRDF restructuringLinking to other sources

Essential step in the ‘Linked Data’ philosophyThe conversion processSlide34

Data schema:

Based on the structure of the original XML files

Needs to facilitate the coupling of different biographies of the same person, without compromising the original data

Needs to facilitate the incorporation of several enrichments, following from NLP, Entity Reconciliation, etc.

Compatible with existing

schemas such as the

Europeana

Data Model,PROV, RDAgr2, FOAF, DC terms

The conversion processSlide35

Johan

Rudolph

Thorbecke werd

in 1798 geboren op 14 januari

in Zwolle en komt uit een half-Duitse…

Johan

Rudolph

Thorbecke werd

in 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…BiograpyNedschemaThorbecke

Biographical DescriptionProvenanceMeta DataNNBWPerson

Meta Data

“Thorbecke”

BiographyPartsBirth1798

EventBiographical DescriptionEnrichment

NLP Tool

Person

Meta Data

Event

Birth

Johan

Rudolph

Thorbecke werd

in

1798 geboren op 14 januari

in

Zwolle en komt uit een half-Duitse…Zwolle1798-01-14Slide36

Retrieving Information from TextSlide37

The texts in the Biography Portal

Collection

of

biographical

dictionaries

Dutch,

including from the 19th and early 20th century and even older quotesSources (different dictionaries/collections) have their own styleMetadata

available (though large differences in completeness)Slide38

Challenges and Advantages

Challenges

:

Little

work

on NLP and biographiesPerformance of Dutch NLP tools on variations of DutchAdvantages:High quality metadata coverage several categories of information (supervised machine learning)Within

sources, clear and similar structure of textsSlide39

General ApproachStart

by

using

advantages

:

Use metadata to label informationA basic IR system can be build using sentence number and lemmas as featuresEnhance performance with NLP toolsBuild upon

information retrieve in the first steps to tackle more challenging tasksSlide40

A Basic System

Supervised

Machine

Learning

Two

step

identification

process (Wu and Weld 2007;2010, Fader et al. 2011)Identify sentence that contains informationSequence tagging to identify information

within the sentenceSlide41

Adding NLP

Location

& Date

recognition

(

GeoNames

)

(other) Named Entities (VIAF enhanced with names from metadata)Depending on performance of the system, we’ll work on:Chunking, multiword

recognitionParsingWord Sense DisambiguationSlide42

Metadata & Project GoalsDuplicate

detection

(metadata and

text

)

Events

/

Network discoveryEducation (begin, end, location)Occupation (begin, end, location)Relations (parents, partners)Temporal relations between eventsSlide43

Output first system

Better

coverage

of

categories

mentioned aboveA timeline for a person’s life (birth, education, occupation, locations, death)Named Entities in text

(dates, locations, persons)Slide44

Beyond the first system

The

information

provided

by

the first system can be used to:Identify alternative descriptions of events(same time, location and/or participants)Identify relations between events

(same locations & time, consequent events, same participants, etc.)Initial networks of peopleSlide45

Methodological issues and text interpretation

Results

should

be

reproducibleCode release (including scripts, configurations, …)DocumentationOpen source dataThe setup should be modularCombine output of different toolsFlexible

choice of methods usedSlide46

Evaluation Challenges (1/2)

How

to

evaluate

the

extraction

tools?

Partial evaluation using metadata (10-fold cross-validation), but:No precise indication of precision or recall (incomplete metadata…)Biographies

with rich metadata are not necessarily representative Manually annotated data needed!Slide47

Evaluation Challenges (2/2)

How

to

compare

performance NLP tools?

Little

work on biographies, little or none on Dutch ones…How hard are older texts? Can we quantify?Systematic

comparison:English biographies (wikipedia)Dutch biographies (wikipedia)Biographies from the portalSlide48

Reproducibility/Replication

What

do

results

mean

if they cannot be reproduced?What variation in results can be expected based on details not mentioned

in papers?Which information is needed to replicate results or find the origin of differences? Paper submitted ACL 2013 (joint work with Marieke van Erp and others)Slide49

Representations (tools)How

to

represent

and combine output of different tools?

Compatibility

(easy to

convert

output of external NLP tools)Flexibility (be able to contain alternative representations and interpretations)Integrate representations in NIF (joint work with Jesper

Hoeksema and Willem van Hage)Slide50

Representation (events)

How

to combine

knowledge

from

the NLP

community and Linked Data community?Combination of textual information with external resourcesComplete representation of information from text (location

, retrieval method)Paper submitted to workshop on Events: Definition, detection, coreference and representation (joint work with Marieke van Erp, Willem van Hage, Sara Tonelli, and others)Slide51

Current state of affairs

Basic

system

using

sentence

number and lemmas for main categories metadata (evaluation ongoing)Module for labeling locations and dates in text (adaptions to be made for

modularity)Annotation effort started for evaluation (selection of approximately 700 texts)Slide52

DemonstratorSlide53

The interface should be easy to useThe demonstrator should inspire historians to undertake new research and give direction, rather than being the ‘closing factor’ in their researchThe interface should allow to ‘fine tune’ results returned upon an initial action

Interface: FocusSlide54

Query compositionFaceted browsingA combination

Interface:

OptionsSlide55

Drop down boxes to select ‘Verbs’, data elements and relations

Interface: Query

compositionSlide56

No explicit querying, but convergence of the data through browsing and selecting

Provides better feedback to the user

Allows for more direct and easier

adjustment of the selected data

Interface:

Faceted

browsingSlide57

Interface:

Faceted

browsingSlide58

Query composition combined with faceted browsingCreate new facets by defining a query

The result of the query is available as a subset of the data by selecting the defined facet

As such, combinable with other facets

Method to integrate ‘open’ querying of the data into a general interface and visualization

Interface: A

combinationSlide59

Interface: A

combination

Question

Analysis

Selection

Process

Results

Data

FacetsSlide60

Time and place

are primary elements

Interface:

Demonstrator

Results

?Slide61
Slide62

Questions

Related Contents


Next Show more