eScience Center 21 March 2013 Why a good case for eScience Involves big data with high complexity Rich meta data joining diverse textual sources and selections of data Incomplete ID: 611384
Download Presentation The PPT/PDF document "BiographyNed" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
BiographyNed
eScience
Center 21
March
2013Slide2Slide3
Why a good case for eScience?
Involves
big data with high complexity
Rich
meta data joining diverse textual sources
and selections
of data
Incomplete
and noisy
Potential
to investigate difficult questions, e.g.
:
How
did the current Dutch elite develop from the
colonial past?
Biographies
may represent different views
and realities
and thus answers to
questions:
hero
or
villain
2.8
textual sources per personSlide4
What will we do?
Develop
generic text mining technology
that converts
textual data to structured
data
Taking into account nature of historical text
Enrich and externally link data repository of Dutch biographies
Develop visualizations
and interactions on
the data
set to support historical research
Develop
a range of cases that demonstrate
the possibilities
and impossibilities of the data
set and
technologySlide5
Patterns in data
Value
Interpretation
Line composition in paintings
Twitter patterns during elections
Cubism
Democratic participation
Nature of
eHumanitiesSlide6
Patterns in data
Value
Interpretation
Narratives
Cases: persons/objects/events
Line composition in paintings
Twitter patterns during elections
Cubism
Democratic participation
19th-century Japanese prints
Biographical descriptions of Prince Bernhard
The rise of the Japanese middle class
German nobles in the
InterbellumSlide7
Methodological Issues
How
telling is the output of
our
tools?
Selection
made
by (editors of) dictionariesReliability of automated text analysisIntroduction of biases in the methodology Careful evaluation and
detailed communication is required…Slide8
Statistics on available
informationSlide9
Textual Information per person
Information
Numbers
Average
XML-files
per
individual2.79Texts78.75%Words (total/person)
288.83Words (longest text/person)229.04Words (total/text)366.76Words (longest text)/texts290.83Slide10
Availability of Information in the portalSlide11
Presence of information for
governors
of Dutch
Indies
(%
on
71
individuals)Slide12
Biography Portal of the Netherlands. The
Sources
Katwijkseweg
33Slide13
The Historical Perspective
History
and
Biography
Where
do
eScience
and History meet?Use CasesSlide14
Historical Research
The Art and
Science
of
History
:
Drawing
up a narrative from primary and secondary sources which approximates historical reality as well as possible.Slide15
Building Blocks and Concrete
Building
blocks
:
facts
derived
mainly from archival findings and existing literatureConcrete: the methods historians use to put them together into a narrative/
synthesis.The Narrative: a historical synthesis which can not be scientifically proven (only made likely) based on facts which can be proven or falsified. There is necessarily a creative
element in drawing up a narrativeSlide16
Example: Grand Pensionary Johan de Witt (1625-1672)
Building
blocks
:
born
in 1625;
son
of Jacob and Anna van den Corput; appointed grand pensionary in 1653;murdered in the Hague in 1672; enemy of William (III) of Orange; William ofOrange rewarded one of the instigators of
the murderConcrete: (logic) Based on these last data itis likely that William ordered the death of Johan Narrative: William probably ordered the death of Johan <= proposition based on facts and reasoningSlide17
The House of HistorySlide18
The Importance of Provenance
The
only
way
to
falsify
presented historical facts is by going back to the original source(s) and look at those sources critically. Highly important to
be able to know what information comes from where exactly.Slide19
Our Sources Here
The Metadata: building
blocks
The
entries
in
biographical
dictionaries themselves: short historical narrativesSlide20
Status of Biography in Academia and Society
Despite
improved
efforts
this century to embed biography in academic theories and methods, some still do not consider it (e.g. some
social historians) a worthy academic discipline, being too anecdotal and limited.Biography is the most popular non-fiction genre in bookstores (from both academic and lay authors)Slide21
Where do eScience and History
meet? (I)
“And
when
the capsule
biography
of an individual is combined with 50,000 others, many of them relatively obscure, […] and when they are all powerfully searchable
online, the social historian’s grumbles about biography’s limitations as an approach to historical study dissolves into nothingness.”(Brian Harrison, 2004, former editor of the Oxford Dictionary of National Biography) Slide22
Where do eScience and History
meet? (II)
Quantitative
analyses of a
larger
group
of people (prosopography).Surpassing the anecdotal. B. Finding relations/networks between
people which are otherwise hard to detect Slide23
Where do eScience and History
meet? III
C
.
Insight
in
Historiography
and historical selectivity. Who was described/included and why? “Undoubtedly I have deprived many interesting women
by not including them. The only thing I can say to defend myself is this: history writing is also a process of ruthless selection.” (Els Kloek, Head Biography portal and
main author 1001 vrouwen)D. Thematic research. E.g.: When did the discovery of America start to influence people’s lives?Slide24
BiographyNed Use Cases
In the
initial
stages of the research a list of
possible
historical questions within one of those four themes was drawn up (subject to change) , which the demonstrator should be able to
give us an answer to, or at least point into a direction/trend.Slide25
Case I: Making life easier
: Group
portrait
of the
Governors-General
Highest
Official in the Dutch indies 1610-194971 men (still a relatively small group)What can we say about these men as a group?Who
was appointed and what qualities did he have to have? Etc …. Slide26
Case I: data mining
Family
connections
(
parents
/
wife
/children, other relevant connections <= patronage)Place of BirthEducation ReligionCareer (patterns)Age at appointmentDuration of holding the officeReason for
leaving the officePlace of DeathSlide27
Case I: Time and Effort
More
than
1 full week
to
manually
mine this information from the Biography Portal. Can a historian do this with (almost) the same results in
under one hour if helped by the demonstrator? Slide28
Case II: Making things possible
: The Dutch
Nation
&
Identity
Who
were selected to be included in National Biographical Dictionaries and why? (what was their claim to fame?)
Are there different perspectives on the sameperson over the time and how can this be explained?Who was deemed most important? (based on the length of the entries)What time
periods are most represented?Is there a difference in claim to fame for people from different periods in history, or between men and women?Which words are used most often and can we link them
to national identities?Slide29
Case II: More Questions …
What
events
are
mentioned
most
often and what does that say about the status questionis of how the Dutch see/saw themselves?What are the
differences in the answers to these questions between several national biographical dictionaries?Are people and events described or appreciated differently over time? Does the perspective change?How does this
relate to biographical dictionaries, nations and identities elsewhere in Europe?Slide30
Conversion to Linked DataSlide31
Online machine readable
data with links
Simple facts called ‘RDF Triples’
Thorbecke
>
hasBirthPlace
> Zwolle
Some technology concepts: Schemas: To structure LDRDF Stores: To store LD
SPARQL: To access LDHuge growth in the past years: More than 300 data sourcesMore than 30 billion triplesA crash course on Linked DataSlide32
Purely syntactic conversion
Preserve the original structure of the data
Prevent loss of information
Allow for reinterpretation of the original data in the future
The
conversion
process
Data PreservationSlide33
Conversion steps:
Retrieval of XML dump of the Biography Portal
Initial conversion to ‘crude’ RDF
Using
ClioPatria
and the XMLRDF
tool for
ClioPatriaRDF restructuringLinking to other sources
Essential step in the ‘Linked Data’ philosophyThe conversion processSlide34
Data schema:
Based on the structure of the original XML files
Needs to facilitate the coupling of different biographies of the same person, without compromising the original data
Needs to facilitate the incorporation of several enrichments, following from NLP, Entity Reconciliation, etc.
Compatible with existing
schemas such as the
Europeana
Data Model,PROV, RDAgr2, FOAF, DC terms
The conversion processSlide35
Johan
Rudolph
Thorbecke werd
in 1798 geboren op 14 januari
in Zwolle en komt uit een half-Duitse…
Johan
Rudolph
Thorbecke werd
in 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…BiograpyNedschemaThorbecke
Biographical DescriptionProvenanceMeta DataNNBWPerson
Meta Data
“Thorbecke”
BiographyPartsBirth1798
EventBiographical DescriptionEnrichment
NLP Tool
Person
Meta Data
Event
Birth
Johan
Rudolph
Thorbecke werd
in
1798 geboren op 14 januari
in
Zwolle en komt uit een half-Duitse…Zwolle1798-01-14Slide36
Retrieving Information from TextSlide37
The texts in the Biography Portal
Collection
of
biographical
dictionaries
Dutch,
including from the 19th and early 20th century and even older quotesSources (different dictionaries/collections) have their own styleMetadata
available (though large differences in completeness)Slide38
Challenges and Advantages
Challenges
:
Little
work
on NLP and biographiesPerformance of Dutch NLP tools on variations of DutchAdvantages:High quality metadata coverage several categories of information (supervised machine learning)Within
sources, clear and similar structure of textsSlide39
General ApproachStart
by
using
advantages
:
Use metadata to label informationA basic IR system can be build using sentence number and lemmas as featuresEnhance performance with NLP toolsBuild upon
information retrieve in the first steps to tackle more challenging tasksSlide40
A Basic System
Supervised
Machine
Learning
Two
step
identification
process (Wu and Weld 2007;2010, Fader et al. 2011)Identify sentence that contains informationSequence tagging to identify information
within the sentenceSlide41
Adding NLP
Location
& Date
recognition
(
GeoNames
)
(other) Named Entities (VIAF enhanced with names from metadata)Depending on performance of the system, we’ll work on:Chunking, multiword
recognitionParsingWord Sense DisambiguationSlide42
Metadata & Project GoalsDuplicate
detection
(metadata and
text
)
Events
/
Network discoveryEducation (begin, end, location)Occupation (begin, end, location)Relations (parents, partners)Temporal relations between eventsSlide43
Output first system
Better
coverage
of
categories
mentioned aboveA timeline for a person’s life (birth, education, occupation, locations, death)Named Entities in text
(dates, locations, persons)Slide44
Beyond the first system
The
information
provided
by
the first system can be used to:Identify alternative descriptions of events(same time, location and/or participants)Identify relations between events
(same locations & time, consequent events, same participants, etc.)Initial networks of peopleSlide45
Methodological issues and text interpretation
Results
should
be
reproducibleCode release (including scripts, configurations, …)DocumentationOpen source dataThe setup should be modularCombine output of different toolsFlexible
choice of methods usedSlide46
Evaluation Challenges (1/2)
How
to
evaluate
the
extraction
tools?
Partial evaluation using metadata (10-fold cross-validation), but:No precise indication of precision or recall (incomplete metadata…)Biographies
with rich metadata are not necessarily representative Manually annotated data needed!Slide47
Evaluation Challenges (2/2)
How
to
compare
performance NLP tools?
Little
work on biographies, little or none on Dutch ones…How hard are older texts? Can we quantify?Systematic
comparison:English biographies (wikipedia)Dutch biographies (wikipedia)Biographies from the portalSlide48
Reproducibility/Replication
What
do
results
mean
if they cannot be reproduced?What variation in results can be expected based on details not mentioned
in papers?Which information is needed to replicate results or find the origin of differences? Paper submitted ACL 2013 (joint work with Marieke van Erp and others)Slide49
Representations (tools)How
to
represent
and combine output of different tools?
Compatibility
(easy to
convert
output of external NLP tools)Flexibility (be able to contain alternative representations and interpretations)Integrate representations in NIF (joint work with Jesper
Hoeksema and Willem van Hage)Slide50
Representation (events)
How
to combine
knowledge
from
the NLP
community and Linked Data community?Combination of textual information with external resourcesComplete representation of information from text (location
, retrieval method)Paper submitted to workshop on Events: Definition, detection, coreference and representation (joint work with Marieke van Erp, Willem van Hage, Sara Tonelli, and others)Slide51
Current state of affairs
Basic
system
using
sentence
number and lemmas for main categories metadata (evaluation ongoing)Module for labeling locations and dates in text (adaptions to be made for
modularity)Annotation effort started for evaluation (selection of approximately 700 texts)Slide52
DemonstratorSlide53
The interface should be easy to useThe demonstrator should inspire historians to undertake new research and give direction, rather than being the ‘closing factor’ in their researchThe interface should allow to ‘fine tune’ results returned upon an initial action
Interface: FocusSlide54
Query compositionFaceted browsingA combination
Interface:
OptionsSlide55
Drop down boxes to select ‘Verbs’, data elements and relations
Interface: Query
compositionSlide56
No explicit querying, but convergence of the data through browsing and selecting
Provides better feedback to the user
Allows for more direct and easier
adjustment of the selected data
Interface:
Faceted
browsingSlide57
Interface:
Faceted
browsingSlide58
Query composition combined with faceted browsingCreate new facets by defining a query
The result of the query is available as a subset of the data by selecting the defined facet
As such, combinable with other facets
Method to integrate ‘open’ querying of the data into a general interface and visualization
Interface: A
combinationSlide59
Interface: A
combination
Question
Analysis
Selection
Process
Results
Data
FacetsSlide60
Time and place
are primary elements
Interface:
Demonstrator
Results
?Slide61Slide62
Questions