ElEPHãT Building an Analytical BibliographY for a Prosopography in Early English Imprint Data Nushrat khan OxfordIllinois Digital Libraries Placement Programme About eeBo tcp ID: 461602
Download Presentation The PPT/PDF document "BABY" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
BABY ElEPHãT – Building an Analytical BibliographY for a Prosopography in Early English Imprint Data
Nushrat khan
Oxford-Illinois Digital Libraries Placement ProgrammeSlide2
About eeBo-tcpCollaboration between the universities of Oxford and Michigan from 1999-2015Early English Texts between 1473-1700
25000 texts made available onlineFull text searching available through EEBO-TCP
Database
2Slide3
Why historic texts are interesting3Historic Datasets
Accessibility
Reveal Historical Information
Semantic Web
Technical Interoperability
Future ResearchSlide4
Workset constructorEnables workset creation from Person, Place, Subject, Genre and Dates parameters (http://
eeboo.oerc.ox.ac.uk/)
4Slide5
How does it work?
5
Workflow of Publishing Structured Metadata
Available Metadata Fields
Title
Author Name
Date (Precise Birth, Precise Death, precise-
floruit
-from, precise-
floruit
-to, precise-
floruit
-to)
Raw Publication Place
Raw Publication Date
Publisher Slide6
Sample publisher dataPublisher
By Rycharde Iugge, printer to the
Quenes Maiestie,Printed by I[ohn] C[harlewood] for Iohn
Hinde, dwelling in
Paules Church-yarde, at the signe of the golden Hinde
,Printed by Benjamin Took and John Crook, and are to be sold by Mary Crook & Andrew Crook ...,Printed by Peter Smith, and at Saint-Omer at the English College Press],
s.n.],[By J.
Charlewood] for Edward White, dwelling at the little North doore of Paules Church, at the signe
of the
Gunne
,
Imprinted by Richard Field, and are to be sold by Richard
Garbrand
[, Oxford],
[By I.
Jaggard
?] for M. S[
parke
.,
Imprinted by E: G[
riffin
]: for
Iohn
Budge, and Ralph
Mab,
By [J. King for?] Iohn waley dwellyng
in Foster lane,
6Slide7
Inside the data7
Work
Printed By
Sold At
Printed For
Sold By
Printed At
?
:
.
[ ]
,
…
[ ]?
“”
,.Slide8
workflow
8Slide9
Entity recognition approaches
9
NLTK Entity
Extractor
Regular ExpressionSlide10
reverb10
For automatically identifying and extracting
binary relationships from English sentences
Input
Output
Argument1, Relation Phrase, Argument2
Raw text
Bananas are an excellent source of potassium
(bananas, be source of, potassium) Slide11
Open calais11
Not as efficient on short texts
i.e. Printed by A. Bells
Input text too short
Example Sentence:
Printed by Melchisedech
Bradwood for
William Aspley
Cannot detect as a personSlide12
Nltk entity recognizerStep 1 Extracted all the entities labeled as PERSON for each sentence
12
work_000001|Rycharde Iuggework_000003|Paulswork_000004|Iohn Charlewoodwork_000004|Iohn Hinde
work_000005|Ioan Danter
work_000006|Francis Grovework_000007|Henry Godduswork_000008|Arthur Iohnson
work_000012|Leonard Lichfieldwork_000013|Langly Curtiswork_000014|Benjamin Tookwork_000014|John Crook
work_000014|Mary Crookwork_000014|Andrew Crookwork_000015|William Keblewhite
All the entities NLTK can extract for each record
(with
some
limitations)Slide13
Limitations of NLtkNLTK does not identify initials as names,
i.e A. B.Extracts only the surname in the expressions like
A. Bells, Edw: AlldeIdentifies the word “Printer” in sentences where it’s mentioned in capital letters after ‘by’.
i.e
Printed by John Bill, Printer to the King's most Excellent Majesty
In case of complex sentences containing multiple names it cannot detect and extract all the names efficiently 13Slide14
Finding relationships within sentences14("'Printed", 'JJ')
('by', 'IN')(PERSON Benjamin/NNP Took
/NNP)('and', 'CC')(PERSON John/NNP Crook/NNP)('and', 'CC')
('are', 'VBP')('to', 'TO')('
be', 'VB')('sold', 'VBN')('by', 'IN')(PERSON Mary/NNP
Crook/NNP)('&', 'CC')(PERSON Andrew/NNP
Crook/NNP)
Look for preceding preposition
Separate the entities based on ‘by’ or ‘for’
“You're having a hard time because
it's hard
. This is really not an easy task to approach.
–
jonrsharpe
Jul 31 '
14"Slide15
Data refining15
Printed
For
De-duplicate the ‘Sold by’
Put back the ones in ‘Printed and Sold by’
Extracted separately using ‘Regex’ Slide16
Generating unique uriIdeal case : Assign unique URI to the same personException in this case:Few authoritative
sources to refer to Time consuming validation
Very limited information about each person available 16
Assigned unique URI to every instance
Python
uuid
module – uuid4() function Slide17
Working with ontologyChecked existing ontologies for ‘Printed by’ and ‘Printed for’ relationships --- MODS, MADS, BibFrame etc
17
EEBOO Ontology
Modify the existing ontology to define the new relationships
Work
Author
Printed By
Printed For
Sold
B
ySlide18
Storing triples and generating rdf18Slide19
Querying on the data 119
Top 20 Publishers
Top 20 Printed for
Top 20 Sold BySlide20
Querying on the data 220
The sellers for the works published by Henri Hills
Both Printed and Sold by Henri Hills
Sellers who worked with Henri Hills-
Will
Larner
, Jane Underhill, Francis SmithSlide21
Future directionTrain NLTK to capture the names properly Extract specific place names from the publisher field. i.e. sold at Golden HindeIn case of
initials figure out how to identify the names, i.e. whether R. Charles is Robert Charles or Ruth Charles etc. May be request help from domain expert
Analyze how name expressions have changed over timeIdentify the authors using authoritative sources and domain specific knowledge, i.e. London Book Trades Index, British Book Trade IndexAnalyze and visualize the data by mapping
21Slide22
gratitudeTerhi Nurmikko-FullerDavid M. Weigl
Professor David De Roure
Kevin PagePip WillcoxAnd everybody else at OeRC!
22