/
BABY BABY

BABY - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
382 views
Uploaded On 2016-09-06

BABY - PPT Presentation

ElEPHãT Building an Analytical BibliographY for a Prosopography in Early English Imprint Data Nushrat khan OxfordIllinois Digital Libraries Placement Programme About eeBo tcp ID: 461602

sold printed nnp person printed sold person nnp data nltk names crook

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "BABY" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

BABY ElEPHãT – Building an Analytical BibliographY for a Prosopography in Early English Imprint Data

Nushrat khan

Oxford-Illinois Digital Libraries Placement ProgrammeSlide2

About eeBo-tcpCollaboration between the universities of Oxford and Michigan from 1999-2015Early English Texts between 1473-1700

25000 texts made available onlineFull text searching available through EEBO-TCP

Database

2Slide3

Why historic texts are interesting3Historic Datasets

Accessibility

Reveal Historical Information

Semantic Web

Technical Interoperability

Future ResearchSlide4

Workset constructorEnables workset creation from Person, Place, Subject, Genre and Dates parameters (http://

eeboo.oerc.ox.ac.uk/)

4Slide5

How does it work?

5

Workflow of Publishing Structured Metadata

Available Metadata Fields

Title

Author Name

Date (Precise Birth, Precise Death, precise-

floruit

-from, precise-

floruit

-to, precise-

floruit

-to)

Raw Publication Place

Raw Publication Date

Publisher Slide6

Sample publisher dataPublisher

By Rycharde Iugge, printer to the

Quenes Maiestie,Printed by I[ohn] C[harlewood] for Iohn

Hinde, dwelling in

Paules Church-yarde, at the signe of the golden Hinde

,Printed by Benjamin Took and John Crook, and are to be sold by Mary Crook & Andrew Crook ...,Printed by Peter Smith, and at Saint-Omer at the English College Press],

s.n.],[By J.

Charlewood] for Edward White, dwelling at the little North doore of Paules Church, at the signe

of the

Gunne

,

Imprinted by Richard Field, and are to be sold by Richard

Garbrand

[, Oxford],

[By I.

Jaggard

?] for M. S[

parke

.,

Imprinted by E: G[

riffin

]: for

Iohn

Budge, and Ralph

Mab,

By [J. King for?] Iohn waley dwellyng

in Foster lane,

6Slide7

Inside the data7

Work

Printed By

Sold At

Printed For

Sold By

Printed At

?

:

.

[ ]

,

[ ]?

“”

,.Slide8

workflow

8Slide9

Entity recognition approaches

9

NLTK Entity

Extractor

Regular ExpressionSlide10

reverb10

For automatically identifying and extracting

binary relationships from English sentences

Input

Output

Argument1, Relation Phrase, Argument2

Raw text

Bananas are an excellent source of potassium

(bananas, be source of, potassium) Slide11

Open calais11

Not as efficient on short texts

i.e. Printed by A. Bells

Input text too short

Example Sentence:

Printed by Melchisedech

Bradwood for

William Aspley

Cannot detect as a personSlide12

Nltk entity recognizerStep 1 Extracted all the entities labeled as PERSON for each sentence

12

work_000001|Rycharde Iuggework_000003|Paulswork_000004|Iohn Charlewoodwork_000004|Iohn Hinde

work_000005|Ioan Danter

work_000006|Francis Grovework_000007|Henry Godduswork_000008|Arthur Iohnson

work_000012|Leonard Lichfieldwork_000013|Langly Curtiswork_000014|Benjamin Tookwork_000014|John Crook

work_000014|Mary Crookwork_000014|Andrew Crookwork_000015|William Keblewhite

All the entities NLTK can extract for each record

(with

some

limitations)Slide13

Limitations of NLtkNLTK does not identify initials as names,

i.e A. B.Extracts only the surname in the expressions like

A. Bells, Edw: AlldeIdentifies the word “Printer” in sentences where it’s mentioned in capital letters after ‘by’.

i.e

Printed by John Bill, Printer to the King's most Excellent Majesty

In case of complex sentences containing multiple names it cannot detect and extract all the names efficiently 13Slide14

Finding relationships within sentences14("'Printed", 'JJ')

('by', 'IN')(PERSON Benjamin/NNP Took

/NNP)('and', 'CC')(PERSON John/NNP Crook/NNP)('and', 'CC')

('are', 'VBP')('to', 'TO')('

be', 'VB')('sold', 'VBN')('by', 'IN')(PERSON Mary/NNP

Crook/NNP)('&', 'CC')(PERSON Andrew/NNP

Crook/NNP)

Look for preceding preposition

Separate the entities based on ‘by’ or ‘for’

“You're having a hard time because

it's hard

. This is really not an easy task to approach.

jonrsharpe

Jul 31 '

14"Slide15

Data refining15

Printed

For

De-duplicate the ‘Sold by’

Put back the ones in ‘Printed and Sold by’

Extracted separately using ‘Regex’ Slide16

Generating unique uriIdeal case : Assign unique URI to the same personException in this case:Few authoritative

sources to refer to Time consuming validation

Very limited information about each person available 16

Assigned unique URI to every instance

Python

uuid

module – uuid4() function Slide17

Working with ontologyChecked existing ontologies for ‘Printed by’ and ‘Printed for’ relationships --- MODS, MADS, BibFrame etc

17

EEBOO Ontology

Modify the existing ontology to define the new relationships

Work

Author

Printed By

Printed For

Sold

B

ySlide18

Storing triples and generating rdf18Slide19

Querying on the data 119

Top 20 Publishers

Top 20 Printed for

Top 20 Sold BySlide20

Querying on the data 220

The sellers for the works published by Henri Hills

Both Printed and Sold by Henri Hills

Sellers who worked with Henri Hills-

Will

Larner

, Jane Underhill, Francis SmithSlide21

Future directionTrain NLTK to capture the names properly Extract specific place names from the publisher field. i.e. sold at Golden HindeIn case of

initials figure out how to identify the names, i.e. whether R. Charles is Robert Charles or Ruth Charles etc. May be request help from domain expert

Analyze how name expressions have changed over timeIdentify the authors using authoritative sources and domain specific knowledge, i.e. London Book Trades Index, British Book Trade IndexAnalyze and visualize the data by mapping

21Slide22

gratitudeTerhi Nurmikko-FullerDavid M. Weigl

Professor David De Roure

Kevin PagePip WillcoxAnd everybody else at OeRC!

22