/
Welsh Natural Language Toolkit Welsh Natural Language Toolkit

Welsh Natural Language Toolkit - PowerPoint Presentation

firingbarrels
firingbarrels . @firingbarrels
Follow
342 views
Uploaded On 2020-08-27

Welsh Natural Language Toolkit - PPT Presentation

Daniel Williams 1 Overview Quick overview of WNLT version 1 WNLT accessibility CymrIE for Twitter Improvements to WNLT Experiment using WNLT 2 Welsh Natural Language Toolkit Open source software for Welsh NLP ID: 804302

welsh wnlt precision ner wnlt welsh ner precision recall person gui twitter date location http cymrie service kind heuristics

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Welsh Natural Language Toolkit" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Welsh NaturalLanguage Toolkit

Daniel Williams

1

Slide2

OverviewQuick overview of WNLT (version 1)

WNLT accessibilityCymrIE for TwitterImprovements to WNLT

Experiment using WNLT

2

Slide3

Welsh NaturalLanguage Toolkit

Open source software for Welsh NLP

CymrIETokenisationLemmatisationPart of Speech

Named Entity Recognition (NER)Person, Location, Organization, Money, Percent, Date and Address annotations3

Slide4

GATE Developer (GUI)

4

*Annotations produce by WNLT version 2.3.2

Slide5

WNLT GUI

5

Slide6

WNLT CLI(Command line interface)

6

Slide7

WNLT API(Application Programming Interface)

7

Slide8

WNLT Accessibility

GATE Developer

GUI

Easier for nontechnical usersQuick resultsNo configuration / setup needed

CLI

Scripts

Helps automation

API

Software developers

GATE Embedded developers

Integration with existing web, desktop, mobile applications

8

Slide9

Twitter

Facilitates sentiment analysis and NLP of Welsh tweets

NoisyTwitter specific metadataTwitter URLs, Usernames and Hashtags

9

Slide10

TwitIE ProcessingResources

10

Slide11

Annotation Set Transfer

11

{

"text": "Os chi

ddim

yn

rhy

siwr

am

rywbeth

yn

ymwneud

a'ch

canlyniadau

#LefelA

mae

tim

@

ucas_online

ar

gal

i'ch

helpu

http://t.co/

pZqVSlcJrnâ

€?",

"truncated": true,

"

in_reply_to_user_id

": null,

"

in_reply_to_status_id

": null,

"favorited": false,

"source": "<a

href

=\"http://twitter.com/\"

rel

=\"nofollow\">Twitter for iPhone</a>", "in_reply_to_screen_name": null, "in_reply_to_status_id_str": null, "id_str": "54691802283900928", "entities": { "user_mentions": [ { "indices": [ 3, 19 ], "screen_name": "PostGradProblem", "id_str": "271572434", "name": "PostGradProblems", "id": 271572434………………………………………………………………………………………………

Os chi ddim yn rhy siwr am rywbeth yn ymwneud a'ch canlyniadau #LefelA mae tim @ucas_online ar gal i'ch helpu http://t.co/pZqVSlcJrn�

Slide12

Tweet LanguageIdentification

12

English

French

German

Dutch

Spanish

Welsh

Average

Accuracy

95.20%

96.36%

95.69%

97.34%

97.02%

99.07%

96.78%

Precision

83.00%

93.59%

80.99%

89.59%

93.97%

99.38%

90.09%

Recall

85.46%

84.25%

92.84%

91.32%

88.40%

96.40%

89.78%

NPV

97.43%

96.86%

98.72%

98.60%

97.62%

98.98%

98.03%

Specificity

96.92%

98.82%

96.19%

98.30%

98.82%

99.83%

98.15%

F1 Score

84.21%

88.67%

86.51%

90.45%

91.10%97.87%89.80%

~5,000 tweets

Slide13

Emoticons Gazetteer& Hashtag Tokenizer

=) 8) … -> :)

#ColegauCymruCamel caseEurfa Dictionary

13

Slide14

UserID, Hashtag and URL

14

Slide15

Performance evaluation

15

Annotation

Precision

Recall

F-Measure

Token

87.7%

99.9%

93.4%

Token.category

71.4%

81.3%

76.0%

Token.lemma

69.7%

79.4%

74.2%

VBD, VBDP, VBDI, VDI, VBF

->

VB

NNS, NNP, NNPS, NNM, NNF

->

NN

JJR, JJS

->

JJ

~2,200 Tokens

Slide16

Twitter API

16

Slide17

Twitter CymrIE in GUI

17

Slide18

Improvements for NER

Added more gazetteersObtained from Y

Lolfa CyfSpecial events

Welsh chapelsChoirsPapurau-broWelsh magazinesPerson names

18

Slide19

ImprovementsImproved JAPE rules for identifying Date, Person and Location annotations

Interfaces in WelshWNLT GUI

WNLT CLIWelsh user guide

19

Slide20

GUI in Welsh

20

Slide21

CLI in Welsh

21

Slide22

CymrIE NER Experiment

22

~ 7 heuristics for Person role

~ 18 heuristics for all

Papurau-bro newsletters

Find:

Person role

Location kind

Date kind

Slide23

CymrIE NER Experiment

23

~ 7 heuristics

Papurau-bro newsletters

BRIDE

GROOM

SERVICE

SERVICE

MOTHER & FATHER OF BRIDE

Slide24

Wedding NER

Annotation

Precision

Recall

Bride

100%

67.7%

BrideFather

91.7%

52.4%

BrideMother

100%

54.2%

Groom

95.2%

62.5%

GroomFather

91.7%

57.9%

GroomMother

90%

47.4%

Service

Location

66.7%

58.3%

Service

Date

96.3%

89.7%

24

* 32 news articles were used in the gold standard

Slide25

Future work

25

Add co-referencing processing resource to CymrIE

Collaborate with users to further develop CymrIE’s NER capabilities

Further develop the NER pipeline for Wedding announcements:

More heuristics for identifying Person role, Location kind and Date kind

Machine learning

Ensemble methods

Slide26

Acknowledgements

Welsh Language Unit, Welsh

GovernmentGareth Morlais for further funding of the WNLT project

Department of Information Studies – University Collage London

Andres Vlachidis for his help creating the Gold Standards and evaluation

School

of Welsh – Cardiff

University

Benjamin

Screen for his help

creating the Twitter Gold Standard

Y

Lolfa

Y

Lolfa

for supplying a lot of gazetteers

http

://

techiaith.cymru

– Bangor University

Resource for Welsh tweets

26

Slide27

Welsh NaturalLanguage Toolkit

Daniel Williams

Dr. Daniel Cunliffe

Prof. Douglas TudhopeHypermedia research unit

http

://hypermedia.research.southwales.ac.uk/kos/wnlt/

Sourceforge

https

://sourceforge.net/projects/wnlt-project

/

27

Slide28

Extras slides

28

Slide29

WNLT 1 – CymrIE Performance

- Evaluation

Gold Standard

2221 Tokens

230 Entities (Date, Location,

Organization,

Percent, Person)

Results

Tokenizer : Recall-99%, Precision-98%, F1-99%

POS: Recall-82%, Precision-81%, F1-81%

Lemma: Recall-80%, Precision-79%, F1-80%

NER: Recall-89%, Precision-86%, F1-87%

*Partial matches weight as “half-matches” (average mode)

29