DARIAH Day UZH Zurich 18 December 2017 CCBY 40 Overview Intro to CLARIN CLARIN data architecture CLARIN for data science 2 Intro to CLARIN CLARIN in seven bullets CLARIN is the Common Language Resources and Technology ID: 670939
Download Presentation The PPT/PDF document "Veni , vidi , CLARIN! Darja Fišer" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Veni, vidi, CLARIN!
Darja Fišer
DARIAH Day @ UZH
Zurich, 18 December 2017
CC-BY 4.0
Slide2
OverviewIntro to CLARINCLARIN data architecture
CLARIN for data science
2Slide3
Intro to CLARINSlide4
CLARIN in seven bullets
CLARIN
is the Common Language Resources and Technology Infrastructure
ESFRI
ERIC status since 2012, Landmark since 2016
t
hat provides easy and sustainable access for scholars in the humanities and social sciences and beyondto digital language data (in written, spoken, video or multimodal form)and advanced tools to discover, explore, exploit, annotate, analyse or combine them, wherever they are locatedthrough a single sign-on environmentand that serves as an ecosystem for knowledge sharing.
4Slide5
CLARIN ERIC in members and centres
A
consortium
of:
19
members: AT, BG, CZ, DE, DK, DLU, EE, FI, GR,
HU, IT
, LT, LV, NL, NO, PL, PT, SE, SI2 observers: FR, UK;>40 centres5Slide6
What CLARIN Centres offer
Repository
library of linguistic data and tools
search for data and tools and easily use them online or download them
deposit your data and be sure it is safely stored, everyone can find it, and correctly cite it
Federated single sign-on
log in
once with your existing institutional credentialsget access to protected resourcesMetadatadescribe content, provenance and formats of linguistic data and toolsfacilitate preservation and dissemination of linguistic data and toolsPersistent Identifier (PID or handle)a special permanent URL that provides a permanent link to linguistic data and toolswill resolve correctly even if in some distant future the data is movedshould be used as URL in citationsLicensingPublicAcademicRestricted
Preservation (Data Seal of
Approval)
committed to long-term care of items in the repository
ensure the archived data can be found, understood and used in the future
6Slide7
CLARIN data types and user communities
Newspaper
archives
Literary
texts
Parliamentary recordsLiterary textsHistorical lettersBroadcast
archives
Oral
History
data
Social
Media data
…
7
Digital humanities
Linguistics and
P
hilology
Translation and Lexicography
Literary Studies
History
Political and Social Sciences
Media
Studies
Culture, Folklore, Anthropology
Speech therapy
Teachers
G
eneral
P
ublicSlide8
CLARIN data architectureSlide9
Repositories9
* slides by Dieter Van
UytvanckSlide10
Harvesting10Slide11
Processing11
Slide12
Content search
12
Slide13
Workflows13
Slide14
CLARIN for data scienceSlide15
CLARIN and data science (1)
Text
and
speech as
social
and cultural dataContribution to the development of new methodological frameworks for the
integrated
processing of multiple datatypes,
and
multidisciplinary
research
agendas
Europe’s
multilinguality as a basis for
comparative
research of societal and cultural phenomena, that are reflected in
language use:Migration patterns
Intellectual
history
Language
variation
across
period
and
region
Dynamics in
mental
health
conditions
Parliamentary
discourse
15Slide16
Parliamentary records
great
potential
for
reuse
and re-purposing within many fields of study in the humanities and social sciences (and beyond):suited for both close reading and
distance
reading
Humanitie
s
:
history, language
change, discourse analysis …
Social
sciences: social and cultural dynamics, political sciences, economics ...
considered a rich data type
apart
from
linguistic
content,
rich
in
metadata
(speaker, party
affiliation
,
age
,
sex
,
education
,
origin
,
duration
of speech)
apart
from
linguistic
content,
rich
in extralinguistic clues (interruptions, voting results)made easily available under the Freedom of Information acts in over 100 countries all around the world to enable informed participation by the public and improve effective functioning of democratic systemsbut alsooften presenting itself as messy or noisy data calling for links with data in other modalities than text and speechcreated under specific circumstances that need to be well understood before strong conclusions can be drawn
16Slide17
Corpora of parliamentary records
Coverage
exist for
18 countries
Size
(in tokens)
largest: UK
(1.6 billion)smallest: Portuguese (1 million)Periods covered by the corpusmostly 2nd half of 20th century and 21st century, Dutch and British corpora from early 19th centuryAvailabilityFor download (7)at, cz [CPM], dk, de [sample only], no [ToN], pt, lvFor on-line searching (7)
Finnish (
KORP)
CzechParl (
SketchEngine)
Latvian (
noSketchEngine)
Bulgarian (CLaRK)
Hungarian (HNC, registration required)Proceedings of Norwegian Parliamentary Debates (Corpuscle
)
Both for download and on-line searching (5)
Dutch (Political Mashup)Estonian (Keeleveeb)Swedish (KORP)Slovenian (noSketchEngine)Polish (NKJP)Full overview available
here
17Slide18
CLARIN’s Parliamentary data for many disciplines
Perspective of curators and researchers:
Historical perspective
: the specifics of diachronical
perspective; time dynamics per topics, etc.
Political science perspective
: political activity of parties and politicians; the role of the various public political bodies; policy comparison; language differences as indicators to differing political views etc.
Sociological perspective: conflicts in parliament; attitudes of politicians to critical issue: trending topics; patterns of language use reflecting societal dynamics, models of parliamentary communication, control, commissions, etc.Psychological and language perspective: language portraits of politicians; semantic differences of political terms; gestures; behavior in parliament, etc.Developers' perspective:Design of parliamentary speech corpora: annotations, visualization, etc.Text analytics, semantic processing and linking of parliamentary dataSearches and information extraction from parliamentary corporaMultilinguality issues in parliamentary data
18Slide19Slide20
ParlaCLARIN @ LREC 2018
Background
Need
for better harmonization, interoperability and comparability of the resources and tools relevant for the study of parliamentary discussions and decisions, not only in Europe but worldwide
Aim
Bring
together researchers interested in compiling, annotating, structuring, linking and
visualising parliamentary records that are suitable for research in a wide range of disciplines in the Humanities and Social SciencesPaper submission deadline10 January 2018More infohttps://www.clarin.eu/ParlaCLARIN20Slide21
Veni
,
vidi
,
CLARIN!
darja.fiser@ff.uni-lj.si