Nicoletta Calzolari Istituto di Linguistica Computazionale CNR Pisa glottoloilccnrit The Future of KYOTO with some historical notes to show a path along an evolving vision ID: 805219
Download The PPT/PDF document "N. Calzolari 1 2nd KYOTO Workshop, Gifu,..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
N. Calzolari
1
2nd KYOTO Workshop, Gifu, Japan, January 2011
Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisaglottolo@ilc.cnr.it
The Future of KYOTO
… with some historical notes to show a path along an evolving vision
in today EU context: META-SHARE, ...
Slide2Why
such needed LRs,
are lacking
after 30 years of R&D in the field? 1) Because the main trend until mid-’80s was to privilege the processing of so-called
“critical” phenomena
, studied by the dominating linguistic theories, rather than focusing on the deep analysis of the real uses of a languageAs a result CL was focusing on:
few examples - often artificially built
lexicons made of few entries (toy lexicons
)grammars with poor coverage 2) Because large-scale LRs are
costly
& their production requires a big organizing effort
N. Calzolari
2
2nd KYOTO Workshop, Gifu, Japan, January 2011
Old slide with Antonio Zampolli (’80s/early ‘90s)
Slide3… back from the early ‘80s
It became evident that:
Part of the results of meaning extraction
, e.g. many meaning distinctions, which could be generalised over lexicographic definitions and automatically captured, were unmanageable at the formal representation level, and had to be blurred into unique features and valuesUnfortunately, it is still today
difficult to constrain word-meanings within a rigorously defined organization: by their very nature they tend to evade any strict boundaries
N. Calzolari3
2nd KYOTO Workshop, Gifu, Japan, January 2011
Automatic acquisition of lexical information from MRDs
Was my first research & became central in the Pisa group (ACQUILEX)And also Amsler
, Briscoe,
Boguraev
,
Wilks
’ group,
IBM, then
Japanese groups, …The trend was: “
large-scale computational methods for the transformation of machine readable dictionaries into machine tractable
dictionaries
”
Instead of relying on linguists’ introspection
Pioneering
Research
Historical notes
Slide4Automatic acquisition of info
from texts:
This trend has become today a consolidated & pervasive factFrom acquisition of “linguistic information”To acquisition of “general knowledge”,
with more data intensive, robust, reliable methodsN. Calzolari
42nd KYOTO Workshop, Gifu, Japan, January 2011
… back from the late ‘80s
After acquisition from MRDs,
Historical notes
Slide5N. Calzolari
5
2nd KYOTO Workshop, Gifu, Japan, January 2011
Looking into the pastAll started with the situation we had in the
late ‘80s – early ‘90sWith all the
Xxx-LEX projects
5
MultiLex
GeneLex
AcquiLex
Xxx-Lex
A. Zampolli: Let’s be coherent:
Xxx-Lex
After the “Grosseto Workshop” (1985): a turning point
EAGLES
ISLE
Standards, Best Practices, ...
Slide6ISO LMF
Lexical
Markup
FrameworkN. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 20116
Structural skeleton, with the basic hierarchy of information in a lexical entry
+
various extensions
Modular framework
LMF specs comply with modelling UML principlesan XML DTD allows implementationBuilds on
EAGLES/ISLE
NEDO
Asian
Lang.uages
The field is mature
NICT
Language-Grid
Service Ontology
ICT
KYOTO
LIRICS
New
initiatives
…
LexInfo
Slide7N. Calzolari
7
2nd KYOTO Workshop, Gifu, Japan, January 2011
KYOTOA search environment using semantic technologiesA “compass” for the web2.0
Interdisciplinarity scientific community (LRT, web technologies, knowledge engineers), companies, domain experts
Multilingualism 7 languages (2 Asiatic languages)
Kyoto Core System is
open &
free
Slide8Annotation Format
(KAF
)
Multi-level Annotation Format
•
stand-off
annotation
•
uniform representation for 7 languagesShared through the languages•
Text
: tokenisation, sentences, paragraphs with reference to the sources
•
Terms
: words & multi-words, parts-of-speech, etc.
•
Chunks
: constituents & syntagmatic
realization
•
Dependencies
:
grammatical
functions
L1 – Semantic modules
:
M
ultiword tagging, Sense Tagging, Named Entity Recognition,
OntoTagging
L2 – Semantic module
: event/fact extraction
N. Calzolari
8
2nd KYOTO Workshop, Gifu, Japan, January 2011
from
Piek
Vossen
Slide9N. Calzolari
9
2nd KYOTO Workshop, Gifu, Japan, January 2011
KYOTO System & Adoption of StandardsLinearMAF/SYNAF
Linear
SEMAF
Term extraction Tybot
Generic
TMFSemantic annotation
Linear
Generic
FACTAF
Fact extraction
Kybot
Domain editing
Wikyoto
Wordnet
Domain Wordnet
LMF API
Ontology
Domain ontology
OWL API
Concept
User
Fact
User
from
Piek
Vossen
Source
Documents
Could be at the basis of a new standard?
Slide102nd KYOTO Workshop, Gifu, Japan, January 2011
A common representation format for WordNets
Seven
WordNetssimilar but not identical hampered interoperability
to be accessed both intra- and inter-linguistically
to support easier integration
Wn
IT
WnENWnEU
WnNL
WnJP
WnCH
WnES
endow
WordNet
with a
representation format allowing easy access, integration & interoperability
among resources
Wn
IT
WnEN
WnEU
WnNL
WnJP
WnCH
WnES
Slide112nd KYOTO Workshop, Gifu, Japan, January 2011
N. Calzolari
11
GlobalInformation
Lemma
Monolingual
ExternalRef
Monolingual
ExternalRefs
Sense
LexicalEntry
Statement
Definition
SynsetRelation
SynsetRelations
Monolingual
ExternalRef
Monolingual
ExternalRefs
Synset
Lexicon
Interlingual
ExternalRef
Interlingual
ExternalRefs
SenseAxis
SenseAxes
LexicalResource
1..1
1..*
0..1
1..*
1..*
1..1
0..*
0..1
1..*
Meta
0..1
0..1
Meta
0..1
0..1
Meta
Meta
0..1
Meta
0..*
0..1
0..1
0..1
1..*
1..*
0..*
0..1
1..*
A common representation format:
WordNet
- LMF
Data
Categories
from Monica Monachini
Slide122nd KYOTO Workshop, Gifu, Japan, January 2011
Towards a Centralized
WordNet DC Registry
A list of 85 sem.rels as a result of a mapping of the KYOTOWordNet grid
Inter-WN
Intra-WN
N. Calzolari
12
Slide132nd KYOTO Workshop, Gifu, Japan, January 2011
N. Calzolari
13
SWN
<fuego_3, llama_1>
09686541-n
<!ELEMENT SenseAxes (SenseAxis+)>
<!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)>
<!ATTLIST SenseAxisid ID #REQUIREDrelType CDATA #REQUIRED><!ELEMENT Target EMPTY><!ATTLIST Target
ID CDATA #REQUIRED>
<!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)>
<!ELEMENT InterlingualExternalRef (Meta?)>
<!ATTLIST InterlingualExternalRef
externalSystem CDATA #REQUIRED
externalReference CDATA #REQUIRED
relType (at|plus|equal) #IMPLIED>
IWN
<fuoco_1, fiamma_1>
00001251-n
WordNet
-LMF Multilingual level - Cross-lingual Relations
WN3.0
<
fire_1 flame_1 flaming_1
>
13480848-n
groups monolingual synsets corresponding to each other and sharing the same relations to English
link to ontology/(ies
)
specifies the type of correspondence
from Monica
Monachini
Slide14N. Calzolari
14
2nd KYOTO Workshop, Gifu, Japan, January 2011
Complex picture!Is there anything we need to do for Interoperability?Work within ISO:LMF:
abstract meta-model for lexical representation
Ontology Group or more Group
s?Language Resource Ontologies
: ontology of data categories
Real life:Lexicons (e.g. WordNets) that are called OntologiesLexicons linked to Ontologies: to be used in applications, in multilingual systems, domains, …
Work on “
ontologising
” Lexicons: to allow exploiting various relations, to make inferences, …
Semantic Lexicons, with many types of relations among semantic units: these are often of “conceptual/world-knowledge” nature. Do we want DCs for these?
ISO SC 4/WG 4 – Lexicon-Ontology relations
New work item:
PWI 24622
KYOTO can contribute
Slide15N. Calzolari
15
2nd KYOTO Workshop, Gifu, Japan, January 2011
To explore the need of doing something within ISO about the relations between Lexicon and OntologyDo we/ISO need to address another (lexical) layer?How lexicons and ontologies are linked and information mapped from one to the other
The ontological layer in a/connected to a lexicon
Possible issues/questions:
Is LMF enough to represent Ontological links? How to connect work being done in ISO Lexical group and ISO Ontology groups?
Lexicon and Ontologies: separation? or lexicalised ontologies? or ontologies lexicons?
Lexicon, Ontologies and DomainsOn a very different dimension: Ontology of lexical/semantic/conceptual categories? Standardised semantic categories, ontology labels?Relation to multilinguality ...
KYOTO can contribute
Slide16N. Calzolari
16
2nd KYOTO Workshop, Gifu, Japan, January 2011
Input to Multilingual Web http://www.multilingualweb.eu/ The MultilingualWeb project is exploring standards and best practices that support the
creation, localization and use of multilingual web-based information
It aims to raise the visibility of existing best practices and standards and identify gapsThe core vehicle for this is a series of four workshops, for networking across communities that span the various aspects involved
Next workshop on best practices aimed at development of Content for the Web, including creation of content ranging from personal authoring for blogs and social networking sites to development of large corporate or organizational enterprises: “Content on the Multilingual Web”
4-5 April 2011Pisa, Italy
KYOTO can contribute
Slide17N. Calzolari
17
2nd KYOTO Workshop, Gifu, Japan, January 2011
A new paradigm of R&D in LRs & LT
Since few years
A
dopting the paradigm of
accumulation of knowledge
, so successful in more mature disciplines, based on sharing LRs, tools & results
A
bility to build on each other achievements,
allowing controlled & effective
cooperation of many groups on common tasks
(see HumanGenomeProject)
e. g. initiatives to achieve international consensus on annotation guidelines
Slide18Some steps for a “new generation” of LRs
N. Calzolari
18
2nd KYOTO Workshop, Gifu, Japan, January 2011
Slide19Lexical WEB
N. Calzolari
192nd KYOTO Workshop, Gifu, Japan, January 2011
ComLex
SIMPLE
WordNets
WordNets
WordNetsFrameNet
Lex_x
Lex_y
LMF
with
intelligent
agents
NomLex
Standards for
Content Interoperability
Enough??
Global WordNet GRID
Bio
Lexicon
SIMPLE-WEB
Standards
Slide20(Distributed) Language Services
N. Calzolari
20
2nd KYOTO Workshop, Gifu, Japan, January 2011
A scenario implying:
Enabling:
Can KYOTO contribute?
Slide21N. Calzolari
2nd KYOTO Workshop, Gifu, Japan, January 2011
21
Which Communities?Language ResourcesLanguage TechnologiesStandardisationContent/OntologiesSystem developersIntegrators SSH
EC
National funding agencies
Industry
Many
applications/domainsMTCLIR…e-governmentcontent industry
intelligence
e-culture
e-health
domotics…
core
EU
Forum
with
Many LRs & LTs exist, but a global vision, policy
&
strategy is needed
for
CLARIN
for SSH
FLaReNet
Network
META-NET
NoE
Need
to consider together
technical
organisational
strategic
economic, social
cultural
legal
political
issues wrt LRs & LTs
Today
Slide22Fostering Language Resources Network
FLaReNet at a glance
An international Forum
to facilitate interaction, toOvercome the fragmentation in LR & LT & recreate a community Anticipate the needs of new types of LR & LT & Language InfrastructuresCreate a shared policy for the next years Foster a
European strategy for consolidating the sector
22
http://www.flarenet.eu
N. Calzolari
222nd KYOTO Workshop, Gifu, Japan, January 2011Essential Community mobilisation (also to
prepare the ground for a
RI
)
A
“roadmap”
: a
plan of actions
as
input to policy development
A
(
EU)
model for the LRs/LTs area of the next years
Ambitious!
Slide23N. Calzolari
2nd KYOTO Workshop, Gifu, Japan, January 2011
23
Create a shared repository of data formats, annotations, etc. as a major help to achieve standardisationCommon repositories for tools & language data should be established that are universally and easily accessible by everyoneCoordinate input to ISO/W3C standardisation workResults from Vienna & Barcelona Forum:
Shaping the Future of the Multilingual Digital Europe
Access to LRs
is critical & should involve all the communityNeed to create the means to plug together different LR & LT, In a
web-based resource and technology “grid”
Slide242
nd Blueprint
Result of a permanent and cyclical consultation Inside the community it represents
Outside it, through connections with neighbouring projects, associations, initiatives, funding agenciesOrganised along three main “directions”: Infrastructural AspectsResearch and DevelopmentPolitical and Strategic IssuesReflect three major
development factors that can boost or hinder the growth of the field of LRT
N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 2011
24
http://www.flarenet.eu/
sites/default/files/D8.2b.pdf
Slide25Sources: many meetings
N. Calzolari
2nd KYOTO Workshop, Gifu, Japan, January 2011
25
Slide26N. Calzolari
2nd KYOTO Workshop, Gifu, Japan, January 2011
26
3rd FLaReNet ForumThe European Language Resources and Technologies Forum:Important role in defining recommendationsIn Barcelona: 120 Participants from 22 Countries
Define final recommendations
Previous
Proceedings & Reports on the web
Blueprint
will be discussed Also for adoption & endorsement by FLaReNet Institutional Members
Slide27N. Calzolari
2nd KYOTO Workshop, Gifu, Japan, January 2011
27
IssueChallengeRecommended ActionsMetadata
Interoperability of Metadata sets
Set up a global infrastructure of common and uniform and/or interoperable metadata sets
Metadata usable both by humans and by machinesCreate machine-understandable metadata
with formal syntax and clear semantics
Automate the process of metadata creation Develop structured metadataDocumentationReliable documentation of LRs according to
common best practices
Collect
all possible and existing LR
documentation
Devise and adopt a widely agreed
standard documentation template
for all types of resources
Infrastructural Aspects
Slide28Political and Strategic dimensions
N. Calzolari
2nd KYOTO Workshop, Gifu, Japan, January 2011
28IssueChallengeRecommended ActionsFunding Agencies policiesDevise models to allow different types of players
easy access
to resourcesEnsure that
publicly funded resources are publicly available
either free of charge or at a small distribution cost
Encourage/enforce use of best practices or standards in LR production projectsMake sustainability and sharing/distribution plans mandatory in projects concerning LR productionLR citation
Appropriate
citation of
Language Resources like traditional publications
Develop
a standard protocol for citing
language resources
KYOTO can be an example
Slide29LRE Map: Why??
The Map as an answer
to start to fill this gap, but also:To encourage the needed
“change in culture”N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 201129
Problem:
Lack of information
& documentation about resources is, in the e-science paradigm, a very critical issueNon documented resources don’t exist!!
A
collective enterprise: Each researcher must become aware of the importance of his/her personal engagement in documenting resources A task as important as creating new resources and not an accessory to be disregardedAs the necessary service to the whole community
Will become an essential instrument to
monitor the field
www.resourcebook.eu
N. Calzolari
30
2nd KYOTO Workshop, Gifu, Japan, January 2011
How many LRs & Types at LREC?
30
How many LRs & Types at COLING?
Languages: 170!
Slide31Languages:
But
obviously …
N. Calzolari312nd KYOTO Workshop, Gifu, Japan, January 2011170
!!
image courtesy of Wordle (http://www.wordle.net)
Slide32Availability
N. Calzolari
2nd KYOTO Workshop, Gifu, Japan, January 2011
32Freely available!
54%
3%
15%
25%
57%
LREC
COLING
Slide33The Project META-NET
N. Calzolari
2nd KYOTO Workshop, Gifu, Japan, January 2011
33META-NET is a Network of Excellence (coord. Hans Uszkoreit) dedicated
to fostering the technological foundations of the European multilingual information society
Objectives:Prepare the ground for a large-scale concerted effort by building a strategic alliance of national and international research programmes, corporate users and commercial technology providers and language communities
Strengthen the European research community through research networking and by creating new schemes and structures for sharing resources and effortsBuild bridges by approaching open problems in collaboration with other research fields such as machine learning, social computing, cognitive systems, knowledge technologies and multimedia content
Final goal:
META – The Multilingual Europe Technology Alliance
Slide34language communities
policy makers and funding bodies
user industries
provider industries
language technology
community
machine
learning
community
semantic
techno-
logies
community
cognitive
systems
community
multimedia
content
techno-
logies
The META Alliance
N. Calzolari
34
2nd KYOTO Workshop, Gifu, Japan, January 2011
Slide35Founding Members
Deutsches
Forschungszentrum
für
Künstliche
Intelligenz
GmbH, Germany
Barcelona Media – Centre
d'Innovació
, Spain
Consiglio
Nazionale
Ricerche
–
Instituto
di
Linguistica
Computazionale “Antonio Zampolli”, ItalyInstitute for Language and Speech Processing, R.C. “Athena”, GreeceCharles University in Prague, Czech RepublicCentre National de la Recherche
Scientifique
–
Laboratoire
d'Informatique
pour la
Mécanique
et les
Sci.s
de
l'Ingénieur
, France
Universiteit
Utrecht, The Netherlands
Aalto University, Finland
Fondazione
Bruno Kessler, Italy
Dublin City University, Ireland
Rheinisch Westfälische Technische Hochschule Aachen, Germany Jožef Stefan Institute, SloveniaEvaluations and Language Resources Distribution Agency, France N. Calzolari352nd KYOTO Workshop, Gifu, Japan, January 2011
Slide36Three Lines of Action
The META-NET objectives translate into three lines of action:
N. Calzolari36
2nd KYOTO Workshop, Gifu, Japan, January 2011
Slide37The Process
Visions
Strategic Research Agenda
Roadmap
2010 2011 2012
communication
within
META-NET (
META-VISION) communication in the wider LT community
and
among
other
stakeholders
communication
to policy makers
funding
bodies
,
public
N. Calzolari
37
2nd KYOTO Workshop, Gifu, Japan, January 2011
Slide38Data has become a key factor in LT R&D
A few indicators:
Increasing size & importance of LREC conference, corpora mailing list, etc.Citation ranks of publications on language resourcesLanguage research and language technology belong to the Data Intensive SciencesExpensive data become valuable through sharingHowever, the long demanded and well-contemplated instruments for managing and sharing this data are still missingN. Calzolari
2nd KYOTO Workshop, Gifu, Japan, January 2011
38
Slide39META-SHARE: Key Features
META-SHARE is an
open, integrated, secure, interoperable exchange infrastructure (resp. Stelios Piperidis
) for language data & tools for the Human Language Technologies domainever-evolving, scalable, including free and for-a-fee LRs/LTs and servicesincluding legacy, contemporary and emerging datasets, tools and technologiesA marketplace where language data & tools are documented, uploaded and stored in repositories, catalogued and announced, downloaded, exchanged, aiming to support a data economy (includes free and for-a-fee LRs/LTs and also services)Standards-compliant, overcoming format, terminological and semantic differences
Based on distributed networked repositories
accessible through common interfacesN. Calzolari
2nd KYOTO Workshop, Gifu, Japan, January 201139
Slide40What we’re offering
A channel to
share and
distribute language data and toolsTechnical solutions for building your own repositoriesProtocols and mechanisms for making the descriptions of your resources (and the actual resources) harvestableGuidelines and recommendations on standards used in the LR production and documentation processesRecommendations on data and tools licensing issuesAccess to large catalogues of documented, high-quality resources, as well as the actual data and
toolsN. Calzolari
2nd KYOTO Workshop, Gifu, Japan, January 201140
KYOTO can be among the first
Slide41Features
Single Sign-On
Easy AdministrationMetadata HarvestingPersistent Identifiers (PIDs)
Intuitive SearchN. Calzolari41Open SourceService-OrientedDistributedReplication/BackupReporting & Statistics
2nd KYOTO Workshop, Gifu, Japan, January 2011
Slide42v0 architecture
Slide43On the communication/mobilisation side
A
change of cultureConvincing arguments that data assets and their value do not necessarily grow if locked in the drawer
Incentives and models that can convince data holders that there is life after the announcement of data existence and/or sharing (share does not necessarily mean for free, nor for unbridled use) Interoperability, common metadata, formats, etc. In other words we need to create/reinforce a data economy based on widely agreed principles and rules, mutual understanding, sustainable and adaptive models, simplified copyright rules and licensing models
The present time window seems appropriate
Challenges
43
N.Calzolari
Multilingual Web, Madrid, 2010
KYOTO can be a “model”
For other projects to follow
Slide44LR building
as collaborative “common shared task”
New methodology of work
Assemble a comprehensive “map of language data and mechanisms” for the planet’s languages ( LRE Map
)
Interoperability acquires even more value
Needs consensual planning of common strategies
towards shared objectives
Not just the sum of many individual effortsBut an organised, well-structured, collective enterpriseSimilar to more mature sciences: Physicists/Astronomers’s experiments … of X,000 people working on the same big enterprise
N. Calzolari
44
2nd KYOTO Workshop, Gifu, Japan, January 2011
META-SHARE
is a
big step that
needs a real Paradigm shif
t
45
2nd KYOTO Workshop, Gifu, Japan, January 2011
We wanted more & more data ...
We experience today a sort of statistical “intoxication” !
It started as a new strategy, a revolution maybe? But it has turned to tactics. Stuck with it? In a narrow loop of small advances, not linked to each other
Can we add
also a new strategy? and hopefully a vision?
Main Statement
We tend to forget about “language” & the need to
understand its properties & complexities
Where do we (try to) encode what we know about language properties?
In annotations
Preamble
Vision
Like the big Genome project
, ...
a large
Language initiative
Is there
any theoretical knowledge of or
Any serious methodology of studying and exploiting
the
interactions
among the various annotation layers?
BUT
Slide4646
2nd KYOTO Workshop, Gifu, Japan, January 2011
Strategy
MANY
(parallel)
texts
for
MANY
languages With ALL possible annotation layers
Similar to
more mature sciences
, e.g.
p
hysics, … of thousands of people working together
on the same big experiment
Create a sustainable infrastructure
for a
large Language repository plan,
Where we
accumulate all the knowledge we have about language &
Encourage analysis of linguistic interrelations
Means
a change of mentality:
going
beyond “individual” research
interests
From “my approach” to some “compromise” allowing to go for big amounts/ integration/building on each other/…
Slide47From no infrastructure ...
To many infrastructures/networks
We were complaining there was no infrastructure ...
Have we been too successful??Now many infrastructural/networking initiatives
Very good opportunity
But only if we are able to act in a coordinated & coherent way
Otherwise we spoil & confuse the field47
47
2nd KYOTO Workshop, Gifu, Japan, January 2011
N. Calzolari