Marc KempsSnijders Expected practices and interface descriptions SOAP WSDL XMLRPC WSDL REST WADL WSDL Currently web services from a number of organizations RACAI Tokenizing lemmatizing chunking language identification ID: 639323
Download Presentation The PPT/PDF document "CLARIN web services and workflow" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CLARIN web services and workflow
Marc
Kemps-SnijdersSlide2
Expected practices and interface descriptionsSOAP
WSDL
XML-RPC WSDLREST WADL, WSDLCurrently web services from a number of organizations:RACAITokenizing, lemmatizing, chunking, language identification,..UPFStatistical services, concordance, querying, …Leipzig Linguistic ServicesSentence boundary detection, co-occurrence statistics, ..…
Web services
Currently available services are listed in CLARIN inventory Slide3
Web service will be registered using CMD Infrastructure.All services are registered using CLARIN metadata
Metadata serves as the basis for profile matching
Web service registration
This figure indicates the principle of profile matching. A resource can be consumed by a succeeding processing step if the functional characteristics of the resource description map with those that are specified for the input of the tool or web service. The tool or web service will create additional metadata so that for the next processing step the same argument holds. Slide4
Currently a number of WFMS are in use:GATE
UIMA
TavernaJBPM based systemsWorkflowClarin claims no preference to any of these.
Human task support
Some tasks require human interaction, e.g. manual annotationSlide5
Web service interactions are governed by 2 guiding CLARIN principlesEach resource is associated with standoff XML metadata (CMD)
Each resource must provide provenance data
PrinciplesThe data that results from web service invocations must follow this and provide proper metadata and provenance dataSlide6
Service
Metadata
component
Provenance
component
CLARIN metadata description
(CMD)
Resource Data
Provenance data
Resource proxy
JournalFile
proxy
2
. Load metadata
3
. Supply resource data
CLARIN metadata description
(CMD)
Resource Data’
Provenance data
Resource proxy
JournalFile
proxy
5. Create metadata
Standard
parameters
Metadata PID
Input parameters
6. Record parameters
1
. Pass PID
4. Pass configuration parameters
7
. Generate
Provenance data
8. Record
r
esult dataSlide7
Architecture (Wrapper)
Metadata
component
Provenance
component
Service 1
Wrapper 1
Metadata
component
Provenance
component
Service 3
Wrapper 3
Metadata
component
Provenance
component
Service 2
Wrapper 2
Client
Client invokes wrapper interface
Each wrapper will contain metadata and provenance componentSlide8
Architecture (CLARIN Service Bus)
Client
Metadata component
Service
Provenance component
Web service
CSB messaging
In memory messaging
…
…
Request
Result
CSB Service
WFMS
m
ay
be integrated into the CLARIN Service Bus
Calling
workflow processes from CSB
Calling CSB services from workflow
processes
Middleware solution (CLARIN Service
Bus)
may provide more generic approachSlide9
??
QuestionsSlide10
Formats, interoperability and standards
Marc
Kemps-SnijdersSlide11
Format interoperability
Interoperability is only relevan
t if Resources are to be exchangedResources are to be combined in collectionsTools and services need to operate on resourcesResults are to be compared
Standardization attempts to solve these cross
resource and technology issues by
Looking
at existing practices
Provide
abstractions
Address sustainability aspects
Seek international consensus
Provide
solid grounding through well accepted standards bodies.
Increasingly the linguistic community not only presents itself from a research perspective,
but also from a service provider perspective Slide12
Basic standardsUnicode – ISO 10646
Widely supported, some glyphs are still missing
Country codes - ISO 3166Widely supportedLanguage codes – ISO 639-1/2/3Many languages not covered, politically sensitiveXMLWidely supported, lack of generic linguistic resource models and semantic groundingFeature Structures Part 1– ISO 24610-1:2006
Reference XML vocabulary for FS representationTEI
CLARIN should identify the extent in which competing formats are being used (
DocBook
, NLM DTD, …)
StandardizationSlide13
Ongoing standardization projectsMorpho
-syntactic Annotation Framework (MAF) – ISO/DIS 24611
Token-word form, does not specify tag setsSyntactic Annotation Framework (SynAF) – ISO/CD 24615Draft stage and not usable at this stageLexical Markup Framework (LMF) – ISO 24613:2008Flexible lexicon framework, further concrete testing neededData Category Registry (DCR) – ISO 12620:2009 (forthcoming)Restricted model, no relations, limited constraints specificationTEI/ODDCombines documentation and schema
Persistent Identification – ISO/CD 24619
Linguistic Annotation Framework (LAF) – ISO/DIS 24612
Annotated resources as graphs, very abstract level
StandardizationSlide14
Pivot formats
Pivot
Use of accepted pivot model(s) reduces the amount of transformers needed
For each combination of processes a transformer is neededSlide15
FormatsCHAT
Shoebox/Toolbox
EAFEXMERALDAXCESPAULATIGERPentree….Community practices
Tag sets
GOLD
TDS
STTS
EUROTYP
….
….
Clarin
will need to make statements on how to deal with these formats (inclusion versus
curation
)Slide16
Thank you for your attentionSlide17
ISO process
CD = Committee Draft
DIS = Draft International StandardDPAS = Draft Publicly Available Specification DTR = Draft Technical Report DTS = Draft Technical SpecificationFDIS = Final Draft International StandardIS = International StandardNP = New Work Item ProposalPAS = Publicly Available SpecificationTR = Technical Report
TS = Technical Specification
WD = Working Draft