/
Generating Generating

Generating - PowerPoint Presentation

debby-jeon
debby-jeon . @debby-jeon
Follow
395 views
Uploaded On 2016-10-25

Generating - PPT Presentation

Automatic Semantic Annotations for Research Datasets Ayush Singhal and Jaideep Srivastava CS dept University of Minnesota MN USA Contents Motivation Problem statement Proposed approach ID: 480666

context dataset data datasets dataset context datasets data results global snap type uci research annotation similar concept experiments evaluation

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Generating" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Generating Automatic Semantic Annotations for Research Datasets

Ayush Singhal and Jaideep Srivastava

CS dept. ,

University of Minnesota, MN, USASlide2

ContentsMotivationProblem statementProposed approachData type labellingExperiments and results

Application concept

Experiments and results

Similar dataset identification

Experiments and results

Conclusions

and future workSlide3

MotivationAnnotation is act of adding a note by way of comment or explanation.Apart from documents, images, videos are searchable only when they have tags or annotations (i.e. content)Recently, genomic databases, archeological databases are annotated for indexing.Slide4

Annotating research datasetsNo context- hard to be searchable by popular search engines.

Make the dataset

visible

and informative.Slide5

Example of structured AnnotationSlide6

Problem statementGiven a data name “D” as a string of English characters, the research task is to generate semantic annotations for the dataset denoted by “D” in the following categories:Characteristic data type

Application domain

List of similar datasetsSlide7

Proposed approachResearch challengesNo universal schema for describing content of a dataset.Common attribute, dataset name.

No well known structure for semantic annotation of research datasets.

Proposed structure should positively impact user’s search for datasets. Slide8

Context generation

Critical step:

how to generate useful context for a dataset.

Usage of the dataset in research.

Research articles and journals .

Get a proxy using web knowledge: Google scholar search engine.

Used the top-50 results to build context for the

dataset

Global context

”Slide9

Identifying Data type labelsFor a dataset ‘D’:Given

: global context of ‘D’, a list of data types

Required

: data type of ‘D’

Approach: Supervised Multi-label classification

Feature construction:

0.

Preprocessing of global context-stop word removal etc.

1.

BOW and TFIDF representation of Global context of ‘D’.

2. Dimensionality reduction by PCA- 98% of variance coverageSlide10

Experiments and results

Dataset

Instances

Label count

Label

density

Label cardinality

SNAP

42

5

0.34

1.69

UCI

110

4

0.275

1.1

Ground truth

: author provided data type labels.

Baseline

: ZeroR classifier.Evaluation metrics: typical multi-label classification metrics ( Tsoumakas et al 2010)

Measure

ZeroR

AdaBoostMH

(

tfidf

)

Fmeasure

0.0250.172Average Precision ↑0.6570.663Macro AUC↑0.50.555

MeasureZeroRAdaBoostMH (BOW)Fmeasure ↑0.8540.873Average Precision ↑0.9080.924Macro AUC↑0.50.54

SNAP dataset

UCI datasetSlide11

Concept generationGiven a dataset ‘D’, find k-descriptors (n-gram words) for the application of dataset.Approach: Concept extraction from world knowledge (

wikipedia

,

dbpedia

)

Input feature: Global context of ‘D’.

Preprocessing of global context

Used text analytic tools (

AlchemyAPI

) for concept generation.

Pruning of input query termsSlide12

Experiments and resultsBaseline: Context generated from the short description provided by the owner. Text pre-processing was done.Evaluation metrics: user rating.

Comparison of average user rating on UCI and SNAP dataset.

UCI dataset

SNAP datasetSlide13

Identifying similar datasetsGiven a dataset ‘D’, find k-most similar datasets from a list of datasets.Approach: cosine similarity between TFIDF vectors of global-context of ‘D’ and global-context of d_i

in list of datasets.

Top-k selection from list ranked in descending order.Slide14

Experiments and resultsGround truth: dataset categorization provided by the dataset repository owners. Different categorization for SNAP and UCI.

Baseline: Context generated from owner’s description.

Evaluation metrics:

precision@k

SNAP dataset

UCI datasetSlide15

Use case: Synthetic queryingSynthetic querying on the annotated database of research datasets.50 queries on SNAP database and 50 queries on UCI database.Query structure: find a <data type> dataset used for <concept> like <similar to><fields> are random generated from their respective lists.

Evaluation metric: overlap between context of retrieved results and the input query.

Baseline: querying on Google database and extracting dataset names from the retrieved results. Slide16

Quantitative and qualitative evaluation

Comparison of Google results with annotated DB for a few samplesSlide17

Conclusions and Future workReal world datasets play an important role- testing and validation purposes.General purpose search engines cannot find datasets due to lack of annotation. A novel concept of structured semantic annotation of dataset- data type labels, application concepts, similar datasets.

Annotation generated using global context from the web corpus.

Data type labels identification using multi-label classifier- using web context helps to improve accuracy both for SNAP and UCI test datasets.Slide18

Conclusions and Future workConcept generation using web context performs better than baseline based on user ratings.Web context is not significantly helpful in identifying similar datasets for UCI and SNAP datasets.18% improvement in accuracy over normal datasets search using Google ( for synthetic queries).Future work: finding an overall encompassing structure of annotation ; extending analysis across different domains.Slide19

Thank you