Xuan S Yang Reynold Cheng Luyi Mo Ben Kao David W Cheung xyang2 ckcheng lymo kao dcheungcshkuhk The University of Hong Kong Outline 2 Introduction Problem Definition amp Solution ID: 510201
Download Presentation The PPT/PDF document "On Incentive-Based Tagging" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
On Incentive-Based Tagging
Xuan S. Yang, Reynold Cheng, Luyi Mo, Ben Kao, David W. Cheung{xyang2, ckcheng, lymo, kao, dcheung}@cs.hku.hkThe University of Hong KongSlide2
Outline
2IntroductionProblem Definition & SolutionExperimentsConclusions & Future WorkSlide3
Collaborative Tagging Systems
3Example:Delicious, Flickr Users / TaggersResourcesWebpagesPhotosTags
Descriptive keywordsPostNon-empty set of tagsSlide4
Applications with Tag Data
4Search[1][2]Recommendation[3]Clustering[4]Concept Space Learning[5]
[1] Optimizing web search using social annotations. S. Bao et al. WWW’07[2] Can social bookmarking improve web search? P. Heymann et al. WSDM’08[3] Structured approach to query recommendation with social annotation data. J. Guo CIKM’10[4] Clustering the tagged web. D. Ramage et al. WSDM’09
[5] Exploring the value of folksonomies for creating semantic metadata. H. S. Al-Khalifa IJWSIS’07Slide5
Problem of Collaborative Tagging
5Most posts are given to small number of highly popular resources
[6] Analyzing Social Bookmarking Systems: A del.icio.us Cookbook. ECAI Mining Social Data Workshop. 2008
dataset from delicious
[6]
All
30m
urls
Over
10m
urls
are just tagged
once
Under-Tagging
39%
posts vs.
1%
urls
Over-TaggingSlide6
Under-Tagging
6Resources with very few posts have low quality tag dataLow quality of one single postIrrelevant to the
resource{3dmax}Not cover all the aspects{geography, education}Don’t know which tag is more important{maps, education}
Improve
tag data quality for under-tagged resource by
g
iving it
sufficient number
of postsSlide7
Having a sufficient No. of Posts
7All aspects of the resource will be coveredRelative occurrence frequency of tag t can reflect its importance
Irrelevant Tags rarely appearImportant tags occur frequently Can we always improve tag data quality by giving more posts to a resource?Slide8
Over-Tagging
8Relative Frequency vs. no. of posts>=250, stable
Tagging Efforts are
Wasted!Slide9
Incentive-Based Tagging
9Guide users’ tagging effortReward users for annotating under-tagged resourcesReduce the number of under-tagged resources
Save the tagging efforts wasted in over-tagged resourcesSlide10
Incentive-Based Tagging (cont’d)
10Limited BudgetIncentive AllocationObjective: Maximize Quality Improvement
Selected Resource
Quality Metric for Tag DataSlide11
Effect of Incentive-Based Tagging
11Top-10 Most Similar Query5,000 tagged resources Simulation for Physics ExperimentsImplemented in Java
www.myphysicslab.comTag Data
Top-10 ResultBase Case: 150k Posts From Delicious
10 Java
150k
+
10k
more Posts from Delicious
4 Physics
6 Java
150k
+
10
k
more Posts from
incentive-Based Tagging
9 Physics
1
Simulation
Ideal Case:
2m
Posts
from Delicious
10 PhysicsSlide12
Related Work
12Tag Recommendation[7][8][9] Automatically assign tags to resourcesDifferences:Machine-Learning Based MethodsHuman Labor
[7] Social Tag Prediction. P. Heymann, SIGIR’08[8] Latent Dirichlet Allocation for Tag Recommendation, R. Krestel, RecSys’09[9] Learning Optimal Ranking with Tensor Factorization for Tag Recommendation, S. Rendle, KDD’09Slide13
Related Work (Cont’d)
13Data Cleaning under Limited Budget[10]Similarity:Improve Data Quality with Human LaborOpposite Directions:
“-” Remove Uncertainty“+” Enrich Information[10] Explore or Exploit? Effective Strategies for Disambiguating Large Databases. R. Cheng VLDB’10Slide14
Outline
14IntroductionProblem Definition & SolutionExperimentsConclusions & Future WorkSlide15
Data Model
15Set of ResourcesFor a specific riPost: a set of tagsPost Sequence {pi
(k)}Relative Frequency Distribution (rfd)After ri has k posts
{maps, education}
{geography, education}
{3dmax}
Tag
Frequency
Relative Frequency
Maps
1
0.2
Geography
1
0.2
Education
2
0.4
3dmax
1
0.2Slide16
Quality Model: Tagging Stability
16
Stability of rfdAverage Similarity between ω rfds’, i.e., (k-
ω+1)-th, …, k-th rfd
Stable
point
Threshold
Stable rfd Slide17
Quality
17For one resource ri with k postsSimilarity between its current rfd and its stable
rfdFor a set of resources RAverage quality of all the resourcesSlide18
Incentive-Based Tagging
18Input A set of resourcesInitial postsBudgetOutputIncentive assignment
how many new posts should ri get ObjectiveMaximize quality
r
1
r
2
r
3
Current
Time
time
time
timeSlide19
Incentive-Based Tagging (cont’d)
19Optimal SolutionDynamic ProgrammingBest Quality ImprovementAssumption: know the stable rfd & posts in the future
r
1
r
2
r
3
time
time
time
Current
TimeSlide20
Strategy Framework
20Slide21
Implementing CHOOSE()
21Free Choice (FC)Users freely decide which resource they want to tag.
Round Robin (RR)The resources have even chance to get posts. Slide22
Implementing CHOOSE()
22Fewest Post First (FP)Prioritize Under-Tagged ResourcesMost
Unstable First (MU)Resources with unstable rfds’ need more postsWindow sizeHybrid (FP-MU)
r
1
r
2
r
3
time
time
timeSlide23
Outline
23IntroductionProblem Definition & SolutionExperimentsConclusion & Future WorkSlide24
Setup
24Delicious dataset during year 20075000 resourcesPassed their stable pointKnow the entire post sequenceSimulation from
Feb. 1 2007148,471 Posts in total7% passed stable point25% under-tagged (# of Posts < 10)
r
1
r
2
r
3
time
time
time
Simulation StartSlide25
Quality vs. Budget
25FP & FP-MU are close to optimalFC does NOT increase the quality Budget = 1,0000.7% more posts comparing with initial no.
6.7% quality improvementMake all resources reach stable pointFC: over 2 million more postsFP & FP-MU: 90% saved Slide26
Over-Tagging
26Free Choice: 50% posts are over-tagging, wastedFP, MU and FP-MU:
0%Slide27
Top-10 Similar Sites (Cont’d)
27On Feb. 1 2007www.myphysicslab.com3 postsTop-10 all java related
10,000 more posts by FCget 4 more posts4/10 physics related Slide28
Top-10 Similar Sites (Cont’d)
28On Dec. 31 2007
270 PostsTop-10 all physics relatedPerfect Result10,000 more posts by FPget 11 more postsTop 9 physics related9 included in Perfect ResultTop 6 same order with Perfect ResultSlide29
Conclusion
29Define Tag Data QualityProblem of Incentive-Based TaggingEffective SolutionsImprove Data QualityImprove Quality of Application ResultsE.g. Top-k searchSlide30
Future Work
30Different costs of tagging operationUser preference in allocation processSystem developmentSlide31
References
31[1] Optimizing web search using social annotations. S. Bao et al. WWW’07[2] Can social bookmarking improve web search? P. Heymann et al. WSDM’08
[3] Structured approach to query recommendation with social annotation data. J. Guo CIKM’10[4] Clustering the tagged web. D. Ramage et al. WSDM’09 [5] Exploring the value of folksonomies for creating semantic metadata. H. S. Al-Khalifa IJWSIS’07[6] Analyzing Social Bookmarking Systems: A del.icio.us Cookbook. ECAI Mining Social Data Workshop. 2008[7] Social Tag Prediction. P. Heymann
, SIGIR’08[8] Latent Dirichlet Allocation for Tag Recommendation, R. Krestel, RecSys’09[9] Learning Optimal Ranking with Tensor Factorization for Tag Recommendation, S.
Rendle
, KDD’09
[10] Explore or Exploit? Effective Strategies for Disambiguating Large Databases. R. Cheng VLDB’10Slide32
Thank you!
Contact Info: Xuan Shawn Yang University of Hong Kong xyang2@cs.hku.hk http://www.cs.hku.hk/~xyang2
32Slide33
Effectiveness of Quality Metric (Backup)
33All-Pair SimilarityRepresent each resource by their tagsCalculate the similarity between all pairs of resources
Compare the similarity result with gold standardSlide34
Under-Tagged Resources (Backup)
34Slide35
Other Top-10 Similar Sites (Backup)
35Slide36
Problem of Collaborative Tagging (Backup)
36Most posts are given to small number of highly popular resourcesdataset from delicious.com
All 30m urls39% posts vs. top 1% urlsOver 10m urls are just tagged onceSelected 5000 resources
High Quality Resources7% passed stable points50% over-tagging posts
25%
under-tagged (<
10
posts)Slide37
Tagging Stability (Backup)
37ExampleWindow size ThresholdStable Point: 100Stable rfd: