MetaLabeler Lei Tang Arizona State University Suju Rajan and Vijay K Narayanan Yahoo Data Mining amp Research Large Scale MultiLabel Classification Huge number of instances and categories ID: 813107
Download The PPT/PDF document "Large Scale Multi-Label Classification v..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Large Scale Multi-Label Classification via
MetaLabeler
Lei Tang
Arizona State University
Suju Rajan
and Vijay K. Narayanan
Yahoo! Data Mining & Research
Slide2Large Scale Multi-Label Classification
Huge number of instances and categories
Common for online contents
Web Page Classification
Query
Categorization
Video Annotation/Organization
Social Bookmark/Tag
Recommendation
Slide3Challenges
Multi-Class
: t
housands of categories
Multi-Label: each instance has >1 labels
Large Scale: huge number of instances and categoriesOur query categorization problem: 1.5M queries, 7K categories
Yahoo! Directory
792K docs, 246K categories
in Liu et al. 05Most existing multi-label methods do not scalestructural SVM, mixture model, collective inference, maximum-entropy
model, etc.
The simplest
One-vs-Rest SVM is still widely used
Slide4One-vs-Rest SVM
x
1
C
1
, C
3
x
2
C
1
, C
2
, C
4
x
3
C
2x4C2, C4
x1+x2+x3-x4-
x1-x2+x3+x4+
x1+x2-x3-x4-
x1-x2+x3-x4+
C1
C2
C3
C4
SVM1
SVM2
SVM3
SVM4
C1
C2
C3
C4
Predict
Slide5One-vs-Rest SVM
Pros:
Simple, Fast, Scalable
Each label trained independently, easy
to parallelCons:Highly skewed class distribution (few +, many -)
Biased prediction scoresOutput reasonable good
ranking (Rifkin and Klauta
04)e.g. 4 categories C
1, C2, C3, C
4
True Labels for x1
: C1, C3
Prediction Scores: {s
1, s3
} > {s
2
, s
4
} Predict the number of labels?
Slide6MetaLabeler Algorithm
Obtain a ranking of class membership for each instance
Any genetic ranking algorithm can be appliedUse One-
vs-Rest SVM
Build a Meta Model to predict the number of top classesConstruct Meta LabelConstruct Meta Feature
Build Meta Model
Slide7Meta Model – Training
Q
2
= cotton children jeans
Labels:
Children clothing
Q
3 = leather fashion in 1990sLabels:
FashionWomen Clothing Leather Clothing
Q
1 = affordable cocktail dressLabels:
Formal wearWomen Clothing
Q1: 2
Q2: 1
Q3: 3
Meta data
Query: #labels
Meta-Model
One-vs-RestSVMClothingWomenClothingFormalwearFashionChildrenClothing
RegressionLeather clothingHow to handle predictions like 2.5 labels?
Slide8Meta Feature Construction
Content-Based
Use raw dataRaw data contains all the infoScore-Based
Use prediction scoresBias with scores might be learned
Rank-BasedUse sorted prediction scores
C1
C2C3
C40.9
-0.20.7-0.6
C1
C2C3
C4Meta Feature0.9
-0.20.7
- 0.6
Meta Feature
0.9
0.7
-0.2
-0.6
Slide9MetaLabeler Prediction
Given one instance:
Obtain the rankings for all labels;Use the meta model to predict the number of labelsPick the top-ranking labels
MetaLabeler
Easy to implementUse existing SVM package/software directlyCan be combined with a hierarchical structure easily
Simply build a Meta Model at each internal node
Slide10Baseline Methods
Existing
thresholding methods (Yang 2001)Rank-based Cut (
Rcut)
output fixed number of top-ranking labels for each instanceProportion-based CutFor each label, choose a portion of test instances as positive
Not applicable for online predictionScore-based Cut (Scut, aka. threshold tuning)
For each label, determine a threshold based on cross-validation
Tends to overfit and is not very stable
MetaLabeler: A local RCut methodCustomize the number of labels for each instance
Slide11Publicly Available Benchmark
Data
Yahoo! Web Page Classification11 data sets:each constructed from a top-level category
2nd level topics are the categories
16-32k instances, 6-15k features, 14-23 categories1.2 -1.6 labels per instance, maximum 17 labelsEach label has at least 100 instances
RCV1:A large scale text corpus 101 categories, 3.2 labels per instanceFor evaluation purpose, use 3000 for training, 3000 for testing
Highly skewed distribution (some labels have only 3-4 instances)
Slide12MetaLabeler
of Different Meta
FeaturesWhich type of meta feature is more predictive?
Content-based MetaLabeler
outperforms other meta features
Slide13Performance Comparison
MetaLabeler tends to outperform other methods
Slide14Bias with MetaLabeler
The distribution of number of labels is imbalanced
Most instances have small number of labels;Small portion of data instances have many more labels
Imbalanced Distribution leads to bias in MetaLabeler
Prefer to predict lesser labelsOnly predict many labels with strong confidence
Slide15Scalability Study
Threshold tuning requires cross-validation, otherwise
overfitMetaLabeler simply adds some meta labels and learn One-
vs-Rest SVMs
Slide16Scalability Study (cond
.)
Threshold tuning: linearly increasing with number of categories in the dataE.g. 6000 categories -> 6000 thresholds to be tuned MetaLabeler: upper bounded by the maximum number of labels with one instance
E.g. 6000 categories but one instance has at most 15 labels
Just need to learn additional 15 binary SVMs Meta Model is “independent” of number of categories
Slide17Application to Large Scale Query Categorization
Query categorization problem:
1.5 million unique queries: 1M for training, 0.5M for testing120k features
A 8-level taxonomy of 6433 categoriesMultiple labels
e.g. 0% interest credit card no transfer feeFinancial Services/Credit, Loans and
Debt/Credit/Credit Card/ Balance TransferFinancial Services/Credit, Loans and Debt/Credit/Credit
Card/ Low Interest Card
Financial Services/Credit, Loans and Debt/Credit/Credit
Card/ Low-No-fee Card
1.23 labels on average
At most 26 labels
Slide18Flat Model
Flat Model: do not leverage the hierarchical structure
Threshold tuning on training data alone takes 40 hours to finish while MetaLabeler costs 2 hours.
Slide19Hierarchical
Model - Training
Root
. . . . .
. . . . . . . . . . . .
. . . . .
. . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
Training Data
N
New Training Data
Step 1: Generate Training Data
Step 2: Roll up labels
Step 4: Train One vs. Rest SVM
Other
Step 3: Create “Other” Category
Slide20Hierarchical
Model -
Prediction
Root
. . . . .
. . . . . . . . .
. . . . .
. . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
Query q
Predict using SVMs trained at root level
Query q
Query q
Stop !!!
Stop if reaching a leaf node or “other” category
m
1
m2m3m2m3m4c1c2c3OtherStop !!!
Slide21Hierarchical Model +
MetaLabeler
Precision decrease by 1-2%, but recall is improved by 10% at deeper levels.
Slide22Features in
MetaLabeler
Feature
Related Categories
Overstock.com
Mass Merchants/…/discount department stores
Apparel & Jewelry
Electronics & Appliances
Home & Garden
Books-Movies-Music-Tickets
Blizard
Toys & Hobbies/…/Video Game
Computing/…/Computer Game Software
Entertainment & Social Event/…/Fast Food Restaurant
Reference/News/Weather Information
Threading
Books-Movies-Music-Tickets/…/Computing Books
Computing/…/Programming
Health and Beauty/…/Unwanted Hair Toys and Hobbies/…/Sewing
Slide23Conclusions & Future Work
MetaLabeler
is promising for large-scale multi-label classificationCore idea: learn a meta model to predict the number of labelsSimple, efficient and scalable
Use existing SVM software directly
Easy for practical deploymentFuture workHow to optimize MetaLabeler for desired performance ?
E.g. > 95% precisionApplication to social networking related tasks
Slide24Questions?
Slide25References
Liu, T., Yang, Y., Wan, H.,
Zeng, H., Chen, Z., and Ma, W. 2005. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor
. Newsl.
7, 1 (Jun. 2005), 36-43. Rifkin, R. and Klautau, A. 2004. In Defense of One-Vs-All Classification. J. Mach. Learn. Res.
5 (Dec. 2004), 101-141.Yang, Y. 2001. A study of thresholding strategies for text categorization. In Proceedings of the 24th Annual international ACM SIGIR Conference on Research and Development in information Retrieval
(New Orleans, Louisiana, United States). SIGIR '01. ACM, New York, NY, 137-145.
Slide26Hierarchical vs. Flat Model
Flat model
Build a one-vs-rest SVM for all the labels
No taxonomy information during training.
Hierarchical model has about 5% higher recall fat deeper levels.