/
Chong Sun,  Narasimhan Chong Sun,  Narasimhan

Chong Sun, Narasimhan - PowerPoint Presentation

evans
evans . @evans
Follow
65 views
Uploaded On 2024-01-29

Chong Sun, Narasimhan - PPT Presentation

Rampalli Frank Yang AnHai Doan WalmartLabs amp UWMadison Presenter Jun Xie WalmartLabs Chimera LargeScale Classification using Machine Learning Rules and Crowdsourcing ID: 1043062

rules items 5000 analysts items rules analysts 5000 learning types amp large human product classify knives crowdsourcing training products

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Chong Sun, Narasimhan" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Chong Sun, Narasimhan Rampalli, Frank Yang, AnHai Doan@WalmartLabs & UW-MadisonPresenter: Jun Xie, @WalmartLabs Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing@WalmartLabs

2. Problem DefinitionClassify tens of millions of products into 5000+ typesEach product: a record of attribute-value pairstitle: Gerber folding knife 0 KN-Knivesdescription: most versatile knife in its category ... manufacturer, color, etc. many products have just title attributeProduct typeslaptop computers, area rugs, laptop bags & cases, dining chairs, etc.2IDTitleSCC PTEASW1876Eastern Weavers Rugs EYEBALLWH-8x10 Shag Eyeball White 8x10 Rug Shag RugsArea RugsEMLCO655Royce Leather 643-RED-4 Ladies Laptop Brief - Red Notebook CasesLaptop Bags and Cases14968347International Concepts Stacking Dining Arm Chair (Set of 2)Dining Chairs12490924South Carolina Gamecocks Rectangle Toothfairy PillowDecorative Pillows

3. ChallengesVery large # of product types (5000+)started out having very little training datacreating training data for 5000+ is very difficult Very limited human resources1 developer and 1-2 analysts (who can’t write code)Products often arrive in burstse.g., a batch of 300K items just come, must classify fastmakes it hard to provision for analysts and outsourcingNeed very high precision (>92%)can tolerate lower recall, but want to increase recall over time Current approaches can’t handle these scales/challenges3

4. Manually Classifying the ItemsUsing analystscan accurately classify about 100 items per daymust understand the item, navigate through a large space of possible types, decide on the most appropriate onee.g., Misses’ Jacket, Pants and Blouse – 14 -16-18-20 Pattern  sewing patternse.g., Gerber folding knife 0 KN-Knives  utility knives? pocket knives? tactical knives? multitools?would take 5 analysts 200 days to classify 100K itemsUsing outsourcingvery expensive: $770K for 1M itemsoutsourcing is not “elastic”Using crowdsourcingcrowd workers can’t navigate a complex and large taxonomy of types4

5. Learning-Based SolutionsDifficult to generate training datatoo many prod types (5000+)to label just 200 items per prod type, must label 1M itemsDifficult to generate representative samplesrandom sampling would severely under-sample certain typesanalysts and outsourced workers don’t know how to obtain a random sample, e.g., for handbags, computer cablesnew product types appear all the time  the universe of items keeps changingDifficult to handle “corner” casesitems coming from special sources, need to be handled speciallyhard to “go the last mile”, e.g., increasing precision from 90 to 95%Concept drift and changing distributione.g., smart phone5

6. Rule-Based SolutionsAnalysts & outsourcing workers write rules to classify itemsWriting rules to cover 5000+ product types is very slow doesn’t scaleOur Chimera solutioncombines the above approachesuses learning & hand-crafted rulesuses developers, analysts, and crowsourcingcontinuously improves over timekeeps precision high while trying to improve recall6

7. Our Chimera Solution7GatekeeperRulesWhitelist RulesItems to ClassifyResultBlacklist RulesK-NNNaïve BayesPerceptronRegressionSVMVoting MasterAttribute BasedFilterCrowd EvaluationSampleAnalysisUnclassifiedClassifiedTraining Data Classification RulesReports

8. ExamplesRulesrings?  ringswedding bands?  ringsdiamond.*trio sets?  ringsmacbook  ! Fruit (a blacklist rule)Classification evaluation using crowdsourcing8

9. Key Novelties of Our SolutionUse both learning and rules extensivelyrules are not “nice to have”, they are critical for high accuracyUse both crowd and analysts for evaluation/analysisusing both in-house analysts and crowdsourcing is critical at our scale to achieve an accurate, continuously improving, and cost-effective solutionScalable in terms of human resourcestaps into crowdsourcing (very elastic) and analystsTreat human and machines as first-class citizenssolution carefully spells out what techniques are used where, who is doing what, and how to coordinate among them9

10. EvaluationChimera has been developed and deployed for 2 yearsApplied to 2.5M items from market place vendorsclassified more than 90% with 92% precisionApplied to 14M items from walmart.comclassified 93% with 93% precisionAs of March 2014has 852K items in training data for 3,663 types20,459 rules for 4,930 typesCrowdsourcingevaluating 1,000 items takes 1 hour with 15-25 workersStaffing1 developer + 1 dedicated analyst + 1 more analyst when needed10

11. Conclusion & Lessons LearnedChimera: classifying millions of items into 5000+ typesAt this scale, existing approaches do not work wellWe have developed a highly scalable, accurate solutionusing learning, rules, crowdsourcing, analystsLessons learnedboth learning + rules are criticalcrowdsourcing is critical but must be closely monitoredcrowdsourcing must be coupled with in-house analysts and developersoutsourcing does not work at a very large scalehybrid human-machine systems are here to stayMore details in our paper11