/
Publish-Subscribe  Approach to Social Annotation of Publish-Subscribe  Approach to Social Annotation of

Publish-Subscribe Approach to Social Annotation of - PowerPoint Presentation

daniella
daniella . @daniella
Follow
64 views
Uploaded On 2023-12-30

Publish-Subscribe Approach to Social Annotation of - PPT Presentation

News Topk PublishSubscribe for Social Annotation of News Joint work with Maxim Gurevich RelateIQ Marcus Fontoura Vanja Josifovski Google ID: 1036152

story tweets top tweet tweets story tweet top time index stories day score pub real news list annotation posting

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Publish-Subscribe Approach to Social An..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Publish-Subscribe Approach to Social Annotation of NewsTop-k Publish-Subscribe for Social Annotation of News Joint work with: Maxim Gurevich (RelateIQ) Marcus Fontoura, Vanja Josifovski (Google) Alex Shraer Work done while authors were at Yahoo! Research

2. News & Social Updates

3. News Annotation Goal: Annotate each story with k most related tweetsChallenges:Automatic matching, based on content of story & tweetReal time - continuously update annotationsServing Latency - avoid delay in serving the news pageHigh scale – billions of page views per day, hundreds of millions of tweets per day, tens of thousands of stories per day

4. Real-time Index ApproachMaintain a tweet index in real-timeFor every page view in the media site, query this index with the content of the story as the queryProblems: Long queries, serving time affected The index is queried and updated very frequently Caching techniques almost unusableNot scalable!TweetIndextop-k tweetsstoryupdateNew tweetPage viewBillions per dayHundreds of millions per day

5. Our solution: Top-K Publish-SubscribeTreat stories as subscriptions, tweets as published itemsNew item triggers a subscription only if it is among the top-k matching items published so fartop-k tweetsstoryupdateNew tweetPage viewStory to top-k tweets mapStoryIndexNew storyqueryupdate

6. Real Time Indexing VS Top-k Pub-Sub Real-time indexing Publish-Subscribe Computation 1B  50ms = 50Bms 100M10ms+1B1ms = 2Bms Serving time 50ms 1ms #cores 600 12 + 12 = 24 1B pageviews/day => ~600 pageviews/50ms10K100M1B pageviews50ms10ms1ms Story Index 100M tweets/day =>~12 tweets/10ms 1B pageviews/day => ~12 pageviews/1msTop-k mapX 25X 50X 25Story to top-k tweets mapStoryIndex1B pageviews

7. Standard IR Index and AlgorithmsPosting list for term t: a list of partial scores, one for each document containing the term tQuery q = <t1, t3, t4>Go over posting lists for t1, t3, t4Collect partial scores, when done we have fully scored documents w.r.t. the query qReturn k documents with maximal scoretermsDocumentss1s3t1s4s7s9s10s11s18s31s37s2s7s8s18s11s18s3t2s4s3s8t3s9s32s4s5t4s7s12s13s15s21s22s34s35s6s8t5s13s14s19s22s25

8. Story Index and Top-k Pub-Sub AlgorithmsPosting list for term t: a list of partial scores, one for each story containing the term ttweet = <t1, t3, t4>Go over posting lists for t1, t3, t4Collect partial scores, when done we have fully scored stories w.r.t. the query qFor every story s with score(s, tweet) > 0, attempt to insert tweet into annotation set of s Compare score(s, tweet) to score of the k tweets currently annotating s termsStoriess1s3t1s4s7s9s10s11s18s31s37s2s7s8s18s11s18s3t2s4s3s8t3s9s32s4s5t4s7s12s13s15s21s22s34s35s6s8t5s13s14s19s22s25

9. Our contributionMethod to convert efficient IR algorithms into efficient top-k pub-sub algorithmsDemonstrate on 4 standard IR algorithms TAAT, Buckley & Lewit, DAAT, WAND

10. Key for Efficiency: SkippingScore of worstTweet annotating story s1IR algorithms skip most of the posting listsCompute upper bound on score gain in all remaining posting listsIf upper bound is not enough to change result set, can skip remaining listsCan’t use this for pub-sub – instead of 1 result-set we have to update manyμs - score of worst tweet annotating a story sSkipping condition when processing a tweet: Can skip s only if upper bound on score(tweet, s) ≤ μs Use a segment tree per posting list to skip segments of the list that satisfy skipping conditionOverhead ~1.6% of index sizes1s2t4s3s4s5

11. Score(story, tweet)Content based matching (cosine similarity, BM25)Time-based decay factorevery  time the score is divided by 2

12. Test Collection100K articles from a single dayEach article has title, abstract and main body35M from same day containing only ASCII chars 24K/minute

13. Fraction of related tweets that actually matterWe measured: 38 new tweets related to average story per minuteFor 100K stories: 3.8M tweets / minuteThis would be #invalidations in real-time indexing w/cachingMany (expensive) queries of Tweet Index or, alternatively, stale annotationsFraction of related tweets that actually become annotations: 5 orders of magnitude less!Important to efficiently identify stories the tweet will actually annotate

14. Skipping: 10x reduction in processing timeOur alg. with skippingOur alg. w/o skipping

15. SummaryAnnotating news stories with social updates in real timeTop-k pub-sub: stories indexed as subscriptions, tweets are eventsScalable, fast annotation serving Low latency tweet processing, off the critical serving path! Method to convert top-k retrieval alg. to top-k pub-subDemonstrate using 4 popular algorithmsSkipping works - up to 10x latency reductionCan use top-k pub-sub for ‘top’ stories, caching for othersMany potential applications Examples: alerts, personalized news feed, etc.

16. Thank you! Alex Shraer shralex@google.com