/
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights - PowerPoint Presentation

ella
ella . @ella
Follow
65 views
Uploaded On 2023-12-30

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights - PPT Presentation

Feng Zhang Jidong Zhai Xipeng Shen Onur Mutlu Wenguang Chen Renmin University of China Tsinghua University North Carolina State University ID: 1035727

order data sequence processing data order processing sequence direct compressed perform large based traversal level space attributes university bit

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Efficient Document Analytics on Compress..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang Chen ⋄†Renmin University of China ⋄Tsinghua University#North Carolina State University⋆ETH Zürich1

2. MotivationEvery day, 2.5 quintillion bytes of data created – 90% data in the world today has been created in the last two years alone[1].[1] What is Big Data?https://www-01.ibm.com/software/data/bigdata/what-is-big-data.htmlHow to perform efficient document analytics when data are extremely large?2/23

3. OutlineIntroductionMotivation & ExampleCompression-Based Direct Processing ChallengesGuidelines and TechniquesSolution OverviewGuidelinesEvaluationBenchmarksResultsConclusion3/23

4. MotivationHow to perform efficient document analytics when data are extremely large?Challenge 1: SPACE: Large Space RequirementChallenge 2:TIME: Long Processing Time4/23

5. MotivationObservationUsing Hash Table to check redundant content for Wikipedia datasetWhether we can perform analytics directly on compressed data?REUSE5/23

6. Our IdeaCompression-based direct processingSequitur algorithm meets our requirement edgeR2:R1:R0:R0 → R1 R1 R2 aR1 → R2 c R2 dR2 → a ba b c a b d a b c a b d a b a Rules:(a) Original data(b) Sequitur compressed data(c) DAG RepresentationR1R1R2R2cR2daba: 0 b: 1 c: 2 d: 3R0: 4 R1: 5 R2: 6(d) Numerical representation4 → 5 5 6 05 → 6 2 6 36 → 0 1 (e) Compressed data in numerical IDaInput:6/23

7. Double benefitsR2:R1:R0:R1R1R2R2cR2dabaAppear more than once, but only store once!Challenge 1: SpaceAppear more than once, but only compute once!Challenge 2:Time7/23

8. OptimizationWe can make it more compact.R2:R1:R0:R1R1R2R2cR2dabaSome applications do not need to keep the sequence.R2:R1:R0:R1R1R2R2cR2dabaR0:x2x1x1R1R2aR2:x1x1abR1:cdx2x1x1R2iweight122In each rule, we may remove sequence info too.212Further saves storage space and computation time.8/23

9. ExampleWord Count<a, 2×2 + 1 +1> = <a, 6><b, 2×2 + 1> = <b, 5><c, 1×2 > = <c, 2><d, 1×2 > = <d, 2><a,2>, <b,2><c,1>, <d,1><a,1>, <b,1>321R0:R1R1R2aR1:R2cR2dR2:ab<a, 1×2> = <a, 2><b, 1×2> = <b, 2><c, 1><d, 1><a,6>, <b,5><c,2>, <d,2>CFG RelationInformation Propagationi<w,i>Step #Word table9/23

10. ChallengesCHALLENGESUnitsensitivityParallelismbarriersOrdersensitivityDataattributesReuseof results across nodesOverheadin saving and propagatingHow to perform parallelism on large datasets.How to utilize the attributes of datasets.How to accommodate the order for applications that are sensitive to the order. How to organize data. 10/23

11. OutlineIntroductionMotivation & ExampleCompression-Based Direct Processing ChallengesGuidelines and TechniquesSolution OverviewGuidelinesEvaluationBenchmarksResultsConclusion11/23

12. Solution OverviewSOLUTION TECHNIQUESAdaptivetraversal order andinformation to propagationCompression-timeindexingDoublecompressionLoad-timecoarseningTwo-level table withdepth-first traversalCoarse-grainedparallel algorithm andautomatic data partitionDouble-layeredbit vector forfootprint minimizationCHALLENGESUnitsensitivityParallelismbarriersOrdersensitivityDataattributesReuseof results across nodesOverheadin saving and propagating12/23

13. Data AttributesProblem: How to utilize the attributes of datasetsGuideline I: minimize the footprint sizeGuideline II: Traversal order is essential for the efficiencyAverage SizeFiles #Postorder≤2860>2860Preorder using 2levBitMap≤800Preorder using regular BitMap>800or?Traversal OrderThe best traversal order may depend on the data attributes of input.or13/23

14. Parallelism BarriersProblem: How to perform parallelism on large datasetsGuideline III: Coarse-grained distributed implementation is preferredpipe()pipe()pipe()pipe()f1f2f3f4f5f6…Input filesf1f2f3Partition 1Partition 2f4Partition 1f5:part1Partition 2f5:part2Partition 1f5:part3Partition 2f6RDD 1RDD 2RDD 3…C++ ProgramC++ ProgramC++ ProgramC++ ProgramSparkContextFinal Results14/23

15. Order Sensitivity (detailed in paper)Problem: Some applications are sensitive to the order Guideline IV: depth-first traversal and a two-level table designR1R2w1spt1R2w2spt2R3root node…w4……file0file1file2w5R4R1:R4R2:w6w7R4:w8R4R5R3:w9w6R5:w8……w5Local Sequence TableLocal Sequence TableGlobal Sequence Table15/23

16. Unit Sensitivity (detailed in paper)Problem: How to organize dataGuideline V: use of double-layered bitmap if unit information needs to be passed across the CFGNULLP1P3…bit2bit1bit3…Bit arrayPointer arrayP0bit0bit0bit1…bit N-1…Level 1:Level 2:N bitsbit0bit1…bit N-1bit0bit1…bit N-1R1R2w1spt1R2w2spt2R3root node…w4……file0file1file2w5R4R1:R4R2:w6R6R4:R7R4R5R3:w9R8R5:w8……w516/23

17. Short Summary for Six Guidelines Data Attributes ChallengeGuideline IIParallelism BarriersGuideline IIIOrder SensitivityGuideline IVUnit SensitivityGuideline VGeneral insights and common techniquesGuideline I and Guideline VI17/23

18. OutlineIntroductionMotivation & ExampleCompression-Based Direct Processing ChallengesGuidelines and TechniquesSolution OverviewGuidelinesEvaluationBenchmarksResultsConclusion18/23

19. BenchmarksSix benchmarksWord Count, Inverted Index, Sequence Count, Ranked Inverted Index, Sort, Term Vector Five datasets580 MB ~ 300 GBTwo platformsSingle nodeSpark cluster (10 nodes on Amazon EC2)19/23

20. Time SavingsCompressDirect yields 2X speedup, on average, over direct processing.20/23

21. Space SavingsCompressDirect achieves 11.8X compression ratio, even more than gzip does.compression ratio = original size / compressed data size21/23

22. ConclusionOur method, compression-based direct processing. How the concept can be materialized on Sequitur. Major challenges.Guidelines. Our library, CompressDirect, to help further ease the required development efforts. 22/23

23. Thanks!Any questions?23/23Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang Chen ⋄†Renmin University of China ⋄Tsinghua University#North Carolina State University⋆ETH Zürich