Texas Tech University Suren Byna Lawrence Berkeley National Laboratory Houjun Tang ID: 809277
Download The PPT/PDF document "Authors: Wei Zhang" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Authors:
Wei Zhang Texas Tech University Suren Byna Lawrence Berkeley National Laboratory Houjun Tang Lawrence Berkeley National Laboratory Brody Williams Texas Tech University Yong Chen Texas Tech University
Topic:
MIQS: Metadata Indexing and Querying Service for Self-Describing File Formats
The 31st International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’19, Denver, CO)
Date:
November
19
th
,
2019
Slide2Data Management in Scientific Applications
Scientific ApplicationsExperimentsObservationsEver-increasing DataSize of the filesNumber of the filesVariety of the files
Slide3Self-describing Data Formats
Self-describing Data FormatsMetadata is stored alongside the data objectsHDF5, netCDF, ADIOS-BP, ASDF, etc.
Slide4Metadata Search over HDF5 File Collection
File System HierarchyDirectoryHDF5 Files+ Object HierarchyGroupsDatasets=HDF5 File Collection HierarchyFind data objects with metadata attribute “brightness = 60”Find data files with any data object on which attribute
“temperature = 50”
Slide5Metadata Search by Scanning Data Files
DrawbacksTime-consuming file scanning processSize of the data objectsNumber of the data files
Slide6Metadata Search with DatabaseViolating self-contained data management principle
Deployment EffortData Model AdaptionMaintenance DemandStorage RedundancyPerformance IssuesPoor Portability & MobilityNameTypeData SourceDB SystemDB Type
SPOT Suite
TomographyHDF5MongoDBNoSQL
JAMOGenomicsHDF5MongoDBBIMM
Biomedical
Biomedical Image
MySQL
RDBMS
EMPRESS
2.0
General
HDF5,
netCDF
SQLite
Slide7A New Era of Metadata Search
Self-contained Metadata IndexMinimal Complexity:Metadata schemaMetadata search processPortability & MobilityMinimal Storage RequirementPerformance GainsDirect Access to Metadata Index
Slide8MIQS – Metadata Indexing and
Querying ServiceOverviewIn-memory IndexIndex PersistenceIndex LoadingMetadata Search ServiceHighlightsMinimal ComplexityPortability & MobilityMinimal Storage RequirementPerformance Gains
Slide9In-memory Index
Path ListsIDs instead of repetitive path strings.ART – Adaptive Radix TreeSave space for strings with common prefixExact queryAffix-based querySBST – Self-balancing Search TreeEfficient exact queryPossible range query
Slide10Index Construction – Initial Indexing
Process r only index when file_counterr % n == rRecursive scanfile_counterNumber of files encounterednNumber of processesrProcess rank
Slide11Index Construction – Compact Index File
Making mobility possibleSmall index filesKeeping compact layoutGrouping value blocks in attribute blockGrouping FOIPs in value blockUse ID instead of path strings
Slide12Index Construction – Index File Read/Write
Index PersistenceProcess r index file r Retrospective loadingProcess r (n+r-s-1)%n (s < n-1)Index RecoveryProcess r (n+r-s-1)%n (s < n)
Slide13Serving Queries
Attribute name ARTNumeric value SBST.String value ARTRetrieve list of FOIPsFile paths File Path ListObject paths Object Path ListBrightness=50Author=“Oscar”BrightnessAuthorOscar50[(0,1)][(0,1)]”/home/Oscar/data/test.hdf5”,”/2019/05/2/B/pixel.fit”
Slide14Evaluation – Platform & Control
ExperimentMIQS Evaluation PlatformSupercomputerEdisonCPU12 * Ivy Bridge @2.4GHzMemory64GB DDR3 1866Network23.7TB/s global bandwidthLustre30PB @ 700GB peak I/OMIQS v.s. MongoDBNoSQLFlexible Data SchemaState-of-the-artMongoDB Evaluation PlatformHost machineDifferent from EdisonCPU16 * Haswell @2.3GHzMemory128GB DDR4 2133Network56Gb/s bandwidthHDD6TB 7200rpm 6Gb/s SASMongoDB Storage EngineWiredTiger with data compression
Slide15Evaluation – Dataset
100 HDF5 files Baryon Oscillation Spectroscopic Survey(BOSS)145 GB 144 million attributes1.5 million data objects.
Slide16Evaluation – Indexing Time
16 attributes in MongoDB5-9 min16 attributes in MIQS50% Indexing Time Reduction (initial indexing)99% Indexing Time Reduction (index recovering)You can also:Index all attributes in MIQS in 8-14min.Recover index of all attributes within 2min.
Slide17Evaluation – Indexing Time (Break-down)
Scanning Time : roughly equal (MIQS v.s. MongoDB)MongoDBInserting BSON (3 - 6min)MIQS 16 attributesIn-memory index: 0.5 – 1minPersistent index : ignorableMIQS all attributes:In-memory index: 5 – 8minPersistent index : ignorableMIQS index recovery16 attributes: < 40sAll attributes: < 2min.MongoDB Indexing Time (16 attributes)MIQS Indexing Time
(16 attributes)
MIQS Indexing Time (all attributes)MIQS Index Recovery Time
Slide18Evaluation – Query Performance
LatencyMongoDB: 5 min at maximum scaleMIQS: 0.29 ms at maximum scaleThroughput:MongoDB: 319 kQPS at maximum scaleMIQS: 363 billion QPS at maximum scaleQuery Latency Comparison (16 attributes)Query Throughput Comparison (16 attributes)
Slide19Evaluation – Memory Consumption
MongoDB:Up to 4.2GBMIQS:16 attributes: up to 600 MBAll attributes: up to 7.8GBMongoDB Memory Consumption (16 attributes)MIQS Memory Consumption (16 attributes)MIQS Memory Consumption (all attributes)
You
can:Save spaceIndex more attributes
Slide20ConclusionProblems: No metadata
indexing or Not self-contained.MIQS – Self-contained Metadata Indexing and Querying ServiceBenefits:Minimal ComplexityPortability & MobilityMinimal Storage RequirementsPerformance GainsFuture WorkIntegrating compact index file layout into HDF5More types of queriesPerformance improvementEmbracing a new era of metadata search
Slide21Follow UpPaper:
Contact Us:DISCL @ TTU:SDM Group @ LBNLhttps://discl.cs.ttu.eduX-Spirit.zhang@ttu.eduBrody.Williams@ttu.eduYong.Chen@ttu.eduhttps://sdm.lbl.govhtang4@lbl.govsbyna@lbl.govhttp://bit.ly/SC19-MIQSACM Digital Library:
Acknowledgement:
Many thanks to the audience and also those paper
reviewers who provided valuable comments.This research is supported in part by the National Science Foundation under grant CNS-1338078, CNS-1362134, CCF-1409946, CCF-1718336, OAC-1835892, and CNS-1817094. This work is sup- ported in part by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. (Project: EOD-HDF5: Experimental and Observational Data enhancements to HDF5, Program managers: Dr. Laura Biven and Dr. Lucy Nowell). This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility.
Slide22Slide23