/
Scaling   Metadata   of Scaling   Metadata   of

Scaling Metadata of - PowerPoint Presentation

sophia
sophia . @sophia
Follow
64 views
Uploaded On 2024-01-29

Scaling Metadata of - PPT Presentation

Distributed File System Shared by Yiduo Wang What we talk about when we talk about File System File System What we talk about when we talk about filesystem A system ID: 1042503

file metadata store system metadata file system store scale partition dir server dir1 filesystem data dfs distributed write transaction

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Scaling Metadata of" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Scaling Metadata of Distributed File SystemShared by Yiduo Wang

2. What we talk about when we talk about File System?

3. File SystemWhat we talk about when we talk about filesystem?A system for storing files.Why filesystem different?(KV, Database, Object…)

4. File System StandardFrom 1980: POSIX[1] filesystem standardA sets of system calls with certain behaviors: open(), read(), write() …File TypeDirectory, File, Hard link, Symbol linkPermissionsACL, e.g., 755(rwxr-xr-x) …Attributes: e.g., atime, mtime, ctime …Atomicity and Consistency: e.g., content written should be visible after flush() [1] The Open Group Base Specifications Issue 7, 2018 edition. https://pubs.opengroup.org/onlinepubs/9699919799/

5. File System StandardFrom 1980: POSIX[1] filesystem standardA sets of system calls with certain behaviors: open(), read(), write() …File TypeDirectory, File, Hard link, Symbol linkPermissionsACL, e.g., 755(rwxr-xr-x) …Attributes: e.g., atime, mtime, ctime …Atomicity and Consistency: e.g., content written should be visible after flush()Provide a global namespace[1] The Open Group Base Specifications Issue 7, 2018 edition. https://pubs.opengroup.org/onlinepubs/9699919799/

6. File System NamespaceNamespace:A directory tree from the rootor a DAG which has the unique root: "/"File1Dir1/Dir2SymLinkLink✓dataFile2data

7. File System NamespaceNamespace:A directory tree from the rootor a DAG which has the unique root: "/"Dir1/Dir2SymLinkLink✓Dir1/Dir2Dir3XFile1dataFile2dataFile1dataFile2data

8. What is File System Metadata?

9. File System MetadataWhat is metadata?Meta-X: x of xMeta-physics: physics of physicsMeta-model: model of model

10. File System MetadataWhat is metadata?Meta-X: x of xMeta-physics: physics of physicsMeta-model: model of modelMeta-data: data of dataIn FS: names, parent, accessed time, ACLs, blocks…Dir1/Dir2SymLinkLinkFile1dataFile2dataMetadataData

11. How File System access Metadata?

12. File System HierarchyApplicationsUser Space

13. File System HierarchyApplicationsread()write()open()chmod()…POSIX system callKernel SpaceUser SpaceVFS (Virtual file system)

14. File System HierarchyApplicationsread()write()open()chmod()…POSIX system callVFS (Virtual file system)read()write()create()setattr()VFS Interface…Kernel SpaceUser Space

15. File System HierarchyApplicationsread()write()open()chmod()…POSIX system callVFS (Virtual file system)read()write()create()setattr()VFS Interface…Block FilesystemEXT2EXT3EXT4XFSBtrFS…Kernel SpaceUser SpaceNetwork FilesystemNFSCephFSLustre…FUSECeph-fuseLustre-fuse…

16. Why File Systems are always evolving?

17. Evolution: HardwareStorageNetworkFloppy disk, ~10KB/sHDD, ~100MB/sSSD, ~X00MB/sNVMe-SSD, ~XGB/s

18. Evolution: ScenarioScenarioMulti-user computer → Multi-tenant datacenter → CloudWorkload inclinationMassive small file95% files smaller than 96KB[1]High metadata operation ratio2000: Meta-op > 50% [1]2020: Meta-op > 85%[1] A comparison of file system workloads. ATC 00

19. Evolution: ScalabilityCapacity (single FS)Facebook: Billions of files[1]Alibaba: Tens of billions of files [2]LinkedIn: Billions of object[3]Scale (single Cluster)LinkedIn: Tens of thousands of nodes [3]Summit: Thousands of nodes [4][1] Facebook’s Tectonic filesystem: Efficiency from exascale. FAST 21[2] InfiniFS: An efficient metadata service for large-scale distributed filesystems. FAST 22[3] The exabyte club: LinkedIn’s journey of scaling the Hadoop Distributed File System. 2021[4] Exascale deep learning for climate analytics. SC 18

20. Evolution: ScalabilityCapacity (single FS)Facebook: Billions of files[1]Alibaba: Tens of billions of files [2]LinkedIn: Billions of object[3]Scale (single Cluster)LinkedIn: Tens of thousands of nodes [3]Summit: Thousands of nodes [4][1] Facebook’s Tectonic filesystem: Efficiency from exascale. FAST 21[2] InfiniFS: An efficient metadata service for large-scale distributed filesystems. FAST 22[3] The exabyte club: LinkedIn’s journey of scaling the Hadoop Distributed File System. 2021[4] Exascale deep learning for climate analytics. SC 18Have to scale DFS,Especially the metadata!

21. How to scale metadata?

22. Scale MetadataHow to utilize cross-node resources on one client?

23. Scale Metadata: CoupledNetwork filesystems, from 198XNFS, CIFS, Sprite, Coda…Dir1mntDir2data/Server-2NFS-clientNFS-serverNFS-serverlocal diskdata/Server-1local diskClient

24. Scale Metadata: CoupledNetwork filesystems, from 198XNFS, CIFS, Sprite, Coda…Dir1mntDir2data/Server-2NFS-clientNFS-serverNFS-serverlocal diskdata/Server-1local diskScalabilityFlexibilityResource utilizationConcurrency…XClient

25. Scale FS: Decouple Metadata

26. Scale FS: Decouple MetadataMotivation: A global & scalable filesystem serviceDecouple metadata from filesystemHDFS, Lustre, MooseFS, GFS…

27. Scale FS: Decouple MetadataMotivation: A global & scalable filesystem serviceDecouple metadata from filesystemHDFS, Lustre, MooseFS, GFS…Dir1/Dir2File1dataFile2dataMetadataDataMetadata ServerData Server1file1-dataData Server2file2-data……Data Server ClusterCluster maintenanceRequest resolveMetadata processingMetadata storage…

28. Scale FS: Decouple MetadataMotivation: A global & scalable filesystem serviceDecouple metadata from filesystemHDFS, Lustre, MooseFS, GFS…Metadata ServerData Server1file1-dataData Server2file2-data……Data Server ClusterClientread file1 data①get metadata②get data

29. Scale FS: Decouple MetadataMotivation: A global & scalable filesystem serviceDecouple metadata from filesystemHDFS, Lustre, MooseFS, GFS…Metadata ServerData Server1file1-dataData Server2file2-data……Data Server ClusterClientread file1 data①get metadata②get dataIndependent metadata/data scalingFast in-memory \Metadata processing

30. New FS SemanticHDFS propose a relaxed filesystem semantic for DFS.POSIX SemanticHDFS SemanticExt4, XFS, BtrFS…Local filesystemNFS, CIFS…Network filesystemDistributed filesystemHDFS, HopsFS…Lustre, MooseFS…

31. New FS SemanticHDFS propose new semantic for distributed scenario.POSIX SemanticHDFS SemanticExt4, XFS, BtrFS…Local filesystemNFS, CIFS…Network filesystemDistributed filesystemHDFS, HopsFS…Lustre, MooseFS…FS modelR/W modelPath resolutionScenarioAverage file sizeMDS: DSHDFSDirectory, File, SymlinkRead &append writeFull-pathBig file;Write once read many500MB 500PB:1Billion[1]1:10000[1]POSIXDirectory, File,Symlink, LinkRead &random writeRecursively(parent, name)Small file;Write many read many95%<96KB[2]1:4[3]/10[4]/40[5][1] The exabyte club: LinkedIn’s journey of scaling the Hadoop Distributed File System. 2021[2] A comparison of file system workloads. ATC 00[3] Large-scale Stability and Performanceof the Ceph File System. Linux Vault 2017[4] The Feature Development and Architecture Evolvement of Lustre under New Challenges. 2019 [5] Lustre Scalability. LUG 2009

32. Decouple Metadata: LimitationSingle metadata server limitation- ScalabilityThroughputCapacity- AvailabilityTo scale metadata service: Multi-MDSFrom ~200X

33. Scale FS: Partition Metadata

34. Challenge: How to partition?Metadata ServerMetadata ServerMetadata ServerMetadata ServerMetadata Server ClusterData Server1file1-data……Data Server ClusterData Server1file1-data……Data Server ClusterStill Hash????

35. No Partition, Only BackupMetadata ServerMetadata ServerMetadata ServerMetadata ServerMetadata Server ClusterExample: HDFS, Farsite, ShardFS…AvailabilityScalability, Consistency…

36. Partition: HashDir1/Dir2File1File2Metadata/File1Dir2Dir1File2Hash by full-path/file name/dir name/attributesMDS1MDS2MDS3MDS4

37. Partition: HashDir1/Dir2File1File2Metadata/File1Dir2Dir1File2Hash by full-path/file name/dir name/attributesMeta-ManagementNamePartitionConferenceOpensourceMulti-MDSHash-tableLustre-DNE2Hash by full path√BeeGFSHash by entryID+fileID√HBA&G-HBABloom FilterTPDS07SoMetaHash by file attributeCluster17GlusterFSHash by file nameTectonicHash by dir nameFAST21InfiniFSHash by dir nameFAST22MDS1MDS2MDS3MDS4

38. Limitation: HashDir1/Dir2File1File2MetadataMDS1/File1MDS2Dir2MDS3Dir1MDS4File2Hash by full-path/file name/dir name/attributesIs this method get the best scalability?Advantage: Very good load balance✓

39. Limitation: HashDir1/Dir2File1File2MetadataMDS1/File1MDS2Dir2MDS3Dir1MDS4File2Hash by full-path/file name/dir name/attributesAdvantage: Very good load balanceDisadvantage: Destroy locality ⤫✓

40. Limitation: HashDir1/Dir2File1File2MetadataMDS1/File1MDS2Dir2MDS3Dir1MDS4File2Hash by full-path/file name/dir name/attributesAdvantage: Very good load balanceDisadvantage: Destroy locality E.g.: open(/Dir2/File2):query MDS1query MDS2query MDS4More RPCHigher latency[1]Recursive operations[2]…[1] InfiniFS: An Efficient Metadata Service for Large-Scale Distributed Filesystems. FAST 22[2] Efficient Metadata Management in Large Distributed Storage Systems. MSST 03⤫⤫✓

41. Partition: Static SubtreeDir1/Dir2File1File2MetadataMDS1Dir1/File1MDS3Dir2File2Bind subtree to specific MDSMeta-ManagementNamePartitionConferenceOpensourceMulti-MDSStatic SubtreeLustre-DNE1Static subtree√CephFS-pinStatic subtreeOSDI 06√

42. Partition: Dynamic SubtreeRequests

43. Partition: Dynamic SubtreeRequests

44. Partition: Dynamic SubtreeRequests

45. Partition: Dynamic SubtreeRequests

46. Partition: Dynamic SubtreeDir1/Dir2File1File2MetadataMDS1Dir1/File1MDS2Dir2File2Meta-ManagementNamePartitionConferenceOpensourceMulti-MDSDynamic SubtreeCephFSDynamic subtreeOSDI 06√IndexFSDynamic subtreeSC14√ADLSRange split by dir-inodeSIGMOD17CubeFSRange split by inodeSIGMOD19√HopsFSDynamic database partitionFAST19√

47. Limitation: SubtreeDir1/Dir2File1File2MetadataMDS1Dir1/File1MDS2Dir2File2Advantage: Better localityDisadvantage: Load imbalance [1, 2] Big directory [3, 4]⤫✓[1] Mantle: A programmable metadata load balancer for the Ceph file system. SC 15[2] Lunule: An agile and judicious metadata load balancer for CephFS. SC 21[3] Scale and concurrency of GIGA+: File system directories with millions of files. FAST 11[4] IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion SC 14

48. Partition MetadataTrade-off between locality and balanceStateful metadata service is hard to scaleConsistency is complex

49. Partition MetadataTrade-off between locality and balanceStateful metadata service is hard to scaleFilesystem consistency is complexIf we can store metadata in a storage system withGood ScalabilityStrong ConsistencyDynamic PartitionHigh AvailabilityFault Tolerance…

50. Scale FS: Disaggregate Metadata

51. Store Metadata in …ClientMDSmeta-opdata-opRADOSCephFS: Store metadata and data in object store entryentry…metadatadata…metadatadata…MEMDISKDISKMeta-WAL

52. Store Metadata in …ClientMDSmeta-opdata-opRADOSCephFS: Store metadata and data in object store entryentry…metadatadata…metadatadata…MEMDISKDISKMeta-WALSimplify system designHigh reliability⤫✓Stateful MDSComplex consistencyMDS still very heavy!

53. Store Metadata in DDBSFrom 2015, some DFS store metadata in DDBSClientMDSmeta-opdata-opMEMData Clusterdatadatadata…datadatadata…DISKDISKDistributed DatabaseDistributedtransactionsDISKTableDISKTable

54. Store Metadata in DDBSFrom 2015, some DFS store metadata in DDBSClientMDSmeta-opdata-opMEMData Clusterdatadatadata…datadatadata…DISKDISKDistributed DatabaseDistributedtransactionsDISKTableDISKTableNamePartitionMeta-operationStorage SystemConferenceOpensourceCalvinFSHash inodeDistributed transaction(order-based)CalvinDBFAST 15HopsFSDynamic subtreeDistributed transaction(lock-based)MySQL-NDBFAST 17√ADLSDynamic subtreeDistributed transaction(optimistic lock-based)RSL-HKSIGMOD17

55. Store Metadata in DDBSFrom 2015, some DFS store metadata in DDBSClientMDSmeta-opdata-opMEMData Clusterdatadatadata…datadatadata…DISKDISKDistributed DatabaseDistributedtransactionsDISKTableDISKTableNamePartitionMeta-operationStorage SystemConferenceOpensourceCalvinFSHash inodeDistributed transaction(order-based)CalvinDBFAST 15HopsFSDynamic subtreeDistributed transaction(lock-based)MySQL-NDBFAST 17√ADLSDynamic subtreeDistributed transaction(optimistic lock-based)RSL-HKSIGMOD17Disaggregated design scales metadata serviceBut: Heavy distributed transaction makes meta-op slow

56. Store Metadata in KV-storeFrom ~2020, some DFS store metadata in KV-storeBenefit: Higher scalability, lower latencyNamePartitionMeta-operationStorage SystemConferenceOpensourceTectonicHash dir-inodeLocal transactionZippyDBFAST 21Create /Dir1/File1Dir1/File1Step1: Create File1Step2: Modify Dir1 attributeKV1KV2Dir1File1Attribute: atime, ctime, childs…

57. Store Metadata in KV-storeFrom ~2020, some DFS store metadata in KV-storeBenefit: Higher scalability, lower latencyProblem: No transactionCreate /Dir1/File1Dir1/File1Step1: Create File1Step2: Modify Dir1 attributeKV1KV2Dir1File1Attribute: atime, ctime, childs…

58. Store Metadata in KV-storeFrom ~2020, some DFS store metadata in KV-storeBenefit: Higher scalability, lower latencyProblem: No transactionCreate /Dir1/File1Dir1/File1Step1: Create File1Step2: Modify Dir1 attributeCrash, Overwrite…May break Atomicity and IsolationKV1KV2Dir1File1Attribute: atime, ctime, childs…

59. Store Metadata in KV-storeFrom ~2020, some DFS store metadata in KV-storeInfiniFS: Redesign metadata partition Key idea: partition file and parent-attribute in same shardNamePartitionMeta-operationStorage SystemConferenceOpensourceTectonicHash dir-inodeLocal transactionZippyDBFAST 21InfiniFSDisaggregate dir-attrHash dir-inodeLocal transaction+ 2PC transactionRocksDBFAST 22Create /Dir1/File1Dir1/File1KV1Dir1-inodeKV1Dir1-attrFile1Step1: Create File1Step2: Modify Dir1 attributeLocal Transaction

60. Next Reading GroupRethink DFS architecture with new hardwarePaper sharing: InfiniFS, …