/
Advanced File Systems CS 140 Advanced File Systems CS 140

Advanced File Systems CS 140 - PowerPoint Presentation

amelia
amelia . @amelia
Follow
64 views
Uploaded On 2024-01-29

Advanced File Systems CS 140 - PPT Presentation

Nov 4 2016 Ali Jose Mashtizadeh Outline FFS Review and Details Crash Recoverability Soft Updates Journaling CopyonWrite LFSWAFLZFS Review Improvements to UNIX FS Problems with original UNIX FS ID: 1042502

int32 file blocks inode file int32 inode blocks disk time write directory data system block log ufs1 inodes free

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Advanced File Systems CS 140" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Advanced File SystemsCS 140 – Nov 4, 2016Ali Jose Mashtizadeh

2. OutlineFFS Review and DetailsCrash RecoverabilitySoft UpdatesJournalingCopy-on-Write: LFS/WAFL/ZFS

3. Review: Improvements to UNIX FSProblems with original UNIX FS512 B blocksFree blocks in linked listAll inodes at the beginning of the diskUNIX Performance IssuesTransfers 512 B per disk IOFragmentation leads to 512 B/average seekInodes far from directory and file dataFiles within directory scattered everywhereUseability Issues14 character file namesnot crash proof

4. Review: Fast File System [McKusic]Variable block size (at least 4 KiB)Fragment technique reduced wasted spaceCylinder groups spread inodes around diskBitmap for fast allocationFS reserves space to improve allocationTunable parameter default 10%Reduces fragmentationUsability improvements:255 character file namesAtomic rename system callSymbolic links

5. Review: FFS Disk LayoutEach cylinder group has its own:SuperblockCylinder Block: Bookkeeping informationInodes, Data/Directory blocks

6. Cylinders, tracks, and sectors

7. SuperblockContains file system parametersDisk characteristics, block size, Cyl. Group infoInformation to locate inodes, free bitmap, and root dir.Replicated once per cylinder groupAt different offsets to span multiple plattersContains magic number 0x00011954 to find replica’s(McKusick’s birthday)Contains non-replicated information (FS summary)Number of blocks, fragments, inodes, directoriesFlag stating if the file system was cleanly unmounted

8. Bookkeeping informationBlock mapBitmap of available fragmentsUsed for allocating new blocks/fragmentsSummary info within CG (Cylinder Group Block)# of free inodes, blocks/frags, files, directoriesUsed when picking which cyl. group to allocate into# of free blocks by rotational position (8 positions)Reasonable when disks were accessed with CHSOS could use this to minimize rotational delay

9. Inodes and Data blocksEach CG has fixed # of inodesEach inode maps offset to disk block for one fileInode contains metadata

10. On-Disk Inodestruct ufs1_dinode { u_int16_t di_mode; /* 0: IFMT, permissions; see below. */ int16_t di_nlink; /* 2: File link count. */ uint32_t di_freelink; /* 4: SUJ: Next unlinked inode. */ u_int64_t di_size; /* 8: File byte count. */ int32_t di_atime; /* 16: Last access time. */ int32_t di_atimensec; /* 20: Last access time. */ int32_t di_mtime; /* 24: Last modified time. */ int32_t di_mtimensec; /* 28: Last modified time. */ int32_t di_ctime; /* 32: Last inode change time. */ int32_t di_ctimensec; /* 36: Last inode change time. */ ufs1_daddr_t di_db[NDADDR]; /* 40: Direct disk blocks. */ ufs1_daddr_t di_ib[NIADDR]; /* 88: Indirect disk blocks. */ u_int32_t di_flags; /* 100: Status flags (chflags). */ u_int32_t di_blocks; /* 104: Blocks actually held. */ u_int32_t di_gen; /* 108: Generation number. */ u_int32_t di_uid; /* 112: File owner. */ u_int32_t di_gid; /* 116: File group. */ u_int64_t di_modrev; /* 120: i_modrev for NFSv4 */};

11. On-Disk Inode: POSIX Permissionsstruct ufs1_dinode { u_int16_t di_mode; /* 0: IFMT, permissions; see below. */ int16_t di_nlink; /* 2: File link count. */ uint32_t di_freelink; /* 4: SUJ: Next unlinked inode. */ u_int64_t di_size; /* 8: File byte count. */ int32_t di_atime; /* 16: Last access time. */ int32_t di_atimensec; /* 20: Last access time. */ int32_t di_mtime; /* 24: Last modified time. */ int32_t di_mtimensec; /* 28: Last modified time. */ int32_t di_ctime; /* 32: Last inode change time. */ int32_t di_ctimensec; /* 36: Last inode change time. */ ufs1_daddr_t di_db[NDADDR]; /* 40: Direct disk blocks. */ ufs1_daddr_t di_ib[NIADDR]; /* 88: Indirect disk blocks. */ u_int32_t di_flags; /* 100: Status flags (chflags). */ u_int32_t di_blocks; /* 104: Blocks actually held. */ u_int32_t di_gen; /* 108: Generation number. */ u_int32_t di_uid; /* 112: File owner. */ u_int32_t di_gid; /* 116: File group. */ u_int64_t di_modrev; /* 120: i_modrev for NFSv4 */};

12. On-Disk Inode: Hard Link Countstruct ufs1_dinode { u_int16_t di_mode; /* 0: IFMT, permissions; see below. */ int16_t di_nlink; /* 2: File link count. */ uint32_t di_freelink; /* 4: SUJ: Next unlinked inode. */ u_int64_t di_size; /* 8: File byte count. */ int32_t di_atime; /* 16: Last access time. */ int32_t di_atimensec; /* 20: Last access time. */ int32_t di_mtime; /* 24: Last modified time. */ int32_t di_mtimensec; /* 28: Last modified time. */ int32_t di_ctime; /* 32: Last inode change time. */ int32_t di_ctimensec; /* 36: Last inode change time. */ ufs1_daddr_t di_db[NDADDR]; /* 40: Direct disk blocks. */ ufs1_daddr_t di_ib[NIADDR]; /* 88: Indirect disk blocks. */ u_int32_t di_flags; /* 100: Status flags (chflags). */ u_int32_t di_blocks; /* 104: Blocks actually held. */ u_int32_t di_gen; /* 108: Generation number. */ u_int32_t di_uid; /* 112: File owner. */ u_int32_t di_gid; /* 116: File group. */ u_int64_t di_modrev; /* 120: i_modrev for NFSv4 */};

13. On-Disk Inode: Block Pointersstruct ufs1_dinode { u_int16_t di_mode; /* 0: IFMT, permissions; see below. */ int16_t di_nlink; /* 2: File link count. */ uint32_t di_freelink; /* 4: SUJ: Next unlinked inode. */ u_int64_t di_size; /* 8: File byte count. */ int32_t di_atime; /* 16: Last access time. */ int32_t di_atimensec; /* 20: Last access time. */ int32_t di_mtime; /* 24: Last modified time. */ int32_t di_mtimensec; /* 28: Last modified time. */ int32_t di_ctime; /* 32: Last inode change time. */ int32_t di_ctimensec; /* 36: Last inode change time. */ ufs1_daddr_t di_db[NDADDR]; /* 40: Direct disk blocks. */ ufs1_daddr_t di_ib[NIADDR]; /* 88: Indirect disk blocks. */ u_int32_t di_flags; /* 100: Status flags (chflags). */ u_int32_t di_blocks; /* 104: Blocks actually held. */ u_int32_t di_gen; /* 108: Generation number. */ u_int32_t di_uid; /* 112: File owner. */ u_int32_t di_gid; /* 116: File group. */ u_int64_t di_modrev; /* 120: i_modrev for NFSv4 */};

14. Inode AllocationEvery file/directory requires an inodeNew file: Place inode/data in same CG as directoryNew directory: Use a different CG from parentSelect a CG with greater than average # free inodesChoose a CG with smallest # of directoriesWithin CG, inodes allocated randomlyAll inodes close together as CG doesn’t have manyAll inodes can be read and cached with few IOs

15. Fragment allocationAllocate space when user grows a fileLast block should be a fragment if not full-sizeIf growth is larger than a fragment move to a full blockIf no free fragments exist break full blockProblem: Slow for many small writesMay have to keep moving the end of the file aroundSolution: new stat struct field st_blksizeTells applications and stdio library to buffer this size

16. Block allocationOptimize for sequential access#1: Use block after the end of file#2: Use block in same cylinder group#3: Use quadratic hashing to choose next CG#4: Choose any CGProblem: don’t want one file to fill up whole CGOtherwise other inodes will not be near dataSolution: Break big files over many CGsLarge extents in each CGs, so seeks are amortizedExtent transfer time >> seek time

17. DirectoriesSame as a file, but Inode marks as a directoryContents considered as 512-byte chunksDisks only guarantee atomic sector updatesEach chunk has direct structure with:32-bit i-number16-bit size of directory entry8-bit file type8-bit length of file nameCoalesce when deletingPeriodically compact directory chunksNever move across chunks

18. Updating FFS for the 90sNo longer wanted to assume rotational delayDisk caches usually reduce rotational delay effectsData contiguously allocatedSolution: Cluster writesFile system delays writesAccumulates data into 64 KiB clusters to write at onceCluster allocation similar to fragment/blocksOnline DefragmentationPortions of files can be copied to future read costBetter Crash RecoveryFreeBSD: Normal fsck, GJournal, SU, SU+J

19. FFS ImplementationSeparatesFile/Directory abstraction (UFS)Disk Layout (FFS)Log File System (LFS) [Mendel]Log structured file system layout for BSDDisk LayoutMaps i-number to inodeManages free spaceIO interface given inodeDisk DriverVolume ManagerUnix File SystemFFSLFS

20. OutlineOverviewFFS Review and DetailsCrash RecoverabilitySoft UpdatesJournalingLFS/WAFL

21. Fixing CorruptionsFile System Check (fsck) run after crashHard disk guarentees per-sector atomic updatesFile system operates on blocks (~8 sectors)Summary info usually badRecount inodes, blocks, fragments, directoriesSystem may have corrupt inodesInodes may have bad fields, extended attributes corruptInodes < 512 BAllocated blocks may be missing referencesDirectories may be corruptHoles in directoryFile names corrupt/not unique/corrupt inode numberAll directories must be reachable

22. Ensure RecoverabilityGoal: Ensure fsck can recover the file systemExample: suppose we asynchronously write dataAppending:Inode points to a corrupt indirect blockFile grown, but new data not presentDelete/truncate + Append to another file:New file may reuse old blockOld inode not yet written to diskCross allocation!

23. Performance & ConsistencyWe need to guarantee some ordering of updatesWrite new inode to disk before directory entryRemove directory name before deallocationWrite cleared inode to disk before updating free bitmapRequires many metadata writes to be synchronousEnsures easy recoveryHurts performanceCannot always batch updatesPerformance:Extracting tar files easily 10-20x slowerfsck: very slow leading to more downtime!

24. OutlineOverviewFFS Review and DetailsCrash RecoverabilitySoft UpdatesJournalingLFS/WAFL

25. Ordered UpdatesFollow 3 rules for ordering updates [Ganger]Never write pointer before initializing the structure pointed toNever reuse a resource before nullifying all pointers to itNever clear last pointer to live resource before setting new oneFile system will be recoverableMight leak disk space, but file system correctGoal: scavenge in background for missing resources

26. DependenciesExample: Creating a file ABlock X contains file A’s inodeBlock Y contains directory block references AWe say “Y depends on X”Y cannot be written before X is writtenX is the dependee, Y is the dependerWe can hold up writes, but must preserve order

27. Cyclic DependenciesSuppose you create A, unlink BBoth have same directory & inodeCannot write directory until A is initializedOtherwise: Directory will point to bogus inodeSo A might be associated with the wrong dataCannot write inode until B’s directory entry clearedOtherwise B could end up with a too small link countFile could be deleted while links existOtherwise, fsck has to be slower!Check every directory entry and inode link countsRequires more memory

28. Cyclic Dependencies Illustrated

29. Soft Updates [Ganger]Write blocks in any orderKeep track of dependenciesWhen writing a block, unroll any changes you can’t yet commit to disk

30. Breaking DependenciesCreated file A and deleted file BNow say we decide to write directory blockCan’t write file name A to disk – has dependee

31. Breaking DependenciesUndo file A before writing dir block to diskBut now inode block has no dependeesCan safely write inode block to disk as-is

32. Breaking DependenciesNow inode block clean (same in memory/disk)But have to write directory block a second time

33. Breaking DependenciesAll data stably on diskCrash at any point would have been safe

34. Soft Update Issuesfsync: May flush directory entries, etc.unmount: Some disk buffers flushed multiple timesDeleting directory tree fast!unlink doesn’t need to read file inodes synchronouslyCareful with memory!Useless write backsSyncer flushes dirty buffers to disk every 30 secondsWriting all at once means many dependency issuesFix syncer to write blocks one at a timeLRU buffer eviction needs to know about dependencies

35. Soft Updates: fsckSplit into foreground + background partsForeground must be done before remountingEnsure per-cylinder summary makes senseRecompute free block/inode counts from bitmap (fast)Leave FS consistent but may leak disk spaceBackground does traditional fsckDone on a snapshot while system is runningA syscall allows fsck to have FFS patch file systemSnapshot deleted once completeMuch shorter downtimes than traditional fsck

36. OutlineOverviewFFS Review and DetailsCrash RecoverabilitySoft UpdatesJournalingLFS/WAFL

37. JournalingIdea: Use Write-Ahead Log to Journal MetadataReserve portion of the diskWrite operation to log, then to the diskAfter a crash/reboot, replay the log (efficient)May redo already committed changes, but doesn’t missPhysical Journal: Both data + metadata writtenInduces extra IO overhead to diskSimple to implement and safe for applicationLogical Journal: Only metadata written to journalMetadata journalingMore complex implementation

38. Journaling (Continued)Performance Advantage:Log is a linear region of the diskMultiple operations can be loggedFinal operations written asynchronouslyJournals typically large (~1 GB)Copy of disk blocks writtenRequires high throughputExample: Deleting directory treeJournal all freed blocks, changed directory blocks, etcReturn to userIn background we can write out changes in any order

39. SGI XFS: [Sweeney]Main idea:Big disks, files, and large # of files, 64-bit everythingMaintain very good performanceBreak disk up into Allocation Groups (AGs)0.5 – 4 GB regions of a diskSimilar to Cylinder-Groups but for different purpose:AGs too large to minimize seek timesAGs have no fixed # of inodesAdvantages:Parallelize allocation, and data structures for multicoreUsed on super computers with many cores32-bit pointers within AGs for size

40. XFS B+ TreesXFS makes extensive use of B+ TreesIndexed data structure stores ordered keys & valuesKeys have defined orderingThree main operations O(log(n))Insert a new <key, value> pairDelete <key, value> pairRetrieve closest <key, value> to target key kSee any algorithms book for detailsUsed to:Free space managementExtended attributesExtent map for files: file offset to <start, length>

41. B+ Trees ContinuedSpace Allocation:Two B+ Trees: Sorted by length, Sorted by addressEasily find nearby blocks easily (locality)Easily find large extents for large files (best fit)Easily coalesce adjacent free regionsJournaling enables complex atomic operationsFirst write metadata changes to log on-diskApply changes to diskOn Crash:If log is incomplete, log will be discardedOtherwise replay log

42. Journaling vs. Soft UpdatesLimitations of Soft UpdatesHigh complexity and very tied to FFS data formatMetadata updates may proceed out of ordercreate A, create B, crash – maybe only B exists after rebootStill need slow background fsckLimitations of JournalingDisk write required for every metadata operationCreate-then-delete may not require I/Os with soft updatesPossible contention for the end of log on multiprocessorfsync must sync other operations’ metadata to logCan we get the best of both worlds?

43. Soft Updates + Journaling [McKusick]Example of minimalist Metadata JournalingJournal resource allocations and future referencesfsck can recover file system state from journal16 Mb journal (instead of GBs for data journals)32 byte journal entries27 types of journal entriesFast fsck takes ~2 secondsLess bandwidth and cost than normal journaling

44. OutlineOverviewFFS Review and DetailsCrash RecoverabilitySoft UpdatesJournalingLFS/WAFL

45. Log File System [Mendel]Main Idea: Only have a journal!Fast writingFragmentation from long lived data may be a problemSlow reading (lots of seeks)Finding the head of the file system may take timeLog cleaning:A process needs to periodically free old blocks from the file systemSSDs reduce seek overhead – LFS practical again

46. Write Anywhere File Layout (WAFL)Copy-on-Write File SystemInspired ZFS, HAMMER, btrfsCore Idea: Write whole snapshots to diskSnapshots are virtually free!Snapshots accessible from .snap directory in root

47. WAFL: Detailed ViewOnly root at fixed locationFS structures are accessed through InodesBlock allocation, Inode Table, Files, Directories

48. COPYWAFL: ExampleRoot Inode (1)RootInode (2) InodeFileIndirectFile BlockFileBlock 0FileBlock 0IndirectFile BlockInodeFileFileBlock 1Consistency:On crash recovery find a snapshot that has been fully committed to diskReclaim space after whole snapshot written to diskPersistent Snapshots:Save root inode and do not reclaim data older than latest snapshot

49. ZFSCopy-on-Write functions similar to WAFLIntegrates Volume Manager & File SystemSoftware RAID without the write holeIntegrates File System & Buffer ManagementAdvanced prefetching: strided patterns etc.Use Adaptive Replacement Cache (ARC) instead of LRUFile System reliabilityCheck summing of all data and metadataRedudant Metadata

50. RAID Write HoleFile AFile BFile CParity⊕⊕=Disk 1Disk 2Disk 3Disk 4Result: No way to distinguish between this and silent data corruption

51. ZFS Volume ManagementVolume management and RAID part of ZFSChecksums can identify:Silent data corruptionRAID striped data not written in entiretyFile system verifies recent snapshots on mount to fix write hole issuesIf only one drive’s data is corrupt we don’t lose data

52. 2 Copy3 Copies4-CopiesRedudant Metadata [zfs(8)]UberblockZnodeDirectoryZnodeIndirectBlockDataZnodeDouble IndirectBlockIndirectBlockDataCopiesImportanceCopies ideally placed on different disks

53. SummaryPerformance & Recoverability Trade-offfsck: Metadata overhead, fsck very slowSoft Updates: Background recovery, low overheadJournaling: Fast recovery, low overheadSoft Updates + JournalingFast recovery, low overheadLFS/WAFLFast writing/Higher read costsAlmost no recovery time!

54. Cluster/Distributed File SystemShared Disk Cluster File SystemsVxFS (JFS in HP-UX) – ClustersNetwork based lockingVMFS – VMware’s clustering FSUses SCSI reservations to coordinate between nodesLocal Disk ClusterGFS, Ceph, Lustre, Windows DFSBothIBM GPFS