/
File Systems 3: Buffering, Transactions, and Reliability File Systems 3: Buffering, Transactions, and Reliability

File Systems 3: Buffering, Transactions, and Reliability - PowerPoint Presentation

joy
joy . @joy
Follow
65 views
Uploaded On 2023-11-08

File Systems 3: Buffering, Transactions, and Reliability - PPT Presentation

Sam Kumar CS 162 Operating Systems and System Programming Lecture 21 httpsinsteecsberkeleyeducs162su20 7292020 Kumar CS 162 at UC Berkeley Summer 2020 1 Read AampD Ch 14 Recall Components of a File System ID: 1030757

berkeley file summer 162 file berkeley 162 summer 2020kumar data disk system block write cache blocks inode memory buffer

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "File Systems 3: Buffering, Transactions,..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. File Systems 3: Buffering, Transactions, and ReliabilitySam KumarCS 162: Operating Systems and System ProgrammingLecture 21https://inst.eecs.berkeley.edu/~cs162/su207/29/2020Kumar CS 162 at UC Berkeley, Summer 20201Read: A&D Ch 14

2. Recall: Components of a File System7/29/2020Kumar CS 162 at UC Berkeley, Summer 20202File pathDirectoryStructureFile Index StructureFile number“inumber”…Data blocks“inode”One Block = multiple sectorsEx: 512 sector, 4K block

3. Abstract Representation of a ProcessOpen file description is better described as remembering the inumber (file number) of the file, not its name7/29/2020Kumar CS 162 at UC Berkeley, Summer 20203User SpaceKernel SpaceAddress Space (Memory)Thread’s RegsFile DescriptorsNot shown: Initially contains 0, 1, and 2 (stdin, stdout, stderr)3File: foo.txt inumberPosition: 100Open File DescriptionProcess…

4. Abstract Representation of a ProcessOpen file description actually remembers an in-memory inode in the system-wide open-file table7/29/2020Kumar CS 162 at UC Berkeley, Summer 20204User SpaceKernel SpaceAddress Space (Memory)Thread’s RegsFile DescriptorsNot shown: Initially contains 0, 1, and 2 (stdin, stdout, stderr)3File: foo.txt inumber pointer to in-memory inodePosition: 100Open File DescriptionProcess…

5. In-Memory File System Structures7/29/2020Kumar CS 162 at UC Berkeley, Summer 20205Open syscall: find inode on disk from pathname (traversing directories)Create “in-memory inode” in system-wide open file tableOne entry in this table no matter how many instances of the file are openRead/write syscalls look up in-memory inode using the file handle(fd)fdinode

6. Recall: File Allocation TableWhere is FAT stored?On diskHow to format a disk?Zero the blocks, mark FAT entries “free”How to quick format a disk?Mark FAT entries “free”Simple: can implement in device firmware7/29/2020Kumar CS 162 at UC Berkeley, Summer 20206File 31, Block 0File 31, Block 1File 31, Block 2Disk BlocksFATN-1:0:0:N-1:31:File 1 numbermemoryfreeFile 31, Block 3File 2 number

7. Recall: Inode Structure7/29/2020Kumar CS 162 at UC Berkeley, Summer 20207

8. Berkeley FFS: LocalityFile system volume is divided into a set of block groupsClose set of tracks (generalized term for cylinder groups)Data blocks, metadata, and free space interleaved within block groupAvoid huge seeks between user data and system structurePut directory and its files in common block groupStore them near each other7/29/2020Kumar CS 162 at UC Berkeley, Summer 20208

9. Berkeley FFS: First-Fit Block Allocation7/29/2020Kumar CS 162 at UC Berkeley, Summer 20209

10. Recall: Directory AbstractionHard link: mapping from name to file in the directory structureFirst hard link to a file is made when file is initially createdCreate extra hard links to a file with link syscallRemove links with unlink (rm)When can the file contents be deleted?When there are no more hard links to the fileInode maintains reference count for this purpose7/29/2020Kumar CS 162 at UC Berkeley, Summer 202010/usr/usr/lib4.3/usr/lib4.3/foo/usr/lib/foo/usr/lib

11. Windows NTFS7/29/2020Kumar CS 162 at UC Berkeley, Summer 202011

12. New Technology File System (NTFS)Default on modern Windows systemsInstead of FAT or inode array: Master File TableMax 1 KB size for each table entryEach entry in MFT contains metadata and:File’s data directly (for small files)A list of extents (start block, size) for file’s dataFor big files: pointers to other MFT entries with more extent lists7/29/2020Kumar CS 162 at UC Berkeley, Summer 202012

13. NTFS Small File: Data in MFT Record7/29/2020Kumar CS 162 at UC Berkeley, Summer 202013

14. NTFS Medium File: Extents for File Data7/29/2020Kumar CS 162 at UC Berkeley, Summer 202014

15. NTFS Large File: Pointers to Other MFT Records7/29/2020Kumar CS 162 at UC Berkeley, Summer 202015

16. NTFS Huge, Fragmented File: Many MFT Records7/29/2020Kumar CS 162 at UC Berkeley, Summer 202016

17. NTFS DirectoriesDirectories implemented as B TreesFile's number identifies its entry in MFTMFT entry always has a file name attributeHuman readable name, file number of parent dirHard link? Multiple file name attributes in MFT entry7/29/2020Kumar CS 162 at UC Berkeley, Summer 202017

18. Memory-Mapped Files7/29/2020Kumar CS 162 at UC Berkeley, Summer 202018

19. Memory-Mapped FilesTraditional I/O involves explicit transfers between buffers in process address space to/from regions of a fileTypically, read/write system callsAlternative: “map” the file directly into an empty region of our address spaceImplicitly “page it in” from the file when we read an addressWrite to the address and “eventually” page it out to the file7/29/2020Kumar CS 162 at UC Berkeley, Summer 202019

20. Same Machinery as Demand Paging…On page fault to mapped region:Read from the correct offset of the file(Not from the typical swap space)If page in mapped region is chosen for eviction:Write back dirty page to the backing file(Not to the typical swap space)Executable files are mapped into the code region this way!7/29/2020Kumar CS 162 at UC Berkeley, Summer 202020

21. The mmap System CallAPI provided by OS for a process to alter its memory map (memory regions)Also supports anonymous mapping: memory not backed by a fileMemory regions can beshared (inherited by child on fork)private (not inherited by child on fork)7/29/2020Kumar CS 162 at UC Berkeley, Summer 202021

22. Buffering7/29/2020Kumar CS 162 at UC Berkeley, Summer 202022

23. Recall: Translation from User to System ViewWhat happens if user says: “give me bytes 2 – 12?”Fetch block corresponding to those bytesReturn just the correct portion of the blockWhat about writing bytes 2 – 12?Fetch block, modify relevant portion, write out blockEverything inside file system is in terms of whole-size blocksActual disk I/O happens in blocksread/write smaller than block size needs to translate and buffer7/29/2020Kumar CS 162 at UC Berkeley, Summer 202023FileSystemFile(Bytes)

24. Buffer CacheKernel must copy disk blocks to main memory to access their contents and write them back if modifiedCould be data blocks, inodes, directory contents, etc.Possibly dirty (modified and not written back)Key Idea: Exploit locality by caching disk data in memoryName translations: Mapping from pathsinodesDisk blocks: Mapping from block addressdisk content Buffer Cache: Memory used to cache kernel resources, including disk blocks and name translationsCan contain “dirty” blocks (with modifications not on disk)7/29/2020Kumar CS 162 at UC Berkeley, Summer 202024

25. File System Buffer CacheOS implements a cache of disk blocks7/29/2020Kumar CS 162 at UC Berkeley, Summer 202025MemoryDiskData blocksDir Data blocksiNodesFree bitmapfile descPCBReadingWritingBlocksStatefreefree

26. File System Buffer Cache: openLoad block of directorySearch for mapping7/29/2020Kumar CS 162 at UC Berkeley, Summer 202026MemoryBlocksStateDiskData blocksDir Data blocksiNodesFree bitmapfile descPCBReadingWritingfreefreerddir

27. File System Buffer Cache: openLoad block of directorySearch for mappingLoad inode7/29/2020Kumar CS 162 at UC Berkeley, Summer 202027MemoryBlocksStateDiskData blocksDir Data blocksiNodesFree bitmapfile descPCBReadingWritingfreerdinode<name>:inumberdir

28. File System Buffer Cache: openLoad block of directorySearch for mappingLoad inodeCreate reference via open file description7/29/2020Kumar CS 162 at UC Berkeley, Summer 202028MemoryBlocksStateDiskData blocksDir Data blocksiNodesFree bitmapfile descPCBReadingWritingfreeinode<name>:inumberdir

29. File System Buffer Cache: readFrom inode, traverse index structure to find data blockLoad data blockCopy all or part to user data buffer7/29/2020Kumar CS 162 at UC Berkeley, Summer 202029MemoryBlocksStateDiskData blocksDir Data blocksiNodesFree bitmapfile descPCBReadingWritingfreeinode<name>:inumberdir

30. File System Buffer Cache: writeMay allocate new blocksBlocks mustbe writtenback to disk7/29/2020Kumar CS 162 at UC Berkeley, Summer 202030MemoryBlocksStateDiskData blocksDir Data blocksiNodesFree bitmapfile descPCBReadingWritingfreeinode<name>:inumberdir

31. File System Buffer Cache: EvictionBlocks being written back to diskgo througha transientstate7/29/2020Kumar CS 162 at UC Berkeley, Summer 202031MemoryBlocksStateDiskData blocksDir Data blocksiNodesFree bitmapfile descPCBReadingWritingfreeinode<name>:inumberdir

32. Buffer Cache DiscussionImplemented entirely in OS softwareUnlike memory caches and TLBBlocks go through transitional states between free and in-useBeing read from disk, being written to diskOther processes can run, etc.Blocks are used for a variety of purposesinodes, data for dirs and files, freemapOS maintains pointers into themTermination – e.g., process exit – open, read, writeReplacement – what to do when it fills up?7/29/2020Kumar CS 162 at UC Berkeley, Summer 202032

33. File System CachingLRU Replacement policy?Advantages:Works very well for name translationWorks well in general if memory is big enough to accommodate a host’s working set of file blocksChallenges:Some application scans through file system, thereby flushing the cache with data used only once (e.g., find . –exec grep foo {} \;)Other Replacement Policies?Some systems allow applications to request other policiesExample, ‘Use Once’: File system can discard blocks as soon as they are used7/29/2020Kumar CS 162 at UC Berkeley, Summer 202033

34. File System CachingCache Size: How much memory should the OS allocate to the buffer cache vs virtual memory?Too much memory to the file system cache → won’t be able to run many applications at onceToo little memory to file system cache → many applications may run slowly (disk caching not effective)Solution: adjust boundary dynamically so that the disk access rates for paging and file access are balanced7/29/2020Kumar CS 162 at UC Berkeley, Summer 202034

35. File System CachingRead-Ahead Prefetching: fetch sequential blocks earlyFast to access; File System tries to obtain sequential layout; Applications tend to do sequential reads and writesHow much to prefetch?Too many delays on requests by other applicationsToo few causes many seeks (and rotational delays) among concurrent file requests7/29/2020Kumar CS 162 at UC Berkeley, Summer 202035

36. Delayed WritesBuffer cache is a writeback cache (write-behind cache)write copies data from user space to kernel buffer cacheread are fulfilled by the cache, so reads see the results of writesEven if the data has not reached diskWhen does data from a write syscall finally reach disk?When the buffer cache is full (e.g., we need to evict something)Flush the buffer cache periodically (in case we crash)Short-lived files might never make it to disk! 7/29/2020Kumar CS 162 at UC Berkeley, Summer 202036

37. Buffer Caching vs. Demand PagingReplacement Policy?Demand Paging: LRU is infeasible; use approximation (like NRU/Clock)Buffer Cache: LRU is OKEviction Policy?Demand Paging: evict not-recently-used pages when memory is close to fullBuffer Cache: write back dirty blocks periodically, even if used recentlyWhy? To minimize data loss in case of a crash7/29/2020Kumar CS 162 at UC Berkeley, Summer 202037

38. AnnouncementsProject 2 final report and scheduling lab due tonightSubmit each one to GradescopeSeparate assignment for eachProject 2 peer reviews are outDue tomorrowHomework 5 is out!Get started on it earlyIt has two parts, 5A and 5B, each worth one homeworkQuiz 3 is on MondayCovers material up to this slide7/29/2020Kumar CS 162 at UC Berkeley, Summer 202038

39. Dealing with Persistent StateBuffer Cache: write back dirty blocks periodically, even if used recentlyWhy? To minimize data loss in case of a crashNot foolproof! Can still crash with dirty blocks in the cacheWhat is the dirty block was for a directory?Lose pointer to file’s inode (leak space)File system now in inconsistent state 7/29/2020Kumar CS 162 at UC Berkeley, Summer 202039Takeaway: File systems needrecovery mechanisms

40. Important TerminologyAvailability: the probability that the system can accept and process requestsOften measured in “nines” of probability (e.g., 99.9% is “3-nines of availability”)Key idea here is independence of failuresDurability: the ability of a system to recover data despite faultsThis idea is fault tolerance applied to dataDoesn’t necessarily imply availability: information on pyramids was very durable, but could not be accessed until discovery of Rosetta StoneReliability: the ability of a system or component to perform its required functions under stated conditions for a specified period of time (IEEE definition)Usually stronger than availability: means system is not only “up”, but also working correctlyIncludes availability, security, fault tolerance/durabilityMust make sure data survives system crashes, disk crashes, other problems7/29/2020Kumar CS 162 at UC Berkeley, Summer 202040

41. How to make File Systems more Durable?7/29/2020Kumar CS 162 at UC Berkeley, Summer 202041

42. How to Make File Systems more Durable?Disk blocks contain Reed-Solomon error correcting codes (ECC) to deal with small defects in disk driveCan allow recovery of data from small media defects Make sure writes survive in short termEither abandon delayed writes orUse special, battery-backed RAM (called non-volatile RAM or NVRAM) for dirty blocks in buffer cacheMake sure that data survives in long termNeed to replicate! More than one copy of data!Important element: independence of failureCould put copies on one disk, but if disk head fails…Could put copies on different disks, but if server fails…Could put copies on different servers, but if building is struck by lightning…. Could put copies on servers in different continents…7/29/2020Kumar CS 162 at UC Berkeley, Summer 202042

43. RAID 1: Disk Mirroring/ShadowingEach disk fully replicated onto its shadow (100% capacity overhead)Bandwidth sacrificed on write: Logical write = two physical writesReads may be optimizedCan have two independent reads to same data7/29/2020Kumar CS 162 at UC Berkeley, Summer 202043recoverygroup

44. RAID 5: High I/O Rate ParityData striped across multiple disks Successive blocks stored on successive (non-parity) disksIncreased bandwidth over single diskParity block (in green) constructed by XORing data bocks in stripeP0=D0D1D2D3Can destroy any one disk and still reconstruct dataSuppose Disk 3 fails, then can reconstruct:D2=D0D1D3P0Can spread information more widely for durabilityRAID algorithms (and generalizations) within data center across cloud7/29/2020Kumar CS 162 at UC Berkeley, Summer 202044IncreasingLogicalDisk AddressesStripeUnitD0D1D2D3P0D4D5D6P1D7D8D9P2D10D11D12P3D13D14D15P4D16D17D18D19D20D21D22D23P5Disk 1Disk 2Disk 3Disk 4Disk 5

45. RAID 6 and Erasure CodesIn general: RAIDX is an “erasure code” (treat missing/failed disk as an erasure)Today, disks so big that: RAID 5 not sufficient!Time to repair disk sooooo long, another disk might fail in process!“RAID 6” – allow 2 disks in replication stripe to failRequires more complex erasure code, such as EVENODD code (see readings)More general option for general erasure code: Reed-Solomon codesBased on polynomials in GF(2k) (I.e. k-bit symbols) data points define a degree polynomial; encoding is points on the polynomialAny points can be used to recover the polynomial; failures toleratedErasure codes not just for disk arrays. For example, geographic replicationE.g., split data into chunks, generate fragments and distribute across the InternetAny 4 fragments can be used to recover the original data --- very durable! 7/29/2020Kumar CS 162 at UC Berkeley, Summer 202045

46. How to make File Systems more Reliable?7/29/2020Kumar CS 162 at UC Berkeley, Summer 202046

47. Threats to ReliabilitySingle logical file operation can involve updates to multiple physical disk blocksinode, indirect block, data block, bitmap, …With sector remapping, single update to physical disk block can require multiple (even lower level) updates to sectorsInterrupted OperationCrash or power failure in the middle of a series of related updates may leave stored data in an inconsistent stateLoss of stored dataFailure of non-volatile storage media may cause previously stored data to disappear or be corrupted7/29/2020Kumar CS 162 at UC Berkeley, Summer 202047

48. Two Reliability ApproachesCareful Ordering and RecoveryFAT & FFS + (fsck)Each step builds structure, Data block << inode << free << directoryLast step links it in to rest of FSRecover scans structure looking for incomplete actionsVersioning and Copy-on-WriteZFS, …Version files at some granularityCreate new structure linking back to unchanged parts of oldLast step is to declare that the new version is ready7/29/2020Kumar CS 162 at UC Berkeley, Summer 202048

49. Berkeley FFS: Create a FileNormal operation:Allocate data blockWrite data blockAllocate inodeWrite inode blockUpdate bitmap of free blocks and inodesUpdate directory with file name  inode numberUpdate modify time for directoryRecovery:Scan inode tableIf any unlinked files (not in any directory), delete or put in lost & found dirCompare free block bitmap against inode treesScan directories for missing update/access timesTime proportional to disk size7/29/2020Kumar CS 162 at UC Berkeley, Summer 202049

50. From Indexing to VersioningRecall: multi-level index structure lets us find the data blocks of a fileInstead of over-writing existing data blocks and updating the index structure:Create a new version of the file with the updated dataReuse blocks that don’t change much of what is already in placeThis is called: Copy On Write (COW)7/29/2020Kumar CS 162 at UC Berkeley, Summer 202050

51. Copy-on-Write Example7/29/2020Kumar CS 162 at UC Berkeley, Summer 202051Write old versionnew version

52. More Systematic Approach to ReliabilityUse transactions for atomic updatesEnsure that multiple related operations performed atomicallyIf a crash occurs in middle, state of system should reflect all or none of the operationsMost modern file systems use transactions to safely update their internals 7/29/2020Kumar CS 162 at UC Berkeley, Summer 202052

53. Key Systems Concept: TransactionA transaction is an atomic sequence of reads and writes that takes the system from consistent state to another.Recall: Code in a critical section appears atomic to other threadsTransactions extend the concept of atomic updates from memory to persistent storage7/29/2020Kumar CS 162 at UC Berkeley, Summer 202053consistent state 1consistent state 2transaction

54. Typical Transaction StructureBegin a transaction – get transaction idDo a bunch of updatesIf any fail along the way, roll-backOr, if any conflicts with other transactions, roll-backCommit the transaction7/29/2020Kumar CS 162 at UC Berkeley, Summer 202054

55. Key Systems Concept: LogWriting/appending a single item to a log is atomicLike memory load/store in most architecturesKey idea: append a single item (atomic) to seal the commitment to a whole sequence of actions7/29/2020Kumar CS 162 at UC Berkeley, Summer 202055Get 10$ from account AGet 7$ from account BGet 13$ from account CPut 15$ into account XPut 15$ into account YStart Tran NCommit Tran N

56. Journaling File SystemsDon’t modify data structures on disk directlyWrite each update as transaction recorded in a logCommonly called a journal or intention listAlso maintained on disk (allocate blocks for it when formatting)Once changes are in the log, they can be safely applied e.g. modify inode pointers and directory mappingGarbage collection: once a change is applied, remove its entry from the log7/29/2020Kumar CS 162 at UC Berkeley, Summer 202056

57. Creating a File (No Journaling Yet)Find free data block(s)Find free inode entryFind dirent insertion point-----------------------------------------Write map (i.e., mark used)Write inode entry to point to block(s)Write dirent to point to inode7/29/2020Kumar CS 162 at UC Berkeley, Summer 202057Data blocksFree space map…Inode tableDirectoryentries

58. Creating a File (With Journaling)Find free data block(s)Find free inode entryFind dirent insertion point-----------------------------------------[log] Write map (i.e., mark used)[log] Write inode entry to point to block(s)[log] Write dirent to point to inode7/29/2020Kumar CS 162 at UC Berkeley, Summer 202058Data blocksFree space map…Inode tableDirectoryentriesLog: in non-volatile storage (Flash or on Disk)headtailpendingdonestartcommit

59. After Commit, Eventually Replay TransactionAll accesses to the file system first looks in the logActual on-disk data structure might be staleEventually, copy changes to disk and discard transaction from the log7/29/2020Kumar CS 162 at UC Berkeley, Summer 202059Data blocksFree space map…Inode tableDirectoryentriesLog: in non-volatile storage (Flash or on Disk)headpendingdonestartcommittailtailtailtailtail

60. Crash Recovery: Discard Partial TransactionsUpon recovery, scan the logDetect transaction start with no commitDiscard log entriesDisk remains unchanged7/29/2020Kumar CS 162 at UC Berkeley, Summer 202060Data blocksFree space map…Inode tableDirectoryentriesLog: in non-volatile storage (Flash or on Disk)headtailpendingdonestart

61. Crash Recovery: Keep Complete TransactionsScan log, find startFind matching commitRedo it as usualOr just let it happen later7/29/2020Kumar CS 162 at UC Berkeley, Summer 202061Data blocksFree space map…Inode tableDirectoryentriesLog: in non-volatile storage (Flash or on Disk)headtailpendingdonestartcommit

62. Journaling SummaryWhy go through all this trouble?Updates atomic, even if we crash:Update either gets fully applied or discardedAll physical operations treated as a logical unitIsn’t this expensive?Yes! We're now writing all data twice (once to log, once to actual data blocks in target file)Modern filesystems journal metadata updates onlyRecord modifications to file system data structuresBut apply updates to a file’s contents directly7/29/2020Kumar CS 162 at UC Berkeley, Summer 202062

63. Going Further: Log-Structured File SystemsThe log IS what is recorded on diskFile system operations logically replay log to get resultCreate data structures to make this fastOn recovery, replay the logIndex (inodes) and directories are written into the log tooLarge, important portion of the log is cached in memoryDo everything in bulk: log is collection of large segmentsEach segment contains a summary of all the operations within the segmentFast to determine if segment is relevant or notFree space is approached as continual cleaning process of segmentsDetect what is live or not within a segmentCopy live portion to new segment being formed (replay)Garbage collection: entire segment7/29/2020Kumar CS 162 at UC Berkeley, Summer 202063

64. LFS Paper (see Readings)LFS: write file1 block, write inode for file1, write directory page mapping “file1” in “dir1” to its inode, write inode for this directory page. Do the same for ”/dir2/file2”. Then write summary of the new inodes that got created in the segmentReads are same for LFS and FFSBuffer cache likely to hold information in both casesBut disk IOs are very different7/29/2020Kumar CS 162 at UC Berkeley, Summer 202064

65. ConclusionBuffer Cache: Memory used to cache kernel resources, including disk blocks and name translationsCan contain “dirty” blocks (blocks yet on disk)File system operations involve multiple distinct updates to blocks on diskNeed to have all or nothing semanticsCrash may occur during the sequenceTraditional file system perform check and recovery on bootAlong with careful ordering so partial operations result in loose fragments, rather than lossCopy-on-write provides richer function (versions) with much simpler recoveryLittle performance impact since sequential write to storage device is nearly freeTransactions over a log provide a general solutionCommit sequence to durable log, then update the diskLog takes precedence over diskReplay committed transactions, discard partials7/29/2020Kumar CS 162 at UC Berkeley, Summer 202065