Lecture 23 ApplicationSpecific File Systems Deep Archival Storage Security and Protection April 29 th 2013 Prof John Kubiatowicz httpinsteecsberkeleyeducs19424 Goals for Today ID: 713052
Download Presentation The PPT/PDF document "CS194-24 Advanced Operating Systems Stru..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CS194-24Advanced Operating Systems Structures and Implementation Lecture 23Application-Specific File SystemsDeep Archival StorageSecurity and Protection
April
29
th
, 2013
Prof. John
Kubiatowicz
http://inst.eecs.berkeley.edu/~cs194-24Slide2
Goals for TodayApplication-specific File SystemsDynamo, HaystackDeep Archival StorageOceanStoreSecurity and ProtectionInteractive is important! Ask Questions!
Note: Some slides and/or pictures in the following are
adapted
from Bovet, “Understanding the Linux Kernel”, 3
rd
edition, 2005Slide3
Recall: VFS Common File ModelFour primary object types for VFS:superblock object: represents a specific mounted filesysteminode object: represents a specific filedentry object: represents a directory entry file object: represents
open
file associated with
process
There
is no specific directory
object (VFS treats directories as files)May need to fit the model by faking itExample: make it look like directories are filesExample: make it look like have inodes, superblocks, etc.Slide4
Recall: Data-based Caching (Data “De-Duplication”)Use a sliding-window hash function to break files into chunksRabin Fingerprint: randomized function of data windowPick sensitivity: e.g. 48 bytes at a time, lower 13 bits = 0 2-13 probability of happening, expected chunk size 8192
Need minimum and maximum chunk sizes
Now – if data stays same, chunk stays the same
Blocks named by cryptographic hashes such as SHA-256 Slide5
Recall: Peer-to-Peer: Fully equivalent components
Peer-to-Peer has many interacting components
View system as a set of equivalent nodes
“All nodes are created equal”
Any structure on system must be self-organizing
Not based on physical characteristics, location, or ownershipSlide6
Recall: Lookup with Leaf Set (Chord)
0…
10…
110…
111…
Lookup ID
Source
Response
Assign IDs to nodes
Map hash values to node with closest ID
Leaf set is successors and predecessors
All that’s needed for correctness
Routing table matches successively longer prefixes
Allows efficient lookups
Data Replication:
On leaf setSlide7
Advantages/Disadvantages of Consistent HashingAdvantages:Automatically adapts data partitioning as node membership changesNode given random key value automatically “knows” how to participate in routing and data managementRandom key assignment gives approximation to load balanceDisadvantagesUneven distribution of key storage natural consequence of random node names
Leads to uneven query load
Key management can be expensive when nodes transiently fail
Assuming that we immediately respond to node failure, must transfer state to new node set
Then when node returns, must transfer state back
Can be a significant cost if transient failure common
Disadvantages of “Scalable” routing algorithmsMore than one hop to find data O(log N) or worse
Number of hops unpredictable and almost always > 1Node failure, randomness, etcSlide8
Dynamo AssumptionsQuery Model – Simple interface exposed to application levelGet(), Put()No Delete()No transactions, no complex queriesAtomicity, Consistency, Isolation, DurabilityOperations either succeed or fail, no middle groundSystem will be eventually consistent, no sacrifice of availability to assure consistency
Conflicts can occur while updates propagate through system
System can still function while entire sections of network are down
Efficiency – Measure system by the 99.9th percentile
Important with millions of users, 0.1% can be in the 10,000s
Non Hostile Environment
No need to authenticate query, no malicious queriesBehind web services, not in front of themSlide9
Service Level Agreements (SLA)Application can deliver its functionality in a bounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds.Example: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per secondContrast to services which focus on mean response time
Service-oriented architecture of
Amazon’s
platformSlide10
ReplicationEach data item is replicated at N hosts“preference list”: The list of
nodes responsible
for storing
a
particular
key
Successive nodes not guaranteed to be on different physical nodesThus preference list includes physically distinct nodesSloppy Quorum
R (or W) is the minimum number of nodes that must participate in a successful read (or write) operation.Setting R + W > N yields a quorum-like system.
Latency
of a get (or put)
is
dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency.
Replicas synchronized via anti-entropy protocol
Use of
Merkle
tree for each unique range
Nodes exchange root of trees for shared key range Slide11
AdministriviaGet moving on Lab 4Will require you to read a bunch of code to digest the VFS layerDesign due this Thursday!So that Palmer can have design reviews on FridayFocus on behavioral aspectsMounting, File operations, EtcDon’t forget final Lecture during RRR
Monday 5/6
Send me final topicsSlide12
Data VersioningA put() call may return to its caller before the update has been applied at all the replicasA get() call may return many versions of the same object.Challenge: an object having distinct version sub-histories, which the system will need to reconcile in the future.Solution: uses vector clocks in order to capture causality between different versions of the same objectA vector clock is a list of (node, counter) pairs
Every version of every object is associated with one vector
clock
If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.Slide13
Vector clock exampleSlide14
Conflicts (multiversion data)Client must resolve conflictsOnly resolve conflicts on reads Different resolution options:Use vector clocks to decide based on historyUse timestamps to pick latest versionExamples given in paper:
For shopping cart, simply merge different versions
For customer’s session information, use latest version
Stale versions returned on reads are updated (“read repair”)
Vary N, R, W to match requirements of applications
High performance reads: R=1, W=N
Fast writes with possible inconsistency: W=1Common configuration: N=3, R=2, W=2When do branches occur?Branches uncommon: 0.06% of requests saw > 1 version over 24 hours
Divergence occurs because of high write rate (more coordinators), not necessarily because of failureSlide15
Haystack File SystemDoes it ever make sense to adapt a file system to a particular usage pattern?PerhapsGood example: Facebook’s “Haystack” filesystemSpecific application (Photo Sharing)Large files!, Many files!260 Billion images, 20
PetaBytes
(10
15
bytes!)
One billion new photos a week (60
TeraBytes)Presence of Content Delivery Network (CDN)Distributed caching and distribution networkFacebook web servers
return special URLs that encode requests to CDNPay for service by bandwidthSpecific usage patterns:New photos accessed a lot (caching well)Old photos accessed little,
but likely
to be requested
at
any
time
NEEDLES
Number of photos
requested in day Slide16
Old Solution: NFSIssues with this design?Long Tail Caching does notwork for most photosEvery access to back end storagemust be fast
without benefit of
caching!
Linear Directory scheme
works
badly
for many photos/directoryMany disk operations to find
even a single photoDirectory’s block map too big to cache in memory“Fixed” by reducing directory size, however still not greatMeta-Data (FFS) requires ≥ 3 disk accesses per lookupCaching all iNodes in memory might help, but iNodes
are big
Fundamentally, Photo Storage different from other storage:
Normal file systems fine for developers, databases,
etcSlide17
New Solution: HaystackFinding a needle (old photo) in HaystackDifferentiate between oldand new photosHow? By looking at “Writeable”vs “Read-only” volumesNew Photos go to Writeable
volumes
Directory: Help locate photos
Name (URL) of photo has
embedded volume and photo ID
Let CDN or Haystack Cache
Serve new photosrather than forwarding them to Writeable volumesHaystack Store: Multiple “Physical Volumes”
Physical volume is large file (100 GB) which stores millions of photosData Accessed by Volume ID with offset into fileSince Physical Volumes are large files, use XFS which is optimized for large filesSlide18
Haystack DetailsEach physical volume is stored as single file in XFSSuperblock: General information about the volumeEach photo (a “needle”) stored by appending to fileNeedles stored sequentially in fileNaming: [Volume ID, Key, Alternate Key, Cookie]Cookie: random value to avoid guessing attacksKey: Unique 64-bit photo ID
Alternate Key: four different sizes, ‘n’, ‘a’, ‘s’, ‘t’
Deleted Needle Simply marked as “deleted”
Overwritten Needle – new version appended at endSlide19
Haystack Details (Con’t)Replication for reliability and performance:Multiple physical volumes combined into logical volumeFactor of 3Four different sizes
Thumbnails, Small, Medium, Large
Lookup
User requests Webpage
Webserver returns URL of form:
http://<CDN>/<Cache>/<Machine id>/<Logical
volume,photo>Possibly reference cache only if old imageCDN will strip off CDN reference if missing, forward to cacheCache will strip off cache reference and forward to StoreIn-memory
index on Store for each volume map: [Key, Alternate Key] OffsetSlide20
What about Protection?Start by asking some high-level questions…What do we expect of our systems?Won’t leak our informationWon’t lose our informationWill always work when we need themWon’t launch attacks against other peopleHow can we prevent systems from misbehaving?
Never connect them to the network?
Always authenticate users?
Never use them?
Protection:
use of one or more mechanisms for controlling the access of programs, processes, or users to resources
Page Table MechanismFile Access MechanismOn-disk encryptionCan use lots of Protection but still have an insecure system!Bugs, back doors, viruses, poorly defined policy, inside man
Denial of service, …Slide21
Protection vs SecuritySecurity is a very complex topic: see, i.e. CS161Security is about Policy, i.e. what human-centered properties do we want from our systemUsually with reference to an attack modelSecurity is achieved through a series of
Mechanisms
, i.e. individual elements of the system combined together to achieve a security policy
Security:
use of protection mechanisms to prevent misuse of resources
Misuse defined with respect to policy
E.g.: prevent exposure of certain sensitive information
E.g.: prevent unauthorized modification/deletion of dataRequires consideration of the external environment within which the system operatesMost well-constructed system cannot protect information if user accidentally reveals passwordSlide22
Preventing MisuseTypes of Misuse:Accidental:If I delete shell, can’t log in to fix it!
Could make it more difficult by asking: “do you really want to delete the shell?”
Intentional:
Some high school brat who can’t get a date, so instead he transfers $3 billion from B to A.
Doesn’t help to ask if they want to do it (of course!)
Three Pieces to Security
Authentication:
who the user actually isAuthorization: who is allowed to do what
Enforcement:
make sure people do only what they are supposed to do
Loopholes in any carefully constructed system:
Log in as
superuser
and you’ve circumvented authentication
Log in as self and can do anything with your resources; for instance: run program that erases all of your files
Can you trust software to correctly enforce Authentication and Authorization?????Slide23
Authentication: Identifying Users
How to identify users to the system?
Passwords
Shared secret between two parties
Since only user knows password, someone types
correct password must be user typing it
Very common technique
Smart Cards
Electronics embedded in card capable of
providing long passwords or satisfying
challenge response queries
May have display to allow reading of password
Or can be plugged in directly; several
credit cards now in this category
Biometrics
Use of one or more intrinsic physical or
behavioral traits to identify someone
Examples: fingerprint reader,
palm reader, retinal scan
Becoming quite a bit more common
What else?
Consider the “Swarm” and “Un-pad” viewsSlide24
Timing Attacks: Tenex Password CheckingTenex – early 70’s, BBNMost popular system at universities before UNIXThought to be very secure, gave “red team” all the source code and documentation (want code to be publicly available, as in UNIX)In 48 hours, they figured out how to get every password in the systemHere’s the code for the password check:
for (i = 0; i < 8; i++)
if (userPasswd[i] != realPasswd[i])
go to error
How many combinations of passwords?
256
8?Wrong!Slide25
Defeating Password CheckingTenex used VM, and it interacts badly with the above codeKey idea: force page faults at inopportune times to break passwords quicklyArrange 1st
char in string to be last char in
pg
, rest on next
pg
Then arrange for
pg with 1
st char to be in memory, and rest to be on disk (e.g., ref lots of other pgs, then ref 1st page)
a|aaaaaa
|
page in memory| page on disk
Time password check to determine if first character is correct!
If fast, 1
st
char is wrong
If slow, 1
st
char is right,
pg
fault, one of the others wrong
So try all first characters, until one is slow
Repeat with first two characters in memory, rest on disk Only 256 * 8 attempts to crack passwords
Fix is easy, don’t stop until you look at all the charactersSlide26
How do we decide who is authorized
to do actions in the system?
Access Control Matrix:
contains
all permissions in the system
Resources across top
Files, Devices, etc…
Domains in columnsA domain might be a user or a group of permissions
E.g. above: User D
3
can read F
2
or execute F
3
In practice, table would be huge and sparse!
Two approaches to implementation
Access Control Lists: store permissions with each object
Still might be lots of users!
UNIX limits each file to: r,w,x for owner, group, world
More recent systems allow definition of groups of users and permissions for each group
Capability List: each process tracks objects has permission to touch
Popular in the past, idea out of favor today
Consider page table: Each process has list of pages it has access to, not each page has list of processes …Recall: Authorization: Who Can Do What?Slide27
Authorization ContinuedPrinciple of least privilege: programs, users, and systems should get only enough privileges to perform their tasksVery hard to do in practice
How do you figure out what the minimum set of privileges is needed to run your programs?
People often run at higher privilege then necessary
Such as the “administrator” privilege under windows
One solution: Signed Software
Only use software from sources that you trust, thereby dealing with the problem by means of authentication
Fine for big, established firms such as Microsoft, since they can make their signing keys well known and people trust them
Actually, not always fine: recently, one of Microsoft’s signing keys was compromised, leading to malicious software that looked validWhat about new startups?
Who “validates” them?
How easy is it to fool them?Slide28
Mandatory Access Control (MAC)Mandatory Access Control (MAC)“A Type of Access control by which the operating system constraints the ability of a subject or initiator to access or generally perform some sort of operation on an object or target
.”
From Wikipedia
Subject: a process or thread
Object: files, directories, TCP/UDP ports,
etc
Security policy is centrally controlled by a security policy administrator: users not allowed to operate outside the policyExamples: SELinux, HiStar, etc.
Contrast: Discretionary Access Control (DAC)Access restricted based on the identity of subjects and/or groups to which they blongControls are discretionary – a subject with a certain access permission is capable of passing that permission on to any other subjectStandard UNIX modelSlide29
Data Centric Access Control (DCAC?)Problem with many current models:If you break into OS data is compromisedIn reality, it is the data that matters – hardware is somewhat irrelevant (and ubiquitous)Data-Centric Access Control (DCAC)I just made this term up, but you get the idea
Protect data at all costs, assume that software might be compromised
Requires encryption and sandboxing techniques
If hardware (or virtual machine) has the right cryptographic keys, then data is released
All of the previous authorization and enforcement mechanisms reduce to key distribution and protection
Never let decrypted data or keys outside sandbox
Examples: Use of TPM, virtual machine mechanismsSlide30
EnforcementEnforcer checks passwords, ACLs, etcMakes sure the only authorized actions take place
Bugs in
enforcer
things
for malicious users to exploit
Normally, in UNIX,
superuser can do anything
Because of coarse-grained access control, lots of stuff has to run as superuser in order to workIf there is a bug in any one of these programs, you lose!Paradox
Bullet-proof enforcer
Only known way is to make enforcer as small as possible
Easier to make correct, but simple-minded protection model
Fancy protection
Tries to adhere to principle of least privilege
Really hard to get right
Same argument for Java or C++: What do you make private
vs
public?
Hard to make sure that code is usable but only necessary modules are public
Pick something in middle? Get bugs and weak protection!Slide31
SummaryPeer-to-Peer: Use of 100s or 1000s of nodes to keep higher performance or greater availabilityMay need to relax consistency for better performanceApplication-Specific File Systems (e.g. Haystack):Optimize system for particular usage patternSecurity: use of protection mechanisms to prevent misuse of
resources
Represents Human-Centered Policy as opposed to mechanism
Three
Pieces to Security
Authentication: who the user actually is
Authorization: who is allowed to do what
Enforcement: make sure people do only what they are supposed to doPrinciple of least privilege: programs, users, and systems should get only enough privileges to perform their tasks