/
Badri Nath Rutgers University badrics Badri Nath Rutgers University badrics

Badri Nath Rutgers University badrics - PDF document

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
445 views
Uploaded On 2015-03-05

Badri Nath Rutgers University badrics - PPT Presentation

rutgersedu 1 Chubby 2 Cassandra Services deployed over a cluster Thousands of machines over 100s of racks Deployed over cheap hardware Need to be highly available Replication State of the system need to be consistent Every client need to have consist ID: 41629

rutgersedu Chubby Cassandra

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Badri Nath Rutgers University badrics" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 1 Services deployed over a clusterThousands of machines over 100s of racksDeployed over cheap hardwareNeed to be highly availableState of the system need to be consistentEvery client need to have consistent view of What is where?Who can do what? 3 1 TB of data with 100 MB tablets10000 tablets, Spread out over 100s of machines miTablet-Map(Tabletid, serverID) need to be consistentMap changes due to:Tablets can migrate to a different serverNew tablets can be createdServers can be added and deleted Big table and Chubby: perfect togetherBT uses chubby forTo ensure exactly one masterTo store map of blocks, server idTo discover tablet serversTo store schema for each table and ACL 2 5 Solve consensus on a case by case basisPaxos, 2 phase commit, 3 phase commit etcMultiple implementations, ad-hocNeed to be incorporated into application codeImplement a common service that all cluster applications that need consensus can useImplement distributed consensus as a serviceUniform interface, highly available, scalable 6 Typical Locking ClientLock Manager lock(resource,type) grant()/refuse() operation(resource) status unlock()Type=Shared/Exclusive Difficult to get it rightBlocking –client holding the lock dies or goes How to recover locks?Availability –lock manager is a critical resource, need to be Highly AvailableHow to fail-over lock manager state?Granularity –fine grain vscoarse grain What to offer?Load and overhead based on rate of lock operations 8 Consensus achieved by a centralized locking serviceChubby is a lock serviceChubby Lock manager replicated to provide high availability Still at any time one master with failoverLoad on the chubby master minimized by a number of design decisionsCoarse grained lockingCan hold data for long period of times (over large extents of dataChubby not bothered for row level locks 4 Coarse grained vsfine grainedLess interaction with lock managerHold for longer periods fotimeLess Load on the lock serverOnce locks are acquired to make a consistent change, client is less dependent on the lock serverReduces delaysBrief unavailability of lock manager does not impede Low lock-acquisition rate implies lot more clients can be serviced Coarse grain vsFine grainFine Grained locks –Adding servers and performance can suffer –transaction rate growsIf required, a coarse grain lock holder can issue fine grain lockshandled by a Chubby client that has a tablet level lock that covers the rows Implementing consensus as a lock serviceChubby exports a file system interfaceClients interact with chubby Creation of name space and contents together with synchronization (using lock service)Maintain a consistent view in spite of failures and concurrent accesses 16 Client-chubby interaction 5 Example: create a MasterHndlmstr=Open(/TBSS/MSTR,RW)acquire(hndladd,Xclusive)…. Will fail if there is already a lock….. Live master existselifsetcontents(hndlmstr,masterIPaddress) TBSSMSTR SRVRS SRVR1SRVR2 SRVRX Hndladd=Open(/TBSS/SRVRS,RW)acquire(hndladd,Xclusive)Hndlnew=open(hndladd,SRVX)Acquire(hndlwnew,Xclusive)Setcontents(hndlnew,myIPaddress)Release(hndladd) TBSSMSTR SRVRS SRVR1SRVR2 SRVRX hndl=Open(/crwltbl/root/mt4/usrtb1)Utabledict=Getcontents(hndl)Utabledict.get(key) Chubby in a nutshellConsensus, consistency problem reduced to keeping a namespace, contents consistent using Once you understand this, you understand chubbyNext problemKeeping the name space, content at chubby resilient to failureReplicate the stateHave a set of servers capable of handling state 6 Two main components that communicate via RPC ServerLibrary that applications link against.Designed for fault toleranceMaster and 4 replicasMaster is chosen using paxos(but only among All requests handled by masterA Replica takes over in case of failure 22 Client needs to locate chubby masterClient finds the master by sending a master location request to all the replicas listed in the DNSIf a non-master replica receives the request, it responds with the master identityClient continues requests to that master until it stops responding or replies that it is no longer the master 24 Write Requests –Propagated to all replicasacknowledged when write request reaches the majority of the replicas Read Requests –satisfied by the master aloneSafe provided that the master lease has not expired, as no other master exists in the protocolChubby master is the arbiter for all client requests 7 Replicas use paxosto elect a masterMust obtain votes from the majority of replicas and A promise that they will not elect a new master during Master Lease (a few seconds)Replicas check is node is still master Lease renewed periodically 26 Chubby exports a file interface Chubby maintains a namespaceNamespace contains files and directories --nodesTree of files with names separated by slashes –/ls/singer/beyonce/singleladyClients create nodes, add contents to nodesCan acquire locks on nodes 27 Files and directories are nodesCan be permanent or ephemeral (cool idea)Ephemeral nodes are deleted if no client has them openUse ephemeral nodes in chubby to advertise that a client is aliveHndl=Open(…/myname, ephemeral)Setcontents(hndl)=myipaddressAny client can do a getcontents( …/myname) to infer that a machine whose IP is myipaddressis aliveIf the client is dead, then node will not exist 28 RWC (read, write, change ACL) modesACLsare files are located in a separate ACL directoryEach Mode has a file that consist of simple lists of names of principalsUsers are authenticated by a mechanism built into the RPC system 9 Locks, handles, cached data become invalid when What happens if there is a pending operation delayed in the networkThe master has released the lock on the itemThe operation may arrive later at the masterIf the item is not protected, the delayed operation may operate on inconsistent dataUse SequencersGetsequencer(handle)Gives you a sequence number for the lock held on the Use of sequencer Open(node);lock(node) GetSequencer()Node: sequencer Setcontents(node,value,sequencer)Checksequencer() A client passes the sequencer to the server if it wants for protected operationServer checks if the sequencerIs still validChecked against chubby cache or against most recent sequencer observedIs in the right mode to perform the request 36 Sequencers for fine grain Chubby client obtains a coarse grain lock Other applications can get a fine grain lock along with sequencer from chubby clientE.g, a range of rowsWhen the app server writes to chubby server, provide sequencerIf sequencer has changed (chubby client has lost the lock) operation will be rejected 12 Cassandra data modelEntire system is a giant table with lots of rowsEach row has a keyEach row has a column familyColumnfamily= column/super columnColumn = name, value, timestampSuper colum= name, column + 46 Write to a log and updateinmemoryPeriodically writes are flushed to diskWrite optimiations; redundant writes are eliminatedWrites are sorted to eleimnaterandom seeksHundreds of gigabytes –minimizing seek time 47 Cassandra Architecture Messaging Layer Cluster Membership Failure Detector Storage Layer Partitioner Replicator Cassandra API Tools Data Model KEY ColumnFamily1 : MailListType : SimpleSort : Name Name : tid1Value : Binary'.9;TimeStamp: t1 Name : tid2Value : Binary'.8;TimeStamp: t2 Name : tid3Value : Binary'.8;TimeStamp: t3 Name : tid4Value : Binary'.9;TimeStamp: t4 ColumnFamily2 : WordListType : SuperSort : Time Name : aloha ColumnFamily3 : SystemType : SuperSort : Name Name : hint1Column List.1; Name : hint2Column List.1; Name : hint3Column List.1; Name : hint4Column List.1; C1 Name : dude C2 Column Families are declared upfront SuperColumnsare added and modified dynamically Columns are added and modified dynamically 14 Query Cassandra Cluster Replica AResult Replica B Digest QueryDigest Response Digest Response Result Read repair if digests differ Read repair if digests differ 54 1 PartitioningAnd Replication 55 Cluster Membership and Failure DetectionGossip protocol is used for cluster membership.Send changed state to peersNode[i]KeepAlive[i]=TSPick a random peer jSend KeepAliveIf (KeepAlive== OLD):Send to a random peer with probability 1/KElse:send to a random peer if KeepAliveEvery T seconds each member increments its KeepAlivecounter and selects one random member to send 56 Performance BenchmarkRandom and sequential writes -limited by network bandwidth.Read performance for Inbox Search in production: Search Interactions Term Search Min 7.69 ms 7.78 ms Median 15.69 ms 18.27 ms Average 26.13 ms 44.41 ms 15 ly when absolutely required.Many types of failures are possible.Big systems need proper systems-level monitoring.Value simple designs 58 Future workAtomicity guarantees across multiple keysDistributed transactionsCompression support Granular security via ACL’s