/
HewlettPackard Laboratories Palo Alto CA danteohpl HewlettPackard Laboratories Palo Alto CA danteohpl

HewlettPackard Laboratories Palo Alto CA danteohpl - PDF document

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
429 views
Uploaded On 2015-06-01

HewlettPackard Laboratories Palo Alto CA danteohpl - PPT Presentation

hpcom Computer Systems Laboratory Stanford University kinshuk yhuang mendelcsstanfordedu Abstract Despite the fact that lar gescale shar edmemory multipr o cessors have been commer cially available for several years system softwar e that fully utiliz ID: 77917

hpcom Computer Systems Laboratory Stanford

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "HewlettPackard Laboratories Palo Alto CA..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

154*Hewlett-Packard LaboratoriesPalo Alto, CAdanteo@hpl.hp.comComputer Systems LaboratoryStanford University{kinshuk, yhuang, mendel}@cs.stanford.eduAbstractDespite the fact that large-scale shared-memory multipro-cessors have been commercially available for several years,system software that fully utilizes all their features is stillnot available, mostly due to the complexity and cost of mak-ing the required changes to the operating system. A recentlyproposed approach, called Disco, substantially reduces thisdevelopment cost by using a virtual machine monitor thatleverages the existing operating system technology.In this paper we present a system called Cellular Discothat extends the Disco work to provide all the advantages ofthe hardware partitioning and scalable operating systemapproaches. We argue that Cellular Disco can achieve thesebeneÞts at only a small fraction of the development cost ofmodifying the operating system. Cellular Disco effectivelyturns a large-scale shared-memory multiprocessor into avirtual cluster that supports fault containment and hetero-geneity, while avoiding operating system scalability bottle-necks. Yet at the same time, Cellular Disco preserves thebeneÞts of a shared-memory multiprocessor by implement-ing dynamic, Þne-grained resource sharing, and by allow-ing users to overcommit resources such as processors andmemory. This hybrid approach requires a scalable resourcemanager that makes local decisions with limited informa-tion while still providing good global performance and faultcontainment.In this paper we describe our experience with a CellularDisco prototype on a 32-processor SGI Origin 2000 system.We show that the execution time penalty for this approach islow, typically within 10% of the best available commercialoperating system for most workloads, and that it can man-age the CPU and memory resources of the machine signiÞ-cantly better than the hardware partitioning approach.1IntroductionShared-memory multiprocessor systems with up to a fewhundred processors have been commercially available forthe past several years. Unfortunately, due to the developmentcost and the complexity of the required changes, most oper-ating systems are unable to effectively utilize these largemachines. Poor scalability restricts the size of machines thatcan be supported by most current commercial operating sys-tems to at most a few dozen processors. Memory allocationalgorithms that are not aware of the large difference in localversus remote memory access latencies on NUMA (Non-Uniform Memory Access time) systems lead to suboptimalapplication performance. Resource management policies notdesigned to handle a large number of resources can lead tocontention and inefficient usage. Finally, the inability of theoperating system to survive any hardware or system softwarefailure results in the loss of all the applications running onthe system, requiring the entire machine to be rebooted.The solutions that have been proposed to date are eitherbased on hardware partitioning [4][21][25][28], or requiredeveloping new operating systems with improved scalabilityand fault containment characteristics [3][8][10][22]. Unfor-tunately, both of these approaches suffer from serious draw-backs. Hardware partitioning limits the flexibility withwhich allocation and sharing of resources in a large systemcan be adapted to dynamically changing load requirements.Since partitioning effectively turns the system into a clusterof smaller machines, applications requiring a large numberof resources will not perform well. New operating systemdesigns can provide excellent performance, but require aconsiderable investment in development effort and timebefore reaching commercial maturity.A recently proposed alternative approach, calledDisco[2], uses a virtual machine monitor to run unmodifiedcommodity operating systems on scalable multiprocessors.With a low implementation cost and a small run-time virtu-alization overhead, the Disco work shows that a virtualmachine monitor can be used to address scalability andNUMA-awareness issues. By running multiple copies of anoff-the-shelf operating system, the Disco approach is able toleverage existing operating system technology to form thesystem software for scalable machines.Although Disco demonstrated the feasibility of this newapproach, it left many unanswered questions. In particular,the Disco prototype lacked several major features that madeit difficult to compare Disco to other approaches. For exam-ple, while other approaches such as hardware partitioningsupport hardware fault containment, the Disco prototypelacked such support. In addition, the Disco prototype lackedthe resource management mechanisms and policies requiredCellular Disco: resource management using virtualclusters on shared-memory multiprocessorsKinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel RosenblumPermission to make digital or hard copies of all or part of this workfor personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advan-tage, and that copies bear this notice and the full citation on the firstpage. To copy otherwise, to republish, to post on servers or to redis-tribute to lists, requires prior specific permission and/or a fee.SOSP-17 12/1999 Kiawah Island, SC© 1999 ACM 1-58113-140-2/99/0012É$5.0017th ACM Symposium on Operating Systems Principles (SOSPÕ99)Published asOperating Systems Review34(5):154-169, Dec. 1999 155to make it competitive compared to a customized operatingsystem approach.In this work we present a system called Cellular Discothat extends the basic Disco approach by supporting hard-ware fault containment and aggressive global resource man-agement, and by running on actual scalable hardware. Oursystem effectively turns a large-scale shared-memorymachine into avirtual cluster by combining the scalabilityand fault containment benefits of clusters with the resourceallocation flexibility of shared-memory systems. Our experi-ence with Cellular Disco shows that:1.Hardware fault containment can be added to a virtualmachine monitor with very low run-time overheads andimplementation costs. With a negligible performance pen-alty over the existing virtualization overheads, fault contain-ment can be provided in the monitor at only a very smallfraction of the development effort that would be needed foradding this support to the operating system.2.The virtual cluster approach can quickly and efÞ-ciently correct resource allocation imbalances in scalablesystems. This capability allows Cellular Disco to managethe resources of a scalable multiprocessor signiÞcantly bet-ter than a hardware partitioning scheme and almost as wellas a highly-tuned operating system-centric approach. Vir-tual clusters do not suffer from the resource allocation con-straints of actual hardware clusters, since large applicationscan be allowed to use all the resources of the system,instead of being conÞned to a single partition.3.The small-scale, simulation-based results of Discoappear to match the experience of running workloads onreal scalable hardware. We have built a Cellular Disco pro-totype that runs on a 32-processor SGI Origin 2000[14] andis able to host multiple instances of SGIÕs IRIX6.2 operat-ing system running complex workloads. Using this system,we have shown that Cellular Disco provides all the featuresmentioned above while keeping the run-time overhead ofvirtualization below 10% for most workloads.This paper focuses on our experience with the mecha-nisms and policies implemented in Cellular Disco for deal-ing with the interrelated challenges of hardware faultcontainment and global resource management:Fault containment:Although a virtual machine monitorautomatically provides software fault containment in that afailure of one operating system instance is unlikely to harmsoftware running in other virtual machines, the large poten-tial size of scalable shared-memory multiprocessors alsorequires the ability to contain hardware faults. Cellular Discois internally structured into a number of semi-independentcells, or fault-containment units. This design allows theimpact of most hardware failures to be confined to a singlecell, a behavior very similar to that of clusters, where mostfailures remain limited to a single node.While Cellular Disco is organized in a cellular structuresimilar to the one in the Hive operating system [3], providingfault containment in Cellular Disco required only a fractionof the development effort needed for Hive, and it does notimpact performance once the virtualization cost has beenfactored out. A key design decision that reduced cost com-pared to Hive was to assume that the code of Cellular Discoitself is correct. This assumption is warranted by the fact thatthe size of the virtual machine monitor (50K lines of C andassembly) is small enough to be thoroughly tested.Resource management:In order to support betterresource management than hardware clusters, Cellular Discoallows virtual machines to overcommit the actual physicalresources present in the system. This offers an increaseddegree of flexibility by allowing Cellular Disco to dynami-cally adjust the fraction of the system resources assigned toeach virtual machine. This approach can lead to a signifi-cantly better utilization of the system, assuming thatresource requirement peaks do not occur simultaneously.Cellular Disco multiplexes physical processors amongseveral virtual machines, and supports memory paging inaddition to any such mechanism that may be provided by thehosted operating system. These features have been carefullyimplemented to avoid the inefficiencies that have plaguedvirtual machine monitors in the past[20]. For example, Cel-lular Disco tracks operating system memory usage and pag-ing disk I/O to eliminate double paging overheads.Cellular Disco must manage the physical resources in thesystem while satisfying the often conflicting constraints ofproviding good fault-containment and scalable resource loadbalancing. Since a virtual machine becomes vulnerable tofaults in a cell once it starts using any resources from thatcell, fault containment will only be effective if all of theresources for a given virtual machine are allocated from asmall number of cells. However, a naive policy may subop-timally use the resources due to load imbalance. Resourceload balancing is required to achieve efficient resource utili-zation in large systems. The Cellular Disco implementationof both CPU and memory load balancing was designed topreserve fault containment, avoid contention, and scale tohundreds of nodes.In the process of virtualizing the hardware, CellularDisco can also make many of the NUMA-specific resourcemanagement decisions for the operating system. The physi-cal memory manager of our virtual machine monitor imple-ments first-touch allocation and dynamic migration orreplication of ÒhotÓ memory pages [29]. These features arecoupled with a physical CPU scheduler that is aware ofmemory locality issues.By virtualizing the underlying hardware, Cellular Discoprovides an additional level of indirection that offers an eas-ier and more effective alternative to changing the operatingsystem. For instance, we have added support that allowslarge applications running across multiple virtual machinesto interact directly through shared memory by registeringtheir shared memory regions directly with the virtualmachine monitor. This support allows a much more efÞcientinteraction than through standard distributed-system proto-cols and can be provided transparently to the hosted operat-ing system.This paper is structured as follows. We start by describ-ing the Cellular Disco architecture in Section2. Section3 156describes the prototype implementation and the basic virtu-alization and fault-containment overheads. Next, we discussour resource management mechanisms and policies: CPUmanagement in Section4 and memory management inSection5. Section6 discusses hardware fault recovery. Weconclude after comparing our work to hardware- and oper-ating system-centric approaches and discussing relatedwork.2The Cellular Disco architectureCompared to previous work on virtual machine monitors,Cellular Disco introduces a number of novel features: sup-port for hardware fault containment, scalable resource man-agement mechanisms and policies that are aware of faultcontainment constraints, and support for large, memory-intensive applications. For completeness, we first present ahigh-level overview of hardware virtualization that parallelsthe descriptions given in [2] and [5]. We then discuss each ofthe distinguishing new features of Cellular Disco in turn.2.1 Overview of hardware virtualizationCellular Disco is a virtual machine monitor [5] that can exe-cute multiple instances of an operating system by runningeach instance inside its own virtual machine (see Figure1).Since the virtual machines export an interface that is similarto the underlying hardware, the operating system instancesneed not be aware that they are actually running on top ofCellular Disco.For each newly created virtual machine, the user speci-fies the amount of resources that will be visible to that virtualmachine by indicating the number of virtual CPUs (VCPUs),the amount of memory, and the number and type of I/Odevices. The resources visible to a virtual machine are calledphysical resources. Cellular Disco allocates the actualmachine resources to each virtual machine as required by theNodeCCNodeCCNodeCCNodeCCNodeCCNodeCCNodeCCNodeCCInterconnectOSOSOperating SystemApplicationAppAppAppFigure 1.Cellular Disco architecture. Multipleinstances of an off-the-shelf operating system runinside virtual machines on top of a virtual machinemonitor; each instance is only booted with as manyresources as it can handle well. In the Origin 2000each node contains two CPUs and a portion of thesystem memory (not shown in the Þgure).Cellular Disco (Virtual Machine Monitor)VMVMVirtual Machinedynamic needs and the priority of the virtual machine, simi-lar to the way an operating system schedules physicalresources based on the needs and the priority of user applica-tions.To be able to virtualize the hardware, the virtualmachine monitor needs to intercept all privileged opera-tions performed by a virtual machine. This can be imple-mented efficiently by using the privilege levels of theprocessor. Although the complexity of a virtual machinemonitor depends on the underlying hardware, even com-plex architectures such as the Intel x86 have been success-fully virtualized [30]. The MIPS processor architecture[11] that is supported by Cellular Disco has three privilegelevels:user mode (least privileged, all memory accesses aremapped),supervisor mode (semi-privileged, allowsmapped accesses to supervisor and user space), andkernelmode (most privileged, allows use of both mapped andunmapped accesses to any location, and allows executionof privileged instructions). Without virtualization, theoperating system runs at kernel level and applications exe-cute in user mode; supervisor mode is not used. Under Cel-lular Disco, only the virtual machine monitor is allowed torun at kernel level, and thus to have direct access to allmachine resources in the system. An operating systeminstance running inside a virtual machine is only permittedto use the supervisor and user levels. Whenever a virtual-ized operating system kernel executes a privileged instruc-tion, the processor will trap into Cellular Disco where thatinstruction is emulated. Since in supervisor mode all mem-ory accesses are mapped, an additional level of indirectionthus becomes available to map physical resources to actualmachine resources.The operating system executing inside a virtual machinedoes not have enough access privilege to perform I/O opera-tions. When attempting to access an I/O device, a CPU willtrap into the virtual machine monitor, which checks thevalidity of the I/O request and either forwards it to the realI/O device or performs the necessary actions itself in the caseof devices such as the virtual paging disk (see Section5.3).Memory is managed in a similar way. While the operatingsystem inside a virtual machine allocates physical memoryto satisfy the needs of applications, Cellular Disco allocatesmachine memory as needed to back the physical memoryrequirements of each virtual machine. Apmap data structuresimilar to the one in Mach [18] is used by the virtual machinemonitor to map physical addresses to actual machineaddresses. In addition to the pmap, Cellular Disco needs tomaintain amemmap structure that allows it to translate backfrom machine to physical pages; this structure is used fordynamic page migration and replication, and for fault recov-ery (see Section6).Performing the physical-to-machine translation using thepmap at every software reload of the MIPS TLB can lead tovery high overheads. Cellular Disco reduces this overheadby maintaining for every VCPU a 1024-entry translationcache called thesecond level software TLB (L2TLB). Theentries in the L2TLB correspond to complete virtual-to-machine translations, and servicing a TLB miss from the 157L2TLB is much faster than generating a virtual exception tobe handled by the operating system inside the virtualmachine.2.2 Support for hardware fault containmentAs the size of shared-memory machines increases, reliabilitybecomes a key concern for two reasons. First, one can expectto see an increase in the failure rate of large systems: a tech-nology that fails once a year for a small workstation corre-sponds to a failure rate of once every three days when usedin a 128-processor system. Second, since a failure will usu-ally bring down the entire system, it can cause substantiallymore state loss than on a small machine. Fault tolerance doesnot necessarily offer a satisfactory answer for most users,due to the system cost increase and to the fact that it does notprevent operating system crashes from bringing down theentire machine.Support for software fault containment (of faults occur-ring in the operating systems running inside the virtualmachines) is a straightforward benefit of any virtual machinemonitor, since the monitor can easily restrict the resourcesthat are visible to each virtual machine. If the operating sys-tem running inside a virtual machine crashes, this will notimpact any other virtual machines.To address the reliability concerns for large machines,we designed Cellular Disco to supporthardware fault con-tainment, a technique that can limit the impact of faults toonly a small portion of the system. After a fault, only a smallfraction of the machine will be lost, together with any appli-cations running on that part of the system, while the rest ofthe system can continue executing unaffected. This behavioris similar to the one exhibited by a traditional cluster, wherehardware and system software failures tend to stay localizedto the node on which they occurred.To support hardware fault containment, Cellular Disco isinternally structured as a set of semi-independentcells, asshown in Figure2. Each cell contains a complete copy of themonitor text and manages all the machine memory pagesbelonging to its nodes. A failure in one cell will only bringdown the virtual machines that were using resources fromthat cell, while virtual machines executing elsewhere will beCellular DiscoNodeNodeNodeNodeNodeNodeNodeNodeVMVMVirtual MachineFigure 2.The cellular structure of Cellular Disco allowsthe impact of a hardware fault to be contained withinthe boundary of the cell where the fault occurred.Cell boundariesInterconnectable to continue unaffected. We designed the system to favora smaller overhead during normal execution but a higher costwhen a component fails, hopefully an infrequent occurrence.The details of the fault recovery algorithm are covered inSection6.One of our basic assumptions when designing CellularDisco was that the monitor can be kept small enough to bethoroughly tested so that its probability of failure isextremely low. Cellular Disco is thus considered to be atrusted system software layer. This assumption is warrantedby the fact that with a size of less than 50K lines, the monitoris about as complex as other trusted layers in the shared-memory machine (e.g., the cache coherence protocol imple-mentation), and it is about two orders of magnitude simplerthan modern operating systems, which may contain up toseveral million lines of code.The trusted layer decision can lead to substantiallysmaller overheads compared to a design in which the systemsoftware layer cannot be trusted due to its complexity, suchas in the case of the Hive operating system [3]. If cells do nottrust each other, they have to use expensive distributed pro-tocols to communicate and to update their data structures.This is substantially less efficient than directly using sharedmemory. The overheads become evident when one considersthe case of a single virtual machine straddling multiple cells,all of which need to update the monitor data structures cor-responding to the virtual machine. An example of a structurerequiring frequent updates is the pmap address translationtable.Although Cellular Disco cells can use shared memoryfor updating virtual machine-specific data structures, theyare not allowed to directly touch data structures in othercells that are essential for the survival of those cells. Forthose cases, as well as when the monitor needs to requestthat operations be executed on a given node or VCPU, acarefully designed communication mechanism is providedin Cellular Disco that offers low latency and exactly-oncesemantics.The basic communication primitive is a fast inter-proces-sor RPC (Remote Procedure Call). For our prototype Origin2000 implementation, we measured the round-trip time foran RPC carrying a cache line-sized argument and reply (128bytes) at 16ms. Simulation results indicate that this time canbe reduced to under 7ms if appropriate support is providedin the node controller, such as in the case of the FLASH mul-tiprocessor [13].A second communication primitive, called amessage, isprovided for executing an action on the machine CPU thatcurrently owns a virtual CPU. This obviates most of the needfor locking, since per-VCPU operations are serialized on theowner. The cost of sending a message is on average the sameas that of an RPC. Messages are based on a fault tolerant, dis-tributed registry that is used for locating the current owner ofa VCPU given the ID of that VCPU. Since the registry iscompletely rebuilt after a failure, VCPUs can change owners(that is, migrate around the system) without having todepend on a fixed home. Our implementation guarantees 158exactly-once message semantics in the presence of conten-tion, VCPU migration, and hardware faults.2.3 Resource management under constraintsCompared to traditional resource management issues, anadditional requirement that increases complexity in CellularDisco is fault containment. The mechanisms and policiesused in our system must carefully balance the often conflict-ing requirements of efficiently scheduling resources andmaintaining good fault containment. While efficientresource usage requires that every available resource in thesystem be used when needed, good fault containment canonly be provided if the set of resources used by any givenvirtual machine is confined to a small number of cells. Addi-tionally, our algorithms had to be designed to scale to systemsizes of up to a few hundred nodes. The above requirementshad numerous implications for both CPU and memory man-agement.CPU management: Operating systems for shared-mem-ory machines normally use a global run queue to performload sharing; each idle CPU looking for work examines therun queue to attempt to find a runnable task. Such anapproach is inappropriate for Cellular Disco because it vio-lates fault-containment requirements and because it is asource of contention in large systems. In Cellular Disco,each machine processor maintainsits own run queue ofVCPUs. However, even with proper initial load placement,separate run queues can lead to an imbalance among the pro-cessors due to variability in processor usage over the lifetimeof the VCPUs. A load balancing scheme is used to avoid thesituation in which one portion of the machine is heavilyloaded while another portion is idle. The basic load balanc-ing mechanism implemented in Cellular Disco isVCPUmigration; our system supports intra-node, intra-cell, andinter-cell migration of VCPUs. VCPU migration is used bya balancing policy module that decides when and whichVCPU to migrate, based on the current load of the systemand on fault containment restrictions.An additional feature provided by the Cellular Discoscheduler is that all non-idle VCPUs belonging to the samevirtual machine aregang-scheduled. Since the operating sys-tems running inside the virtual machines use spinlocks fortheir internal synchronization, gang-scheduling is necessaryto avoid wasting precious cycles spinning for a lock held bya descheduled VCPU.Memory management: Fault-containment requires thateach Cellular Disco cell manage its own memory allocation.However, this can lead to a case in which a cell running amemory-intensive virtual machine may run out of memory,while other cells have free memory reserves. In a static par-titioning scheme there would be no choice but to start pagingdata out to disk. To avoid an inefficient use of the shared-memory system, Cellular Disco implements amemory bor-rowing mechanism through which a cell may temporarilyobtain memory from other cells. Since memory borrowingmay be limited by fault containment requirements, we alsosupport paging as a fall-back mechanism.An important aspect of our memory balancing policies isthat they carefully weigh the performance gains obtained byallocating borrowed memory versus the implications forfault containment, since using memory from a remote cellcan make a virtual machine vulnerable to failures on thatcell.2.4 Support for large applicationsIn order to avoid operating system scalability bottlenecks,each operating system instance is given only as manyresources as it can handle well. Applications that need fewerresources than those allocated to a virtual machine run asthey normally would in a traditional system. However, largeapplications are forced to run across multiple virtualmachines.The solution proposed in Disco was to split large appli-cations and have the instances on the different virtualmachines communicate using distributed systems protocolsthat run over a fast shared-memory based virtual ethernetprovided by the virtual machine monitor. This approach issimilar to the way such applications are run on a cluster or ahardware partitioning environment. Unfortunately, thisapproach requires that shared-memory applications berewritten, and incurs significant overhead introduced bycommunication protocols such as TCP/IP.Cellular DiscoÕs virtual cluster environment provides amuch more efficient sharing mechanism that allows largeapplications to bypass the operating system and registershared-memory regions directly with the virtual machinemonitor. Since every system call is intercepted first by themonitor before being reflected back to the operating system,it is easy to add in the monitor additional system call func-tionality for mapping global shared-memory regions. Appli-cations running on different virtual machines cancommunicate through these shared-memory regions withoutany extra overhead because they simply use the cache-coher-ence mechanisms built into the hardware. The only draw-back of this mechanism is that it requires relinking theapplication with a different shared-memory library, and pos-sibly a few small modifications to the operating system forhandling misbehaving applications.Since the operating system instances are not aware ofapplication-level memory sharing, the virtual machine mon-itor needs to provide the appropriate paging mechanisms andpolicies to cope with memory overload conditions. Whenpaging out to disk, Cellular Disco needs to preserve the shar-ing information for pages belonging to a shared-memoryregion. In addition to the actual page contents, the CellularDisco pager writes out a list of virtual machines using thatpage, so that sharing can be properly restored when the pageis faulted back in.3The Cellular Disco prototypeIn this section we start by discussing our Cellular Disco pro-totype implementation that runs on actual scalable hardware.After describing the experimental setup, we provide evalua-tions of our virtualization and fault containment overheads. 1593.1 Prototype implementationThe Cellular Disco virtual machine monitor was designed tosupport shared-memory systems based on the MIPS R10000processor architecture [11]. Our prototype implementationconsists of about 50K lines of C and assembly and runs on a32-processor SGI Origin 2000 [14].One of the main hurdles we had to overcome in the pro-totype was the handling of I/O devices. Since coping with allthe details of the Origin I/O hardware was beyond our avail-able resources, we decided to leverage the device driverfunctionality already present in the SGI IRIX 6.4 operatingsystem for our prototype. Our Cellular Disco implementa-tion thus runspiggybacked on top of IRIX 6.4.To run our Cellular Disco prototype, we first boot theIRIX 6.4 operating system with a minimal amount of mem-ory. Cellular Disco is implemented as a multi-threaded ker-nel process that spawns a thread on each CPU. The threadsare pinned to their designated processors to prevent the IRIXscheduler from interfering with the control of the virtualmachine monitor over the machineÕs CPUs. Subsequentactions performed by the monitor violate the IRIX processabstraction, effectively taking over the control of themachine from the operating system. After saving the kernelregisters of the host operating system, the monitor installs itsown exception handlers and takes over all remaining systemmemory. The host IRIX 6.4 operating system remains dor-mant but can be reactivated any time Cellular Disco needs touse a device driver.Whenever one of the virtual machines created on top ofCellular Disco requests an I/O operation, the request is han-dled by the procedure illustrated in Figure3. The I/O requestcauses a trap into Cellular Disco (1), which checks accesspermissions and simply forwards the request to the hostIRIX (2) by restoring the saved kernel registers and excep-tion vectors, and requesting the host kernel to issue theappropriate I/O request (3). From the perspective of the hostoperating system, it looks as if Cellular Disco had been run-ning all the time just like any other well-behaved kernel pro-cess. After IRIX initiates the I/O request, control returns toCellular Disco, which puts the host kernel back into the dor-mant state. Upon I/O completion the hardware raises aninterrupt (4), which is handled by Cellular Disco because theFigure 3.I/O requests made by a virtual machine arehandled using host IRIX device drivers. This is a sixstep process that is fully described in the text.VirtualMachineCellular DiscoHostIRIX6.4Hardware123456forwardactualI/Oreqcompletion int(devicedrivers)I/O reqexception vectors have been overwritten. To allow the hostdrivers to properly handle I/O completion the monitor reac-tivates the dormant IRIX, making it look as if the I/O inter-rupt had just been posted (5). Finally, Cellular Disco posts avirtual interrupt to the virtual machine to notify it of the com-pletion of its I/O request (6). Since some drivers require thatthe kernel be aware of time, Cellular Disco forwards alltimer interrupts in addition to device interrupts to the hostIRIX.Our piggybacking technique allowed us to bring up oursystem on real hardware quickly, and enabled Cellular Discoto handle any hardware device IRIX supports. By measuringthe time spent in the host IRIX kernel, we found the over-head of the piggybacking approach to be small, less than 2%of the total running time for all the benchmarks we ran. Themain drawback of our current piggybacking scheme is that itdoes not support hardware fault containment, given themonolithic design of the host operating system. While thefault containment experiments described in Section6 do notuse the piggybacking scheme, a solution running one copy ofthe host operating system per Cellular Disco cell would bepossible with appropriate support in the host operating sys-tem.3.2 Experimental setupWe evaluated Cellular Disco by executing workloads on a32-processor SGI Origin 2000 system configured as shownin Table4. The running times for our benchmarks rangefrom 4 to 6 minutes, and the noise is within 2%.On this machine we ran the following four workloads:Database, Pmake, Raytrace, and Web server. These work-loads, described in detail in Table5, were chosen becausethey stress different parts of the system and because they area representative set of applications that commercial users runon large machines.3.3 Virtualization overheadsThe performance penalty that must be paid for virtualizationlargely depends on the processor architecture of the virtual-ized system. The dominant portion of this overhead is thecost of handling the traps generated by the processor for eachprivileged instruction executed by the kernel.To measure the impact of virtualization we compared theperformance of the workloads executing under two differentsetups. First, we ran the workloads on IRIX6.4 executingComponentCharacteristicsProcessors32 x MIPS R10000 @ 195 MHzNode controllers16 x SGI Hub @100 MHzMemory3.5 GBL2 cache size4 MBDisks5 (total capacity: 40GB)Table 4.SGI Origin 2000 conÞguration that was usedfor running most of the experiments in this paper. 160directly on top of the bare hardware.Then, we ran the sameworkloads on IRIX6.2 executing on top of the Cellular Discovirtual machine monitor. We used two different versions ofIRIX to demonstrate that Cellular Disco can leverage an off-the-shelf operating system that has only limited scalability toprovide essentially the same functionality and performanceas an operating system specifically designed for large-scalemachines. IRIX6.2 was designed for small-scale Challengebus-based multiprocessors [7], while IRIX6.4 was the latestoperating system available for the Origin 2000 when westarted our experimental work. Another reason for using twodifferent versions of IRIX is that IRIX6.2 does not run on theOrigin 2000. Except for scalability fixes in IRIX6.4, the twoversions are fairly similar; therefore, the uniprocessor num-bers presented in this section provide a good estimate of thevirtualization cost. However, multiprocessor numbers mayalso be distorted by the scalability limitations of IRIX6.2.The Cellular Disco virtualization overheads are shown inFigure6. As shown in the figure, the worst-case uniproces-sor virtualization penalty is only 9%. For each workload, thebar on the left shows the time (normalized to 100) needed tocomplete the run on IRIX 6.4, while the bar on the rightshows the relative time to complete the same run on IRIX 6.2running on top of the monitor. The execution time is brokenWorkloadDescriptionDatabaseDecision support workload based on the TPC-D [27] query suite on Informix Relational Database version 7.1.2 using a200MB and a 1GB database. We measure the sum of the run times of the 17 non-update queries.PmakeI/O intensive parallel compilation of the SGI IRIX 5.3 operating system (about 500K lines of C and assembly code).RaytraceCPU intensive ray tracer from the SPLASH-2 [31] parallel benchmark suite. We used the balls4 data set with varyingamounts of anti-aliasing so that it runs four to six minutes for single- and multi-process conÞgurations.WebKernel intensive web server workload. SpecWEB96 [23] running on an Apache web server. Although the workload alwaysruns for 5 minutes, we scaled the execution times so that each run performs the same number of requests.Table 5.Workloads. The execution times reported in this paper are the average of two stable runs after an initialwarm-up run. The running times range from 4 to 6 minutes, with a noise of 2%.100IRIX105CDDatabase100IRIX109CDPmake100IRIX102CDRaytrace100IRIX107CDWeb020406080100120Normalized Execution TimeIdleCellular DiscoIrix KernelUser 100\nIRIX109\nCDDatabase100\nIRIX112 CDPmake 100\nIRIX103\nCDRaytrace\r100\nIRIX106\nCDWeb020406080100120Normalized Execution TimeIdleCellular DiscoIrix KernelUserFigure 6.Virtualization overheads. For each workload, the left bar shows the execution time separated into variousmodes for the benchmark running on IRIX6.4 on top of the bare hardware. The right bar shows the same bench-mark running on IRIX6.2 on top of Cellular Disco. The time spent in IRIX6.4 device drivers is included in theCellular Disco portion of each right bar. For multiprocessor runs, the idle time under Cellular Disco increases due tothe virtualization overheads in the serial parts of the workload. The reduction in user time for some workloads isdue to better memory placement. Note that for most workloads, the overheads are within 10%.100IRIX110CDDatabase100IRIX120CDPmake100IRIX101CDRaytrace100IRIX104CDWebLoaded.100IRIX80CDWeb..Unloaded020406080100120Normalized Execution TimeIdleCellular DiscoIrix KernelUserUniprocessor8 processors32 processorsdown into time spent in idle mode, in the virtual machinemonitor (this portion also includes the time spent in the hostkernelÕs device drivers), in the operating system kernel, andin user mode. This breakdown was measured by using thehardware counters of the MIPS R10000 processors.Figure6 also shows the virtualization overheads for 8and 32 processor systems executing a single virtual machinethat spans all the processors. We have included two cases(loaded and unloaded) for the Web workload because thetwo systems perform very differently depending on the load.The unloaded case limits the number of server and client pro-cesses to 16 each (half the number of processors), while theloaded case starts 32 clients and does not limit the number ofserver processes (the exact value is determined by the webserver). IRIX6.4 uses blocking locks in the networking code,which results in better performance under heavy load, whileIRIX6.2 uses spin locks, which increases kernel time butperforms better under light load. The Database, Pmake, andWeb benchmarks have a large amount of idle time due totheir inability to fully exploit the available parallelism; a sig-nificant fraction of those workloads is serialized on a singleprocessor. Note that on a multiprocessor virtual machine,any virtualization overheads occurring in the serial part of aworkload aremagnified since they increase the idle time of 161the unused VCPUs. Even under such circumstances, CellularDisco introduces only 20% overhead in the worst case.3.4 Fault-containment overheadsIn order to gauge the overheads introduced by the cellularstructure of Cellular Disco, we ran our benchmarks on top ofthe virtual machine monitor using two configurations. First,the monitor was run as a single cell spanning all 32 proces-sors in the machine, corresponding to a setup that does notprovide any fault containment. Second, we booted CellularDisco in an 8-cell configuration, with 4 processors per cell.We ran our workloads inside a 32-processor virtual machinethat was completely contained in the single cell in the firstcase, and that spanned all 8 cells in the second one.Figure7 shows that the running time for virtualmachines spanning cell boundaries is practically the same aswhen executing in a single cell (except for some small differ-ences due to scheduling artifacts). This result shows that inCellular Disco, hardware fault containment can be providedat practically no loss in performance once the virtualizationoverheads have been factored out. This result stands in sharpcontrast to earlier fault containment work [3].4CPU managementIn this section, we first describe the processor load balancingmechanisms provided in Cellular Disco. We then discuss thepolicies we use to actually balance the system. Next we dis-cuss our implementation of gang scheduling. We concludewith an evaluation of the performance of the system and withcomments on some interesting issues regarding inter-cellmigration.4.1CPUbalancing mechanismsCellular Disco supports three different types of VCPU100 1c101 8c!Database"100 1c!98#8c#Pmake$100 1c101 8c!Raytrace%100 1c!101 8c#Web&020406080100Normalized Execution TimeIdleCellular Disco'Irix KernelUserFigure 7.Overhead of fault-containment. The left bar,normalized to 100, shows the execution breakdown ina single cell conÞguration. The right bar shows theexecution proÞle on an 8 cell system. In both cases,we ran a single 32-processor virtual machine spanningthe entire system.migration, each providing a different tradeoff between per-formance and cost.The simplest VCPU migration case occurs when aVCPU is moved to a different processor on the same node(the Origin 2000 has two CPUs per node). Although the timerequired to update the internal monitor data structures is only37ms, the real cost is paid gradually over time due to the lossof CPU cache affinity. To get a rough estimate of this cost,let us assume that half of the 128-byte lines in the 4 MB sec-ond-level cache are in use, with half of the active lines localand the other half remote. Refilling this amount of cachedinformation on the destination CPU requires about 8ms.The second type of migration occurs when a VCPU ismoved to a processor on a different node within the samecell. Compared to the cost of intra-node migration, this caseincurs the added cost of copying the second level softwareTLB (described in Section2.1) which is always kept on thesame node as the VCPU since it is accessed very frequently.At 520ms, the cost for copying the entire L2TLB (32 KB) isstill much smaller than the gradual cost of refilling the CPUcache. However, inter-node migration has a higher long-term cost because the migrated VCPU is likely to accessmachine memory pages allocated on the previous node.Unlike the cost of cache affinity loss which is only paid once,accessing remote memory is a continuous penalty that isincurred every time the processor misses on a remote cacheline. Cellular Disco alleviates this penalty by dynamicallymigrating or replicating frequently accessed pages to thenode generating the cache misses [29].The third type of VCPU migration occurs when a VCPUis moved across a cell boundary; this migration costs1520ms including the time to copy the L2TLB. Besides los-ing cache and node affinity, this type of migration may alsoincrease the fault vulnerability of the VCPU. If the latter hasnever before run on the destination cell and has not beenusing any resources from it, migrating it to the new cell willmake it vulnerable to faults in that cell. However, CellularDisco provides a mechanism through which dependencies tothe old cell can be entirely removed by moving all the dataused by the virtual machine over to the new cell; this processis covered in detail in Section4.5.4.2 CPU balancing policiesCellular Disco employs two separate CPU load balancingpolicies: the idle balancer and the periodic balancer. The idlebalancer runs whenever a processor becomes idle, and per-forms most of the balancing work. The periodic balancerredistributes those VCPUs that are not handled well by theidle balancer.When a processor becomes idle, the idle balancer runs onthat processor to search for VCPUs that can be ÒstolenÓ fromthe run queues of neighboring processors in the same cell,starting with the closest neighbor. However, the idle bal-ancer cannot arbitrarily select any VCPU on the remotequeues due to gang scheduling constraints. Cellular Discowill schedule a VCPU only when all the non-idle VCPUs ofthat virtual machine are runnable. Annotations on the idle 162loop of the kernel inform Cellular Disco when a VCPUbecomes idle. The idle balancer checks the remote queuesfor VCPUs that, if moved, would allow that virtual machineto run. For example, consider the case shown inFigure8.VCPUs in the top row are currently executing on the actualmachine CPUs; CPU 0 is idle due to gang scheduling con-straints. After checking the remote queues, the idle balancerrunning on CPU 1 will migrate VCPU B1 because the migra-tion will allow VCPUs B0 and B1 to run on CPUs 0 and 1,respectively. Although migrating VCPU B1 would allow toit start executing right away, it may have enough cache andnode affinity on CPU 2 to cancel out the gains. CellularDisco tries to match the benefits with the cost of migrationby delaying migration until a VCPU has been descheduledfor some time depending on the migration distance: 4ms forintra-node, and 6ms for inter-node. These were the optimalvalues after testing a range from 1ms to 10ms; however, theoverall performance only varies by 1-2% in this range.The idle balancer performs well even in a fairly loadedsystem because there are usually still a few idle cycles avail-able for balancing decisions due to the fragmentation causedby gang scheduling. However, by using only local loadinformation to reduce contention, the idle balancer is notalways able to take globally optimal decisions. For this rea-son, we included in our system a periodic balancer that usesglobal load information to balance load in heavily loadedsystems and across different cells. Querying each processorindividually is impractical for systems with hundreds of pro-cessors. Instead, each processor periodically updates theload tree, a low-contention distributed data structure thattracks the load of the entire system.The load tree, shown in Figure8, is a binary tree encom-passing the entire machine. Each leaf of the tree represents aprocessor, and stores the load on that processor. Each innernode in the tree contains the sum of the loads of its children.Periodic balancerIdle balancerFigure 8.CPU balancing scenario. The numbers insidethe nodes of the tree represent the CPU load on thecorresponding portion of the machine. The letter in theVCPU name speciÞes the virtual machine, while thenumber designates the virtual processor. VCPUs inthe top row are currently scheduled on the processors.0VC A1VC A0121134VC B0VC B1CPU0CPU1CPU2CPU3Load treeTo reduce memory contention the tree nodes are physicallyspread across the machine. Starting from its correspondingleaf, each processor updates the tree on every 10ms timerinterrupt. Cellular Disco reduces the contention on higherlevel nodes by reducing the number of processors that canupdate a level by half at every level greater than three.The periodic balancer traverses this tree depth first,checking the load disparity between the two children. If thedisparity is larger than one VCPU, The balancer will try tofind a VCPU from the loaded side that is a good candidate formigration. Gang scheduling requires that two VCPUs of thesame VM not be scheduled on the same processor; therefore,one of the requirements for a good candidate is that the lessloaded side must have a processor that does not already haveanother VCPU of the same virtual machine. If the two sidesbelong to different cells, then migrating a VCPU will make itvulnerable to faults in the new cell. To prevent VCPUs frombeing vulnerable to faults in many cells, Cellular Disco keepstrack of the list of cells each VCPU is vulnerable to, and theperiodic balancer prefers migrating VCPUs that are alreadyvulnerable to faults on the less-loaded cell.Executing the periodic balancer across the entire systemcan be expensive for large machines; therefore we left this asa tunable parameter, currently set at 80ms. However,heavily loaded systems can have local load imbalances thatare not be handled by idle balancer due to the lack of idlecycles. Cellular Disco addresses this problem by also addinga local periodic load balancer that runs on each 8 CPU regionevery 20ms. The combination of these schemes results in anefficient adaptive system.4.3 Scheduling policyBoth of the balancing schemes described in the previous sec-tion would be ineffective without a scalable gang scheduler.Most gang schedulers use either space or time partitioning,but these schemes require a centralized manager thatbecomes a scalability bottleneck. Cellular DiscoÕs scheduleruses a distributed algorithm similar to the IRIX gang sched-uler [1].When selecting the next VCPU to run on a processor, ourscheduler always picks the highest-priority gang-runnableVCPU that has been waiting the longest. A VCPU becomesgang-runnable when all the non-idle VCPUs of that virtualmachine are either running or waiting on run queues of pro-cessors executing lower priority virtual machines. Afterselecting a VCPU, the scheduler sends RPCs to all the pro-cessors that have VCPUs belonging to this virtual machinewaiting on the run queue. On receiving this RPC, those pro-cessors deschedule the VCPU they were running, follow thesame scheduling algorithm, and converge on the desired vir-tual machine. Each processor makes its own decisions, butends up converging on the correct choice without employinga central global manager.4.4 CPU management resultsWe tested the effectiveness of the complete CPU manage-ment system by running the following three-part experiment. 163First, we ran a single virtual machine with 8VCPUs execut-ing an 8-process raytrace, leaving 24 processors idle. Next,we ran four such virtual machines, each one running an 8-process raytrace. Finally, we ran eight virtual machines con-figured the same way, a total of 64VCPUs running raytraceprocesses. An ideal system would run the first two configu-rations in the same time, while the third case should taketwice as long. We measured only a 0.3% increase in the sec-ond case, and the final configuration took 2.17 times as long.The extra time can be attributed to migration overheads,cache affinity loss due to scheduling, and some load imbal-ance. To get a baseline number for the third case, we ran thesame experiment on IRIX6.4 and found that IRIX actuallyexhibits a higher overhead of 2.25.4.5 Inter-cell migration issuesMigrating VCPUs across cell boundaries raises a number ofinteresting issues. One of these is when to migrate the datastructure associated with the entire virtual machine, not justa single VCPU. The size of this data structure is dominatedby the pmap, which is proportional to the amount of physicalmemory the virtual machine is allowed to use. Although theL2TLB reduces the number of accesses to the pmap, it is stilldesirable to place the pmap close to the VCPUs so that soft-ware reloaded TLB misses can be satisfied quickly. Also, ifall the VCPUs have migrated out of a cell, keeping the pmapin the old cell leaves the virtual machine vulnerable to faultsin the old cell. We could migrate the virtual machine-widedata structures when most of the VCPUs have migrated to anew cell, but the pmap is big enough that we do not want tomove it that frequently. Therefore, we migrate it only whenall the VCPUs have migrated to a different cell. We havecarefully designed this mechanism to avoid blocking theVCPUs, which can run concurrently with this migration.This operation takes 80ms to copy I/O-related data struc-tures other than the pmap, and copying the pmap takes161ms per MB of physical memory the virtual machine isallowed to use.Although Cellular Disco migrates the virtual machinedata structures when all the VCPUs have moved away froma cell, this is not sufficient to remove vulnerability to faultsoccurring in the old cell. To become completely independentfrom the old cell, any data pages being used by a virtualmachine must be migrated as well. This operation takes25ms per MB of memory being used by the virtual machineand can be executed without blocking any of the VCPUs.5Memory managementIn this section, we focus on the problem of managingmachine memory across cells. We will present the mecha-nisms to address this problem, policies that uses those mech-anisms, and an evaluation of the performance of thecomplete system. The section concludes by looking at issuesrelated to paging.5.1 Memory balancing mechanismBefore describing the Cellular Disco memory balancingmechanism, it is important to discuss the memory allocationmodule. Each cell maintains its ownfreelist (list of freepages) indexed by the home node of each memory page. Ini-tially, the freelist entries for nodes not belonging to this cellare empty, as the cell has not yet borrowed any memory.Every page allocation request is tagged with a list of nodesthat can supply the memory (this list is initialized when a vir-tual machine is created). When satisfying a request, a higherpreference is given to memory from the local node, in orderto reduce the memory access latency on NUMA systems(first-touch allocation strategy).The memory balancing mechanism is fairly straightfor-ward. A cell wishing to borrow memory issues a fast RPC toa cell which has available memory. The loaner cell allocatesmemory from its freelist and returns a list of machine pagesas the result of the RPC. The borrower adds those pages toits freelist, indexed by their home node. This operation takes758ms to borrow 4MB of memory.5.2 Memory balancing policiesA cell starts borrowing memory when its number of freepages reaches a low threshold, but before completely run-ning out of pages. This policy seeks to avoid forcing smallvirtual machines that fit into a single cell to have to useremote memory. For example, consider the case of a cellwith two virtual machines: one with a large memory foot-print, and one that entirely fits into the cell. The large virtualmachine will have to use remote memory to avoid paging,but the smaller one can achieve good performance with justlocal memory, without becoming vulnerable to faults inother cells. The cell must carefully decide when to allocateremote memory so that enough local memory is available tosatisfy the requirements of the smaller virtual machine.Depending on their fault containment requirements,users can restrict the set of cells from which a virtualmachine can use borrowed memory. Paging must be used asa last recourse if free memory is not available from any of thecells in this list. To avoid paging as much as possible, a cellshould borrow memory from cells that are listed in the allo-cation preferences of the virtual machines it is executing.Therefore, every cell keeps track of the combined allocationpreferences of all the virtual machines it is executing, andadjusts that list whenever a virtual machine migrates into orout of the cell.A policy we have found to be effective is the following:when the local free memory of a cell drops below 16MB, thecell tries to maintain at least 4MB of free memory from eachcell in its allocation preferences list; the cell borrows 4MBfrom each cell in the list from which it has less than 4MBavailable. This heuristic biases the borrowing policy tosolicit memory from cells that actively supply pages to atleast one virtual machine. Cells will agree to loan memory aslong as they have more than 32MB available. The abovethresholds are all tunable parameters. These default valueswere selected to provide hysteresis for stability, and they arebased on the number of pages that can be allocated duringthe interval between consecutive executions of the policy, 164every 10ms. In this duration, each CPU can allocate at most732KB, which means that a typical cell with 8CPUs canonly allocate 6MB in 10ms if all the CPUs allocate memoryas fast as possible, a very unlikely scenario; therefore, wedecided to borrow 4MB at a time. Cells start borrowingwhen only 16MB are left because we expect the resident sizeof small virtual machines to be in 10-15MB range.We measured the effectiveness of this policy by runninga 4-processor Database workload. First, we ran the bench-mark with the monitor configured as a single cell, in whichcase there is no need for balancing. Next, we ran in an 8-cellconfiguration, with 4 CPUs per cell. In the second configu-ration, the cell executing the Database virtual machine didnot have enough memory to satisfy the workload and endedup borrowing 596MB of memory from the other cells. Bor-rowing this amount of memory had a negligible impact onthe overall execution time (less than 1% increase).5.3 Issues related to pagingIf all the cells are running low on memory, there is no choicebut to page data out to disk. In addition to providing the basicpaging functionality, our algorithms had to solve three addi-tional challenges: identifying actively used pages, handlingmemory pages shared by different virtual machines, andavoiding redundant paging.Cellular Disco implements a second-chance FIFO queueto approximate LRU page replacement, similar to VMS[15]. Each virtual machine is assigned a resident set size thatis dynamically trimmed when the system is running low onmemory. Although any LRU approximation algorithm canfind frequently used pages, it cannot separate the infre-quently used pages into pages that contain active data andunallocated pages that contain garbage. Cellular Discoavoids having to write unallocated pages out to disk by non-intrusively monitoring the physical pages actually beingused by the operating system. Annotations on the operatingsystemÕs memory allocation and deallocation routines pro-vide the required information to the virtual machine moni-tor.A machine page can be shared by multiple virtualmachines if the page is used in a shared memory region asdescribed in Section2.4, or as a result of a COW (Copy-On-Write) optimization. The sharing information is usually keptin memory in the control data structures for the actualmachine page. However, this information cannot remainthere once the page has been written out if the machine pageis to be reused. In order to preserve the sharing, CellularDisco writes the sharing information out to disk along withthe data. The sharing information is stored on a contiguoussector following the paged data so that it can be written outusing the same disk I/O request; this avoids the penalty of anadditional disk seek.Redundant paging is a problem that has plagued earlyvirtual machine implementations [20]. This problem canoccur since there are two separate paging schemes in the sys-tem: one in Cellular Disco, the other in the operating systemsrunning in the virtual machines. With these schemes makingindependent decisions, some pages may have to be writtenout to disk twice, or read in just to be paged back out. Cellu-lar Disco avoids this inefficiency by trapping every read andwrite to the kernelÕs paging disk, identified by designatingfor every virtual machine a special disk that acts as the vir-tual paging disk. Figure9 illustrates the problem and the wayCellular Disco avoids it. In both cases shown, the virtualmachine kernel wishes to write a page to its paging disk thatCellular Disco has already paged out to its own paging disk.Without the paging disk, as shown in Case A, the kernelÕspageout request appears to the monitor as a regular disk writeof a pagethat has been paged out to Cellular DiscoÕs pagingdisk. Therefore, Cellular Disco will Þrst fault that page infrom its paging disk, and then issue the write for the kernelÕspaging disk. Case B shows the optimized version with thevirtual paging disk.When the operating system issues a writeto this disk, the monitor notices that it has already paged outthe data, so it simply updates an internal data structure tomake the sectors of the virtual paging disk point to the realsectors on Cellular DiscoÕs paging disk. Any subsequentoperating system read from the paging disk is satisfied bylooking up the actual sectors in the indirection table andreading them from Cellular DiscoÕs paging disk.We measured the impact of the paging optimization byrunning the following micro-benchmark, called stressMem.After allocating a very large chunk of memory, stressMemwrites a unique integer on each page; it then loops throughall the pages again, verifying that the value it reads is thesame as what it wrote out originally. StressMem ran for 258seconds when executing without the virtual paging disk opti-mization, but it took only 117 seconds with the optimization(a 55% improvement).6Hardware fault recoveryDue to the tight coupling provided by shared-memory hard-ware, the effects of any single hardware fault in a multipro-cessor can very quickly ripple through the entire system.Current commercial shared-memory multiprocessors arethus extremely likely to crash after the occurrence of anyhardware fault. To resume operation on the remaining goodresources after a fault, these machines require a hardwarereset and a reboot of the operating system.Figure 9.Redundant paging. Disk activity is shown inbold. Case A illustrates the problem, which results in 3disk accesses, while Case B shows the way CellularDisco avoids it, requiring just one disk access.Case B: withvirtual paging diskCellular Disco pageoutDisk writeOS pageout Update mappingTotal diskaccesses:time31Case A: withoutvirtual paging diskCellular Disco pageoutDisk writeOS pageoutPage fault in Disk write 165As shown in [26], it is possible to design multiprocessorsthat limit the impact of most faults to a small portion of themachine, called a hardware fault containment unit. CellularDisco requires that the underlying hardware be able torecover itself with such a recovery mechanism. After detect-ing a hardware fault, the fault recovery support described in[26] diagnoses the system to determine which resources arestill operational and reconfigures the machine in order toallow the resumption of normal operation on the remaininggood resources. An important step in the reconfigurationprocess is to determine which cache lines have been lost as aresult of the failure. Following a failure, cache lines can beeither coherent (lines that were not affected by the fault) orincoherent (lines that have been lost because of the fault).Since the shared-memory system is unable to supply validdata for incoherent cache lines, any cache miss to these linesmust be terminated by raising an exception.After completing hardware recovery, the hardwareinforms Cellular Disco that recovery has taken place by post-ing an interrupt on all the good nodes. This interrupt willcause Cellular Disco to execute its own recovery sequence todetermine the set of still-functioning cells and to decidewhich virtual machines can continue execution after thefault. This recovery process is similar to that done in Hive[3], but our design is much simpler for two reasons: we didnot have to deal with operating system data structures, andwe can use shared-memory operations because cells can trusteach other. Our simpler design results in a much fasterrecovery time.In the first step of the Cellular Disco recovery sequence,all cells agree on aliveset (set of still-functioning nodes) thatforms the basis of all subsequent recovery actions. Whileeach cell can independently obtain the current liveset byreading hardware registers [26], the possibility of multiplehardware recovery rounds resulting from back-to-back hard-ware faults requires the use of a standard n-round agreementprotocol [16] to guarantee that all cells operate on a commonliveset.The agreed-upon liveset information is used in the sec-ond recovery step to ÒunwedgeÓ the communication system,which needs to be functional for subsequent recoveryactions. In this step, any pending RPCÕs or messages tofailed cells are aborted; subsequent attempts to communicatewith a failed cell will immediately return an error.The final recovery step determines which virtualmachines had essential dependencies on the failed cells andterminates those virtual machines. Memory dependenciesare determined by scanning all machine memory pages andchecking for incoherent cache lines; the hardware provides amechanism to perform this check. Using the memmap datastructure, bad machine memory pages are translated back tothe physical memory pages that map to them, and then to thevirtual machines owning those physical pages. A tunablerecovery policy parameter determines whether a virtualmachine that uses a bad memory page will be immediatelyterminated or will be allowed to continue running until ittries to access an incoherent cache line. I/O device depen-dencies are treated similarly to memory dependencies.The experimental setup used throughout the rest of thispaper could not be used for testing the Cellular Disco faultrecovery support, since the necessary hardware fault con-tainment support required by Cellular Disco is not imple-mented in the Origin 2000 multiprocessor, and since in thepiggybacking solution of Section3.1 the host operating sys-tem represents a single point of failure. Fortunately, CellularDisco was originally designed to run on the FLASH multi-processor [13], for which the hardware fault containmentsupport described in [26] was designed. When running onFLASH, Cellular Disco can fully exploit the machineÕs hard-ware fault containment capabilities. The main differencebetween FLASH and the Origin 2000 is the use in FLASH ofa programmable node controller called MAGIC. Most of thehardware fault containment support in FLASH is imple-mented using MAGIC firmware.We tested the hardware fault recovery support in CellularDisco by using a simulation setup that allowed us to performa large number of fault injection experiments. We did not usethe FLASH hardware because the current FLASH prototypeonly has four nodes and because injecting multiple con-trolled faults is extremely difficult and time consuming onreal hardware. The SimOS [19] and FlashLite [13] simula-tors provide enough detail to accurately observe the behaviorof the hardware fault containment support and of the systemsoftware after injecting any of a number of common hard-ware faults into the simulated FLASH system.Figure10 shows the setup used in our fault injectionexperiments. We simulated an 8-node FLASH system run-ning Cellular Disco. The size of the Cellular Disco cells waschosen to be one node, the same as that of the FLASH hard-ware fault containment units. We ran 8 virtual machines,each with essential dependencies on two different cells. Eachvirtual machine executed a parallel compile of a subset of theGnuChess source files.N0R0N1R1N2R2N3R3N4R4N5R5N6R6N7R7Figure 10.Experimental setup used for the fault-containment experiments shown in Table11. Eachvirtual machine has essential dependencies on twoCellular Disco cells. The fault injection experimentswere performed on a detailed simulation of the FLASHmultiprocessor [13].VM0VM2VM4VM6VM3VM1VM5VM7VM7Cellular DiscoInterconnect 166On the configuration shown in Figure10 we performedthe fault injection experiments described in Table11. Afterinjecting a hardware fault, we allowed the FLASH hardwarerecovery and the Cellular Disco recovery to execute, and ranthe surviving virtual machines until their workloads com-pleted. We then checked the results of the workloads bycomparing the checksums of the generated object files withthe ones obtained from a reference run. An experiment wasdeemed successful if exactly one Cellular Disco cell and thetwo virtual machines with dependencies on that cell werelost after the fault, and if the surviving six virtual machinesproduced the correct results. Table11 shows that the Cellu-lar Disco hardware fault recovery support was 100% effec-tive in 1000 experiments that covered router, interconnectlink, node, and MAGIC firmware failures.In order to evaluate the performance impact of a fault onthe surviving virtual machines, we measured the recoverytimes in a number of additional experiments. Figure12shows how the recovery time varies with the number ofnodes in the system and the amount of memory per node.The figure shows that the total recovery time is small (lessthan half a second) for all tested hardware configurations.While the recovery time only shows a modest increase withthe number of nodes in the system, there is a steep increasewith the amount of memory per node. For large memoryconfigurations, most of the time is spent in two places. First,to determine the status of cache lines after a failure, the hard-ware fault containment support must scan all node coherencedirectories. Second, Cellular Disco uses MAGIC firmwaresupport to determine which machine memory pages containinaccessible or incoherent cache lines. Both of these opera-tions involve expensive directory scanning operations thatare implemented using MAGIC firmware. The cost of theseoperations could be substantially reduced in a machine witha hardwired node controller.7Comparison to other approachesIn the previous sections we have shown that Cellular Discocombines the features of both hardware partitioning and tra-ditional shared-memory multiprocessors. In this section wecompare the performance of our system against both hard-ware partitioning and traditional operating system-centricapproaches. The hardware partitioning approach divides alarge scale machine into a set of small scale machines and aseparate operating system is booted on each one, similar to aSimulated hardware faultNumber ofexperimentsSuccess RateNode power supply failure250100%Router power supply failure250100%Link cable or connector failure250100%MAGIC Þrmware failure250100%Table 11.For all the fault injection experiments shown,the simulated system recovered and produced correctresults.cluster of small machines with a fast interconnect. Thisapproach is also similar to Cellular Disco without inter-cellresource sharing. In fact, because IRIX6.2 does not run onthe SGI Origin, we evaluated the performance of thisapproach using Cellular Disco without inter-cell sharing. Weused IRIX6.4 as the representative of operating system-cen-tric approaches.Small applications that fit inside a single hardware parti-tion run equally well on all three systems, except for thesmall virtualization overheads of Cellular Disco. Largeresource-intensive applications that donÕt fit inside a singlepartition, however, can experience significant slowdownwhen running on a partitioned system due to the lack ofresource sharing. In this section we evaluate all three sys-tems using such a resource-intensive workload to demon-strate the need for resource sharing.For our comparison, we use a workload consisting of amix of applications that resembles the way large-scalemachines are used in practice: we combine an 8-processDatabase workload with a 16-process Raytrace run. Bydividing the 32-processor Origin system into 4 cells (eachwith 8 processors), we obtain a configuration in which thereis neither enough memory in any single cell to satisfy Data-base, nor enough CPUs in any cell to satisfy Raytrace.Because the hardware partitioning approach cannot automat-ically balance the load, we explicitly placed the two applica-tions on different partitions. In all three cases, we startedboth applications at the same time, and measured the time ittook them to finish, along with the overall CPU utilization.Table13 summarizes the results of our experimental com-parison. As expected, the performance of our virtual clusterssolution is very close to that of the operating system-centricapproach as both applications are able to access as manyresources as they need. Also, as expected, the hardware par-titioning approach suffers serious performance degradationdue to the lack of resource sharing.The hardware partitioning and cluster approaches typi-cally avoid such serious problems by allocating enoughresources in each partition to meet the expected peakdemand; for example, the database partition would havebeen allocated with more memory and the raytrace partitionwith more processors. However, during normal operation100300500281632time [ms]totalHWFigure 12.Fault-recovery times shown as a function ofthe number of nodes in the system and the amount ofmemory per node. The total time includes both hard-ware and Cellular Disco recovery.1003005001664128256 MBtime [ms]totalHWf(number of nodes)f(memory per node)16 MB/node8 nodes 167this configuration wastes resources, and prevents efficientresource utilization because a raytrace workload will not per-form well on the partition configured for databases and sim-ilarly, a database workload will not perform well on thepartition configured for raytrace.8Related workIn this section we compare Cellular Disco to other projectsthat have some similarities to our work: virtual machines,hardware partitioning, operating system based approaches,fault containment, and resource load balancing.8.1 Virtual machinesVirtual machines are not a new idea: numerous researchprojects in the 1970Õs [9], as well as commercial productofferings [5][20] attest to the popularity of this concept in itsheyday. The VAX VMM Security Kernel [12] used virtualmachines to build a compatible secure system at a low devel-opment cost. While Cellular Disco shares some of the funda-mental framework and techniques of these virtual machinemonitors, it is quite different in that it adapts the virtualmachine concept to address new challenges posed by mod-ern scalable shared memory servers.Disco [2] first proposed using virtual machines to pro-vide scalability and to hide some of the characteristics of theunderlying hardware from NUMA-unaware operating sys-tems. Compared to Disco, Cellular Disco provides a com-plete solution for large scale machines by extending theDisco approach with the following novel aspects: the use ofa virtual machine monitor for supporting hardware fault con-tainment; the development of both NUMA- and fault con-tainment-aware scalable resource balancing andovercommitment policies; and the development of mecha-nisms to support those policies. We have also evaluated ourapproach on real hardware using long-running realisticworkloads that more closely resemble the way largemachines are currently used.8.2 Hardware-centric approachesHardware partitioning has been proposed as a way to solvethe system software issues for large-scale shared-memorymachines. Some of the systems that support partitioning areSequentÕs Application Region Manager [21], Sun Microsys-temsÕ Dynamic System Domains [25], and UnisysÕ CellularApproachRaytraceDatabaseCPU util.Operating system216 s231 s55%Virtual cluster221 s229 s58%Hardware partitioning434 s325 s31%Table 13.Comparison of our virtual cluster approach tooperating system- and hardware-centric approachesusing a combination of Raytrace and Databaseapplications. We measured the wall clock time foreach application and the overall CPU utilization.MultiProcessing (CMP) architecture [28]. The benefits ofthis approach are that it only requires very small operatingsystem changes, and that it provides limited fault isolationbetween partitions [25][28]. The major drawback of parti-tioning is that it lacks resource sharing, effectively turning alarge and expensive machine into a cluster of smaller sys-tems that happen to share a fast network. As shown inSection7, the lack of resource sharing can lead to seriousperformance degradation.To alleviate the resource sharing problems of static par-titioning, dynamic partitioning schemes have been proposedthat allow a limited redistribution of resources (CPUs andmemory) across partitions [4][25][28]. Unfortunately, repar-titioning is usually a very heavyweight operation requiringextensive hardware and operating system support. An addi-tional drawback is that even though whole nodes can bedynamically reassigned to a different partition, the resourceswithin a node cannot be multiplexed at a fine granularitybetween two partitions.8.3 Software-centric approachesAttempts to provide the support for large-scale multiproces-sors in the operating system can be divided into two strate-gies: tuning of an existing SMP operating system to make itscale to tens or hundreds of processors, and developing newoperating systems with better scalability characteristics.The advantage of adapting an existing operating systemis backwards compatibility and the benefit of an existing siz-able code base, as illustrated by SGIÕs IRIX 6.4 and IRIX6.5operating systems. Unfortunately, such an overhaul usuallyrequires a significant software development effort. Further-more, adding support for fault containment is a daunting taskin practice, since the base operating system is inherently vul-nerable to faults.New operating system developments have been pro-posed to address the requirements of scalability (Tornado [8]and K42 [10]) and fault containment (Hive [3]). While theseapproaches tackle the problem at the basic level, they requirea very significant development time and cost before reachingcommercial maturity. Compared to these approaches, Cellu-lar Disco is about two orders of magnitude simpler, whileproviding almost the same performance.8.4 Fault-containmentWhile a considerable amount of work has been done on faulttolerance, this technique does not seem to be very attractivefor large-scale shared-memory machines, due to the increasein cost and to the fact that it does not defend well againstoperating system failures. An alternative approach that hasbeen proposed is fault-containment, a design technique thatcan limit the impact of a fault to a small fraction of the sys-tem. Fault containment support in the operating system hasbeen explored in the Hive project [3], while the necessaryhardware and firmware support has been implemented in theFLASH multiprocessor [13]. Cellular Disco requires thepresence of hardware fault containment support such as thatdescribed in [26], and is thus complementary. Hive and Cel- 168lular Disco are two attempts to provide fault containmentsupport in the system software; the main advantage of Cellu-lar Disco is its extreme simplicity when compared to Hive.Our approach is the first practical demonstration that end-to-end hardware fault containment can be provided at a realisticcost in terms of implementation effort. Cellular Disco alsoshows that if the basic system software layer can be trusted,fault containment does not add any performance overhead.8.5 Load balancingCPU and memory load balancing have been studied exten-sively in the context of networks of workstations, but not onsingle shared-memory systems. Traditional approaches toprocess migration [17] that require support in the operatingsystem are too complex and fragile, and very few have madeit into the commercial world so far. Cellular Disco providesa much simpler approach to migration that does not requireany support in the operating system, while offering the flex-ibility of migrating at the granularity of individual CPUs ormemory pages.Research projects such as GMS [6] have investigatedusing remote memory in the context of clusters of machines,where remote memory is used as a fast cache for VM pagesand file system buffers. Cellular Disco can directly use thehardware support for shared memory, thus allowing substan-tially more flexibility.9ConclusionsWith a size often exceeding a few million lines of code, cur-rent commercial operating systems have grown too large toadapt quickly to the new features that have been introducedin hardware. Off-the-shelf operating systems currently sufferfrom poor scalability, lack of fault containment, and poorresource management for large systems. This lack of goodsupport for large-scale shared-memory multiprocessorsstems from the tremendous difficulty of adapting the systemsoftware to the new hardware requirements.Instead of modifying the operating system, our approachinserts a software layer between the hardware and the oper-ating system. By applying an old idea in a new context, weshow that our virtual machine monitor (called CellularDisco) is able to supplement the functionality provided bythe operating system and to provide new features. In thispaper, we argue that Cellular Disco is a viable approach forproviding scalability, scalable resource management, andfault containment for large-scale shared-memory systems atonly a small fraction of the development cost required forchanging the operating system. Cellular Disco effectivelyturns those large machines into Òvirtual clustersÓ by combin-ing the benefits of clusters and those of shared-memory sys-tems.Our prototype implementation of Cellular Disco on a 32-processor SGI Origin 2000 system shows that the virtualiza-tion overhead can be kept below 10% for many practicalworkloads, while providing effective resource managementand fault containment. Cellular Disco is the first demonstra-tion that end-to-end fault containment can be achieved inpractice with a reasonable implementation effort. Althoughthe results presented in this paper are based on virtualizingthe MIPS processor architecture and on running the IRIXoperating system, our approach can be extended to other pro-cessor architectures and operating systems. A straightfor-ward extension of Cellular Disco could support thesimultaneous execution on a scalable machine of severaloperating systems, such as a combination of Windows NT,Linux, and UNIX.Some of the remaining problems that have been left openby our work so far include efficient virtualization of low-latency I/O devices (such as fast network interfaces), systemmanagement issues, and checkpointing and cloning of wholevirtual machines.AcknowledgmentsWe would like to thank SGI for kindly providing us accessto a 32-processor Origin 2000 machine for our experiments,and to the IRIX 5.3, IRIX 6.2 and IRIX 6.4 source code. Theexperiments in this paper would not have been possible with-out the invaluable help we received from John Keen andSimon Patience.The FLASH and Hive teams built most of the infrastruc-ture needed for this paper, and provided an incredibly stim-ulating environment for this work. Our special thanks go tothe Disco, SimOS, and FlashLite developers whose work hasenabled the development of Cellular Disco and the faultinjection experiments presented in the paper.This study is part of the Stanford FLASH project, fundedby DARPA grant DABT63-94-C-0054.References[1]James M. Barton and Nawaf Bitar. A Scalable Multi-Discipline, Multiple-Processor Scheduling Frameworkfor IRIX.Lecture Notes in Computer Science, 949, pp.45-69. 1995.[2]Edouard Bugnion, Scott Devine, Kinshuk Govil, andMendel Rosenblum. Disco: Running Commodity Oper-ating Systems on Scalable Multiprocessors.ACMTransactions on Computer Systems (TOCS), 15(4), pp.412-447. November 1997.[3]John Chapin, Mendel Rosenblum, Scott Devine,Tirthankar Lahiri, Dan Teodosiu, and Anoop Gupta.Hive: Fault containment for shared-memory Multipro-cessors. InProceedings of the 15th Symposium onOperating Systems Principles (SOSP), pp. 12-25.December 1995.[4]Compaq Computer Corporation. OpenVMS Galaxy.http://www.openvms.digital.com/availability/galaxy.html. Accessed October 1999.[5]R. J. Creasy. The Origin of the VM/370 Time-SharingSystem.IBM J. Res. Develop25(5) pp. 483-490, 1981.[6]Michael Feeley, William Morgan, Frederic Pighin,Anna Karlin, Henry Levy, and Chandramohan Thek-kath. Implementing Global Memory Management in a 169Workstation Cluster. InProceedings of the 15th Sympo-sium on Operating Systems Principles (SOSP), pp. 201-212. December 1995.[7]Mike Galles and Eric Williams. Performance Optimiza-tions, Implementation, and VeriÞcation of the SGIChallenge Multiprocessor. In Proceedings of the 27thHawaii International Conference on System Sciences,Volume 1: Architecture, pp. 134-143. January 1994.[8]Ben Gamsa, Orran Krieger, Jonathan Appavoo, andMichael Stumm. Tornado: Maximizing Locality andConcurrency in a Shared Memory MultiprocessorOperating System. InProceedings of the 3rd Sympo-sium on Operating Systems Design and Implementation(OSDI), pp. 87-100. February 1999.[9]Robert P. Goldberg. Survey of Virtual MachineResearch.IEEE Computer Magazine7(6), pp. 34-45.June 1974.[10]IBM Corporation. The K42 Project.http://www.research.ibm.com/K42/index.html. Accessed October 1999.[11]Gerry Kane and Joe Heinrich.MIPS RISC Architecture.Prentice Hall, Englewood Cliffs, NJ. 1992.[12]Paul Karger, Mary Zurko, Douglas Bonin, AndrewMason, and Clifford Kahn. A Retrospective on theVAX VMM Security Kernel.IEEE Transactions onSoftware Engineering, 17(11), pp. 1147-1165. Novem-ber 1991.[13]Jeffrey Kuskin, David Ofelt, Mark Heinrich, John Hei-nlein, Richard Simoni, Kourosh Gharachorloo, JohnChapin, David Nakahira, Joel Baxter, Mark Horowitz,Anoop Gupta, Mendel Rosenblum, and John Hennessy.The Stanford FLASH Multiprocessor. InProceedingsof the 21st International Symposium on ComputerArchitecture (ISCA), pp. 302-313. April 1994.[14]Jim Laudon and Daniel Lenoski. The SGI Origin: AccNUMA Highly Scalable Server. InProceedings ofthe 24th International Symposium on Computer Archi-tecture (ISCA). pp. 241-251. June 1997.[15]H. M. Levy and P. H. Lipman. Virtual Memory Man-agement in the VAX/VMS Operating System.IEEEComputer, 15(3), pp. 35-41. March 1982.[16]Nancy Lynch.Distributed Algorithms. Morgan Kauf-mann Publishers, San Francisco, CA. 1996.[17]Dejan S. Milojicic, Fred Douglis, Yves Paindaveine,Richard Wheeler and Songnian Zhou. Process Migra-tion.TOG Research Institute Technical Report. Decem-ber 1996.[18]Rashid, R.F., et al. Machine-Independent Virtual Mem-ory Management for Paged Uniprocessor and Multipro-cessor Architectures.IEEE Transactions on Computers,37(8), pp. 896-908.August 1988.[19]Mendel Rosenblum, Edouard Bugnion, Scott Devineand Steve Herrod. Using the SimOS Machine Simula-tor to study Complex Computer Systems.ACM Trans-actions on Modelling and Computer Simulations(TOMACS), 7(1), pp. 78-103. January 1997.[20]Seawright, L.H., and MacKinnon, R.A. VM/370: Astudy of multiplicity and usefulness.IBM SystemsJournal, 18(1), pp. 4-17. 1979.[21]Sequent Computer Systems, Inc. SequentÕs ApplicationRegion Manager.http://www.sequent.com/dcsolutions/agile_wp1.html. Accessed October 1999.[22]SGI Inc. IRIX 6.5.http://www.sgi.com/software/irix6.5.Accessed October 1999.[23]Standard Performance Evaluation Corporation.SPECweb96 Benchmark.http://www.spec.org/osg/web96. Accessed October 1999.[24]Vijayaraghavan Soundararajan, Mark Heinrich, BenVerghese, Kourosh Gharachorloo, Anoop Gupta, andJohn Hennessy. Flexible Use of Memory for Replica-tion/Migration in Cache-Coherent DSM Multiproces-sors. InProceedings of 25th International Symposiumon Computer Architecture (ISCA). pp. 342-55. June1998.[25]Sun Microsystems, Inc. Sun Enterprise 10000 Server:Dynamic System Domains.http://www.sun.com/servers/highend/10000/Tour/domains.html. Accessed October1999.[26]Dan Teodosiu, Joel Baxter, Kinshuk Govil, JohnChapin, Mendel Rosenblum, and Mark Horowitz.Hardware Fault Containment in Scalable Shared-Mem-ory Multiprocessors. InProceedings of 24th Interna-tional Symposium on Computer Architecture (ISCA).pp. 73-84. June 1997.[27]Transaction Processing Performance Council.TPCBenchmark D (Decision Support) Standard SpeciÞca-tion. TPC, San Jose, CA. June 1997.[28]Unisys Corporation. Cellular MultiProcessing: Break-through Architecture for an Open Mainframe.http://www.marketplace.unisys.com/ent/cmp.html. AccessedOctober 1999.[29]Ben Verghese, Scott Devine, Anoop Gupta, and MendelRosenblum. Operating System Support for ImprovingData Locality on CC-NUMA Compute Servers. InPro-ceedings of the 7th International Conference on Archi-tectural Support for Programming Languages andOperating Systems (ASPLOS), pp. 279-289. October1996.[30]VMWare. Virtual Platform.http://www.vmware.com/products/virtualplatform.html. Accessed October 1999.[31]Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie,Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 programs: Characterization and Methodological Con-siderations. InProceedings of the 22nd Annual Interna-tional Symposium on Computer Architecture (ISCA),pp. 24-36. May 1995.