Advanced Operating Systems Structures and Implementation Lecture 17 Device Drivers April 8 th 2013 Prof John Kubiatowicz httpinsteecsberkeleyeducs19424 Goals for Today SLAB allocator ID: 206667
Download Presentation The PPT/PDF document "CS194-24" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CS194-24Advanced Operating Systems Structures and Implementation Lecture 17Device Drivers
April
8
th
,
2013
Prof. John
Kubiatowicz
http://inst.eecs.berkeley.edu/~cs194-24Slide2
Goals for TodaySLAB allocatorDevices and Device DriversInteractive is important! Ask Questions!
Note: Some slides and/or pictures in the following are
adapted from slides ©
2013Slide3
Review: Clock Algorithm (Not Recently Used)
Set of all pages
in Memory
Single Clock Hand:
Advances only on page fault!
Check for pages not used recently
Mark pages as not used recently
What if hand moving slowly?
Good sign or bad sign?
Not many page faults and/or find page quickly
What if hand is moving quickly?Lots of page faults and/or lots of reference bits setOne way to view clock algorithm: Crude partitioning of pages into two groups: young and oldWhy not partition into more than 2 groups?Slide4
Review: Second-Chance List Algorithm (VAX/VMS)Split memory in two: Active list (RW), SC list (Invalid)
Access pages in Active list at full speed
Otherwise, Page Fault
Always move overflow page from end of Active list to front of Second-chance list (SC) and mark invalid
Desired Page On SC List: move to front of Active list, mark RW
Not on SC list: page in to front of Active list, mark RW; page out LRU victim at end of SC list
Directly
Mapped Pages
Marked: RW
List: FIFO
Second
Chance List
Marked: Invalid
List: LRU
LRU victim
Page-in
From disk
New
Active
Pages
Access
New
SC
Victims
OverflowSlide5
Free ListKeep set of free pages ready for use in demand paging
Freelist filled in background by Clock algorithm or other technique (“Pageout demon”)
Dirty pages start copying back to disk when enter list
Like VAX second-chance list
If page needed before reused, just return to active set
Advantage: Faster for page fault
Can always use page (or pages) immediately on fault
Set of all pages
in Memory
Single Clock Hand:Advances as needed to keep freelist full (“background”)
D
D
Free Pages
For ProcessesSlide6
Reverse Page Mapping (Sometimes called “Coremap”)Physical page frames often shared by many different address spaces/page tablesAll children forked from given processShared memory pages between processes
Whatever reverse mapping mechanism that is in place must be very fast
Must hunt down all page tables pointing at given page frame when freeing a page
Must hunt down all PTEs when seeing if pages “active”
Implementation options:
For every page descriptor, keep linked list of page table entries that point to it
Management nightmare – expensiveLinux 2.6: Object-based reverse mappingLink together memory region descriptors instead (much coarser granularity)Slide7
What Actually Happens in Linux?Memory management in Linux considerably more complex that the previous indicationsMemory Zones: physical memory categoriesZONE_DMA: < 16MB memory, DMAable on ISA busZONE_NORMAL: 16MB
896
MB
(mapped at 0xC0000000)
ZONE_HIGHMEM: Everything else (> 896MB)Each zone has 1 freelist, 2 LRU lists (Active/Inactive)Many different types of allocationSLAB allocators, per-page allocators, mapped/unmappedMany different types of allocated memory:
Anonymous memory (not backed by a file, heap/stack)Mapped memory (backed by a file)Allocation prioritiesIs blocking allowed/etcSlide8
Linux Virtual memory map
Kernel
Addresses
Empty
Space
User
Addresses
User
Addresses
Kernel
Addresses0x000000000xC0000000
0xFFFFFFFF
0x0000000000000000
0x00007FFFFFFFFFFF
0xFFFF800000000000
0xFFFFFFFFFFFFFFFF
3GB Total
128TiB
1
GB
128TiB
896
MB
Physical
64
TiB
Physical
32-Bit Virtual Address Space64-Bit Virtual Address Space
“Canonical Hole”Slide9
Virtual Map (Details)Kernel memory not generally visible to userException: special VDSO facility that maps kernel code into user space to aid in system calls (and to provide certain actual system calls such as gettimeofday().Every physical page described by a “page” structure
Collected together in lower physical memory
Can be accessed in kernel virtual space
Linked together in various “LRU” lists
For 32-bit virtual memory architectures:
When physical memory <
896MBAll physical memory mapped at 0xC0000000When physical memory >= 896MBNot all physical memory mapped in kernel space all the timeCan be temporarily mapped with addresses > 0xCC000000
For 64-bit virtual memory architectures:All physical memory mapped above 0xFFFF800000000000Slide10
Allocating MemoryOne mechanism for requesting pages: everything else on top of this mechanism:Allocate contiguous group of pages of size 2order bytes given the specified mask:
struct
page *
alloc_pages
(
gfp_t
gfp_mask, unsigned int
order)Allocate one page:struct page * alloc_page(gfp_t gfp_mask)Convert page to logical address (assuming mapped):void * page_address(struct page *page)Also routines for freeing pagesZone allocator uses “buddy” allocator that
trys to keep memory unfragmentedAllocation routines pick from proper zone, given flagsSlide11
Allocation flagsPossible allocation type flags:GFP_ATOMIC: Allocation high-priority and must never sleep. Use in interrupt handlers, top halves, while holding locks, or other times cannot sleepGFP_NOWAIT: Like GFP_ATOMIC, except call will not fall back on emergency memory pools. Increases likely hood of failure
GFP_NOIO: Allocation can block but must not initiate disk I/O.
GFP_NOFS: Can block, and can initiate disk I/O, but will not initiate
filesystem
ops.
GFP_KERNEL: Normal allocation, might block. Use in process context when safe to sleep. This should be default choice
GFP_USER: Normal allocation for processesGFP_HIGHMEM:
Allocation from ZONE_HIGHMEMGFP_DMA Allocation from ZONE_DMA. Use in combination with a previous flagSlide12
Page Frame Reclaiming Algorithm (PFRA)Several entrypoints:Low on Memory Reclaiming: The kernel detects a “low on memory” conditionHibernation reclaiming: The kernel must free memory because it is entering in the suspend-to-disk state
Periodic reclaiming: A kernel thread is activated periodically to perform memory reclaiming, if necessary
Low on Memory reclaiming:
Start flushing out dirty pages to disk
Start looping over all memory nodes in the system
try_to_free_pages
()shrink_slab()pdflush kenel thread writing out dirty pagesPeriodic reclaiming:
Kswapd kernel threads: checks if number of free page frames in some zone has fallen below pages_high watermarkEach zone keeps two LRU lists: Active and InactiveEach page has a last-chance algorithm with 2 countActive page lists moved to inactive list when they have been idle for two cycles through the listPages reclaimed from Inactive listSlide13
SLAB AllocatorReplacement for free-lists that are hand-coded by usersConsolidation of all of this code under kernel controlEfficient when objects allocated and freed frequently
Objects
segregated into
“caches
”
Each cache stores different type of object
Data inside cache divided into “slabs”, which are continuous groups of pages (often only 1 page)Key idea: avoid memory fragmentation
Cache
SLABSLAB
Obj
1
Obj
2
Obj
3
Obj
5
Obj
4Slide14
SLAB Allocator DetailsBased on algorithm first introduced for SunOSObservation: amount of time required to initialize a regular object in the kernel exceeds the amount of time required to allocate and deallocate itResolves around object
caching
Allocate once, keep reusing objects
Avoids memory fragmentation:
Caching of similarly sized objects, avoid fragmentation
Similar to custom
freelist per objectReuse of allocationWhen new object first allocated, constructor runsOn subsequent free/reallocation, constructor does not need to be reexecutedSlide15
SLAB Allocator: Cache ConstructionCreation of new Caches:struct
kmem_cache
*
kem_cache_create
(
const char *name,
size_t size,
size_t align, unsigned long flags, void (*ctor)(void *));name: name of cachesize: size of each element in the cachealign: alignment for each object (often 0)flags: possible flags about allocationSLAB_HWCACHE_ALIGN: Align objects to cache linesSLAB_POISON: Fill slabs to known value (0xa5a5a5a5) in order to catch use of uninitialized memorySLAB_RED_ZONE: Insert empty zones around objects to help detect buffer overrunsSLAB_PANIC: Allocation layer panics if allocation failsSLAB_CACHE_DMA: Allocations from DMA-able memorySLAB_NOTRACK: don’t track uninitialized memoryctor: called whenever new pages are added to the cacheSlide16
SLAB Allocator: Cache UseExample:task_struct_cachep =
kmem_cache_create
(“
task_struct
”,
sizeof(struct
task_struct), ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL);Use of example:struct task_struct *tsk;tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
if (!tsk) return NULL;kmem_free(task_struct_cachep,tsk);Slide17
SLAB Allocator Details (Con’t)Caches can be later destroyed with: int
kmem_cache_destroy
(
struct
kmem_cache *
cachep);Assuming that all objects freedNo one ever tries to use cache againAll caches kept in global listIncluding global caches set up with objects of powers of 2 from 25 to 2
17 General kernel allocation (kmalloc/kfree) uses least-fit for requested cache sizeReclamation of memoryCaches keep sorted list of empty, partial, and full slabsEasy to manage – slab metadata contains reference countObjects within slabs linked togetherAsk individual caches for full slabs for reclamationSlide18
Alternatives for allocationA number of options in the kernel for object allocation:SLAB: original allocator based on Bonwick’s paper from SunOSSLUB: Newer allocator with same interface but better use of metadataKeeps SLAB metadata in the page data structure (for pages that happen to be in kernel caches)
Debugging options compiled in by default, just need to be enabled
SLOB: low-memory footprint allocator for embedded systemsSlide19
Kernel Device Structure
The System Call Interface
Process
Management
Memory
Management
Filesystems
Device
Control
Networking
Architecture
Dependent
Code
Memory
Manager
Device
Control
Network
Subsystem
File System Types
Block
Devices
IF drivers
Concurrency,
multitasking
Virtual
memory
Files and
dirs
:
the VFS
TTYs and
device access
ConnectivitySlide20
Modern I/O SystemsSlide21
Virtual Bus Architecture
CPU
RAM
Memory
Bus
USB
Controller
SCSI
Controller
Scanner
Hard Disk
CD ROM
Root Hub
Hub
Webcam
Mouse
Keyboard
PCI #1
PCI #0
PCI Bridge
PCI SlotsSlide22
SandyBridge I/O: PCHPlatform Controller HubUsed to be “SouthBridge,” but no “NorthBridge” now
Connected to processor with proprietary bus
Direct Media Interface
Code name “Cougar Point” for
SandyBridge
processors
Types of I/O on PCH:USBEthernetAudioBIOS supportMore PCI Express (lower speed than on Processor)Sata (for Disks)
SandyBridge
System ConfigurationSlide23
Why Device Drivers?Many different devices, many different propertiesDevices different even within class (i.e. network card)DMA vs Programmed I/OProcessing every packet through device driver
vs
setup of packet filters in hardware to sort packets automatically
Interrupts
vs
Polling
On device buffer management, framing options (e.g. jumbo frames), error management, …Authentication mechanism, etcProvide standardized interfaces to computer usersSocked interface with TCP/IPFactor portion of codebase specific to a given deviceDevice manufacturer can hide complexities of their device behind standard kernel interfaces
Also: Device manufacturer can fix quirks/bugs in their device by providing new driverSlide24
The Goal of the I/O SubsystemProvide Uniform Interfaces, Despite Wide Range of Different DevicesThis code works on many different devices:
FILE fd = fopen(“/dev/something”,”rw”);
for (int i = 0; i < 10; i++) {
fprintf(fd,”Count %d\n”,i);
}
close(fd);
Why? Because code that controls devices (“device driver”) implements standard interface.We will try to get a flavor for what is involved in actually controlling devices in rest of lecture
Can only scratch surface! Slide25
Want Standard Interfaces to DevicesBlock Devices: e.g.
disk drives, tape drives, DVD-ROM
Access blocks of data
Commands include
open()
, read(),
write(), seek()Raw I/O or file-system accessMemory-mapped file access possible
Character Devices: e.g. keyboards, mice, serial ports, some USB devicesSingle characters at a timeCommands include get(), put()Libraries layered on top allow line editingNetwork Devices: e.g. Ethernet, Wireless, BluetoothDifferent enough from block/character to have own interfaceUnix and Windows include socket interfaceSeparates network protocol from network operationIncludes select() functionalityUsage: pipes, FIFOs, streams, queues, mailboxesSlide26
How Does User Deal with Timing?Blocking Interface: “Wait”When request data (e.g.
read()
system call), put process to sleep until data is ready
When write data (e.g.
write()
system call), put process to sleep until device is ready for data
Non-blocking Interface: “Don’t Wait”Returns quickly from read or write request with count of bytes successfully transferred
Read may return nothing, write may write nothingAsynchronous Interface: “Tell Me Later”When request data, take pointer to user’s buffer, return immediately; later kernel fills buffer and notifies userWhen send data, take pointer to user’s buffer, return immediately; later kernel takes data and notifies user Slide27
How does the processor actually talk to the device?
Device
Controller
read
write
control
status
Addressable
Memoryand/orQueues
Registers
(port 0x20)HardwareController
Memory Mapped
Region: 0x8f008020
Bus
Interface
CPU interacts with a
Controller
Contains a set of
registers
that
can be read and written
May contain memory for request
queues or bit-mapped images
Regardless of the complexity of the connections and buses, processor accesses registers in two ways:
I/O instructions:
in/out instructions
Example from the Intel architecture:
out 0x21,ALMemory mapped I/O: load/store instructionsRegisters/memory appear in physical address spaceI/O accomplished with load and store instructions
Address+
Data
Interrupt Request
Processor Memory Bus
CPU
Regular
Memory
Interrupt
Controller
BusAdaptor
Bus
Adaptor
Other Devices
or BusesSlide28
Example: Memory-Mapped Display ControllerMemory-Mapped:Hardware maps control registers and display memory into physical address space
Addresses set by hardware jumpers or programming at boot time
Simply writing to display memory (also called the “frame buffer”) changes image on screen
Addr: 0x8000F000—0x8000FFFF
Writing graphics description to command-queue area
Say enter a set of triangles that describe some scene
Addr: 0x80010000—0x8001FFFF
Writing to the command register may cause on-board graphics hardware to do somethingSay render the above scene
Addr: 0x0007F004Can protect with page tablesDisplayMemory
0x8000F000
0x80010000
Physical Address
Space
Status
0x0007F000
Command
0x0007F004
Graphics
Command
Queue
0x80020000Slide29
Transfering Data To/From ControllerProgrammed I/O:Each byte transferred via processor in/out or load/store
Pro: Simple hardware, easy to program
Con: Consumes processor cycles
proportional to data size
Direct Memory Access:
Give controller access to memory bus
Ask it to transfer data to/from memory directly
Sample interaction with DMA controller (from book):Slide30
I/O Device Notifying the OSThe OS needs to know when:
The I/O device has completed an operation
The I/O operation has encountered an error
I/O Interrupt:
Device generates an interrupt whenever it needs service
Handled in
top half
of device driverOften run on special kernel-level stackPro: handles unpredictable events well
Con: interrupts relatively high overhead Polling:OS periodically checks a device-specific status registerI/O device puts completion information in status registerCould use timer to invoke lower half of drivers occasionallyPro: low overheadCon: may waste many cycles on polling if infrequent or unpredictable I/O operationsActual devices combine both polling and interruptsFor instance: High-bandwidth network device: Interrupt for first incoming packetPoll for following packets until hardware emptySlide31
Device DriversDevice Driver: Device-specific code in the kernel that interacts directly with the device hardware
Supports a standard, internal interface
Same kernel I/O system can interact easily with different device drivers
Special device-specific configuration supported with the
ioctl
()
system
callLinux Device drivers often installed via a ModuleInterface for dynamically loading code into kernel spaceModules loaded with the “
insmod” command and can contain parametersDriver-specific structureOne per driverContains a set of standard kernel interface routinesOpen: perform device-specific initializationRead: perform readWrite: perform writeRelease: perform device-specific shutdownEtc.These routines registered at time device registeredSlide32
Interrupt handlingInterrupt routines typically divided into two pieces:
Top half: run as interrupt routine
Gets input or transfers next block of output
Handles any direct access to hardware
Handles any time-sensitive aspects of handling interrupts
Runs in the ATOMIC Context (cannot sleep)
Bottom half: accessed later to finish processing
Perform any interrupt-related work not performed by the interrupt handler itselfScheduled “later” with interrupts re-enabledSome options for bottom halves can sleepSince you typically have two halves of code, must remember to
synchronize shared dataSince interrupt handler is running in interrupt (ATOMIC) context, cannot sleep!Good choice: spin lock to synchronize data structuresMust be careful never to hold spinlock for too longWhen non-interrupt code holds a spinlock, must make sure to disable interrupts!Consider “spin_lock_irqsave()” or “spin_lock_bh()” variantsConsider lock free queue variants as wellSlide33
Options for Bottom HalfBottom Half used for handling work after interrupt is re-enabled (i.e. deferred work):Perform any interrupt-related work not performed by the interrupt handlerIdeally most of the workWhat to minimize amount of work done in an interrupt handler because they run with interrupts disabled
Many different mechanisms for handling bottom halves
Original “Bottom Half” (deprecated)
Task Queues
Put work on a task queue for later execution
Softirqs
are statically defined bottom halves that can run simultaneously on any processorTasklets: dynamically created bottom halves built on top of softirq mechanismOnly one of each type of tasklet can run at given time
Simplifies synchronizationSlide34
Life Cycle of An I/O Request
Device Driver
Bottom Half
Device Driver
Top Half
Device
Hardware
Kernel I/O
Subsystem
User
ProgramSlide35
Summary (1/2)Clock Algorithm: Approximation to LRU
Arrange all pages in circular list
Sweep through them, marking as not “in use”
If page not “in use” for one pass, than can replace
N
th
-chance clock algorithm: Another
approx LRU
Give pages multiple passes of clock hand before replacingSecond-Chance List algorithm: Yet another approx LRUDivide pages into two groups, one of which is truly LRU and managed on page faultsReverse Page MappingEfficient way to hunt down all PTEs associated with given page frame SLAB Allocator: Kernel mechanism for handling efficient allocation of objects while minimizing initializationSlide36
Summary (2/2)I/O Devices Types:Many different speeds (0.1 bytes/sec to
GBytes
/sec)
Different Access Patterns:
Block Devices, Character Devices, Network Devices
Different Access Timing:
Blocking, Non-blocking, Asynchronous
I/O Controllers: Hardware that controls actual deviceProcessor Accesses through I/O instructions, load/store to special physical memory
Report their results through either interrupts or a status register that processor looks at occasionally (polling)Notification mechanismsInterruptsPolling: Report results through status register that processor looks at periodicallyDevice Driver: Code specific to device which handles unique aspects of device