/
CS194-24 CS194-24

CS194-24 - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
368 views
Uploaded On 2015-11-27

CS194-24 - PPT Presentation

Advanced Operating Systems Structures and Implementation Lecture 17 Device Drivers April 8 th 2013 Prof John Kubiatowicz httpinsteecsberkeleyeducs19424 Goals for Today SLAB allocator ID: 206667

page memory pages device memory page device pages kernel list slab allocation devices interrupt cache data mapped gfp struct

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CS194-24" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CS194-24Advanced Operating Systems Structures and Implementation Lecture 17Device Drivers

April

8

th

,

2013

Prof. John

Kubiatowicz

http://inst.eecs.berkeley.edu/~cs194-24Slide2

Goals for TodaySLAB allocatorDevices and Device DriversInteractive is important! Ask Questions!

Note: Some slides and/or pictures in the following are

adapted from slides ©

2013Slide3

Review: Clock Algorithm (Not Recently Used)

Set of all pages

in Memory

Single Clock Hand:

Advances only on page fault!

Check for pages not used recently

Mark pages as not used recently

What if hand moving slowly?

Good sign or bad sign?

Not many page faults and/or find page quickly

What if hand is moving quickly?Lots of page faults and/or lots of reference bits setOne way to view clock algorithm: Crude partitioning of pages into two groups: young and oldWhy not partition into more than 2 groups?Slide4

Review: Second-Chance List Algorithm (VAX/VMS)Split memory in two: Active list (RW), SC list (Invalid)

Access pages in Active list at full speed

Otherwise, Page Fault

Always move overflow page from end of Active list to front of Second-chance list (SC) and mark invalid

Desired Page On SC List: move to front of Active list, mark RW

Not on SC list: page in to front of Active list, mark RW; page out LRU victim at end of SC list

Directly

Mapped Pages

Marked: RW

List: FIFO

Second

Chance List

Marked: Invalid

List: LRU

LRU victim

Page-in

From disk

New

Active

Pages

Access

New

SC

Victims

OverflowSlide5

Free ListKeep set of free pages ready for use in demand paging

Freelist filled in background by Clock algorithm or other technique (“Pageout demon”)

Dirty pages start copying back to disk when enter list

Like VAX second-chance list

If page needed before reused, just return to active set

Advantage: Faster for page fault

Can always use page (or pages) immediately on fault

Set of all pages

in Memory

Single Clock Hand:Advances as needed to keep freelist full (“background”)

D

D

Free Pages

For ProcessesSlide6

Reverse Page Mapping (Sometimes called “Coremap”)Physical page frames often shared by many different address spaces/page tablesAll children forked from given processShared memory pages between processes

Whatever reverse mapping mechanism that is in place must be very fast

Must hunt down all page tables pointing at given page frame when freeing a page

Must hunt down all PTEs when seeing if pages “active”

Implementation options:

For every page descriptor, keep linked list of page table entries that point to it

Management nightmare – expensiveLinux 2.6: Object-based reverse mappingLink together memory region descriptors instead (much coarser granularity)Slide7

What Actually Happens in Linux?Memory management in Linux considerably more complex that the previous indicationsMemory Zones: physical memory categoriesZONE_DMA: < 16MB memory, DMAable on ISA busZONE_NORMAL: 16MB

896

MB

(mapped at 0xC0000000)

ZONE_HIGHMEM: Everything else (> 896MB)Each zone has 1 freelist, 2 LRU lists (Active/Inactive)Many different types of allocationSLAB allocators, per-page allocators, mapped/unmappedMany different types of allocated memory:

Anonymous memory (not backed by a file, heap/stack)Mapped memory (backed by a file)Allocation prioritiesIs blocking allowed/etcSlide8

Linux Virtual memory map

Kernel

Addresses

Empty

Space

User

Addresses

User

Addresses

Kernel

Addresses0x000000000xC0000000

0xFFFFFFFF

0x0000000000000000

0x00007FFFFFFFFFFF

0xFFFF800000000000

0xFFFFFFFFFFFFFFFF

3GB Total

128TiB

1

GB

128TiB

896

MB

Physical

64

TiB

Physical

32-Bit Virtual Address Space64-Bit Virtual Address Space

“Canonical Hole”Slide9

Virtual Map (Details)Kernel memory not generally visible to userException: special VDSO facility that maps kernel code into user space to aid in system calls (and to provide certain actual system calls such as gettimeofday().Every physical page described by a “page” structure

Collected together in lower physical memory

Can be accessed in kernel virtual space

Linked together in various “LRU” lists

For 32-bit virtual memory architectures:

When physical memory <

896MBAll physical memory mapped at 0xC0000000When physical memory >= 896MBNot all physical memory mapped in kernel space all the timeCan be temporarily mapped with addresses > 0xCC000000

For 64-bit virtual memory architectures:All physical memory mapped above 0xFFFF800000000000Slide10

Allocating MemoryOne mechanism for requesting pages: everything else on top of this mechanism:Allocate contiguous group of pages of size 2order bytes given the specified mask:

struct

page *

alloc_pages

(

gfp_t

gfp_mask, unsigned int

order)Allocate one page:struct page * alloc_page(gfp_t gfp_mask)Convert page to logical address (assuming mapped):void * page_address(struct page *page)Also routines for freeing pagesZone allocator uses “buddy” allocator that

trys to keep memory unfragmentedAllocation routines pick from proper zone, given flagsSlide11

Allocation flagsPossible allocation type flags:GFP_ATOMIC: Allocation high-priority and must never sleep. Use in interrupt handlers, top halves, while holding locks, or other times cannot sleepGFP_NOWAIT: Like GFP_ATOMIC, except call will not fall back on emergency memory pools. Increases likely hood of failure

GFP_NOIO: Allocation can block but must not initiate disk I/O.

GFP_NOFS: Can block, and can initiate disk I/O, but will not initiate

filesystem

ops.

GFP_KERNEL: Normal allocation, might block. Use in process context when safe to sleep. This should be default choice

GFP_USER: Normal allocation for processesGFP_HIGHMEM:

Allocation from ZONE_HIGHMEMGFP_DMA Allocation from ZONE_DMA. Use in combination with a previous flagSlide12

Page Frame Reclaiming Algorithm (PFRA)Several entrypoints:Low on Memory Reclaiming: The kernel detects a “low on memory” conditionHibernation reclaiming: The kernel must free memory because it is entering in the suspend-to-disk state

Periodic reclaiming: A kernel thread is activated periodically to perform memory reclaiming, if necessary

Low on Memory reclaiming:

Start flushing out dirty pages to disk

Start looping over all memory nodes in the system

try_to_free_pages

()shrink_slab()pdflush kenel thread writing out dirty pagesPeriodic reclaiming:

Kswapd kernel threads: checks if number of free page frames in some zone has fallen below pages_high watermarkEach zone keeps two LRU lists: Active and InactiveEach page has a last-chance algorithm with 2 countActive page lists moved to inactive list when they have been idle for two cycles through the listPages reclaimed from Inactive listSlide13

SLAB AllocatorReplacement for free-lists that are hand-coded by usersConsolidation of all of this code under kernel controlEfficient when objects allocated and freed frequently

Objects

segregated into

“caches

Each cache stores different type of object

Data inside cache divided into “slabs”, which are continuous groups of pages (often only 1 page)Key idea: avoid memory fragmentation

Cache

SLABSLAB

Obj

1

Obj

2

Obj

3

Obj

5

Obj

4Slide14

SLAB Allocator DetailsBased on algorithm first introduced for SunOSObservation: amount of time required to initialize a regular object in the kernel exceeds the amount of time required to allocate and deallocate itResolves around object

caching

Allocate once, keep reusing objects

Avoids memory fragmentation:

Caching of similarly sized objects, avoid fragmentation

Similar to custom

freelist per objectReuse of allocationWhen new object first allocated, constructor runsOn subsequent free/reallocation, constructor does not need to be reexecutedSlide15

SLAB Allocator: Cache ConstructionCreation of new Caches:struct

kmem_cache

*

kem_cache_create

(

const char *name,

size_t size,

size_t align, unsigned long flags, void (*ctor)(void *));name: name of cachesize: size of each element in the cachealign: alignment for each object (often 0)flags: possible flags about allocationSLAB_HWCACHE_ALIGN: Align objects to cache linesSLAB_POISON: Fill slabs to known value (0xa5a5a5a5) in order to catch use of uninitialized memorySLAB_RED_ZONE: Insert empty zones around objects to help detect buffer overrunsSLAB_PANIC: Allocation layer panics if allocation failsSLAB_CACHE_DMA: Allocations from DMA-able memorySLAB_NOTRACK: don’t track uninitialized memoryctor: called whenever new pages are added to the cacheSlide16

SLAB Allocator: Cache UseExample:task_struct_cachep =

kmem_cache_create

(“

task_struct

”,

sizeof(struct

task_struct), ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL);Use of example:struct task_struct *tsk;tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);

if (!tsk) return NULL;kmem_free(task_struct_cachep,tsk);Slide17

SLAB Allocator Details (Con’t)Caches can be later destroyed with: int

kmem_cache_destroy

(

struct

kmem_cache *

cachep);Assuming that all objects freedNo one ever tries to use cache againAll caches kept in global listIncluding global caches set up with objects of powers of 2 from 25 to 2

17 General kernel allocation (kmalloc/kfree) uses least-fit for requested cache sizeReclamation of memoryCaches keep sorted list of empty, partial, and full slabsEasy to manage – slab metadata contains reference countObjects within slabs linked togetherAsk individual caches for full slabs for reclamationSlide18

Alternatives for allocationA number of options in the kernel for object allocation:SLAB: original allocator based on Bonwick’s paper from SunOSSLUB: Newer allocator with same interface but better use of metadataKeeps SLAB metadata in the page data structure (for pages that happen to be in kernel caches)

Debugging options compiled in by default, just need to be enabled

SLOB: low-memory footprint allocator for embedded systemsSlide19

Kernel Device Structure

The System Call Interface

Process

Management

Memory

Management

Filesystems

Device

Control

Networking

Architecture

Dependent

Code

Memory

Manager

Device

Control

Network

Subsystem

File System Types

Block

Devices

IF drivers

Concurrency,

multitasking

Virtual

memory

Files and

dirs

:

the VFS

TTYs and

device access

ConnectivitySlide20

Modern I/O SystemsSlide21

Virtual Bus Architecture

CPU

RAM

Memory

Bus

USB

Controller

SCSI

Controller

Scanner

Hard Disk

CD ROM

Root Hub

Hub

Webcam

Mouse

Keyboard

PCI #1

PCI #0

PCI Bridge

PCI SlotsSlide22

SandyBridge I/O: PCHPlatform Controller HubUsed to be “SouthBridge,” but no “NorthBridge” now

Connected to processor with proprietary bus

Direct Media Interface

Code name “Cougar Point” for

SandyBridge

processors

Types of I/O on PCH:USBEthernetAudioBIOS supportMore PCI Express (lower speed than on Processor)Sata (for Disks)

SandyBridge

System ConfigurationSlide23

Why Device Drivers?Many different devices, many different propertiesDevices different even within class (i.e. network card)DMA vs Programmed I/OProcessing every packet through device driver

vs

setup of packet filters in hardware to sort packets automatically

Interrupts

vs

Polling

On device buffer management, framing options (e.g. jumbo frames), error management, …Authentication mechanism, etcProvide standardized interfaces to computer usersSocked interface with TCP/IPFactor portion of codebase specific to a given deviceDevice manufacturer can hide complexities of their device behind standard kernel interfaces

Also: Device manufacturer can fix quirks/bugs in their device by providing new driverSlide24

The Goal of the I/O SubsystemProvide Uniform Interfaces, Despite Wide Range of Different DevicesThis code works on many different devices:

FILE fd = fopen(“/dev/something”,”rw”);

for (int i = 0; i < 10; i++) {

fprintf(fd,”Count %d\n”,i);

}

close(fd);

Why? Because code that controls devices (“device driver”) implements standard interface.We will try to get a flavor for what is involved in actually controlling devices in rest of lecture

Can only scratch surface! Slide25

Want Standard Interfaces to DevicesBlock Devices: e.g.

disk drives, tape drives, DVD-ROM

Access blocks of data

Commands include

open()

, read(),

write(), seek()Raw I/O or file-system accessMemory-mapped file access possible

Character Devices: e.g. keyboards, mice, serial ports, some USB devicesSingle characters at a timeCommands include get(), put()Libraries layered on top allow line editingNetwork Devices: e.g. Ethernet, Wireless, BluetoothDifferent enough from block/character to have own interfaceUnix and Windows include socket interfaceSeparates network protocol from network operationIncludes select() functionalityUsage: pipes, FIFOs, streams, queues, mailboxesSlide26

How Does User Deal with Timing?Blocking Interface: “Wait”When request data (e.g.

read()

system call), put process to sleep until data is ready

When write data (e.g.

write()

system call), put process to sleep until device is ready for data

Non-blocking Interface: “Don’t Wait”Returns quickly from read or write request with count of bytes successfully transferred

Read may return nothing, write may write nothingAsynchronous Interface: “Tell Me Later”When request data, take pointer to user’s buffer, return immediately; later kernel fills buffer and notifies userWhen send data, take pointer to user’s buffer, return immediately; later kernel takes data and notifies user Slide27

How does the processor actually talk to the device?

Device

Controller

read

write

control

status

Addressable

Memoryand/orQueues

Registers

(port 0x20)HardwareController

Memory Mapped

Region: 0x8f008020

Bus

Interface

CPU interacts with a

Controller

Contains a set of

registers

that

can be read and written

May contain memory for request

queues or bit-mapped images

Regardless of the complexity of the connections and buses, processor accesses registers in two ways:

I/O instructions:

in/out instructions

Example from the Intel architecture:

out 0x21,ALMemory mapped I/O: load/store instructionsRegisters/memory appear in physical address spaceI/O accomplished with load and store instructions

Address+

Data

Interrupt Request

Processor Memory Bus

CPU

Regular

Memory

Interrupt

Controller

BusAdaptor

Bus

Adaptor

Other Devices

or BusesSlide28

Example: Memory-Mapped Display ControllerMemory-Mapped:Hardware maps control registers and display memory into physical address space

Addresses set by hardware jumpers or programming at boot time

Simply writing to display memory (also called the “frame buffer”) changes image on screen

Addr: 0x8000F000—0x8000FFFF

Writing graphics description to command-queue area

Say enter a set of triangles that describe some scene

Addr: 0x80010000—0x8001FFFF

Writing to the command register may cause on-board graphics hardware to do somethingSay render the above scene

Addr: 0x0007F004Can protect with page tablesDisplayMemory

0x8000F000

0x80010000

Physical Address

Space

Status

0x0007F000

Command

0x0007F004

Graphics

Command

Queue

0x80020000Slide29

Transfering Data To/From ControllerProgrammed I/O:Each byte transferred via processor in/out or load/store

Pro: Simple hardware, easy to program

Con: Consumes processor cycles

proportional to data size

Direct Memory Access:

Give controller access to memory bus

Ask it to transfer data to/from memory directly

Sample interaction with DMA controller (from book):Slide30

I/O Device Notifying the OSThe OS needs to know when:

The I/O device has completed an operation

The I/O operation has encountered an error

I/O Interrupt:

Device generates an interrupt whenever it needs service

Handled in

top half

of device driverOften run on special kernel-level stackPro: handles unpredictable events well

Con: interrupts relatively high overhead Polling:OS periodically checks a device-specific status registerI/O device puts completion information in status registerCould use timer to invoke lower half of drivers occasionallyPro: low overheadCon: may waste many cycles on polling if infrequent or unpredictable I/O operationsActual devices combine both polling and interruptsFor instance: High-bandwidth network device: Interrupt for first incoming packetPoll for following packets until hardware emptySlide31

Device DriversDevice Driver: Device-specific code in the kernel that interacts directly with the device hardware

Supports a standard, internal interface

Same kernel I/O system can interact easily with different device drivers

Special device-specific configuration supported with the

ioctl

()

system

callLinux Device drivers often installed via a ModuleInterface for dynamically loading code into kernel spaceModules loaded with the “

insmod” command and can contain parametersDriver-specific structureOne per driverContains a set of standard kernel interface routinesOpen: perform device-specific initializationRead: perform readWrite: perform writeRelease: perform device-specific shutdownEtc.These routines registered at time device registeredSlide32

Interrupt handlingInterrupt routines typically divided into two pieces:

Top half: run as interrupt routine

Gets input or transfers next block of output

Handles any direct access to hardware

Handles any time-sensitive aspects of handling interrupts

Runs in the ATOMIC Context (cannot sleep)

Bottom half: accessed later to finish processing

Perform any interrupt-related work not performed by the interrupt handler itselfScheduled “later” with interrupts re-enabledSome options for bottom halves can sleepSince you typically have two halves of code, must remember to

synchronize shared dataSince interrupt handler is running in interrupt (ATOMIC) context, cannot sleep!Good choice: spin lock to synchronize data structuresMust be careful never to hold spinlock for too longWhen non-interrupt code holds a spinlock, must make sure to disable interrupts!Consider “spin_lock_irqsave()” or “spin_lock_bh()” variantsConsider lock free queue variants as wellSlide33

Options for Bottom HalfBottom Half used for handling work after interrupt is re-enabled (i.e. deferred work):Perform any interrupt-related work not performed by the interrupt handlerIdeally most of the workWhat to minimize amount of work done in an interrupt handler because they run with interrupts disabled

Many different mechanisms for handling bottom halves

Original “Bottom Half” (deprecated)

Task Queues

Put work on a task queue for later execution

Softirqs

are statically defined bottom halves that can run simultaneously on any processorTasklets: dynamically created bottom halves built on top of softirq mechanismOnly one of each type of tasklet can run at given time

Simplifies synchronizationSlide34

Life Cycle of An I/O Request

Device Driver

Bottom Half

Device Driver

Top Half

Device

Hardware

Kernel I/O

Subsystem

User

ProgramSlide35

Summary (1/2)Clock Algorithm: Approximation to LRU

Arrange all pages in circular list

Sweep through them, marking as not “in use”

If page not “in use” for one pass, than can replace

N

th

-chance clock algorithm: Another

approx LRU

Give pages multiple passes of clock hand before replacingSecond-Chance List algorithm: Yet another approx LRUDivide pages into two groups, one of which is truly LRU and managed on page faultsReverse Page MappingEfficient way to hunt down all PTEs associated with given page frame SLAB Allocator: Kernel mechanism for handling efficient allocation of objects while minimizing initializationSlide36

Summary (2/2)I/O Devices Types:Many different speeds (0.1 bytes/sec to

GBytes

/sec)

Different Access Patterns:

Block Devices, Character Devices, Network Devices

Different Access Timing:

Blocking, Non-blocking, Asynchronous

I/O Controllers: Hardware that controls actual deviceProcessor Accesses through I/O instructions, load/store to special physical memory

Report their results through either interrupts or a status register that processor looks at occasionally (polling)Notification mechanismsInterruptsPolling: Report results through status register that processor looks at periodicallyDevice Driver: Code specific to device which handles unique aspects of device