/
Data Structures Data Structures

Data Structures - PowerPoint Presentation

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
406 views
Uploaded On 2016-04-01

Data Structures - PPT Presentation

and Algorithms Huffman compression An Application of Binary Trees and Priority Queues CS 102 Encoding and Compression Fax Machines ASCII Variations on ASCII min number of bits needed cost of savings ID: 272703

tree 102 characters building 102 tree building characters file huffman text code node character bits compression frequency codes queue

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Data Structures" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Data StructuresandAlgorithms

Huffman compression

www.mif.vu.lt

/~

algisSlide2

Encoding and Compression

Fax Machines

ASCII

Variations on ASCII

min number of bits needed

cost of savings

patterns

Modifications

Proposed by Dr. David A. Huffman in 1952

A Method for the Construction of Minimum Redundancy Codes

Applicable to many forms of data transmission

Our example: text filesSlide3

The Basic Algorithm

Huffman coding is a form of statistical coding

Not all characters occur with the same frequency!

Standard, all characters are allocated the same amount of space

1 char = 1 byte, be it

e

or

x

Any savings in tailoring codes to frequency of character?

Code word lengths are no longer fixed like ASCII.

Code word lengths vary and will be shorter for the more frequently used characters.Slide4

The Basic Algorithm

Scan text to be compressed and tally occurrence of all characters.

Sort or prioritize characters based on number of occurrences in text.

Build Huffman code tree based on prioritized list.

Perform a traversal of tree to determine all code words.

Scan text again and create new file using the Huffman codes.Slide5

Huffman Compression

Background:

Huffman works with arbitrary bytes, but the ideas are most easily explained using character data

Consider extended ASCII character set:

8 bits per character

BLOCK code, since all

codewords

are the same length

8 bits yield 256 characters

In general, block codes give:

For K bits, 2

K

characters

For N characters,

log

2

N bits are required

Easy to encode and decodeSlide6

Huffman Compression

What if we could use variable length codewords, could we do better than ASCII?

Idea is that different characters would use different numbers of bits

If all characters have the same frequency of occurrence per character we cannot improve over ASCII

What if characters had different

freqs

of occurrence?

Ex: In English text, letters like E, A, I, S appear much more frequently than letters like Q, Z, X

Can we somehow take advantage of these differences in our encoding?Slide7

Huffman Compression

First we need to make sure that variable length coding is feasible

Decoding a block code is easy – take the next 8 bits

Decoding a variable length code is not so obvious

In order to decode unambiguously, variable length codes must meet the prefix property

No

codeword

is a prefix of any other

See example on board showing ambiguity if PP is not met

Ok, so now how do we compress?

Let's use fewer bits for our more common characters, and more bits for our less common charactersSlide8

Huffman CompressionSlide9

Huffman CompressionSlide10

Huffman Compression

Huffman Algorithm:

Assume we have K characters and that each uncompressed character has some weight associated with it (i.e. frequency)

Initialize a forest, F, to have K single node trees in it, one tree per character, also storing the character's weight

while (|F| > 1)

Find the two trees, T1 and T2, with the smallest weights

Create a new tree, T, whose weight is the sum of T1 and T2

Remove T1 and T2 from the F, and add them as left and right children of T

Add T to F Slide11

Huffman Compression

Huffman Issues:

Is the code correct?

Does it satisfy the prefix property?

Does it give good compression?

How to decode?

How to encode?

How to determine weights/frequencies?Slide12

Huffman Compression

Is the code correct?

Based on the way the tree is formed, it is clear that the codewords are valid

Prefix Property is assured, since each codeword ends at a leaf

all original nodes corresponding to the characters end up as leaves

Does it give good compression?

For a block code of N different characters,

log

2

N bits are needed per character

Thus a file containing M ASCII characters, 8M bits are neededSlide13

Huffman Compression

Given Huffman codes {C

0

,C

1

,…C

N-1

} for the N characters in the alphabet, each of

length|C

i

|

Given frequencies {F

0

,F

1

,…F

N-1

} in the file

Where sum of all frequencies = M

The total bits required for the file is:

Sum from 0 to N-1 of (|C

i

| * F

i

)

Overall total bits depends on differences in frequencies

The more extreme the differences, the better the compression

If frequencies are all the same, no compression

See example at the endSlide14

Huffman Compression

How to decode?

This is fairly straightforward, given that we have the Huffman tree available

start at root of tree and first bit of file

while not at end of file

if current bit is a 0, go left in tree else go right in tree // bit is a 1

if we are at a leaf output character go to root

read next bit of file

Each character is a path from the root to a leaf

If we are not at the root when end of file is reached, there was an error in the fileSlide15

Huffman Compression

How to encode?

This is trickier, since we are starting with characters and

outputing

codewords

Using the tree we would have to start at a leaf (first finding the correct leaf) then move up to the root, finally reversing the resulting bit pattern

Instead, let's process the tree once (using a traversal) to build an encoding TABLE.

Demonstrate

inorder

traversal on boardSlide16

Huffman Compression

How to determine weights/frequencies?

2-pass algorithm

Process the original file once to count the frequencies, then build the tree/code and process the file again, this time compressing

Ensures that each Huffman tree will be optimal for each file

However, to decode, the tree/

freq

information must be stored in the file

Likely in the front of the file, so decompress first reads tree info, then uses that to decompress the rest of the file

Adds extra space to file, reducing overall compression qualitySlide17

Huffman Compression

Overhead especially reduces quality for smaller files, since the tree/

freq

info may add a significant percentage to the file size

Thus larger files have a higher potential for compression with Huffman than do smaller ones

However, just because a file is large does NOT mean it will compress well

The most important factor in the compression remains the relative frequencies of the characters

Using a static Huffman tree

Process a lot of "sample" files, and build a single tree that will be used for all files

Saves overhead of tree information, but generally is NOT a very good approachSlide18

Huffman Compression

There are many different file types that have very different frequency characteristics

Ex: .

cpp

file vs. .txt containing an English essay

.

cpp

file will have many ;, {, }, (, )

.txt file will have many

a,e,i,o,u

,., etc.

A tree that works well for one file may work poorly for another (perhaps even expanding it)

Adaptive single-pass algorithm

Builds tree as it is encoding the file, thereby not requiring tree information to be separately stored

Processes file only one time

We will not look at the details of this algorithm, but the LZW algorithm and the self-organizing search algorithm are also adaptiveSlide19

Building a Tree

Consider the following short text:

Eerie eyes seen near lake.

Count up the occurrences of all characters in the text “Eerie eyes seen near lake.

What characters are present?

Create binary tree nodes with character and frequency of each character

Place nodes in a priority queue

The

lower

the occurrence, the higher the priority in the queue

E e r I space y s n a r l k .Slide20

Building a Tree

The queue after inserting all nodes:

Null Pointers are not shown

E

1

i

1

y

1

l

1

k

1

.

1

r

2

s

2

n

2

a

2

sp

4

e

8Slide21

Building a Tree

While priority queue contains two or more nodes

Create new node

Dequeue node and make it left subtree

Dequeue next node and make it right subtree

Frequency of new node equals sum of frequency of left and right children

Enqueue new node back into queueSlide22

Building a Tree

E

1

i

1

y

1

l

1

k

1

.

1

r

2

s

2

n

2

a

2

sp

4

e

8Slide23

Building a Tree

E

1

i

1

y

1

l

1

k

1

.

1

r

2

s

2

n

2

a

2

sp

4

e

8

2Slide24

Building a Tree

E

1

i

1

y

1

l

1

k

1

.

1

r

2

s

2

n

2

a

2

sp

4

e

8

2Slide25

Building a Tree

E

1

i

1

k

1

.

1

r

2

s

2

n

2

a

2

sp

4

e

8

2

y

1

l

1

2Slide26

Building a Tree

E

1

i

1

k

1

.

1

r

2

s

2

n

2

a

2

sp

4

e

8

2

y

1

l

1

2Slide27

Building a Tree

E

1

i

1

r

2

s

2

n

2

a

2

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2Slide28

Building a Tree

E

1

i

1

r

2

s

2

n

2

a

2

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2Slide29

Building a Tree

E

1

i

1

n

2

a

2

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4Slide30

Building a Tree

E

1

i

1

n

2

a

2

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4Slide31

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4Slide32

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4Slide33

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4Slide34

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4Slide35

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6Slide36

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

What is happening to the characters with a low number of occurrences?Slide37

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

8Slide38

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

8Slide39

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

8

10Slide40

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

8

10Slide41

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

8

10

16Slide42

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

8

10

16Slide43

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

8

10

16

26Slide44

Building a Tree

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

8

10

16

26

After enqueueing this node there is only one node left in priority queue.Slide45

Building a Tree

Dequeue the single node left in the queue.

This tree contains the new code words for each character.

Frequency of root node should equal number of characters in text.

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

8

10

16

26

Eerie eyes seen near lake.

26 charactersSlide46

Encoding the File - Traverse Tree for Codes

Perform a traversal of the tree to obtain new code words

Going left is a 0 going right is a 1

code word is only completed when a leaf node is reached

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

8

10

16

26Slide47

Encoding the File - Traverse Tree for Codes

Char Code

E 0000

i

0001

y 0010

l 0011

k 0100

. 0101

space 011

e 10

r 1100

s 1101

n 1110

a 1111

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

8

10

16

26Slide48

Encoding the File

Rescan text and encode file using new code words

“Eerie eyes seen near lake.”

Char Code

E 0000

i

0001

y 0010

l 0011

k 0100

. 0101

Space 011

e 10

r 1100

s 1101

n 1110

a 1111

0000101100000110011100010101101101001111101011111100011001111110100100101

Why is there no need for a separator character?Slide49

Encoding the File - Results

Have we made things any better?

73 bits to encode the text

ASCII would take 8 * 26 = 208 bits

0000101100000110011100010101101101001111101011111100011001111110100100101

If modified code used 4 bits per character are needed.

Total bits 4 * 26 = 104.

Savings not as great.Slide50

Decoding the File

How does receiver know what the codes are?

Tree constructed for each text file.

Considers frequency for each file

Big hit on compression, especially for smaller files

Tree predetermined

based on statistical analysis of text files or file types

Data transmission is bit based versus byte basedSlide51

Decoding the File

Once receiver has tree it scans incoming bit stream

0

 go left

1  go right

E

1

i

1

sp

4

e

8

2

y

1

l

1

2

k

1

.

1

2

r

2

s

2

4

n

2

a

2

4

4

6

8

10

16

26

10100011011110111101111110000110101Slide52

Summary

Huffman coding is a technique used to compress files for transmission

Uses statistical coding

more frequently used symbols have shorter code words

Works well for text and fax transmissions

An application that uses several data structuresSlide53

Huffman Shortcomings

What is Huffman missing?

Although OPTIMAL for single character (word) compression, Huffman does not take into account patterns / repeated sequences in a file

Ex: A file with 1000 As followed by 1000

Bs

, etc. for every ASCII character will not compress AT ALL with Huffman

Yet it seems like this file should be

compressable

We can use run-length encoding in this case

However run-length encoding is very specific and not generally effective for most files (since they do not typically have long runs of each character)