and Algorithms Huffman compression An Application of Binary Trees and Priority Queues CS 102 Encoding and Compression Fax Machines ASCII Variations on ASCII min number of bits needed cost of savings ID: 272703
Download Presentation The PPT/PDF document "Data Structures" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Data StructuresandAlgorithms
Huffman compression
www.mif.vu.lt
/~
algisSlide2
Encoding and Compression
Fax Machines
ASCII
Variations on ASCII
min number of bits needed
cost of savings
patterns
Modifications
Proposed by Dr. David A. Huffman in 1952
“
A Method for the Construction of Minimum Redundancy Codes
”
Applicable to many forms of data transmission
Our example: text filesSlide3
The Basic Algorithm
Huffman coding is a form of statistical coding
Not all characters occur with the same frequency!
Standard, all characters are allocated the same amount of space
1 char = 1 byte, be it
e
or
x
Any savings in tailoring codes to frequency of character?
Code word lengths are no longer fixed like ASCII.
Code word lengths vary and will be shorter for the more frequently used characters.Slide4
The Basic Algorithm
Scan text to be compressed and tally occurrence of all characters.
Sort or prioritize characters based on number of occurrences in text.
Build Huffman code tree based on prioritized list.
Perform a traversal of tree to determine all code words.
Scan text again and create new file using the Huffman codes.Slide5
Huffman Compression
Background:
Huffman works with arbitrary bytes, but the ideas are most easily explained using character data
Consider extended ASCII character set:
8 bits per character
BLOCK code, since all
codewords
are the same length
8 bits yield 256 characters
In general, block codes give:
For K bits, 2
K
characters
For N characters,
log
2
N bits are required
Easy to encode and decodeSlide6
Huffman Compression
What if we could use variable length codewords, could we do better than ASCII?
Idea is that different characters would use different numbers of bits
If all characters have the same frequency of occurrence per character we cannot improve over ASCII
What if characters had different
freqs
of occurrence?
Ex: In English text, letters like E, A, I, S appear much more frequently than letters like Q, Z, X
Can we somehow take advantage of these differences in our encoding?Slide7
Huffman Compression
First we need to make sure that variable length coding is feasible
Decoding a block code is easy – take the next 8 bits
Decoding a variable length code is not so obvious
In order to decode unambiguously, variable length codes must meet the prefix property
No
codeword
is a prefix of any other
See example on board showing ambiguity if PP is not met
Ok, so now how do we compress?
Let's use fewer bits for our more common characters, and more bits for our less common charactersSlide8
Huffman CompressionSlide9
Huffman CompressionSlide10
Huffman Compression
Huffman Algorithm:
Assume we have K characters and that each uncompressed character has some weight associated with it (i.e. frequency)
Initialize a forest, F, to have K single node trees in it, one tree per character, also storing the character's weight
while (|F| > 1)
Find the two trees, T1 and T2, with the smallest weights
Create a new tree, T, whose weight is the sum of T1 and T2
Remove T1 and T2 from the F, and add them as left and right children of T
Add T to F Slide11
Huffman Compression
Huffman Issues:
Is the code correct?
Does it satisfy the prefix property?
Does it give good compression?
How to decode?
How to encode?
How to determine weights/frequencies?Slide12
Huffman Compression
Is the code correct?
Based on the way the tree is formed, it is clear that the codewords are valid
Prefix Property is assured, since each codeword ends at a leaf
all original nodes corresponding to the characters end up as leaves
Does it give good compression?
For a block code of N different characters,
log
2
N bits are needed per character
Thus a file containing M ASCII characters, 8M bits are neededSlide13
Huffman Compression
Given Huffman codes {C
0
,C
1
,…C
N-1
} for the N characters in the alphabet, each of
length|C
i
|
Given frequencies {F
0
,F
1
,…F
N-1
} in the file
Where sum of all frequencies = M
The total bits required for the file is:
Sum from 0 to N-1 of (|C
i
| * F
i
)
Overall total bits depends on differences in frequencies
The more extreme the differences, the better the compression
If frequencies are all the same, no compression
See example at the endSlide14
Huffman Compression
How to decode?
This is fairly straightforward, given that we have the Huffman tree available
start at root of tree and first bit of file
while not at end of file
if current bit is a 0, go left in tree else go right in tree // bit is a 1
if we are at a leaf output character go to root
read next bit of file
Each character is a path from the root to a leaf
If we are not at the root when end of file is reached, there was an error in the fileSlide15
Huffman Compression
How to encode?
This is trickier, since we are starting with characters and
outputing
codewords
Using the tree we would have to start at a leaf (first finding the correct leaf) then move up to the root, finally reversing the resulting bit pattern
Instead, let's process the tree once (using a traversal) to build an encoding TABLE.
Demonstrate
inorder
traversal on boardSlide16
Huffman Compression
How to determine weights/frequencies?
2-pass algorithm
Process the original file once to count the frequencies, then build the tree/code and process the file again, this time compressing
Ensures that each Huffman tree will be optimal for each file
However, to decode, the tree/
freq
information must be stored in the file
Likely in the front of the file, so decompress first reads tree info, then uses that to decompress the rest of the file
Adds extra space to file, reducing overall compression qualitySlide17
Huffman Compression
Overhead especially reduces quality for smaller files, since the tree/
freq
info may add a significant percentage to the file size
Thus larger files have a higher potential for compression with Huffman than do smaller ones
However, just because a file is large does NOT mean it will compress well
The most important factor in the compression remains the relative frequencies of the characters
Using a static Huffman tree
Process a lot of "sample" files, and build a single tree that will be used for all files
Saves overhead of tree information, but generally is NOT a very good approachSlide18
Huffman Compression
There are many different file types that have very different frequency characteristics
Ex: .
cpp
file vs. .txt containing an English essay
.
cpp
file will have many ;, {, }, (, )
.txt file will have many
a,e,i,o,u
,., etc.
A tree that works well for one file may work poorly for another (perhaps even expanding it)
Adaptive single-pass algorithm
Builds tree as it is encoding the file, thereby not requiring tree information to be separately stored
Processes file only one time
We will not look at the details of this algorithm, but the LZW algorithm and the self-organizing search algorithm are also adaptiveSlide19
Building a Tree
Consider the following short text:
Eerie eyes seen near lake.
Count up the occurrences of all characters in the text “Eerie eyes seen near lake.
What characters are present?
Create binary tree nodes with character and frequency of each character
Place nodes in a priority queue
The
lower
the occurrence, the higher the priority in the queue
E e r I space y s n a r l k .Slide20
Building a Tree
The queue after inserting all nodes:
Null Pointers are not shown
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8Slide21
Building a Tree
While priority queue contains two or more nodes
Create new node
Dequeue node and make it left subtree
Dequeue next node and make it right subtree
Frequency of new node equals sum of frequency of left and right children
Enqueue new node back into queueSlide22
Building a Tree
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8Slide23
Building a Tree
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8
2Slide24
Building a Tree
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8
2Slide25
Building a Tree
E
1
i
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8
2
y
1
l
1
2Slide26
Building a Tree
E
1
i
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8
2
y
1
l
1
2Slide27
Building a Tree
E
1
i
1
r
2
s
2
n
2
a
2
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2Slide28
Building a Tree
E
1
i
1
r
2
s
2
n
2
a
2
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2Slide29
Building a Tree
E
1
i
1
n
2
a
2
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4Slide30
Building a Tree
E
1
i
1
n
2
a
2
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4Slide31
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4Slide32
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4Slide33
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4Slide34
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4Slide35
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6Slide36
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
What is happening to the characters with a low number of occurrences?Slide37
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8Slide38
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8Slide39
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10Slide40
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10Slide41
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16Slide42
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16Slide43
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26Slide44
Building a Tree
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26
After enqueueing this node there is only one node left in priority queue.Slide45
Building a Tree
Dequeue the single node left in the queue.
This tree contains the new code words for each character.
Frequency of root node should equal number of characters in text.
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26
Eerie eyes seen near lake.
26 charactersSlide46
Encoding the File - Traverse Tree for Codes
Perform a traversal of the tree to obtain new code words
Going left is a 0 going right is a 1
code word is only completed when a leaf node is reached
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26Slide47
Encoding the File - Traverse Tree for Codes
Char Code
E 0000
i
0001
y 0010
l 0011
k 0100
. 0101
space 011
e 10
r 1100
s 1101
n 1110
a 1111
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26Slide48
Encoding the File
Rescan text and encode file using new code words
“Eerie eyes seen near lake.”
Char Code
E 0000
i
0001
y 0010
l 0011
k 0100
. 0101
Space 011
e 10
r 1100
s 1101
n 1110
a 1111
0000101100000110011100010101101101001111101011111100011001111110100100101
Why is there no need for a separator character?Slide49
Encoding the File - Results
Have we made things any better?
73 bits to encode the text
ASCII would take 8 * 26 = 208 bits
0000101100000110011100010101101101001111101011111100011001111110100100101
If modified code used 4 bits per character are needed.
Total bits 4 * 26 = 104.
Savings not as great.Slide50
Decoding the File
How does receiver know what the codes are?
Tree constructed for each text file.
Considers frequency for each file
Big hit on compression, especially for smaller files
Tree predetermined
based on statistical analysis of text files or file types
Data transmission is bit based versus byte basedSlide51
Decoding the File
Once receiver has tree it scans incoming bit stream
0
go left
1 go right
E
1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26
10100011011110111101111110000110101Slide52
Summary
Huffman coding is a technique used to compress files for transmission
Uses statistical coding
more frequently used symbols have shorter code words
Works well for text and fax transmissions
An application that uses several data structuresSlide53
Huffman Shortcomings
What is Huffman missing?
Although OPTIMAL for single character (word) compression, Huffman does not take into account patterns / repeated sequences in a file
Ex: A file with 1000 As followed by 1000
Bs
, etc. for every ASCII character will not compress AT ALL with Huffman
Yet it seems like this file should be
compressable
We can use run-length encoding in this case
However run-length encoding is very specific and not generally effective for most files (since they do not typically have long runs of each character)