Huffmann Codes

05.06.2018

Greedy algorithms are

A kind of Divide and Conquer algorithms.
Typically solve optimizing problems more effective than DP solutions.

In order to apply a greedy solution the problem must comply to the following two properties

Optimal substructure: an optimal solution to the problem contains optimal solutions to subproblems.
Greedy-choice property: the global optimal solution can be assembled by making local greedy choices.

One problem that can't be solved greedy is as follows (the 0/1 knapsack problem): We are out stealing with a knapsack that can hold 50 pounds. We steal with a greedy attitude from the list with the following items (10,60),(20,100),(30,120) where the first item of the tuple is the weight and the second is the prize. Now being greedy we take (10,60),(20,100) which amount to 30 pounds and 160 worth of ill-gotten gains. But the optimal items to put in the knapsack is of course the second and the third.

Huffmann codes

Example. Huffmann codes are a way to way to compress data. Say we have a string of characters. This string can be compressed with a greedy approach on the frequency of each character. To make this work we see the characters as a binary string.

Now the binary string representation can have fixed length. Or variable length where no two strings have same prefixes. The latter results in a complete binary tree (optimal code). The lower frequency, the lower placement in the tree. For the frequencies a:20,b:13,c:9,d:5 we have

Where $z_3 = 47, z_2 = 27, z_1 = 14$.

For a tree, $T$, we have that the total cost in bits is given as $$ B(T) = \sum_{c \in C} c.freq \cdot d_{T}(c) $$

The code is

Huffman(C) n = |C| Q = C for i = 1 to n - 1 allocate new node z z.left = x = Extact-Min(Q) z.right = y = Extract-Min(Q) z.freq = x.freq + y.freq Insert(Q,z) return Extract-Min(Q) #root of tree

Correctness of Huffmann codes

We first show the greedy-choice property. We need the following lemma

Lemma 16.2

Let $C$ be alphabet with $c$ as characters and $c.freq$ as frequencies of those. Let $x$ and $y$ be the two characters with lowest frequencies. Then there exist an optimal prefix code where $x$ and $y$ are siblings.

Proof

Let $T$ be a tree of an optimal prefix code. Let $a$ and $b$ be siblings in $T$ with maximum depth. And, as stated in the Lemma, let $x$ and $y$ be the two characters with lowest frequencies. Let $a.freq \leq b.freq$ and $x.freq \leq y.freq$ from which follows that $x.freq \leq a.freq$ and $y.freq \leq b.freq$. Note that if $x.freq = b.freq$, then $a.freq = b.freq = x.freq = y.freq$, and the lemma is trivially true.

Now we exchange the position of $a$ and $x$ in $T$ to obtain $T'$. And we exchange the positions of $b$ and $y$ in $T'$ to obtain $T''$ in which $x$ and $y$ now have maximum depth. The difference in cost between $T$ and $T'$ is $$ B(T) - B(T') = \\ \sum_{c \in C} c.freq \cdot d_{T} (c) - \sum_{c \in C} c.freq \cdot d_{T'} (c) = \\ x.freq \cdot d_{T}(x) + a.freq \cdot d_{T}(a) - x.freq \cdot d_{T'}(x) - a.freq \cdot d_{T'}(a) = \\ x.freq \cdot d_{T}(x) + a.freq \cdot d_{T}(a) - x.freq \cdot d_{T}(a) - a.freq \cdot d_{T}(x) = \\ (a.freq - x.freq)(d_{T}(a) - d_{T}(x)) \geq 0 $$ *remember that $a.freq \geq x.freq$ and $d_{T}(a) \geq d_{T}(x)$. In the same way we get that $B(T') - B(T'') \geq 0$ - exchanging $y$ and $b$ does not increase the cost. Now we have that $B(T'') \leq B(T)$. But since $T$ is optimal, we also have that $B(T) \leq B(T'')$, and thus $B(T) = B(T'')$. In conclusion: $T''$ is an optimal tree in which $x$ and $y$ appears as sibling leaves.

By lemma 16.2 we have that an optimal tree can be build starting with merging the two characters with lowest frequencies.

Next let's show optimal substructure.

Lemma 16.3

Let $C$ be alphabet, let $x$ and $y$ have minimum frequencies. Let $C'$ be the alphabet $C$ with $x$ and $y$ removed and $z$ added, that is $C' = C - \{x,y\} \cup {z}$. In $C'$ $z.freq = x.freq + y.freq$. Let $T'$ be a tree representing an optimal prefix code for $C'$. Then $T$ obtained by replacing $z$ with $x$ and $y$ in $T'$ is an optimal prefix code tree for $C$.

Proof

For each $c \in C - \{x,y\}$ we have that $d_{T}(c) = d_{T'}(c)$ and hence $c.freq \cdot d_{T}(c) = c.freq \cdot d_{T'}(c)$. We have that $d_{T}(x) = d_{T}(y) = d_{T'}(z) + 1$. And we have that $$ x.freq \cdot d_{T}(x) + y.freq \cdot d_{T}(y) = \\ (x.freq + y.freq)(d_{T'}(z) + 1) = \\ z.freq \cdot d_{T'}(z) + x.freq + y.freq $$ from which follows that $$B(T) = B(T') + x.freq + y.freq$$ or equivalently $$B(T') = B(T) - x.freq - y.freq$$ We finally proof the lemma by contradiction. Let $T$ not represent an optimal prefix code for $C$. Let $T''$ represent an optimal prefix code for $C$, and we have $B(T'') < B(T)$. Let $T'''$ be $T''$ with the common parrent of $x$ and $y$ replaced by $z$ with $z.freq = x.freq + y.freq$. Now $$ B(T''') = B(T'') - x.freq - y.freq < \\ B(T) - x.freq - y.freq = \\ B(T') $$ in contradiction with the assumption that $T'$ represent an optimal prefix code. Thus $T$ represents an optimal prefix code for $C$.

By lemma 16.2 and lemma 16.3 we have that a solution to a problem in the algorithm contains solutions to subproblems.

Running Time of Huffmann

We have the following factors of concern

Build-Min-Heap that is $O(n)$.
A for-loop that executes $n - 1$ times. Each of these involves a heap operation of $O(lg n)$.
In total we get $O(n \cdot lg n)$.

...