I’m trying to write a program in c for Huffman coding, but I am stuck. For input I have:
Sample input:
4 // here I scan how many letters I have
A 00 // and for everyone I scan how they are coded in string down
B 10
C 01
D 11
001010010101001011010101010110011000 //this is a suboptimal huffman code
So first I have to decode this string, and to find out how many times every letter appear. And I already do that. But now I have to find out how many bits have every letter using huffman tree, and in the output I have to print the average bit per symbol.
The output for this example here have to be:
Sample output
1.722
So now, how to find out how many bits have every letter with huffman coding?
1
To solve this you need to create the huffman tree and compute the bits needed to represent every symbol. Then you can compute total bits needed for original string in huffman encoding and divide by number of characters.
First you map your input string based on the original character encoding :
00 A
10 B
10 B
01 C
01 C
01 C
00 A
10 B
11 D
01 C
01 C
01 C
01 C
01 C
10 B
01 C
10 B
00 A
Next you count number of occurrence of each character:
3 00,A
9 01,C
5 10,B
1 11,D
Now we make a min priority queue using the occurrence as key, this looks like :
[(1,D), (3,A), (5, B), (9,C)]
Keep applying the huffman process ( http://en.wikipedia.org/wiki/Huffman_coding ). So first you combine D and A to make a new node ‘DA’ which key = 1+3 = 4. Put this back in the priority queue:
[(4, DA), (5, B), (9,C)]
Now DA and B combine to give DAB:
[(9, DAB), (9,C)]
Now DAB and C combine to give root node : ‘DABC’
[(18, DABC)]
Now the process stops and we give each character a new encoding based on how far it is away from the root node. ‘C’ was combined the last so that get’s only one bit. Let’s say I always use ‘0’ for the second element ( of the two that got picked from priority queue). The implicit bits are represented in parenthesis:
C = 0, DAB = 1
B = (1) 0, DA = (1) 1
A = (11) 0, D = (11) 1
So you get the encoding:
C = 0
B = 10
A = 110
D = 111
Encoding original message:
Total bits needed = 9 * 1 + 5 * 2 + 3 * 3 + 3 * 1
= 9 + 10 + 9 + 3
= 31
Number of Characters = 18
Average bits = 31 / 18 = 1.722222
3
Once you have the Huffman coding tree, the optimum code for each symbol is given by the path to the symbol in the tree.
For instance, let’s take this tree and say that left is 0 and right is 1 (this is arbitrary) :
/
A
/
B
/
C D
Path to A is left, therefore its optimum code is 0, the length of this code is 1 bit.
Path to B is right, left, its code is 10, length 2 bits.
C is right, right, left, code 110 ,3 bits, and D right, right, right, right, code 1111, 4 bits.
Now you have the length of each code and you already computed the frequency of each symbol.
The average bits per symbol is the average across these code lengths weighted by the frequency of their associated symbols.