Публикация Школы траблшутеров

To avoid a data overload, do apply a Huffman code

Время чтения: 6 мин 20 сек
5 июля 2023 г. Просмотров: 365

Алгоритмы, Программы | Максим Голубь, Олег Брагинский

Which mechanics lies beneath the archiving software? Whether it is a simple decrease of similar characters or magic of numbers available only to techno-wizards? The founder of School of Troubleshooters Oleg Braginsky and student Maksim Golub made a deep dive into one of the ways of the data compression.

The algorithm of packing the data was offered by the student at Massachusetts Technology Institute David Albert Hoffman (August 9,1925 – October 7, 1999). It owes its appearance to the choice that Professor Robert Fano put before the group: pass the exam or work on a term paper with a pre-defined topic.

Huffman had chosen the later one and his teacher assigned him to solve a task that he was working on himself: the optimal coding of the information. When he found a solution, David did not only pass the exam, but was able to overstep his teacher, offering the concept that became a basis for the data compression.

Let’s review the algorithm step by step. Suppose that there is an initial string «ABBCCCDDDEEEE». To code the sequence originally, there is a need to have a vocabulary of five elements each of a size of eight bites. In total, there is 5*8 =40 bites to store the dataset in the memory.

The goal is to represent a vocabulary as a tree view, where each branch contains the value of the element. All the nodes are placed according to their frequency of appearance: from most recent to rare elements. The construction of the scheme starts from the branch all way down to the root.

For the original string (1), create a full list of unique characters and calculate the frequency of appearance for each value. As the result, you can observe «A» once, «B» and «C» can be spotted twice, «D» comes out three times, while letter «E» like a quartet of musicians plays an etude for four (2).

Next, sort the table in the ascending order by frequency of appearance characters in the text. After this, unite two first items from the list into one node (3). New element gets a name «X» with two children «A» and «B» (4). The total for X equals the sum of its sub-elements which is three (4).

As «X» has a higher value than «B», which is equivalent to two, it is moved to the right to keep the order of elements according to the sorting rules. Apply the same rule of grouping for «C» and «D» (5) to get new node equivalent to five (6). Assign new node a name «Y» (6) and move it to the end of the row.

Up next, unite nodes «X» and «E» (7). As «X» already has two child elements, they form a new node with «E». Assign letter «Z» to this new element (8). Last step is a conjunction of «Y» and «Z». Name new node «T» and make it a root of the tree as there are no other elements left.

Code each element by using a route that is needed to go through to reach it. The rule is: if move from the top of the node to the right side – the value equals to one, if path lies through the left side, then the value equals to zero. After passing each node the value should be added to the resulting code (11).

Run through each node and write down the output for all elements. «X», «Y», «Z», «T» to be skipped – they only unite nodes. Pay attention: the elements that appear more frequently in the text are reached out faster, while the rare ones require few more steps to get to and hence, take longer travel time.

After building all of the routes and writing down the code value, compare it with the original vocabulary. With initial forty bites (12) get a new lightweight list with the set of twelve bites (13), which spares twenty-seven characters, saving seventy percentage of space in the memory comparing to the initial size.

David Huffman invented his algorithm in the middle of fifties in the last century. Because of the simplicity and consistency of generating optimal output, it has been being used as the foundation for solutions that require data compression while coding and transferring information on variety of software and hardware.