DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Understanding AVL Trees in C#: A Guide to Self-Balancing Binary Search Trees
  • When To Use Decision Trees vs. Random Forests in Machine Learning
  • Discover Hidden Patterns with Intelligent K-Means Clustering
  • Mastering Approximate Top K: Choosing Optimal Count-Min Sketch Parameters

Trending

  • Lambda-Driven API Design: Building Composable Node.js Endpoints With Functional Primitives
  • Key Takeaways From Integrating a RAG Application With LangSmith
  • Why We Chose Iceberg Over Delta After Evaluating Both at Scale
  • How to Test a PATCH API Request With REST-Assured Java
  1. DZone
  2. Data Engineering
  3. Data
  4. Two-Pass Huffman in Blocks of 2 Symbols: Golang Implementation

Two-Pass Huffman in Blocks of 2 Symbols: Golang Implementation

One of the most famous compression algorithms is Huffman coding. Here, see an advanced version: a block-based, 2-symbol, two-pass Huffman algorithm in Golang.

By 
Daniil Koshelev user avatar
Daniil Koshelev
·
Nov. 05, 24 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
10.2K Views

Join the DZone community and get the full member experience.

Join For Free

Data compression is perhaps the most important feature of modern computation, enabling efficient storage and transmission of information. One of the most famous compression algorithms is Huffman coding. In this post, we are going to introduce an advanced version: a block-based, 2-symbol, two-pass Huffman algorithm in Golang. It can bring further enhancements regarding the increase of compression efficiency in specific types of data, as it will take into consideration pairs of symbols instead of individual ones.

Algorithm Overview

The two-pass Huffman algorithm in blocks of 2 symbols is an extension of the classic Huffman coding. It processes input data in pairs of bytes, potentially offering better compression ratios for certain types of data. Let’s break down the encoding process step by step:

1. First Pass: Frequency Counting

  • Read the input file in blocks of 2 bytes (16 bits).
  • Store these blocks in a map structure, where:
    • The key is the 2-byte block (represented as an int16).
    • The value is a structure containing frequency information and other metadata.

2. Build the Huffman Tree

  • Create a list of elements from the map.
  • Sort the list based on the frequency of each block.
  • Iteratively combine the two least frequent elements:
    • Create a new node that becomes the parent of these two elements.
    • The frequency of the new node is the sum of its children's frequencies.
    • Store references to the original two elements within the new node.
  • Repeat the combination process until only one element (the root) remains.

3. Generate Huffman Codes

  • Starting from the root of the Huffman tree, traverse the tree:
    • Assign 0 for left branches and 1 for right branches.
    • When a leaf node is reached, the path from root to leaf becomes the code for that block.
  • Store the resulting codes for each original 2-byte block.

4. Second Pass: Encoding

  • Read the input file again, this time replacing each 2-byte block with its corresponding Huffman code.
  • Write the encoded data to the output file, bit by bit.

5. File Structure

The encoded file has a specific structure to allow for proper decoding:

  • First 4 bytes: Number of nodes in the Huffman tree
  • Next byte: Number of useful bits in the last byte of encoded data
  • Next byte: Flag indicating if a zero byte was added to the end of the input file
  • Tree structure: Encoded representation of the Huffman tree
  • Encoded data: The compressed content

Golang Implementation

Let's dive into the key components of the Golang implementation:

Go
 
// Block represents a node in the Huffman tree
type Block struct {
    Value     uint16  // The 2-byte block value
    Frequency int     // Frequency of the block in the input
    Left      *Block  // Left child in the Huffman tree
    Right     *Block  // Right child in the Huffman tree
    Code      string  // Huffman code for this block
}

// EncodeFile orchestrates the entire encoding process
func EncodeFile(inputPath, outputPath string) error {
    // Count block frequencies
    blocks, err := countBlockFrequencies(inputPath)
    if err != nil {
        return err
    }

    // Build the Huffman tree
    root := buildHuffmanTree(blocks)

// Generate Huffman codes for each block
    generateCodes(root, "")

    // Encode the file using the generated codes
    return writeEncodedFile(inputPath, outputPath, root, blocks)
}

// countBlockFrequencies reads the input file and counts the frequency of each 2-byte block
func countBlockFrequencies(inputPath string) (map[uint16]*Block, error) {
    blocks := make(map[uint16]*Block)
    file, err := os.Open(inputPath)
    if err != nil {
        return nil, err
    }
    defer file.Close()

    reader := bufio.NewReader(file)
    for {
        block, err := reader.ReadUint16(binary.BigEndian)
        if err != nil {
            if err == io.EOF {
                break
            }
            return nil, err
        }
        if _, exists := blocks[block]; !exists {
            blocks[block] = &Block{Value: block, Frequency: 1}
        } else {
            blocks[block].Frequency++
        }
    }
    return blocks, nil
}

// buildHuffmanTree constructs the Huffman tree from the frequency map
func buildHuffmanTree(blocks map[uint16]*Block) *Block {
    pq := make(PriorityQueue, 0, len(blocks))
    for _, block := range blocks {
        heap.Push(&pq, block)
    }

    for pq.Len() > 1 {
        left := heap.Pop(&pq).(*Block)
        right := heap.Pop(&pq).(*Block)
        parent := &Block{
            Frequency: left.Frequency + right.Frequency,
            Left:      left,
            Right:     right,
        }
        heap.Push(&pq, parent)
    }

    return heap.Pop(&pq).(*Block)
}

// generateCodes assigns Huffman codes to each block
func generateCodes(node *Block, code string) {
    if node.Left == nil && node.Right == nil {
        node.Code = code
        return
    }
    generateCodes(node.Left, code+"0")
    generateCodes(node.Right, code+"1")
}

// writeEncodedFile writes the compressed data to the output file
func writeEncodedFile(inputPath, outputPath string, root *Block, blocks map[uint16]*Block) error {
    inputFile, err := os.Open(inputPath)
    if err != nil {
        return err
    }
    defer inputFile.Close()

    outputFile, err := os.Create(outputPath)
    if err != nil {
        return err
    }
    defer outputFile.Close()

    // Write file header
    binary.Write(outputFile, binary.BigEndian, uint32(len(blocks)))
    // ... (write other header information)

    // Write tree structure
    writeTreeStructure(outputFile, root)

    // Encode and write data
    reader := bufio.NewReader(inputFile)
    writer := bitio.NewWriter(outputFile)
    for {
        block, err := reader.ReadUint16(binary.BigEndian)
        if err != nil {
            if err == io.EOF {
                break
            }
            return err
        }
        code := blocks[block].Code
        for _, bit := range code {
            if bit == '0' {
                writer.WriteBit(bitio.Zero)
            } else {
                writer.WriteBit(bitio.One)
            }
        }
    }
    writer.Close()

    return nil
}


This implementation provides a solid foundation for the two-pass Huffman algorithm in Golang. The EncodeFile function orchestrates the entire encoding process, from frequency counting to writing the encoded file.

Decoding Process

The decoding process essentially reverses the encoding steps:

  1. Read the file header to obtain:
    • Number of nodes in the Huffman tree
    • Number of useful bits in the last byte of encoded data
    • The presence of a "zero byte" at the end of the original file
  2. Reconstruct the Huffman tree:
    • Read the encoded tree structure bit by bit.
    • When a 1 is encountered, read the next 16 bits to get the original block value.
    • When a 0 is encountered, create an internal node.
  3. Read the encoded data:
    • Process the data bit by bit, traversing the Huffman tree.
    • When a leaf node is reached, output the corresponding 2-byte block.
    • Continue until all data is processed.
  4. Handle the last byte:
    • Use the information about useful bits to avoid outputting extra data.
  5. If a "zero byte" was added during encoding, remove it from the decoded output.

Here's a skeleton of the decoding function in Golang:

Go
 
func DecodeFile(inputPath, outputPath string) error {
    inputFile, err := os.Open(inputPath)
    if err != nil {
        return err
    }
    defer inputFile.Close()

    outputFile, err := os.Create(outputPath)
    if err != nil {
        return err
    }
    defer outputFile.Close()

    // Read file header
    var nodeCount uint32
    binary.Read(inputFile, binary.BigEndian, &nodeCount)
    // ... (read other header information)

    // Reconstruct Huffman tree
    root := reconstructTree(inputFile)

    // Decode data
    reader := bitio.NewReader(inputFile)
    writer := bufio.NewWriter(outputFile)
    currentNode := root
    for {
        bit, err := reader.ReadBit()
        if err != nil {
            if err == io.EOF {
                break
            }
            return err
        }
        if bit == bitio.Zero {
            currentNode = currentNode.Left
        } else {
            currentNode = currentNode.Right
        }
        if currentNode.Left == nil && currentNode.Right == nil {
            binary.Write(writer, binary.BigEndian, currentNode.Value)
            currentNode = root
        }
    }
    writer.Flush()

    // Handle last byte and zero byte if necessary
    // ...

    return nil
}


Performance Analysis

The provided test results show the algorithm's performance on various file types. Let's analyze some of these results:

  1. Text files (books, papers, source code):
    • Achieved compression ratios of around 8-9 bits per word
    • Example: book.txt compressed to 8.14 bits per word
    • Entropy values:
      • H(X) = 4.53 (first-order entropy)
      • H(X|X) = 3.58 (second-order entropy)
  2. Image file (picture):
    • Exceptional compression: 2.39 bits per word
    • Entropy values much lower than text files:
      • H(X) = 1.21
      • H(X|X) = 0.82
  3. Source code files (program1, program2, program3):
    • Compression ratios between 8.00 and 8.80 bits per word
    • Generally lower entropy values compared to natural language text
  4. Geometric data (geometric):
    • Higher compression ratio: 9.22 bits per word
    • Higher entropy values:
      • H(X) = 5.65
      • H(X|X) = 4.26

These results demonstrate that the two-pass Huffman algorithm in blocks of 2 symbols performs differently across various types of data:

  • It's particularly effective for image data, achieving significant compression.
  • For text and source code, it provides moderate compression, with performance varying based on the content's complexity and redundancy.
  • Geometric data seems to be more challenging to compress with this method.

The entropy values provide insights into the theoretical limits of compression for each file type. The difference between the achieved compression (bits per word) and the first-order entropy H(X) indicates the potential for further optimization.

For further reference and verification, all files used in the performance analysis, including text files, image files, source code, and geometric data, can be accessed via the following Google Drive link. This link provides direct access to the files mentioned in this report, allowing for a more detailed review of the compression results and entropy values across various file types.

Conclusion

Huffman's two-pass algorithm in blocks of 2 symbols is an interesting approach for conducting data compression. It may yield better compression ratios for certain types of data compared to classic single-symbol Huffman coding. The Golang implementation here has shown effectiveness for a wide range of file types and sizes.

Key Takeaways

  1. The best performance of the algorithm is on image data, which justifies using it in the pipelines of image compression.
  2. With regards to text and source code, it provides moderate compression, which is a good balance between efficiency and complexity. Block-based allows catching some context-meaningful pairs of symbols-what can be better in some scenarios. Easy to implement and integrate into existing Golang projects.
  3. The block-based approach allows for capturing some context (pairs of symbols), which can lead to better compression in certain scenarios.
  4. The implementation is flexible and can be easily integrated into existing Golang projects.

As with any compression method, the effectiveness can vary depending on the input data characteristics. Further optimizations and adaptations could potentially enhance its performance for specific use cases:

  • Experimenting with different block sizes (e.g., 3 or 4 bytes) might yield better results for certain data types.
  • Implementing adaptive block sizes based on data characteristics could optimize compression ratios.
  • Parallel processing of blocks could improve encoding and decoding speed for large files.

In conclusion, this two-pass Huffman implementation in Golang provides a solid foundation for exploring advanced data compression techniques. Its block-based approach offers a unique balance between compression efficiency and algorithmic complexity, making it a valuable tool in the data compression toolkit.

References

  • Kudryashov, B.D. Information Theory. Higher School of Economics, 2009.
Data compression Golang Algorithm Data (computing) Tree (data structure)

Opinions expressed by DZone contributors are their own.

Related

  • Understanding AVL Trees in C#: A Guide to Self-Balancing Binary Search Trees
  • When To Use Decision Trees vs. Random Forests in Machine Learning
  • Discover Hidden Patterns with Intelligent K-Means Clustering
  • Mastering Approximate Top K: Choosing Optimal Count-Min Sketch Parameters

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook