DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • The LLM Selection War Story: Part 4 - Your Production Failure Testing Suite
  • Intelligent Load Management for LLM Calls: From Static Rate Limits to Priority-Aware "Agent QoS"
  • Stop Writing Excel Specs: A Markdown-First Approach to Enterprise Java
  • Anthropic’s Model Context Protocol (MCP): A Developer’s Guide to Long-Context LLM Integration

Trending

  • How to Save Money Using Custom LLMs for Specific Tasks
  • Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1
  • When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps
  • Securing the AI Host: Spring AI MCP Server Communication With API Keys
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Bye Tokens, Hello Patches

Bye Tokens, Hello Patches

Meta's BLT architecture is a better way to scale LLMs that may lead us to replace tokenization with a patches-based approach.

By 
mike labs user avatar
mike labs
·
Jan. 15, 25 · Analysis
Likes (2)
Comment
Save
Tweet
Share
3.8K Views

Join the DZone community and get the full member experience.

Join For Free

Do we really need to break text into tokens, or could we work directly with raw bytes?

First, let’s think about how do LLMs currently handle text. They first chop it up into chunks called tokens using rules about common word pieces. This tokenization step has always been a bit of an odd one out. While the rest of the model learns and adapts during training, tokenization stays fixed, based on those initial rules. This can cause problems, especially for languages that aren’t well-represented in the training data or when handling unusual text formats.

Meta’s new Byte-Level Tokenization (BLT) architecture (paper, code) takes a different approach. Instead of pre-defining tokens, it looks at the raw bytes of text and dynamically groups them based on how predictable they are. When the next byte is very predictable (like finishing a common word), it groups more bytes together. When the next byte is unpredictable (like starting a new sentence), it processes bytes in smaller groups.

Scaling trends for models trained with fixed inference budgets

Scaling trends for models trained with fixed inference budgets


Traditional token-based models, like Llama 2 and 3, scale model size based on the inference budget. In contrast, the BLT architecture allows scaling both model size and patch size (ps) together while maintaining the same budget. BLT models with patch sizes 6 and 8 quickly surpass Llama 2 and 3. Larger patch sizes, like 8, become more effective earlier when using higher inference budgets. Vertical lines show key points for compute efficiency and performance crossover.

This dynamic approach leads to three key benefits:

First, it can match the performance of state-of-the-art tokenizer-based models like Llama 3 while offering the option to trade minor performance losses for up to 50% reduction in inference flops. The model saves resources by processing predictable sections more efficiently.

Second, it handles edge cases much better. Consider tasks that require character-level understanding, like correcting misspellings or working with noisy text. BLT significantly outperforms token-based models on these tasks because it can directly access and manipulate individual characters.

Third, it introduces a new way to scale language models. With traditional tokenizer-based models, you’re somewhat constrained in how you can grow them. But BLT lets you simultaneously increase both the model size and the average size of byte groups while keeping the same compute budget. This opens up new possibilities for building more efficient models.

Main Components of BLT

To understand how BLT works in practice, let’s look at its three main components:

  1. A lightweight local encoder that processes raw bytes and groups them based on predictability
  2. A large transformer that processes these groups (called “patches”)
  3. A lightweight local decoder that converts patch representations back into bytes
The BLT architecture has three main modules: a lightweight Local Encoder to convert input bytes into patch representations, a Latent Transformer for processing these patches, and a lightweight Local Decoder to generate the next patch of bytes. BLT uses byte n-gram embeddings and cross-attention to enhance information flow between the Latent Transformer and byte-level modules. Unlike fixed-vocabulary tokenization, BLT dynamically groups bytes into patches, maintaining access to detailed byte-level information.


The entropy-based grouping is particularly clever. BLT uses a small language model to predict how surprising each next byte will be. When it encounters a highly unpredictable byte (like the start of a new word), it creates a boundary and begins a new patch. This way, it dedicates more computational resources to the challenging parts of the text while efficiently handling the easier parts.

I like the results. On standard benchmarks, BLT matches or exceeds Llama 3’s performance. But where it really shines is on tasks requiring character-level understanding. For instance, on the CUTE benchmark testing character manipulation, BLT outperforms token-based models by more than 25 points — and this is despite being trained on 16x less data than the latest Llama model.

The 8B BLT model is compared to the 8B BPE Llama 3, both trained on 1T tokens, using tasks that test robustness to noise and language structure awareness. Best results are in bold, and the overall best results (including Llama 3.1) are underlined. BLT significantly outperforms Llama 3 and even surpasses Llama 3.1 on many tasks, demonstrating that byte-level awareness offers unique advantages not easily achieved with more data.


This points to a future where language models might no longer need the crutch of fixed tokenization. By working directly with bytes in a dynamic way, we could build models that are both more efficient and more capable of handling the full complexity of human language.

What do you think about this approach? Does removing the tokenization step seem like the right direction for language models to evolve? Let me know in the comments or on the AImodels.fyi community Discord. I’d love to hear what you have to say.


Patch (computing) Data Types large language model

Published at DZone with permission of mike labs. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • The LLM Selection War Story: Part 4 - Your Production Failure Testing Suite
  • Intelligent Load Management for LLM Calls: From Static Rate Limits to Priority-Aware "Agent QoS"
  • Stop Writing Excel Specs: A Markdown-First Approach to Enterprise Java
  • Anthropic’s Model Context Protocol (MCP): A Developer’s Guide to Long-Context LLM Integration

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook