DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • The Hidden Cost of Overprivileged Tokens: Designing Messaging Platforms That Assume Compromise
  • How to Verify Domain Ownership: A Technical Deep Dive
  • Creating Effective Exceptions in Java Code [Video]
  • Spring OAuth Server: Token Claim Customization

Trending

  • The Invisible OOMKill: Why Your Java Pod Keeps Restarting in Kubernetes
  • Lambda-Driven API Design: Building Composable Node.js Endpoints With Functional Primitives
  • Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack
  • Build Self-Managing Data Pipelines With an LLM Agent
  1. DZone
  2. Software Design and Architecture
  3. Security
  4. Secure Log Tokenization Using Aho–Corasick and Spring

Secure Log Tokenization Using Aho–Corasick and Spring

This article shows how to use the Aho–Corasick algorithm and deterministic tokenization in Spring Boot to intercept logs in real time, remove sensitive values.

By 
Balakumaran Sugumar user avatar
Balakumaran Sugumar
·
Jan. 08, 26 · Analysis
Likes (3)
Comment
Save
Tweet
Share
1.8K Views

Join the DZone community and get the full member experience.

Join For Free

Modern microservices, payment engines, and event-driven systems are generating massive volumes of logs every second. These logs are critical for debugging, monitoring, observability, and compliance audits.

But there is an increasing and hazardous problem: Sensitive data — things like credit card numbers, email addresses, phone numbers, SSNs, API keys, and session tokens — often accidentally appear in logs. Once it's stored in log aggregators such as ELK, Splunk, CloudWatch, Datadog, or S3, this sensitive data becomes a high-risk liability.

Organizations shall comply with PCI-DSS, GDPR, HIPAA, SOX, and internal security policies that strictly prohibit storing raw PII/PCI. Regex-based log scrubbing is inadequate now. It's too slow, too brittle, and often misses edge cases.

This article presents a high-performance, look-ahead log interception mechanism that is built using: 

  1. Aho–Corasick multi-pattern matching.
  2. Deterministic tokenization implemented in Java + Spring Boot.

This approach provides real-time scrubbing for large-scale systems with predictable performance and security guaranteed.

Why Aho–Corasick Is Great for Log Interception

The Aho–Corasick algorithm is designed for fast, simultaneous searching of many fixed strings. It's used everywhere that requires high-speed, multi-pattern detection, such as in network security systems (IDS) and spam filters.

It works by turning all your patterns (your "dictionary" of secrets) into a single, highly efficient structure called a Finite-State Automaton.

Key Benefits of Log Interception

  1. Lightning-fast, single-pass search:
    • It scans the incoming log text only once, character by character.
    • It has linear time complexity—meaning the time it takes is directly proportional only to the size of the log line you're reading. It doesn't get exponentially slower as you add more patterns.
  2. Searches thousands of patterns at once (multi-pattern):
    • It can detect hundreds or even thousands of sensitive patterns (SSNs, tokens, card formats) simultaneously in that one single pass.
  3. No costly backtracking:
    • AC is built on the structure of a trie (a tree of all your patterns) connected by failure links.
    • If it encounters a mismatch, it simply follows a predetermined failure link to the next most likely match point instead of starting the whole search over. This completely avoids the costly backtracking that slows down regex.
  4. Predictable, consistent performance (streaming-friendly):
    • It has deterministic and predictable performance, even in the worst-case scenario. This makes it perfect for "hot paths" such as logging interceptors that handle continuous, high-volume log streams (e.g., a filter or appender).

Aho–Corasick Implementation vs. Regex

Aho-Corasick Implementation vs. Regex

The bottom line:

Using Aho–Corasick in your log interceptor is a win-win: it gives you both strong security (ensuring PII/PCI data is found and redacted) and excellent performance (minimal CPU overhead and high throughput).

Architecture Overview

Architecture overview

Runtime flow


This diagram focuses on the data (the Log Message) transforming as it passes through different processes (the steps).

  • External entity (source): The Incoming Log Message originates outside the system boundary.
  • Process 1: Initial filter/intercept: The log is processed by the Logback TurboFilter/Spring Interceptor.
  • Process 2: Pattern scanning: The filtered log is scanned using the Aho–Corasick Trie Scan to identify sensitive data.
  • Process 3: Tokenization: The sensitive data is transformed by the Deterministic Tokenizer, resulting in the Sanitized Log Message.
  • Data stores (sinks): The final message is written to the various persistence targets, including ELK, Splunk,CloudWatch, S3, and Kafka.

Java + Spring Boot Implementation

At application start-up: create an AC automaton (trie + failure links + output links) from a dictionary of “sensitive patterns.” Here, AC is Aho–Corasick.

The configuration of a custom log interceptor/filter/appender in the logging framework used—Logback, Log4j, or a Spring logging filter—would mean that every log message before it is emitted would go through this AC-based scanning.

When matches are found, replace/mask/tokenize the sensitive substrings but keep the rest of the log intact; optionally, maintain a mapping store — e.g., a hash map — so that repeated occurrences of the same sensitive value get mapped to the same token; useful for traceability while protecting PII.

Emit sanitized logs downstream: console, file, central aggregators like Splunk, ELK, S3, etc.

Since the AC matching runs in O(n + totalPatterns + numMatches) per log message (n = message length), and since the pattern set is precompiled at startup, the runtime overhead remains low, which makes the solution viable even in high-throughput microservices.

Regarding using a library that implements the Aho–Corasick algorithm (Maven example dependency), you can use the AC implementation from org.ahocorasick (version 0.6.3).

1. Add Aho–Corasick Dependency

XML
 
<dependency>
  <groupId>org.ahocorasick</groupId>
  <artifactId>ahocorasick</artifactId>
  <version>0.6.3</version>
</dependency>


2. Define Sensitive Patterns

Java
 
List<String> sensitivePatterns = List.of(
    "\\d{16}",                                   // Credit card (simple)
    "\\b\\d{3}-\\d{2}-\\d{4}\\b",                 // SSN
    "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+",          // Email
    "Bearer [A-Za-z0-9-_]+",                     // API Tokens
    "[A-Fa-f0-9]{32}",                            // Session IDs
    "\\b[0-9]{10}\\b"                             // Phone numbers
);


3. Build Aho–Corasick Trie

Java
 
Trie trie = Trie.builder()
        .onlyWholeWords()
        .ignoreCase()
        .addKeywords(sensitivePatterns)
        .build();


4. Deterministic Tokenizer

Java
 
@Component
public class Tokenizer {

    private final Map<String, String> cache = new ConcurrentHashMap<>();

    public String tokenize(String value) {
        return cache.computeIfAbsent(value, v ->
            "TOKENIZED_" + Base64.getEncoder().encodeToString(v.getBytes()).substring(0,10)
        );
    }
}


5. Logback TurboFilter Integration

Java
 
public class SensitiveDataFilter extends TurboFilter {

    private Trie trie;
    private Tokenizer tokenizer;

    @Override
    public FilterReply decide(Marker marker, Logger logger,
                              Level level, String format, Object[] params, Throwable t) {

        if (format == null) return FilterReply.NEUTRAL;

        String sanitized = sanitize(format);
        logger.log(level, sanitized);

        return FilterReply.DENY; // Prevent raw log from being written
    }

    private String sanitize(String msg) {
        Collection<Emit> emits = trie.parseText(msg);

        for (Emit e : emits) {
            String match = msg.substring(e.getStart(), e.getEnd() + 1);
            String token = tokenizer.tokenize(match);
            msg = msg.replace(match, token);
        }
        return msg;
    }
}


Output Example

Before sanitization:

Plain Text
 
Processing payment for card 4532123412341234 from [email protected]


After sanitization:

Plain Text
 
Processing payment for card TOKENENIZED_EssTYUIIOO from TOKENENIZED_llo3asd456


This ensures no leak of PCI or PII while still allowing observability teams to track user journeys and correlate events.

Performance Advantages

Aho–Corasick Complexity

  • O(n) matching regardless of number of patterns
  • No backtracking
  • Ideal for log pipelines exceeding 50k–200k log lines/minute
  • Outperforms regex significantly for multi-pattern workloads

Tokenization

  • O(1) average lookup (via ConcurrentHashMap)
  • Supports millions of tokens

Why This Approach Is Secure

  • Sensitive data never leaves the application boundary.
  • Tokens are non-reversible, except by using a secure vault-based scheme.
  • Prevents accidental or malicious logging of users' data.
  • Strong alignment with PCI DSS 4.0, GDPR Article 32, and SOC2 logging controls.
  • Supports privacy-by-design principles.

Conclusion

In summary, integrating Aho–Corasick-based log interception with deterministic tokenization delivers a comprehensive solution for secure log management. This powerful combination provides high-speed multi-pattern detection and generates deterministic, safe correlation tokens, ensuring robust compliance and zero leakage of sensitive data (PII, payment info) into system logs. Offering drop-in integration with Spring Boot and Logback, this technique is the ideal, future-proof approach for secure, enterprise-scale Java microservices handling regulated data.

security Data Types

Opinions expressed by DZone contributors are their own.

Related

  • The Hidden Cost of Overprivileged Tokens: Designing Messaging Platforms That Assume Compromise
  • How to Verify Domain Ownership: A Technical Deep Dive
  • Creating Effective Exceptions in Java Code [Video]
  • Spring OAuth Server: Token Claim Customization

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook