Secure Log Tokenization Using Aho–Corasick and Spring
This article shows how to use the Aho–Corasick algorithm and deterministic tokenization in Spring Boot to intercept logs in real time, remove sensitive values.
Join the DZone community and get the full member experience.
Join For FreeModern microservices, payment engines, and event-driven systems are generating massive volumes of logs every second. These logs are critical for debugging, monitoring, observability, and compliance audits.
But there is an increasing and hazardous problem: Sensitive data — things like credit card numbers, email addresses, phone numbers, SSNs, API keys, and session tokens — often accidentally appear in logs. Once it's stored in log aggregators such as ELK, Splunk, CloudWatch, Datadog, or S3, this sensitive data becomes a high-risk liability.
Organizations shall comply with PCI-DSS, GDPR, HIPAA, SOX, and internal security policies that strictly prohibit storing raw PII/PCI. Regex-based log scrubbing is inadequate now. It's too slow, too brittle, and often misses edge cases.
This article presents a high-performance, look-ahead log interception mechanism that is built using:
- Aho–Corasick multi-pattern matching.
- Deterministic tokenization implemented in Java + Spring Boot.
This approach provides real-time scrubbing for large-scale systems with predictable performance and security guaranteed.
Why Aho–Corasick Is Great for Log Interception
The Aho–Corasick algorithm is designed for fast, simultaneous searching of many fixed strings. It's used everywhere that requires high-speed, multi-pattern detection, such as in network security systems (IDS) and spam filters.
It works by turning all your patterns (your "dictionary" of secrets) into a single, highly efficient structure called a Finite-State Automaton.
Key Benefits of Log Interception
- Lightning-fast, single-pass search:
- It scans the incoming log text only once, character by character.
- It has linear time complexity—meaning the time it takes is directly proportional only to the size of the log line you're reading. It doesn't get exponentially slower as you add more patterns.
- Searches thousands of patterns at once (multi-pattern):
- It can detect hundreds or even thousands of sensitive patterns (SSNs, tokens, card formats) simultaneously in that one single pass.
- No costly backtracking:
- AC is built on the structure of a trie (a tree of all your patterns) connected by failure links.
- If it encounters a mismatch, it simply follows a predetermined failure link to the next most likely match point instead of starting the whole search over. This completely avoids the costly backtracking that slows down regex.
- Predictable, consistent performance (streaming-friendly):
- It has deterministic and predictable performance, even in the worst-case scenario. This makes it perfect for "hot paths" such as logging interceptors that handle continuous, high-volume log streams (e.g., a filter or appender).
Aho–Corasick Implementation vs. Regex


The bottom line:
Using Aho–Corasick in your log interceptor is a win-win: it gives you both strong security (ensuring PII/PCI data is found and redacted) and excellent performance (minimal CPU overhead and high throughput).
Architecture Overview


This diagram focuses on the data (the Log Message) transforming as it passes through different processes (the steps).
- External entity (source): The Incoming Log Message originates outside the system boundary.
- Process 1: Initial filter/intercept: The log is processed by the Logback TurboFilter/Spring Interceptor.
- Process 2: Pattern scanning: The filtered log is scanned using the Aho–Corasick Trie Scan to identify sensitive data.
- Process 3: Tokenization: The sensitive data is transformed by the Deterministic Tokenizer, resulting in the Sanitized Log Message.
- Data stores (sinks): The final message is written to the various persistence targets, including ELK, Splunk,CloudWatch, S3, and Kafka.
Java + Spring Boot Implementation
At application start-up: create an AC automaton (trie + failure links + output links) from a dictionary of “sensitive patterns.” Here, AC is Aho–Corasick.
The configuration of a custom log interceptor/filter/appender in the logging framework used—Logback, Log4j, or a Spring logging filter—would mean that every log message before it is emitted would go through this AC-based scanning.
When matches are found, replace/mask/tokenize the sensitive substrings but keep the rest of the log intact; optionally, maintain a mapping store — e.g., a hash map — so that repeated occurrences of the same sensitive value get mapped to the same token; useful for traceability while protecting PII.
Emit sanitized logs downstream: console, file, central aggregators like Splunk, ELK, S3, etc.
Since the AC matching runs in O(n + totalPatterns + numMatches) per log message (n = message length), and since the pattern set is precompiled at startup, the runtime overhead remains low, which makes the solution viable even in high-throughput microservices.
Regarding using a library that implements the Aho–Corasick algorithm (Maven example dependency), you can use the AC implementation from org.ahocorasick (version 0.6.3).
1. Add Aho–Corasick Dependency
<dependency>
<groupId>org.ahocorasick</groupId>
<artifactId>ahocorasick</artifactId>
<version>0.6.3</version>
</dependency>
2. Define Sensitive Patterns
List<String> sensitivePatterns = List.of(
"\\d{16}", // Credit card (simple)
"\\b\\d{3}-\\d{2}-\\d{4}\\b", // SSN
"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+", // Email
"Bearer [A-Za-z0-9-_]+", // API Tokens
"[A-Fa-f0-9]{32}", // Session IDs
"\\b[0-9]{10}\\b" // Phone numbers
);
3. Build Aho–Corasick Trie
Trie trie = Trie.builder()
.onlyWholeWords()
.ignoreCase()
.addKeywords(sensitivePatterns)
.build();
4. Deterministic Tokenizer
@Component
public class Tokenizer {
private final Map<String, String> cache = new ConcurrentHashMap<>();
public String tokenize(String value) {
return cache.computeIfAbsent(value, v ->
"TOKENIZED_" + Base64.getEncoder().encodeToString(v.getBytes()).substring(0,10)
);
}
}
5. Logback TurboFilter Integration
public class SensitiveDataFilter extends TurboFilter {
private Trie trie;
private Tokenizer tokenizer;
@Override
public FilterReply decide(Marker marker, Logger logger,
Level level, String format, Object[] params, Throwable t) {
if (format == null) return FilterReply.NEUTRAL;
String sanitized = sanitize(format);
logger.log(level, sanitized);
return FilterReply.DENY; // Prevent raw log from being written
}
private String sanitize(String msg) {
Collection<Emit> emits = trie.parseText(msg);
for (Emit e : emits) {
String match = msg.substring(e.getStart(), e.getEnd() + 1);
String token = tokenizer.tokenize(match);
msg = msg.replace(match, token);
}
return msg;
}
}
Output Example
Before sanitization:
Processing payment for card 4532123412341234 from [email protected]
After sanitization:
Processing payment for card TOKENENIZED_EssTYUIIOO from TOKENENIZED_llo3asd456
This ensures no leak of PCI or PII while still allowing observability teams to track user journeys and correlate events.
Performance Advantages
Aho–Corasick Complexity
- O(n) matching regardless of number of patterns
- No backtracking
- Ideal for log pipelines exceeding 50k–200k log lines/minute
- Outperforms regex significantly for multi-pattern workloads
Tokenization
- O(1) average lookup (via ConcurrentHashMap)
- Supports millions of tokens
Why This Approach Is Secure
- Sensitive data never leaves the application boundary.
- Tokens are non-reversible, except by using a secure vault-based scheme.
- Prevents accidental or malicious logging of users' data.
- Strong alignment with PCI DSS 4.0, GDPR Article 32, and SOC2 logging controls.
- Supports privacy-by-design principles.
Conclusion
In summary, integrating Aho–Corasick-based log interception with deterministic tokenization delivers a comprehensive solution for secure log management. This powerful combination provides high-speed multi-pattern detection and generates deterministic, safe correlation tokens, ensuring robust compliance and zero leakage of sensitive data (PII, payment info) into system logs. Offering drop-in integration with Spring Boot and Logback, this technique is the ideal, future-proof approach for secure, enterprise-scale Java microservices handling regulated data.
Opinions expressed by DZone contributors are their own.
Comments