DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Cryptography Module in Mule 4
  • DataWeave: Play With Dates (Part 1)
  • Tired of Messy Code? Master the Art of Writing Clean Codebases
  • The Long Road to Java Virtual Threads

Trending

  • Introducing Graph Concepts in Java With Eclipse JNoSQL
  • Enforcing Architecture With ArchUnit in Java
  • The Evolution of Scalable and Resilient Container Infrastructure
  • Supervised Fine-Tuning (SFT) on VLMs: From Pre-trained Checkpoints To Tuned Models
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Algorithm of the Week: Aho-Corasick String Matching Algorithm in Haskell

Algorithm of the Week: Aho-Corasick String Matching Algorithm in Haskell

By 
Swizec Teller user avatar
Swizec Teller
·
Mar. 19, 13 · Interview
Likes (0)
Comment
Save
Tweet
Share
21.5K Views

Join the DZone community and get the full member experience.

Join For Free
let’s say you have a large piece of text and a dictionary of keywords. how do you quickly locate all the keywords?

aho-corasick algorithm diagram

aho-corasick algorithm diagram

well, there are many ways really, you could even iterate through the whole thing and compare words to keywords. but it turns out that’s going to be very slow. at least o(n_keywords * n_words) complexity. essentially you’re making as many passes over the text as your dictionary is big.

in 1975 a couple of ibm researchers – alfred aho and margaret corasick – discovered an algorithm that can do this in a single pass. the aho-corasick string matching algorithm .

i implemented it in haskell and it takes 0.005s to find 8 different keywords in oscar wilde’s the nightingale and the rose – a 12kb text.

a quick naive keyword search implemented in python takes 0.023s . not a big difference practically speaking, but imagine  a situation with megabytes of text and thousands of words in the dictionary. the authors mention printing out the result as a major bottleneck in their assessment of the algorithm.

yep, printing .

the aho-corasick algorithm

at the core of this algorithm are three functions:

the three functions of aho-corasick algorithm

the three functions of aho-corasick algorithm

  • a parser based on a state machine, which maps (state, char) pairs to states and occasionally emits an output. this is called the goto function
  • a failure function, which tells the goto function which state to jump into when the character it just read doesn’t match anything
  • an output function, which maps states to outputs – potentially more than one per state

the algorithm works in two stages. it will first construct the goto, failure and output functions. the complexity of this operation hinges solely on the size of our dictionary. then it iterates over the input text to produce all the matches.

using state machines for parsing text is a well known trick – the real genius of this algorithm rests in that failure function if you ask me. it makes lateral transitions between states when the algorithm climbs itself into a wall.

say you have she and hers in the dictionary.

the goto machine eats your input string one character at the time. let’s say it’s already read s h . the next input is an e so it outputs she and reaches a final state. next it reads an r , but the state didn’t expect any more inputs, so the failure function puts us on the path towards hers .

this is a bit tricky to explain in text, i suggest you look at the picture from the original article and look at what’s happening.

my haskell implementation

the first implementation i tried, relied on manully mapping inputs to outputs for the goto, failure and output functions by using pattern recognition. not very pretty, extremely hardcoded, but it worked and was easy to make.

building the functions dynamically proved a bit trickier.

type goto = map (int, char) int
type failure = map int int
type output = map int [string]

first off, we build the goto function.

-- builds the goto function
build_goto::goto -> string -> (goto, string)
build_goto m s = (add_one 0 m s, s)
 
-- adds one string to goto function
add_one::int -> goto -> [char] -> goto
add_one _  m [] = m
add_one state m (c:rest)
  | member key m = add_one (frommaybe 0 $ map.lookup key m) m rest
  | otherwise = add_one max (map.insert key max m) rest
  where key = (state, c)
        max = (size m)+1

essentially this builds a flattened prefix tree in a hashmap of (state, char) pairs mapping to the next state. it makes sure to avoid adding new edges to the three as much as possible.

the reason it’s not simply a prefix tree are those lateral transitions; doing them in a tree would require backtracking and repeating of steps, so we haven’t achieved anything.

once we have the goto function, building the output is trivial.

-- builds the output function
build_output::(?m::goto) => [string] -> output
build_output [] = empty
build_output (s:rest) = map.insert (fin 0 s)
                          (list.filter (\x -> elem x dictionary) $ list.tails s) $
                          build_output rest
 
-- returns the state in which an input string ends without using failures
fin::(?m::goto) => int -> [char] -> int
fin state [] = state
fin state (c:rest) = fin next rest
  where next = frommaybe 0 $ map.lookup (state, c) ?m

we are essentially going over the dictionary, finding the final state for each word and building a hash table mapping final states to their outputs.

building the failure function was trickiest, because we need a way to iterate over the depths at which nodes are position in the goto state machine. but we threw that info away by using a hashmap.

-- tells us which nodes in the goto state machine are at which traversal depth
nodes_at_depths::(?m::goto) => [[int]]
nodes_at_depths =
  list.map (\i ->
                  list.filter (>0) $
                  list.map (\l -> if i < length l then l!!i else -1)
                  paths)
           [0..(maximum $ list.map length paths)-1]
  where paths = list.map (path 0) dictionary

we now have a list of lists, that tells us at which depth certain nodes are.

-- builds the failure function
build_fail::(?m::goto) => [[int]] -> int -> failure
build_fail nodes 0 = fst $
                  mapaccuml (\f state ->
                              (map.insert state 0 f, state))
                  empty (nodes!!0)
build_fail nodes d = fst $
                  mapaccuml (\f state ->
                              (map.insert state (decide_fail state lower) f, state))
                  lower (nodes!!d)
  where lower = build_fail nodes (d-1)
 
-- inner step of building the failure function
decide_fail::(?m::goto) => int -> failure -> int
decide_fail state lower = findwithdefault 0 (s, c) ?m
  where (s', c) = key' state $ assocs ?m
        s = findwithdefault 0 s' lower
 
-- gives us the key associated with a certain state (how to get there)
key'::int -> [((int, char), int)] -> (int, char)
key' _ [] = (-1, '_') -- this is ugly, being of maybe type would be better
key' state ((k, v):rest)
  | state == v = k
  | otherwise = key' state rest

here we are going over the list of nodes at depths and deciding what the failure should be for each depth based on the failures of depth-1. at depth zero, all failures go to the zeroth state.

an important part of this process was inverting the goto hashmap so values point to keys, which is essentially what the key’ function does.

finally, we can use the whole algorithm like this:

main = do
  let ?m = fst $ mapaccuml build_goto empty dictionary
  let ?f = build_fail nodes_at_depths $ (length $ nodes_at_depths)-1
      ?out = build_output dictionary
 
  print $ ahocorasick text

a bit more involved than the usual example of haskell found online, it’s still pretty cool :)

you can see the whole code on github here .






Algorithm Haskell (programming language) Strings Data Types

Published at DZone with permission of Swizec Teller, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Cryptography Module in Mule 4
  • DataWeave: Play With Dates (Part 1)
  • Tired of Messy Code? Master the Art of Writing Clean Codebases
  • The Long Road to Java Virtual Threads

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!