Over a million developers have joined DZone.

Haskell: An Impressively Non-Performant Union Find

· Performance Zone

See Gartner’s latest research on the application performance monitoring landscape and how APM suites are becoming more and more critical to the business, brought to you in partnership with AppDynamics.


I’ve spent the best part of the last day debugging a clustering algorithm I wrote as part of the Algorithms 2 course, eventually coming to the conclusion that the union find data structure I was using wasn’t working as expected.

In our algorithm we’re trying to group together points which are ‘close’ to each other and the data structure is particular useful for doing that.

To paraphrase from my previous post about how we use the union find data structure:

We start out with n connected components i.e. every point is in its own connected component.

We then merge these components together as calculate the neighbours of each point until we’ve iterated through all the points and have grouped all the points into the appropriate components.

I came across 3 libraries which implement this data structure – union-find, equivalence and persistent-equivalence.

union-find seemed like it had the easiest API to understand so I plugged it into my program only to eventually realise that it wasn’t putting the points into components as I expected.

I eventually narrowed the problem down to the following example:

> let uf = emptyEquivalence (0,9)
> components $ equate 0 1 uf
> components $ equate 8 9 $ equate 0 1 $ uf
> components $ equate 0 8 $ equate 8 9 $ equate 0 1 $ uf

We start out with a union-find where every point is in its own component. The next line puts points ’0′ and ’1′ into the same component which it does by making indexes ’0′ and ’1′ of the array both have the same value, in this case 0, which is known as the component’s leader.

We continue doing that for points ’8′ and ’9′ which works fine and our union find now consists of 8 components – the ones with leaders 8 & 0 which have two elements and then ones with leaders 2,3,4,5,6 & 7 which only contain themselves.

Things go wrong on our next step where we try to join nodes ’0′ and ’8′. As I understand it what should happen here is that all the points connected to either ’0′ or ’8′ should end up in the same component so we should have a component containing points ’0′, ’1′, ’8′ and ’9′ but ’9′ has been missed off in this case.

The implementation is deliberately written to work like this so I thought I’d try writing my own version based on the following Ruby version:

class UnionFind
  def initialize(n)
    @leaders = 1.upto(n).inject([]) { |leaders, i| leaders[i] = i; leaders }
  def connected?(id1,id2)
    @leaders[id1] == @leaders[id2]
  def union(id1,id2)
    leader_1, leader_2 = @leaders[id1], @leaders[id2]
    @leaders.map! {|i| (i == leader_1) ? leader_2 : i }

This is my Haskell equivalent which I adapted from the union-find one that I mentioned above:

module Leaders (UnionSet, create, components, numberOfComponents, indexes, inSameComponent, union) where
import Control.Concurrent.MVar
import Control.Monad
import Data.Array.Diff as ArrayDiff
import Data.IORef
import qualified Data.List
import Data.Maybe
import System.IO.Unsafe
import qualified Data.Set as Set
arrayFrom :: (IArray a e, Ix i) => (i,i) -> (i -> e) -> a i e
arrayFrom rng f = array rng [ (x, f x) | x <- range rng ]
ref :: a -> IORef a
ref x = unsafePerformIO (newIORef x)
data UnionSet i = UnionSet { leaders :: IORef (DiffArray i i) }
create :: Ix i => (i, i) -> UnionSet i
create is = UnionSet (ref (arrayFrom is id))
extractComponents :: Ix i => DiffArray i i -> [(i, i)]    
extractComponents  = Set.toList . Set.fromList . ArrayDiff.assocs
components :: Ix i => UnionSet i -> [(i,i)]
components (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (extractComponents l)
numberOfComponents :: Ix i => UnionSet i -> Int
numberOfComponents (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (length $ extractComponents l) 
indexes :: Ix i => UnionSet i -> [(i,i)]
indexes (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (ArrayDiff.assocs l)       
inSameComponent :: Ix i => UnionSet i -> i -> i -> Bool
inSameComponent (UnionSet leaders) x y = unsafePerformIO $ do
    l <- readIORef leaders
    return (l ! x == l ! y)
union x y (UnionSet leaders)  = unsafePerformIO $ do
    ls <- readIORef leaders
    let leader1 = ls ! x 
        leader2 = ls ! y
        newLeaders = map (\(index, value) -> (index, leader2)) . filter (\(index, value) -> value == leader1) $ assocs ls        
    modifyIORef leaders (\l -> l // newLeaders)
    return $ UnionSet leaders

We can recreate the above example like so:

> indexes $ Leaders.union 0 8 $ Leaders.union 8 9 $ Leaders.union 0 1 $ create (0,9)

Unfortunately it takes 44 seconds to execute which is mainly down to the call to assocs on line 46. assocs gives us a list of all the indexes and their corresponding values which we use to work out which indexes need to be updated with a new leader.

The rest of the code is mostly boiler plate around getting the array out of the IORef. The IORef allows us to have a mutable array in this instance. There is a page on the c2 wiki which explains how to use IORef in more detail.

Although using the DiffArray allows us to provide a pure external interface around its use, it is known to be 10-100x slower than an MArray.

I’ve been playing around with a version of the union-find data structure which makes use of an MArray instead and has decreased the execution time to 34 seconds.

Unless anyone has any ideas for how I can get this to perform more quickly I’m thinking that perhaps an array isn’t a good choice of underlying data structure at least when using Haskell.

The Performance Zone is brought to you in partnership with AppDynamics.  See Gartner’s latest research on the application performance monitoring landscape and how APM suites are becoming more and more critical to the business.


Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}