Yes, it’s this time of the year again! I started a little tradition a year ago with Stalking a Hylomorphism in the Wild. This year I was reminded of the Advent of Code by a tweet with this succint C++ program:

This piece of code is probably unreadable to a regular C++ programmer, but makes perfect sense to a Haskell programmer.

Here’s the description of the problem: You are given a list of equal-length strings. Every string is different, but two of these strings differ only by one character. Find these two strings and return their matching part. For instance, if the two strings were “abcd” and “abxd”, you would return “abd”.

What makes this problem particularly interesting is its potential application to a much more practical task of matching strands of DNA while looking for mutations. I decided to explore the problem a little beyond the brute force approach. And, of course, I had a hunch that I might encounter my favorite wild beast–the hylomorphism.

Brute force approach

First things first. Let’s do the boring stuff: read the file and split it into lines, which are the strings we are supposed to process. So here it is:

main = do
  txt <- readFile "day2.txt"
  let cs = lines txt
  print $ findMatch cs

The real work is done by the function findMatch, which takes a list of strings and produces the answer, which is a single string.

findMatch :: [String] -> String

First, let’s define a function that calculates the distance between any two strings.

distance :: (String, String) -> Int

We’ll define the distance as the count of mismatched characters.

Here’s the idea: We have to compare strings (which, let me remind you, are of equal length) character by character. Strings are lists of characters. The first step is to take two strings and zip them together, producing a list of pairs of characters. In fact we can combine the zipping with the next operation–in this case, comparison for inequality, (/=)–using the library function zipWith. However, zipWith is defined to act on two lists, and we will want it to act on a pair of lists–a subtle distinction, which can be easily overcome by applying uncurry:

uncurry :: (a -> b -> c) -> ((a, b) -> c)

which turns a function of two arguments into a function that takes a pair. Here’s how we use it:

uncurry (zipWith (/=))

The comparison operator (/=) produces a Boolean result, True or False. We want to count the number of differences, so we’ll covert True to one, and False to zero:

fromBool :: Num a => Bool -> a
fromBool False = 0
fromBool True  = 1

(Notice that such subtleties as the difference between Bool and Int are blisfully ignored in C++.)

Finally, we’ll sum all the ones using sum. Altogether we have:

distance = sum . fmap fromBool . uncurry (zipWith (/=))

Now that we know how to find the distance between any two strings, we’ll just apply it to all possible pairs of strings. To generate all pairs, we’ll use list comprehension:

let ps = [(s1, s2) | s1 <- ss, s2 <- ss]

(In C++ code, this was done by cartesian_product.)

Our goal is to find the pair whose distance is exactly one. To this end, we’ll apply the appropriate filter:

filter ((== 1) . distance) ps

For our purposes, we’ll assume that there is exactly one such pair (if there isn’t one, we are willing to let the program fail with a fatal exception).

(s, s') = head $ filter ((== 1) . distance) ps

The final step is to remove the mismatched character:

filter (uncurry (==)) $ zip s s'

We use our friend uncurry again, because the equality operator (==) expects two arguments, and we are calling it with a pair of arguments. The result of filtering is a list of identical pairs. We’ll fmap fst to pick the first components.

findMatch :: [String] -> String
findMatch ss = 
  let ps = [(s1, s2) | s1 <- ss, s2 <- ss]
      (s, s') = head $ filter ((== 1) . distance) ps
  in fmap fst $ filter (uncurry (==)) $ zip s s'

This program produces the correct result and we could stop right here. But that wouldn’t be much fun, would it? Besides, it’s possible that other algorithms could perform better, or be more flexible when applied to a more general problem.

Data-driven approach

The main problem with our brute-force approach is that we are comparing everything with everything. As we increase the number of input strings, the number of comparisons grows like a factorial. There is a standard way of cutting down on the number of comparison: organizing the input into a neat data structure.

We are comparing strings, which are lists of characters, and list comparison is done recursively. Assume that you know that two strings share a prefix. Compare the next character. If it’s equal in both strings, recurse. If it’s not, we have a single character fault. The rest of the two strings must now match perfectly to be considered a solution. So the best data structure for this kind of algorithm should batch together strings with equal prefixes. Such a data structure is called a prefix tree, or a trie (pronounced try).

At every level of our prefix tree we’ll branch based on the current character (so the maximum branching factor is, in our case, 26). We’ll record the character, the count of strings that share the prefix that led us there, and the child trie storing all the suffixes.

data Trie = Trie [(Char, Int, Trie)]
  deriving (Show, Eq)

Here’s an example of a trie that stores just two strings, “abcd” and “abxd”. It branches after b.

   a 2
   b 2
c 1    x 1
d 1    d 1

When inserting a string into a trie, we recurse both on the characters of the string and the list of branches. When we find a branch with the matching character, we increment its count and insert the rest of the string into its child trie. If we run out of branches, we create a new one based on the current character, give it the count one, and the child trie with the rest of the string:

insertS :: Trie -> String -> Trie
insertS t "" = t
insertS (Trie bs) s = Trie (inS bs s)
  where
    inS ((x, n, t) : bs) (c : cs) =
      if c == x 
      then (c, n + 1, insertS t cs) : bs
      else (x, n, t) : inS bs (c : cs)
    inS [] (c : cs) = [(c, 1, insertS (Trie []) cs)]

We convert our input to a trie by inserting all the strings into an (initially empty) trie:

mkTrie :: [String] -> Trie
mkTrie = foldl insertS (Trie [])

Of course, there are many optimizations we could use, if we were to run this algorithm on big data. For instance, we could compress the branches as is done in radix trees, or we could sort the branches alphabetically. I won’t do it here.

I won’t pretend that this implementation is simple and elegant. And it will get even worse before it gets better. The problem is that we are dealing explicitly with recursion in multiple dimensions. We recurse over the input string, the list of branches at each node, as well as the child trie. That’s a lot of recursion to keep track of–all at once.

Now brace yourself: We have to traverse the trie starting from the root. At every branch we check the prefix count: if it’s greater than one, we have more than one string going down, so we recurse into the child trie. But there is also another possibility: we can allow to have a mismatch at the current level. The current characters may be different but, since we allow only one mismatch, the rest of the strings have to match exactly. That’s what the function exact does. Notice that exact t is used inside foldMap, which is a version of fold that works on monoids–here, on strings.

match1 :: Trie -> [String]
match1 (Trie bs) = go bs
  where
    go :: [(Char, Int, Trie)] -> [String]
    go ((x, n, t) : bs) = 
      let a1s = if n > 1 
                then fmap (x:) $ match1 t
                else []
          a2s = foldMap (exact t) bs
          a3s = go bs -- recurse over list
      in a1s ++ a2s ++ a3s
    go [] = []
    exact t (_, _, t') = matchAll t t'

Here’s the function that finds all exact matches between two tries. It does it by generating all pairs of branches in which top characters match, and then recursing down.

matchAll :: Trie -> Trie -> [String]
matchAll (Trie bs) (Trie bs') = mAll bs bs'
  where
    mAll :: [(Char, Int, Trie)] -> [(Char, Int, Trie)] -> [String]
    mAll [] [] = [""]
    mAll bs bs' = 
      let ps = [ (c, t, t') 
               | (c,  _,  t)  <- bs
               , (c', _', t') <- bs'
               , c == c']
      in foldMap go ps
    go (c, t, t') = fmap (c:) (matchAll t t')

When mAll reaches the leaves of the trie, it returns a singleton list containing an empty string. Subsequent actions of fmap (c:) will prepend characters to this string.

Since we are expecting exactly one solution to the problem, we’ll extract it using head:

findMatch1 :: [String] -> String
findMatch1 cs = head $ match1 (mkTrie cs)

Recursion schemes

As you hone your functional programming skills, you realize that explicit recursion is to be avoided at all cost. There is a small number of recursive patterns that have been codified, and they can be used to solve the majority of recursion problems (for some categorical background, see F-Algebras). Recursion itself can be expressed in Haskell as a data structure: a fixed point of a functor:

newtype Fix f = In { out :: f (Fix f) }

In particular, our trie can be generated from the following functor:

data TrieF a = TrieF [(Char, a)]
  deriving (Show, Functor)

Notice how I have replaced the recursive call to the Trie type constructor with the free type variable a. The functor in question defines the structure of a single node, leaving holes marked by the occurrences of a for the recursion. When these holes are filled with full blown tries, as in the definition of the fixed point, we recover the complete trie.

I have also made one more simplification by getting rid of the Int in every node. This is because, in the recursion scheme I’m going to use, the folding of the trie proceeds bottom-up, rather than top-down, so the multiplicity information can be passed upwards.

The main advantage of recursion schemes is that they let us use simpler, non-recursive building blocks such as algebras and coalgebras. Let’s start with a simple coalgebra that lets us build a trie from a list of strings. A coalgebra is a fancy name for a particular type of function:

type Coalgebra f x = x -> f x

Think of x as a type for a seed from which one can grow a tree. A colagebra tells us how to use this seed to create a single node described by the functor f and populate it with (presumably smaller) seeds. We can then pass this coalgebra to a simple algorithm, which will recursively expand the seeds. This algorithm is called the anamorphism:

ana :: Functor f => Coalgebra f a -> a -> Fix f
ana coa = In . fmap (ana coa) . coa

Let’s see how we can apply it to the task of building a trie. The seed in our case is a list of strings (as per the definition of our problem, we’ll assume they are all equal length). We start by grouping these strings into bunches of strings that start with the same character. There is a library function called groupWith that does exactly that. We have to import the right library:

import GHC.Exts (groupWith)

This is the signature of the function:

groupWith :: Ord b => (a -> b) -> [a] -> [[a]]

It takes a function a -> b that converts each list element to a type that supports comparison (as per the typeclass Ord), and partitions the input into lists that compare equal under this particular ordering. In our case, we are going to extract the first character from a string using head and bunch together all strings that share that first character.

let sss = groupWith head ss

The tails of those strings will serve as seeds for the next tier of the trie. Eventually the strings will be shortened to nothing, triggering the end of recursion.

fromList :: Coalgebra TrieF [String]
fromList ss =
  -- are strings empty? (checking one is enough)
  if null (head ss) 
  then TrieF [] -- leaf
  else
    let sss = groupWith head ss
    in TrieF $ fmap mkBranch sss

The function mkBranch takes a bunch of strings sharing the same first character and creates a branch seeded with the suffixes of those strings.

mkBranch :: [String] -> (Char, [String])
mkBranch sss =
  let c = head (head sss) -- they're all the same
  in (c, fmap tail sss)

Notice that we have completely avoided explicit recursion.

The next step is a little harder. We have to fold the trie. Again, all we have to define is a step that folds a single node whose children have already been folded. This step is defined by an algebra:

type Algebra f x = f x -> x

Just as the type x described the seed in a coalgebra, here it describes the accumulator–the result of the folding of a recursive data structure.

We pass this algebra to a special algorithm called a catamorphism that takes care of the recursion:

cata :: Functor f => Algebra f a -> Fix f -> a
cata alg = alg . fmap (cata alg) . out

Notice that the folding proceeds from the bottom up: the algebra assumes that all the children have already been folded.

The hardest part of designing an algebra is figuring out what information needs to be passed up in the accumulator. We obviously need to return the final result which, in our case, is the list of strings with one mismatched character. But when we are in the middle of a trie, we have to keep in mind that the mismatch may still happen above us. So we also need a list of strings that may serve as suffixes when the mismatch occurs. We have to keep them all, because they might be matched later with strings from other branches.

In other words, we need to be accumulating two lists of strings. The first list accumulates all suffixes for future matching, the second accumulates the results: strings with one mismatch (after the mismatch has been removed). We therefore should implement the following algebra:

Algebra TrieF ([String], [String])

To understand the implementation of this algebra, consider a single node in a trie. It’s a list of branches, or pairs, whose first component is the current character, and the second a pair of lists of strings–the result of folding a child trie. The first list contains all the suffixes gathered from lower levels of the trie. The second list contains partial results: strings that were matched modulo single-character defect.

As an example, suppose that you have a node with two branches:

[ ('a', (["bcd", "efg"], ["pq"]))
, ('x', (["bcd"],        []))]

First we prepend the current character to strings in both lists using the function prep with the following signature:

prep :: (Char, ([String], [String])) -> ([String], [String])

This way we convert each branch to a pair of lists.

[ (["abcd", "aefg"], ["apq"])
, (["xbcd"],         [])]

We then merge all the lists of suffixes and, separately, all the lists of partial results, across all branches. In the example above, we concatenate the lists in the two columns.

(["abcd", "aefg", "xbcd"], ["apq"])

Now we have to construct new partial results. To do this, we create another list of accumulated strings from all branches (this time without prefixing them):

ss = concat $ fmap (fst . snd) bs

In our case, this would be the list:

["bcd", "efg", "bcd"]

To detect duplicate strings, we’ll insert them into a multiset, which we’ll implement as a map. We need to import the appropriate library:

import qualified Data.Map as M

and define a multiset Counts as:

type Counts a = M.Map a Int

Every time we add a new item, we increment the count:

add :: Ord a => Counts a -> a -> Counts a
add cs c = M.insertWith (+) c 1 cs

To insert all strings from a list, we use a fold:

mset = foldl add M.empty ss

We are only interested in items that have multiplicity greater than one. We can filter them and extract their keys:

dups = M.keys $ M.filter (> 1) mset

Here’s the complete algebra:

accum :: Algebra TrieF ([String], [String])
accum (TrieF []) = ([""], [])
accum (TrieF bs) = -- b :: (Char, ([String], [String]))
    let -- prepend chars to string in both lists
        pss = unzip $ fmap prep bs
        (ss1, ss2) = both concat pss
        -- find duplicates
        ss = concat $ fmap (fst . snd) bs
        mset = foldl add M.empty ss
        dups = M.keys $ M.filter (> 1) mset
     in (ss1, dups ++ ss2)
  where
      prep :: (Char, ([String], [String])) -> ([String], [String])
      prep (c, pss) = both (fmap (c:)) pss

I used a handy helper function that applies a function to both components of a pair:

both :: (a -> b) -> (a, a) -> (b, b)
both f (x, y) = (f x, f y)

And now for the grand finale: Since we create the trie using an anamorphism only to immediately fold it using a catamorphism, why don’t we cut the middle person? Indeed, there is an algorithm called the hylomorphism that does just that. It takes the algebra, the coalgebra, and the seed, and returns the fully charged accumulator.

hylo :: Functor f => Algebra f a -> Coalgebra f b -> b -> a
hylo alg coa = alg . fmap (hylo alg coa) . coa

And this is how we extract and print the final result:

print $ head $ snd $ hylo accum fromList cs

Conclusion

The advantage of using the hylomorphism is that, because of Haskell’s laziness, the trie is never wholly constructed, and therefore doesn’t require large amounts of memory. At every step enough of the data structure is created as is needed for immediate computation; then it is promptly released. In fact, the definition of the data structure is only there to guide the steps of the algorithm. We use a data structure as a control structure. Since data structures are much easier to visualize and debug than control structures, it’s almost always advantageous to use them to drive computation.

In fact, you may notice that, in the very last step of the computation, our accumulator recreates the original list of strings (actually, because of laziness, they are never fully reconstructed, but that’s not the point). In reality, the characters in the strings are never copied–the whole algorithm is just a choreographed dance of internal pointers, or iterators. But that’s exactly what happens in the original C++ algorithm. We just use a higher level of abstraction to describe this dance.

I haven’t looked at the performance of various implementations. Feel free to test it and report the results. The code is available on github.

Acknowledgments

I’m grateful to the participants of the Seattle Haskell Users’ Group for many helpful comments during my presentation.

Advertisements