Yes, it’s this time of the year again! I started a little tradition a year ago with Stalking a Hylomorphism in the Wild. This year I was reminded of the Advent of Code by a tweet with this succint C++ program:
This piece of code is probably unreadable to a regular C++ programmer, but makes perfect sense to a Haskell programmer.
Here’s the description of the problem: You are given a list of equal-length strings. Every string is different, but two of these strings differ only by one character. Find these two strings and return their matching part. For instance, if the two strings were “abcd” and “abxd”, you would return “abd”.
What makes this problem particularly interesting is its potential application to a much more practical task of matching strands of DNA while looking for mutations. I decided to explore the problem a little beyond the brute force approach. And, of course, I had a hunch that I might encounter my favorite wild beast–the hylomorphism.
Brute force approach
First things first. Let’s do the boring stuff: read the file and split it into lines, which are the strings we are supposed to process. So here it is:
main = do txt <- readFile "day2.txt" let cs = lines txt print $ findMatch cs
The real work is done by the function findMatch
, which takes a list of strings and produces the answer, which is a single string.
findMatch :: [String] -> String
First, let’s define a function that calculates the distance between any two strings.
distance :: (String, String) -> Int
We’ll define the distance as the count of mismatched characters.
Here’s the idea: We have to compare strings (which, let me remind you, are of equal length) character by character. Strings are lists of characters. The first step is to take two strings and zip them together, producing a list of pairs of characters. In fact we can combine the zipping with the next operation–in this case, comparison for inequality, (/=)
–using the library function zipWith
. However, zipWith
is defined to act on two lists, and we will want it to act on a pair of lists–a subtle distinction, which can be easily overcome by applying uncurry
:
uncurry :: (a -> b -> c) -> ((a, b) -> c)
which turns a function of two arguments into a function that takes a pair. Here’s how we use it:
uncurry (zipWith (/=))
The comparison operator (/=)
produces a Boolean result, True
or False
. We want to count the number of differences, so we’ll covert True
to one, and False
to zero:
fromBool :: Num a => Bool -> a fromBool False = 0 fromBool True = 1
(Notice that such subtleties as the difference between Bool
and Int
are blisfully ignored in C++.)
Finally, we’ll sum all the ones using sum
. Altogether we have:
distance = sum . fmap fromBool . uncurry (zipWith (/=))
Now that we know how to find the distance between any two strings, we’ll just apply it to all possible pairs of strings. To generate all pairs, we’ll use list comprehension:
let ps = [(s1, s2) | s1 <- ss, s2 <- ss]
(In C++ code, this was done by cartesian_product
.)
Our goal is to find the pair whose distance is exactly one. To this end, we’ll apply the appropriate filter:
filter ((== 1) . distance) ps
For our purposes, we’ll assume that there is exactly one such pair (if there isn’t one, we are willing to let the program fail with a fatal exception).
(s, s') = head $ filter ((== 1) . distance) ps
The final step is to remove the mismatched character:
filter (uncurry (==)) $ zip s s'
We use our friend uncurry
again, because the equality operator (==)
expects two arguments, and we are calling it with a pair of arguments. The result of filtering is a list of identical pairs. We’ll fmap fst
to pick the first components.
findMatch :: [String] -> String findMatch ss = let ps = [(s1, s2) | s1 <- ss, s2 <- ss] (s, s') = head $ filter ((== 1) . distance) ps in fmap fst $ filter (uncurry (==)) $ zip s s'
This program produces the correct result and we could stop right here. But that wouldn’t be much fun, would it? Besides, it’s possible that other algorithms could perform better, or be more flexible when applied to a more general problem.
Data-driven approach
The main problem with our brute-force approach is that we are comparing everything with everything. As we increase the number of input strings, the number of comparisons grows like a factorial. There is a standard way of cutting down on the number of comparison: organizing the input into a neat data structure.
We are comparing strings, which are lists of characters, and list comparison is done recursively. Assume that you know that two strings share a prefix. Compare the next character. If it’s equal in both strings, recurse. If it’s not, we have a single character fault. The rest of the two strings must now match perfectly to be considered a solution. So the best data structure for this kind of algorithm should batch together strings with equal prefixes. Such a data structure is called a prefix tree, or a trie (pronounced try).
At every level of our prefix tree we’ll branch based on the current character (so the maximum branching factor is, in our case, 26). We’ll record the character, the count of strings that share the prefix that led us there, and the child trie storing all the suffixes.
data Trie = Trie [(Char, Int, Trie)] deriving (Show, Eq)
Here’s an example of a trie that stores just two strings, “abcd” and “abxd”. It branches after b
.
a 2 b 2 c 1 x 1 d 1 d 1
When inserting a string into a trie, we recurse both on the characters of the string and the list of branches. When we find a branch with the matching character, we increment its count and insert the rest of the string into its child trie. If we run out of branches, we create a new one based on the current character, give it the count one, and the child trie with the rest of the string:
insertS :: Trie -> String -> Trie insertS t "" = t insertS (Trie bs) s = Trie (inS bs s) where inS ((x, n, t) : bs) (c : cs) = if c == x then (c, n + 1, insertS t cs) : bs else (x, n, t) : inS bs (c : cs) inS [] (c : cs) = [(c, 1, insertS (Trie []) cs)]
We convert our input to a trie by inserting all the strings into an (initially empty) trie:
mkTrie :: [String] -> Trie mkTrie = foldl insertS (Trie [])
Of course, there are many optimizations we could use, if we were to run this algorithm on big data. For instance, we could compress the branches as is done in radix trees, or we could sort the branches alphabetically. I won’t do it here.
I won’t pretend that this implementation is simple and elegant. And it will get even worse before it gets better. The problem is that we are dealing explicitly with recursion in multiple dimensions. We recurse over the input string, the list of branches at each node, as well as the child trie. That’s a lot of recursion to keep track of–all at once.
Now brace yourself: We have to traverse the trie starting from the root. At every branch we check the prefix count: if it’s greater than one, we have more than one string going down, so we recurse into the child trie. But there is also another possibility: we can allow to have a mismatch at the current level. The current characters may be different but, since we allow only one mismatch, the rest of the strings have to match exactly. That’s what the function exact
does. Notice that exact t
is used inside foldMap
, which is a version of fold
that works on monoids–here, on strings.
match1 :: Trie -> [String] match1 (Trie bs) = go bs where go :: [(Char, Int, Trie)] -> [String] go ((x, n, t) : bs) = let a1s = if n > 1 then fmap (x:) $ match1 t else [] a2s = foldMap (exact t) bs a3s = go bs -- recurse over list in a1s ++ a2s ++ a3s go [] = [] exact t (_, _, t') = matchAll t t'
Here’s the function that finds all exact matches between two tries. It does it by generating all pairs of branches in which top characters match, and then recursing down.
matchAll :: Trie -> Trie -> [String] matchAll (Trie bs) (Trie bs') = mAll bs bs' where mAll :: [(Char, Int, Trie)] -> [(Char, Int, Trie)] -> [String] mAll [] [] = [""] mAll bs bs' = let ps = [ (c, t, t') | (c, _, t) <- bs , (c', _', t') <- bs' , c == c'] in foldMap go ps go (c, t, t') = fmap (c:) (matchAll t t')
When mAll
reaches the leaves of the trie, it returns a singleton list containing an empty string. Subsequent actions of fmap (c:)
will prepend characters to this string.
Since we are expecting exactly one solution to the problem, we’ll extract it using head
:
findMatch1 :: [String] -> String findMatch1 cs = head $ match1 (mkTrie cs)
Recursion schemes
As you hone your functional programming skills, you realize that explicit recursion is to be avoided at all cost. There is a small number of recursive patterns that have been codified, and they can be used to solve the majority of recursion problems (for some categorical background, see F-Algebras). Recursion itself can be expressed in Haskell as a data structure: a fixed point of a functor:
newtype Fix f = In { out :: f (Fix f) }
In particular, our trie can be generated from the following functor:
data TrieF a = TrieF [(Char, a)] deriving (Show, Functor)
Notice how I have replaced the recursive call to the Trie
type constructor with the free type variable a
. The functor in question defines the structure of a single node, leaving holes marked by the occurrences of a
for the recursion. When these holes are filled with full blown tries, as in the definition of the fixed point, we recover the complete trie.
I have also made one more simplification by getting rid of the Int
in every node. This is because, in the recursion scheme I’m going to use, the folding of the trie proceeds bottom-up, rather than top-down, so the multiplicity information can be passed upwards.
The main advantage of recursion schemes is that they let us use simpler, non-recursive building blocks such as algebras and coalgebras. Let’s start with a simple coalgebra that lets us build a trie from a list of strings. A coalgebra is a fancy name for a particular type of function:
type Coalgebra f x = x -> f x
Think of x
as a type for a seed from which one can grow a tree. A colagebra tells us how to use this seed to create a single node described by the functor f
and populate it with (presumably smaller) seeds. We can then pass this coalgebra to a simple algorithm, which will recursively expand the seeds. This algorithm is called the anamorphism:
ana :: Functor f => Coalgebra f a -> a -> Fix f ana coa = In . fmap (ana coa) . coa
Let’s see how we can apply it to the task of building a trie. The seed in our case is a list of strings (as per the definition of our problem, we’ll assume they are all equal length). We start by grouping these strings into bunches of strings that start with the same character. There is a library function called groupWith
that does exactly that. We have to import the right library:
import GHC.Exts (groupWith)
This is the signature of the function:
groupWith :: Ord b => (a -> b) -> [a] -> [[a]]
It takes a function a -> b
that converts each list element to a type that supports comparison (as per the typeclass Ord
), and partitions the input into lists that compare equal under this particular ordering. In our case, we are going to extract the first character from a string using head
and bunch together all strings that share that first character.
let sss = groupWith head ss
The tails of those strings will serve as seeds for the next tier of the trie. Eventually the strings will be shortened to nothing, triggering the end of recursion.
fromList :: Coalgebra TrieF [String] fromList ss = -- are strings empty? (checking one is enough) if null (head ss) then TrieF [] -- leaf else let sss = groupWith head ss in TrieF $ fmap mkBranch sss
The function mkBranch
takes a bunch of strings sharing the same first character and creates a branch seeded with the suffixes of those strings.
mkBranch :: [String] -> (Char, [String]) mkBranch sss = let c = head (head sss) -- they're all the same in (c, fmap tail sss)
Notice that we have completely avoided explicit recursion.
The next step is a little harder. We have to fold the trie. Again, all we have to define is a step that folds a single node whose children have already been folded. This step is defined by an algebra:
type Algebra f x = f x -> x
Just as the type x
described the seed in a coalgebra, here it describes the accumulator–the result of the folding of a recursive data structure.
We pass this algebra to a special algorithm called a catamorphism that takes care of the recursion:
cata :: Functor f => Algebra f a -> Fix f -> a cata alg = alg . fmap (cata alg) . out
Notice that the folding proceeds from the bottom up: the algebra assumes that all the children have already been folded.
The hardest part of designing an algebra is figuring out what information needs to be passed up in the accumulator. We obviously need to return the final result which, in our case, is the list of strings with one mismatched character. But when we are in the middle of a trie, we have to keep in mind that the mismatch may still happen above us. So we also need a list of strings that may serve as suffixes when the mismatch occurs. We have to keep them all, because they might be matched later with strings from other branches.
In other words, we need to be accumulating two lists of strings. The first list accumulates all suffixes for future matching, the second accumulates the results: strings with one mismatch (after the mismatch has been removed). We therefore should implement the following algebra:
Algebra TrieF ([String], [String])
To understand the implementation of this algebra, consider a single node in a trie. It’s a list of branches, or pairs, whose first component is the current character, and the second a pair of lists of strings–the result of folding a child trie. The first list contains all the suffixes gathered from lower levels of the trie. The second list contains partial results: strings that were matched modulo single-character defect.
As an example, suppose that you have a node with two branches:
[ ('a', (["bcd", "efg"], ["pq"])) , ('x', (["bcd"], []))]
First we prepend the current character to strings in both lists using the function prep
with the following signature:
prep :: (Char, ([String], [String])) -> ([String], [String])
This way we convert each branch to a pair of lists.
[ (["abcd", "aefg"], ["apq"]) , (["xbcd"], [])]
We then merge all the lists of suffixes and, separately, all the lists of partial results, across all branches. In the example above, we concatenate the lists in the two columns.
(["abcd", "aefg", "xbcd"], ["apq"])
Now we have to construct new partial results. To do this, we create another list of accumulated strings from all branches (this time without prefixing them):
ss = concat $ fmap (fst . snd) bs
In our case, this would be the list:
["bcd", "efg", "bcd"]
To detect duplicate strings, we’ll insert them into a multiset, which we’ll implement as a map. We need to import the appropriate library:
import qualified Data.Map as M
and define a multiset Counts
as:
type Counts a = M.Map a Int
Every time we add a new item, we increment the count:
add :: Ord a => Counts a -> a -> Counts a add cs c = M.insertWith (+) c 1 cs
To insert all strings from a list, we use a fold:
mset = foldl add M.empty ss
We are only interested in items that have multiplicity greater than one. We can filter them and extract their keys:
dups = M.keys $ M.filter (> 1) mset
Here’s the complete algebra:
accum :: Algebra TrieF ([String], [String]) accum (TrieF []) = ([""], []) accum (TrieF bs) = -- b :: (Char, ([String], [String])) let -- prepend chars to string in both lists pss = unzip $ fmap prep bs (ss1, ss2) = both concat pss -- find duplicates ss = concat $ fmap (fst . snd) bs mset = foldl add M.empty ss dups = M.keys $ M.filter (> 1) mset in (ss1, dups ++ ss2) where prep :: (Char, ([String], [String])) -> ([String], [String]) prep (c, pss) = both (fmap (c:)) pss
I used a handy helper function that applies a function to both components of a pair:
both :: (a -> b) -> (a, a) -> (b, b) both f (x, y) = (f x, f y)
And now for the grand finale: Since we create the trie using an anamorphism only to immediately fold it using a catamorphism, why don’t we cut the middle person? Indeed, there is an algorithm called the hylomorphism that does just that. It takes the algebra, the coalgebra, and the seed, and returns the fully charged accumulator.
hylo :: Functor f => Algebra f a -> Coalgebra f b -> b -> a hylo alg coa = alg . fmap (hylo alg coa) . coa
And this is how we extract and print the final result:
print $ head $ snd $ hylo accum fromList cs
Conclusion
The advantage of using the hylomorphism is that, because of Haskell’s laziness, the trie is never wholly constructed, and therefore doesn’t require large amounts of memory. At every step enough of the data structure is created as is needed for immediate computation; then it is promptly released. In fact, the definition of the data structure is only there to guide the steps of the algorithm. We use a data structure as a control structure. Since data structures are much easier to visualize and debug than control structures, it’s almost always advantageous to use them to drive computation.
In fact, you may notice that, in the very last step of the computation, our accumulator recreates the original list of strings (actually, because of laziness, they are never fully reconstructed, but that’s not the point). In reality, the characters in the strings are never copied–the whole algorithm is just a choreographed dance of internal pointers, or iterators. But that’s exactly what happens in the original C++ algorithm. We just use a higher level of abstraction to describe this dance.
I haven’t looked at the performance of various implementations. Feel free to test it and report the results. The code is available on github.
Acknowledgments
I’m grateful to the participants of the Seattle Haskell Users’ Group for many helpful comments during my presentation.
December 20, 2018 at 9:32 pm
Well, that was a nice, unexpected Christmas present. I started off reading just out of general interest in a CS term I hadn’t encountered before… and then got drawn in because I think you’ve described the solution to a performance problem I’ve been beating my head against for six months at work — a recursive algorithm where the intermediate data structure is blowing up memory. I knew in theory it shouldn’t be necessary, but had no idea how to get from A to B. Can’t Google what you don’t know to ask about! Thanks for the tip!
December 21, 2018 at 2:19 am
But you are still reading all the string in memory from a file, aren’t you?
Is the following still your final main?
main = do
txt <- readFile "day2.txt"
let cs = lines txt
print $ findMatch cs
I guess you should use a truly lazy IO like
hGetContents
for really big data, based on the following linkshttp://book.realworldhaskell.org/read/io.html#io.lazy
https://stackoverflow.com/questions/9746352/parsing-large-log-files-in-haskell
Or is that negligible compared to the hylomorphism laziness and memory savings? Have you actually tried with some huge load of data just to show some quantifiable indicators? Thanks a lot.
Aside from such specific question, the post is wonderful and very interesting
December 22, 2018 at 1:42 pm
There’s a nifty function Data.Bool.bool that does what your fromBool is doing… fromBool = bool 0 1 (I don’t know if it’s worth the extra import though)
December 27, 2018 at 2:11 pm
readFile
is lazy.December 27, 2018 at 2:33 pm
Thanks, I know, but what about lines txt? I guess it isn’t..
December 27, 2018 at 10:48 pm
Unless explicitly annotated (for instance with a bang !) or matched using a pattern, everything in Haskell is lazy.
December 30, 2018 at 11:17 am
Dear Bartosz, this is all very fascinating. 6 months ago I stumbled onto some podcast of yours and I barely understood anything. After evernings with “Learn you a Haskell for great good” i’m starting to grasp what it’s all about. Now I’ve spent hours reading this blow post and I understand the first half :). I’m glad the post doesn’t start with “reading time 20 min”. Keep up the inspiring work, I’ve just started digging into your “Cathegory theory..” book. It might drive me nuts 😉
January 2, 2019 at 6:36 am
Bartosz, I’ve included some benchmarking wrapping at https://github.com/jrp2014/AoC2018/tree/master/Day02. The halo version seems vg although the straight Trie is even better.
January 2, 2019 at 8:44 am
Thanks for great article! Can it be shown how the laziness works, so the whole structure is not wholly created?
January 2, 2019 at 2:34 pm
I bet the performance can be improved by playing with strictness/laziness. Running the profiler to see memory consumption could show the bottlenecks.
January 2, 2019 at 4:18 pm
In Haskell everything is created on demand. If it’s not consumed, it’s never created. So the real question is how is the data structure consumed. In particular, when parts of it are consumed, which triggers their on-demand construction, are they garbage collected before other parts need to be constructed. So to convince yourself that laziness works, you’d have to use a profiler, which will show you peak memory usage.
January 3, 2019 at 5:36 am
I was more thinking about if we can imagine how are the parts consumed, because otherwise it is quite abstract for be able to reason about it and for example be able to make some performance improvement or compare it with strict implementation.
January 3, 2019 at 1:44 pm
There are several operations that force evaluation. One is pattern matching. You can also force evaluation using the bang syntax. Also, I used foldl in my code for simplicity, but the forcing foldl’ (with the prime) would be better (I modified the code on github).
January 4, 2019 at 1:58 pm
The foldl’ seems to make little difference. I’ve put up some further benchmarks.
September 1, 2019 at 2:05 am
[…] Open season on hylomorphisms. ~ B. Milewski. #Haskell #FunctionalProgramming […]
January 6, 2021 at 8:13 am
[…] Open Seasons on Hylomorphisms – Advent of Code 2018, String comparison challenge […]