Back in 2023, I introduced Solve.hs, my newest course focused on problem solving in Haskell. This course was inspired by my experiences solving programming puzzles with Haskell, especially by the feeling of how different it was compared to other languages.
Solve.hs will teach you all the core knowledge you need around data structures and algorithms to tackle not only these kinds of puzzles (which often appear as interview questions), but also the mindset shifts you have to make when solving them in Haskell.
In 2023, I released the first two modules, which focused on data structures, with a special emphasis on how Haskell uses linked lists. These also explored the patterns that replace ’for’ and ‘while’ loops from other languages.
Then in 2024 I released module 3, which explained all of the most essential algorithms in great detail, and showed how we have to implement them differently in Haskell.
Finally, today, I am releasing the fourth and final module for this course! This module explains parsing in great detail. You’ll learn:
Basic string manipulation techniques for simple parsing
How to use libraries to parse common data formats (e.g. JSON)
How to use the Megaparsec library to parse any other kind of structured data
How to write your own monadic parser
How to use regular expressions for parsing in Haskell
These skills can be important in puzzle solving challenges where your input is just a string. But they’re also applicable in a wide variety of “real world” projects!
For the next 2 weeks, you can get Solve.hs for 20% off with the code SOLVE25. You can also get an extra 10% discount by subscribing to our newsletter!
After these 2 weeks are up, you’ll not only lose the discount, but the price of the course will go up to reflect the added material from module 4. This course will never be cheaper, so grab it now by going to the course page!
The GHC developers are very pleased to announce the availability
of the release candidate for GHC 9.10.2. Binary distributions, source
distributions, and documentation are available at downloads.haskell.org and
via GHCup.
GHC 9.10.2 is a bug-fix release fixing over 50 issues of a variety of
severities and scopes. A full accounting of these fixes can be found in the
release notes. As always, GHC’s release status, including planned future
releases, can be found on the GHC Wiki status.
This release candidate will have a two-week testing period. If all goes well
the final release will be available the week of 1 May 2025.
We would like to thank Well-Typed, Tweag I/O, Juspay, QBayLogic, Channable,
Serokell, SimSpace, the Haskell Foundation, and other anonymous contributors
whose on-going financial and in-kind support has facilitated GHC maintenance
and release management over the years. Finally, this release would not have
been possible without the hundreds of open-source contributors whose work
comprise this release.
As always, do give this release a try and open a ticket if you see
anything amiss.
A few months ago, I announced that the GHC wasm
backend added support for Template Haskell and ghci. Initially, the
ghci feature only supported running code in nodejs and accessing the
nodejs context, and I’ve been asked a few times when ghci was going to work in
browsers in order to allow live-coding the frontend. Sure, why not? I promised
it in the last blog post’s wishlist. After all, GHCJS used to support
GHCJSi for browsers almost 10 years ago!
I was confident this could be done with moderate effort. Almost all the
pieces are already in place: the external interpreter logic in GHC is
there, and the wasm dynamic linker already works in nodejs. So just
make it runnable in browsers as well, add a bit of logic for
communicating with GHC and we’re done right? Well, it still took a few
months for me to land it…but finally here it is!
To keep this post within reasonable length, I will only introduce the
user-facing aspects of the wasm ghci browser mode and won’t cover the
underlying implementation. The rest of the post is an example ghci
session followed by a series of bite sized subsections, each covering
one important tip about using this feature.
How to use it
The ghc-wasm-meta repo provides user-facing
installation methods for the GHC wasm backend. Here we’ll go with the
simplest nix-based approach:
$ nix shell 'gitlab:haskell-wasm/ghc-wasm-meta?host=gitlab.haskell.org'
$ wasm32-wasi-ghc --interactive -fghci-browser
GHCi, version 9.12.2.20250327: https://www.haskell.org/ghc/ :? forhelp
Open http://127.0.0.1:38827/main.html or import http://127.0.0.1:38827/main.js to boot ghci
The -fghci-browser flag enables the browser mode. There are a couple
of other related flags which you can read about in the user
manual, but for now, let’s open that page to
proceed. You’ll see a blank page, but you can press F12 to open the
devtools panel and check the network monitor tab to see that it’s
sending a lot of requests and downloading a bunch of wasm modules.
Within a few seconds, the initial loading process should be complete,
and the ghci prompt should appear in the terminal and accept user
commands.
Let’s start with the simplest:
ghci>putStrLn"hello firefox"ghci>
The message is printed in the browser’s devtools console. That’s not
impressive, so let’s try something that only works in a browser:
The above code implements logic to export a Haskell IO () function
to a JavaScript synchronous callback that can be attached as a
button’s client event listener. Synchronous callbacks always attempt
to run Haskell computations to completion, which works fine as long as
the exported Haskell function’s main thread does not block indefinitely,
like waiting for an async JSFFI import to resolve or be rejected. You
can read more about JSFFI in the user manual, but let’s
carry on with this example:
Now, the button is attached to a simple counter in Haskell that prints
an incrementing integer to the console each time the button is
clicked. And that should be sufficient for a minimal demo! Now, there
are still a couple of important tips to be mentioned before we wrap up
this post:
Hot reloading
Just like native ghci, you can perform hot reloading:
ghci> :r
Ok, no modules to be reloaded.
ghci> btn
<interactive>:15:1: error: [GHC-88464]
Variable not in scope: btn
Reloading nukes all bindings in the current scope. But it doesn’t
magically undo all the side effects we’ve performed so far: if you
click on the button now, you’ll notice the counter is still working
and the exported Haskell function is still retained by the JavaScript
side! And this behavior is also consistent with native ghci:
hot-reloading does not actually wipe the Haskell heap, and there
exist tricks like foreign-store to persist values
across ghci reloads.
For the wasm ghci, things like foreign-store should work, though you
can allocate a stable pointer and print it, then reconstruct the
stable pointer and dereference it after a future reload. Since wasm
ghci runs in a JavaScript runtime after all, you can also cook your
global variable by assigning to globalThis. Or locate the element
and fetch its event handler, it should be the same Haskell callback
exported earlier which can be freed by freeJSVal.
So, when you do live-coding that involve some non-trivial back and
forth calling between JavaScript and Haskell, don’t forget that hot
reloads don’t kill old code and you need to implement your own logic
to disable earlier callbacks to prevent inconsistent behavior.
Loading object code
The wasm ghci supports loading GHC bytecode and object code. All the
code you type into the interactive session is compiled to bytecode.
The code that you put in a .hs source file and load via command line
or :l commands can be compiled as object code if you pass
-fobject-code to ghci.
I fixed the ghci debugger for all 32-bit cross targets
since the last blog post. Just like native ghci, debugger features
like breakpoints now work for bytecode. If you don’t use the ghci
debugger, it’s recommended that you use -fobject-code to load
Haskell modules, since object code is faster and more robust at
run-time.
Interrupting via ^C
My GHC patch that landed the ghci browser
mode also fixed a previous bug in wasm ghci: ^C was not handled at all
and would kill the ghci session. Now, the behavior should be
consistent with native ghci. With or without -fghci-browser, if
you’re running a long computation and you press ^C, an async exception
should interrupt the computation and unblock the ghci prompt.
Read the :doc, Luke
Among the many changes I landed in GHC since last blog post, one of
them is adding proper haddock documentation to all user-facing things
exported by GHC.Wasm.Prim. Apart from the GHC user manual, the
haddock documentation is also worth reading for users. I haven’t set
up a static site to serve the haddock pages yet, but they are already
accessible in ghci via the :doc command. Just try import GHC.Wasm.Prim and check :doc JSVal or :doc freeJSVal, then you
can read them in plain text.
As the Haskell wasm user community grows, so will the frustration with
lack of proper documentation. I’m slowly improving that. What you see
in :doc will continue to be polished, same for the user manual.
Importing an npm library in ghci
You can use JavaScript’s dynamic import() function as an async JSFFI
import. If you want to import an npm library in a ghci session, the
simplest approach is using a service like esm.run which
serves pre-bundled npm libraries as ES modules over a CDN.
If you have a local npm project and want to use the code there, you
need to do your own bundling and start your own development server
that serves a page to make that code somehow accessible (e.g. via
globalThis bindings). But how does that interact with the wasm ghci?
Read on.
Using ghci to debug other websites
The browser mode works by starting a local HTTP server that serves
some requests to be made from the browser side. For convenience, that
HTTP server accepts CORS requests from any origin, which means
it’s possible to inject the main.js startup script into browser tabs
of other websites and use the wasm ghci session to debug those
websites! Once you fire up a ghci session, just open the devtools
console of another website and drop a
import("http://127.0.0.1:38827/main.js") call, if that website
doesn’t actively block third-party scripts, then you can have more fun
than running it in the default blank page.
All JavaScript code for the GHC wasm backend consists of proper ES modules
that don’t pollute the globalThis namespace. This principle has
been enforced since day one, which allows multiple Haskell wasm
modules or even wasm ghci sessions to co-exist in the same page! It
works fine as long as you respect their boundaries and don’t attempt
to do things like freeing a JSVal allocated elsewhere, but even if
you only have one wasm module or ghci session, the “no global variable”
principle should also minimize the interference with the original page.
In my opinion, being able to interact with other websites is the most
exciting aspect of the browser mode. Sure, for Haskell developers that
want to experiment with frontend development, using ghci should
already be much easier than setting up a playground project and
manually handling linker flags, wrapper scripts, etc. But there’s even
greater potential: who said the website itself needs to be developed
in Haskell? Haskell can be used to test websites written in foreign
tech stacks, and testing backed by an advanced type system is
undoubtedly one of our core strengths! You can use libraries like
quickcheck-state-machine or
quickcheck-dynamic to perform state machine
property testing interactively, which has much greater potential of
finding bugs than just a few hard coded interactions in JavaScript.
No host file system in wasm
The default nodejs mode of wasm ghci has full access to the host file
system, so you can use Haskell APIs like readFile to operate on any
host file path. This is no longer the case for browser mode: the only
handles available are stdout/stderr, which output to the devtools
console in a line-buffered manner, and there’s no file to read/write
in wasm otherwise. The same restriction also applies to Template
Haskell splices evaluated in a browser mode ghci session, so splices
like $(embedFile ...) will fail.
This is a deliberate design choice. The dev environment backed by ghci
browser mode should be as close as possible to the production
environment used by statically linked wasm modules, and the production
environment won’t have access to the host file system either. It would be
possible to add extra plumbing to expose the host file system to ghci
browser mode, but that is quite a bit of extra work and also makes the
dev environment less realistic, so I’d like to keep the current design
for a while.
If you need to read a local asset, you can serve the asset via another
local HTTP server and fetch it in ghci. If you have modules that use
splices like embedFile, those modules should be pre-compiled to
object code and loaded later in ghci.
Don’t press F5
It’s very important that the browser page is never refreshed. The
lifetime of the browser tab is supposed to be tied to the ghci
session. Just exit ghci and close the tab when you’re done, but
refreshing the page would completely break ghci! A lot of shared state
between the browser side and host side is required to make it work,
and refreshing would break the browser side of the state.
Likewise, currently the browser mode can’t recover from network
glitches. It shouldn’t be a concern when you run GHC and the browser
on the same machine, but in case you use SSH port forwarding or
tailscale to establish the GHC/browser connection over an unstable
network, once the WebSocket is broken then the game is over.
This is not ideal for sure, but supporting auto-recovery upon network
issues or even page reloads is incredibly challenging, so let’s live
with what is supported for now.
Doesn’t work on Safari yet
Currently the browser mode works fine for Firefox/Chrome, including
desktop/mobile versions and all the forks with different logos and
names. Sadly, Safari users are quite likely to see spurious crashes
with a call_indirect to a null table entry error in the console.
Rest assured, normal statically-linked Haskell wasm modules still work
fine in Safari.
This is not my fault, but WebKit’s! I’ve filed a WebKit
bug and if we’re lucky, this may be looked into on their
side and get fixed eventually. If not, or if many people complain
loudly, I can implement a workaround that seems to mitigate the WebKit
bug to make the browser mode work in Safari too. That’ll be extra
maintenance burden, so for now, if you’re on macOS, your best bet is
installing Firefox/Chrome and using that for ghci.
Huge libraries don’t work yet
How large is “huge”? Well, you can check the source code of
V8, SpiderMonkey and
JavaScriptCore. In brief: there are limits agreed upon
among major browser engines that restrict a wasm module’s
import/export numbers, etc, and we do run into those limits
occasionally when the Haskell library is huge. For instance, the
monolithic ghc library exceeds the limit, and so does the profiling way of
ghc-internal. So cost-center profiling doesn’t work for the ghci
browser mode yet, though it does work for statically linked wasm
modules and ghci nodejs mode.
Unfortunately, this issue is definitely not a low hanging fruit even
for me. I maintain a nodejs fork that patches the V8 limits so that
the Template Haskell runner should still work for huge libraries, but
I can’t do the same for browsers. A fundamental fix to sidestep the
browser limits would be a huge amount of work. So I’ll be prioritizing
other work first. If you need to load a huge library in the browser,
you may need to split it into cabal sublibraries.
Wishlist, as usual
My past blog posts usually ends with a “what comes next” section. This
one is no exception. The browser mode is in its early days, so it’s
natural to find bugs and other rough edges, and there will be
continuous improvement in the coming months. Another thing worth
looking into is profiling: modern browsers have powerful profilers,
and it would be nice to integrate our own profiling and event log
mechanism with browser devtools to improve developer experience.
The next big thing I’ll be working on is threaded RTS
support. Currently all Haskell wasm modules are
single-threaded and runs in the browser main thread, but there may
exist workloads that can benefit from multiple CPU cores. Once this is
delivered, Haskell will also become the first functional language with
multi-core support in wasm!
You’re welcome to join the Haskell wasm Matrix room
to chat about the GHC wasm backend and get my quick updates on this
project.
Today, 2025-04-16, at 1830 UTC (11:30 am PDT, 2:30 pm EDT, 7:30 pm GMT, 20:30 CET, …)
we are streaming the 42th episode of the Haskell Unfolder live on YouTube.
Functional programming is programming with mathematical functions, mapping
inputs to outputs. By contrast, logic programming—perhaps best known from
the language Prolog—is programming with mathematical relations between
values, without making a distinction between inputs and outputs. In this
two-year anniversary episode of the Haskell Unfolder we take a look at
typedKanren, an embedding of the logic programming language miniKanren in
Haskell. We will see how we can use it to write a type checker for a simple
functional language in a few lines of code.
About the Haskell Unfolder
The Haskell Unfolder is a YouTube series about all things Haskell hosted by
Edsko de Vries and Andres Löh, with episodes appearing approximately every two
weeks. All episodes are live-streamed, and we try to respond to audience
questions. All episodes are also available as recordings afterwards.
Purely functional list concatenation, xs ++ ys in Haskell syntax, is well known to be linear time
in the length of the first input and constant time in the length of the second, i.e. xs ++ ys is
O(length xs). This leads to quadratic complexity if we have a bunch of left associated uses of
concatenation.
The ancient trick to resolve this is to, instead of producing lists, produce list-to-list functions
a la [a] -> [a] or ShowS = String -> String = [Char] -> [Char]. “Concatenation” of “lists”
represented this way is just function composition which is a constant time operation. We can lift a
list xs to this representation via the section (xs ++). This will still lead to O(length xs)
amount of work to apply this function, but a composition of such functions applied to a list will
always result in a fully right associated expression even if the function compositions aren’t
right associated.
In the last several years, it has become popular to refer to this technique as “difference lists”.
Often no justification is given for this name. When it is given, it is usually a reference to the
idea of difference lists in logic programming. Unfortunately, other than both techniques giving rise
to efficient concatenation, they have almost no similarities.
Functional Lists
To start, I want to do a deeper analysis of the “functional lists” approach, because I think what it
is doing is a bit misunderstood and, consequently, oversold1. Let’s see how we would model this approach in an OO
language without higher-order functions, such as early Java. I’ll use strings for simplicity, but it
would be exactly the same for generic lists.
This is just a straight, manual implementation of closures for (.) and (++) (specialized to
strings). Other lambdas not of the above two forms would lead to other implementations of
PrependTo. Let’s say, however, these are the only two forms that actually occur, which is mostly
true in Haskell practice, then another view on this OO code (to escape back to FP) is that it is an
OOP encoding of the algebraic data type:
dataPrependTo=ComposePrependToPrependTo|PrependStringprependTo ::PrependTo->String->StringprependTo (Compose left right) end = prependTo left (prependTo right end)prependTo (Prepend s) end = s ++ end
We could have also arrived at this by defunctionalizing a typical example of the technique. Modulo
some very minor details (that could be resolved by using the Church-encoded version of this), this
does accurately reflect what’s going on in the technique. Compose is clearly constant time. Less
obviously, applying these functional lists requires traversing this tree of closures – made
into an explicit tree here. In fact, this reveals that this representation could require arbitrarily
large amounts of work for a given size of output. This is due to the fact that prepending an empty
string doesn’t increase the output size but still increases the size of the tree. In practice,
it’s a safe assumption that, on average, at least one character will be prepended per leaf of the
tree which makes the overhead proportional to the size of the output.
This tree representation is arguably better than the “functional list” representation. It’s less
flexible for producers, but that’s arguably a good thing because we didn’t really want arbitraryString -> String functions. It’s more flexible for consumers. For example, getting the head of
the list is a relatively efficient operation compared to applying a “functional list” and taking
the head of the result even in an eager language. (Laziness makes both approaches comparably
efficient.) Getting the last element is just the same for the tree version, but, even with laziness,
is much worse for the functional version. More to the point, this concrete representation allows
the concatenation function to avoid adding empty nodes to the tree whereas (.) can’t pattern
match on whether a function is the identity function or not.
This view makes it very clear what the functional version is doing.
Difference Lists in Prolog
List append is the archetypal example of a Prolog program due to the novelty of its “invertibility”.
For our purposes, viewing this as a function of the first two arguments, this is exactly the usual
functional implementation of list concatenation with exactly the same problems. We could, of course,
encode the defunctionalized version of the functional approach into (pure) Prolog. This would
produce:
(I’ll be ignoring the issues that arise due to Prolog’s untyped nature.)
However, this being a logic programming language means we have additional tools available to use
that functional languages lack. Namely, unification variables. For an imperative (destructive)
implementation of list concatenation, the way we’d support efficient append of linked lists is we’d
keep pointers to the start and end of the list. To append two lists, we’d simply use the end
pointer of the first to update the end of the first list to point at the start of the second. We’d
then return a pair consisting of the start pointer of the first and the end pointer of the second.
This is exactly how Prolog difference lists work, except instead of pointers, we use unification
variables which are more principled. Concretely, we represent a list as a pair of lists, but the
second list will be represented by an unbound unification variable and the first list contains
that same unification variable as a suffix. This pair is often represented using the infix
operator (“functor” in Prolog terminology), -, e.g. Xs - Ys. We could use diff(Xs, Ys) or
some other name. - isn’t a built-in operator, it’s just a binary constructor essentially.
At the level of logic, there are no unification variables. The constraints above mean that Xs - Ys
is a list Xs which contains Ys as a suffix.
The name “difference list” is arguably motivated by the definition of concatenation in this
representation.
concat(Xs-Ys,Ys-Zs,Xs-Zs).
This looks a lot like |Xs - Ys + Ys - Zs = Xs - Zs|. If the suffix component of the first argument
is unbound, like it’s supposed to be, then this is a constant-time operation of binding that
component to Ys. If it is bound, then we need to unify which, in the worst-case, is O(length Ys)
where the length is up to either nil or an unbound variable tail2.
We also have the unit of concat, i.e. the empty
list via3:
empty(Xs-Xs).
See the footnote, but this does in some way identify Xs - Ys with the “difference” of Xs and
Ys.
We get back to a “normal” list via:
to_list(Xs- [],Xs).% or more generally,prepend_to(Xs-Ys,Ys,Xs).
to_list is a constant-time operation, no matter what. Note, to_list binds the suffix component
of the difference list. This means that the first input no longer meets our condition to be a
difference list. In other words, to_list (and prepend_to) consumes the difference list.
More precisely, it constrains the possible suffixes the list could be.
Indeed, any operation that binds the suffix component of a difference list consumes it. For example,
concat consumes its first argument.
Of course, it still makes logical sense to work with the difference list when its suffix component
is bound, it’s just that its operational interpretation is different. More to the point, given a
difference list, you cannot prepend it (via prepend_to or concat) to two different lists to get
two different results.
Converting from a list does require traversing the list since we need to replace the nil node, i.e.
[], with a fresh unbound variable. Luckily, this is exactly what append does.
from_list(Xs,Ys-Zs) :- append(Xs,Zs,Ys).
from_list also suggests this “difference list” idea. If all of Xs, Ys, and Zs are ground
terms, then from_list(Xs, Ys - Zs) holds when append(Xs, Zs, Ys) holds. Exactly when if our
invariants are maintained, i.e. that Zs is a suffix of Ys. Writing these relations more
functionally and writing append as addition, we’d have:
If we did want to “duplicate” a difference list, we’d essentially need to convert it to a (normal)
list with to_list, and then we could use from_list multiple times on that result. This would,
of course, still consume the original difference list. We’d also be paying O(length Xs) for every
duplicate, including to replace the one we just consumed4.
That said, we can prepend to a list to a difference list without consuming it. We can perform
other actions with the risk of (partially) consuming the list, e.g. indexing into the list. Indexing
into the list would force the list to be at least a certain length, but still allow prepending to
any list that will result in a final list at least that long.
Comparison
I’ll start the comparison with a massive discrepancy that we will ignore going forward. Nothing
enforces that a value of type ShowS actually just appends something to its input. We could use
abstract data type techniques or the defunctionalized version to avoid this. To be fair, difference
lists also need an abstraction barrier to ensure their invariants, though their failure modes are
different. A difference list can’t change what it is based on what it is prepended to.
Functional Representation
Difference Lists
constant-time concatenation
constant-time concatenation
constant-time conversion from a list (though you pay for it later)
O(n) conversion from a list
persistent
non-persistent, requires linear use
represented by a tree of closures
represented by a pair of a list and a unification variable
O(n) (or worse!) conversion to a list
constant-time conversion to a list
defunctionalized version can be implemented in pretty much any language
requires at least single-assignment variables
unclear connection to being the difference of two lists (which two lists?)
mathematical, if non-obvious, connection to being the difference of two (given) lists
As an illustration of the difference between persistent and non-persistent uses, the function:
double f = f . f
is a perfectly sensible function on ShowS values that behaves exactly as you’d expect. On the
other hand:
double(In,Out) :- concat(In,In,Out).
is nonsense that will fail the occurs check (if it is enabled, otherwise it will create a cyclic
list) except for when In is the empty difference list.
Conclusion
I hope I’ve illustrated that the functional representation is not just not difference lists, but
is, in fact, wildly different from difference lists.
This functional representation is enshrined into Haskell via the ShowS type and related functions,
but I’d argue the concrete tree representation is actually clearer and better. The functional
representation is more of a cute trick that allows us to reuse existing functions. Really, ShowS
should have been an abstract type.
Difference lists are an interesting example of how imperative ideas can be incorporated into a
declarative language. That said, difference lists come with some of the downsides of an imperative
approach, namely the lack of persistence.
As far as I’m aware, there isn’t an unambiguous and widely accepted name for this functional
representation. Calling it “functional lists” or something like that is, in my opinion, very
ambiguous and potentially misleading. I think the lack of a good name for this is why “difference
lists” started becoming popular. As I’ve argued, using “difference list” in this context is even
more misleading and confusing.
If people really want a name, one option might be “delta list”. I don’t think this term is used.
It keeps the intuitive idea that the functional representation represents some “change” to a list,
a collection of deltas that will all be applied at once, but it doesn’t make any false reference to
difference lists. I’m not super into this name; I just want something that isn’t “difference list”
or otherwise misleading.
To be clear, it’s still much, much,
better than using plain concatenation.↩︎
Such a length relation couldn’t
be written in pure Prolog but can in actual Prolog.↩︎
For those algebraically minded, this almost makes concat and empty into another
monoid except concat is partial, but such a partial monoid is just a category! In other words,
we have a category whose objects are lists and whose homsets are, at most, singletons containing
Xs - Ys for Hom(Xs, Ys). If we maintain our invariant that we have Xs - Ys only when Ys is a
suffix of Xs, this thin category is exactly the category corresponding to the reflexive,
transitive “has suffix” relation. We could generalize this to any monoid via a “factors through”
relation, i.e. |\mathrm{Hom}(m, n)| is inhabited if and only if |\exists p. m = pn| which you can
easily prove is a reflexive, transitive relation given the monoid axioms. However, for a general
monoid, we can have a (potentially) non-thin category by saying |p \in \mathrm{Hom}(m,n)| if and
only if |m = pn|. The category will be thin if and only if the monoid is cancellative. This is
exactly the slice category of the monoid viewed as a one-object category.↩︎
Again, in actual Prolog, we could
make a duplicate without consuming the original, though it would still take O(length Xs) time using
the notion of length mentioned before.↩︎
We have recently contributed to a research study investigating how AI can help with realistic software development tasks.
METR initiated this study to measure how AI tools affect real-world software
engineering productivity, particularly in substantial open-source projects. The
study was designed to measure and assess how experts can use AI tools in order to
improve their workflows. It’s all well and good if the latest model can fix an artificial
test case, but what’s more interesting is how AI can be driven by expert knowledge.
The study required us to work on about 20 small normal development tasks. These
tasks were randomised; in some, we were allowed to use AI tools, and in others, we were not.
Apart from that, we could solve them however we wanted. We compiled notes on how we approached using the tools
and recorded our screens to provide a record of our experience.
Sam used the hours to work on GHC tickets, and I (Matt) used the hours to work on Cabal tickets.
I have included a list of the issues we worked on at the end of the post. Sam focused on fixing small bugs,
while I fixed all the known regressions in the cabal-install-3.14 release.
In this post, I’ll briefly discuss how we used the tools and what our overall experience was.
Haskell Programming with the help of AI
Before this study, neither of us was experienced with using AI tools to help with software development. I was
impressed that the models could interact with Haskell code at all. At the
start, it was quite overwhelming trying to understand what was available and what the trade-offs were between different tools.
The AI landscape is changing rapidly at the moment; there is a new model and tool every week.
Therefore, I won’t go into too much detail about what specific models or tools we used, but rather focus on our findings and experiences.
Development Environment
For the study, we were primarily using the following models and tools:
The text editor Cursor with AI autocomplete. Cursor is a fork of VSCode with AI related features.
In the version of Cursor we used (0.45), there were two modes: the “chat” mode, which does not directly perform edits to your
files, and the “compose” mode, which does.
From within Cursor, the LLM claude-3.7-sonnet-thinking for the “chat” and “compose” features.
The standard ChatGPT 4o model from the web interface.
Using an editor with integrated LLM support, in particular one that supports Haskell Language Server, is key to getting the most
out of the AI tools:
Within the editor, it means that LLMs have access to relevant context for the task. This includes any files we pass to the
model as context, but also the rest of the codebase which can be searched by the LLM.
When an LLM suggests a change, it will receive feedback from HLS which will allow it to fix issues (e.g. fixing up missing imports,
resolving typechecker errors, etc). In practice, this made LLMs much more more autonomous and reliable.
We didn’t use anything complicated or new, such as the Model Context Protocol or
very advanced thinking models such as ChatGPT o1.
Armed with these tools, we were ready to set about our task.
Architectural Understanding Tasks
For the AI-enabled tasks, we were encouraged to use the AI as much as possible.
Therefore, I typically started by just giving the AI a link to the GitHub issue
and asking it to explain what to do to me. The summary was useful to me to just
check I understood correctly, and hearing it phrased differently was a good
sanity check before starting the issue.
Asking specific questions about the codebase had more mixed results.
In general, the AI could usually give plausible answers to understanding tasks,
but they were often wrong in some subtle way. It is also very suggestible to
agree with whatever you state you think the solution is.
My impression for architectural understanding tasks was that you would have to provide a summary document as context in order to answer
questions more accurately.
Technology Understanding Tasks
For tasks that required me to understand something new or unfamiliar,
the AI was very good. In one issue, I had to investigate something wrong with the GitHub CI setup,
which was an area I was quite clueless about. ChatGPT was able to suggest the probable cause of the issue
with minimal prompting and just the issue description for context. That certainly saved a lot of time.
The ability to generate ad-hoc scripts for particular tasks was also very useful. I generated several
useful single-use bash and python scripts for extracting specific pieces of information from the codebase.
These scripts can also be used to generate information to feed back into the prompt, which can perform a useful
feedback loop.
Code Generation Tasks
Once the AI demonstrated to me it understood the problem, I would ask it to
generate a solution. The AI could generate plausible, syntactically correct
code, but it was often the wrong idea. I think this was the biggest waste of
time. Once a solution is generated, it was quite tempting to just “fix” the
wrongness, but more often than not, the architecture or design was wrong. Many
fixes in a codebase
like Cabal require changing a few lines very precisely; that’s not something the AI is good at doing on its own at the moment.
On the other hand, if you are precise with your prompts and set the correct
context, the AI can save a lot of time generating specific definitions for you.
I would often use it to generate routine instances, simple definitions, or other well-defined generation tasks. It normally got these
correct, which I was very surprised about.
Generating test cases was also a good use of the AI. It was able to handle
generating the right structure for the custom Cabal test framework. These
invariably required some tweaking, but getting all the right files in place made it
a much simpler task.
The final approach to improve generation tasks is to first converse with the “chat” interface to clarify the problem, discuss
different parts of the design, and point out any issues. Once this context is established, you can ask the “chat” window to first generate
a prompt for the “compose” window. This prompt then gives precise instructions already to “compose,” but it can be
edited further if something is not quite right.
Documentation Tasks
Opinion was split between us about how useful the LLMs were for documentation tasks.
I thought that this was a strong point in favour of using LLMs. Often when working on an issue, you end up having to explain the same thing several
different times. First, you explain precisely to the machine what your intent is with the code you write. Then you explain to a developer
in the comments and commit message. Finally, you explain to the user in the changelog and documentation. Each
of these tasks requires modifying somewhere slightly different with a slightly different focus. I found
that I was much more inclined to include all these different parts when using the AI since it could do a good job
generating the necessary files without requiring too much further editing.
The code changes themselves, along with the context developed in “chat,” were normally enough to be able to generate the
commit message, changelog, and documentation updates with very little effort.
On the other hand, the suggestions weren’t to Sam’s taste. He thought that the style
generated for the commit messages was rambly and indirect. The model might focus on
explaining a small detail rather than giving a bigger picture overview. For the more
complicated code in GHC, the explanation of the code was a vague transcription rather
than relaying any higher-level ideas the user might want to know.
He felt similarly when it came to note writing, a developer documentation artifact common to
GHC development, the LLMs would “get stuck” explaining details of the code rather than the
bigger picture. He did have some success in writing commit messages: the LLMs were
good at summarising which functions and parts of the code were modified, which gave
a good starting point for structuring the necessary explanations.
It’s interesting we had different experiences in this area, perhaps it was due to
the difference in the codebase, or a difference in our style of using the models.
People often struggle writing commit messages or documentation, and I think using
LLMs can reduce the barrier to entry in this area. A human crafted commit message is
often much bettter than one generated by a model but I would much prefer a commit message
generated by an LLM rather than none at all.
Verification Tasks
Another interesting use case is to use the AI to perform ad-hoc verification tasks. For example, I used the AI to check that
all NFData instances had a certain structure. For this, I first worked with the AI to generate a script to extract all the code for
NFData instances from the codebase. This required a small amount of debugging, but it would have taken me several hours to write the
awk script myself due to unfamiliarity with the language. Once I had the script, I extracted all the NFData instances and asked
ChatGPT to check that they all had the correct structure. The instance-by-instance summary allowed me to also quickly verify
the AI’s answer. It resulted in spotting a few missed cases that were very hard to spot by eye.
LLMs were also useful in diagnosing failing test cases. For example, Sam implemented a change to GHC which lead to
a few dozen failing tests. After giving relevant context about the change, the LLM was able to categorise the
failing tests:
Some test results only involved minor output changes or improvements in error messages, these could be accepted.
The LLM further categorised the serious test failures, e.g. “tests 1, 4 and 5 failed for one reason, while test 2 failed for another reason”.
This categorisation was useful to identify potential issues with a change and quickly addressing them.
It often happens in GHC development that a small change can lead to hundred of failing test cases, and
it can be very time consuming to go over all failing test cases individually. Having an assistant that can
quickly do a first pass at sorting the test failures is very helpful.
Of course, the answers given to you by an LLM must always be taken with suspicion.
In situations where 95% confidence is good enough, or when it is quick and easy to check the correctness of an answer,
they can be very useful.
Refactoring Tasks
Using an LLM can be helpful for refactoring tasks that are routine and
well-defined. In our experience, however, they tend to struggle with larger
tasks or those requiring nuance.
For instance, the LLMs performed well when adding a new error message to Cabal’s diagnostic
infrastructure. This kind of task requires modifying quite a few different places in
a routine manner. There is not much code to add, nothing to move around or delete.
Similarly, for smaller tasks like lifting an expression to a top-level definition or
adding debugging traces,
the AI was able to do this with a high degree of confidence.
For repetitive refactoring tasks, such as renaming fields or parameters, the Cursor
autocomplete is extremely useful. I could often just change the name of a field, navigate to the next
type error, and the autocomplete would suggest the correct modification.
I personally also found it useful that the AI liked to keep things consistent. Fields were named
consistently, functions named consistently, etc. I find this task of making sure the whole API is consistent
quite difficult to do manually.
Finally, I didn’t really try to use the AI for large refactoring tasks. There would
tend to be a lot of random or incomplete changes to the codebase, and subtle things would go
wrong. Sam also reported that when working on some more routine refactorings, there
might have been one or two places which required a decision to be made, and you
could waste quite a lot of time if the LLM choose incorrectly. It would be useful
if a LLM could indicate the places that it modified with lower confidence.
Conclusion
Overall, I found the experience of using AI tools in my normal development
workflow to be very useful, and I will continue to use them after the study. It’s
clear to me that it is going to become essential to be
familiar with these tools as a developer in the future.
Sam has a more negative outlook in comparison. While he found LLMs useful,
he is concerned that increased use of LLMs will affect our shared ability to reason about our code.
Usage of LLMs risks disincentivising thinking deeply about the design or architecture
of our software, which then increases the burden placed on reviewers and risks
the community losing their shared understanding of how the codebase is supposed to operate.
We appreciate METR’s support in conducting this research, which has helped us
better understand both the potential and limitations of AI-assisted development
in the Haskell ecosystem.
If your company is interested in funding open-source work
then we offer Haskell Ecosystem Support Packages to provide
commercial users with support from Well-Typed’s experts, while investing in the
Haskell community and its technical ecosystem.
Issues Fixed
Sam’s GHC Issues
Sam focused on fixing small bugs in the typechecker, together with a couple of bugfixes related to LLVM code generation.
Create a reproducer for the issue where Cabal 3.14.1.0 invokes test binaries with a corrupt (duplicated) environment variable list, aiding in debugging and resolution.
Develop a test for the regression where Cabal-3.14.1.0’s v1-test and Setup.hs test cause test suites of alex-3.4.0.1 and happy-1.20.1.1 to be unable to find data files.
Encountering “Error: Dependency on unbuildable library” when using three internal libraries with --enable-multi-repl, indicating issues in handling multiple internal dependencies.
The cabal repl command does not support renaming of re-exported modules when loading multiple components. This limitation leads to failures when one component re-exports a module from another with a different name.
Rust has always felt like a strange beast, culturally speaking. The community is
made of a mix of people with very different perspectives, including anything
from hardcore low-level kernel hackers to category-theorist and functional
programming gurus. This is also what makes this community so fertile: whether
you’re coming from C, Haskell or TypeScript, you’re likely to learn a lot from
other perspectives.
I’d like to add my modest contribution by introducing a pattern coming from the
functional programming world, recursion schemes1. Recursion
schemes are a design pattern for representing and traversing recursive data structures
(typically trees) which help factor the common part of recursive traversals,
making transformations nicer to write, to read and to compose.
Even in the functional programming world, recursion schemes are not so
well-known. Like monads, they are usually presented in Haskell with frightening
words like zygohistomorphic prepromorphisms. It’s a pity because
recursion schemes can be both simple, useful and practical. I’d even argue that
in Rust, the most interesting part is perhaps the representation technique, more
than the traversal, despite the latter being the original and the usual
motivation for using recursion schemes.
In this post, we’ll work through a concrete example to introduce recursion
schemes and what they can do. We’ll point to a more real life example of how we
use them in the implementation of the Nickel configuration language,
and we’ll discuss the pros and cons of using recursion schemes in the particular
context of Rust.
(In)flexible representations
Let’s say you’re writing a JSON parser library. You’ll need to expose a type
representing JSON values. For the sake of argument, let’s assume that you
support an extension of the JSON language with pairs, so you can write {"foo": ("hello","world")}. Here’s a natural representation:
This data structure is recursive: JSON values can contain other JSON values. We
thus have to use Box (or any other indirection) around recursive occurrences
of JsonValue. Otherwise, this enum would have an infinite size (excepted for
Array and Object since Vec and HashMap add their own indirection, but
it’s somehow luck).
Now, user requestor asks that your parser adds location information to the
output, because they validate some user-provided configuration and would like to
point to specific items on error. This is a reasonable
request which is sadly very hard
to satisfy in the serde ecosystem. Anyway, our parser isn’t interfacing
with serde, so we can add span information:
You can go different ways about this. We could have added a second argument to
each constructor of the enum, such as in String(String, Span), to avoid the
additional Spanned layer, but that would be a lot of repetition. We could also
have moved Box to data: Box<T>. Still, the general idea is that we now have
two layers:
a struct layer gathering the JSON data and the span together;
the original enum layer, the core of JSON, which is almost unchanged.
So far, so good. But user conservator is now complaining that you’ve spoiled
their performance. They’re using JSON as a machine exchange format and don’t
care about position information. Could you restore the old representation and a
way to produce it, ignoring spans?
Unfortunately, we had to change JsonValue. Copy-pasting the original
JsonValue enum under a different name is possible, but it’s unsatisfying, as
we now have multiple copies to maintain. It also doesn’t scale. Beside adding
position information, you might want to have a value representation that uses
Rc instead of Box, because you’re going to need to keep reference to
arbitrary nodes during some complex transformation.
The functorial representation
The recursion schemes pattern has two components: a representation technique and
a transformation technique. I believe the representation part is particularly
interesting for Rust, so let’s start with that.
We’ll try to make our JSON representation more generic to accommodate for the
different variations that we mentioned in the previous section. The fundamental
idea is to replace the recursive occurrences of JsonValue within itself,
Box<JsonValue> (or JsonValue for Array and Object), by a generic
parameter T. Doing so, we’re defining just one layer of a JSON tree where
recursive children can be anything, not necessarily JSON values (we use the F
suffix for that generic version because it’s technically a functor, but that
doesn’t really matter).
This is precisely a single node of a JSON tree, that is either a leaf or a
marker of a node with children but without actually including them.
If we set T = Box<JsonValueF<T>>, we get back the original JsonValue.
But wait, you can’t define the generic parameter T to be something which
depends on T itself! In fact we can, but we need to introduce an extra
indirection:
The price to pay is an additional struct layer, so you need to match on
value.data, and wrap new values as JsonValue { data: JsonValueF::Number(0) }. Note that this layer doesn’t have any cost at run-time.
Another difference is that we now box the values in Array and Object,
which isn’t needed. For now I’ll just ignore that, but you could take a second
generic parameter U to represent the occurrences of T that don’t need an
indirection if this really matters to you.
If we extend our intermediate layer a bit, we can get SpannedValue!
This idea of putting a self-referential type within JsonValueF is referred to
as tying the knot. The power of this approach is that you can keep the core
JsonValueF type unchanged. This applies to any tree-like recursive structure.
Some methods can be implemented only once on JsonValueF for any T, say
is_string or is_number. With additional trait constraints on T, we can
write more involved functions, still operating on the generic functor
representation.
Let’s now see how to traverse our JSON values.
Traversals
The strong point of recursion schemes is to provide an interface for traversing
recursive structures that let you focus on what the function actually does,
which is otherwise mixed with how the recursion is done. The idea is to use
generic combinators which factor out the plumbing of recursive traversals.
Let’s count the number of String nodes in a JSON value, the naive way.
We’ll see how to write this function in the style of recursion schemes. First, we need to
define one core combinator: map.
map takes a JsonValueF<T>, a function f from T to U and returns a
JsonValue<U>. That is, map takes a JSON layer where all the direct children
(the recursive occurrences in our full type) are of some type T and applies
f to transform them to something of type U. This is the secret sauce for
defining traversals.
map isn’t specific to JsonValueF. It can be defined mechanically for any
functor representation (e.g. through a macro) of a data structure.
Note that there’s no recursion in sight: there can’t be, because T and U are
entirely generic and could very well be (), but we saw that JsonValueF<()>
is a single node. map only operates at the current layer.
The trick is that f can use map itself. Let’s see how to use it for
count_strings:
If you look closely, there’s no more recursion in the body of the pattern
matching. It’s factored out in the map call. Let’s break down this example:
map, given a function from T to U, promises you that it can transform
the direct children of type T in JsonValueF<T> to U, providing
JsonValueF<U>. We use it immediately with a recursive call to
count_strings, which can indeed transform the direct children from a
Box<JsonValue> to a u32. If the children have children itself,
count_strings will do that recursively as its first action, down to the
leaves.
Once we’ve reduced potential children of deeper layers to u32s, we get a
JsonValueF<u32>. We sum its content at the current layer.
There is a catch though: our count_strings function takes an owned argument,
which consumes the original JSON value. I’ll come back to that later.
While I find the second version of count_strings a little cleaner, the
difference between the two isn’t really astonishing.
As a more compelling example, let’s define a generic bottom-up traversal
function on JsonValue. This traversal is able to map — that is to rewrite —
nodes (more exactly entire subtrees). map_bottom_up takes a generic
transformation f and applies this function to every subtree starting from the
leaves. You could use such a function to apply program transformations
or optimizations on an abstract syntax tree.
implJsonValue{pubfnmap_bottom_up(self:JsonValue, f:implFnMut(JsonValue)->JsonValue)->JsonValue{let data =self.data.map(|v|Box::new(v.map_bottom_up(f)));f(JsonValue{ data })}}
This example is quite remarkable: it’s almost a one-liner and there is no
pattern matching at all! Once again, the structural recursion is entirely
factored out in the map function. We implemented map_bottom_up on
JsonValue directly, but with some trait constraints on T, we can write a
more generic version JsonValueF that works on both the Boxed and Rced
version (the arena one is more tricky as it requires an explicit allocator).
This example is only scratching the surface.
Mapping is just one example: another common traversals are folds (known as
catamorphisms in the recursion schemes jargon), which generalize the well-known
Iterator::fold from sequences to trees. In fact, count_strings would make
more sense as a fold, but we’ll leave that for another time.
Are recursion schemes useful in Rust?
Haskell has a number of features that make recursion schemes particularly nice
to use and to compose, not the least of which is garbage collection. You don’t
have to think about ownership; it’s references all the way down. Recursive data
structures are easy to express.
On the other side, there is Rust, which culturally doesn’t like recursive
functions that much, for good and bad reasons2. Though sometimes
recursion is hard to avoid, especially on tree-like data structures.
An important issue is that our count_strings consumes its argument, which is
unacceptable in practice. It is possible to write a version of map that takes
a value by reference, and thus similarly for count_strings, but it’s not
entirely straightforward nor free. You can find a by-reference version and more
explanations in our associated repository. At any rate, you can always
write specific traversals manually without resorting to the recursion schemes
way if needed. It’s not an all or nothing approach.
In fact, even if you don’t use map at all, the functor representation alone is
quite useful.
How we use recursion schemes in Nickel
In the implementation of the Nickel configuration language, we use the functor
representation for the abstract syntax tree of a static type.
Here are the stages we went through:
In the parser and most of the Nickel pipeline, we used to have a simple
Box-based, owned representation, akin to JsonValue.
However, during type inference, the Nickel typechecker needs to handle new
type constructions, in particular unification
variables. Those are as-of-yet unknown types, similar
to unknowns in an algebraic equation. Extending the base representation is
readily done as for SpannedJsonValue:
pubenumUnifType{Concrete(Box<TypeF<UnifType>>),/// A unification variable.UnifVar(VarId),//.. rigid type variables, etc.}
More recently, we’ve split the historical, all-powerful unique representation
of expressions (including Nickel types) into two intermediate ones. The new
initial representation is arena-allocated, which makes it natural to use bare
references as the recursive indirection instead of allocating in the heap
through e.g. Box. This is easy with recursion schemes: that is precisely
the ArenaJsonValue example. For a smooth transition, we need to temporarily
keep the old Box-ed Type representation in parts of the codebase, but
having different representations co-exist is a basic feature of recursion
schemes.
We use map-based traversal typically to substitute type variables (that is,
a Nickel generic type, as our T in Rust) for a concrete type and similar
rewriting operations. We have variants of the core map function that can also
thread mutable state, raise errors, or both. Traversal by reference are
implemented manually, with a plain recursive function.
On the downside, type and core function definitions can be a bit verbose and
tricky to get right. For example, Nickel’s TypeF has sub-components that themselves
contain types leading to 4 generic parameters. There are multiple possibilities
for Box placement in particular, only some of them are correct and they are
subtly different. Though once you’ve defined a new variant, this complexity is
mostly hidden from the consumers of your API. It can still manifest as terrible
Rust type errors sometimes if, God forbid, you’ve put a Box at the wrong
place.
Conclusion
We’ve introduced recursion schemes, a design pattern for representing and
traversing recursive data structures. While the traversal part isn’t as good a
fit as in purer functional languages like Haskell, it can still be useful in
Rust. The representation part is particularly relevant, making it easy to define
variations on a recursive data structure with different ownership models or
metadata. We’ve shown how we use recursion schemes in Nickel, and while there
are performance and complexity trade-offs to consider, they can bring value for
moderately complex tree types that need to be extended and transformed in
various ways.
Rust allocates on the stack by default, which makes it easier to overflow
(though the stack can be configured to be larger at compile time). However,
I have the impression that there’s a misleading idea that recursive
functions perform poorly. For tree transformations at least, the iterative
version is usually harder to write and can require explicitly representing
the context on the heap through an auxiliary data structure such as a
zipper, which is likely to perform worse. The stack can overflow, and
(recursive) functions call aren’t entirely free either, but in terms of
allocation, deallocation and locality, the stack is also hard to beat!↩
Google have chosen to break my blog by removing support for the chart API which I used for the formulae. For a better version with all the formulae go to my github repo.
There are many introductions to the Expectation-Maximisation algorithm.
Unfortunately every one I could find uses arbitrary seeming tricks that seem to be plucked out of a hat by magic.
They can all be justified in retrospect, but I find it more useful to learn from reusable techniques that you can apply to further problems.
Examples of tricks I've seen used are:
Using Jensen's inequality. It's easy to find inequalities that apply in any situation. But there are often many ways to apply them. Why apply it to this way of writing this expression and not that one which is equal?
Substituting in the middle of an expression. Again, you can use just about anywhere. Why choose this at this time? Similarly I found derivations that insert a into an expression.
Majorisation-Minimisation. This is a great technique, but involves choosing a function that majorises another. There are so many ways to do this, it's hard to imagine any general purpose method that tells you how to narrow down the choice.
My goal is to fill in the details of one key step in the derivation of the EM algorithm in a way that makes it inevitable rather than arbitrary.
There's nothing original here, I'm merely expanding on a stackexchange answer.
Generalities about EM
The EM algorithm seeks to construct a maximum likelihood estimator (MLE) with a twist: there are some variables in the system that we can't observe.
First assume no hidden variables.
We assume there is a vector of parameters that defines some model.
We make some observations .
We have a probability density that depends on .
The likelihood of given the observations is .
The maximum likelhood estimator for is the choice of that maximises for the we have observed.
Now suppose there are also some variables that we didn't get to observe.
We assume a density .
We now have
where we sum over all possible values of .
The MLE approach says we now need to maximise
One of the things that is a challenge here is that the components of might be mixed up among the terms in the sum.
If, instead, each term only referred to its own unique block of , then the maximisation would be easier as we could maximise each term independently of the others.
Here's how we might move in that direction.
Consider instead the log-likelihood
Now imagine that by magic we could commute the logarithm with the sum.
We'd need to maximise
One reason this would be to our advantage is that often takes the form where is a simple function to optimise.
In addition, may break up as a sum of terms, each with its own block of 's.
Moving the logarithm inside the sum would give us something we could easily maximise term by term.
What's more, the for each is often a standard probability distribution whose likelihood we already know how to maximise.
But, of course, we can't just move that logarithm in.
Maximisation by proxy
Sometimes a function is too hard to optimise directly.
But if we have a guess for an optimum, we can replace our function with a proxy function that approximates it in the neighbourhood of our guess and optimise that instead.
That will give us a new guess and we can continue from there.
This is the basis of gradient descent.
Suppose is a differentiable function in a neighbourhood of .
Then around we have
We can try optimising with respect to within a neighbourhood of .
If we pick a small circular neighbourhood then the optimal value will be in the direction of steepest descent.
(Note that picking a circular neighbourhood is itself a somewhat arbitrary step,
but that's another story.)
For gradient descent we're choosing because it matches both the value and derivatives of at .
We could go further and optimise a proxy that shares second derivatives too, and that leads to methods based on Newton-Raphson iteration.
We want our logarithm of a sum to be a sum of logarithms.
But instead we'll settle for a proxy function that is a sum of logarithms.
We'll make the derivatives of the proxy match those of the original function
precisely so we're not making an arbitrary choice.
Write
The are constants we'll determine.
We want to match the derivatives on either side of the
at :
On the other hand we have
To achieve equality we want to make these expressions match.
We choose
Our desired proxy function is:
So the procedure is to take an estimated and obtain a new estimate
by optimising this proxy function with respect to .
This is the standard EM algorithm.
It turns out that this proxy has some other useful properties.
For example, because of the concavity of the logarithm,
the proxy is always smaller than the original likelihood.
This means that when we optimise it we never optimise ``too far''
and that progress optimising the proxy is always progress optimising the
original likelihood.
But I don't need to say anything about this as it's all part of the standard literature.
Afterword
As a side effect we have a general purpose optimisation algorithm that has nothing to do with statistics. If your goal is to compute
you can iterate, at each step computing
where is the previous iteration.
If the take a convenient form then this may turn out to be much easier.
Note
This was originally written as a PDF using LaTeX. It'll be available here for a while. Some fidelity was lost when converting it to HTML.
Do you use an LLM for coding? Do you maintain a personal benchmark based on problems you have posed the LLM? The purpose of this blog post is to convince you should do this: that you can do so with marginal effort on top of your day-to-day vibe coding and that you will get both short and long term benefits from making your own personal benchmark exist.
I started thinking about benchmarks for coding in part with my frustration with the discourse around LLMs in the public squares I frequent (Reddit and Twitter). People often want to know "what's the best model" or "what's the best coding IDE"? One might imagine that the way to answer this question would be to test the models on a variety of problems from real world uses of the LLM for coding, and then compare how well various systems do on this. Indeed, whenever a new SOTA model releases, the lab will usually tell you about the model's performance against a few well known coding benchmarks. Problem solved?
Of course not! In fact, for the most part, no one really talks about benchmarks when comparing models. Why? I argue the most popular benchmarks measure tasks that are largely different from what a user wants out of an LLM. For example, take the recent Gemini 2.5 Pro release. In their headline table, they test against LiveCodeBench, Aider Polyglot and SWE-bench Verified. Both LiveCodeBench and Aider Polyglot derive their problems from contest programming and pedagogical exercises (respectively), while SWE-bench assesses bug fixes to preexisting codebases. While useful, this is only a small slice things people want to do with LLMs.
Wouldn't it be great if you had your own, personal benchmark, based on problems you actually care about? If you are tweaking your .cursorrules, you could run your benchmark to see if a change you made helped or not. When a new model comes out, you could spend a few bucks to run your eval and make a decision if you should switch your daily driver. And then on social media, if you wanted to stan the new model, instead of asking the model to drop a ball inside a rotating hexagon or vagueposting about how the new model is incredible, you could just post your benchmark results.
It's a collection of nearly 100 tests I've extracted from my actual conversation history with various LLMs.
There are two defining features of this benchmark that make it interesting. Most importantly, I've implemented a simple dataflow domain specific language to make it easy for me (or anyone else!) to add new tests that realistically evaluate model capabilities. This DSL allows for specifying both how the question should be asked and also how the answer should be evaluated. Most questions are evaluated by actually running the code the model writes but the framework supports a bunch of other evaluation methods as well. And then, directly as a result of this, I've written nearly 100 tests for different situations I've actually encountered when working with LLMs as assistants.
I have been working on my own benchmark based off of Carlini's benchmark, and I can confirm that this works well for the traditional style of coding eval, where you have a one-shot task that generates and executes the code against some test cases. My basic strategy is to vibe code as usual, but whenever I give an LLM a task that it isn't able to one shot, I consider adding it to the benchmark. In more detail:
I only add a task if a SOTA LLM failed it. This ensures the benchmark consists of all appropriate difficulty problems: easy enough that I thought an LLM should be able to do it, but hard enough that a SOTA model failed on it. I don't need problems that are too hard (this is already well covered by well known benchmarks like SWE-Bench or SWE-Lancer), and I don't mind if my problems saturate because, hey, that means the models are that much better for my use cases!
After I have added the task to the benchmark, I can use the benchmark runner to tell if changing the model, tweaking the prompt, or even just running the prompt again at nonzero temperature can make it pass. Indeed, it's helpful to find some configuration that makes the eval pass, as this is good for debugging issues in the evaluation function itself... also it means you have working code for whatever task you were working on. Conversely, you can make the task harder by leaving things out from the prompt.
Writing the test is the labor intensive part, but you can always vibe code a test. Importantly, you have a failing implementation (your initial generation) and some way you (manually?) determined that the implementation was wrong, so just turn this into your evaluation function! (And for all you yak shaving aficionados, if the model fails to vibe code your test, well, you have another task for your benchmark!)
For example, the other day I needed to take an asciinema recording and convert it into a sequence of frames rendered as plain text. However, the only project for doing these conversations was agg, which converts recordings into animated gifs. In
agg_to_text, I ask an LLM to take agg's source code and create a new program which dumps the frames as plain text rather than gif images. The reason why this task is difficult, is because there is some discretion in deciding when to emit a frame, and with my original prompt the LLM didn't precisely replicate the original behavior in agg. While working on the benchmark, I realized that instructing the model specifically about how frame batching worked was enough to get it to preserve the original behavior. But I don't think I should need to do this: thus this task. (P.S. If this test saturates, well, I can always make it harder by removing the agg source code from the prompt.)
The ability to benchmark one shot tasks is here today, but I would like to speculate a bit about what lies beyond them. In particular, most of my LLM coding activity involves asking the LLM to make changes to a pre-existing project, which makes it less amenable to "single prompt creates self contained program". (Also, I usually only ask one-shot questions that the LLM can answer, so most of them would never go in my benchmark.)
In short, how can I extract tasks from my day-to-day work? There seems to be two big extra levers we have:
Codebase tasks. This is the heavy-weight approach: you record the Git commit of your codebase at the time you prompted for some new feature to be added, and then when you want to run an eval on a new model you just check out the codebase at that commit and let the end-to-end system go. You'll typically want to execute the modified code, which means you'll also need a way to reliably setup the runtime environment for the code; things like lockfiles can help a lot here.
Transcript tasks. You don't actually need the entire codebase to be available to ask an LLM for a completion; you only need the conversation transcript up to the point of the critical generation. If the transcript is mostly your agent system reading in files for context, you can end up with a relatively system generic prompt that can tell you something about other systems. Of course, if you want to actually run the change, you still need the full codebase, which is why this approach is much more amenable if you're going to do some static analysis on the output. For example, if a model keeps adding try: ... except: ... blocks that are suppressing errors, you can take some transcripts where you've caught the model red-handed doing this and make an eval that checks if the model is still doing this. I suspect testing on transcripts works best for testing if changing prompts or rules improves performance, since the transcript itself will put the model into some particular latent space and if it were a different model they might have made different choices leading to a different latent space. Transcripts from thinking models are especially susceptible to this!
I have started adapting Carlini's framework to work better for these cases, although I would love to be told someone has already solved this problem for me. In particular, I am very excited about using transcript tasks to evaluate whether or not things I add to my prompts / triggered rules are helping or not. Current SOTA model instruction following isn't great and I regularly catch models doing behaviors that I explicitly told them not to in the system prompt. I have started some initial analysis over all of my chat logs to find cases where the model misbehaved, although I haven't quite worked out how I want to build an eval out of it.
One word of warning: to make transcript tasks, you need an AI coding system that doesn't obscure how it assembles its underlying prompts (which rules out most of the popular closed source AI code editors.)
I started building evals for a selfish reason: I wanted to be able to tell if modifications to my prompts were doing anything. But I also think there is a broader opportunity that arises if we also publish these benchmarks to the world.
For one, building a real world benchmark on use cases we care about is a way to communicate to the people training AI models whether or not they are doing well or not. Historical evals have focused on LeetCoding, and consequently we have models that would ace any big tech interview and yet on real world tasks will drive you off a cliff at the first opportunity. And this is not just free labor for the top labs: if you believe in open source models, one of the biggest barriers to good small models is having really high quality data. We, the OSS vibe coding community, can directly help here.
I think there is a tremendous opportunity for the open source community to really push the state of the art in coding evaluations. There's only so many benchmarks that I, personally, can create, but if everyone is making benchmarks I could eventually imagine a universe of benchmarks where you could curate the problems that are relevant to your work and quickly and cheaply judge models in this way: a Wikipedia of Coding Benchmarks.
To summarize: every time an LLM fails to solve a problem you ask it for, this is a potential new benchmark. As long as there is a way to automate testing if the LLM has solved the problem, you can turn this into a benchmark. Do this for yourself, and you can quickly have a personal benchmark with which to evaluate new models. Do this at scale, and you can help push the frontier in coding models.
Haskell is the world’s best programming language1, but let’s
face the harsh reality that a lot of times in life you’ll have to write in other
programming languages. But alas you have been fully Haskell-brained and
lost all ability to program unless it is type-directed, you don’t even know how
to start writing a program without imagining its shape as a type first.
Well, fear not. The foundational theory behind Algebraic Data Types and
Generalized Algebraic Data Types (ADTs and GADTs) are so fundamental that
they’ll fit (somewhat) seamlessly into whatever language you’re forced to write.
After all, if they can fit profunctor
optics in Microsoft’s Java code, the sky’s the limit!
This is an “April Fools” joke in the tradition of my previous
one in some of these ways that we are going to twist these other languages
might seem unconventional or possibly ill-advised… but also the title is
definitely a lie: these languages definitely should have them! :D
Normal ADTs
As a reminder, algebraic Data Types (ADTs) are products and sums; that’s why
they’re algebraic, after all!
Product Types
Products are just immutable structs, which pretty much every language
supports — as long as you’re able to make sure they are never mutated.
This is much simpler in languages where you can associate functions with
data, like OOP and classes. For example, this is the common “value object”
pattern in java (roughly related to the java bean2):
In this case, not only are these ADTs (algebraic data types), they’re also
ADTs (abstract data types): you are meant to work with them
based on a pre-defined abstract interface based on type algebra, instead of
their internal representations.
Sum Types
If your language doesn’t support sum types, usually the way to go is with the
visitor pattern: the underlying implementation is hidden, and the only
way to process a sum type value is by providing handlers for every branch — a
pattern match as a function, essentially. Your sum values then basically
determine which handler is called.
For example, we can implement it for a network address type that can either
be IPv4 or IPv6. Here we are using C++ just for generics and lambdas with
closures, for simplicity, but we’ll discuss how this might look in C later.
Note that in this way, the compiler enforces that we handle every branch.
And, if we ever add a new branch, everything that ever consumes
IPAddress with an IPAddressVisitor will have to add a
new handler.
In a language without generics or powerful enough polymorphism, it’s
difficult to enforce the “pure” visitor pattern because you can’t ensure that
all branches return the same type.
One common pattern is to have an “effectful” visitor pattern, where the point
isn’t to return something, but to execute something on the payload of
the present branch. This is pretty effective for languages like C, javascript,
python, etc. where types aren’t really a rigid thing.
For example, this might be how you treat an “implicit nullable”:
This is basically for_ from Haskell: You can do something like
conditionally launch some action if the value is present.
visitMaybe( () =>console.log("Nothing to request"), (reqPayload) =>makeRequest("google.com", reqPayload), maybeRequest);
On a simpler note, if your language as subtyping built in (maybe with classes
and subclasses) or some other form of dynamic dispatch, you can implement it in
terms of that, which is nice in python, java, C++, etc.
interface ExprVisitor<R>{ R visitLit(int value); R visitNegate(Expr unary); R visitAdd(Expr left, Expr right); R visitMul(Expr left, Expr right);}abstractclass Expr {publicabstract<R> R accept(ExprVisitor<R> visitor);}
Alternatively, you’re in a language where lambdas are easy, instead of
tupling up the visitor, you could just have accept itself take a
number of arguments corresponding to each constructor:
//Alternative definition without an explicit Visitorclassabstract classExpr { public abstract <R>R accept(Function<int,R> visitLit,Function<Expr,R> visitNegate,BiFunction<Expr,Expr,R> visitAdd,BiFunction<Expr,Expr,R> visitMul );}
(Note that C++ doesn’t allow template virtual methods — not because it’s not
possible within the language semantics and syntax, but rather because the
maintainers are too lazy to add it — so doing this faithfully requires a bit
more creativity)
Now, if your language has dynamic dispatch or subclass polymorphism, you can
actually do a different encoding, instead of the tagged union. This will work in
languages that don’t allow or fully support naked union types, too. In this
method, each constructor becomes a class, but it’s important to only
allow access using accept to properly enforce the sum type
pattern.
class Lit extends Expr {privatefinalint value;publicLit(int value){this.value= value;}@Overridepublic<R> R accept(ExprVisitor<R> visitor){return visitor.visitLit(value);}}class Negate extends Expr {privatefinal Expr unary;publicNegate(Expr unary){this.unary= unary;}@Overridepublic<R> R accept(ExprVisitor<R> visitor){return visitor.visitNegate(unary);}}class Add extends Expr {privatefinal Expr left;privatefinal Expr right;publicAdd(Expr left, Expr right){this.left= left;this.right= right;}@Overridepublic<R> R accept(ExprVisitor<R> visitor){return visitor.visitAdd(left, right);}}class Mul extends Expr {privatefinal Expr left;privatefinal Expr right;publicMul(Expr left, Expr right){this.left= left;this.right= right;}@Overridepublic<R> R accept(ExprVisitor<R> visitor){return visitor.visitMul(left, right);}}
(But, just wanted to note that if you actually are working in java,
you can actually do something with sealed classes, which allows exhaustiveness
checking for its native switch/case statements.)
Alternatively you could make all of the subclasses anonymous and expose them
as factory methods, if your language allows it:
abstractclass Expr {publicabstract<R> R accept(ExprVisitor<R> visitor);publicstatic Expr lit(int value){returnnewExpr(){@Overridepublic<R> R accept(ExprVisitor<R> visitor){return visitor.visitLit(value);}};}publicstatic Expr negate(Expr unary){returnnewExpr(){@Overridepublic<R> R accept(ExprVisitor<R> visitor){return visitor.visitNegate(unary);}};}publicstatic Expr add(Expr left, Expr right){returnnewExpr(){@Overridepublic<R> R accept(ExprVisitor<R> visitor){return visitor.visitAdd(left, right);}};}// ... etc}
Passing around function references like this is actually pretty close to the
scott encoding of our data type — and for non-recursive types, it’s essentially
the church encoding.
Recursive Types
Speaking of recursive types…what if your language doesn’t allow recursive
data types? What if it doesn’t allow recursion at all, or what if recursively
generated values are just annoying to deal with? Just imagine writing that
Expr type in a language with explicit memory management, for
example. Or, what if you wanted a way to express your recursive types in a more
elegant and runtime-safe manner?
One thing you can instead do is have your visitor be in its “catamorphism”,
or church encoding. Instead of having the “visitor” take the recursive
sub-values, instead have it return the result of recursively applying
itself.
Let’s do this in dhall, one of the most famous non-recursive
languages. Dhall does have native sum types, so we won’t worry about
manually writing a visitor pattern. But it does not have recursive data
types.
Let’s define a type like:
dataExpr=LitNatural|AddExprExpr|MulExprExpr
But we can’t define data types in dhall that refer to themselves. So instead,
we can define them in their “church encoding”: give what you would do with an
Expr to consume it, where the consumption function is given as if
it were recursively applied.
Note that ExprF r is essentially
ExprVisitor<R>, except instead of add being
Expr -> Expr -> r, it’s r -> r -> r: the
input values aren’t the expression, but rather the results of recursively
folding on the expression. In fact, our original non-recursive
ExprVisitor<R> (to be more precise, the
R accept(ExprVisitor<R>)) is often called the “scott
encoding”, as opposed to the recursive “church encoding” fold.
For value creation, you take the visitor and recursively apply:
And finally, using the data type involves providing the
handler to fold up from the bottom to top. Note that
add : \(left : Natural) -> \(right : Natural) -> left + right
already assumes that the handler has been applied to the sub-expressions, so you
get Naturals on both sides instead of Expr.
This pattern is useful even in languages with good datatype recursion, like
Haskell — it’s actually the recursion-schemes
refactoring of a recursive data type, and it can be useful to have it live
alongside your normal recursive types. I’ve written this blog
post talking about how useful this pattern is to have alongside your normal
recursive types.
This pattern is pretty portable to other languages too, as long as you can
scrounge together something like Rank-N types:
interface ExprFold<R>{ R foldLit(int value); R foldNegate(R unary); R foldAdd(R left, R right); R foldMul(R left, R right);}interface Expr {publicabstract<R> R accept(ExprFold<R> fold);publicstatic Expr lit(int value){returnnewExpr(){@Overridepublic<R> R accept(ExprFold<R> fold){return fold.foldLit(value);}};}publicstatic Expr negate(Expr unary){returnnewExpr(){@Overridepublic<R> R accept(ExprFold<R> fold){return fold.foldNegate(unary.accept(fold));}};}// etc.}
By “Rank-N types” here, I mean that your objects can generate polymorphic
functions: given an Expr, you could generate an
<R> R accept(ExprFold <R> fold) for any R,
and not something pre-determined or pre-chosen by your choice of representation
of Expr.
Generalized Algebraic Data Types
You’ve implemented ADTs in your language of choice, or you are currently in a
language with native ADTs. Life is good, right? Until that sneaky voice starts
whispering in your hear: “we need more type safety.” You resist that urge, maybe
even get a lot done without it, but eventually you are compelled to give in and
embrace the warm yet harsh embrace of ultimate type safety. Now what?
Singletons and Witnesses
In Haskell, singletons are essentially enums used to associate a value with a
reifiable type. “Reifiable” here means that you can take the runtime value of a
singleton and use it to bring evidence to the type-level. I ran into a
real-world usage of this while writing https://coronavirus.jle.im/, a web-based data visualizer of
COVID-19 data (source here) in
purescript. I needed a singleton to represent scales for scatter plots
and linking them to the data that can be plotted. And, not only did it need to
be type-safe in purescript (which has ADTs but not GADTs), it had to be
type-safe in the javascript ffi as well.
Here’s how it might look in Haskell:
-- | Numeric typesdataNType ::Type->TypewhereNInt ::NTypeIntNDouble ::NTypeDoubleNPercent ::NTypePercent-- | Define a scaledataScale ::Type->TypewhereScaleDate ::ScaleDateScaleLinear ::Bool->NType a ->Scale a -- ^ whether to include zero in the axis or notScaleLog ::NType a ->Scale a
You’d then run it like this:
plot ::Scale a ->Scale b -> [(a, b)] ->Canvas
So, we have the type of the input tuples being determined by the
values you pass to plot:
But let’s say we only had ADTs. And then we’re passing them down to a
javascript FFI which only has structs and functions. We could drop the
type-safety and instead error on runtime, but…no. Type unsafety is not
acceptable.
The fundamental ability we want to gain is that if we pattern match on
ScaleDate, then we knowa has to be
Date. If we match on NInt, we know that ahas to be Int.
For the sake of this example, we’re going to be implementing a simpler
function in purescript and in javascript: a function that takes a scale type and
a list of points prints the bounds. In Haskell, this looks like:
(Pretend the Percent type is just a newtype-wrapped
Float or something)
There are at least two main approaches to do this. We’ll be discussing
runtime equality witnesses and Higher-Kinded Eliminators.
Runtime Witnesses and Coyoneda
Embedding
The Yoneda Lemma
is one of the most powerful tools that Category Theory has yielded as a branch
of math, but its sibling coyoneda
is one of the most useful Haskell abstractions.
This doesn’t give you GADTs, but it’s a very lightweight way to “downgrade”
your GADTs into normal GADTs which is appropriate if you don’t need the full
power.
The trick is this: if you have MyGADT a, and you know you are
going to be using it to produceas, you can do a covariant
coyoneda transform.
For example, if you have this type representing potential data sources:
dataSource ::Type->TypewhereByteSource ::Handle->SourceWordStringSource ::FilePath->SourceStringreadByte ::Handle->IOWordreadString ::FilePath->IOStringreadSource ::Source a ->IO areadSource = \caseByteSource h -> readByte hStringSource fp -> readString fp
You could instead turn Source into a non-GADT by making it a
normal parameterized ADT and adding a X -> a field, which is a
type of CPS transformation:
dataSource a =ByteSourceHandle (Word-> a)|StringSourceFilePath (String-> a)byteSource ::Handle->SourceWordbyteSource h =ByteSource h idstringSource ::FilePath->SourceStringstringSource fp =StringSource fp idreadSource ::Source a ->IO areadSource = \caseByteSource h out -> out <$> readByte hStringSource fp out -> out <$> readString fp
A nice benefit of this method is that Source can now have a
Functor instance, which the original GADT could not.
And, if MyGADT a is going to be consumingas, you can do the contravariant
coyoneda transform:
dataSink a =ByteSinkHandle (a ->Word)|StringSinkFilePath (a ->String)
And, if you are going to be both consuming and producing as, you
can do the invariant coyoneda transform
dataInterface a =ByteInterfaceHandle (Word-> a) (a ->Word)|StringInterfaceFilePath (String-> a) (Word-> a)
However, in practice, true equality involves being able to lift
under injective type constructors, and carrying every single
continuation is unwieldy. We can package them up together with a runtime
equality witness.
This is something we can put “inside” NInt such that, when we
pattern match on a NType a, the type system can be assured that
a is an Int.
You need some sort of data of type IsEq a b with functions:
refl :: IsEq a a
to :: IsEq a b -> a -> b
sym :: IsEq a b -> IsEq b a
trans :: IsEq a b -> IsEq b c -> IsEq a c
inj :: IsEq (f a) (f b) -> IsEq a b
If you have to and sym you also get
from :: IsEq a b -> b -> a.
From all of this, we can recover our original
IsEq a Word -> Word -> a and
IsEq a Word -> a -> Word functions, saving us from having to
put two functions.
Your language of choice might already have this IsEq. But one of
the more interesting ways to me is Leibniz equality (discussed a lot in this
Ryan Scott post), which works in languages with higher-kinded polymorphism.
Leibniz quality in languages with higher-kinded polymorphism means that
a and b are equal if
forall p. p a -> p b: any property of a is also
true of b.
In Haskell, we write this like:
newtypeLeibniz a b =Leibniz (forall p. p a -> p b)refl ::Leibniz a arefl =Leibnizid
The only possible way to construct a ‘Leibniz’ is with both type parameters
being the same: You can only ever create a value of type
Leibniz a a, never a value of Leibniz a b where
b is not a.
You can prove that this is actually equality by writing functions
Leibniz a b -> Leibniz b a and
Leibniz a b -> Leibniz b c -> Leibniz a c (this
Ryan Scott post goes over it well), but in practice we realize this equality
by safely coercing a and b back and forth:
newtypeIdentity a =Identity { runIdentity :: a }to ::Leibniz a b -> a -> bto (Leibniz f) = runIdentity . f .IdentitynewtypeOp a b =Op { getOp :: b -> a }from ::Leibniz a b -> b -> afrom (Leibniz f) = getOp (f (Opid))
So, if your language supports higher-kinded Rank-2 types, you have a
solution!
There are other solutions in other languages, but they will usually all be
language-dependent.
Let’s write everything in purescript. The key difference is we use
map (to isNumber) :: Array a -> Array Number, etc., to get our
Array as something we know it has the type of.
importText.PrintfnewtypeLeibniz a b =Leibniz (forall p. p a -> p b)to ::Leibniz a b -> a -> bfrom ::Leibniz a b -> b -> adataNType a =NInt (Leibniz a Int)|NNumber (Leibniz a Number)|NPercent (Leibniz a Percent)typeAxisBounds a = { minValue :: a , minLabel ::String , maxValue :: a , maxLabel ::String }displayNumericAxis ::NType a ->Array a ->AxisBounds adisplayNumericAxis = \caseNInt isInt -> \xs ->let xMin =minimum$map (to isInt) xs xMax =maximum$map (to isInt) xsshowInt=showin { minValue: xMin , minLabel:showInt xMin , maxValue: xMax , maxLabel:showInt xMax }NNumber isNumber -> \xs ->let xMin =minimum$map (to isNumber) xs xMax =maximum$map (to isNumber) xs showFloat = printf (Proxy ::Proxy"%.4f") -- it works a little differentlyin { minValue: xMin , minLabel: showFloat xMin , maxValue: xMax , maxLabel: showFloat xMax }NPercent isPercent -> \xs ->let xMin =minimum$map (to isPercent) xs xMax =maximum$map (to isPercent) xs showPercent = printf (Proxy ::Proxy"%.1f%%") <<< (_ *100.0)in { minValue: xMin , minLabel: showPercent xMin , maxValue: xMax , maxLabel: showPercent xMax }
To work with our [a] as if it were [Int], we have
to map the coercion function over it that our Leibniz a Int gave
us. Admittedly, this naive way adds a runtime cost of copying the array. But we
could be more creative with finding the minimum and maximum in this way in
constant space and no extra allocations.
And, if we wanted to outsource this to the javascript FFI, remember that
javascript doesn’t quite have sum types, so we can create a quick visitor:
typeNVisitor a r = { nvInt ::Leibniz a Int-> r , nvNumber ::Leibniz a Number-> r , nvPercent ::Leibniz a Percent-> r }typeNAccept a =forall r.NVisitor a r -> rtoAccept ::NType a ->NAccept atoAccept =case _ ofNInt isInt -> \nv -> nv.nvInt isIntNNumber isNumber -> \nv -> nv.nvNumber isNumberNPercent isPercent -> \nv -> nv.nvPercent isPercentforeign import _formatNumeric :: forall a. Fn2 (NAccept a) a StringformatNumeric ::NType a -> a ->StringformatNumeric nt = runFn2 _formatNumeric (toAccept nt)
Admittedly in the javascript we are throwing away the “GADT type safety”
because we throw away the equality. But we take what we can — we at least retain
the visitor pattern for sum-type type safety and exhaustiveness checking. I
haven’t done this in typescript yet so there might be a way to formalize Leibniz
equality to do this in typescript and keep the whole chain type-safe from top to
bottom.
Higher-Kinded Eliminators
This is essentially the higher-kinded version of the visitor pattern, except
in dependent type theory these visitors are more often called “eliminators” or
destructors, which is definitely a cooler name.
In the normal visitor you’d have:
dataUser=TheAdmin|MemberIntdataUserHandler r =UH { uhTheAdmin :: r , uhMember ::Int-> r }
But note that if you have the right set of continuations, you have something
that is essentially equal to User without having to actually use
User:
typeUser'=forall r.UserHandler r -> rfromUser ::User->User'fromUser = \caseTheAdmin-> \UH{..} -> uhTheAdminMember userId -> \UH{..} -> uhMember userIdtoUser ::User'->FootoUser f = f $UH { fhTheAdmin =TheAdmin, fhMember =Member }
This means that User is actually equivalent to
forall r. UserHandler r -> r: they’re the same type, so if your
language doesn’t have sum types, you could encode it as
forall r. UserHandler r -> r instead. Visitors, baby.
But, then, what actually does the r type variable represent
here, semantically? Well, in a UserHandler r, r is the
“target” that we interpret into. But there’s a deeper relationship between
r and User: A UserHandler r essentially
“embeds” a User into an r. And, a
UserHandler r -> r is the application of that embedding to an
actual User.
If we pick r ~ (), then UserHandler () embeds
User into (). If we pick r ~ String, then
UserHandler () embeds User into String
(like, “showing” it). And if we pick r ~ User, a
UserHandler User embeds a User into…itself?
So here, r is essentially the projection that we view the user
through. And by making sure we are forall r. UserHandler r -> r
for allr, we ensure that we do not lose any information:
the embedding is completely 1-to-1. It lets you “create” the User
faithfully in a “polymorphic” way.
In fact, to hammer this home, some people like to use the name of the type as
the type variable: UserHandler user:
-- | The same thing as before but with things renamed to prove a pointdataMakeUser user =MakeUser { uhTheAdmin :: user , uhMember ::Int-> user }typeUser'=forall user.MakeUser user -> user
The forall user. lets us faithfully “create” a User
within the system we have, without actually having a User data
type. Essentially we can imagine the r in the forall r
as “standing in” for User, even if that type doesn’t actually
exist.
Now, here’s the breakthrough: If we can use forall (r :: Type)
to substitute for User :: Type, how about we use a
forall (p :: Type -> Type) to substitute for a
Scale :: Type -> Type?
dataScale ::Type->TypewhereScaleDate ::ScaleDateScaleLinear ::Bool->LType a ->Scale aScaleLog ::NType a ->Scale adataScaleHandler p a =SH { shDate :: p Date , shLinear ::Bool->NType a -> p a , shLog ::NType a -> p a }typeScale' a =forall p.ScaleHandler p a -> p afromScale ::Scale a ->Scale' afromScale = \caseScaleDate-> \SH{..} -> shDateScaleLinear hasZero lt -> \SH{..} -> shLinear hasZero ltScaleLog nt -> \SH{..} -> shLog nttoScale ::Scale' a ->Scale atoScale f = f $SH { shDate =ScaleDate, shLinear =ScaleLinear, shLog =ScaleLog }
So in our new system, forall p. ScaleHandler p a -> p a is
identical to Scale: we can use p a to substitute in
Scale in our language even if our language itself cannot support
GADTs.
So let’s write formatNType in purescript. We no longer have an
actual Scale sum type, but its higher-kinded church encoding:
typeNType a =forall p. { int :: p Int , number :: p Number , percent :: p Percent } -> p atypeScale a =forall p. { date :: p Date , linear ::Bool->NType a -> p a , log ::NType a -> p a } -> p antInt ::NTypeIntntInt nth = nth.intntNumber ::NTypeNumberntNumber nth = nth.numberntPercent ::NTypePercentntPercent nth = nth.percentformatNType ::NType a -> a ->StringformatNType nt = fwhereOp f = nt { int:Opshow , number:Op$ printf (Proxy"%.4f") , percent:Op$ printf (Proxy"%.1f%%") <<< (_ *100.0) }
Here we are using
newtypeOp b a =Op (a -> b)
as our “target”: turning an NType a into an
Op String a. And an Op String a is an
a -> String, which is what we wanted! The int field
is Op String Int, the number field is
Op String Number, etc.
In many languages, using this technique effectively requires having a newtype
wrapper on-hand, so it might be unwieldy in non-trivial situations. For example,
if we wanted to write our previous axis function which is
NType a -> [a] -> String, we’d have to have a newtype wrapper
for [a] -> String that has a as its argument:
newtypeOpList b a =Op ([a] -> b)
or you could re-use Compose:
newtypeCompose f g a =Compose (f (g a))
and your p projection type would be Compose Op [].
So, you don’t necessarily have to write a bespoke newtype wrapper, but you do
have to devote some brain cycles to think it through (unless you’re in a
language that doesn’t need newtype wrappers to have this work, like we’ll
discuss later).
By the way, this method generalizes well to multiple arguments: if you have a
type like MyGADT a b c, you just need to project into a
forall (p :: k1 -> k2 -> k3 -> Type).
I believe I have read somewhere that the two methods discussed here (runtime
equality witness vs. higher-kinded eliminator) are not actually fully identical
in their power, and there are GADTs where one would work and not the other … but
I can’t remember where I read this and I’m also not big-brained enough to figure
out what those situations are. But if you, reader, have any idea, please let me
know!
Existential Types
Let’s take a quick break to talk about something that’s not
technically related to GADTs but is often used alongside them.
What if we wanted to store a value with its NType and hide the
type variable? In Haskell we’d write this like:
dataNType ::Type->TypewhereNInt ::NTypeIntNDouble ::NTypeDoubleNPercent ::NTypePercentdataSomeNType=forall a.SomeNType (NType a) aformatNType ::NType a -> a ->StringformatNType nt x =...formatSomeNType ::SomeNType->StringformatSomeNType (SomeNType nt x) = formatNType nt xmyFavoriteNumbers :: [SomeNType]myFavoriteNumbers = [SomeNTypeNInt3, SomeNTypeNDoublepi]
But what if our language doesn’t have existentials? Remember, this is
basically a value SomeNType that isn’t a Generic, but
contains both a NType a and an a of the
same variable.
One strategy we have available is to CPS-transform our existentials into
their CPS form (continuation-passing style form). Basically, we write exactly
what we want to do with our contents if we pattern matched on them.
It’s essentially a Rank-N visitor pattern with only a single constructor:
typeSomeNType=forall r. (forall a.NType a -> a -> r) -> rsomeNType ::NType a -> a ->SomeNTypesomeNType nt x = \f -> f nt xformatSomeNumeric ::SomeNType->StringformatSomeNumeric snt = snt \nt x -> formatNumeric nt x
You can imagine, syntactically, that snt acts as its “own”
pattern match, except instead of matching on
SomeNType nt x -> .., you “match” on
\nt x -> ..
This general pattern works for languages with traditional generics like Java
too:
interface SomeNTypeVisitor<R>{<A> R visit(NType<A> nt, A val);}interface SomeNType {publicabstract<R> R accept(SomeNTypeVisitor<R> visitor);// One option: the factory methodpublicstatic<A> SomeNType someNType(NType<A> nt, A val){returnnewSomeNType(){@Overridepublic<R> R accept(SomeNTypeVisitor<R> visitor){return visitor.visit(nt, val);}};}}// Second option: the subtype hiding a type variable, which you have to always// make sure to upcast into `SomeNType` after creatingclass SomeNTypeImpl<A>extends SomeNType {private NType<A> nt;private A val;publicSomeNTypeImpl(NType<A> nt, A val){this.nt= nt;this.val= val;}@Overridepublic<R> R accept(SomeNTypeVisitor<R> visitor){return visitor.visit(nt, val);}}
Does…anyone write java like this? I tried committing this once while at
Google and I got automatically flagged to be put on a PIP.
Recursive GADTs
The climax of this discussion: what if your language does not support GADTs
or recursive data types?
We’re going to be using dhall as an example again, but note that the
lessons applied here are potentially useful even when you do have
recursive types: we’re going to be talking about a higher-kinded church
encoding, which can be a useful form of your data types that live alongside your
normal recursive ones.
Let’s imagine Expr as a GADT, where Expr a
represents an Expr that evaluates to an a:
dataExpr ::Type->TypewhereNatLit ::Natural->ExprNaturalBoolLit ::Bool->ExprBoolAdd ::ExprNatural->ExprNatural->ExprNaturalLTE ::ExprNatural->ExprNatural->ExprBoolTernary ::ExprBool->Expr a ->Expr a ->Expr aeval ::Expr a -> aeval = \caseNatLit n -> nBoolLit b -> bAdd x y -> eval x + eval yLTE a b -> eval a <= eval bTernary b x y ->if eval b then eval x else eval y
Adding this type variable ensures that our Expr is type-safe:
it’s impossible to Add an Expr Bool, and the two
branches of a Ternary must have the same result type, etc. And, we
can write eval :: Expr a -> a and know exactly what type will be
returned.
Now, let’s combine the two concepts: First, the church encoding, where our
handlers take the “final result” of our fold r instead of the
recursive value Expr. Second, the higher-kinded eliminator pattern
where we embed Expr :: Type -> Type into
forall (p :: Type -> Type).
Again, now instead of add taking Expr, it takes
p Natural: the “Natural result of the fold”.
p not only stands in for what we embed Expr into, it
stands in for the result of the recursive fold. That’s why in eval,
the first arguments of add are the Natural results of
the sub-evaluation.
These values can be created in the same way as before, merging the two
techniques, sending the handlers downstream:
If all of this is difficult to parse, try reviewing both the recursive ADT
section and the higher-kinded eliminator section and making sure you understand
both well before tackling this, which combines them together!
Admittedly in Haskell (and purescript) this is a lot simpler because we don’t
have to explicitly pass in type variables:
dataExprF p =ExprF { natLit ::Natural-> p Natural , boolLit ::Bool-> p Bool , add :: p Natural-> p Natural-> p Natural , ternary ::forall a. p Bool-> p a -> p a -> p a }typeExpr a =forall p.ExprF p a -> p aeval ::Expr a -> aeval e = runIdentity $ e { natLit =Identity , boolLit =Identity , add = \(Identity x) -> \(Identity y) ->Identity (x + y) , ternary = \(Identity b) -> \(Identity x) -> \(Identity y) ->if b then x else y }ternary ::ExprBool->Expr a ->Expr a ->Expr aternary b x y handlers = handlers.ternary (b handlers) (x handlers) (y handlers)
But one nice thing about the dhall version that’s incidental to dhall is that
it doesn’t require any extra newtype wrappers like the Haskell one does. That’s
because type inference tends to choke on things like this, but dhall doesn’t
really have any type inference: all of the types are passed explicitly. It’s one
of the facts about dhall that make it nice for things like this.
Congratulations
In any case, if you’ve made it this far, congratulations! You are a master of
ADTs and GADTs. Admittedly every language is different, and some of these
solutions have to be tweaked for the language in question. And, if your program
gets very complicated, there is a good chance that things will become
ergonomically unfeasible.
But I hope, at least, that this inspires your imagination to try to bring
your haskell principles, techniques, standards, practices, and brainrot into the
language of your choice (or language you are forced to work with).
And, if you ever find interesting ways to bring these things into a language
not discussed here (or a new interesting technique or pattern), I would
absolutely love to hear about it!
Until next time, happy “Haskelling”!
Special Thanks
I am very humbled to be supported by an amazing community, who make it
possible for me to devote time to researching and writing these posts. Very
special thanks to my supporter at the “Amazing” level on patreon, Josh Vera! :)
I bet you thought there was going be some sort of caveat in this
footnote, didn’t you?↩︎
I didn’t think I’d ever write “java bean” non-ironically on my
blog, but there’s a first time for everything.↩︎
Be aware that this implementation is not necessarily
appropriately lazy or short-circuiting in Ternary: it might
evaluate both sides returning the chosen branch.↩︎
To visit a tree or graph in breadth-first order, there are two main
implementation approaches: queue-based or level-based.
Our goal here is to develop a level-based approach where the levels of
the breadth-first walk are constructed compositionally and dynamically.
Compositionality means that for every node, its descendants—the other nodes
reachable from it—are defined by composing the descendants of its children.
Dynamism means that the children of a node are generated only when that node
is visited; we will see that this requirement corresponds to asking for a
monadic unfold.
A prior solution, using the Phases applicative functor,
is compositional but not dynamic in that sense. The essence of Phases
is a zipping operation in free applicative functors.
What if we did zipping in free monads instead?
A breadth-first walk explores the tree level by level; every level contains the
nodes at the same distance from the root. The list of levels of a tree can be defined
recursively—it is a fold. For a tree Node x l r, the first level contains
just the root node x, and the subsequent levels are obtained by appending the
levels of the subtrees l and r pairwise.
(We can’t just use zipWith because it throws away the end of a list when the
other list is empty.)
Finally, we concatenate the levels together to obtain the list of nodes in
breadth-first order.
toListBF::Treea-> [a]toListBF=concat.levels
Thanks to laziness, the list will indeed be produced by walking the tree in
breadth-first order.
So far so good.
The above function lets us fold a tree in breadth-first order.
The next level of difficulty is to traverse a tree, producing a tree
with the same shape as the original tree, only with modified labels.
This has the exact same type as traverse, which you might obtain with
deriving (Foldable, Traversable). The stock-derived Traversable—enabled
by the DeriveTraversable extension—is a depth-first traversal, but the laws
of traverse don’t specify the order in which nodes should be visited,
so you could make it a breadth-first traversal if you wanted.
“Breadth-first numbering” is a special case of “breadth-first traversal”
where the arrow (a -> m b) is specialized to a counter.
Okasaki presents a “numbering” solution based on queues and another solution
based on levels.
Both are easily adaptable to the more general “traversal” problem as we will
soon see.
There is a wonderful Discourse thread from 2024 on the topic of
breadth-first traversals.
The first post gives an elegant breadth-first numbering algorithm
which also appears in the appendix of Okasaki’s paper,
but sadly it does not generalize from “numbering” to
“traversal” beyond the special case m = State s.
Last but not least, another level-based solution to the breadth-first traversal
problem can be found in the
tree-traversals library by Noah Easterly.
It is built around an applicative transformer named Phases,
which is a list of actions—imagine the type “[m _]”—where each
element m _ represents one level of the tree.
The Phases applicative enables a compositional definition of a
breadth-first traversal, similarly to the levels function above:
the set of nodes reachable from the root is defined by combining the sets of
nodes reachable from its children. This concern of compositionality
is one of the main motivations behind this post.
Non-standard terminology
The broad family of algorithms being discussed is typically called
“breadth-first search” (BFS) or “breadth-first traversal”,
but in general these algorithms are not “searching” for anything,
and in Haskell, “traversal” is reserved for “things like traverse”.
Instead, this post will use “walks” as a term encompassing folds, traversals,
unfolds, or any concept that can be qualified with “breadth-first”.
Problem statement: Breadth-first unfolds
Both the fold toListBF and the traversal traverseBF had in common that they
receive a tree as an input. This explicit tree makes the notion of levels
“static”. With unfolds, we will have to deal with levels that exist only
“dynamically” as the result of unfolding the tree progressively.
To introduce the unfolding of a tree, it is convenient to introduce its “base
functor”. We modify the tree type by replacing the recursive tree fields with
an extra type parameter:
An unfold generates a tree from a seed and a
function which expands the seed into a leaf or a node containing more seeds.
A pure unfold—or anamorphism—can be defined readily:
The order in which nodes are evaluated depends on
how the resulting tree is consumed. Hence unfold
is neither inherently “depth-first” nor “breadth-first”.
The situation changes if we make the unfold monadic.
unfoldM::Monadm=> (s->m (TreeFas)) ->s->m (Treea)
An implementation of unfoldM must decide upon an ordering between actions.
To see why adding an M to unfold imposes an ordering,
contemplate the fact that these expressions have the same meaning:
Node a (unfold f l) (unfold f r)
= ( let tl = unfold f l in
let tr = unfold f r in
Node a tl tr )
= ( let tr = unfold f r in
let tl = unfold f l in
Node a tl tr )
whereas these monadic expressions do not have the same meaning in general:
( unfoldM f l >>= \tl ->
unfoldM f r >>= \tr ->
pure (Node a tl tr) )
/=
( unfoldM f r >>= \tr ->
unfoldM f l >>= \tl ->
pure (Node a tl tr) )
Without further requirements, there is an “obvious” definition of unfoldM,
which is a depth-first unfold:
We unfold the left subtree l fully before unfolding the right one r.
The problem is to define a breadth-firstunfoldM.
If you want to think about this problem on your own, you can stop reading here.
The rest of this post presents solutions.
Queue-based unfold
The two breadth-first numbering algorithms in Okasaki’s paper can
actually be generalized to breadth-first unfolds.
Here is the first one that uses queues (using the function (<+) for “push” and
pattern-matching on (:>) for “pop”):
If you’re frowning upon the use of error—as you should be—you can replace
error with dummy values here (Empty, Leaf), but
(1) that won’t be possible with tree structures that must be non-empty
(e.g., if Leaf contained a value) and (2) this is dead code, which
is harmless but no more elegant than making it obvious with error.
The correctness of this solution is also not quite obvious.
There are subtle ways to get this implementation wrong:
should the recursive call be b2 <+ b1 <+ q or b1 <+ b2 <+ q?
Should the pattern be p :> t1 :> t2 or p :> t2 :> t1?
For another version of this challenge, try implementing the unfold for another
tree type, such as finger trees or rose trees, without getting lost in the
order of pushes and pops (by the way, this is Data.Tree.unfoldTreeM_BF in
containers). The invariant is not complex but there is room for mistakes.
I believe that the compositional approach that will be presented later is more
robust on that front, although it is admittedly a subjective quality for which
is difficult to make a strong case.
Some uses of unfolds
Traversals from unfolds
One sense in which unfoldM is a more difficult problem than traverse is
that we can use unfoldM to implement traverse.
We do have to make light of the technicality that there is a Monad constraint
instead of Applicative, which makes unfoldM not suited to implement the
Traversable class.
A depth-first unfold gives a depth-first traversal:
We can use a tree unfold to explore a graph.
This usage distinguishes unfolds from folds and traversals,
which only let you explore trees.
Given a type of vertices V, a directed graph is represented by a function
V -> F V, where F is a functor which describes the arity of each node.
The obvious choice for F is lists, but we will stick to TreeF here
so we can just reuse this post’s unfoldM implementations.
The TreeF functor restricts us graphs where each node has zero or two
outgoing edges; it is a weird restriction, but we will make do for the sake of
example.
An ASCII drawing of a graph
+-------+
v |
+->1--->2--->3 |
| | | ^ |
| v v | |
| 4--->5--->6--+
| | | ^
| +----|----+
| |
+-------+
The graph drawn above turns into the following function, where every vertex
is mapped either to NodeF with the same vertex as the first argument followed
by its two adjacent vertices, or to LeafF if it has no outgoing edges or does
not belong to the graph.
If we simply feed that function to unfold, we will get the infinite tree
of all possible paths from a chosen starting vertex.
To obtain a finite tree, we want to keep track of vertices that we have
already visited, using a stateful memory. The following function wraps graph,
returning LeafF also if a vertex has already been visited.
Applying unfoldM_BF to that function produces a “breadth-first tree”
of the graph, an encoding of the trajectory of a breadth-first walk through the
graph. “Breadth-first trees” are a concept from graph theory with well-studied
properties.
-- Visit `graph` in breadth-first orderbfGraph_Q::Int->TreeIntbfGraph_Q= (`evalState`Set.empty) .unfoldM_BF_QvisitGraph
This post is a compilable Literate Haskell file. You can run all of the tests
and benchmarks in here. The source repository provides the necessary
configuration to build it with cabal.
$ cabal build breadth-first-unfolds
Test cases can then be selected with the -p option and a pattern
(see the tasty documentation for details).
Run all tests and benchmarks by passing no option.
$ cabal exec breadth-first-unfolds -- -p "/Q-graph/||/S-graph/"
All
Q-graph: OK
S-graph: OK
“Global” level-based unfold
The other solution from Okasaki’s paper can also be adapted into a monadic unfold.
The starting point is to unfold a list of seeds [s] instead of a single seed:
we can traverse the list with the expansion function s -> m (TreeF a s) to
obtain another list of seeds, the next level of the breadth-first unfold,
and keep going.
Iterating this process naively yields a variant of monadic unfold without a
result. This no-result variant can be generalized from TreeF to
any foldable structure:
Modifying this solution to create the output tree requires a little more thought.
We must keep hold of the intermediate list of ts :: [TreeF a s] to
reconstruct trees after the recursive call returns.
This solution is less brittle than the queue-based solution because
we always traverse lists left-to-right.
To avoid the uses of error in reconstruct,
you can probably create a specialized data structure in place of [TreeF a s],
but that is finicky in its own way.
In search of compositionality
Both of the solutions above (the queue-based and the “monolithic” level-based unfolds)
stem from a global view of breadth-first walks: we are iterating on a list or a
queue which holds all the seeds from one or two levels at a time.
That structure represents a “front line” between visited and unvisited
vertices, and every iteration advances the front line a little: with a queue we
advance it one vertex at a time, with a list we advance the whole front line
in an inner loop—one call to traverse—before recursing.
The opposite local view of breadth-first order is exemplified by the earlier
levels function: it only produces a list of lists of the vertices
reachable from the current root. It does so recursively, by composing
together the vertices reachable from its children. Our goal here is to find a
similarly local, compositional implementation of breadth-first unfolds.
Rather than defining unfoldM directly, which sequences the computations on
all levels into a single computation, we will introduce an intermediate
function weave that keeps levels separate—just as toListBF is defined
using levels.
The result of weave will be in an as yet unknown applicative functor F m
depending on m.
And because levels are kept separate, weave only needs
a constraint Applicative m to compose computations on the same level.
The goal is to implement this signature, where the result type F is also an
unknown:
With only what we know so far, a bit of type-directed programming leads to the
following incomplete definition. We have constructed something of type
m (F m (Tree a)), while we expect F m (Tree a):
To fill the hole _, we postulate the following primitive, weft,
as part of the unknown definition of F:
weft::Applicativem=>m (Fma) ->Fma
Intuitively, F m represents “multi-level computations”.
The weft function constructs a multi-level (F m)-computation from
one level of m-computation which returns the subsequent levels
as an (F m)-computation.
We fill the hole with weft, completing the definition of weave:
The function weave defines a multi-level computation which represents
a breadth-first walk from a seed s:
the first level of the walk is f s, expanding the initial seed;
the auxiliary function weaveF constructs the remaining levels from
the initial seed’s expansion:
if the seed expands to LeafF, there are no more seeds,
and we terminate with an empty computation (pure);
if the seed expands to NodeF, we obtain two sub-seeds l and r,
they generate their own weaves recursively (weave f l and weave f r),
and we compose them (liftA2).
One way to think about weft is as a generalization of the following primitives:
we can “embed” m-computations into F m,
and we can “delay” multi-level (F m)-computations, shifting the
m-computation on each level to the next level.
The key law relating these two operations is that embedded computations
and delayed computations commute with each other:
embed u *> delay v = delay v <* embed u
The embed and delay operations are provided by the Phases applicative
functor that I mentioned earlier, which enables breadth-first traversals,
but not breadth-first unfolds. Thus, weft is a strictly more expressive
primitive than embed and delay.
Eventually, we will run a multi-level computation as a single m-computation
so that we can use weave to define unfoldM. The runner function will be
called mesh:
mesh::Monadm=>Fma->ma
It is characterized by this law which says that mesh executes the first
level of the computation u :: m (F m a), then executes the remaining levels
recursively:
mesh (weft u) = u >>= mesh
Putting everything together, weave and mesh combine into a breadth-first unfold:
It remains to find an applicative functor F equipped with weft and mesh.
The weave applicative
A basic approach to design a type is to make some of the operations it
should support into constructors. The weave applicative WeaveS has
constructors for pure and weft:
dataWeaveSma=EndSa|WeftS (m (WeaveSma))
(The suffix “S” stands for Spoilers. Read on!)
We instantiate the unknown functor F with WeaveS.
typeF=WeaveS
Astute readers will have recognized WeaveS as the free monad.
Just as Phases has the same type definition as the free applicative functor but
a different Applicative instance, we will give WeaveS an Applicative
instance that does not coincide with the Applicative and Monad instances of
the free monad.
Starting with the easy functions,
weft is WeftS, and the equation for mesh above is basically its definition.
We just need to add an equation for EndS.
Recall that WeaveS represents multi-level computations.
Computations are composed level-wise with the following liftS2.
The interesting case is the one where both arguments are WeftS: we compose
the first level with liftA2, and the subsequent ones with liftS2
recursively.
liftS2 will be the liftA2 in WeaveS’s Applicative instance.
The Functor and Applicative instances show that WeaveS is an
applicative transformer: for every applicative functor m,
WeaveS m is also an applicative functor.
The outer weft constructor was moved into the recursive calls.
The result type has an extra m, which makes it more apparent that
we always start with a call to f. It’s the same vibe as replacing the type
[a] with NonEmpty a when we know that a list will always have at least one
element; weaveS always produces at least one level of computation.
We also replace (<$>) with its flipped version (<&>) for aesthetic reasons:
we can apply it to a lambda without parentheses, and that change makes the
logic flow naturally from left to right: we first expand the seed s using
f, and continue depending on whether the expansion produced LeafF or NodeF.
To define unfoldM, instead of applying mesh directly, we chain it with
(>>=).
That solution is Obviously Correct™, but it has a terrible flaw:
it does not run in linear time!
We can demonstrate this by generating a “thin” tree whose height
is equal to its size.
The height h is the seed of the unfolding, and we generate a NodeF as long
as it is non-zero, asking for a decreased height h - 1 on the right,
and a zero height on the left.
$ cabal exec breadth-first-unfolds -- -p "S-thin"
All
S-thin
1x: OK
27.6 μs ± 2.6 μs, 267 KB allocated, 317 B copied, 6.0 MB peak memory
10x: OK
2.90 ms ± 181 μs, 23 MB allocated, 178 KB copied, 7.0 MB peak memory, 105.35x
Multiplying the height by 10x makes the function run 100x slower.
Dramatically quadratic.
Complexity analysis
We can compare this implementation with level from earlier, which is linear-time.
In particular, looking at zipLevels with liftS2—which play similar
roles—there is a crucial difference when one of the arguments is empty
([] or EndS):
zipLevels simply returns the other argument, whereas liftS2 calls (<$>),
continuing the recursion down the other argument.
So zipLevels stops working after reaching the end of either argument, whereas
liftS2 walks to the end of both arguments. There is at least one
call to liftS2 on every level which will walk to the bottom of the tree,
so we get a quadratic lower bound Ω(height2).
Out of sight, out of mind
The problematic combinators are fmap and liftS2, which weaveS uses to
construct the unfolded tree. If we don’t care about that tree—wanting only
the effect of a monadic unfold—then we can get rid of the complexity
associated with those combinators.
With no result to return, we remove the a type parameter from the definition
of WeaveS, yielding the oblivious (“O”) variant:
dataWeaveOm=EndO|WeftO (m (WeaveOm))
We rewrite mesh into meshO, reducing a WeaveO m computation
into m () instead of m a.
To implement a breadth-first walk, we modify weaveS above by replacing
liftA2 (Node a) with (<>). Note that the type parameter a is no longer in
the result. It was only used in the tree that we decided to forget.
Running weaveO with meshO yields a oblivious monadic unfold:
it produces m () instead of m (Tree a).
(You may remember seeing another implementation of that same signature
just earlier, unfoldM_BF_G_.)
Previously, we benchmarked the function thinTreeS that outputs a tree by
forcing the tree. With an oblivious unfold, there is no tree to force.
Instead we will count the number of generated NodeF constructors:
thinTreeO::Int->IntthinTreeO= (`execState`0) .unfoldM_BF_O_ (state.f)wheref0counter= (LeafF, counter)fhcounter= (NodeF () 0 (h-1), counter+1) -- increment the counter for every NodeF
We adapt the benchmark from before to measure the complexity of
unfolding thin trees. We have to increase the baseline height from 100 to 500
because this benchmark runs so much faster than the previous ones.
$ cabal exec breadth-first-unfolds -- -p O-thin
All
O-thin
1x: OK
148 μs ± 8.3 μs, 543 KB allocated, 773 B copied, 6.0 MB peak memory
10x: OK
1.45 ms ± 113 μs, 5.4 MB allocated, 82 KB copied, 7.0 MB peak memory, 9.78x
The growth is linear, as desired:
the “10x” bench is 10x slower than the baseline “1x” bench.
Laziness for the win
The oblivious unfold avoided quadratic explosion by simplifying the problem.
Now let’s solve the original problem again,
so we can’t just get rid of fmap and liftA2.
As mentioned previously, the root cause was that (1) liftA2 calls fmap when
one of the constructors is EndS, and (2) fmap traverses the other argument.
The next solution will be to make fmap take constant time,
by storing the “mapped function” in the constructor.
Behold the “L” variant of WeaveS, which is a GADT:
The Applicative instance is… a good exercise for the reader.
The details are not immediately important—we only care about improving fmap
for now—we will come back to have a look at the Applicative instance soon.
The runner function meshL is a simple bit of type Tetris.
$ cabal exec breadth-first-unfolds -- -p "L-thin"
All
L-thin
1x: OK
14.1 μs ± 782 ns, 59 KB allocated, 5 B copied, 6.0 MB peak memory
10x: OK
140 μs ± 13 μs, 586 KB allocated, 51 B copied, 6.0 MB peak memory, 9.93x
Lazy in more ways than one
As hinted by the “L” and “S” suffixes,
WeaveL is a “lazy” variant of WeaveS: fmap for WeaveL “postpones”
work by accumulating functions in the WeftL constructor.
That work is “forced” by meshL, which is where the fmap ((<$>)) of the
underlying monad m is called, performing the work accumulated
by possibly many calls to WeaveL’s fmap.
One subtlety is that there are multiple “lazinesses” at play.
The main benefit of using WeaveL is really to delay computation,
that is a kind of laziness, but WeaveL doesn’t need to be
implemented in a lazy language.
We can rewrite all of the code we’ve seen so far in a strict language
with minor changes, and we will still observe the quadratic vs linear behavior
of WeaveS vs WeaveL on thin trees.
The “manufactured laziness” of WeaveL is a concept independent of the
“ambient laziness” in Haskell.
Nevertheless, we can still find an interesting role for that “ambient laziness”
in this story. Indeed, the function weaveL also happens to be lazier than
weaveS in the usual sense.
A concrete test case is worth a thousand words. Consider the following
tree generator which keeps unfolding left subtrees while making
every right subtree undefined:
whnfTreeS::TestTreewhnfTreeS=expectFail$testCase"S-whnf"$docasepartialTreeSofNode___->pure () -- SucceedLeaf->error"unreachable"-- definitely not a Leaf
As it turns out, this test using the “S” variant fails. (That’s
why the test is marked with expectFail.)
Forcing partialTreeS evaluates the undefined in partialTreeF.
Therefore partialTreeS is not equivalent to partialTree.
$ cabal exec breadth-first-unfolds -- -p "L-whnf"
All
L-whnf: OK
This difference can only be seen with “lazy monads”, where (>>=) is
lazy in its first argument.
(If this definition sounds not quite right, that’s probably because of seq.
It makes a precise definition of “lazy monad” more complicated.)
Examples of lazy monads from the transformers library
are Identity, Reader, lazy State, lazy Writer, and Accum.
The secret sauce is the definition of liftA2 for WeaveL:
In the third clause of liftA2, we put the function f in a lambda with a
lazy pattern (~(a, b)) directly under the topmost constructor WeftL.
Thus, we can access the result of f from the second field of WeftL
without looking at the first field. In liftS2 earlier, f was
passed as an argument to (liftA2 . liftS2), that forces us to run the
computation before we can get a hold on the result of f.
Maximizing laziness
The “L” variant of unfoldM is lazier than the “S” variant,
but there is still a gap between partialTreeL and the pure partialTree:
if we force not only the root, but also the left subtree of partialTreeL,
then we run into undefined again.
Although the unfold using WeaveL is lazier than using WeaveS,
it is not yet as lazy as it could be.
The reason is that, strictly speaking, WeaveL’s liftA2 is a strict function.
The expansion function partialTreeF produces a level with an undefined
sub-computation, which crashes the whole level.
Each level in a computation will be either completely defined or undefined.
To recap, we’ve been looking at the following trees:
It is natural to ask: can we define a breadth-first unfold that, when applied
to partialTreeF, will yield the same tree as partialTree?
More generally, the new problem is to define a breadth-first unfoldM
whose specialization with the Identity functor is equivalent to
the pure unfold even on partially-defined values. That is, it satisfies
the following equation:
unfold f = runIdentity . unfoldM (Identity . f)
Laziness without end
The strictness of liftA2 is caused by WeaveL having two constructors.
Let’s get rid of EndL.
Wait a second. I spoke too fast, GHC gives us an error:
error: [GHC-87005]
• An existential or GADT data constructor cannot be used
inside a lazy (~) pattern
• In the pattern: WeftE wa g
In the pattern: ~(WeftE wa g)
In an equation for ‘fmap’: fmap f ~(WeftE wa g) = WeftE wa (f . g)
|
641 | > fmap f ~(WeftE wa g) = WeftE wa (f . g)
| ^^^^^^^^^^
The feature we need is “first-class existentials”,
for which there is an open GHC proposal.
Not letting that stop us, there is a simple version of first-class existentials
available in the package some,
as the module Data.Some.Newtype (internally using unsafeCoerce).
That will be sufficient for our purposes.
All we need is an abstract type Some and a pattern synonym:
-- imported from Data.Some.NewtypedataSomefpatternSome::fa->Somef
And we’re back on track. Here comes the actual “E” (endless) variant:
The endless WeaveE enables an even lazier implementation of unfoldM.
When specialized to the identity monad, it lets us force the resulting
tree in any order. The forceLeftTreeE test passes (unlike forceLeftTreeL).
$ cabal exec breadth-first-unfolds -- -p "E-left"
All
E-left: OK
One can also check that forcing the left spine of partialTreeE
arbitrarily deep throws no errors.
We made it lazy, but at what cost?
First, this “Endless” variant only works for lazy monads.
With a strict monad, the runner meshE will loop forever.
It is possible to run things more incrementally by pattern-matching on
WeaveE, but you’re better off using the oblivious WeaveO anyway.
Second, when you aren’t running into an unproductive loop, the “Endless” variant of
unfoldM has quadratic time complexity Ω(height2). The reason
is essentially the same as the “Strict” variant: liftA2 keeps looping even if
one argument is a pure weave—before, that was to traverse the other
non-pure argument, now, there isn’t even a way to tell when the computation
has ended.
Thus, every leaf may create work proportional to the height of the tree.
Running the same benchmark as before, we measure even more baffling timings:
Using the previous setup comparing a baseline and a 10x run, we see a more than
700x slowdown, so much worse than the 100x predicted by a quadratic model.
Interestingly, the raw output shows that the total cumulative allocations did
grow by a 100x factor.1
But it gets weirder with more data points: it does not follow a clear power law.
If Time(n) grew as
nc for some fixed exponent c, then the ratio
Time(Mn)/Time(n) would be Mc,
a constant that does not depend on n.
In the following benchmark, we keep doubling the height (M = 2) for every
test case, and we measure the time relative to the preceding case each time.
A quadratic model predicts a 4x slowdown at every step. Instead, we
observe wildly varying factors.
Benchmark output (each time factor is relative to the preceding line,
for example, the “4x” benchmark is 9.5x slower than the “2x” benchmark):
$ cabal exec breadth-first-unfolds -- -p "E-thin-more"
All
E-thin-more
1x: OK
222 μs ± 9.3 μs, 1.2 MB allocated, 13 KB copied, 6.0 MB peak memory
2x: OK
2.43 ms ± 85 μs, 4.8 MB allocated, 236 KB copied, 7.0 MB peak memory, 10.94x
4x: OK
23.1 ms ± 1.2 ms, 19 MB allocated, 2.7 MB copied, 10 MB peak memory, 9.53x
8x: OK
126 ms ± 7.8 ms, 76 MB allocated, 18 MB copied, 24 MB peak memory, 5.44x
16x: OK
181 ms ± 7.0 ms, 119 MB allocated, 30 MB copied, 24 MB peak memory, 1.44x
I believe this benchmark is triggering some pathological behavior in the garbage
collector. I modified tasty-bench with an option to measure CPU time without GC
(mutator time). At time of writing, tasty-bench is still waiting for a new release.
We can point Cabal to an unreleased commit of tasty-bench by adding the following
lines to cabal.project.local.
For the “2x” benchmarks, we are closer the expected 4x slowdown, but there is
still a noticeable gap.
I’m going to chalk the rest to inherent measurement errors (the cost of
tasty-bench’s simplicity) exacerbated by the pathological GC behavior;
a possible explanation is that the pattern of memory usage becomes so bad that
it affects non-GC time.
Benchmark output (excluding GC time, each measurement is relative to the
preceding line):
Microbenchmarks: Queues vs Global Levels vs Weaves
So far we’ve focused on asymptotics (linear vs quadratic). Some readers
will inevitably wonder about real speed.
Among the linear-time algorithms—queues (“Q”), global levels (“G”),
and weaves (lazy “L” or oblivious “O”)—which one is faster?
tl;dr: Queues are (much) faster in these microbenchmarks (up to 25x!),
but keep in mind that these are all quite naive implementations.
There are two categories to measure separately: unfolds which produce trees,
and oblivious unfolds—which don’t produce trees. These microbenchmarks
construct full trees up to a chosen number of nodes. When there is an
output tree, we force it (using nf), otherwise we force a counter of the
number of nodes. We run on different sufficiently large sizes (500 and 5000)
to check the stability of the measured factors, ensuring that we are only
comparing the time components that dominate at scale.
The tables list times relative to the queue benchmark for each tree size.
I hope to have piqued your interest in breadth-first unfolds without
using queues.
To the best of my knowledge, this specific problem hasn’t been studied in the
literature. It is of course related to breadth-first traversals,
previously solved using the Phases applicative.2
The intersection of functional programming and breadth-first walks is a small
niche, which makes it quick to survey that corner of the world for any related
ideas to those presented here.
The paper Modular models of monoids with operations by Zhixuan Yang
and Nicolas Wu, in ICFP 2023, mentions a general construction of Phases as an
example application of their theory. Basically, Phases is defined by a
fixed-point equation:
Phases f = Day f Phases :+: Identity
We can express Phases abstractly as a least fixed-point
μx.f▫x + Id in any monoidal category with a suitable structure.
If we instantiate the monoidal product ▫ not with Day convolution,
but with functor composition (Compose), then we get Weave.
In another coincidence, the monad-coroutine package
implements a weave function which is a generalization of
liftS2—this may require some squinting.
While WeaveS as a data type coincides with the free monad Free,
monad-coroutine’s core data type Coroutine coincides
with the free monad transformer FreeT.
We can view Phases as a generalization of “zipping” from
lists to free applicatives—which are essentially lists of actions,
and Weave generalizes that further to free monads. To recap, the surprise was
that the naive data type of free monads results in a quadratic-time unfold.
That issue motivated a “lazy” variant3 which achieves a linear-time
breadth-first unfold. That in turn suggested an even “lazier” variant which
enables more control on evaluation order at the cost of efficiency.
I’ve just released the weave library which implements
the main ideas of this post. I don’t expect it to have many users, given
how much slower it is compared to queue-based solutions.
But I would be curious to find a use case for the new compositionality
afforded by this abstraction.
Recap table
Unfolds
Time
Laziness
Compositional
Phases*
No
linear†
by levels
Yes
Queue (Q)
Yes
linear†
strict
No
Global Levels (G)
Yes
linear†
by levels
No
Strict Weave (S)
Yes
quadratic‡
strict
Yes
Oblivious Weave (O)
Oblivious only
linear†
N/A
Yes
Lazy Weave (L)
Yes
linear†
by levels
Yes
Endless Weave (E)
Yes
quadratic‡E
maximally lazy◊
Yes
†Linear wrt. size: Θ(size). ‡Quadratic wrt. height: lower bound Ω(height2), upper bound O(height × size). EThe “Endless” meshE only terminates with lazy monads. *I guess there exists an “endless Phases” variant, that
would be quadratic and maximally lazy. ◊The definition of “maximally lazy” in this post actually misses a
range of possible lazy behaviors with monads other than Identity. A further
refinement seems to be another can of worms.
Note that tasty-bench also reports memory statistics
(allocated, copied, and peak memory) when certain RTS options are enabled,
which I’ve done by compiling the test executable with -with-rtsopts=-T.↩︎
Today, 2025-03-16, at 1930 UTC (12:30 pm PST, 3:30 pm EST, 7:30 pm GMT, 20:30 CET, …)
we are streaming the 41th episode of the Haskell Unfolder live on YouTube.
Generic functions are a powerful tool that allows us to make more type classes derivable. In this episode, we’ll look at a simple example, namely deriving Monoid instances for product types, using both GHC’s built-in generics and the generics-sop library.
About the Haskell Unfolder
The Haskell Unfolder is a YouTube series about all things Haskell hosted by
Edsko de Vries and Andres Löh, with episodes appearing approximately every two
weeks. All episodes are live-streamed, and we try to respond to audience
questions. All episodes are also available as recordings afterwards.
The GHC developers are very pleased to announce the availability of
GHC 9.6.7. Binary distributions, source distributions, and
documentation are available on the
release page.
GHC 9.6.7 brings number of fixes, including:
GHC’s internal Unique type has been widened to 64-bits on 32-bit
architectures, avoiding potential miscompilations on large projects.
Fix a runtime crash when using the compacting GC, caused by black
holes in large objects.
Added new flags -fspec-eval and -fspec-eval-dictfun to allow
switching off speculative evaluation.
The following libraries have been updated since GHC 9.6.6:
Note about Haskell Language Server and building GHC 9.8+:
The change of Unique to 64 bit (GHC#22010)
adds the exported symbol
ghc_unique_counter64 to the RTS. Unfortunately it’s impossible to
avoid this without breaking other things. If you encounter a linker
error related to ghc_unique_counter64 when building GHC (or building a
GHC-derived package like ghc-lib-parser) with GHC 9.6.7, you probably
have to add this fix
to the program you’re building.
We would like to thank GitHub, IOG, the Zw3rk stake pool, Well-Typed,
Tweag I/O, Serokell, Equinix, SimSpace, the Haskell Foundation, and
other anonymous contributors whose on-going financial and in-kind
support has facilitated GHC maintenance and release management over
the years. Finally, this release would not have been possible without
the hundreds of open-source contributors whose work comprise this
release.
Please give this release a try and open a ticket
if you see anything amiss.
This is the twenty-sixth edition of our GHC activities report, which describes
the work Well-Typed are doing on GHC, Cabal, HLS and other parts of the core Haskell toolchain.
The current edition covers roughly the months of December 2024 to February 2025.
You can find the previous editions collected under the
ghc-activities-report tag.
Sponsorship
We offer Haskell Ecosystem Support Packages to provide
commercial users with support from Well-Typed’s experts, while investing in the
Haskell community and its technical ecosystem. Clients who engage in these packages
both fund the work described in this report
and support the Haskell Foundation.
We are delighted to announce two new Bronze Haskell Ecosystem Supporters: Channable
and QBayLogic.
Many thanks also to our existing clients who also contribute to making this work possible:
Anduril, Juspay and Mercury,
and to the HLS Open Collective for
supporting HLS release management.
In addition, many others within Well-Typed contribute to GHC, Cabal and HLS
occasionally, or contribute to other open source Haskell libraries and tools.
GHC
GHC Releases
We are currently overseeing releases of the GHC 9.10 and 9.12 release series.
Zubin oversaw the preparation and final release of GHC 9.12.1 on 16 December 2024. Unfortunately, it was found in mid-January that this release was affected by a regression affecting sub-word division (#25653). In response to this we scheduled a minimal 9.12.2 release fixing this issue, which was released on 14 March 2025. We anticipate that the next GHC 9.12 release will come in the summer.
Our current release engineering focus is 9.10.2, which is currently being worked on by Andreas and Ben. We expect that the release candidate for this will be out in late March.
In parallel, Zubin has been working towards cutting a corresponding release of HLS and introducing support for GHC 9.12.
Platform support
For many years, GHC’s FreeBSD support has been in a state of limbo: while the compiler has usually been functional on FreeBSD, we have never had proper CI support, meaning that we could neither systematically validate correctness nor produce binary distributions.
Late last year Ben fixed several issues that had been plaguing FreeBSD, allowing it to pass
the testsuite and worked with a contributor to bring up a continuous integration
runner for this platform (!13619, !13963). In response, we expect that GHC 9.14 will ship binary distributions for FreeBSD as a tier 2 platform.
However, supporting GHC’s compatibility matrix requires a real investment of time and energy. If you rely on FreeBSD or any other BSD, we would appreciate your help in looking after and improving GHC’s support of these platforms.
Frontend
Matthew, Adam, and Rodrigo wrote and proposed the now-accepted Explicit Level Imports proposal. This proposal represents a significant step forward in Haskell staged metaprogramming story, introducing syntax to distinguish imports needed at runtime from those only needed at compile-time (e.g. for TemplateHaskell splices). This distinction opens the door to compile-time improvements, more robust cross-compilation support, and a more expressive metaprogramming story.
Sam finished up work by GHC contributor Jade, giving GHCi error messages
their own error codes (#23338, !13094).
Sam made several internal improvements to the typechecker, surrounding the
function checkTyEqRhs which is responsible for skolem escape, occurs checks
and representation-polymorphism checking (!13778, !13931).
Sam implemented a simplification of the logic for solving of quantified constraints,
both improving solver efficiency and simplifying specialization (!13958).
Sam refactored the GHC “error context” infrastructure, migrating it to a
structured representation like for the error message contents (#23436, !10540).
Sam implemented defaulting of representational equalities, which allows
GHC to accept several uses of coerce that used to be rejected with ambiguous
types (#21003, !13834).
Sam fixed GHC emitting spurious “incomplete record selectors” warnings due to
missing long-distance information in the pattern-match checker (#25749, !13979).
Backend
Sam prevented GHC emitting LLVM code with incompatible vector types (e.g. the same
variable being declared as 4xi32 and used as 8xi16) (!13936).
Sam investigated CI failures with the LLVM backend, identifying several
critical bugs such as #25771 and #25773.
Ben improved the naming of various compiler-generated binders, which will make it easier
to make sense of -ddump-simpl output and runtime stacks (!13849, !13875).
Rodrigo ensured that certain join-points inline,
dramatically improving runtime allocations of certain programs (#25723, !13909).
In response to a serious correctness regression in 9.12.1,
Ben improved the testing story for primops by adding Cmm surface syntax for the
previously-untested Mul2 operations and expanding the scope of the
test-primops testsuite (!13843, test-primops!27).
Compiler performance
Matthew improved the performance of type family consistency checking by
ensuring checks are run in topological order, significantly reducing redundant work
(#25554, !13685).
Rodrigo made a variety of improvements in GHC’s memory consumption, including
refactoring the ModuleGraph interface (!13658) and reducing the memory
usage of module transitive closure calculations done when encountering
Template Haskell splices (#25634, !13753).
Rodrigo refactored the HomePackageTable and HomeUnitGraph to avoid
significant space leaks (#25511, !13675).
Rodrigo improved the performance of compiling deriving Show and deriving Data
(!13739).
Runtime system
Ben fixed a few bugs in the linker’s object unloading implementation resulting in
runtime crashes (#24935, !13704; #25039, !13714).
Ben removed some dead code in the IO manager which was causing some CI jobs
to fail (!13678).
Ben lifted the runtime system’s limit of 256 capabilities, ensuring that
the runtime system can scale to large multicore systems (#25560, !13692).
Ben fixed a bug in mmapInRegion which would cause it to loop indefinitely
in certain circumstances on FreeBSD (#25492, !13618).
GHCi & bytecode interpreter
Matthew improved the error reporting of out-of-scope qualified names in
GHCi (!13751).
Matthew fixed segfaults in the bytecode interpreter that were caused by the
FastString table being loaded unoptimised (!13877).
Matthew dramatically improved the performance of the bytecode interpreter
by avoiding generating no-op SLIDE x 0 instructions (!13868),
by using a strict genericLength function (!13885),
by avoiding intermerdiate lists in nameToCLabel (!13898),
by using Name rather than FastString to key the symbol cache (!13914).
Matthew fixed the INTERP_STATS macro that is used for performance statistics
of GHCi (!13879), and then proceeded to fix accounting errors in these
statistics (!13956).
Ben and Matthew improved the printing of BCOs to assist debugging the bytecode
interpreter (!13570, !13878, !13955)
Ben fixed an incorrect assumption regarding which unlifted types can appear at
the top-level (#25641, !13796).
Libraries
Ben re-introduced missing {Enum/Show} IOSubSystem instances that were
accidentally removed in !9676 (#25549, !13683).
Ben introduced Data.Enum and Data.Bounded as agreed in CLC Proposal #208, correcting
an accidental inclusion of Data.Enum in the ghc-internal refactor
(#25320, !11347, !13790).
Ben implemented CLC proposal #305, ensuring that threads created by GHC’s base library can be easily identified by their thread label (#25452, !13566).
Rodrigo improved the implementation of SomeException for SomeAsyncException,
implementing CLC Proposal #309 (!13725).
Build system
Ben fixed #25501, ensuring that the ld-override logic is consistent between the configure
script and ghc-toolchain (!13617).
Ben mitigated a race condition with mktexfmt in Hadrian (#25564, !13703).
Ben allowed i686 to be parsed as part of triples in the configure script
(#25691, !13874).
Cabal
Matthew fixed two Cabal 3.14 regressions in which the current working directory
was not correctly taken into account, when creating the build folder
(#10772, #10800) and when running test executables (#10704, #10725).
Matthew fixed a Cabal 3.14 regression in which Cabal would erroneously pick
versions of build tools (such as alex or happy) from the system environment
rather than the versions specified in build-tool-depends (#10692, #10731).
ghc-debug
Zubin implemented support in the ghc-debug backend and Brick front-end for
streaming heap traversals, enabling constant-space analysis of large heaps
(!66).
A new release of Liquid Haskell is out after quite an active period of
development with 99 pull requests in the liquidhaskell repository, and
29 pull requests in the liquid-fixpoint repository from about ten contributors.
This post is to provide an overview of the changes that made it into the latest release.
There were contributions to the reflection and proof mechanisms; we got
contributions to the integration with GHC; the support of cvc5 was improved
when dealing with sets, bags, and maps; and there was a rather large overhaul
of the name resolution mechanism.
Reflection improvements
Liquid Haskell is a tool to verify Haskell programs. We can write formal
specifications inside special Haskell comments {-@ ... @-}, and the tool
will check whether the program behaves as specified. For instance, the following
specification of the filter function says that we expect all of the elements
in the result to satisfy the given predicate.
{-@ filter :: p:(a -> Bool) -> xs:[a] -> {v:[a] | all p v } @-}
Liquid Haskell would then analyze the implementation of filter to verify that
it does indeed yield elements that satisfy the predicate.
To verify such a specification, Liquid Haskell needs to attach a meaning to the
names in the predicate all p v. It readily learns that p is a parameter
of filter, and that v is the result. all, however, isn’t bound by the specification’s parameters, so it refers to whatever is in scope, which is the
Haskell function from the Prelude.
all::(a->Bool)->[a]->Bool
And Liquid Haskell has a mechanism to provide logic meaning to the implementation
of a function like all, known as reflection. While it has always been convenient to reflect functions in modules analyzed by Liquid Haskell, it was not so easy when there was a
mix of local and imported definitions from dependencies that are not analysed with
Liquid Haskell. Last year, there was an internship at Tweag to address exactly this
friction, which resulted in contributions to the
latest release.
Reasoning and reflection of lambdas
The reflection mechanism also has other specific limitations at the moment. For instance,
it doesn’t allow reflecting recursive functions defined in let or where bindings. And
until recently, it didn’t allow reflecting functions that contained anonymous functions.
For example,
takePositives = filter (\x -> x > 0)
In the latest release, we have several contributions that introduce support for reflecting lambdas and improve the story for reasoning with them.
This feature is considered experimental at the moment, since we will still have usability and
performance concerns that deserve further contributions, but one can
already explore the experience that we could expect in the long run.
Integration with GHC
In 2020 Liquid Haskell became a compiler plugin for GHC. It was hooked into the
end of the type checking phase firstly to ensure it only runs on well-typed programs,
and secondly, to ensure the plugin runs when GHC is only asked to typecheck the
module but not to generate code, which was helpful to IDEs.
For a few technical reasons, the plugin was re-parsing and re-typechecking the module
instead of using the abstract syntax tree (AST) that GHC handed to it as the result of
type checking. That is no longer the case in the latest release, where the AST after
type checking is now used for all purposes. In addition, there were several improvements
to how the ghc library is used.
cvc5 support
Liquid Haskell offloads part of its reasoning to a family of automated theorem
provers known as SMT solvers. For most developments, Liquid Haskell has been
used with the Z3 SMT solver, and this is what has been used most of the time in
continuous integration pipelines.
In theory, any SMT solver can be used with Liquid Haskell, if it provides a standard
interface known as SMT-LIB. In practice, however, experiments are done with
theories that are not part of the standard. For instance, the reasoning capabilities
for bags, sets, and maps used to require z3. But now the latest release implements
support for cvc5 as well.
Name resolution overhaul
Name resolution determines, for each name in a program, what is the definition that
it refers to. Liquid Haskell, in particular, is responsible for resolving names
that appear in specifications. This task was problematic when the programs
it was asked to verify spanned many modules.
There were multiple kinds of names, each with their own name resolution rules,
and names were resolved in different environments when verifying a module and
when importing it elsewhere, not always yielding the same results, which often
produced confusing errors.
Name resolution, however, was done all over the code base, and any attempt to
rationalize it would require a few months of effort. I started such an epic last
September, and managed to conclude it in February.
These changes made it into the latest release together with an awful lot of
side quests to simplify the existing code.
The road ahead
There is no coordinated roadmap for Liquid Haskell. Much of the contributions
that it receives depend on the opportunity enabled by academic research or
the needs of particular use cases.
On my side, I’m trying to improve the adoption of Liquid Haskell. Much of the challenge
is reducing the amount of common workarounds that the proficient Liquid Haskeller
needs to employ today. For instance, supporting reflection of functions in local bindings
would save the user the trouble of rewriting her programs to put the recursive functions
in the top level.
Repairing the support for type classes would allow functions to be verified
even if they use type classes, which is a large subset of Haskell today.
And without having defined a scope with precision yet, Liquid Haskell still needs to
improve its user documentation, its error messages, and its tracing and logging.
The project is chugging along, though. It is making significant leaps in usability. The
upgrade costs have been quantified for a few GHC releases, and
no longer look like an unbounded risk. The amount of external contributions has
increased last year, although we still have to see if it is a trend. And there is
no shortage of interest from academia and industrial interns.
Thanks to the many contributors for their work and their help during code
reviews. I look forward to learning what makes it into the coming Liquid Haskell releases!
On this episode of the Haskell Interlude, Andres Löh and Mike Sperber are joined by Farhad Mehta, a professor at OST Rapperswil, and one of the organizers of ZuriHac. Fahrad tells us about formal methods, building tunnels, the importance of education, and the complicated relationship between academia and industry.
At work I sometimes need to deal with large and deep JSON objects where I'm only
interested in a few of the values. If all the interesting values are on the top
level, then aeson have functions that make it easy to implement FromJSON's
parseJSON (Constructors and accessors), but if the values are spread out then
the functions in aeson come up a bit short. That's when I reach for lens-aeson,
as lenses make it very easy to work with large structures. However, I've found
that using its lenses to implement parseJSON become a lot easier with a few
helper functions.
Many of the lenses produces results wrapped in Maybe, so the first function is
one that transforms a Maybe a to a Parser a. Here I make use of Parser
implementing MonadFail.
infixl8<!>(<!>) :: (MonadFail m) => Maybe a->String->m a(<!>) mv err = maybe (fail err) pure mv
In some code I wrote this week I used it to extract the user name out of a JWT
produced by Keycloak:
instance FromJSON OurClaimsSetwhere
parseJSON = ... $ \o ->do
cs <- parseJSON o
n <- o ^? key "preferred_username". _String <!>"preferre username missing"
...
pure $ OurClaimsSet cs n ...
Also, all the lenses start with a Value and that makes the withX functions
in aeson to not be a perfect fit. So I define variations of the withX
functions, e.g.
withObjectV :: String->(Value->Parser a)->Value->Parser awithObjectV s f = withObject s (f . Object)
That makes the full FromJSON instance for OurClaimsSet looks like this
instance FromJSON OurClaimsSetwhereparseJSON = withObjectV "OurClaimsSet"$ \o ->do
cs <- parseJSON o
n <- o ^? key "preferred_username". _String <!>"name"letrs = o ^.. key "resource_access". members . key "roles". _Array . traverse . _String
pure $ OurClaimsSet cs n rs
The GHC developers are happy to announce the release of GHC 9.12.2.
Binary distributions, source distributions, and documentation are available at
downloads.haskell.org.
We hope to have this release available via ghcup shortly. This is a small
release fixing a critical code generation bug, #25653, affecting some subword
division operations.
As always, GHC’s release status, including planned future releases, can
be found on the GHC Wiki status.
We would like to thank IOG, the Zw3rk stake pool,
Well-Typed, Tweag I/O, Serokell, Equinix, SimSpace, the Haskell
Foundation, and other anonymous contributors whose on-going financial
and in-kind support has facilitated GHC maintenance and release
management over the years. Finally, this release would not have been
possible without the hundreds of open-source contributors who
contribute their code, tickets, and energy to the GHC project.
As always, do give this release a try and open a ticket if you see
anything amiss.
I’ve created an open mirror
contest which will run in
parallel to the official contest, so if you want to grab some friends
and try solving some of the problems together using your favorite
language, be my guest!
<noscript>Javascript needs to be activated to view comments.</noscript>
Today, 2025-03-12, at 1930 UTC (12:30 pm PST, 3:30 pm EST, 7:30 pm GMT, 20:30 CET, …)
we are streaming the 40th episode of the Haskell Unfolder live on YouTube.
QuickCheck is useful for more than just testing. Comparing the behaviour of a system to a model can be used to check if a system under construction is working correctly, but it can also be used to better understand an already existing system. In this episode we show that this does not need to be very difficult, by designing a model that we can use to understand tensor convolutions in an existing large library.
About the Haskell Unfolder
The Haskell Unfolder is a YouTube series about all things Haskell hosted by
Edsko de Vries and Andres Löh, with episodes appearing approximately every two
weeks. All episodes are live-streamed, and we try to respond to audience
questions. All episodes are also available as recordings afterwards.
A few months ago I explained that one reason why this blog has become more quiet is that all my work on Lean is covered elsewhere.
This post is an exception, because it is an observation that is (arguably) interesting, but does not lead anywhere, so where else to put it than my own blog…
When defining a function recursively in Lean that has nested recursion, e.g. a recusive call that is in the argument to a higher-order function like List.map, then extra attention used to be necessary so that Lean can see that xs.map applies its argument only elements of the list xs. The usual idiom is to write xs.attach.map instead, where List.attach attaches to the list elements a proof that they are in that list. You can read more about this my Lean blog post on recursive definitions and our new shiny reference manual, look for Example “Nested Recursion in Higher-order Functions”.
To make this step less tedious I taught Lean to automatically rewrite xs.map to xs.attach.map (where suitable) within the construction of well-founded recursion, so that nested recursion just works (issue #5471). We already do such a rewriting to change if c then … else … to the dependent if h : c then … else …, but the attach-introduction is much more ambitious (the rewrites are not definitionally equal, there are higher-order arguments etc.) Rewriting the terms in a way that we can still prove the connection later when creating the equational lemmas is hairy at best. Also, we want the whole machinery to be extensible by the user, setting up their own higher order functions to add more facts to the context of the termination proof.
I implemented it like this (PR #6744) and it ships with 4.18.0, but in the course of this work I thought about a quite different and maybe better™ way to do this, and well-founded recursion in general:
WellFounded.fix : (hwf : WellFounded r) (F : (x : α) → ((y : α) → r y x → C y) → C x) (x : α) : C x
we have to rewrite the functorial of the recursive function, which naturally has type
F : ((y : α) → C y) → ((x : α) → C x)
to the one above, where all recursive calls take the termination proof r y x. This is a fairly hairy operation, mangling the type of matcher’s motives and whatnot.
so the functorial’s type is unmodified (here β will be ((x : α) → C x)), and everything else is in the propositional side-condition montone F. For this predicate we have a syntax-guided compositional tactic, and it’s easily extensible, e.g. by
theorem monotone_mapM (f : γ → α → m β) (xs : List α) (hmono : monotone f) :
monotone (fun x => xs.mapM (f x))
Once given, we don’t care about the content of that proof. In particular proving the unfolding theorem only deals with the unmodified F that closely matches the function definition as written by the user. Much simpler!
Isabelle has it easier
Isabelle also supports well-founded recursion, and has great support for nested recursion. And it’s much simpler!
There, all you have to do to make nested recursion work is to define a congruence lemma of the form, for List.map something like our List.map_congr_left
List.map_congr_left : (h : ∀ a ∈ l, f a = g a) :
List.map f l = List.map g l
This is because in Isabelle, too, the termination proofs is a side-condition that essentially states “the functorial F calls its argument f only on smaller arguments”.
Can we have it easy, too?
I had wished we could do the same in Lean for a while, but that form of congruence lemma just isn’t strong enough for us.
But maybe there is a way to do it, using an existential to give a witness that F can alternatively implemented using the more restrictive argument. The following callsOn P F predicate can express that F calls its higher-order argument only on arguments that satisfy the predicate P:
section setup
variable {α : Sort u}
variable {β : α → Sort v}
variable {γ : Sort w}
def callsOn (P : α → Prop) (F : (∀ y, β y) → γ) :=
∃ (F': (∀ y, P y → β y) → γ), ∀ f, F' (fun y _ => f y) = F f
variable (R : α → α → Prop)
variable (F : (∀ y, β y) → (∀ x, β x))
local infix:50 " ≺ " => R
def recursesVia : Prop := ∀ x, callsOn (· ≺ x) (fun f => F f x)
noncomputable def fix (wf : WellFounded R) (h : recursesVia R F) : (∀ x, β x) :=
wf.fix (fun x => (h x).choose)
def fix_eq (wf : WellFounded R) h x :
fix R F wf h x = F (fix R F wf h) x := by
unfold fix
rw [wf.fix_eq]
apply (h x).choose_spec
This allows nice compositional lemmas to discharge callsOn predicates:
theorem callsOn_base (y : α) (hy : P y) :
callsOn P (fun (f : ∀ x, β x) => f y) := by
exists fun f => f y hy
intros; rfl
@[simp]
theorem callsOn_const (x : γ) :
callsOn P (fun (_ : ∀ x, β x) => x) :=
⟨fun _ => x, fun _ => rfl⟩
theorem callsOn_app
{γ₁ : Sort uu} {γ₂ : Sort ww}
(F₁ : (∀ y, β y) → γ₂ → γ₁) -- can this also support dependent types?
(F₂ : (∀ y, β y) → γ₂)
(h₁ : callsOn P F₁)
(h₂ : callsOn P F₂) :
callsOn P (fun f => F₁ f (F₂ f)) := by
obtain ⟨F₁', h₁⟩ := h₁
obtain ⟨F₂', h₂⟩ := h₂
exists (fun f => F₁' f (F₂' f))
intros; simp_all
theorem callsOn_lam
{γ₁ : Sort uu}
(F : γ₁ → (∀ y, β y) → γ) -- can this also support dependent types?
(h : ∀ x, callsOn P (F x)) :
callsOn P (fun f x => F x f) := by
exists (fun f x => (h x).choose f)
intro f
ext x
apply (h x).choose_spec
theorem callsOn_app2
{γ₁ : Sort uu} {γ₂ : Sort ww}
(g : γ₁ → γ₂ → γ)
(F₁ : (∀ y, β y) → γ₁) -- can this also support dependent types?
(F₂ : (∀ y, β y) → γ₂)
(h₁ : callsOn P F₁)
(h₂ : callsOn P F₂) :
callsOn P (fun f => g (F₁ f) (F₂ f)) := by
apply_rules [callsOn_app, callsOn_const]
With this setup, we can have the following, possibly user-defined, lemma expressing that List.map calls its arguments only on elements of the list:
theorem callsOn_map (δ : Type uu) (γ : Type ww)
(P : α → Prop) (F : (∀ y, β y) → δ → γ) (xs : List δ)
(h : ∀ x, x ∈ xs → callsOn P (fun f => F f x)) :
callsOn P (fun f => xs.map (fun x => F f x)) := by
suffices callsOn P (fun f => xs.attach.map (fun ⟨x, h⟩ => F f x)) by
simpa
apply callsOn_app
· apply callsOn_app
· apply callsOn_const
· apply callsOn_lam
intro ⟨x', hx'⟩
dsimp
exact (h x' hx')
· apply callsOn_const
end setup
So here is the (manual) construction of a nested map for trees:
section examples
structure Tree (α : Type u) where
val : α
cs : List (Tree α)
-- essentially
-- def Tree.map (f : α → β) : Tree α → Tree β :=
-- fun t => ⟨f t.val, t.cs.map Tree.map⟩)
noncomputable def Tree.map (f : α → β) : Tree α → Tree β :=
fix (sizeOf · < sizeOf ·) (fun map t => ⟨f t.val, t.cs.map map⟩)
(InvImage.wf (sizeOf ·) WellFoundedRelation.wf) <| by
intro ⟨v, cs⟩
dsimp only
apply callsOn_app2
· apply callsOn_const
· apply callsOn_map
intro t' ht'
apply callsOn_base
-- ht' : t' ∈ cs -- !
-- ⊢ sizeOf t' < sizeOf { val := v, cs := cs }
decreasing_trivial
end examples
This makes me happy!
All details of the construction are now contained in a proof that can proceed by a syntax-driven tactic and that’s easily and (likely robustly) extensible by the user. It also means that we can share a lot of code paths (e.g. everything related to equational theorems) between well-founded recursion and partial_fixpoint.
I wonder if this construction is really as powerful as our current one, or if there are certain (likely dependently typed) functions where this doesn’t fit, but the β above is dependent, so it looks good.
With this construction, functions defined by well-founded recursion will reduce even worse in the kernel, I assume. This may be a good thing.
The cake is a lie
What unfortunately kills this idea, though, is the generation of the functional induction principles, which I believe is not (easily) possible with this construction: The functional induction principle is proved by massaging F to return a proof, but since the extra assumptions (e.g. for ite or List.map) only exist in the termination proof, they are not available in F.
Oh wey, how anticlimactic.
PS: Path dependencies
Curiously, if we didn’t have functional induction at this point yet, then very likely I’d change Lean to use this construction, and then we’d either not get functional induction, or it would be implemented very differently, maybe a more syntactic approach that would re-prove termination. I guess that’s called path dependence.
There’s yet again been a bit of functional programming-adjacent twitter drama
recently, but it’s actually sort of touched into some subtleties about sum types
that I am asked about (and think about) a lot nowadays. So, I’d like to take
this opportunity to talk a bit about the “why” and nature of sum types and how
to use them effectively, and how they contrast with other related concepts in
programming and software development and when even cases where sum types aren’t
the best option.
Sum Types at their Best
The quintessential sum type that you just can’t live without is
Maybe, now adopted in a lot of languages as
Optional:
dataMaybe a =Nothing|Just a
If you have a value of type Maybe Int, it means that its valid
values are Nothing, Just 0, Just 1,
etc.
This is also a good illustration to why we call it a “sum” type: if
a has n possible values, then Maybe a has
1 + n: we add the single new value Nothing to it.
The “benefit” of the sum type is illustrated pretty clearly here too: every
time you use a value of type Maybe Int, you are forced to
consider the fact that it could be Nothing:
showMaybeInt ::MaybeInt->StringshowMaybeInt = \caseNothing->"There's nothing here"Just i ->"Something is here: "<>show i
That’s because usually in sum type implementations, they are implemented in a
way that forces you to handle each case exhaustively. Otherwise, sum types are
much less useful.
At the most fundamental level, this behaves like a compiler-enforced null
check, but built within the language in user-space instead being compiler magic,
ad-hoc syntax1, or static analysis — and the fact that it
can live in user-space is why it’s been adopted so widely. At a higher level,
functional abstractions like Functor, Applicative, Monad, Foldable, Traversable
allow you to use a Maybe a like just a normal a with
the appropriate semantics, but that’s a
topic for another time (like 2014).
This power is very special to me on a personal level. I remember many years
ago on my first major haskell project changing a type from String
to Maybe String, and then GHC telling me every place in the
codebase where something needed to change in order for things to work still.
Coming from dynamically typed languages in the past, this sublime experience
truly altered my brain chemistry and Haskell-pilled me for the rest of my life.
I still remember the exact moment, what coffee shop I was at, what my order was,
the weather that day … it was truly the first day of the rest of my life.
It should be noted that I don’t consider sum types a “language feature” or a
compiler feature as much as I’d consider it a design pattern. Languages that
don’t have sum types built-in can usually implement them using typed unions and
an abstract visitor pattern interface (more on that later). Of course, having a
way to “check” your code before running it (like with a type system or
statically verified type annotations) does make a lot of the features much more
useful.
Anyway, this basic pattern can be extended to include more error information
in your Nothing branch, which is how you get the
Either e a type in the Haskell standard library, or the
Result<T,E> type in rust.
Along different lines, we have the common use case of defining syntax
trees:
dataExpr=LitInt|NegateExpr|AddExprExpr|SubExprExpr|MulExprExpreval ::Expr->Inteval = \caseLit i -> iNegate x ->-(eval x)Add x y -> eval x + eval ySub x y -> eval x - eval yMul x y -> eval x * eval ypretty ::Expr->Stringpretty = go 0where wrap ::Int->Int->String->String wrap prio opPrec s| prio > opPrec ="("<> s <>")"|otherwise= s go prio = \caseLit i ->show iNegate x -> wrap prio 2$"-"<> go 2 xAdd x y -> wrap prio 0$ go 0 x <>" + "<> go 1 ySub x y -> wrap prio 0$ go 0 x <>" - "<> go 1 yMul x y -> wrap prio 1$ go 1 x <>" * "<> go 2 ymain ::IO ()main =doputStrLn$ pretty myExprprint$ eval myExprwhere myExpr =Mul (Negate (Add (Lit4) (Lit5))) (Lit8)
-(4 + 5) * 8
-72
Now, if we add a new command to the sum type, the compiler enforces us to
handle it.
dataExpr=LitInt|NegateExpr|AddExprExpr|SubExprExpr|MulExprExpr|AbsExpreval ::Expr->Inteval = \caseLit i -> iNegate x ->-(eval x)Add x y -> eval x + eval ySub x y -> eval x - eval yMul x y -> eval x * eval yAbs x ->abs (eval x)pretty ::Expr->Stringpretty = go 0where wrap ::Int->Int->String->String wrap prio opPrec s| prio > opPrec ="("<> s <>")"|otherwise= s go prio = \caseLit i ->show iNegate x -> wrap prio 2$"-"<> go 2 xAdd x y -> wrap prio 0$ go 0 x <>" + "<> go 1 ySub x y -> wrap prio 0$ go 0 x <>" - "<> go 1 yMul x y -> wrap prio 1$ go 1 x <>" * "<> go 2 yAbs x -> wrap prio 2$"|"<> go 0 x <>"|"
Another example where things shine are as clearly-fined APIs between
processes. For example, we can imagine a “command” type that sends different
types of commands with different payloads. This can be interpreted as perhaps
the result of parsing command line arguments or the message in some
communication protocol.
For example, you could have a protocol that launches and controls
processes:
dataCommand a =LaunchString (Int-> a) -- ^ takes a name, returns a process ID|StopInt (Bool-> a) -- ^ takes a process ID, returns success/failurelaunch ::String->CommandIntlaunch nm =Launch nm idstop ::Int->CommandBoolstop pid =Stop pid id
This ADT is written in the “interpreter” pattern (used often with things like
free monad), where any arguments not involving a are the command
payload, any X -> a represent that the command could respond
with X.
Let’s write a sample interpreter backing the state in an IntMap in an
IORef:
importqualifiedData.IntMapasIMimportData.IntMap (IntMap)runCommand ::IORef (IntMapString) ->Command a ->IO arunCommand ref = \caseLaunch newName next ->do currMap <- readIORef reflet newId =case IM.lookupMax currMap ofNothing->0Just (i, _) -> i +1 modifyIORef ref $ IM.insert newId newNamepure (next newId)Stop procId next ->do existed <- IM.member procId <$> readIORef ref modifyIORef ref $ IM.delete procIdpure (next existed)main ::IO ()main =do ref <- newIORef IM.empty aliceId <- runCommand ref $ launch "alice"putStrLn$"Launched alice with ID "<>show aliceId bobId <- runCommand ref $ launch "bob"putStrLn$"Launched bob with ID "<>show bobId success <- runCommand ref $ stop aliceIdputStrLn$if successthen"alice succesfully stopped"else"alice unsuccesfully stopped"print=<< readIORef ref
Launched alice with ID 0
Launched bob with ID 1
alice succesfully stopped
fromList [(1, "bob")]
Let’s add a command to “query” a process id for its current status:
dataCommand a =LaunchString (Int-> a) -- ^ takes a name, returns a process ID|StopInt (Bool-> a) -- ^ takes a process ID, returns success/failure|QueryInt (String-> a) -- ^ takes a process ID, returns a status messagequery ::Int->CommandStringquery pid =Query pid idrunCommand ::IORef (IntMapString) ->Command a ->IO arunCommand ref = \case-- ...Query procId next ->do procName <- IM.lookup procId <$> readIORef refpurecase procName ofNothing->"This process doesn't exist, silly."Just n ->"Process "<> n <>" chugging along..."
Relationship with Unions
To clarify a common confusion: sum types can be described as “tagged unions”:
you have a tag to indicate which branch you are on (which can be case-matched
on), and then the rest of your data is conditionally present.
In many languages this can be implemented under the hood as a struct with a
tag and a union of data, along with some abstract visitor pattern
interface to ensure exhaustiveness.
Remember, it’s not exactly a union, because, ie, consider a type
like:
dataEntity=UserInt|PostInt
An Entity here could represent a user at a user id, or a post at
a post id. If we considered it purely as a union of Int and
Int:
union Entity {int user_id;int post_id;};
we’d lose the ability to branch on whether or not we have a user or an int.
If we have the tagged union, we recover the original tagged union semantics:
Of course, you still need an abstract interface like the visitor pattern to
actually be able to use this as a sum type with guarantees that you handle every
branch, but that’s a story for another day. Alternatively, if your language
supports dynamic dispatch nicely, that’s another underlying implementation that
would work to back a higher-level visitor pattern interface.
Subtypes Solve a Different
Problem
Now, sum types aren’t exactly a part of common programming education
curriculum, but subtypes and supertypes definitely were
drilled into every CS student’s brain and waking nightmares from their first
year.
Informally (a la Liskov), B is a subtype of A (and
A is a supertype of B) if anywhere that expects an
A, you could also provide a B.
In normal object-oriented programming, this often shows up in early lessons
as Cat and Dog being subclasses of an
Animal class, or Square and Circle being
subclasses of a Shape class.
When people first learn about sum types, there is a tendency to understand
them as similar to subtyping. This is unfortunately understandable, since a lot
of introductions to sum types often start with something like
-- | Bad Sum Type Example!dataShape=CircleDouble|RectangleDoubleDouble
While there are situations where this might be a good sum type (ie, for an
API specification or a state machine), on face-value this is a bad example on
the sum types vs. subtyping distinction.
You might notice the essential “tension” of the sum type: you declare all of
your options up-front, the functions that consume your value are open and
declared ad-hoc. And, if you add new options, all of the consuming functions
must be adjusted.
So, subtypes (and supertypes) are more effective when they lean into
the opposite end: the universe of possible options are open and declared ad-hoc,
but the consuming functions are closed. And, if you add new functions,
all of the members must be adjusted.
In typed languages with a concept of “objects” and “classes”, subtyping is
often implemented using inheritance and interfaces.
So, a function like processWidget(Widget widget) that expects a
Widget would be able to be passed a Button or
InputField or Box. And, if you had a container like
List<Widget>, you could assemble a structure using
Button, InputField, and Box. A perfect
Liskov storm.
In typical library design, you’re able to add new implementations of
Widget as an open universe easily: anyone that imports
Widget can, and they can now use it with functions taking
Widgets. But, if you ever wanted to add new functionality
to the Widget interface, that would be a breaking change to all
downstream implementations.
However, this implementation of subtyping, while prevalent, is the most
mind-numbly boring realization of the concept, and it pained my soul to even
spend time talking about it. So let’s jump into the more interesting way that
subtype and supertype relationships manifest in the only language where anything
is interesting: Haskell.
Subtyping via Parametric
Polymorphism
In Haskell, subtyping is implemented in terms of parametric polymorphism and
sometimes typeclasses. This allows for us to work nicely with the concept of
functions and APIs as subtypes and supertypes of each other.
For example, let’s look at a function that takes indexers and applies
them:
So, what functions could you pass to sumAtLocs? Can you
only pass [Double] -> Int -> Double?
Well, not quite. Look at the above where we passed (!!), which
has type forall a. [a] -> Int -> a!
In fact, what other types could we pass? Here are some examples:
fun1 :: [a] ->Int-> afun1 = (!!)fun2 :: [a] ->Int-> afun2 xs i =reverse xs !! ifun3 :: (Foldable t, Floating a) => t a ->Int-> afun3 xs i =iflength xs > i then xs !! i elsepifun4 ::Num a => [a] ->Int-> afun4 xs i =sum (take i xs)fun5 :: (Integral b, Num c) => a -> b -> cfun5 xs i =fromIntegral ifun5 :: (Foldable t, Fractional a, Integral b) => t a -> b -> afun5 xs i =sum xs /fromIntegral ifun5 :: (Foldable t, Integral b, Floating a) => t a -> b -> afun5 xs i =logBase (fromIntegral i) (sum xs)
What’s going on here? Well, the function expects a
[Double] -> Int -> Double, but there are a lot of other types
that could be passed instead.
At first this might seem like meaningless semantics or trickery, but it’s
deeper than that: remember that each of the above types actually has a very
different meaning and different possible behaviors!
forall a. [a] -> Int -> a means that the amust come from the given list. In fact, any function with that type is
guaranteed to be partial: if you pass it an empty list, there is no
a available to use.
forall a. Num a => [a] -> Int -> a means that the
result might actually come from outside of the list: the implementation could
always return 0 or 1, even if the list is empty. It
also guarantees that it will only add, subtract, multiply, or abs: it will never
divide.
forall a. Fractional a => [a] -> Int -> a means that
we could possibly do division on the result, but we can’t do anything “floating”
like square rooting or logarithms.
forall a. Floating a => [a] -> Int -> a means that we
can possibly start square rooting or taking the logarithms of our input
numbers
[Double] -> Int -> Double gives us the least guarantees
about the behavior: the result could come from thin air (and not be a part of
the list), and we can even inspect the machine representation of our
inputs.
So, we have all of these types with completely different semantics and
meanings. And yet, they can all be passed to something expecting a
[Double] -> Int -> Double. That means that they are all
subtypes of [Double] -> Int -> Double!
[Double] -> Int -> Double is a supertype that houses
multitudes of possible values, uniting all of the possible values and semantics
into one big supertype.
Through the power of parametric polymorphism and typeclasses, you can
actually create an extensible hierarchy of supertypes, not just of
subtypes.
Consider a common API for json serialization. You could have multiple
functions that serialize into JSON:
The type of toJSON :: forall a. JSON a => a -> Value is a
subtype of Foo -> Value, Bar -> Value, and
Baz -> Value, because everywhere you would want a
Foo -> Value, you could give toJSON instead. Every
time you want to serialize a Foo, you could use
toJSON.
This usage works well, as it gives you an extensible abstraction to design
code around. When you write code polymorphic over Monoid a, it
forces you to reason about your values with respect to only the aspects relating
to monoidness. If you write code polymorphic over Num a, it forces
you to reason about your values only with respect to how they can be added,
subtracted, negated, or multiplied, instead of having to worry about things like
their machine representation.
The extensibility comes from the fact that you can create even more
supertypes of forall a. ToJSON a => a -> Value easily,
just by defining a new typeclass instance. So, if you need a
MyType -> Value, you could make it a supertype of
toJSON :: ToJSON a => a -> Value by defining an instance of
the ToJSON typeclass, and now you have something you can use in its
place.
Practically this is used by many libraries. For example, ad uses it for automatic
differentiation: its diff function looks scary:
diff :: (forall s.AD s ForwardDouble->AD s ForwardDouble) ->Double->Double
But it relies on the fact that that
(forall s. AD s ForwardDouble -> AD s ForwardDuble) is a
superclass of (forall a. Floating a => a -> a),
(forall a. Num a => a -> a), etc., so you can give it
functions like \x -> x * x (which is a
forall a. Num a => a -> a) and it will work as that
AD s type:
ghci> diff (\x -> x * x) 1020-- 2*x
This “numeric overloading” method is used by libraries for GPU programming,
as well, to accept numeric functions to be optimized and compiled to GPU
code.
Another huge application is in the lens library, which
uses subtyping to unite its hierarchy of optics.
For example, an Iso is a subtype of Traversal which
is a subtype of Lens, and Lens is a supertype of
Fold and Traversal, etc. In the end the system even
allows you to use id from the Prelude as a lens or a
traversal, because the type signature of id :: a -> a is
actually a subtype of all of those types!
Subtyping using Existential
Types
What more closely matches the spirit of subtypes in OOP and other
languages is the existential type: a value that can be a value of any
type matching some interface.
For example, let’s imagine a value that could be any instance of
Num:
This is somewhat equivalent to Java’s
List<MyInterface> or List<MyClass>, or
python’s List[MyClass].
Note that to use this effectively in Haskell with superclasses and
subclasses, you need to manually wrap and unwrap:
dataSomeFrational=forall a.Fractional a =>SumFractional acastUp ::SomeFractional->SumNumcastUp (SomeFractional x) =SomeNum x
So, SomeNum is “technically” a supertype of
SomeFractional: everywhere a SomeNum is expected, a
SomeFractional can be given…but in Haskell it’s a lot less
convenient because you have to explicitly cast.
In OOP languages, you can often cast “down” using runtime reflection
(SomeNum -> Maybe SomeFractional). However, this is impossible
in Haskell the way we have written it!
That’s because of type erasure: Haskell does not (by default) couple a value
at runtime with all of its associated interface implementations. When you create
a value of type SomeNum, you are packing an untyped pointer to that
value as well as a “dictionary” of all the functions you could use it with:
dataNumDict a =NumDict { (+) :: a -> a -> a , (*) :: a -> a -> a , negate :: a -> a , abs :: a -> a , fromInteger ::Integer-> a }mkNumDict ::Num a =>NumDict amkNumDict =NumDict (+) (*) negateabsfromIntegerdataFractionalDict a =FractionalDict { numDict ::NumDict a , (/) :: a -> a -> a , fromRational ::Rational-> a }-- | Essentially equivalent to the previous 'SomeNum'dataSomeNum=forall a.SomeNum { numDict ::NumDict a , value :: a }-- | Essentially equivalent to the previous 'SomeFractional'dataSomeFractional=forall a.SomeFractional { fractionalDict ::FractionalDict a , value :: a }castUp ::SomeFractional->SomeNumcastUp (SomeFractional (FractionalDict {numDict}) x) =SomeNum d xcastDown ::SomeNum->MaybeSomeFractionalcastDown (SomeNum nd x) =error"not possible!"
All of these function pointers essentially exist at runtime inside
the SomeNum. So, SomeFractional can be “cast up” to
SomeNum by simply dropping the FractionalDict.
However, you cannot “cast down” from SomeNum because there is no
way to materialize the FractionalDict: the association from type to
instance is lost at runtime. OOP languages usually get around this by having the
value itself hold pointers to all of its interface implementations at
runtime. However, in Haskell, we have type erasure by default: there are no
tables carried around at runtime.2
In the end, existential subtyping requires explicit wrapping/unwrapping
instead of implicit or lightweight casting possible in OOP languages optimized
around this sort of behavior.3 Existential-based subtyping is just less
common in Haskell because parametric polymorphism offers a solution to most
similar problems. For more on this topic, Simon Peyton Jones has a nice lecture on the
topic.
The pattern of using existentially qualified data in a container
(like [SomeNum]) is often called the “widget pattern” because it’s
used in libraries like xmonad to allow
extensible “widgets” stored alongside the methods used to manipualte them. It’s
more common to explicitly store the handler functions (a “dictionary”) inside
the type instead of of existential typeclasses, but sometimes it can be nice to
let the compiler handle generating and passing your method tables implicitly for
you. Using existential typeclasses instead of explicit dictionaries also allows
you to bless certain methods and functions as “canonical” to your type, and the
compiler will make sure they are always coherent.
I do mention in a blog
post about different types of existential lists, however, that this
“container of instances” type is much less useful in Haskell than in other
languages for many reasons, including the up/downcasting issues mentioned above.
In addition, Haskell gives you a whole wealth of functionality to operate over
homogeneous parameters (like [a], where all items have the same
type) that jumping to heterogeneous lists gives up so much.
Aside
Let’s briefly take a moment to talk about how typeclass hierarchies give us
subtle subtype/supertype relationships.
Let’s look at the classic Num and Fractional:
classNum aclassNum a =>Fractional a
Num is a superclass of Fractional, and
Fractional is a subclass of Num. Everywhere a
Num constraint is required, you can provide a
Fractional constraint to do the same thing.
However, in these two types:
Num a => aFractional a => a
forall a. Num a => a is actually a subclass of
forall a. Fractional a => a! That’s because if you need a
forall a. Fractional a => a, you can provide a
forall a. Num a => a instead. In fact, let’s look at three
levels: Double, forall a. Fractional a => a, and
forall a. Num a => a.
-- can be used as `Double`1.0 ::Double1.0 ::Fractional a => a1 ::Num a => a-- can be used as `forall a. Fractional a => a`1.0 ::Fractional a => a1 ::Num a => a-- can be used as `forall a. Num a => a`1 ::Num a => a
So, Double is a supertype of Fractional a => a
is a supertype of Num a => a.
The general idea here is that the more super- you go, the more you “know”
about the actual term you are creating. So, with Num a => a, you
know the least (and, you have the most possible actual terms because
there are more instances of Num than of Fractional).
And, with Double, you know the most: you even know its
machine representation!
So, Num is a superclass of Fractional but
forall a. Num a => a is a subclass of
forall a. Fractional a => a. This actually follows the typical
rules of subtyping: if something appears on the “left” of an arrow
(=> in this case), it gets flipped from sub- to super-. We often
call the left side a “negative” (contravariant) position and the right side a
“positive” position, because a negative of a negative (the left side of a left
size, like a in (a -> b) -> c) is a
positive.
Also note that our “existential wrappers”:
dataSomeNum=forall a.Num a =>SomeFractional adataSomeFractional=forall a.Fractional a =>SomeFractional a
can be CPS-transformed to their equivalent types:
typeSomeNum'=forall r. (forall a.Num a => a -> r) -> rtypeSomeFractional'=forall r. (forall a.Fractional a => a -> r) -> rtoSomeNum' ::SomeNum->SomeNum'toSomeNum' (SomeNum x) f = f xtoSomeNum ::SomeNum'->SomeNumtoSomeNum sn = sn SomeNum
And in those cases, Num and Fractional again appear
in the covariant (positive) position, since they’re the negative of negative.
So, this aligns with our intuition that SomeFractional is a subtype
of SomeNum.
The Expression Problem
This tension that I described earlier is closely related to the expression
problem, and is a tension that is inherent to a lot of different aspects of
language and abstraction design. However, in the context laid out in this post,
it serves as a good general guide to decide what pattern to go down:
If you expect a canonical set of “inhabitants” and an open set of
“operations”, sum types can suit that end of the spectrum well.
If you expect a canonical set of “operations” and an open set of
“inhabitants”, consider subtyping and supertyping.
I don’t really think of the expression problem as a “problem” in the sense of
“some hindrance to deal with”. Instead, I see it in the “math problem” sort of
way: by adjusting how you approach things, you can play with the equation make
the most out of what requirements you need in your design.
Looking Forward
A lot of frustration in Haskell (and programming in general) lies in trying
to force abstraction and tools to work in a way they weren’t meant to. Hopefully
this short run-down can help you avoid going against the point of these
design patterns and start making the most of what they can offer. Happy
Haskelling!
Special Thanks
I am very humbled to be supported by an amazing community, who make it
possible for me to devote time to researching and writing these posts. Very
special thanks to my supporter at the “Amazing” level on patreon, Josh Vera! :)
Must OOP languages also have mechanisms for type erasure, but
the default is unerased, which is opposite of Haskell.↩︎
Note that there are current GHC proposals
that attempt to allow “naked” existentials without newtype wrappers, so we could
actually get the same seamless and implicit up-casting as we would get in OOP
languages. However, the jury is out on whether or not this is a good idea.↩︎
A while ago, we wrote a post on how we helped a client initially
integrate the Testwell CTC++ code coverage tool from Verifysoft into
their Bazel build.
Since then, some circumstances have changed, and we were recently challenged to
see if we could improve the CTC++/Bazel integration to the point were CTC++
coverage builds could enjoy the same benefits of Bazel caching and incremental
rebuilds as regular (non-coverage) builds. Our objective was to make it feasible
for developers to do coverage builds with CTC++ locally, rather than them
using different coverage tools or delaying coverage testing altogether.
Thus we could enable the client to focus their efforts on improving
overall test coverage with CTC++ as their only coverage tool.
In this sequel to the initial integration, we, as a team, have come up with a more involved
scheme for making CTC++ meet Bazel’s expectations of hermetic and reproducible
build actions. There is considerable extra complexity needed to make this work,
but the result is a typical speedup of 5-10 times on most coverage builds.
The kind of speedup that not only makes your CI faster, but that allows
developers to work in a different and more efficient way, altogether.
More generally, we hope this blog post can serve as a good example (or maybe a
cautionary tale 😉) of how to take a tool that does not play well with Bazel’s
idea of a well-behaved build step, and force it into a shape where we can still
leverage Bazel’s strengths.
The status quo
You can read our previous blog post for more details, but here
we’ll quickly summarize the relevant bits of the situation after our initial
integration of CTC++ coverage builds with Bazel:
CTC++ works by wrapping the compiler invocation with its ctc tool, and
adding coverage instrumentation between the preprocessing and compiling steps.
In addition to instrumenting the source code itself, ctc also writes
instrumentation data in a custom text format (aka. symbol data) to a separate
output file, typically called MON.sym (aka. the symbol file).
At runtime the instrumented unit tests will collect coverage statistics and
write these (in binary form) to another separate output file: MON.dat.
As far as Bazel is concerned, both the MON.sym and MON.dat files are
untracked side-effects of the respective compilation and testing steps. As
such we had to poke a hole in the Bazel sandbox and arrange for these files
to be written to a persistent location without otherwise being tracked or
managed by Bazel.
More importantly, these side-effects mean that we have to disable all caching
and re-run the entire build and all tests from scratch every single time.
Otherwise, we would end up with incomplete MON.sym and MON.dat files.
Another consideration - not emphasized in our previous post since we had to
disable caching of intermediate outputs in any case - is that the outputs from
ctc are not hermetic and reproducible. Both the instrumentation that is added
to the source code, as well as the symbol file that is written separately by
ctc contain the following information that is collected at compile time:
Absolute paths to source code files: Even though Bazel passes relative
paths on the command-line, ctc will still resolve these into absolute paths
and record these paths into its outputs. Since all these build steps run
inside the Bazel sandbox, the recorded paths vary arbitrarily from build to
build. Even worse: the paths are made invalid as soon as the sandbox is
removed, when the compilation step is done.
Timestamps: ctc will also record timestamps into the instrumented source
code and the symbol file. As far as we know, these might have been part of
some internal consistency check in previous versions of CTC++, but currently
they are simply copied into the final report, and displayed as a property of
the associated symbol data on which the HTML report is based. Since our
coverage reports are already tied to known Git commits in the code base,
these timestamps have no additional value for us.
Fingerprints: ctc calculates a 32-bit fingerprint based on the symbol
data, and records this fingerprint into both the instrumented source and the
symbol file. Since the symbol data already contains absolute path names as
detailed above, the resulting fingerprint will also vary accordingly, and thus
not be reproducible from one build to the next, even when all other inputs
remain unchanged.
Outlining the problems to be solved
If we are to make CTC++ coverage builds quicker by leveraging the Bazel cache,
we must answer these two questions:
Can we make ctc’s outputs reproducible? Without this, re-enabling the Bazel
cache for these builds is a non-starter, as each re-evaluation of an
intermediate build step will have never-before-seen action inputs, and none
of the cached outputs from previous builds will ever get reused.
Can we somehow capture the extra MON.sym output written by ctc at build
time, and appropriately include it into Bazel’s build graph?1
We need for Bazel to cache and reuse the symbol data associated with a
compilation unit in exactly the same way that it would cache and reuse the
object file associated with the same compilation unit.
Solving both of these would allow us to achieve a correct coverage report
assembled from cached object files and symbol data from previously-built and
unchanged source code, together with newly-built object files and symbol data
from recently-changed source code (in addition to the coverage statistics
collected from re-running all tests).
Achieving reproducibility
Let’s tackle the problem of making ctc’s outputs reproducible first. We start
by observing that ctc allows us to
configure hook scripts that will be invoked at various
points while ctc is running. We are specifically interested in:
RUN_AFTER_CPP, allows access to the preprocessed source before the
instrumentation step, and
RUN_AFTER_INSTR, allows access to the instrumented source before it’s passed
on to the underlying compiler.
From our existing work, we of course also have our own wrapper script around
ctc, which allows us to access the outputs of each ctc invocation before
they are handed back to Bazel. We also know, from our previous work, that we can
instruct ctc to write a separate symbol file per compilation unit, rather than
have all compilation units append to the same MON.sym file.
Together this allows us to rewrite the outputs from ctc in such a way as to
make them reproducible. What we want to rewrite, has already been outlined
above:
Absolute paths into the sandbox: We could rewrite these into corresponding
absolute paths to the original source tree instead, but we can just as well
take it one step further and simply strip the sandbox root directory prefix
from all absolute paths. This turns them into relative paths that happen to
resolve correctly, whether they’re taken relative to the sandbox directory at
compile time, or relative to the root of the source tree afterwards.
Timestamps: This one is relatively easy, we just need to decide on a
static timestamp that does not change across builds. For some reason the CTC++
report tooling did not like us passing the ultimate default timestamp, aka.
the Unix Epoch, so we instead settled for midnight on January 1 2024.2
Fingerprints: Here we need to calculate a 32-bit value that will reflect
the complete source code in this compilation unit (but importantly with
transient sandbox paths excluded). We don’t have direct access to the
in-progress symbol data that ctc uses to calculate its own fingerprint,
so instead we settle on calculating a CRC32 checksum across the entire
preprocessed source code (beforectc adds its own instrumentation).3
Once we’ve figured out what to rewrite, we can move on to the how:
Using the RUN_AFTER_CPP option to ctc, we can pass in a small script that
calculates our new fingerprint by running the preprocessed source code
through CRC32.
Using the RUN_AFTER_INSTR option to ctc, we can pass in a script that
processes the instrumented source, line by line:
rewriting any absolute paths that point into the Bazel sandbox,
rewriting the timestamp recorded by ctc into our static timestamp, and
rewriting the fingerprint to the one calculated in step 1.
In our script that wraps the ctc invocation, we can insert the above two
options on the ctc command line. We can also instruct ctc to write a
separate .sym file for this compilation unit inside the sandbox.
In the same wrapper script, afterctc is done producing the object file
and symbol file for a compilation unit, we can now rewrite the symbol file
that ctc produced. The rewrites are essentially the same as performed in
step 2, although the syntax of the symbol file is different than the
instrumented source.
At this point, we have managed to make ctc’s outputs reproducible, and we can
proceed to looking at the second problem from above: properly capturing and
maintaining the symbol data generated by ctc. However, we have changed
the nature of the symbol data somewhat: Instead of having multiple compilation
units write to the same MON.sym file outside of the sandbox, we now have one
.sym file per compilation unit written inside the sandbox. These files are
not yet known to Bazel, and would be removed together with the rest of the
sandbox as soon as the compilation step is finished.
Enabling correct cache/reuse of symbol data
What we want to achieve here is for the symbol data associated with a
compilation unit to closely accompany the corresponding object file from the
same compilation unit: If the object file is cached and later reused by Bazel,
we want the symbol file to be treated the same. And when the object file is
linked into an executable or a shared library, we want the symbol file to
automatically become part of any coverage report that is later created based on
running code from that executable or library.
I suspect there are other ways we could handle this, for example using
Bazel aspects, or similar, but since we’re already knee-deep
in compiler wrappers and rewriting outputs…
In for a penny, in for a pound…
Given that we want the symbol file to be as closely associated with the object
file as possible, let’s take that to the ultimate conclusion and make it a
stowaway inside the object file. After all, the object file is “just� an ELF
file, and it does not take too much squinting to regard the ELF format as a
generic container of sections, where a section really can be any piece of
data you like.
The objcopy tool, part of the GNU binutils tool suite, also comes to our aid
with options like --add-section and --dump-section to help us embed and
extract such sections from any ELF file.
With this in hand, we can design the following scheme:
In our wrapper script, after ctc has generated an object file with an
accompanying symbol file, we run
objcopy --add-section ctc_sym=$SYMBOL_FILE $OBJECT_FILE to embed the
symbol file as a new ctc_sym section inside the object file.
We make no changes to our Bazel build, otherwise. We merely expect Bazel to
collect, cache, and reuse the object files as it would do with any
intermediate build output. The symbol data is just along for the ride.
In the linking phase (which is already intercepted by ctc and our wrapper
script) we can forward the symbol data from the linker inputs (ELF object
files) into the linker output (a shared library or executable, also in the
ELF format), like this: Extract the ctc_sym from each object file passed as
input (objcopy --dump-section ctc_sym=$SYMBOL_FILE $OBJECT_FILE /dev/null),
then concatenate these symbol files together, and finally embed that into the
ELF output file from the linker.4
At test run time, in addition to running the tests (which together produce
MON.dat as a side effect), we can iterate over the test executables and
their shared library dependencies, and extract any ctc_sym sections that
we come across. These are then split into separate symbol files and placed
next to MON.dat.
Finally, we can pass MON.dat and all the .sym files on to the ctcreport
report generator to generate the final HTML report.5
Results
With all of the above in place, we can run coverage builds with and without our
changes, while testing various build scenarios, to see what we have achieved.
Let’s look at some sample build times for generating CTC++ coverage reports.
All times below are taken from the best of three runs, all on the same machine.
Status quo
Starting with the situation as of our previous blog post:
Scope of coverage build + tests
bazel build/test
ctcreport
Total
Entire source tree
38m46s
2m06s
44m26s
One large application
13m59s
43s
15m30s
One small application
21s
1s
35s
Since caching is intentionally disabled and there is no reuse between these
coverage builds, these are the kinds of numbers you will get, no matter the
size of your changes since the last coverage build.
Let’s look at the situation after we made the changes outlined above.
Worst case after our changes: No cache to be reused
First, for a new coverage build from scratch (i.e. a situation in which there is
nothing that can be reused from the cache):
Scope of coverage build + tests
bazel build/test
ctcreport
Total
Speedup
Entire source tree
38m48s
1m59s
43m03s
1.0x
One large application
13m04s
43s
14m26s
1.1x
One small application
19s
1s
22s
1.6x
As expected, these numbers are very similar to the status quo. After all, we are
doing the same amount of work, and this is not the scenario we sought to improve
in any case.
There is maybe a marginal improvement in the overhead (i.e. the time spent
between/around bazel and ctcreport), but it’s pretty much lost in the noise,
and certainly nothing worth writing a blog post about.
Best case after our changes: Rebuild with no changes
This is the situation where we are now able to reuse already-instrumented
intermediate build outputs. In fact, in this case there are no changes
whatsoever, and Bazel can reuse the test executables from the previous build
directly, no (re-)building necessary. However, as discussed above, we do need
to re-run all tests and then re-generate the coverage report:
Scope of coverage build + tests
bazel build/test
ctcreport
Total
Speedup
Entire source tree
3m24s
1m58s
6m55s
6.4x
One large application
1m31s
42s
2m49s
5.5x
One small application
1s
1s
4s
8.8x
Common case after our changes: Rebuild with limited change set
This last table is in many ways the most interesting (but least accurate),
as it tries to reflect the common case that most developers are interested in:
“I’ve made a few changes to the source code, how long will I have to wait to
see the updated coverage numbers?�
Of course, as with a regular build, it depends on the size of your changes, and
the extent to which they cause misses in Bazel’s build cache. Here, I’ve done
some small source code change that cause rebuilds in a handful of compilation
units:
Scope of coverage build + tests
bazel build/test
ctcreport
Total
Speedup
Entire source tree
3m23s
1m57s
6m54s
6.4x
One large application
1m34s
42s
2m52s
5.4x
One small application
4s
1s
6s
5.8x
The expectation here would be that the total time needed is the sum of how long
it takes to do a regular build of your changes, plus the numbers from the no-op
case above. And this seems to largely hold true. Especially for the single-
application case were we expect your changes to affect application’s unit tests,
and therefore the build phase must strictly precede the test runs.
In the full source tree scenario, it seems that Bazel can start running other
(unrelated) tests concurrently with building your changes, and as long as your
changes, and the tests on which they depend, are not among the slowest tests
to run, then those other, slower test will “hide� the marginal build time cost
imposed by your changes.
Conclusion
We have achieved what we set out to do: to leverage the Bazel cache to avoid
unnecessary re-building of coverage-instrumented source code. It involves a
fair amount of added complexity in the build process, in order to make CTC++‘s
outputs reproducible, and thus reusable by Bazel, but the end result, in the
common case - a developer making a small source code change relative to a
previous coverage build - is a 5-10x speedup of the total time needed to build
and test with coverage instrumentation, including the generation of the final
coverage report.
Future work
A natural extension of the above scheme is to apply a similar treatment to the
generation of the coverage statistics at test runtime: Bazel allows for test
runs to be cached, so that later build/test runs can reuse the results and logs
from earlier test runs, rather than having to re-run tests that haven’t changed.
However, in much the same way as for symbol data at build time, we would need to
make sure that coverage statistics (.dat files) were saved and reused along
with the corresponding test run results/logs.
One could imagine each test creating a separate .dat file when run, and then
have Bazel cache this together with the test logs. The report generation phase
would then need to collect the .dat files from both the reused/cached and
the new/uncached test runs, and pass them all to the ctcreport tool.
Failure to do so correctly would cause coverage statistics to be lost, and the
resulting coverage report would be misleading.
With all this in place we could then enable caching of test results (in
practice, removing the --nocache_test_results flag that we currently pass),
and enjoy yet another speedup courtesy of Bazel’s cache.
That said, we are entering the realm of diminishing returns: Unit tests - once
they are built - typically run quickly, and there is certainly less time to be
saved here than what is saved by reusing cached build results. Looking at the
above numbers: even if we were able to fully eliminate time used by
bazel test, we would still only achieve another 2x speedup, theoretically.
For now, we can live with re-running all tests from scratch in order to create
a complete MON.dat file, every time.
And that is where I believe it stops: extending this even further to
incrementally generate the coverage report itself, in effect to re-generate
parts of the report based on a few changed inputs, is - as far as I can see -
not possible with the existing tools.
Finally, I want to commend Verifysoft for their understanding and cooperation.
I can only imagine that for someone not used to working with Bazel, our initial
questions must have seemed very eccentric. They were, however, eager to
understand our situation and find a way to make CTC++ work for us. They have
even hinted at including a feature in a future version of CTC++ to allow
shortening/mapping paths at instrumentation time. Using such a feature to
remove the sandbox paths would also have the nice side effect of making CTC++‘s
own fingerprint logic reproducible, as far as we can see. Together, this would
enable us to stop rewriting paths and fingerprints on our own.
Thanks to Mark Karpov for being my main co-conspirator in coming up with this
scheme, and helping to work out all the side quests and kinks along the way.
Also thanks to Christopher Harrison, Joseph Neeman, and Malte Poll for their reviews of this article.
Four years ago I bought a pair of YubiKey 5s:
One YubiKey 5 Nano, which fits in my laptop’s USB slot, and another YubiKey 5 NFC as backup, which sat in my home office.
However, I kept worrying about what happens if my house burns down or something, taking both my laptop and office YubiKeys together at the same time.
On the otherhand, if I stored my YubiKey 5 NFC offsite, then whenever I needed to register a new FIDO service, I would need to go fetch the key, update it, and then return it.
Based my peronal experince, even if that were not a big pain, the "return it" step often gets delayed indefinitely because it feels so low priority.
Then I read a popular comment made on Hacker News: Get three YubiKeys.
Suddenly everything clicked!
I bought a second YubiKey 5 NFC last year.
Now, I keep a second YubiKey 5 NFC offsite, in addition to the one in my laptop and the one in my office.
If my home burns down, I still have an offsite YubiKey available.
But the best thing about having a second YubiKey 5 NFC is that it partly mitigates the offsite update problem.
In the previous scenario, we required potentially two trips offsite to update the backup YubiKey.
However, now the procedure is to register a new FIDO service is to first update the office YubiKey 5 NFC key (and the YubiKey 5 nano).
Then, at your earlist convienence, you swap the office YubiKey 5 NFC key with the offsite YubiKey 5 NFC.
When you get the offsite YubiKey home, you update it with the new FIDO service and then it becomes the new office YubiKey.
There is no need to return to the offsite location.
Part of the issue is that there is no "public FIDO key", like there is with a "public PGP key".
You need the acutual YubiKey in hand to register it with a FIDO service, no matter whether it is a discoverable credetial or not.
If you were only using the YubiKey as a OpenPGP smart card, the perhaps you could get away with just having a local key and an offsite key.
Even still, I would recommend a third YubiKey so that whenever the time comes to do some operation on your offsite key, you can perform the same swaping trick.
The title of this article says that three is the right number of YubiKeys.
However this is because I only have one nano in my laptop because that is my primary computing interface.
I do have a desktop computer that I mostly only access as a remote server.
If you have multiple computer devices that you regularly use, it would make sense to have a YubiKey nano device in each of them.
And in addition to those, have one offsite key, and one local key for swapping with the offsite key.
Retrieval-augmented generation (RAG) is about providing large language models
with extra context to help them produce more informative responses. Like
any machine learning application, your RAG app needs to be monitored and
evaluated to ensure that it continues to respond accurately to user queries.
Fortunately, the RAG ecosystem has developed to the point where you can evaluate
your system in just a handful of lines of code.
The outputs of these evaluations are easily interpretable: numbers between
0 and 1, where higher numbers are better. Just copy our
sample code below, paste it into your continuous monitoring system,
and you’ll be looking at nice dashboards in no time. So that’s it, right?
Well, not quite. There are several common pitfalls in RAG evaluation. From this
blog post, you will learn what the metrics mean and how to check that they’re
working correctly on your data with our field-gained knowledge. As they say,
“forewarned is forearmed”!
Background
If you’re new to RAG evaluation, our previous posts about it give an
introduction to evaluation and discuss benchmark suites.
For now, you just need to know that a benchmark suite consists of a collection of
questions or prompts, and for each question establishes:
a “ground truth” context, consisting of documents from our database that are relevant
for answering the question; and
a “ground truth” answer to the question.
For example
Query
Ground truth context
Ground truth answer
What is the capital of France?
Paris, the capital of France, is known for its delicious croissants.
Paris
Where are the best croissants?
Lune Croissanterie, in Melbourne, Australia, has been touted as ‘the best croissant in the world.’
Melbourne
Then the RAG system provides (for each question):
a “retrieved” context — the documents that our RAG system thought were relevant — and
a generated answer.
The inputs to a RAG evaluator
Example
Here’s an example that uses the
Ragas library to evaluate the “faithfulness” (how well
the response was supported by the context) of a single RAG output, using an LLM
from AWS Bedrock:
from langchain_aws import ChatBedrockConverse
from ragas import EvaluationDataset, evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import Faithfulness
# In real life, this probably gets loaded from an internal file (and hopefully# has more than one element!)
eval_dataset = EvaluationDataset.from_list([{"user_input":"What is the capital of France?","retrieved_contexts":["Berlin is the capital of Germany."],"response":"I don't know.",}])# The LLM to use for computing metrics (more on this below).
model ="anthropic.claude-3-haiku-20240307-v1:0"
evaluator = LangchainLLMWrapper(ChatBedrockConverse(model=model))print(evaluate(dataset=eval_dataset, metrics=[Faithfulness(llm=evaluator)]))
If you paid close attention in the previous section, you’ll have noticed
that our evaluation dataset doesn’t include all of the components we talked about.
That’s because the “faithfulness” metric only requires the retrieved context and
the generated answer.
RAG evaluation metrics
There are a variety of RAG evaluation metrics available; to keep them straight, we
like to use the RAG Triad, a helpful system of categorizing some RAG metrics.
A RAG system has one input (the query) and two outputs
(the context and the response), and the RAG Triad lets us visualize the three interactions
that need to be evaluated.
The RAG triad
Evaluating retrieval
Feeding an LLM with accurate and relevant context can help it respond well; that’s the
whole idea of RAG.
Your system needs to find that relevant context, and your
evaluation system needs to figure out how well the retrieval is working. This
is the top-right side of the RAG Triad: evaluating the relationship between the query
and the retrieved context.
The two main retrieval metrics are precision and recall; each one has a classical
definition, plus an “LLM-enhanced” definition for RAG.
Roughly, “good precision” means that we don’t return irrelevant information,
while “good recall” means that we don’t miss any relevant information. Let’s
say that each of our benchmark queries is labelled with a ground truth set of
relevant documents, so that we can check how many of the retrieved documents are relevant.
Then the classical precision and recall are
<semantics>precision=# relevant retrieved docs# retrieved docsrecall=# relevant retrieved docs# relevant docs in the database<annotation encoding="application/x-tex">\text{precision} = \frac{\text{\# relevant retrieved docs}}{\text{\# retrieved docs}}
\qquad
\text{recall} = \frac{\text{\# relevant retrieved docs}}{\text{\# relevant docs in the database}}</annotation></semantics>precision=# retrieved docs# relevant retrieved docsrecall=# relevant docs in the database# relevant retrieved docs
These metrics are well-established, useful, and easy to compute. But in a RAG system, the
database might be large, uncurated, and contain redundant documents.
For example, suppose you have ten related documents, each containing an
answer to the query. If your retrieval system returns just one of them then it will have
done its job adequately, but it will only receive a 10% recall score.
With a large database, it’s also possible that there’s a document with the necessary
context that wasn’t tagged as relevant by the benchmark builder. If the retrieval
system finds that document, it will be penalized in the precision score even though
the document is relevant.
Because of these issues with classical precision and recall, RAG evaluations often
adapt them to work on statements instead of documents. We
list the statements in the ground-truth context and in the retrieved
context; we call a retrieved statement “relevant” if it was present in the ground-truth
context.
This definition of precision and recall is better tailored to RAG than the classical one, but
it comes with a big disadvantage: you need to decide what a “statement” is, and whether two statements
are “the same.” Usually you’ll want to automate this decision with an LLM, but that raises
its own issues with cost and reliability. We’ll say more about that later.
Evaluating generation
Once your retrieval is working well — with continuous monitoring and
evaluation, of course — you’ll need to evaluate your generation step. The most
commonly used metric here is faithfulness1, which
measures whether a generated answer is factually supported by the retrieved
context; this is
the bottom side of the RAG Triad.
To calculate faithfulness, we count the number of factual claims in
the generated answer, and then decide which of them is supported
by the context. Then we define
Like the RAG-adapted versions of context precision and recall,
this is a statement-based metric. To automate it, we’d need an LLM
to count the factual claims and decide which of them is context-supported.
You can evaluate faithfulness without having retrieval working yet, as long as you
have a benchmark with ground truth contexts. But if you do that, there’s one crucial point
to keep in mind: you also need to test generation when retrieval is bad, like when
it contains distracting irrelevant documents or just doesn’t have anything useful at all.
Bad retrieval will definitely happen in the wild, and so you need to ensure that your
generation (and your generation evaluation) will degrade gracefully. More on that below.
Evaluating the answer
Finally, there is a family of
commonly-used generation metrics that evaluate the quality of the answer by comparing it
to the prompt and the ground truth:
answer semantic similarity measures the semantic similarity between the generated answer and the ground truth;
answer correctness also compares the generated answer and the ground truth, but is based on counting factual claims
instead of semantic similarity; and
answer relevance measures how well the generated answer corresponds to the question that the prompt asked.
This is the top-left side of the RAG Triad.
These metrics directly get to the key outcome of your RAG system: are the generated responses good?
They come with the usual pluses and minuses of end-to-end metrics. On the one hand, they measure
exactly what you care about; on the other hand, when they fail you don’t know which component
is to blame.
As you’ve seen above, many of the metrics used for evaluating RAG rely on
LLMs to extract and evaluate factual claims. That means that some of the
same challenges you’ll face while building your RAG system also apply to its
evaluation:
You’ll need to decide which model (or models) to use for evaluation, taking into account cost, accuracy, and reliability.
You’ll need to sanity-check the evaluator’s responses, preferably with continuous monitoring and
occasional manual checks.
Because the field is moving so quickly, you’ll need to evaluate the options yourself — any
benchmarks you read online have a good chance of being obsolete by the time you
read them.
When the judges don’t agree
In order to better understand these issues, we ran a few experiments on a basic RAG
system — without query re-writing, context re-ranking or other tools to improve
retrieval — using the Neural Bridge benchmark dataset as our test set. We first ran
these experiments in early 2024; when we re-visited them in December 2024 we found
that newer base LLMs had improved results somewhat but not dramatically.
The Neural Bridge dataset contains 12,000 questions; each one comes with a
context and an answer. We selected 200 of these questions at random and ran them
through a basic RAG system using Chroma DB as the vector store and either Llama
2 or Claude Haiku 3 as the LLM for early 2024 and December 2024 runs, respectively. The RAG system was
not highly tuned — for example, its retrieval step was just a vector similarity search
— and so it gave a mix of good answers, bad answers, and answers saying essentially
“I don’t know: the context doesn’t say.”
Finally, we used Ragas to evaluate various metrics on the generated responses, while varying
the LLMs used to power the metrics.
Experimental results
Our goal in these experiments was to determine:
whether the LLM evaluators were correct, and
whether they were consistent with one another.
We found that different LLMs are often not in agreement. In particular, they can’t all be correct.
Here are the evaluation scores of five different models on four different metrics,
averaged across our benchmark dataset. You’ll notice a fair amount of spread
in the scores for faithfulness and context precision.
Average metrics scores across models
But the scores above are just averages across the dataset — they don’t tell
us how well the LLMs agreed on individual ratings. For that, we checked the
correlation
between model scores and again found some discrepancies between models.
Here are the results for answer relevancy scores: the correlations show
that even though the different models gave very similar average scores,
they aren’t in full agreement.
Correlation of answer relevancy scores across models. A score of one means
that the models agree completely, while a score of zero means that they
agree or disagree essentially at random.
It might not be too surprising that models from the same family (GPT 3.5 and 4,
and Sonnet 3 and 3.5) had larger overlaps than models from different families.
If your budget allows it, choosing multiple uncorrelated models and evaluating
with all of them might make your evaluation more robust.
When faithfulness gets difficult
We dug a little more into the specific reasons for LLM disagreement, and found
something interesting about the faithfulness score:
we restricted to the subset of questions for which
retrieval was particularly bad, having no overlap with the ground truth data.
Even the definition of faithfulness is tricky when the context is
bad. Let’s say the LLM decides that the context doesn’t have relevant information
and so responds “I don’t know” or “The context doesn’t say.” Are those factual statements?
If so, are they supported
by the context? If not, then according to the definition, the faithfulness is zero
divided by zero. Alternatively, you could try to detect responses like this
and treat them as a sort of meta-response that doesn’t go through the normal
metrics pipeline. We’re not sure how best to handle this corner case, but we do know
that you need to do it explicitly and consistently. You also need to be prepared
to handle null values and empty responses from your metrics pipeline, because
this situation often induces them.
Experimental results
On the subset of questions with poor retrieval our Ragas-computed faithfulness
scores ranged from 0%, as judged by Llama 3, to more than 80%, as judged by
Claude 3 Sonnet. We emphasize that these were faithfulness scores evaluated by
different LLMs judging the same retrievals, responses, and generated answers.
Even if you exclude Llama 3 as an outlier, there is a lot of variation.
Faithfulness scores across models, when the context is bad
This variation in scores doesn’t seem to be an intentional choice (to the extent
that LLMs can have “intent”) by the evaluator LLMs, but rather a situation of corner
cases compounding one another. We noticed that this confusing situation made some
models — Llama 3 most often, but also other models — fail to respond in the JSON
format expected by the Ragas library. Depending on how you treat these failures, this
can result in missing metrics or strange scores.
You can sidestep these issues somewhat if you have thorough evaluation across the entire
RAG pipeline: if other metrics are flagging poor retrieval, it matters less that your
generation metrics are behaving strangely on poorly-retrieved examples.
In general, there’s no good substitute for careful human evaluation. The LLM judges don’t agree, so
which one agrees best with ground truth human evaluations (and is the agreement good enough
for your application)? That will depend on your documents, your typical questions, and on future releases of improved models.
Conclusion
Oh, were you hoping we’d tell you which LLM you should use? No such luck: our advice would be
out of date by the time you read this, and if your data doesn’t closely resemble our benchmark
data, then our results might not apply anyway.
In summary, it’s easy to compute metrics for your RAG application, but don’t just
do it blindly. You’ll want to test different LLMs for driving the metrics, and you’ll need
to evaluate their outputs. Your metrics should cover all the sides of the RAG triad, and
you should know what they mean (and be aware of their corner cases) so that you can
interpret the results. We hope that helps, and happy measuring!
The terminology is not quite settled: what Ragas calls “faithfulness,” TruLens
calls “groundedness.” Since the RAG Triad was introduced by TruLens, you’ll
usually see it used in conjunction with their terminology. We’ll use the Ragas
terminology in this post, since that’s what we used for our experiments.↩
The GHC developers are happy to announce the availability of the first and
likely final release candidate of GHC 9.12.2. This is an important bug-fix
release resolving a significant correctness issue present in 9.12.1
(#25653).
In accordance with our under-discussion release policies this candidate
will have a two-week testing window. The final 9.12.2 release will likely come
the week of 12 March 2025.
As always, if you find anything amiss please open a ticket.
In this episode Wouter Swiestra and Niki Vazou talk with Conal Elliott. Conal discusses doing things just for the poetry, how most programs miss their purpose, and the simplest way to ask a question. Conal is currently working on a book about his ideas and actively looking for partners.
Regular, everyday stuff. But the instances for type constructors are more interesting, because they come with an instance context:
instance (Foo a, Foo b) =>Foo (a, b) where...
Then, of course, if we know both Foo a and Foo b, we can infer Foo (a, b). To make this fact overwhelmingly explicit, we can reify the usual constraint-solving logic by using the Dict type, and thus the following program will typecheck:
with the only change required coming from the type constructor instances:
instance (Foo a, Foo b) =>Foo (a, b) wheretypeEvidence (a, b) = (Foo a, Foo b)...
or, if we you want to be cute about it:
instanceEvidence (a, b) =>Foo (a, b) wheretypeEvidence (a, b) = (Foo a, Foo b)...
By sticking Evidence into the superclass constraint, GHC knows that this dictionary is always available when you’ve got a Foo dictionary around. And our earlier backwards program now typechecks as expected.
Recently I looked again at PHOAS, and once again I concluded it's nice for library APIs, but so painful to do anything with inside those libraries. So let convert to something else, like de Bruijn.
There are standalone source files if you just want to see the code:
There is always a way to cheat, though. You can turn the PHOAS ->
untyped de Bruijn machinery into the PHOAS -> typed de Bruijn
machinery by checking that future contexts indeed extend past contexts
and throwing an error otherwise (which can't happed, because future
contexts always extend past contexts, but it's a metatheorem).
In "Generic Conversions of Abstract Syntax Representation" by Steven Keuchel and Johan Jeuring, authors also "cheat" a bit. The "Parametrhic higher-order abstract syntax" section ends with a somewhat disappointing
wherepostulate whatever :_
Keuchel and Jeuring also mention "Unembedding Domain-Specific Languages" by Robert Atkey, Sam Lindley and Jeremy Yallop; where there is one unsatisfactory ⊥ (undefined in Haskell) hiding.
I think that for practical developments (say a library in Haskell), it is ok to make a small short cut; but I kept wondering isn't there is a way to make a conversion without cheating.
Well... it turns out that we cannot "cheat". Well-formedness of PHOAS representation depends on parametricity, and the conversion challenge seems to requires a theorem which there are no proof in Agda.
In unpublished (?) work Adam Chlipala shows a way to do the conversion without relying on postulates http://adam.chlipala.net/cpdt/html/Intensional.html; but that procedure requires an extra well formedness proof of given PHOAS term.
This Agda development is a translation of that developement.
Common setup
Our syntax representations will be well-typed, so we need types:
-- Typesdata Ty :Setwhere emp : Ty fun : Ty → Ty → TyCtx :SetCtx = List Tyvariable A B C : Ty Γ Δ Ω : Ctx v : Ty →Set
de Bruijn syntax
Var : Ctx → Ty →SetVar Γ A = Idx A Γ -- from agda-np, essentially membership relation.data DB (Γ : Ctx): Ty →Setwhere var : Var Γ A → DB Γ A app : DB Γ (fun A B)→ DB Γ A → DB Γ B lam : DB (A ∷ Γ) B → DB Γ (fun A B) abs : DB Γ emp → DB Γ A
Parametric Higher-order abstract syntax
data PHOAS (v : Ty →Set): Ty →Setwhere var : v A → PHOAS v A app : PHOAS v (fun A B)→ PHOAS v A → PHOAS v B lam :(v A → PHOAS v B)→ PHOAS v (fun A B) abs : PHOAS v emp → PHOAS v A-- closed "true" PHOAS terms.PHOAS° : Ty →Set₁PHOAS° A =∀{v}→ PHOAS v A
de Bruijn to PHOAS
This direction is trivial. An anecdotal evidence that de Bruijn representation is easier to transformation on.
phoasify : NP v Γ → DB Γ A → PHOAS v Aphoasify γ (var x)= var (lookup γ x)phoasify γ (app f t)= app (phoasify γ f)(phoasify γ t)phoasify γ (lam t)= lam λ x → phoasify (x ∷ γ) tphoasify γ (abs t)= abs (phoasify γ t)
Interlude: Well-formedness of PHOAS terms
dam Chlipala defines an equivalence relation between two PHOAS terms, exp_equiv in Intensional, wf in CPDT book). e only need a single term well-formedness so can do a little less
Terms like invalid cannot be values of PHOAS°, as all values of "v" inside PHOAS° have to originated from lam-constructor abstractions. We really should keep v parameter free, i.e. parametric, when constructingPHOAS terms.
The idea is then to simply to track which variables (values of v) are intoduced by lambda abstraction.
data phoasWf {v :Ty→Set} (G:List (Σ Ty v)) : {A:Ty} →PHOAS v A→Setwhere varWf :∀ {A} {x : v A}→Idx (A , x) G→ phoasWf G (var x) appWf :∀ {AB} {f :PHOAS v (fun AB)} {t :PHOAS v A}→ phoasWf G f→ phoasWf G t→ phoasWf G (app f t) lamWf :∀ {AB} {f : v A→PHOAS v B}→ (∀ (x : v A) → phoasWf ((A , x) ∷G) (f x))→ phoasWf G (lam f) absWf :∀ {A} {t :PHOAS v emp}→ phoasWf G t→ phoasWf G (abs {A=A} t)-- closed terms start with an empty GphoasWf° :PHOAS° A→Set₁phoasWf° tm =∀ {v} → phoasWf {v = v} [] tm
A meta theorem is then that all PHOASᵒ terms are well-formed, i.e.
meta-theorem-proposition :Set₁meta-theorem-proposition =∀{A}(t : PHOAS° A)→ phoasWf° t
As far as I'm aware this proposition cannot be proved nor refuted in Agda.
de Bruijn to PHOAS translation creates well-formed PHOAS terms.
As a small exercise we can show that phoasify of closed de Bruijn terms creates well-formed PHOAS terms.
toList : NP v Γ → List (Σ Ty v)toList [] = []toList (x ∷ xs)=(_ , x) ∷ toList xsphoasifyWfVar :(γ : NP v Γ)(x : Var Γ A)→ Idx (A , lookup γ x)(toList γ)phoasifyWfVar (x ∷ γ) zero = zerophoasifyWfVar (x ∷ γ)(suc i)= suc (phoasifyWfVar γ i)phoasifyWf :(γ : NP v Γ)(t : DB Γ A)→ phoasWf (toList γ)(phoasify γ t)phoasifyWf γ (var x)= varWf (phoasifyWfVar γ x)phoasifyWf γ (app f t)= appWf (phoasifyWf γ f)(phoasifyWf γ t)phoasifyWf γ (lam t)= lamWf λ x → phoasifyWf (x ∷ γ) tphoasifyWf γ (abs t)= absWf (phoasifyWf γ t)phoasifyWf° :(t : DB [] A)→ phoasWf° (phoasify [] t)phoasifyWf° t = phoasifyWf [] t
PHOAS to de Bruijn
The rest deals with the opposite direction.
In Intensional Adam Chlipala uses v = λ _ → ℕ instatiation to make the translation.
I think that in the typed setting using v = λ _ → Ctx turns out nicer.
The idea in both is that we instantiate PHOAS variables to be de Bruijn levels.
data IsSuffixOf {ℓ}{a :Set ℓ}: List a → List a →Set ℓ where refl :∀{xs}→ IsSuffixOf xs xs cons :∀{xs ys}→ IsSuffixOf xs ys →∀{y}→ IsSuffixOf xs (y ∷ ys)
We need to establish well-formedness of PHOAS expression in relation to some context Γ
Note that variables encode de Bruijn levels, thus the contexts we "remember" in variables should be the suffix of that outside context.
wf :(Γ : Ctx)→ PHOAS (λ_→ Ctx) A →Setwf {A = A} Γ (var Δ)= IsSuffixOf (A ∷ Δ) Γwf Γ (app f t)= wf Γ f × wf Γ twf Γ (lam {A = A} t)= wf (A ∷ Γ)(t Γ)wf Γ (abs t)= wf Γ t
And if (A ∷ Δ) is suffix of context Γ, we can convert the evidence to the de Bruijn index (i.e. variable):
makeVar : IsSuffixOf (A ∷ Δ) Γ → Var Γ AmakeVar refl = zeromakeVar (cons s)= suc (makeVar s)
Given the term is well-formed in relation to context Γ we can convert it to de Bruijn representation.
dbify :(t : PHOAS (λ_→ Ctx) A)→ wf Γ t → DB Γ Adbify (var x) wf = var (makeVar wf)dbify (app f t)(fʷ , tʷ)= app (dbify f fʷ)(dbify t tʷ)dbify {Γ = Γ}(lam t) wf = lam (dbify (t Γ) wf)dbify (abs t) wf = abs (dbify t wf)
What is left is to show that we can construct wf for all phoasWf-well-formed terms.
Adam Chlipala defines a helper function:
makeG′ : Ctx → List (Σ Ty (λ_→ Ctx))makeG′ [] = []makeG′ (A ∷ Γ)=(A , Γ) ∷ makeG′ Γ
However for somewhat technical reasons, we rather define
Normalization by evaluation using parametric higher order syntax. In Agda.
I couldn't find a self-contained example of PHOAS NbE, so here it is. I hope someone might find it useful.
module NbEXP.PHOAS wheredata Ty :Setwhere emp : Ty fun : Ty → Ty → Tydata Tm (v : Ty →Set): Ty →Setwhere var :∀{a}→ v a → Tm v a app :∀{a b}→ Tm v (fun a b)→ Tm v a → Tm v b lam :∀{a b}→(v a → Tm v b)→ Tm v (fun a b)data Nf (v : Ty →Set): Ty →Setdata Ne (v : Ty →Set): Ty →Setdata Ne v where nvar :∀{a}→ v a → Ne v a napp :∀{a b}→ Ne v (fun a b)→ Nf v a → Ne v bdata Nf v where neut : Ne v emp → Nf v emp nlam :∀{a b}→(v a → Nf v b)→ Nf v (fun a b)Sem :(Ty →Set)→ Ty →SetSem v emp = Ne v empSem v (fun a b)= Sem v a → Sem v blower :∀{v : Ty →Set}(a : Ty)→ Sem v a → Nf v araise :∀{v : Ty →Set}(a : Ty)→ Ne v a → Sem v alower emp s = neut slower (fun a b) s = nlam λ x → lower b (s (raise a (nvar x)))raise emp n = nraise (fun a b) n x = raise b (napp n (lower a x ))eval :{v : Ty →Set}{a : Ty}→ Tm (Sem v) a → Sem v aeval (var x)= xeval (app f t)= eval f (eval t)eval (lam t) x = eval (t x)nf :{a : Ty}→{v : Ty →Set}→ Tm (Sem v) a → Nf v anf {a} t = lower a (eval t)nf_parametric :{a : Ty}→({v : Ty →Set}→ Tm v a)->({v : Ty →Set}→ Nf v a)nf_parametric t = nf t
This last month has been fascinating. I guess LLMs have finally
resonated with me on a deeper level. It wasn’t like I woke up and
suddenly everything was different, but their impact is growing on me
non-linearly, forcing me to rewire my brain.
I've been fortunate to be nominated for a few teaching awards over my career, and even to win a couple. The nomination I just received may be the best.
As a new student at the uni, Philip Wadler was the first introductory lecture I had, and his clear passion for the subject made me feel excited to begin my journey in computer science. In particular he emphasised the importance of asking questions, which made the idea of tutorials and lectures a lot less intimidating, and went on to give really valuable advice for starting university. I enjoyed this session so much, and so was looking forward to the guest lectures he was going to do for Inf1A at the end of semester 1. They certainly did not disappoint, the content he covered was engaging, interesting, and above all very entertaining to listen to, especially when he dressed up as a superhero to cement his point. Because I found these talks so rewarding, I also attended the STMU that he spoke at about AI and ChatGPT, and everyone I talked to after the event said they had a really good time whilst also having a completely new insightful perspective on the topic. In summary, Philip Wadler has delivered the best lectures I have attended since starting university, and I have gotten a lot out of them.
President Trump has started rolling out his tariffs, something I blogged about in November. People are talking about these tariffs a lot right now, with many people (correctly) commenting on how consumers will end up with higher prices as a result of these tariffs. While that part is true, I’ve seen a lot of people taking it to the next, incorrect step: that consumers will pay the entirety of the tax. I put up a poll on X to see what people thought, and while the right answer got a lot of votes, it wasn't the winner.
Checking on people's general view of taxes. When the government imposes a tax on trade (sales tax, VAT, tariff, or even payroll tax), which party absorbs the cost of the tax?
For purposes of this blog post, our ultimate question will be the following:
Suppose apples currently sell for $1 each in the entire United States.
There are domestic sellers and foreign sellers of apples, all receiving the same price.
There are no taxes or tariffs on the purchase of apples.
The question is: if the US federal government puts a $0.50 import tariff per apple, what will be the change in the following:
Number of apples bought in the US
Price paid by buyers for apples in the US
Post-tax price received by domestic apple producers
Post-tax price received by foreign apple producers
Before we can answer that question, we need to ask an easier, first question: before instituting the tariff, why do apples cost $1?
And finally, before we dive into the details, let me provide you with the answers to the ultimate question. I recommend you try to guess these answers before reading this, and if you get it wrong, try to understand why:
The number of apples bought will go down
The buyers will pay more for each apple they buy, but not the full amount of the tariff
Domestic apple sellers will receive a higher price per apple
Foreign apple sellers will receive a lower price per apple, but not lowered by the full amount of the tariff
In other words, regardless of who sends the payment to the government, both taxed parties (domestic buyers and foreign sellers) will absorb some of the costs of the tariff, while domestic sellers will benefit from the protectionism provided by tariffs and be able to sell at a higher price per unit.
Let’s say I absolutely love apples, they’re my favorite food. How much would I be willing to pay for a single apple? You might say “$1, that’s the price in the supermarket,” and in many ways you’d be right. If I walk into supermarket A, see apples on sale for $50, and know that I can buy them at supermarket B for $1, I’ll almost certainly leave A and go buy at B.
But that’s not what I mean. What I mean is: how high would the price of apples have to go everywhere so that I’d no longer be willing to buy a single apple? This is a purely personal, subjective opinion. It’s impacted by how much money I have available, other expenses I need to cover, and how much I like apples. But let’s say the number is $5.
How much would I be willing to pay for another apple? Maybe another $5. But how much am I willing to pay for the 1,000th apple? 10,000th? At some point, I’ll get sick of apples, or run out of space to keep the apples, or not be able to eat, cook, and otherwise preserve all those apples before they rot.
The point being: I’ll be progressively willing to spend less and less money for each apple. This form of analysis is called marginal benefit: how much benefit (expressed as dollars I’m willing to spend) will I receive from each apple? This is a downward sloping function: for each additional apple I buy (quantity demanded), the price I’m willing to pay goes down. This is what gives my personal demand curve. And if we aggregate demand curves across all market participants (meaning: everyone interested in buying apples), we end up with something like this:
Assuming no changes in people’s behavior and other conditions in the market, this chart tells us how many apples will be purchased by our buyers at each price point between $0.50 and $5. And ceteris paribus (all else being equal), this will continue to be the demand curve for apples.
Marginal cost
Demand is half the story of economics. The other half is supply, or: how many apples will I sell at each price point? Supply curves are upward sloping: the higher the price, the more a person or company is willing and able to sell a product.
Let’s understand why. Suppose I have an apple orchard. It’s a large property right next to my house. With about 2 minutes of effort, I can walk out of my house, find the nearest tree, pick 5 apples off the tree, and call it a day. 5 apples for 2 minutes of effort is pretty good, right?
Yes, there was all the effort necessary to buy the land, and plant the trees, and water them… and a bunch more than I likely can’t even guess at. We’re going to ignore all of that for our analysis, because for short-term supply-and-demand movement, we can ignore these kinds of sunk costs. One other simplification: in reality, supply curves often start descending before ascending. This accounts for achieving efficiencies of scale after the first number of units purchased. But since both these topics are unneeded for understanding taxes, I won’t go any further.
Anyway, back to my apple orchard. If someone offers me $0.50 per apple, I can do 2 minutes of effort and get $2.50 in revenue, which equates to a $75/hour wage for me. I’m more than happy to pick apples at that price!
However, let’s say someone comes to buy 10,000 apples from me instead. I no longer just walk out to my nearest tree. I’m going to need to get in my truck, drive around, spend the day in the sun, pay for gas, take a day off of my day job (let’s say it pays me $70/hour). The costs go up significantly. Let’s say it takes 5 days to harvest all those apples myself, it costs me $100 in fuel and other expenses, and I lose out on my $70/hour job for 5 days. We end up with:
Total expenditure: $100 + $70 * 8 hours a day * 5 days == $2900
Total revenue: $5000 (10,000 apples at $0.50 each)
Total profit: $2100
So I’m still willing to sell the apples at this price, but it’s not as attractive as before. And as the number of apples purchased goes up, my costs keep increasing. I’ll need to spend more money on fuel to travel more of my property. At some point I won’t be able to do the work myself anymore, so I’ll need to pay others to work on the farm, and they’ll be slower at picking apples than me (less familiar with the property, less direct motivation, etc.). The point being: at some point, the number of apples can go high enough that the $0.50 price point no longer makes me any money.
This kind of analysis is called marginal cost. It refers to the additional amount of expenditure a seller has to spend in order to produce each additional unit of the good. Marginal costs go up as quantity sold goes up. And like demand curves, if you aggregate this data across all sellers, you get a supply curve like this:
Equilibrium price
We now know, for every price point, how many apples buyers will purchase, and how many apples sellers will sell. Now we find the equilibrium: where the supply and demand curves meet. This point represents where the marginal benefit a buyer would receive from the next buyer would be less than the cost it would take the next seller to make it. Let’s see it in a chart:
You’ll notice that these two graphs cross at the $1 price point, where 63 apples are both demanded (bought by consumers) and supplied (sold by producers). This is our equilibrium price. We also have a visualization of the surplus created by these trades. Everything to the left of the equilibrium point and between the supply and demand curves represents surplus: an area where someone is receiving something of more value than they give. For example:
When I bought my first apple for $1, but I was willing to spend $5, I made $4 of consumer surplus. The consumer portion of the surplus is everything to the left of the equilibrium point, between the supply and demand curves, and above the equilibrium price point.
When a seller sells his first apple for $1, but it only cost $0.50 to produce it, the seller made $0.50 of producer surplus. The producer portion of the surplus is everything to the left of the equilibrium point, between the supply and demand curves, and below the equilibrium price point.
Another way of thinking of surplus is “every time someone got a better price than they would have been willing to take.”
OK, with this in place, we now have enough information to figure out how to price in the tariff, which we’ll treat as a negative externality.
Modeling taxes
Alright, the government has now instituted a $0.50 tariff on every apple sold within the US by a foreign producer. We can generally model taxes by either increasing the marginal cost of each unit sold (shifting the supply curve up), or by decreasing the marginal benefit of each unit bought (shifting the demand curve down). In this case, since only some of the producers will pay the tax, it makes more sense to modify the supply curve.
First, let’s see what happens to the foreign seller-only supply curve when you add in the tariff:
With the tariff in place, for each quantity level, the price at which the seller will sell is $0.50 higher than before the tariff. That makes sense: if I was previously willing to sell my 82nd apple for $3, I would now need to charge $3.50 for that apple to cover the cost of the tariff. We see this as the tariff “pushing up” or “pushing left” the original supply curve.
We can add this new supply curve to our existing (unchanged) supply curve for domestic-only sellers, and we end up with a result like this:
The total supply curve adds up the individual foreign and domestic supply curves. At each price point, we add up the total quantity each group would be willing to sell to determine the total quantity supplied for each price point. Once we have that cumulative supply curve defined, we can produce an updated supply-and-demand chart including the tariff:
As we can see, the equilibrium has shifted:
The equilibrium price paid by consumers has risen from $1 to $1.20.
The total number of apples purchased has dropped from 63 apples to 60 apples.
Consumers therefore received 3 less apples. They spent $72 for these 60 apples, whereas previously they spent $63 for 3 more apples, a definite decrease in consumer surplus.
Foreign producers sold 36 of those apples (see the raw data in the linked Google Sheet), for a gross revenue of $43.20. However, they also need to pay the tariff to the US government, which accounts for $18, meaning they only receive $25.20 post-tariff. Previously, they sold 42 apples at $1 each with no tariff to be paid, meaning they took home $42.
Domestic producers sold the remaining 24 apples at $1.20, giving them a revenue of $28.80. Since they don’t pay the tariff, they take home all of that money. By contrast, previously, they sold 21 apples at $1, for a take-home of $21.
The government receives $0.50 for each of the 60 apples sold, or in other words receives $30 in revenue it wouldn’t have received otherwise.
We could be more specific about the surpluses, and calculate the actual areas for consumer surplus, producer surplus, inefficiency from the tariff, and government revenue from the tariff. But I won’t bother, as those calculations get slightly more involved. Instead, let’s just look at the aggregate outcomes:
Consumers were unquestionably hurt. Their price paid went up by $0.20 per apple, and received less apples.
Foreign producers were also hurt. Their price received went down from the original $1 to the new post-tariff price of $1.20, minus the $0.50 tariff. In other words: foreign producers only receive $0.70 per apple now. This hurt can be mitigated by shifting sales to other countries without a tariff, but the pain will exist regardless.
Domestic producers scored. They can sell less apples and make more revenue doing it.
And the government walked away with an extra $30.
Hopefully you now see the answer to the original questions. Importantly, while the government imposed a $0.50 tariff, neither side fully absorbed that cost. Consumers paid a bit more, foreign producers received a bit less. The exact details of how that tariff was split across the groups is mediated by the relevant supply and demand curves of each group. If you want to learn more about this, the relevant search term is “price elasticity,” or how much a group’s quantity supplied or demanded will change based on changes in the price.
Other taxes
Most taxes are some kind of a tax on trade. Tariffs on apples is an obvious one. But the same applies to income tax (taxing the worker for the trade of labor for money) or payroll tax (same thing, just taxing the employer instead). Interestingly, you can use the same model for analyzing things like tax incentives. For example, if the government decided to subsidize domestic apple production by giving the domestic producers a $0.50 bonus for each apple they sell, we would end up with a similar kind of analysis, except instead of the foreign supply curve shifting up, we’d see the domestic supply curve shifting down.
And generally speaking, this is what you’ll always see with government involvement in the economy. It will result in disrupting an existing equilibrium, letting the market readjust to a new equilibrium, and incentivization of some behavior, causing some people to benefit and others to lose out. We saw with the apple tariff, domestic producers and the government benefited while others lost.
You can see the reverse though with tax incentives. If I give a tax incentive of providing a deduction (not paying income tax) for preschool, we would end up with:
Government needs to make up the difference in tax revenue, either by raising taxes on others or printing more money (leading to inflation). Either way, those paying the tax or those holding government debased currency will pay a price.
Those people who don’t use the preschool deduction will receive no benefit, so they simply pay a cost.
Those who do use the preschool deduction will end up paying less on tax+preschool than they would have otherwise.
This analysis is fully amoral. It’s not saying whether providing subsidized preschool is a good thing or not, it simply tells you where the costs will be felt, and points out that such government interference in free economic choice does result in inefficiencies in the system. Once you have that knowledge, you’re more well educated on making a decision about whether the costs of government intervention are worth the benefits.
For many years I wished I had a setup that would allow me to work (that is, code) productively outside in the bright sun. It’s winter right now, but when its summer again it’s always a bit. this weekend I got closer to that goal.
TL;DR: Using code-server on a beefy machine seems to be quite neat.
Passively lit coding
Personal history
Looking back at my own old blog entries I find one from 10 years ago describing how I bought a Kobo eBook reader with the intent of using it as an external monitor for my laptop. It seems that I got a proof-of-concept setup working, using VNC, but it was tedious to set up, and I never actually used that. I subsequently noticed that the eBook reader is rather useful to read eBooks, and it has been in heavy use for that every since.
Four years ago I gave this old idea another shot and bought an Onyx BOOX Max Lumi. This is an A4-sized tablet running Android and had the very promising feature of an HDMI input. So hopefully I’d attach it to my laptop and it just works™. Turns out that this never worked as well as I hoped: Even if I set the resolution to exactly the tablet’s screen’s resolution I got blurry output, and it also drained the battery a lot, so I gave up on this. I subsequently noticed that the tablet is rather useful to take notes, and it has been in sporadic use for that.
Going off on this tangent: I later learned that the HDMI input of this device appears to the system like a camera input, and I don’t have to use Boox’s “monitor” app but could other apps like FreeDCam as well. This somehow managed to fix the resolution issues, but the setup still wasn’t as convenient to be used regularly.
I also played around with pure terminal approaches, e.g. SSH’ing into a system, but since my usual workflow was never purely text-based (I was at least used to using a window manager instead of a terminal multiplexer like screen or tmux) that never led anywhere either.
My colleagues have said good things about using VSCode with the remote SSH extension to work on a beefy machine, so I gave this a try now as well, and while it’s not a complete game changer for me, it does make certain tasks (rebuilding everything after a switching branches, running the test suite) very convenient. And it’s a bit spooky to run these work loads without the laptop’s fan spinning up.
In this setup, the workspace is remote, but VSCode still runs locally. But it made me wonder about my old goal of being able to work reasonably efficient on my eInk tablet. Can I replicate this setup there?
VSCode itself doesn’t run on Android directly. There are project that run a Linux chroot or in termux on the Android system, and then you can VNC to connect to it (e.g. on Andronix)… but that did not seem promising. It seemed fiddly, and I probably should take it easy on the tablet’s system.
code-server, running remotely
A more promising option is code-server. This is a fork of VSCode (actually of VSCodium) that runs completely on the remote machine, and the client machine just needs a browser. I set that up this weekend and found that I was able to do a little bit of work reasonably.
Access
With code-server one has to decide how to expose it safely enough. I decided against the tunnel-over-SSH option, as I expected that to be somewhat tedious to set up (both initially and for each session) on the android system, and I liked the idea of being able to use any device to work in my environment.
I also decided against the more involved “reverse proxy behind proper hostname with SSL” setups, because they involve a few extra steps, and some of them I cannot do as I do not have root access on the shared beefy machine I wanted to use.
That left me with the option of using a code-server’s built-in support for self-signed certificates and a password:
With trust-on-first-use this seems reasonably secure.
Update: I noticed that the browsers would forget that I trust this self-signed cert after restarting the browser, and also that I cannot “install” the page (as a Progressive Web App) unless it has a valid certificate. But since I don’t have superuser access to that machine, I can’t just follow the official recommendation of using a reverse proxy on port 80 or 431 with automatic certificates. Instead, I pointed a hostname that I control to that machine, obtained a certificate manually on my laptop (using acme.sh) and copied the files over, so the configuration now reads as follows:
(I am using nix as a package manager on a Debian system there, hence the additional PATH and complex ExecStart. If you have a more conventional setup then you do not have to worry about Environment and can likely use ExecStart=code-server.
For this to survive me logging out I had to ask the system administrator to run loginctl enable-linger joachim, so that systemd allows my jobs to linger.
Git credentials
The next issue to be solved was how to access the git repositories. The work is all on public repositories, but I still need a way to push my work. With the classic VSCode-SSH-remote setup from my laptop, this is no problem: My local SSH key is forwarded using the SSH agent, so I can seamlessly use that on the other side. But with code-server there is no SSH key involved.
I could create a new SSH key and store it on the server. That did not seem appealing, though, because SSH keys on Github always have full access. It wouldn’t be horrible, but I still wondered if I can do better.
I thought of creating fine-grained personal access tokens that only me to push code to specific repositories, and nothing else, and just store them permanently on the remote server. Still a neat and convenient option, but creating PATs for our org requires approval and I didn’t want to bother anyone on the weekend.
So I am experimenting with Github’s git-credential-manager now. I have configured it to use git’s credential cache with an elevated timeout, so that once I log in, I don’t have to again for one workday.
To login, I have to https://github.com/login/device on an authenticated device (e.g. my phone) and enter a 8-character code. Not too shabby in terms of security. I only wish that webpage would not require me to press Tab after each character…
This still grants rather broad permissions to the code-server, but at least only temporarily
Android setup
On the client side I could now open https://host.example.com:8080 in Firefox on my eInk Android tablet, click through the warning about self-signed certificates, log in with the fixed password mentioned above, and start working!
I switched to a theme that supposedly is eInk-optimized (eInk by Mufanza). It’s not perfect (e.g. git diffs are unhelpful because it is not possible to distinguish deleted from added lines), but it’s a start. There are more eInk themes on the official Visual Studio Marketplace, but because code-server is a fork it cannot use that marketplace, and for example this theme isn’t on Open-VSX.
For some reason the F11 key doesn’t work, but going fullscreen is crucial, because screen estate is scarce in this setup. I can go fullscreen using VSCode’s command palette (Ctrl-P) and invoking the command there, but Firefox often jumps out of the fullscreen mode, which is annoying. I still have to pay attention to when that’s happening; maybe its the Esc key, which I am of course using a lot due to me using vim bindings.
A more annoying problem was that on my Boox tablet, sometimes the on-screen keyboard would pop up, which is seriously annoying! It took me a while to track this down: The Boox has two virtual keyboards installed: The usual Google ASOP keyboard, and the Onyx Keyboard. The former is clever enough to stay hidden when there is a physical keyboard attached, but the latter isn’t. Moreover, pressing Shift-Ctrl on the physical keyboard rotates through the virtual keyboards. Now, VSCode has many keyboard shortcuts that require Shift-Ctrl (especially on an eInk device, where you really want to avoid using the mouse). And the limited settings exposed by the Boox Android system do not allow you configure that or disable the Onyx keyboard! To solve this, I had to install the KISS Launcher, which would allow me to see more Android settings, and in particular allow me to disable the Onyx keyboard. So this is fixed.
I was hoping to improve the experience even more by opening the web page as a Progressive Web App (PWA), as described in the code-server FAQ. Unfortunately, that did not work. Firefox on Android did not recognize the site as a PWA (even though it recognizes a PWA test page). And I couldn’t use Chrome either because (unlike Firefox) it would not consider a site with a self-signed certificate as a secure context, and then code-server does not work fully. Maybe this is just some bug that gets fixed in later versions.
Now that I use a proper certificate, I can use it as a Progressive Web App, and with Firefox on Android this starts the app in full-screen mode (no system bars, no location bar). The F11 key still does’t work, and using the command palette to enter fullscreen does nothing visible, but then Esc leaves that fullscreen mode and I suddenly have the system bars again. But maybe if I just don’t do that I get the full screen experience. We’ll see.
I did not work enough with this yet to assess how much the smaller screen estate, the lack of colors and the slower refresh rate will bother me. I probably need to hide Lean’s InfoView more often, and maybe use the Error Lens extension, to avoid having to split my screen vertically.
I also cannot easily work on a park bench this way, with a tablet and a separate external keyboard. I’d need at least a table, or some additional piece of hardware that turns tablet + keyboard into some laptop-like structure that I can put on my, well, lap. There are cases for Onyx products that include a keyboard, and maybe they work on the lap, but they don’t have the Trackpoint that I have on my ThinkPad TrackPoint Keyboard II, and how can you live without that?
Conclusion
After this initial setup chances are good that entering and using this environment is convenient enough for me to actually use it; we will see when it gets warmer.
A few bits could be better. In particular logging in and authenticating GitHub access could be both more convenient and more safe – I could imagine that when I open the page I confirm that on my phone (maybe with a fingerprint), and that temporarily grants access to the code-server and to specific GitHub repositories only. Is that easily possible?
Below we present some animations that illustrate operations on finite patches of Penrose’s Kite and Dart tiles.
These were created using PenroseKiteDart which is a Haskell package available on Hackage making use of the Haskell Diagrams package. For details, see the PenroseKiteDart user guide.
Penrose’s Kite and Dart tiles can produce infinite aperiodic tilings of the plane. There are legal tiling rules to ensure aperiodicity, but these rules do not guarantee that a finite tiling will not get stuck. A legal finite tiling which can be continued to cover the whole plane is called a correct tiling. The rest, which are doomed to get stuck, are called incorrect tilings. (More details can be found in the links at the end of this blog.)
Decomposition Animations
The function decompose is a total operation which is guaranteed to preserve the correctness of a finite tiling represented as a tile graph (or Tgraph). Let us start with a particular Tgraph called sunGraph which is defined in PenroseKiteDart and consists of 5 kites arranged with a common origin vertex. It is drawn using default style in figure 1 on the left. On the right of figure 1 it is drawn with both vertex labels and dotted lines for half-tile join edges.
Figure 1: sunGraph
We can decompose sunGraph three times by selecting index 3 of the infinite list of its decompositions.
The result (sunD3) is drawn in figure 2 (scaled up).
Figure 2: sunD3
The animation in figure 3 illustrates two further decompositions of sunD3 in two stages.
Figure 3: Two decompositions of sunD3
Figure 4 also illustrates two decompositions, this time starting from forcedKingD.
forcedKingD :: Tgraph
forcedKingD = force (decompose kingGraph)
Figure 4: Two decompositions of forcedKingD
A Composition Animation
An inverse to decomposing (namely composing) has some extra intricacies. In the literature (see for example 1 and 2) versions of the following method are frequently described.
Firstly, split darts in half.
Secondly, glue all the short edges of the half-darts where they meet a kite (simultaneously). This will form larger scale complete darts and larger scale half kites.
Finally join the halves of the larger scale kites.
This works for infinite tilings, but we showed in Graphs,Kites and Darts and Theorems that this method is unsound for finite tilings. There is the trivial problem that a half-dart may not have a complete kite on its short edge. Worse still, the second step can convert a correct finite tiling into an incorrect larger scale tiling. An example of this is given in Graphs, Kites and Darts and Theorems where we also described our own safe method of composing (never producing an incorrect Tgraph when given a correct Tgraph). This composition can leave some boundary half-tiles out of the composition (called remainder half-tiles).
The animation in figure 5 shows such a composition where the remainder half-tiles are indicated with lime green edges.
Figure 5: Composition Animation
In general, compose is a partial operation as the resulting half-tiles can break some requirements for Tgraphs (namely, connectedness and no crossing boundaries). However we have shown that it is a total function on forced Tgraphs. (Forcing is discussed next.)
Forcing Animations
The process of forcing a Tgraph adds half-tiles on the boundary where only one legal choice is possible. This continues until either there are no more forced additions possible, or a clash is found showing that the tiling is incorrect. In the latter case it must follow that the initial tiling before forcing was already an incorrect tiling.
The process of forcing is animated in figure 6, starting with a 5 times decomposed kite and in figure 7 with a 5 times decomposed dart.
Figure 6: Force animationFigure 7: Another force animation
It is natural to wonder what forcing will do with cut-down (but still correct) Tgraphs. For example, taking just the boundary faces from the final Tgraph shown in the previous animation forms a valid Tgraph (boundaryExample) shown in figure 8.
Applying force to boundaryExample just fills in the hole to recreate force (decompositions dartGraph !!5) modulo vertex numbering. To make it more interesting we tried removing further half-tiles from boundaryExample to make a small gap. Forcing this also completes the filling in of the boundary half-tiles to recreate force (decompositions dartGraph !!5). However, we can see that this filling in is constrianed to preserve the required Tgraph property of no crossing boundaries which prevents the tiling closing round a hole.
This is illustrated in the animation shown in figure 9.
Figure 9: Boundary gap animation
As another experiment, we take the boundary faces of a (five times decomposed but not forced) star. When forced this fills in the star and also expands outwards, as illustrated in figure 10.
Figure 10: Star boundary
In the final example, we pick out a shape within a correct Tgraph (ensuring the chosen half-tiles form a valid Tgraph) then animate the force process and then run the animation in both directions (by adding a copy of the frames in reverse order).
The result is shown in figure 11.
Figure 11: Heart animation
Creating Animations
Animations as gif files can be produced by the Haskell Diagrams package using the rasterific back end.
The main module should import both Diagrams.Prelude and Diagrams.Backend.Rasterific.CmdLine. This will expose the type B standing for the imported backend, and diagrams then have type Diagram B.
An animation should have type [(Diagram B, Int)] and consist of a list of frames for the animation, each paired with an integer delay (in one-hundredths of a second).
The animation can then be passed to mainWith.
module Main (main)whereimport Diagrams.Prelude
import Diagrams.Backend.Rasterific.CmdLine
...
fig::[(Diagram B,Int)]
fig = myExampleAnimation
main :: IO ()
main = mainWith fig
If main is then compiled and run (e.g. with parameters -w 700 -o test.gif) it will produce an output file (test.gif with width 700).
Crossfade tool
The decompose and compose animations were defined using crossfade.
crossfade :: Int -> Diagram B -> Diagram B ->[Diagram B]
crossfade n d1 d2 = map blending ratios
where
blending r = opacity (1-r) d1 <> opacity r d2
ratios = map ((/ fromIntegral n) . fromIntegral)[0..n]
Thus crossfade n d1 d2 produces n+1 frames, each with d1 overlaid on d2 but with varying opacities (decreasing for d1 and increasing for d2).
Adding the same pause (say 10 hundreths of a second) to every frame can be done by applying map (,10) and this will produce an animation.
Force animation tool
To create force animations it was useful to create a tool to produce frames with stages of forcing.
an angle argument (to rotate the diagrams in the animation from the default alignment of the Tgraph),
an Int (for the required number of frames),
a Tgraph (to be forced),
a triple of colours for filling darts, kites and grout (edge colour), respectively.
The definition of forceFrames uses stepForce to advance forcing a given number of steps to get the intermediate Tgraphs. The total number of forcing steps will be the number of faces (half-tiles) in the final force g less the number of faces in the initial g. All the Tgraphs are drawn (using colourDKG) but the resulting diagrams must all be aligned properly. The alignment can be achieved by creating a VPatch (vertex patch) from the final Tgraph which is then rotated. All the Tgraphs can then be drawn using sub vertex patches of the final rotated one. (For details see Overlaid examples in the PenroseKiteDart user guide.)
Empires and SuperForce – these new operations were based on observing properties of boundaries of forced Tgraphs.
Graphs, Kites and Darts introduced Tgraphs. This gave more details of implementation and results of early explorations. (The class Forcible was introduced subsequently).
Diagrams for Penrose Tiles – the first blog introduced drawing Pieces and Patches (without using Tgraphs) and provided a version of decomposing for Patches (decompPatch).
consider a generic implementation of alpha-beta game tree search with transposition table, generic enough to be applicable to any user-specified game. what should be its API? what features should it provide?
evaluate to infinite depth (possible because of transposition table), returning game value and line (principal variation). intended for small games.
return the transposition table so that it can be reused for subsequent moves.
evaluate to given depth. or, user-specified predicate of whether to stop searching, e.g., quiescence search. quiescence search wants access to the transposition table.
ambitious: because of the many ways game tree search can be customized (for many examples, albeit often poorly described, see the chessprogramming wiki), structure the algorithm as a collection components each of which can be modified and hooked together in various ways. I have no idea what language or framework could enable this kind of software engineering, though functional programming languages seem attractive as the first thing to try. but beware that a pure functional programming language such as Haskell easily leaks space for this kind of task, and threading state, the transposition table, though the computation may be awkward.
common customizations sacrifice accuracy (correctness or completeness) for speed. for example, if two different evaluated positions have the same key (for example, a 64-bit Zobrist hash in chess), one can optimize by doing no transposition table collision resolution; the second position gets ignored, assumed to have already been evaluated. the default algorithm should not do such optimizations but should allow the user to specify both safe and unsafe optimizations.
allow the search to be augmented with various statistics gathered along the way that get consumed by other user-specified parts of the algorithm. for example, the move generator could order moves based on values of similar moves already evaluated in other parts of the tree.
provide visibility into how user customizations are working, ways to evaluate whether or not they are worth it.
There’s a common anti-pattern I see in beginner-to-intermediate Haskell programmers that I wanted to discuss today. It’s the tendency to conceptualize the creation of an object by repeated mutation. Often this takes the form of repeated insertion into an empty container, but comes up under many other guises as well.
This anti-pattern isn’t particularly surprising in its prevalence; after all, if you’ve got the usual imperative brainworms, this is just how things get built. The gang of four “builder pattern” is exactly this; you can build an empty object, and setters on such a thing change the state but return the object itself. Thus, you build things by chaning together setter methods:
Even if you don’t ascribe to the whole OOP design principle thing, you’re still astronomically likely to think about building data structures like this:
Doodad doodad =new Doodad;foreach(Widget widget in widgets){ doodad.addWidget(widget);}
To be more concrete, maybe instead of doodads and widgets you have BSTs and Nodes. Or dictionaries and key-value pairs. Or graphs and edges. Anywhere you look, you’ll probably find examples of this sort of code.
Maybe you’re thinking to yourself “I’m a hairy-chested functional programmer and I scoff at patterns like these.” That might be true, but perhaps you too are guilty of writing code that looks like:
foldr (\(k, v) m -> Map.insert k v m) Map.empty$ toKVPairs something
Just because it’s dressed up with functional combinators doesn’t mean you’re not still writing C code. To my eye, the great promise of functional programming is its potential for conceptual clarity, and repeated mutation will always fall short of the mark.
The complaint, as usual, is that repeated mutation tells you how to build something, rather than focusing on what it is you’re building. An algorithm cannot be correct in the absence of intention—after all, you must know what you’re trying to accomplish in order to know if you succeeded. What these builder patterns, for loops, and foldrs all have in common is that they are algorithms for strategies for building something.
But you’ll notice none of them come with comments. And therefore we can only ever guess at what the original author intended, based on the context of the code we’re looking at.
I’m sure this all sounds like splitting hairs, but that’s because the examples so far have been extremely simple. But what about this one?
cgo :: (a -> (UInt, UInt)) -> [a] -> [NonEmpty a]cgo f =foldr step []where step a [] = [pure a] step a bss0@((b :| bs) : bss)|let (al, ac) = f a , let (bl, bc) = f b , al +1== bl && ac == bc= (a :| b : bs) : bss|otherwise=pure a : bss0
which I found by grepping through haskell-language-server for foldr, and then mangled to remove the suggestive variable names. What does this one do? Based solely on the type we can presume it’s using that function to partition the list somehow. But how? And is it correct? We’ll never know—and the function doesn’t even come with any tests!
It’s Always Monoids
The shift in perspective necessary here is to reconceptualize building-by-repeated-mutation as building-by-combining. Rather than chiseling out the object you want, instead find a way of gluing it together from simple, obviously-correct pieces.
The notion of “combining together” should evoke in you a cozy warm fuzzy feeling. Much like being in a secret pillow form. You must come to be one with the monoid. Once you have come to embrace monoids, you will have found inner programming happiness. Monoids are a sacred, safe place, at the fantastic intersection of “overwhelming powerful” and yet “hard to get wrong.”
As an amazingly fast recap, a monoid is a collection of three things: some type m, some value of that type mempty, and binary operation over that type (<>) :: m -> m -> m, subject to a bunch of laws:
∀a.mempty<> a = a = a <>mempty∀a b c. (a <> b) <> c = a <> (b <> c)
which is to say, mempty does nothing and (<>) doesn’t care where you stick the parentheses.
If you’re going to memorize any two particular examples of monoids, it had better be these two:
instanceMonoid [a] wheremempty= [] a <> b = a ++ binstance (Monoid a, Monoid b) =>Monoid (a, b) wheremempty= (mempty, mempty) (a1, b1) <> (a2, b2) = (a1 <> a2, b1 <> b2)
The first says that lists form a monoid under the empty list and concatenation. The second says that products preserve monoids.
The list monoid instance is responsible for the semantics of the ordered, “sequency” data structures. That is, if I have some sequential flavor of data structure, its monoid instance should probably satisfy the equation toList a <> toList b = toList (a <> b). Sequency data structures are things like lists, vectors, queues, deques, that sort of thing. Data structures where, when you combine them, you assume there is no overlap.
The second monoid instance here, over products, is responsible for pretty much all the other data structures. The first thing we can do with it is remember that functions are just really, really big product types, with one “slot” for every value in the domain. We can show an isomorphism between pairs and functions out of booleans, for example:
from :: (Bool-> a) -> (a, a)from f = (f False, f True)to :: (a, a) -> (Bool-> a)to (a, _) False= ato (_, a) True= a
and under this isomorphism, we should thereby expect the Monoid a => Monoid (Bool -> a) instance to agree with Monoid a => Monoid (a, a). If you generalize this out, you get the following instance:
instanceMonoid a =>Monoid (x -> a) wheremempty= \_ ->mempty f <> g = \x -> f x <> g x
which combines values in the codomain monoidally. We can show the equivalence between this monoid instance and our original product preservation:
from f <> from g= (f False, f True) <> (g False, g True)= (f False<> g False, f True<> g True)= ((f <> g) False, (f <> g) True)= from (f <> g)
and
to (a11, a12) <> to (a21, a22)= \x -> to (a11, a12) x <> to (a21, a22) x= \x ->case x ofFalse-> to (a11, a12) False<> to (a21, a22) FalseTrue-> to (a11, a12) True<> to (a21, a22) True= \x ->case x ofFalse-> a11 <> a21True-> a12 <> a22= \x -> to (a11 <> a21, a12 <> a22) x= to (a11 <> a21, a12 <> a22)
which is a little proof that our function monoid agrees with the preservation-of-products monoid. The same argument works for any type x in the domain of the function, but showing it generically is challenging.
Anyway, I digresss.
The reason to memorize thisMonoid instance is that it’s the monoid instance that every data structure is trying to be. Recall that almost all data structures are merely different encodings of functions, designed to make some operations more efficient than they would otherwise be.
Don’t believe me? A Map k v is an encoding of the function k -> Maybe v optimized to efficiently query which k values map to Just something. That is to say, it’s a sparse representation of a function.
From Theory to Practice
What does all of this look like in practice? Stuff like worrying about foldr is surely programming-in-the-small, which is worth knowing, but isn’t the sort of thing that turns the tides of a successful application.
The reason I’ve been harping on about the function and product monoids is that they are compositional. The uninformed programmer will be surprised by just far one can get by composing these things.
At work, we need to reduce a tree (+ nonlocal references) into an honest-to-goodness graph. While we’re doing it, we need to collect certain nodes. And the tree has a few constructors which semantically change the scope of their subtrees, so we need to preserve that information as well.
It’s actually quite the exercise to sketch out an algorithm that will accomplish all of these goals when you’re thinking about explicit mutation. Our initial attempts at implementing this were clumsy. We’d fold the tree into a graph, adding fake nodes for the Scope construcotrs. Then we’d filter all the nodes in the graph, trying to find the ones we needed to collect. Then we’d do a graph traversal from the root, trying to find these Scope nodes, and propagating their information downstream.
Rather amazingly, this implementation kinda sorta worked! But it was slow, and took \(O(10k)\) SLOC to implement.
The insight here is that everything we needed to collect was monoidal:
where the deriving (Semigroup, Monoid) via Generically Solution stanza gives us the semigroup and monoid instances that we’d expect from Solution being the product of a bunch of other monoids.
And now for the coup de grace: we hook everything up with the Writer monad. Writer is a chronically slept-on type, because most people seem to think it’s useful only for logging, and, underwhelming at doing logging compared to a real logger type. But the charm is in the details:
instanceMonoid w =>Monad (Writer w)
Writer w is a monad whenever w is a monoid, which makes it the perfect monad for solving data-structure-creation problems like the one we’ve got in mind. Such a thing gives rise to a few helper functions:
collectNode ::MonadWriterSolution m =>Node-> m ()collectNode n = tell $mempty { collectedNodes = Set.singleton n }addMetadata ::MonadWriterSolution m =>Node->Metadata-> m ()addMetadata n m = tell $mempty { metadata = Map.singleton n m }emitGraphFragment ::MonadWriterSolution m =>Graph-> m ()emitGraphFragment g = tell $mempty { graph = g }
each of which is responsible for adding a little piece to the final solution. Our algorithm is thus a function of the type:
algorithm ::Metadata-- ^ the current scope->Tree-- ^ the tree we're reducing->WriterSolutionNode-- ^ our partial solution, and the node corresponding to the root of the tree
which traverses the Tree, recursing with a different Metadata whenever it comes across a Scope constructor, and calling our helper functions as it goes. At each step of the way, the only thing it needs to return is the root Node of the section of the graph it just built, which recursing calls can use to break up the problem into inductive pieces.
This new implementation is roughly 20x smaller, coming in at @O(500)@ SLOC, and was free of all the bugs we’d been dilligently trying to squash under the previous implementation.
Suppose we have a sequence of integers \(a_1, \dots, a_n\) and want to be
able to perform two operations:
we can update any \(a_i\) by adding some value \(v\) to it; or
we can perform a range query, which asks for the sum of the values
\(a_i + \dots + a_j\) for any range \([i,j]\).
There are several ways to solve this problem. For example:
We could just keep the sequence of integers in a mutable array.
Updating is \(O(1)\), but range queries are \(O(n)\) since we must
actually loop through the range and add up all the values.
We could keep a separate array of prefix sums on the side, so
that \(P_i\) stores the sum \(a_1 + \dots + a_i\). Then the range
query on \([i,j]\) can be computed as \(P_j - P_{i-1}\), which only
takes \(O(1)\); however, updates now take \(O(n)\) since we must also
update all the prefix sums which include the updated element.
We can get the best of both worlds using a segment tree, a binary
tree storing the elements at the leaves, with each internal node
caching the sum of its children. Then both update and range query
can be done in \(O(\lg n)\).
I won’t go through the details of this third solution here, but it is
relatively straightforward to understand and implement, especially in
a functional language.
However, there is a fourth solution, known as a Fenwick tree or
Fenwick array, independently invented by Ryabko (1989) and
Fenwick (1994). Here’s a typical Java implementation of a Fenwick
tree:
class FenwickTree {privatelong[] a;publicFenwickTree(int n){ a =newlong[n+1];}publiclongprefix(int i){long s =0;for(; i >0; i -=LSB(i)) s += a[i];return s;}publicvoidupdate(int i,long delta){for(; i < a.length; i +=LSB(i)) a[i]+= delta;}publiclongrange(int i,int j){returnprefix(j)-prefix(i-1);}publiclongget(int i){returnrange(i,i);}publicvoidset(int i,long v){update(i, v -get(i));}privateintLSB(int i){return i &(-i);}}
I know what you’re thinking: what the heck!? There are some loops adding and
subtracting LSB(i), which is defined as the bitwise AND of i and
-i? What on earth is this doing? Unless you have seen this
before, this code is probably a complete mystery, as it was for me the
first time I encountered it.
However, from the right point of view, we can derive this mysterious imperative
code as an optimization of segment trees. In particular, in my
paper I show how we can:
Start with a segment tree.
Delete some redundant info from the segment tree, and shove the
remaining values into an array in a systematic way.
Define operations for moving around in the resulting Fenwick array by
converting array indices to indices in a segment tree, moving
around the tree appropriately, and converting back.
Describe these operations using a Haskell EDSL for
infinite-precision 2’s complement binary arithmetic, and fuse away
all the intermediate conversion steps, until the above mysterious
implementation pops out.
Sam Lindley is a Reader in Programming Languages Design and Implementation at the University of Edinburgh. In this episode, he tells us how difficult naming is, the different kinds of effect systems and handlers, languages *much* purer than Haskell, and Modal logic.
Admittedly a bit late, buuuuuut Merry belated Christmas and Happy New Years
to all!
This past December I again participated in Eric Wastl’s Advent of Code, a series of 25 daily
Christmas-themed puzzles. Each puzzle comes with a cute story about saving
Christmas, and the puzzles increase in difficulty as the stakes get higher and
higher. Every night at midnight EST, my friends and I (including the good people
of libera chat’s ##advent-of-code channel) discuss the latest
puzzle and creative ways to solve and optimize it. But, the main goal isn’t to
solve it quickly, it’s always to see creative ways to approach the puzzle and
share different insights. The puzzles are bite-sized enough that there are often
multiple ways to approach it, and in the past I’ve leveraged group theory, galilean
transformations and linear algebra, and more group theory.
This year was also the special 10 year anniversary event, with callbacks to fun
story elements of all the previous years!
Most of the puzzles are also pretty nice to solve in Haskell! Lots of DFS’s
that melt away as simple recursion or recursion schemes, and even the BFS’s that
expose you to different data structures and encodings.
This year I’ve moved almost all of my Haskell code to an Advent of Code Megarepo.
I also like to post write-ups on Haskelly ways to approach the problems, and
they are auto-compiled on the megarepo wiki.
I try my best every year, but sometimes I am able to complete write-ups for
all 25 puzzles before the new year catches up. The last time was 2020, and I’m
proud to announce that 2024 is now also 100% complete!
You can find all of
them here, but here are links to each individual one. Hopefully you can find
them helpful. And if you haven’t yet, why not try Advent of Code yourself? :) And drop by the
freenode ##advent-of-code channel, we’d love to say hi and chat, or
help out! Thanks all for reading, and also thanks to Eric for a great event this
year, as always!
Lucas Escot wrote a good blog post titled “Making My Life Easier with GADTs”, which contains a demonstration of GADTs that made his life easier.
He posted the article to reddit.
I’m going to trust that - for his requirements and anticipated program evolution - the solution is a good one for him, and that it actually made his life easier.
However, there’s one point in his post that I take issue with:
Dependent types and assimilated type-level features get a bad rep. They are often misrepresented as a futile toy for “galaxy-brain people”, providing no benefit to the regular programmer. I think this opinion stems from a severe misconception about the presumed complexity of dependent type systems.
I am often arguing against complexity in Haskell codebases.
While Lucas’s prediction about “misconceptions” may be true for others, it is not true for me.
I have worked extensively with Haskell’s most advanced features in large scale codebases.
I’ve studied “Types and Programming Languages,”the Idris book, “Type Theory and Formal Proof”, and many other resources on advanced type systems.
I don’t say this to indicate that I’m some kind of genius or authority, just that I’m not a rube who’s looking up on the Blub Paradox.
My argument for simplicity comes from the hard experience of having to rip these advanced features out, and the pleasant discovery that simpler alternatives are usually nicer in every respect.
They are often misrepresented as a futile toy for “galaxy-brain people”, providing no benefit to the regular programmer. I think this opinion stems from a severe misconception about the presumed complexity of dependent type systems.
This opinion - in my case at least - stems from having seen people code themselves into a corner with fancy type features where a simpler feature would have worked just as well.
In this case, the “simplest solution” is to have two entirely separate datatypes, as the blog post initially starts with. These datatypes, after all, represent different things - a typed environment and an untyped environment. Why mix the concerns? What pain or requirement is solved by having one more complicated datatype when two datatypes works pretty damn well?
I could indeed keep typed environments completely separate. Different datatypes, different information. But this would lead to a lot of code duplication. Given that the compilation logic will mostly be mostly identical for these two targets, I don’t want to be responsible for the burden of keeping both implementations in sync.
Code duplication can be a real concern. In this case, we have code that is not precisely duplicated, but simply similar - we want compilation logic to work for both untyped and typed logics, and only take typing information into account. When we want code to work over multiple possible types, we have two options: parametric polymorphism and ad-hoc polymorphism.
With parametric polymorphism, the solution looks like this:
This is actually very similar to the GADT approach, because we’re threading a type variable through the system. For untyped, we can write GlobalDecl (), and for typed, we can write GlobalDecl LamBox.Type.
Functions which can work on either untyped or typed would have GlobalDecl a -> _ as their input, and functions which require a representation can specify it directly. This would look very similar to the GADT approach: in practice, replace GlobalDecl Typed with GlobalDecl Type and GlobalDecl Untyped with GlobalDecl () and you’re good.
(or, heck, data Untyped = Untyped and the change is even smaller).
This representation is much easier to work with. You can deriving stock (Show, Eq, Ord). You can $(deriveJSON ''GlobalEnv). You can delete several language extensions. It’s also more flexible: you can use Maybe Type to represent partially typed programs (or programs with type inference). You can use Either TypeError Type to represent full ASTs with type errors. You can deriving stock (Functor, Foldable, Traversable) to get access to fmap (change the type with a function) and toList (collect all the types in the AST) and traverse (change each type effectfully, combining results).
When we choose GADTs here, we pay significant implementation complexity costs, and we give up flexibility. What is the benefit? Well, the entire benefit is that we’ve given up flexibility. With the parametric polymorphism approach, we can put anything in for that type variable a. The GADT prevents us from writing TypeDecl () and it forbids you from having anything other than Some (type :: Type) or None in the fields.
This restriction is what I mean by ‘coding into a corner’. Let’s say you get a new requirement to support partially typed programs. If you want to stick with the GADT approach, then you need to change data Typing = Typed | Untyped | PartiallyTyped and modify all the WhenTyped machinery - Optional :: Maybe a -> WhenTyped PartiallTyped a. Likewise, if you want to implement inference or type-checking, you need another constructor on Typing and another onWhenTyped - ... | TypeChecking and Checking :: Either TypeError a -> WhenTyped TypeChecking a.
But wait - now our TypeAliasDecl has become overly strict!
But, uh oh, we also want to write functions that can operate in many of these states. We can extend IsTypedish with a function witness witnessTypedish :: WhenTyped t Type -> Type, but that also doesn’t quite work - the t actually determines the output type.
but actually working with this becomes a bit obnoxious. You see, without knowing t, you can’t know the result of isTypedIshWitness, so you end up needing to say things like (IsTypedish t, TypedIshPayload t ~ f Type, Foldable f) => ... to cover the Maybe and Either case - and this only lets you fold the result. But now you’re working with the infelicities of type classes (inherently open) and sum types (inherently closed) and the way that GHC tries to unify these two things with type class dispatch.
Whew.
Meanwhile, in parametric polymorphism land, we get almost all of the above for free. If we want to write code that covers multiple possible cases, then we can use much simpler type class programming. Consider how easy it is to write this function and type:
My Emacs config's todo-list has long had an item about finding some way to
review GitHub PRs without having to leave Emacs and when the forge issue that I
subscribe to came alive again I thought it was time to see if I can improve my
config.
I've tried the first one before but at the time it didn't seem to work at all.
Apparently that's improved somewhat, though there's a PR with a change that's
necessary to make it work.1 The first two don't support comments on multiple
lines of a PR, there are issues/discussions for both
The last one, emacs-pr-review does support commenting on multiple lines, but
it lacks a nice way of opening a review from magit. What I can do is
position the cursor on a PR in the magit status view, then
copy the the PR's URL using forge-copy-url-at-point-as-kill, and
open the PR by calling pr-review and pasting the PR's URL.
Which I did for a few days until I got tired of it and wrote a function to cut
out they copy/paste part.
(defunmes/pr-review-via-forge()(interactive)(if-let*((target (forge--browse-target))(url (if(stringp target) target (forge-get-url target)))(rev-url (pr-review-url-parse url)))(pr-review url)(user-error"No PR to review at point")))
I've bound it to a key in magit-mode-map to make it easier.
I have to say I'm not completely happy with emacs-pr-review, so if either of
the other two sort out commenting on multiple lines I'll check them out again.
The links to formulae here are broken but a PDF version is available at github.
Preface
Functional programming encourages us to program without mutable state.
Instead we compose functions that can be viewed as state transformers.
It's a change of perspective that can have a big impact on how we reason about our code.
But it's also a change of perspective that can be useful in mathematics and I'd like to give an example: a really beautiful technique that alows you to sample from the infinite limit of a probability distribution without needing an infinite number of operations.
(Unless you're infinitely unlucky!)
Markov Chains
A Markov chain is a sequence of random states where each state is drawn from a random distribution that possibly depends on the previous state, but not on any earlier state.
So it is a sequence such that for all .
A basic example might be a model of the weather in which each day is either sunny or rainy but where it's more likely to be rainy (or sunny) if the previous day was rainy (or sunny).
(And to be technically correct: having information about two days or earlier doesn't help us if we know yesterday's weather.)
Like imperative code, this description is stateful.
The state at step depends on the state at step .
Probability is often easier to reason about when we work with independent identically drawn random variables and our aren't of this type.
But we can eliminate the state from our description using the same method used by functional programmers.
Let's choose a Markov chain to play with.
I'll pick one with 3 states called , and and with transition probabilities given by
where
Here's a diagram illustrating our states:
Implementation
First some imports:
> {-# LANGUAGE LambdaCase #-}
> {-# LANGUAGE TypeApplications #-}
> data ABC = A | B | C deriving (Eq, Show, Ord, Enum, Bounded)
We are now in a position to simulate our Markov chain.
First we need some random numbers drawn uniformly from [0, 1]:
> uniform :: (RandomGen gen, MonadState gen m) => m Double
> uniform = state random
And now the code to take a single step in the Markov chain:
> step :: (RandomGen gen, MonadState gen m) => ABC -> m ABC
> step A = do
> a <- uniform
> if a < 0.5
> then return A
> else return B
> step B = do
> a <- uniform
> if a < 1/3.0
> then return A
> else if a < 2/3.0
> then return B
> else return C
> step C = do
> a <- uniform
> if a < 0.5
> then return B
> else return C
Notice how the step function generates a new state at random in a way that depends on the previous state.
The m ABC in the type signature makes it clear that we are generating random states at each step.
We can simulate the effect of taking steps with a function like this:
> steps :: (RandomGen gen, MonadState gen m) => Int -> ABC -> m ABC
> steps 0 i = return i
> steps n i = do
> i <- steps (n-1) i
> step i
We can run for 100 steps, starting with , with a line like so:
*Main> evalState (steps 3 A) gen
B
The starting state of our random number generator is given by gen.
Consider the distribution of states after taking steps.
For Markov chains of this type, we know that as goes to infinity the distribution of the th state approaches a limiting "stationary" distribution.
There are frequently times when we want to sample from this final distribution.
For a Markov chain as simple as this example, you can solve exactly to find the limiting distribution.
But for real world problems this can be intractable.
Instead, a popular solution is to pick a large and hope it's large enough.
As gets larger the distribution gets closer to the limiting distribution.
And that's the problem I want to solve here - sampling from the limit.
It turns out that by thinking about random functions instead of random states we can actually sample from the limiting distribution exactly.
Some random functions
Here is a new version of our random step function:
> step' :: (RandomGen gen, MonadState gen m) => m (ABC -> ABC)
> step' = do
> a <- uniform
> return $ \case
> A -> if a < 0.5 then A else B
> B -> if a < 1/3.0
> then A
> else if a < 2/3.0 then B else C
> C -> if a < 0.5 then B else C
In many ways it's similar to the previous one.
But there's one very big difference: the type signature m (ABC -> ABC) tells us that it's returning a random function, not a random state.
We can simulate the result of taking 10 steps, say, by drawing 10 random functions, composing them, and applying the result to our initial state:
> steps' :: (RandomGen gen, MonadState gen m) => Int -> m (ABC -> ABC)
> steps' n = do
> fs <- replicateA n step'
> return $ foldr (flip (.)) id fs
Notice the use of flip.
We want to compose functions , each time composing on the left by the new .
This means that for a fixed seed gen, each time you increase by 1 you get the next step in a single simulation:
(BTW I used replicateA instead of replicateM to indicate that these are independent random draws.
It may be well known that you can use Applicative instead of Monad to indicate independence but I haven't seen it written down.)
*Main> [f A | n <- [0..10], let f = evalState (steps' n) gen]
[A,A,A,B,C,B,A,B,A,B,C]
When I first implemented this I accidentally forgot the flip.
So maybe you're wondering what effect removing the flip has?
The effect is about as close to a miracle as I've seen in mathematics.
It allows us to sample from the limiting distribution in a finite number of steps!
Here's the code:
> steps_from_past :: (RandomGen gen, MonadState gen m) => Int -> m (ABC -> ABC)
> steps_from_past n = do
> fs <- replicateA n step'
> return $ foldr (.) id fs
We end up building .
This is still a composition of independent identically distributed functions and so it's still drawing from exactly the same distribution as steps'.
Nonetheless, there is a difference: for a particular choice of seed, steps_from_past n no longer gives us a sequence of states from a Markov chain.
Running with argument draws a random composition of functions.
But if you increase by 1 you don't add a new step at the end.
Instead you effectively restart the Markov chain with a new first step generated by a new random seed.
Try it and see:
*Main> [f A | n <- [0..10], let f = evalState (steps_from_past n) gen]
[A, A, A, A, A, A, A, A, A, A]
Maybe that's surprising.
It seems to get stuck in one state.
In fact, we can try applying the resulting function to all three states.
*Main> [fmap f [A, B, C] | n <- [0..10], let f = evalState (steps_from_past n) gen]
[[A,B,C],[A,A,B],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A]]
In other words, for large enough we get the constant function.
Think of it this way:
If f isn't injective then it's possible that two states get collapsed to the same state.
If you keep picking random f's it's inevitable that you will eventually collapse down to the point where all arguments get mapped to the same state.
Once this happens, we'll get the same result no matter how large we take .
If we can detect this then we've found the limit of as goes to infinity.
But because we know composing forwards and composing backwards lead to draws from the same distribution, the limiting backward composition must actually be a draw from the same distribution as the limiting forward composition.
That flip can't change what probability distribution we're drawing from - just the dependence on the seed.
So the value the constant function takes is actually a draw from the limiting stationary distribution.
We can code this up:
> all_equal :: (Eq a) => [a] -> Bool
> all_equal [] = True
> all_equal [_] = True
> all_equal (a : as) = all (== a) as
> test_constant :: (Bounded a, Enum a, Eq a) => (a -> a) -> Bool
> test_constant f =
> all_equal $ map f $ enumFromTo minBound maxBound
This technique is called coupling from the past.
It's "coupling" because we've arranged that different starting points coalesce.
And it's "from the past" because we're essentially asking answering the question of what the outcome of a simulation would be if we started infinitely far in the past.
> couple_from_past :: (RandomGen gen, MonadState gen m, Enum a, Bounded a, Eq a) =>
> m (a -> a) -> (a -> a) -> m (a -> a)
> couple_from_past step f = do
> if test_constant f
> then return f
> else do
> f' <- step
> couple_from_past step (f . f')
We can now sample from the limiting distribution a million times, say:
*Main> let samples = map ($ A) $ evalState (replicateA 1000000 (couple_from_past step' id)) gen
We can now count how often A appears:
*Main> fromIntegral (length $ filter (== A) samples)/1000000
0.285748
That's a pretty good approximation to , the exact answer that can be found by finding the eigenvector of the transition matrix corresponding to an eigenvalue of 1.
> gen = mkStdGen 669
Notes
The technique of coupling from the past first appeared in a paper by Propp and Wilson.
The paper Iterated Random Functions by Persi Diaconis gave me a lot of insight into it.
Note that the code above is absolutely not how you'd implement this for real.
I wrote the code that way so that I could switch algorithm with the simple removal of a flip.
In fact, with some clever tricks you can make this method work with state spaces so large that you couldn't possibly hope to enumerate all starting states to detect if convergence has occurred.
Or even with uncountably large state spaces.
But I'll let you read the Propp-Wilson paper to find out how.
Writing an interpreter for Brainfuck is almost a rite of passage for any programming language implementer,
and it’s my turn now. In this post, we’ll write not one but four Brainfuck interpreters in Haskell. Let’s go!
Brainfuck (henceforth BF) is the most famous of esoteric programming languages. Its fame lies in
the fact that it is extremely minimalistic, with only eight instructions, and very easy to implement.
Yet, it is Turing-complete and as capable as any other programming language1. Writing
an interpreter for BF is a fun exercise, and so there are hundreds, maybe even thousands of them. Since BF
is very verbose, optimizing BF interpreters is almost a sport, with people posting benchmarks of their
creations. I can’t say that what I have in this post is novel, but it was definitely a fun exercise for me.
BF has eight instructions of one character each. A BF program is a sequence of these instructions. It may have other characters as well, which are treated as comments and are ignored while executing. An instruction pointer (IP) points at the next instruction to be executed, starting with the first instruction. The instructions are executed sequentially, except for the jump instructions that may cause the IP to jump to remote instructions. The program terminates when the IP moves past the last instruction.
BF programs work by modifying data in a memory that is an array of at least 30000 byte cells initialized to zero. A data pointer (DP) points to the current byte of the memory to be modified, starting with the first byte of the memory. BF programs can also read from standard input and write to standard output, one byte at a time using the ASCII character encoding.
The eight BF instructions each consist of a single character:
>
Increment the DP by one to point to the next cell to the right.
<
Decrement the DP by one to point to the next cell to the left.
+
Increment the byte at the DP by one.
-
Decrement the byte at the DP by one.
.
Output the byte at the DP.
,
Accept one byte of input, and store its value in the byte at the DP.
[
If the byte at the DP is zero, then instead of moving the IP forward to the next command, jump it forward to the command after the matching ] command.
]
If the byte at the DP is nonzero, then instead of moving the IP forward to the next command, jump it back to the command after the matching [ command.
Each [ matches exactly one ] and vice versa, and the [ comes first. Together, they add conditions and loops to BF.
Some details are left to implementations. In our case, we assume that the memory cells are signed bytes that underflow and overflow without errors. Also, accessing the memory beyond array boundaries wraps to the opposite side without errors.
For a taste, here is a small BF program that prints Hello, World! when run:
As you can imagine, interpreting BF is easy, at least when doing it naively. So instead of writing one interpreter, we are going to write four, with increasing performance and complexity.
Setup
First, some imports:
{-# LANGUAGE GHC2021 #-}{-# LANGUAGE LambdaCase #-}{-# LANGUAGE TypeFamilies #-}moduleMainwhereimportControl.Arrow ((>>>))importControl.Monad (void)importData.Bits (shiftR, (.&.))importData.ByteArrayqualifiedasBAimportData.Char (chr, ord)importData.Functor (($>))importData.Int (Int8)importData.Kind (Type)importData.VectorqualifiedasVimportData.Vector.Storable.MutablequalifiedasMVimportData.Word (Word16, Word8)importForeign.Ptr (Ptr, castPtr, minusPtr, plusPtr)importForeign.StorablequalifiedasSimportSystem.Environment (getArgs, getProgName)importSystem.Exit (exitFailure)importSystem.IOqualifiedasIOimportText.ParserCombinators.ReadPqualifiedasP
We use the GHC2021 extension here that enables a lot of useful GHC extensions by default. Our non-base imports come from the memory and vector libraries.
We abstract the interpreter interface as a typeclass:
classInterpreter a wheredataProgram a ::Type parse ::String->Program a interpret ::Memory->Program a ->IO ()
An Interpreter is specified by a data type Program and two functions: parse parses a string to a Program, and interpret interprets the parsed Program.
For modelling the mutable memory, we use a mutable unboxed IOVector of signed bytes (Int8) from the vector package. Since our interpreter runs in IO, this works well for us. The DP hence, is modelled as a index in this vector, which we name the MemIdx type.
We wrap the IOVectorInt8 with a Memorynewtype. newMemory creates a new memory array of bytes initialized to zero. memorySize returns the size of the memory. readMemory, writeMemory and modifyMemory are for reading from, writing to and modifying the memory respectively. nextMemoryIndex and prevMemoryIndex increment and decrement the array index respectively, taking care of wrapping at boundaries.
Now we write the main function using the Interpreter typeclass functions:
The main function calls the parse and interpret functions for the right interpreter with a new memory and the input string read from the file specified in the command line argument. We make sure to filter out non-BF characters when reading the input file.
With the setup done, let’s move on to our first interpreter.
String Interpreter
A BF program can be interpreted directly from its string representation, going over the characters and executing the right logic for them. But strings in Haskell are notoriously slow because they are implemented as singly linked-lists of characters. Indexing into strings has \(O(n)\) time complexity, so it is not a good idea to use them directly. Instead, we use a char Zipper2.
Zippers are a special view of data structures, which allow one to navigate and easily update them. A zipper has a focus or cursor which is the current element of the data structure we are “at”. Alongside, it also captures the rest of the data structure in a way that makes it easy to move around it. We can update the data structure by updating the element at the focus3.
This zipper is a little different from the usual implementations because we need to know when the focus of the zipper has moved out the program boundaries. Hence, we model the focus as MaybeChar. czFromString creates a char zipper from a string. czMoveLeft and czMoveRight move the focus left and right respectively, taking care of setting the focus to Nothing if we move outside the program string.
Parsing the program is thus same as creating the char zipper from the program string. For interpreting the program, we write this function:
interpretCharZipper ::Memory->CharZipper->IO ()interpretCharZipper memory = go 0where go !memIdx !program =case czFocus program ofNothing->return ()Just c ->case c of'+'-> modifyMemory memory (+1) memIdx >> goNext'-'-> modifyMemory memory (subtract1) memIdx >> goNext'>'-> go (nextMemoryIndex memory memIdx) program''<'-> go (prevMemoryIndex memory memIdx) program'','->dogetChar>>= writeMemory memory memIdx .fromIntegral.ord goNext'.'->do readMemory memory memIdx >>=putChar.chr.fromIntegral goNext'['-> readMemory memory memIdx >>= \case0-> go memIdx $ skipRight 1 program _ -> goNext']'-> readMemory memory memIdx >>= \case0-> goNext _ -> go memIdx $ skipLeft 1 program _ -> goNextwhere program' = czMoveRight program goNext = go memIdx program'
Our main driver here is the tail-recursive go function that takes the memory index and the program as inputs. It then gets the current focus of the program zipper, and executes the BF logic accordingly.
If the current focus is Nothing, it means the program has finished running. So we end the execution. Otherwise, we switch over the character and do what the BF spec tells us to do.
For + and -, we increment or decrement respectively the value in the memory cell at the current index, and go to the next character. For > and <, we increment or decrement the memory index respectively, and go to the next character.
For ,, we read an ASCII encoded character from the standard input, and write it to the memory at the current memory index as a byte. For ., we read the byte from the memory at the current memory index, and write it out to the standard output as an ASCII encoded character. After either cases, we go to the next character.
For [, we read the byte at the current memory index, and if it is zero, we skip right over the part of the program till the matching ] is found. Otherwise, we go to the next character.
For ], we skip left over the part of the program till the matching [ is found, if the current memory byte is non-zero. Otherwise, we go to the next character.
The next two functions implement the skipping logic:
The tail-recursive functions skipRight and skipLeft skip over parts of the program by moving the focus to right and left respectively, till the matching bracket is found. Since the loops can contain nested loops, we keep track of the depth of loops we are in, and return only when the depth becomes zero. If we move off the program boundaries while skipping, we throw an error.
That’s it! We now have a fully functioning BF interpreter. To test it, we use these two BF programs: hanoi.bf and mandelbrot.bf.
hanoi.bf solves the Tower of Hanoi puzzle with animating the solution process as ASCII art:
<noscript></noscript>
A freeze-frame from the animation of solving the Tower of Hanoi puzzle with hanoi.bf
mandelbrot.bf prints an ASCII art showing the Mandelbrot set:
<noscript></noscript>
Mandelbrot set ASCII art by mandelbrot.bf
Both of these BF programs serve as good benchmarks for BF interpreters. Let’s test ours by compiling and running it4:
❯ nix-shell -p "ghc.withPackages (pkgs: with pkgs; [vector memory])" \
--run "ghc --make bfi.hs -O2"
[1 of 2] Compiling Main ( bfi.hs, bfi.o )
[2 of 2] Linking bfi [Objects changed]
❯ time ./bfi -s hanoi.bf > /dev/null
29.15 real 29.01 user 0.13 sys
❯ time ./bfi -s mandelbrot.bf > /dev/null
94.86 real 94.11 user 0.50 sys
That seems quite slow. We can do better.
AST Interpreter
Instead of executing BF programs from their string representations, we can parse them to an Abstract Syntax Tree (AST). This allows us to match brackets only once at parse time, instead of doing it repeatedly at run time. We capture loops as AST nodes, allowing us to skip them trivially.
There is one constructor per BF instruction, except for loops where the Loop constructor captures both the start and end of loop instructions. We use immutable boxed vectors for lists of instructions instead of Haskell lists so that we can index into them in \(O(1)\).
We use the parse combinator library ReadP to write a recursive-decent parser for BF:
All cases except the loop one are straightforward. For loops, we call the parser recursively to parse the loop body. Note that the parser matches the loop brackets correctly. If the brackets don’t match, the parser fails.
The AST interpreter code is quite similar to the string interpreter one. This time we use an integer as the IP to index the Instructions vector. All cases except the loop one are pretty much same as before.
For loops, we read the byte at the current memory index, and if it is zero, we skip executing the Loop AST node and go to the next instruction. Otherwise, we recursively interpret the loop body and go to the next instruction, taking care of passing the updated memory index returned from the recursive call to the execution of the next instruction.
And we are done. Let’s see how it performs:
❯ time ./bfi -a hanoi.bf > /dev/null
14.94 real 14.88 user 0.05 sys
❯ time ./bfi -a mandelbrot.bf > /dev/null
36.49 real 36.32 user 0.17 sys
Great! hanoi.bf runs 2x faster, whereas mandelbrot.bf runs 2.6x faster. Can we do even better?
Bytecode Interpreter
AST interpreters are well known to be slow because of how AST nodes are represented in the computer’s memory. The AST nodes contain pointers to other nodes, which may be anywhere in the memory. So while interpreting an AST, it jumps all over the memory, causing a slowdown. One solution to this is to convert the AST into a more compact and optimized representation known as Bytecode. That’s what our next interpreter uses.
We reuse the parser from the AST interpreter, but then we convert the resultant AST into bytecode by translating and assembling it5. We use the Bytes byte array data type from the memory package to represent bytecode.
Unlike AST, bytecode has a flat list of instructions—called Opcodes—that can be encoded in a single byte each, with optional parameters. Because of its flat nature and compactness, bytecode is more CPU friendly to execute, which is where it gets its performance from. The downside is that bytecode is not human readable unlike AST.
The assembleOpcode function assembles an Opcode to a list of bytes (Word8s). For all cases except for OpLoop, we simply return a unique byte for the opcode.
For OpLoop, we first recursively assemble the loop body. We encode both the body and the body length in the assembled bytecode, so that the bytecode interpreter can use the body length to skip over the loop body when required. We use two bytes to encode the body length, so we first check if the body length plus three is over 65536 (\(= 2^8*2^8\)). If so, we throw an error. Otherwise, we return:
a unique byte for loop start (6),
followed by the body length encoded in two bytes (in the Little-endian order),
then the assembled loop body,
followed by a unique byte for loop end (7),
finally followed by the encoded body length again.
We encode the body length at the end again so that we can use it to jump backward to the start of the loop, to continue looping. Let’s look at this example to understand the loop encoding better:
In Haskell, the pointer type Ptr is parametrized by the type of the data it points to. We have two types of pointers here, one that points to the bytecode program, and another that points to the memory cells. So in this case, the IP and DP are actually pointers.
The go function here is again the core of the interpreter loop. We track the current IP and DP in it, and execute the logic corresponding to the opcode at the current memory location. go ends when the IP points to the end of the program byte array.
Most of the cases in go are similar to previous interpreters. Only difference is that we use pointers to read the current opcode and memory cell. For the loop start opcode, we read the byte pointed to by the DP, and if it is zero, we read the next two bytes from the program bytecode, and use it as the offset to jump the IP by to skip over the loop body. Otherwise, we jump the IP by 3 bytes to skip over the loop start opcode and encoded loop body length bytes. For the loop end opcode, we follow similar steps, except we jump backward to the start of the loop.
The helper functions for doing pointer arithmetic are following:
nextMemPtr and prevMemPtr implement wrapping of pointers as we do for memory indices in nextMemoryIndex and prevMemoryIndex. Let’s see what the results of our hard work are:
❯ time ./bfi -b hanoi.bf > /dev/null
11.10 real 11.04 user 0.04 sys
❯ time ./bfi -b mandelbrot.bf > /dev/null
15.72 real 15.68 user 0.04 sys
1.3x and 2.3x speedups for hanoi.bf and mandelbrot.bf respectively over the AST interpreter. Not bad. But surely we can do even better?
Optimizing Bytecode Interpreter
We can optimize our bytecode interpreter by emitting specialized opcodes for particular patterns of opcodes that occur frequently. Think of it as replacing every occurrence of a long phrase in a text with a single word that means the same, leading to a shorter text and faster reading time. Since BF is so verbose, there are many opportunities for optimizing BF bytecode7. We are going to implement only one simple optimization, just to get a taste of how to do it.
The optimizing bytecode interpreter is pretty much same as the bytecode interpreter, with the optimize function called between the translation and assembly phases.
The pattern of opcode we are optimizing for is [-] and [+]. Both of these BF opcodes when executed, decrement or increment the current memory cell till it becomes zero. In effect, these patterns clear the current cell. We start the process by adding a new Opcode for clearing a cell:
We can see how the patterns [-] and [+] that may execute operations tens, maybe hundreds, of times, are replaced by a single operation in the interpreter now. This is what gives us the speedup in this case. Let’s run it:
❯ time ./bfi -o hanoi.bf > /dev/null
4.07 real 4.04 user 0.01 sys
❯ time ./bfi -o mandelbrot.bf > /dev/null
15.58 real 15.53 user 0.04 sys
hanoi.bf runs 2.7x faster, whereas mandelbrot.bf is barely 1% faster as compared to the non-optimizing bytecode interpreter. This demonstrates how different optimizations apply to different programs, and hence the need to implement a wide variety of them to be able to optimize all programs well.
Comparison
It’s time for a final comparison of the run times of the four interpreters:
Interpreter
Hanoi
Mandelbrot
String
29.15s
94.86s
AST
14.94s
36.49s
Bytecode
11.10s
15.72s
Optimizing Bytecode
4.07s
15.58s
The final interpreter is 7x faster than the baseline one for hanoi.bf, and 6x faster for mandelbrot.bf. Here’s the same data as a chart:
Run time of the four interpreters
That’s it for this post. I hope you enjoyed it and took something away from it. In a future post, we’ll explore more optimization for our BF interpreter. The full code for this post is available here.
If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading!
BF is Turning-complete. That means it can be used to implement any computable program. However, it is a Turing tarpit, which means it is not feasible to write any useful programs in it because of its lack of abstractions.↩︎
A string interpreter also serves as an useful baseline for measuring the performance of BF interpreters. That’s why I decided to use strings instead of Data.Text or Data.Sequence, which are more performant.↩︎
I am a big fan of zippers, as evidenced by this growing list of posts that I use them in.↩︎
We use Nix for getting the dependency libraries.↩︎
If you are unfamiliar, >>> is the left-to-right function composition function:
While the only way to access byte arrays is pointers, we could have continued accessing the memory vector using indices. I benchmarked both methods, and found that using pointers for memory access sped up the execution of hanoi.bf by 1.1x and mandelbrot.bf by 1.6x as compared to index-based access. It’s also nice to learn how to use pointers in Haskell. This is why we chose to use Storable vectors for the memory.↩︎
See BFC, which touts itself as “an industrial-grade Brainfuck compiler”, with a huge list of optimizations.↩︎
I’ve always considered lenses to be a bit uncomfortable. While they’re occasionally useful for doing deeply nested record updates, they often seem to be more trouble than they’re worth. There’s a temptation in the novice programmer, to ^.. and folded their way to a solution that is much more naturally written merely as toList. And don’t get me started about the stateful operators like <<+= and their friends. Many programs which can be more naturally written functionally accidentally end up being imperative due to somebody finding a weird lens combinator and trying to use it in anger. Much like a serious drug collection, the tendency is to push it as far as you can.
Thus, my response has usually been one of pushback and moderation. I don’t avoid lenses at all costs, but I do try to limit myself to the prime types (Lens', Prism', Iso'), and to the boring combinators (view, set, over). I feel like these give me most of the benefits of lenses, without sending me tumbling down the rabbit hole.
All of this is to say that my grokkage of lenses has always been one of generalized injections and projections, for a rather shallow definition of “generalized”. That is, I’ve grown accustomed to thinking about lenses as getter/setter pairs for data structures—eg, I’ve got a big product type and I want to pull a smaller piece out of it, or modify a smaller piece in a larger structure. I think about prisms as the dual structure over coproducts—“generalized” injecting and pattern matching.
And this is all true; but I’ve been missing the forest for the trees on this one. That’s not to say that I want to write lensier code, but that I should be taking the “generalized” part much more seriously.
The big theme of my intellectual development over the last few years has been thinking about abstractions as shared vocabularies. Monoids are not inherently interesting; they’re interesting because of how they let you quotient seemingly-unrelated problems by their monoidal structure. Applicatives are cool because once you’ve grokked them, you begin to see them everywhere. Anywhere you’ve got conceptually-parallel, data-independent computations, you’ve got an applicative lurking somewhere under the surface (even if it happens to be merely the Identity applicative.)
I’ve had a similar insight about lenses, and that’s what I wanted to write about today.
The Context
At work, I’ve been thinking a lot about compilers and memory layout lately. I won’t get into the specifics of why, but we can come up with an inspired example. Imagine we’d like to use Haskell to write a little eDSL that we will use to generate x86 machine code.
The trick of course, is that we’re writing Haskell in order to not write machine code. So the goal is to design high-level combinators in Haskell that express our intent, while simultaneously generating machine code that faithfully implements the intention.
One particularly desirable feature about eDSLs is that they allow us to reuse Haskell’s type system. Thus, imagine we have some type:
typeCode ::Type->TypedataCode a =Code { getMachineCode :: [X86OpCode] }
Notice that the a parameter here is entirely phantom; it serves only to annotate the type of the value produced by executing getMachineCode. For today’s purpose, we’ll ignore all the details about calling conventions and register layout and what not; let’s just assume a Code a corresponds to a computation that leaves a value (or pointer) to something of type a in a well-known place, whether that be the top of the stack, or eax or something. It doesn’t matter!
Since the type parameter to Code is phantom, we need to think about what role it should have. Keeping it at phantom would be disastrous, since this type isn’t used by Haskell, but it is certainly used to ensure our program is correct. Similarly, representational seems wrong, since coerce is meaningful only when thinking about Haskell; which this thing decidedly is not. Thus, our only other option is:
type role Code nominal
Frustratingly, due to very similar reasoning, Code cannot be a functor, because there’s no way1 to lift an arbitrary Haskell function a -> b into a corresponding function Code a -> Code b. If there were, we’d be in the clear! But alas, we are not.
The Problem
All of the above is to say that we are reusing Haskell’s type system, but not its values. An expression of type Code Bool has absolutely no relation to the values True or False—except that we could write, by hand, a function litBool :: Bool -> Code Bool which happened to do the right thing.
It is tempting, however, to make new Haskell types in order to help constrain the assembly code we end up writing. For example, maybe we want to write a DSP for efficiently decoding audio. We can use Haskell’s types to organize our thoughts and prevent ourselves from making any stupid mistakes:
dataDecoder=Decoder { format ::Format , seekPos ::Int , state ::ParserState }dataChunk=...createDecoder ::CodeMediaHandle->CodeDecoderdecodeChunk ::CodeDecoder-> (CodeDecoder, CodeChunk)
We now have a nice interface in our eDSL to guide end-users along the blessed path of signal decoding. We have documented what we are trying to do, and how it can be used once it’s implemented. But due to our phantom, yet nominal, parameter to Code, this is all just make believe. There is absolutely no correlation between what we’ve written down and how we can use it. The problem arises when we go to implement decodeChunk. We’ll need to know what state we’re in, which means we’ll need some function:
In a world where Code is a functor, this is implemented trivially as fmap state. But Code is not a functor! Alas! Woe! What ever can we do?
The Solution
Lenses, my guy!
Recall that Code is phantom in its argument, even if we use roles to restrict that fact. This means we can implement a safe-ish version of unsafeCoerce, that only fiddles with the paramater of our phantom type:
unsafeCoerceCode ::Code a ->Code bunsafeCoerceCode (Code ops) =Code ops
Judicious use of unsafeCoerceCode allows us to switch between a value’s type and its in-memory representation. For example, given a type:
typeBytes ::Nat->TypedataBytes n
we can reinterpret a Decode as a sequence of bytes:
decoderRep ::Iso' (CodeDecoder) (Code (Bytes (32+4+1)))decoderRep = iso unsafeCoerceCode unsafeCoerceCodestateRep ::Iso' (CodeParserState) (Code (Bytes1))stateRep = iso unsafeCoerceCode unsafeCoerceCode
which says we are considering our Decoder to be laid out in memory like:
Of course, this is a completely unsafe transformation, as far as the Haskell type system is aware. We’re in the wild west out here, well past any type theoretical life buoys. We’d better be right that this coercion is sound. But assuming this is in fact the in-memory representation of a Decoder, we are well justified in this transformation.
Notice the phrasing of our Iso' above. It is not an iso between Decoder and Bytes 37, but between Codes of such things. This witnesses the fact that it is not true in the Haskell embedding, merely in our Code domain. Of course, isos are like the least exciting optics, so let’s see what other neat things we can do.
Imagine we have some primitives:
slice :: n <= m=>Int-- ^ offset->Proxy n -- ^ size->Code (Bytes m)->Code (Bytes n)overwrite :: n <= m=>Int-- ^ offset->Bytes n->Bytes m->Bytes m
which we can envision as Haskell bindings to the pseudo-C functions:
Such a lens acts exactly as a record selector would, in that it allows us to view, set, and over a ParserState inside of a Decoder. But recall that Code is just a list of instructions we eventually want the machine to run. We’re using the shared vocabulary of lenses to emit machine code! What looks like using a data structure to us when viewed through the Haskell perspective, is instead invoking an assembler.
Reflections
Once the idea sinks in, you’ll start seeing all sorts of cool things you can do with optics to generate code. Prisms generalize running initializer code. A Traversal over Code can be implemented as a loop. And since all the sizes are known statically, if you’re feeling plucky, you can decide to unroll the loop right there in the lens.
Outside of the context of Code, the realization that optics are this general is still doing my head in. Something I love about working in Haskell is that I’m still regularly having my mind blown, even after a decade.
Humans want the resources of other humans. I want the food that the supermarket owns so that I can eat it. Before buying it, I wanted the house that I now own. And before that, someone wanted to build a house on that plot of land, which was owned by someone else first. Most of the activities we engage in during our lifetime revolve around extracting something from someone else.
There are two basic modalities to getting the resources of someone else. The first, the simplest, and the one that has dominated the majority of human history, is force. Conquer people, kill them, beat them up and take their stuff, force them into slavery and make them do your work. It’s a somewhat effective strategy. This can also be more subtle, by using coercive and fraudulent methods to trick people into giving you their resources. Let’s call this modality the looter approach.
The second is trade. In the world of trade, I can only extract resources from someone else when they willingly give them to me in exchange for something else of value. This can be barter of value for value, payment in money, built-up goodwill, favors, charity (exchanging resources for the benefit you receive for helping someone else), and more. In order to participate in this modality, you need to create your own valuable resources that other people want to trade for. Let’s call this the producer approach.
The producer approach is better for society in every conceivable way. The looter approach causes unnecessary destruction, pushes production into ventures that don’t directly help anyone (like making more weapons), and rewards people for their ability to inflict harm. By contrast, the producer approach rewards the ability to meet the needs of others and causes resources to end up in the hands of those who value them the most.
Looter philosophy is rooted in the concept of the zero sum game, the mistaken belief that I can only have more if someone else has less. By contrast, the producer philosophy correctly identifies the fact that we can all end up better by producing more goods in more efficient ways. We live in our modern world of relatively widespread luxury because producers have made technological leaps—for their own self-serving motives—that have improved everyone’s ability to produce more goods going forward. Think of the steam engine, electricity, computing power, and more.
A producer-only world
It would be wonderful to live in a world in which there are no looters. We all produce, we all trade, everyone receives more value than they give, and there is no wasted energy or destruction from the use of force.
Think about how wonderful it could be! We wouldn’t need militaries, allowing a massive amount of productive capacity to be channeled into things that make everyone’s lives better. We wouldn’t need police. Not only would that free up more resources, but would remove the threat of improper use of force by the state against citizens. The list goes on and on.
I believe many economists—especially Austrian economists—are cheering for that world. I agree with them on the cheering. It’s why things like Donald Trump’s plans for tariffs are so horrific in their eyes. Tariffs introduce an artificial barrier between nations, impeding trade, preventing the peaceful transfer of resources, and leading to a greater likelihood of armed conflict.
There’s only one problem with this vision, and it’s also based in economics: game theory.
Game theory and looters
Imagine I’m a farmer. I’m a great farmer, I have a large plot of land, I run my operations efficiently, and I produce huge amounts of food. I sell that food into the marketplace, and with that money I’m able to afford great resources from other people, who willingly trade them to me because they value the money more than their own resources. For example, how many T-shirts does the clothing manufacturer need? Instead of his 1,000th T-shirt, he’d rather sell it for $5 and buy some food.
While I’m really great as a farmer, I’m not very good as a fighter. I have no weapons training, I keep no weapons on my property, and I dislike violence.
And finally, there’s a strong, skilled, unethical person down the street. He could get a job with me on the farm. For back-breaking work 8 hours a day, I’ll pay him 5% of my harvest. Or, by contrast, he could act like the mafia, demand a “protection fee” of 20%, and either beat me up, beat up my family, or cause harm to my property, if I don’t pay it.
In other words, he could be a producer and get 5% in exchange for hard work, or be a looter and get 20% in exchange for easy (and, likely for him, fun) work. As described, the game theoretic choice is clear.
So how do we stop a producer world from devolving back into a looter world?
Deterrence
There’s only one mechanism I’m aware of for this, and it’s deterrence. As the farmer, I made a mistake. I should get weapons training. I should keep weapons on my farm. I should be ready to defend myself and my property. Because if I don’t, game theory ultimately predicts that all trade will collapse, and society as we know it will crumble.
I don’t necessarily have to have the power of deterrence myself. I could hire a private security company, once again allowing the producer world to work out well. I trade something of lesser value (some money) for something I value more (the protection afforded by private security). If I’m lucky, that security company will never need to do anything, because the mere threat of their presence is sufficient.
And in modern society, we generally hope to rely on the government police force to provide this protection.
There are easy ways to defeat the ability of deterrence to protect our way of life. The simplest is to defang it. Decriminalize violent and destructive acts, for example. Remove the consequences for bad, looter behavior, and you will incentivize looting. This is far from a theoretical discussion. We’ve seen the clear outcome in California, which has decriminalized theft under $950, resulting—in a completely predictable way—in more theft, stores closing, and an overall erosion of producer philosophy.
And in California, this is even worse. Those who try to be their own deterrence, by arming themselves and protecting their rights, are often the targets of government force instead of the looters.
I’m guessing this phrasing has now split my reading audience into three groups. Group A agrees wholly with what I’m saying. Group B believes what I’ve just written is pure evil and garbage. Group C initially disagreed with my statements, but has an open mind and is willing to consider a different paradigm. The next section is targeted at groups A and C. Group B: good luck with the broken world you’re advocating.
Global scale
This concept of deterrence applies at a global scale too. I would love to live in a world where all nations exchange value for value and never use force against others. In fact, I believe the ultimate vision for this kind of a world ends with anarcho-capitalism (though I don’t know enough about the topic to be certain). There ends up being no need for any force against anyone else. It’s a beautiful vision for a unified world, where there are no borders, there is no destruction, there is only unity through trade. I love it.
But game theory destroys this too. If the entire world disarmed, it would take just one person who thinks he can do better through looter tactics to destroy the system. The only way to defeat that is to have a realistic threat of force to disincentivize someone from acting like a looter.
And this is the paradox. In order to live in our wonderful world of production, prosperity, health, and happiness, we always need to have our finger near enough to the trigger to respond to looters with force. I know of no other approach that allows production to happen. (And I am very interested in other theoretical solutions to this problem, if anyone wants to share reading material.)
Peace through strength
This line of thinking leads to the concept of peace through strength. When those tempted to use violence see the overwhelming strength of their potential victims, they will be disincentivized to engage in violent behavior. It’s the story of the guy who wants to rob my farm. Or the roaming army in the ancient world that bypassed the well fortified walled city and attacked its unprotected neighbor.
There are critics of this philosophy. As put by Andrew Bacevich, "'Peace through strength' easily enough becomes 'peace through war.'" I don’t disagree at all with that analysis, and it’s something we must remain vigilant against. But disarming is not the answer, as it will, of course, necessarily lead to the victory of those willing to use violence on others.
In other words, my thesis here is that the threat of violence must be present to keep society civilized. But the cost of using that violence must be high enough that neither side is incentivized to initiate it.
Israel
I’d been thinking of writing a blog post on this topic for a few months now, but finally decided to today. Israel just agreed to a hostage deal with Hamas. In exchange for the release of 33 hostages taken in the October 7 massacre, Israel will hand over 1,000 terrorists in Israeli prisons.
I have all the sympathy in the world for the hostages and their families. I also have great sympathy for the Palestinian civilians who have been harmed, killed, displaced, and worse by this war. And I have empathy (as one of the victims) for all of the Israeli citizens who have lived under threat of rocket attacks, had our lives disrupted, and for those who have been killed by this war. War is hell, full stop.
My message here is to those who have been pushing the lie of “peace through negotiations.” Or peace through capitulation. Or anything else. These tactics are the reason the war has continued. As long as the incentive structure makes initiating a war a positive, wars will continue to be initiated. Hamas has made its stance on the matter clear: it has sworn for the eradication of all Jews within the region, and considers civilian casualties on the Palestinian side not only acceptable, but advantageous.
I know that many people who criticize Israel and put pressure on us to stop the war in Gaza believe they are doing so for noble reasons. (For the record, I also believe many people have less altruistic reasons for their stance.) I know people like to point to the list of atrocities they believe Israel has committed. And, by contrast, the pro-Israel side is happy to respond with corresponding atrocities from the other side.
I honestly believe this is all far beyond irrelevant. The only question people should be asking is: how do we disincentivize the continuation of hostilities? And hostage deals that result in the release of terrorists, allow “aid” to come in (which, if history is any indication, will be used to further the construction of tunnels and other sources for attack on Israel), and give Hamas an opportunity to rearm, only incentivize the continuation of the war.
In other words, if you care about the innocent people on either side, you should be opposed to this kind of capitulation. Whatever you think about the morality of each side, more people will suffer with this approach.
Skin in the game
It’s easy to say things like that when your life isn’t on the line. I also don’t think that matters much. Either the philosophical, political, and economic analysis is correct, or it isn’t. Nonetheless, I do have skin in the game here. I still live in a warzone. I am less than 15 kilometers from the Lebanese border. We’ve had Hezbollah tunnels reaching into our surrounding cities. My family had to lock ourselves inside when Hezbollah paratroopers had attempted to land in our city.
My wife (Miriam) and I have discussed this situation at length, many times, over the course of this war. If I’m ever taken hostage, I hope the Israeli government bombs the hell out of wherever I am being held. I say this not only because I believe it is the right, just, moral, ethical, and strategically correct thing to do. I say this because I am selfish:
I would rather die than be tortured by our enemies.
I would rather die than be leveraged to make my family and country less safe.
I would rather die than live the rest of my life a shell of my former self, haunted not only by the likely torture inflicted on me, but by the guilt of the harm to others resulting from my spared life.
I don’t know why this hostage deal went through now. I don’t know what pressures have been brought to bear on the leaders in Israel. I don’t know if they are good people trying to protect their citizens, nefarious power hungry cretins looking to abuse both the Israeli and Palestinian populace to stay in control, weak-willed toadies who do what they’re told by others, or simply stupid. But my own stance is clear.
But what about the Palestinians?
I said it above, and I’ll say it again: I truly do feel horrible for the trauma that the Palestinian people are going through. Not for the active terrorists mind you, I feel no qualms about those raising arms against us being destroyed. But everyone else, even those who wish me and my fellow Israelis harm. (And, if polling is to be believed, that’s the majority of Palestinians.) I would much rather that they not be suffering now, and that eventually through earned trust on both sides, everyone’s lots are improved.
But the framework being imposed by those who “love” peace isn’t allowing that to happen. Trust cannot be built when there’s a greater incentive to return to the use of force. I was strongly opposed to the 2005 disengagement from Gaza. But once it happened, it could have been one of those trust-building starting points. Instead, I saw many people justify further violence by Hamas—such as non-stop rocket attacks on the south of Israel—because Israel hadn’t done enough yet.
Notice how fundamentally flawed this mentality is, just from an incentives standpoint! Israel gives up control of land, something against its own overall interests and something desired by Palestinians, and is punished for it with increased violence against citizens. Hamas engaged in a brutal destruction of all of its opponents within the Palestinian population, launched attacks on Israel, and when Israel did respond with force, Israel was blamed for having not done enough to appease Hamas.
I know people will want to complicate this story by bringing up the laundry list of past atrocities, of assigning negative motivations to Israel and its leaders, and a million other evasions that are used to avoid actually solving this conflict. Instead, I beg everyone to just use basic logic.
The violence will continue as long as the violence gets results.
My blog posts and reading material have both been on a decidedly economics-heavy slant recently. The topic today, incentives, squarely falls into the category of economics. However, when I say economics, I’m not talking about “analyzing supply and demand curves.” I’m talking about the true basis of economics: understanding how human beings make decisions in a world of scarcity.
A fair definition of incentive is “a reward or punishment that motivates behavior to achieve a desired outcome.” When most people think about economic incentives, they’re thinking of money. If I offer my son $5 if he washes the dishes, I’m incentivizing certain behavior. We can’t guarantee that he’ll do what I want him to do, but we can agree that the incentive structure itself will guide and ultimately determine what outcome will occur.
The great thing about monetary incentives is how easy they are to talk about and compare. “Would I rather make $5 washing the dishes or $10 cleaning the gutters?” But much of the world is incentivized in non-monetary ways too. For example, using the “punishment” half of the definition above, I might threaten my son with losing Nintendo Switch access if he doesn’t wash the dishes. No money is involved, but I’m still incentivizing behavior.
And there are plenty of incentives beyond our direct control! My son is also incentivized to not wash dishes because it’s boring, or because he has some friends over that he wants to hang out with, or dozens of other things. Ultimately, the conflicting array of different incentive structures placed on him will ultimately determine what actions he chooses to take.
Why incentives matter
A phrase I see often in discussions—whether they are political, parenting, economic, or business—is “if they could just do…” Each time I see that phrase, I cringe a bit internally. Usually, the underlying assumption of the statement is “if people would behave contrary to their incentivized behavior then things would be better.” For example:
If my kids would just go to bed when I tell them, they wouldn’t be so cranky in the morning.
If people would just use the recycling bin, we wouldn’t have such a landfill problem.
If people would just stop being lazy, our team would deliver our project on time.
In all these cases, the speakers are seemingly flummoxed as to why the people in question don’t behave more rationally. The problem is: each group is behaving perfectly rationally.
The kids have a high time preference, and care more about the joy of staying up now than the crankiness in the morning. Plus, they don’t really suffer the consequences of morning crankiness, their parents do.
No individual suffers much from their individual contribution to a landfill. If they stopped growing the size of the landfill, it would make an insignificant difference versus the amount of effort they need to engage in to properly recycle.
If a team doesn’t properly account for the productivity of individuals on a project, each individual receives less harm from their own inaction. Sure, the project may be delayed, company revenue may be down, and they may even risk losing their job when the company goes out of business. But their laziness individually won’t determine the entirety of that outcome. By contrast, they greatly benefit from being lazy by getting to relax at work, go on social media, read a book, or do whatever else they do when they’re supposed to be working.
My point here is that, as long as you ignore the reality of how incentives drive human behavior, you’ll fail at getting the outcomes you want.
If everything I wrote up until now made perfect sense, you understand the premise of this blog post. The rest of it will focus on a bunch of real-world examples to hammer home the point, and demonstrate how versatile this mental model is.
Running a company
Let’s say I run my own company, with myself as the only employee. My personal revenue will be 100% determined by my own actions. If I decide to take Tuesday afternoon off and go fishing, I’ve chosen to lose that afternoon’s revenue. Implicitly, I’ve decided that the enjoyment I get from an afternoon of fishing is greater than the potential revenue. You may think I’m being lazy, but it’s my decision to make. In this situation, the incentive–money–is perfectly aligned with my actions.
Compare this to a typical company/employee relationship. I might have a bank of Paid Time Off (PTO) days, in which case once again my incentives are relatively aligned. I know that I can take off 15 days throughout the year, and I’ve chosen to use half a day for the fishing trip. All is still good.
What about unlimited time off? Suddenly incentives are starting to misalign. I don’t directly pay a price for not showing up to work on Tuesday. Or Wednesday as well, for that matter. I might ultimately be fired for not doing my job, but that will take longer to work its way through the system than simply not making any money for the day taken off.
Compensation overall falls into this misaligned incentive structure. Let’s forget about taking time off. Instead, I work full time on a software project I’m assigned. But instead of using the normal toolchain we’re all used to at work, I play around with a new programming language. I get the fun and joy of playing with new technology, and potentially get to pad my resume a bit when I’m ready to look for a new job. But my current company gets slower results, less productivity, and is forced to subsidize my extracurricular learning.
When a CEO has a bonus structure based on profitability, he’ll do everything he can to make the company profitable. This might include things that actually benefit the company, like improving product quality, reducing internal red tape, or finding cheaper vendors. But it might also include destructive practices, like slashing the R&D budget to show massive profits this year, in exchange for a catastrophe next year when the next version of the product fails to ship.
Or my favorite example. My parents owned a business when I was growing up. They had a back office where they ran operations like accounting. All of the furniture was old couches from our house. After all, any money they spent on furniture came right out of their paychecks! But in a large corporate environment, each department is generally given a budget for office furniture, a budget which doesn’t roll over year-to-year. The result? Executives make sure to spend the entire budget each year, often buying furniture far more expensive than they would choose if it was their own money.
There are plenty of details you can quibble with above. It’s in a company’s best interest to give people downtime so that they can come back recharged. Having good ergonomic furniture can in fact increase productivity in excess of the money spent on it. But overall, the picture is pretty clear: in large corporate structures, you’re guaranteed to have mismatches between the company’s goals and the incentive structure placed on individuals.
Using our model from above, we can lament how lazy, greedy, and unethical the employees are for doing what they’re incentivized to do instead of what’s right. But that’s simply ignoring the reality of human nature.
Moral hazard
Moral hazard is a situation where one party is incentivized to take on more risk because another party will bear the consequences. Suppose I tell my son when he turns 21 (or whatever legal gambling age is) that I’ll cover all his losses for a day at the casino, but he gets to keep all the winnings.
What do you think he’s going to do? The most logical course of action is to place the largest possible bets for as long as possible, asking me to cover each time he loses, and taking money off the table and into his bank account each time he wins.
But let’s look at a slightly more nuanced example. I go to a bathroom in the mall. As I’m leaving, I wash my hands. It will take me an extra 1 second to turn off the water when I’m done washing. That’s a trivial price to pay. If I don’t turn off the water, the mall will have to pay for many liters of wasted water, benefiting no one. But I won’t suffer any consequences at all.
This is also a moral hazard, but most people will still turn off the water. Why? Usually due to some combination of other reasons such as:
We’re so habituated to turning off the water that we don’t even consider not turning it off. Put differently, the mental effort needed to not turn off the water is more expensive than the 1 second of time to turn it off.
Many of us have been brought up with a deep guilt about wasting resources like water. We have an internal incentive structure that makes the 1 second to turn off the water much less costly than the mental anguish of the waste we created.
We’re afraid we’ll be caught by someone else and face some kind of social repercussions. (Or maybe more than social. Are you sure there isn’t a law against leaving the water tap on?)
Even with all that in place, you may notice that many public bathrooms use automatic water dispensers. Sure, there’s a sanitation reason for that, but it’s also to avoid this moral hazard.
A common denominator in both of these is that the person taking the action that causes the liability (either the gambling or leaving the water on) is not the person who bears the responsibility for that liability (the father or the mall owner). Generally speaking, the closer together the person making the decision and the person incurring the liability are, the smaller the moral hazard.
It’s easy to demonstrate that by extending the casino example a bit. I said it was the father who was covering the losses of the gambler. Many children (though not all) would want to avoid totally bankrupting their parents, or at least financially hurting them. Instead, imagine that someone from the IRS shows up at your door, hands you a credit card, and tells you you can use it at a casino all day, taking home all the chips you want. The money is coming from the government. How many people would put any restriction on how much they spend?
And since we’re talking about the government already…
Government moral hazards
As I was preparing to write this blog post, the California wildfires hit. The discussions around those wildfires gave a huge number of examples of moral hazards. I decided to cherry-pick a few for this post.
The first and most obvious one: California is asking for disaster relief funds from the federal government. That sounds wonderful. These fires were a natural disaster, so why shouldn’t the federal government pitch in and help take care of people?
The problem is, once again, a moral hazard. In the case of the wildfires, California and Los Angeles both had ample actions they could have taken to mitigate the destruction of this fire: better forest management, larger fire department, keeping the water reservoirs filled, and probably much more that hasn’t come to light yet.
If the federal government bails out California, it will be a clear message for the future: your mistakes will be fixed by others. You know what kind of behavior that incentivizes? More risky behavior! Why spend state funds on forest management and extra firefighters—activities that don’t win politicians a lot of votes in general—when you could instead spend it on a football stadium, higher unemployment payments, or anything else, and then let the feds cover the cost of screw-ups.
You may notice that this is virtually identical to the 2008 “too big to fail” bail-outs. Wall Street took insanely risky behavior, reaped huge profits for years, and when they eventually got caught with their pants down, the rest of us bailed them out. “Privatizing profits, socializing losses.”
And here’s the absolute best part of this: I can’t even truly blame either California or Wall Street. (I mean, I do blame them, I think their behavior is reprehensible, but you’ll see what I mean.) In a world where the rules of the game implicitly include the bail-out mentality, you would be harming your citizens/shareholders/investors if you didn’t engage in that risky behavior. Since everyone is on the hook for those socialized losses, your best bet is to maximize those privatized profits.
There’s a lot more to government and moral hazard, but I think these two cases demonstrate the crux pretty solidly. But let’s leave moral hazard behind for a bit and get to general incentivization discussions.
Non-monetary competition
At least 50% of the economics knowledge I have comes from the very first econ course I took in college. That professor was amazing, and had some very colorful stories. I can’t vouch for the veracity of the two I’m about to share, but they definitely drive the point home.
In the 1970s, the US had an oil shortage. To “fix” this problem, they instituted price caps on gasoline, which of course resulted in insufficient gasoline. To “fix” this problem, they instituted policies where, depending on your license plate number, you could only fill up gas on certain days of the week. (Irrelevant detail for our point here, but this just resulted in people filling up their tanks more often, no reduction in gas usage.)
Anyway, my professor’s wife had a friend. My professor described in great detail how attractive this woman was. I’ll skip those details here since this is a PG-rated blog. In any event, she never had any trouble filling up her gas tank any day of the week. She would drive up, be told she couldn’t fill up gas today, bat her eyes at the attendant, explain how helpless she was, and was always allowed to fill up gas.
This is a demonstration of non-monetary compensation. Most of the time in a free market, capitalist economy, people are compensated through money. When price caps come into play, there’s a limit to how much monetary compensation someone can receive. And in that case, people find other ways of competing. Like this woman’s case: through using flirtatious behavior to compensate the gas station workers to let her cheat the rules.
The other example was much more insidious. Santa Monica had a problem: it was predominantly wealthy and white. They wanted to fix this problem, and decided to put in place rent controls. After some time, they discovered that Santa Monica had become wealthier and whiter, the exact opposite of their desired outcome. Why would that happen?
Someone investigated, and ended up interviewing a landlady that demonstrated the reason. She was an older white woman, and admittedly racist. Prior to the rent controls, she would list her apartments in the newspaper, and would be legally obligated to rent to anyone who could afford it. Once rent controls were in place, she took a different tact. She knew that she would only get a certain amount for the apartment, and that the demand for apartments was higher than the supply. That meant she could be picky.
She ended up finding tenants through friends-of-friends. Since it wasn’t an official advertisement, she wasn’t legally required to rent it out if someone could afford to pay. Instead, she got to interview people individually and then make them an offer. Normally, that would have resulted in receiving a lower rental price, but not under rent controls.
So who did she choose? A young, unmarried, wealthy, white woman. It made perfect sense. Women were less intimidating and more likely to maintain the apartment better. Wealthy people, she determined, would be better tenants. (I have no idea if this is true in practice or not, I’m not a landlord myself.) Unmarried, because no kids running around meant less damage to the property. And, of course, white. Because she was racist, and her incentive structure made her prefer whites.
You can deride her for being racist, I won’t disagree with you. But it’s simply the reality. Under the non-rent-control scenario, her profit motive for money outweighed her racism motive. But under rent control, the monetary competition was removed, and she was free to play into her racist tendencies without facing any negative consequences.
Bureaucracy
These were the two examples I remember for that course. But non-monetary compensation pops up in many more places. One highly pertinent example is bureaucracies. Imagine you have a government office, or a large corporation’s acquisition department, or the team that apportions grants at a university. In all these cases, you have a group of people making decisions about handing out money that has no monetary impact on them. If they give to the best qualified recipients, they receive no raises. If they spend the money recklessly on frivolous projects, they face no consequences.
Under such an incentivization scheme, there’s little to encourage the bureaucrats to make intelligent funding decisions. Instead, they’ll be incentivized to spend the money where they recognize non-monetary benefits. This is why it’s so common to hear about expensive meals, gift bags at conferences, and even more inappropriate ways of trying to curry favor with those that hold the purse strings.
Compare that ever so briefly with the purchases made by a small mom-and-pop store like my parents owned. Could my dad take a bribe to buy from a vendor who’s ripping him off? Absolutely he could! But he’d lose more on the deal than he’d make on the bribe, since he’s directly incentivized by the deal itself. It would make much more sense for him to go with the better vendor, save $5,000 on the deal, and then treat himself to a lavish $400 meal to celebrate.
Government incentivized behavior
This post is getting longer in the tooth than I’d intended, so I’ll finish off with this section and make it a bit briefer. Beyond all the methods mentioned above, government has another mechanism for modifying behavior: through directly changing incentives via legislation, regulation, and monetary policy. Let’s see some examples:
Artificial modification of interest rates encourages people to take on more debt than they would in a free capital market, leading to malinvestment and a consumer debt crisis, and causing the boom-bust cycle we all painfully experience.
Going along with that, giving tax breaks on interest payments further artificially incentivizes people to take on debt that they wouldn’t otherwise.
During COVID-19, at some points unemployment benefits were greater than minimum wage, incentivizing people to rather stay home and not work than get a job, leading to reduced overall productivity in the economy and more printed dollars for benefits. In other words, it was a perfect recipe for inflation.
The tax code gives deductions to “help” people. That might be true, but the real impact is incentivizing people to make decisions they wouldn’t have otherwise. For example, giving out tax deductions on children encourages having more kids. Tax deductions on childcare and preschools incentivizes dual-income households. Whether or not you like the outcomes, it’s clear that it’s government that’s encouraging these outcomes to happen.
Tax incentives cause people to engage in behavior they wouldn’t otherwise (daycare+working mother, for example).
Inflation means that the value of your money goes down over time, which encourages people to spend more today, when their money has a larger impact. (Milton Friedman described this as high living.)
Conclusion
The idea here is simple, and fully encapsulated in the title: incentives determine outcomes. If you want to know how to get a certain outcome from others, incentivize them to want that to happen. If you want to understand why people act in seemingly irrational ways, check their incentives. If you’re confused why leaders (and especially politicians) seem to engage in destructive behavior, check their incentives.
We can bemoan these realities all we want, but they are realities. While there are some people who have a solid internal moral and ethical code, and that internal code incentivizes them to behave against their externally-incentivized interests, those people are rare. And frankly, those people are self-defeating. People should take advantage of the incentives around them. Because if they don’t, someone else will.
(If you want a literary example of that last comment, see the horse in Animal Farm.)
How do we improve the world under these conditions? Make sure the incentives align well with the overall goals of society. To me, it’s a simple formula:
Focus on free trade, value for value, as the basis of a society. In that system, people are always incentivized to provide value to other people.
Reduce the size of bureaucracies and large groups of all kinds. The larger an organization becomes, the farther the consequences of decisions are from those who make them.
And since the nature of human beings will be to try and create areas where they can control the incentive systems to their own benefits, make that as difficult as possible. That comes in the form of strict limits on government power, for example.
And even if you don’t want to buy in to this conclusion, I hope the rest of the content was educational, and maybe a bit entertaining!
At work a few weeks back, I found myself digging into profile reports, trying to determine why our program was running so slowly. Despite having the extremely obvious-in-retrospect data in front of me, I wasted a lot of time speeding up code that turned out to not move the needle at all.
Although perhaps it will be interesting only to future me, I thought it would be a good exercise to write up the experience—if only so I learn the lesson about how to read profiles and not make the same mistake again.
Some Context
I’m currently employed to work on a compiler. The performance has never been stellar, in that we were usually seeing about 5s to compile programs, even trivially small ones consisting of less than a hundred instructions. It was painful, but not that painful, since the test suite still finished in a minute or two. It was a good opportunity to get a coffee. I always assumed that the time penalties we were seeing were constant factors; perhaps it took a second or two to connect to Z3 or something like that.
But then we started unrolling loops, which turned trivially small programs into merely small programs, and our performance ballooned. Now we were looking at 45s for some of our tests! Uh oh! That’s no longer in the real of constant factors, and it was clear that something asymptotically was wrong.
So I fired up GHC with the trusty old -prof flag, and ran the test suite in +RTS -p mode, which instruments the program with all sorts of profiling goodies. After a few minutes, the test suite completed, and left a test-suite.prof file laying around in the current directory. You can inspect such things by hand, but tools like profiteur make the experience much nicer.
Without further ado, here’s what our profile looked like:
Now we’re in business. I dutifully dug into toSSA, the transforms, and collectGarbage. I cached some things, used better data structures, stopped appending lists, you know, the usual Haskell tricks. My work was rewarded, in that I managed to shave 80% off the runtime of our program.
A few months later, we wrote a bigger program and fed it to the compiler. This one didn’t stop compiling. We left it overnight.
Uh oh. Turns out I hadn’t fixed the problem. I’d only papered over it.
Retrospective
So what went wrong here? Quite a lot, in fact! And worse, I had all of the information all along, but managed to misinterpret it at several steps of the process.
Unwinding the story stack, the most salient aspect of having not solved the problem was reducing the runtime by only 80%. Dramatic percentages feel like amazing improvements, but that’s because human brains are poorly designed for building software. In the real world, big percentages are fantastic. In software, they are linear improvements.
That is to say that a percentage-based improvement is \(O(n)\) faster in the best case. My efforts improved our runtime from 45s to 9s. Which feels great, but the real problem is that this program is measured in seconds at all.
It’s more informative to think in terms of orders of magnitude. Taking 45s on a ~3GHz processor is on the order of 1011 instructions, while 9s is 1010. How the hell is it taking us TEN BILLION instructions to compile a dinky little program? That’s the real problem. Improving things from one hundred billion down to ten billion is no longer very impressive at all.
To get a sense of the scale here, even if we spent 1M cycles (which feels conservatively expensive) for each instruction we wanted to compile, we should still be looking at < 0.1s. Somehow we are over 1000x worse than that.
So that’s one mistake I made: being impressed by extremely marginal improvements. Bad Sandy.
The other mistake came from my interpretation of the profile. As a quick pop quiz, scroll back up to the profile and see if you can spot where the problem is.
After expanding a few obviously-not-the-problem call centers that each were 100% of the runtime, I turned my brain off and opened all of the 100% nodes. But in doing so, I accidentally breezed past the real problem. The real problem is either that compileProgram takes 100% of the time of the test, or that transformSSA takes 100% of compiling the program. Why’s that? Because unlike main and co, test does more work than just compiling the program. It also does non-trivial IO to produce debugging outputs, and property checks the resulting programs. Similarly for compileProgram, which does a great deal more than transformSSA.
This is somewhat of a philosophical enlightenment. The program execution hasn’t changed at all, but our perspective has. Rather than micro-optimizing the code that is running, this new perspective suggests we should focus our effort on determining why that code is running in the first place.
Digging through transformSSA made it very obvious the problem was an algorithmic one—we were running an unbounded loop that terminated on convergence, where each step it took @O(n^2)@ work to make a single step. When I stopped to actually read the code, the problem was immediate, and the solution obvious.
The lesson? Don’t read the profile. Read the code. Use the profile to focus your attention.
In my previous two posts "Ways to use torch.compile" and "Ways to use torch.export", I often said that PyTorch would be good for a use case, but there might be some downsides. Some of the downsides are foundational and difficult to remove. But some... just seem like a little something is missing from PyTorch. In this post, here are some things I hope we will end up shipping in 2025!
Improving torch.compile
A programming model for PT2. A programming model is a an abstract description of the system that is both simple (so anyone can understand it and keep it in their head all at once) and can be used to predict the system's behavior. The torch.export programming model is an example of such a description. Beyond export, we would like to help users understand why all aspects of PT2 behave the way it does (e.g., via improved error messages), and give simple, predictable tools for working around problems when they arise. The programming model helps us clearly define the intrinsic complexity of our compiler, which we must educate users about. This is a big effort involving many folks on the PyTorch team and I hope we can share more about this effort soon.
Pre-compilation: beyond single graph export. Whenever someone realizes that torch.compile compilation is taking a substantial amount of time on expensive cluster machines, the first thing they ask is, "Why don't we just compile it in advance?" To support precompiling the torch.compile API exactly as is not so easy; unlike a traditional compiler which gets the source program directly as input, users of torch.compile must actually run their Python program to hit the regions of code that are intended to be compiled. Nor can these regions be trivially enumerated and then compiled: not only must know all the metadata input tensors flowing into a region, a user might not even know what the compiled graphs are if a model has graph breaks.
OK, but why not just run the model, dump all the compiled products, and then reuse them later? This works! Here is a POC from Nikita Shulga where a special decorator aot_compile_sticky_cache swaps between exporting a graph and running the exported product. Zhengxu Chen used a similar idea to export Whisper as a few distinct graphs, which he then manually stitched together in C++ to get a Python-free version of Whisper. If you want training to work, you can more directly integrate AOTInductor as an Inductor backend, e.g., as seen in this POC.. We are a stones throw away from working precompilation, which can guarantee no compilation at runtime, we just need to put the pieces together!
Improving caching further. There are some gaps with caching which we hope to address in the near future: (1) loading Triton cache artifacts takes a long time because we still re-parse the Triton code before doing a cache lookup (James Wu is on this), (2) if you have a lot of small graphs, remote cache ends up having to do lots of small network requests, instead of one batched network request at the beginning (Oguz Ulgen recently landed this), (3) AOTAutograd cache is not fully rolled out yet (James Wu again). These collectively should be worth a 2x speedup or even more on warm cache time.
Fix multithreading. We should just make sure multithreading works, doing the testing and fiddly thread safety auditing needed to make it work. Here's a list of multithreading related issues.
Improving torch.export
Draft mode export. Export requires a lot of upfront work to even get an exported artifact in the first place. Draft mode export capitalizes on the idea that it's OK to generate an unsound "draft" graph early in the export, because even an incorrect graph is useful for kicking the tires on the downstream processing that happens after export. A draft export gives you a graph, and it also gives you a report describing what potential problems need to be fixed to get some guarantees about the correctness of the export. You can then chip away on the problems in the report until everything is green. One of the biggest innovations of draft-mode export is pervasive use of real tensor propagation when doing export: you run the export with actual tensors, so you can always trace through code, even if it is doing spicy things like data-dependent control flow.
Libtorch-free AOTInductor. AOTInductor generated binaries have a relatively small ABI surface that needs to be implemented. This hack from the most recent CUDA Mode meetup shows that you can just create an alternate implementation of the ABI that has no dependence on libtorch. This makes your deployed binary size much smaller!
Support for bundling CUDA kernels into AOTInductor. AOTInductor already supports directly bundling Triton kernels into the generated binary, but traditional CUDA kernels cannot be bundled in this way. There's no reason this has to be the case though: all we're doing is bundling cubins in both case. If we have the ability to bundle traditional CUDA kernels into AOTInductor, this means you could potentially directly embed custom operators into AOTInductor binaries, which is nice because then those operators no longer have to be offered on the runtime (especially if you're commonly iterating on these kernels!)
Export multigraphs. Export's standard model is to give you a single graph that you call unconditionally. But it's easy to imagine a level of indirection on top of these graphs, where we can dispatch between multiple graphs depending on some arguments to the model. For example, if you have a model that optionally takes an extra Tensor argument, you can simply have two graphs, one for when the Tensor is absent, and one for when it is present.
ABI stable PyTorch extensions. It's hard work being a third-party PyTorch extension with native code, because whenever there's a new release of Python or PyTorch you have to rebuild all of your wheels. If there was a limited ABI that you could build your extension against that didn't expose CPython and only relied on a small, stable ABI of PyTorch functions, your binary packaging situation would be much simpler! And if an extension relied on a small ABI, it could even be bundled with AOTInductor binary, letting these export products be truly package agnostic (one of our lessons we learned with torch.package is picking the split between "what is packaged" and "what is not" is very difficult, and people would much rather just have everything be packaged.) Jane Xu is investigating how to do this, and separately, Scott Wolchok has been refactoring headers in libtorch so that a small set of headers can be used independently of the rest of libtorch.
GHC since version 9.8 allows us to create callbacks from JS to Haskell code, which enables us to create full-fledged browser apps.
This article shows how to use the JS backend with foreign component libraries.
When people talk about functional programming in modern multi-paradigm languages, they usually mention Rust, Scala, or Kotlin. You rarely hear Swift being mentioned. This is odd, as one might argue that, of these languages, Swift places the strongest emphasis on functional programming.
In this talk, I will explain the core functional programming features of Swift, including its expressive type system, value types, and mutability control. Furthermore, I will discuss how Swift’s language design is influenced by the desire to create a language that addresses the whole spectrum from low-level systems programming up to high-level applications with sophisticated graphical user interfaces. Beyond the core language itself, functional programming also permeates Swift’s rich ecosystem of libraries. To support this point, I will outline some FP-inspired core libraries, covering concepts from functional data structures over functional reactive programming to declarative user interfaces.
Finally, I will briefly summarise practical considerations for using Swift in your own projects. This includes the cross-platform toolchain, the package manager, and interoperability with other languages.
The seat layout fits on a grid. Each position is either floor (.), an empty seat (L), or an occupied seat (#). For example, the initial seat layout might look like this:
All decisions are based on the number of occupied seats adjacent to a given seat (one of the eight positions immediately up, down, left, right, or diagonal from the seat).
The following rules are applied to every seat simultaneously:
If a seat is empty (L) and there are no occupied seats adjacent to it, the seat becomes occupied.
If a seat is occupied (#) and four or more seats adjacent to it are also occupied, the seat becomes empty.
Otherwise, the seat’s state does not change.
Floor (.) never changes; seats don’t move, and nobody sits on the floor.
This is a classic Cellular Automaton problem. We need to write a program that simulates seats being occupied till no further seats are emptied or occupied, and
returns the final number of occupied seats. Let’s solve this in Haskell.
The Cellular Automaton
First, some imports:
{-# LANGUAGE GHC2021 #-}{-# LANGUAGE LambdaCase #-}{-# LANGUAGE PatternSynonyms #-}{-# LANGUAGE TypeFamilies #-}moduleMainwhereimportControl.Arrow ((>>>))importControl.Comonad (Comonad (..))importData.Function (on)importData.List (intercalate, nubBy)importData.Massiv.Array (Ix2 (..))importData.Massiv.ArrayqualifiedasAimportData.Massiv.Array.UnsafequalifiedasAUimportData.Proxy (Proxy (..))importData.Vector.GenericqualifiedasVGimportData.Vector.Generic.MutablequalifiedasVGMimportData.Vector.UnboxedqualifiedasVUimportSystem.Environment (getArgs, getProgName)
We use the GHC2021 extension here that enables a lot of useful GHC extensions by default. Our non-base imports come from the comonad, massiv and vector libraries.
A cellular automaton consists of a regular grid of cells, each in one of a finite number of states.
For each cell, a set of cells called its neighborhood is defined relative to the specified cell.
An initial state is selected by assigning a state for each cell.
A new generation is created, according to some fixed rule that determines the new state of each cell in terms of the current state of the cell and the states of the cells in its neighborhood.
Let’s model the automaton of the challenge using Haskell:
A cell in the grid can be in empty, occupied or floor state. We encode this with the pattern synonymsEmpty, Occupied and Floor over the Cellnewtype over Char1.
The parseCell function parses a character to a Cell. The rule function implements the automaton rule.
The Solution
We are going to solve this puzzle in three different ways. So, let’s abstract the details and solve it top-down.
class (Eq a) =>Grid a where fromLists :: [[Cell]] -> a step :: a -> a toLists :: a -> [[Cell]]solve ::forall a. (Grid a) =>Proxy a -> [[Cell]] ->Intsolve _ = fromLists @a>>> fix step>>> toLists>>>fmap (filter (==Occupied) >>>length)>>>sumwhere fix f x =let x' = f x inif x == x' then x else fix f x'
We solve the challenge using the Grid typeclass that all our different solutions implement. A grid is specified by three functions:
fromList: converts a list of lists of cells to the grid.
step: runs one step of the CA simulation.
toList: converts the grid back to a list of lists of cells.
The solve function calculates the number of finally occupied seats for any instance of the Grid typeclass by running the simulation till it converges2.
Now, we use solve to solve the challenge in three ways depending on the command line argument supplied:
We have set up the top (main) and the bottom (rule) of our solutions. Now let’s work on the middle part.
The Zipper
To simulate a CA, we need to focus on each cell of the automaton grid, and run the rule for the cell. What is the first thing that come to minds of functional programmers when we want to focus on a part of a data structure? Zippers!.
Zippers are a special view of data structures, which allow one to navigate and easily update them. A zipper always has a focus or cursor which is the current element of the data structure we are “at”. Alongside, it also captures the rest of the data structure in a way that makes it easy to move around it. We can update the data structure by updating the element at the focus.
The first way to solve the challenge is the zipper for once-nested lists. Let’s start with creating the zipper for a simple list:
dataZipper a =Zipper [a] a [a] deriving (Eq, Functor)zPosition ::Zipper a ->IntzPosition (Zipper left _ _) =length leftzLength ::Zipper a ->IntzLength (Zipper left _ right) =length left +1+length rightlistToZipper :: [a] ->Zipper alistToZipper = \case [] ->error"Cannot create Zipper from empty list" (x : xs) ->Zipper [] x xszipperToList ::Zipper a -> [a]zipperToList (Zipper left focus right) =reverse left <> (focus : right)pShowZipper :: (Show a) =>Zipper a ->StringpShowZipper (Zipper left focus right) =unwords$mapshow (reverse left) <> (("["<>show focus <>"]") :mapshow right)zLeft ::Zipper a ->Zipper azLeft z@(Zipper left focus right) =case left of [] -> z x : xs ->Zipper xs x (focus : right)zRight ::Zipper a ->Zipper azRight z@(Zipper left focus right) =case right of [] -> z x : xs ->Zipper (focus : left) x xs
A list zipper has a focus element, and two lists that capture the elements to the left and right of the focus. We use it through these functions:
zPosition returns the zero-indexed position of the focus in the zipper.
zLength returns the length of the zipper.
listToZipper and zipperToList do conversions between lists and zippers.
pShowZipper pretty-prints a zipper, highlighting the focus.
zLeft and zRight move the zipper’s focus to left and right respectively.
ZGrid is a newtype over a zipper of zippers. It has functions similar to Zipper for getting focus, position and size, for conversions to-and-from lists of lists, and for pretty-printing.
Next, the functions to move the focus in the grid:
zgUp ::ZGrid a ->ZGrid azgUp (ZGrid rows) =ZGrid$ zLeft rowszgDown ::ZGrid a ->ZGrid azgDown (ZGrid rows) =ZGrid$ zRight rowszgLeft ::ZGrid a ->ZGrid azgLeft (ZGrid rows) =ZGrid$fmap zLeft rowszgRight ::ZGrid a ->ZGrid azgRight (ZGrid rows) =ZGrid$fmap zRight rows
It works as expected. Now, how do we use this to simulate a CA?
The Comonad
A CA requires us to focus on each cell of the grid, and run a rule for the cell that depends on the neighbours of the cell. An Haskell abstraction that neatly fits this requirement is Comonad.
Comonads are duals of Monads3. We don’t need to learn everything about them for now. For our purpose, Comonad provides an interface that exactly lines up with what is needed for simulating CA:
classFunctor w =>Comonad w where extract :: w a -> a duplicate :: w a -> w (w a) extend :: (w a -> b) -> w a -> w b{-# MINIMAL extract, (duplicate | extend) #-}
Assuming we can make ZGrid a comonad instance, the signatures for the above functions for ZGridCell would be:
The extract function would return the current focus of the grid.
The duplicate function would return a grid of grids, one inner grid for each possible focus of the input grid.
The extend function would apply the automata rule to each possible focus of the grid, and return a new grid.
The nice part is, we need to implement only the extract and duplicate functions, and the generation of the new grid is taken care of automatically by the default implementation of the extend function. Let’s write the comonad instance for ZGrid.
First, we write the comonad instance for Zipper:
instanceComonadZipperwhere extract (Zipper _ focus _) = focus duplicate zipper =Zipper left zipper rightwhere pos = zPosition zipper left = iterateN pos zLeft $ zLeft zipper right = iterateN (zLength zipper - pos -1) zRight $ zRight zipperiterateN ::Int-> (a -> a) -> a -> [a]iterateN n f =take n .iterate f
extract for Zipper simply returns the input zipper’s focus element.
duplicate returns a zipper of zippers, with the input zipper as its focus, and the left and right lists of zippers as variation of the input zipper with all possible focuses. Trying out the functions in GHCi gives a better idea:
zGridNeighbours returns the neighbour cells of the currently focussed cell of the grid. It does so by moving the focus in all eight directions, and extracting the new focuses. We also make sure to return unique cells by their position.
stepZGrid implements one step of the CA using the extend function of the Comonad typeclass. We call extend with a function that takes the current grid, and returns the result of running the CA rule on its focus and the neighbours of the focus.
Finally, we plug in our functions into the ZGridCell instance of Grid.
❯ nix-shell -p "ghc.withPackages (p: [p.massiv p.comonad])" \
--run "ghc --make seating-system.hs -O2"
[1 of 2] Compiling Main ( seating-system.hs, seating-system.o )
[2 of 2] Linking seating-system
❯ time ./seating-system -z input.txt
2243
2.72 real 2.68 user 0.02 sys
I verified with the Advent of Code website that the result is correct. We also see the time elapsed, which is 2.7 seconds. That seems pretty high. Can we do better?
The Array
The problem with the zipper approach is that lists in Haskell are too slow. Some operations on them like length are \(O(n)\). The are also lazy in spine and value, and build up thunks. We could switch to a different list-like data structure5, or cache the grid size and neighbour indices for each index to make it run faster. Or we could try an entirely different approach.
Let’s think about it for a bit. Zippers intermix two things together: the data in the grid, and the focus. When running a step of the CA, the grid data does not change when focussing on all possible focuses, only the focus itself changes. What if we separate the data from the focus? Maybe that’ll make it faster. Let’s try it out.
Let’s model the grid as combination of a 2D array and an index into the array. We are using the arrays from the massiv library.
dataAGrid a =AGrid {aGrid ::A.ArrayA.BA.Ix2 a, aGridFocus ::A.Ix2}deriving (Eq, Functor)
A.Ix2 is massiv’s way of representing an index into an 2D array, and is essentially same as a two-tuple of Ints. A.ArrayA.BA.Ix2 a here means a 2D boxed array of as. massiv uses representation strategies to decide how arrays are actually represented in the memory, among which are boxed, unboxed, primitive, storable, delayed etc. Even though primitive and storable arrays are faster, we have to go with boxed arrays here because the Functor instance of A.Array exists only for boxed and delayed arrays, and boxed ones are the faster among the two for our purpose.
It is actually massively6 easier to write the Comonad instance for AGrid:
The extract implementation simply looks up the element from the array at the focus index. This time, we don’t need to implement duplicate because it is easier to implement extend directly. We map with index (A.imap) over the grid, calling the function f for the variation of the grid with the index as the focus.
Next, we write the CA step:
listsToAGrid :: [[Cell]] ->AGridCelllistsToAGrid = A.fromLists' A.Seq>>>flipAGrid (0:.0)aGridNeighbours ::AGrid a -> [a]aGridNeighbours (AGrid grid (x :. y)) = [ grid A.! (x + i :. y + j)| i <- [-1, 0, 1], j <- [-1, 0, 1], (x + i, y + j) /= (x, y), validIndex (x + i, y + j) ]whereA.Sz (rowCount :. colCount) = A.size grid validIndex (a, b) =and [a >=0, b >=0, a < rowCount, b < colCount]stepAGrid ::AGridCell->AGridCellstepAGrid = extend $ \grid -> rule (extract grid) (aGridNeighbours grid)instanceGrid (AGridCell) where fromLists = listsToAGrid step = stepAGrid toLists = aGrid >>> A.toLists
listsToAGrid converts a list of lists of cells into an AGrid focussed at (0,0). aGridNeighbours finds the neighbours of the current focus of a grid by directly looking up the valid neighbour indices into the array. stepAGrid calls extract and aGridNeighbours to implement the CA step, much like the ZGrid case. And finally, we create the AGridCell instance of Grid.
Let’s compile and run it:
❯ rm ./seating-system
❯ nix-shell -p "ghc.withPackages (p: [p.massiv p.comonad])" \
--run "ghc --make seating-system.hs -O2"
[2 of 2] Linking seating-system
❯ time ./seating-system -a input.txt
2243
0.10 real 0.09 user 0.00 sys
Woah! It takes only 0.1 second this time. Can we do even better?
The Stencil
massiv has a construct called Stencil that can be used for simulating CA:
Stencil is abstract description of how to handle elements in the neighborhood of every array cell in order to compute a value for the cells in the new array.
That sounds like exactly what we need. Let’s try it out next.
With stencils, we do not need the instance of Comonad for the grid. So we can switch to the faster unboxed array representation:
newtypeinstanceVU.MVector s Cell=MV_Char (VU.MVector s Char)newtypeinstanceVU.VectorCell=V_Char (VU.VectorChar)derivinginstanceVGM.MVectorVU.MVectorCellderivinginstanceVG.VectorVU.VectorCellinstanceVU.UnboxCelltypeSGrid a =A.ArrayA.UA.Ix2 a
First five lines make Cell an instance of the Unbox typeclass. We chose to make Cell a newtype wrapper over Char because Char has an Unbox instance.
Then we define a new grid type SGrid that is an 2D unboxed array.
Now, we define the stencil and the step function for our CA:
We make a stencil of size 3-by-3, where the focus is at index (1,1) relative to the stencil’s top-left cell. In the callback function, we use the supplied get function to get the neighbours of the focus by using indices relative to the focus, and call rule with the cells at focus and neighbour indices.
Then we write the step function stepSGrid that maps the stencil over the grid. Finally we put everything together in the SGridCell instance of Grid.
Let’s compile and run it:
❯ rm ./seating-system
❯ nix-shell -p "ghc.withPackages (p: [p.massiv p.comonad])" \
--run "ghc --make seating-system.hs -O2"
[2 of 2] Linking seating-system
❯ time ./seating-system -s input.txt
2243
0.08 real 0.07 user 0.00 sys
It is only a bit faster than the previous solution. But, this time we have another trick up our sleeves. Did you notice A.computeP we sneaked in there? With stencils, we can now run the step for all cells in parallel! Let’s recompile it with the right options and run it again:
❯ rm ./seating-system
❯ nix-shell -p "ghc.withPackages (p: [p.massiv p.comonad])" \
--run "ghc --make seating-system.hs -O2 -threaded -rtsopts"
[2 of 2] Linking seating-system
❯ time ./seating-system -s input.txt +RTS -N
2243
0.04 real 0.11 user 0.05 sys
The -threaded option enables multithreading, and the +RTS -N option makes the process use all CPU cores7. We get a nice speedup of 2x over the single-threaded version.
Bonus Round: Simulation Visualization
Since you’ve read the entire post, here is a bonus visualization of the CA simulation for you (warning: lots of fast blinking):
Play the simulation
<noscript></noscript>
That’s it for this post! I hope you enjoyed it and took something away from it. The full code for this post is available here.
If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading!
The reason for using a newtype instead of a data is explained in the Stencil section.↩︎
If you are unfamiliar, >>> is the left-to-right function composition function:
Cartoonist Ann Telnaes has quit the Washington Post, after they refused to publish one of her cartoons, depicting Mark Zuckerberg (Meta), Sam Altman (Open AI), Patrick Soon-Shiong (LA Times), the Walt Disney Company (ABC News), and Jeff Bezos (Amazon & Washington Post). All that exists is her preliminary sketch, above. Why is this important? See her primer below. (Spotted via Boing Boing.)
Previously, I discussed the value proposition of torch.compile. While doing so, I observed a number of downsides (long compile time, complicated operational model, lack of packaging) that were intrinsic to torch.compile's API contract, which emphasized being able to work on Python code as is, with minimal intervention from users. torch.export occupies a different spot in the tradeoff space: in exchange for more upfront work making a model exportable, it allows for use of PyTorch models in environments where using torch.compile as is would be impossible.
Enable end-to-end C++ CPU/GPU Inference
Scenario: Like before, suppose you want to deploy your model for inference. However, now you have more stringent runtime requirements: perhaps you need to do inference from a CPython-less environment (because your QPS requirements require GIL-less multithreading; alternately, CPython execution overhead is unacceptable but you cannot use CUDA graphs, e.g., due to CPU inference or dynamic shapes requirements). Or perhaps your production environment requires hermetic deploy artifacts (for example, in a monorepo setup, where infrastructure code must be continually pushed but model code should be frozen). But like before, you would prefer not to have to rewrite your model; you would like the existing model to serve as the basis for your Python-less inference binary.
What to do: Use torch.export targeting AOTInductor. This will compile the model into a self-contained shared library which then can be directly invoked from a C++ runtime. This shared library contains all of the compiler generated Triton kernels as precompiled cubins and is guaranteed not to need any runtime compilation; furthermore, it relies only on a small runtime ABI (with no CPython dependency), so the binaries can be used across versions of libtorch. AOTInductor's multithreading capability and low runtime overhead also makes it a good match for CPU inference too!
You don't have to go straight to C++ CPU/GPU inference: you can start with using torch.compile on your code before investing in torch.export. There are four primary extra requirements export imposes: (1) your model must compile with fullgraph=True (though you can sometimes bypass missing Dynamo functionality by using non-strict export; sometimes, it is easier to do non-strict torch.export than it is to torch.compile!), (2) your model's inputs/outputs must only be in torch.export's supported set of argument types (think Tensors in pytrees), (3) your model must never recompile--specifically, you must specify what inputs have dynamic shapes, and (4) the top-level of your model must be an nn.Module (so that export can keep track of all of the parameters your model has).
Some tips:
Check out the torch.export programming model. The torch.export programming model is an upcoming doc which aims to help set expectations on what can and cannot be exported. It talks about things like "Tensors are the only inputs that can actually vary at runtime" and common mistakes such as module code which modifies NN modules (not supported!) or optional input types (you will end up with an export that takes in that input or not, there is no runtime optionality).
Budget time for getting a model to export. With torch.compile for Python inference, you could just slap it on your model and see what happens. For torch.export, you have to actually finish exporting your entire model before you can even consider running the rest of the pipeline. For some of the more complicated models we have exported, there were often dozens of issues that had to be worked around in one way or another. And that doesn't even account for all of the post-export work you have to do, like validating the numerics of the exported model.
Intermediate value debugging. AOTInductor has an option to add dumps of intermediate tensor values in the compiled C++ code. This is good for determining, e.g., the first time where a NaN shows up, in case you are suspecting a miscompilation.
Open source examples: Among other things, torchchat has an example end-to-end AOTInductor setup for server-side LLM inference, which you can view in run.cpp.
torch.export specific downsides:
No built-in support for guard-based dispatch (multiple compilations). Earlier, I mentioned that an exported model must not have any recompiles. This leads to some fairly common patterns of code not being directly supported by torch.export: you can't export a single model that takes an enum as input, or has an optional Tensor argument, or accepts two distinct tensor shapes that need to be compiled individually. Now, technically, we could support this: you could imagine a package that contains multiple exported artifacts and dispatches between them depending on some conditions (e.g., the value of the enum, whether or the optional Tensor argument was provided, the shape of the input tensor). But you're on your own: torch.compile will do this for you, but torch.export will not.
No built-in support for models that are split into multiple graphs. Similarly, we've mentioned that an exported model must be a single graph. This is in contrast to torch.compile, which will happily insert graph breaks and compile distinct islands of code that can be glued together with Python eager code. Now, technically, you can do this with export too: you can carve out several distinct subnets of your model, export them individually, and then glue them together with some custom written code on the other end (in fact, Meta's internal recommendation systems do this), but there's no built-in support for this workflow.
The extra requirements often don't cover important components of real world models. I've mentioned this previously as the extra restrictions export places on you, but it's worth reiterating some of the consequences of this. Take an LLM inference application: obviously, there is a core model that takes in tokens and produces logit predictions--this part of the model is exportable. But there are also important other pieces such as the tokenizer and sampling strategy which are not exportable (tokenizer because it operates on strings, not tensors; sampling because it involves complicated control flow). Arguably, it would be much better if all of these things could be directly bundled with the model itself; in practice, end-to-end applications should just expect to directly implement these in native code (e.g., as is done in torchchat). Our experience with TorchScript taught us that we don't really want to be in the business of designing a general purpose programming language that is portable across all of export's targets; better to just bet that the tokenizer doesn't change that often and eat the cost of natively integrating it by hand.
AOTInductor specific downsides:
You still need libtorch to actually run the model. Although AOTInductor binaries bundle most of their compiled kernel implementation, they still require a minimal runtime that can offer basic necessities such as tensor allocation and access to custom operators. There is not yet an official offering of an alternative, lightweight implementation of the stable ABI AOTInductor binaries depends on, so if you do want to deploy AOTInductor binaries you will typically have to also bring libtorch along. This is usually not a big deal server side, but it can be problematic if you want to do client side deployments!
No CUDA graphs support. This one is not such a big deal since you are much less likely to be CPU bound when the host side logic is all compiled C++, but there's no support for CUDA graphs in AOTInductor. (Funnily enough, this is also something you technically can orchestrate from outside of AOTInductor.)
Edge deployment
Scenario: You need to deploy your PyTorch model to edge devices (e.g., a mobile phone or a wearable device) where computational resources are limited. You have requirements that are a bit different from server size: you care a lot more about minimizing binary size and startup time. Traditional PyTorch deployment with full libtorch won't work. The device you're deploying too might also have some strange extra processors, like a DSP or NPU, that you want your model to target.
What to do: Use torch.export targeting Executorch. Among other things, Executorch offers a completely separate runtime for exported PyTorch programs (i.e., it has no dependency on libtorch, except perhaps there are a few headers which we share between the projects) which was specifically designed for edge deployment. (Historical note: we spent a long time trying to directly ship a stripped down version of libtorch to mobile devices, but it turns out it's really hard to write code that is portable on server and client, so it's better to only share when absolutely necessary.) Quantization is also a pretty important part of deployment to Edge, and Executorch incorporates this into the end-to-end workflow.
Open source examples: torchchat also has an Executorch integration letting you run an LLM on your Android phone.
Downsides. All of the export related downsides described previously apply here. But here's something to know specifically about Executorch:
The edge ecosystem is fragmented. At time of writing, there are seven distinct backends Executorch can target. This is not really Executorch's fault, it comes with the territory--but I want to call it out because it stands in stark contrast to the NVIDIA's server-side hegemony. Yes, AMD GPUs are a thing, and various flavors of CPU are real, but it really is a lot easier to be focused on server side because NVIDIA GPUs come first.
Pre-compiled kernels for eager mode
Scenario: You need a new function or self-contained module with an efficient kernel implementation. However, you would prefer not to have to write the CUDA (or even Triton) by hand; the kernel is something that torch.compile can generate from higher level PyTorch implementation. At the same time, however, you cannot tolerate just-in-time compilation at all (perhaps you are doing a massive training job, and any startup latency makes it more likely that one of your nodes will fail during startup and then you make no progress at all; or maybe you just find it annoying when PyTorch goes out to lunch when you cache miss).
What to do: Use torch.export targeting AOTInductor, and then load and run the AOTInductor generated binary from Python.
Downsides. So, we know this use case works, because we have internally used this to unblock people who wanted to use Triton kernels but could not tolerate Triton's just-in-time compilation. But there's not much affordance in our APIs for this use case; for example, guard-based dispatch is often quite useful for compiled functions, but you'll have to roll that by hand. More generally, when compiling a kernel, you have to make tradeoffs about how static versus dynamic the kernel should be (for example, will you force the inputs to be evenly divisible by eight? Or would you have a separate kernel for the divisible and not divisible cases?) Once again, you're on your own for making the call there.
An exchange format across systems
Scenario: In an ideal world, you would have a model, you could export it to an AOTInductor binary, and then be all done. In reality, maybe this export process needs to be a multi-stage process, where it has to be processed to some degree on one machine, and then finish processing on another machine. Or perhaps you need to shift the processing over time: you want to export a model to freeze it (so it is no longer tied to its original source code), and then repeatedly run the rest of the model processing pipeline on this exported program (e.g., because you are continuously updating its weights and then reprocessing the model). Maybe you want to export the model and then train it from Python later, committing to a distributed training strategy only when you know how many nodes you are running. The ability to hermetically package a model and then process it later is one of the big value propositions of TorchScript and torch.package.
What to do: Use torch.export by itself, potentially using pre-dispatch if you need to support training use-cases. torch.export produces an ExportedProgram which has a clean intermediate representation that you can do processing on, or just serialize and then do processing on later.
Downsides:
Custom operators are not packaged. A custom operator typically refers to some native code which was linked with PyTorch proper. There's no way to extract out this kernel and embed it into the exported program so that there is no dependence; instead, you're expected to ensure the eventual runtime relinks with the same custom operator. Note that this problem doesn't apply to user defined Triton kernels, as export can simply compile it and package the binary directly into the exported product. (Technically, this applies to AOTInductor too, but this tends to be much more of a problem for use cases which are primarily about freezing rapidly evolving model code, as opposed to plain inference where you would simply just expect people to not be changing custom operators willy nilly.)
Choose your own decompositions. Export produces IR that only contains operators from a canonical operator set. However, the default choice is sometimes inappropriate for use cases (e.g., some users want aten.upsample_nearest2d.vec to be decomposed while others do not), so in practice for any given target you may have a bespoke operator set that is appropriate for that use case. Unfortunately, it can be fiddly getting your operator set quite right, and while we've talked about ideas like a "build your own operator set interactive tool" these have not been implemented yet.
Annoyingly large FC/BC surface. Something I really like about AOTInductor is that it has a very small FC/BC surface: I only need to make sure I don't make breaking changes to the C ABI, and I'm golden. With export IR, the FC/BC surface is all of the operators produced by export. Even a decomposition is potentially BC breaking: a downstream pass could be expecting to see an operator that no longer exists because I've decomposed it into smaller pieces. Matters get worse in pre-dispatch export, since the scope of APIs used inside export IR expands to include autograd control operators (e.g., torch.no_grad) as well as tensor subclasses (since Tensor subclasses cannot be desugared if we have not yet eliminated autograd). We will not break your AOTInductor blobs. We can't as easily give the same guarantee for the IR here.
Next time: What's missing, and what we're doing about it
Up until this year, my Bitcoin custody strategy was fairly straightforward, and likely familiar to other hodlers:
Buy a hardware wallet
Put the seed phrase on steel plates
Secure those steel plates somewhere on my property
But in October of last year, the situation changed. I live in Northern Israel, close to the Lebanese border. The past 14 months have involved a lot of rocket attacks, including destruction of multiple buildings in my home town. This brought into question how to properly secure my sats. Importantly, I needed to balance two competing goals:
Resiliency of the saved secrets against destruction. In other words: make sure I didn't lose access to the wallet.
Security against attackers trying to steal those secrets. In other words: make sure no one else got access to the wallet.
I put some time into designing a solution to these conflicting goals, and would like to share some thoughts for others looking to improve their BTC custody strategy. And if anyone has any recommendations for improvements, I'm all ears!
Goals
Self custody I didn't want to rely on an external custody company. Not your keys, not your coins.
Full access I always maintain full access to my funds, without relying on any external party.
Computer hack resilient If my computer systems are hacked, I will not lose access to or control of my funds (neither stolen nor lost).
Physical destruction resilient If my hardware device and steel plates are both destroyed (as well as anything else physically located in my home town), I can still recovery my funds.
Will survive me If I'm killed, I want my wife, children, or other family members to be able to recover and inherit my BTC.
Multisig
The heart of this protection mechanism is a multisig wallet. Unfortunately, interfaces for setting up multisig wallets are tricky. I'll walk through the basics and then come back to how to set it up.
The concept of a multisig is that your wallet is protected by multiple signers. Each signer can be any "normal" wallet, e.g. a software or hardware wallet. You choose a number of signers and a threshold of signers required to perform a transaction.
For example, a 2 of 2 multisig would mean that 2 wallets can sign transactions, and both of them need to sign to make a valid transaction. A 3 of 5 would mean 5 total signers, any 3 of them being needed to sign a transaction.
For my setup, I set up a 2 of 3 multisig, with the 3 signers being a software wallet, a hardware wallet, and SLIP39 wallet. Let's go through each of those, explain how they work, and then see how the solution addresses the goals.
Software wallet
I set up a software wallet and saved the seed phrase in a dedicated password manager account using Bitwarden. Bitwarden offers an emergency access feature, which essentially means a trusted person can be listed as an emergency contact and can recover your account. The process includes a waiting period, during which the account owner can reject the request.
Put another way: Bitwarden is offering a cryptographically secure, third party hosted, fully managed, user friendly dead-man switch. Exactly what I needed.
I added a select group of trusted people as the recoverers on the account. Otherwise, I keep the account securely locked down in Bitwarden and can use it for signing when necessary.
Let's see how this stacks up against the goals:
Self custody Check, no reliance on anyone else
Full access Check, I have access to the wallet at all times
Computer hack resilient Fail, if my system is hacked, I lose control of the wallet
Physical destruction resilient Check, Bitwarden lives beyond my machines
Will survive me Check thanks to the dead-man switch
Hardware wallet
Not much to say about the hardware wallet setup that I haven't said already. Let's do the goals:
Self custody Check, no reliance on anyone else
Full access Check, I have access to the wallet at all times
Computer hack resilient Check, the private keys never leave the hardware device
Physical destruction resilient Fail, the wallet and plates could easily be destroyed, and the plates could easily be stolen. (The wallet could be stolen too, but thanks to the PIN mechanism would theoretically be resistant to compromise. But that's not a theory I'd want to bet my wealth on.)
Will survive me Check, anyone can take my plates and recover the wallet
SLIP39
This one requires a bit of explanation. SLIP39 is a not-so-common standard for taking some data and splitting it up into a number of shards. You can define the threshold of shards necessary to reconstruct the original secret. This uses an algorithm called Shamir's Secret Sharing. (And yes, it is very similar in function to multisig, but implemented differently).
The idea here is that this wallet is controlled by a group of friends and family members. Without getting into my actual setup, I could choose 7 very trusted individuals from all over the world and tell them that, should I contact them and ask for them, they should send me their shards so I can reconstruct that third wallet. And to be especially morbid, they also know the identity of some backup people in the event of my death.
In any event, the idea is that if enough of these people agree to, they can reconstruct the third wallet. The assumption is that these are all trustworthy people. But even with trustworthy people, (1) I could be wrong about how trustworthy they are, or (2) they could be coerced or tricked. So let's see how these security mechanism stands up:
Self custody Fail, I'm totally reliant on others.
Full access Fail, by design I don't keep this wallet myself, so I must rely on others.
Computer hack resilient Check, the holders of these shards keep them in secure, offline storage.
Physical destruction resilient Check (sort of), since the probability of all copies being destroyed or stolen is negligible.
Will survive me Check, by design
Comparison against goals
We saw how each individual wallet stacked up against the goals. How about all of them together? Well, there are certainly some theoretical ways I could lose the funds, e.g. my hardware wallet and plates are destroyed and a majority of shard holders for the SLIP39 lost their shards. However, if you look through the check/fail lists, every category has at least two checks. Meaning: on all dimensions, if some catastrophe happens, at least two of the wallets should survive.
Now the caveats (I seem to like that word). I did a lot of research on this, and this is at least tangential to my actual field of expertise. But I'm not a dedicated security researcher, and can't really claim full, deep understanding of all these topics. So if I made any mistakes here, please let me know.
How-to guide
OK, so how do you actually get a system like this running? I'll give you my own step-by-step guide. Best case scenario for all this: download all the websites and programs mentioned onto a fresh Linux system install, disconnect the internet, run the programs and copy down any data as needed, and then wipe the system again. (Or, alternatively, do all the actions from a Live USB session.)
Set up the SLIP39. You can use an online generator. Choose the number of bits of entropy (IMO 128bit is sufficient), choose the total shares and threshold, and then copy down the phrases.
Generate the software wallet. You can use a sister site to the SLIP39 generator. Choose either 12 or 24 words, and write those words down. On a different, internet-connected computer, you can save those words into a Bitwarden account, and set it up with appropriate emergency access.
Open up Electrum. (Other wallets, like Sparrow, probably work for this too, but I've only done it with Electrum.) The rest of this section will include a step-by-step guide through the Electrum steps. And yes, I took these screenshots on a Mac, but for a real setup use a Linux machine.
Set up a new wallet. Enter a name (doesn't matter what) and click next.
Choose a multisig wallet and click next.
Choose 3 cosigners and require 2 signatures.
Now we're going to enter all three wallets. The first one will be your hardware device. Click next, then follow all the prompts to set it up.
After a few screens (they'll be different based on your choice of hardware device), you'll be prompted to select a derivation path. Use native segwit and the standard derivation path.
This next screen was the single most complicated for me, simply because the terms were unclear. First, you'll see a Zpub string displayed as a "master public key," e.g.:
You need to write this down. It's the same as an xpub, but for multisig wallets. This represents all the possible public keys for your hardware wallet. Putting together the three Zpub values will allow your software of choice to generate all the receiving and change addresses for your new wallet. You'll need all three, so don't lose them! But on their own, they cannot be used to access your funds. Therefore, treat them with "medium" security. Backing up in Bitwarden with your software wallet is a good idea, and potentially simply sending to some friends to back up just in case.
And that explanation brings us back to the three choices on the screen. You can choose to either enter a cosigner key, a cosigner seed, or use another hardware wallet. The difference between key and seed is that the former is public information only, whereas the latter is full signing power. Often, multisig wallets are set up by multiple different people, and so instead of sharing the seed with each other (a major security violation), they each generate a seed phrase and only share the key with each other.
However, given that you're setting up the wallet with access to all seed phrases, and you're doing it on an airgapped device, it's safe to enter the seed phrases directly. And I'd recommend it, to avoid the risk of generating the wrong master key from a seed. So go ahead and choose "enter cosigner seed" and click next.
And now onto the second most confusing screen. I copied my seed phrase into this text box, but it won't let me continue!
The trick is that Electrum, by default, uses its own concept of seed phrases. You need to click on "Options" and then choose BIP39, and then enter your seed phrase.
Continue through the other screens until you're able to enter the final seed. This time, instead of choosing BIP39, choose SLIP39. You'll need to enter enough of the SLIP39 shards to meet the threshold.
And with that, you can continue through the rest of the screens, and you'll now have a fully operational multisig!
Open up Electrum again on an internet-connected computer. This time, connect the hardware wallet as before, enter the BIP39 as before, but for the SLIP39, enter the master key instead of the SLIP39 seed phrase. This will ensure that no internet connected device ever has both the software wallet and SLIP39 at the same time. You should confirm that the addresses on the airgapped machine match the addresses on the internet connected device.
If so, you're ready for the final test. Send a small amount of funds into the first receiving address, and then use Electrum on the internet connected device to (1) confirm in the history that it arrived and (2) send it back to another address. You should be asked to sign with your hardware wallet.
If you made it this far, congratulations! You're the proud owner of a new 2of3 multisig wallet.
Conclusion
I hope the topic of death and war wasn't too terribly morbid for others. But these are important topics to address in our world of self custody. I hope others found this useful. And once again, if anyone has recommendations for improvements to this setup, please do let me know!