I love when my current problem can be solved with a state machine. They’re fun to design and implement, and I have high confidence about correctness. They tend to:
- Present minimal, tidy interfaces
- Require few, fixed resources
- Hold no opinions about input and output
- Have a compact, concise implementation
- Be easy to reason about
State machines are perhaps one of those concepts you heard about in college but never put into practice. Maybe you use them regularly. Regardless, you certainly run into them regularly, from regular expressions to traffic lights.
Morse code decoder state machine
Inspired by a puzzle, I came up with this deterministic state machine for decoding Morse code. It accepts a dot (
'.'), dash (
'-'), or terminator (0) one at a time, advancing through a state machine step by step:
It typically compiles to under 200 bytes (table included), requires only a few bytes of memory to operate, and will fit on even the smallest of microcontrollers.
The state machine is trie-shaped, and the 100-byte table
t is the static encoding of the Morse code trie:
Dots traverse left, dashes right, terminals emit the character at the current node (terminal state). Stopping on red nodes, or attempting to take an unlisted edge is an error (invalid input).
Each node in the trie is a byte in the table. Dot and dash each have a bit indicating if their edge exists. The remaining bits index into a 1-based character table (at the end of
t), and a 0 “index” indicates an empty (red) node. The nodes themselves are laid out as a binary heap in an array: the left and right children of the node at
i are found at
i*2+2. No need to waste memory storing edges!
Since C sadly does not have multiple return values, I’m using the sign bit of the return value to create a kind of sum type. A negative return value is a state — which is why the state is negated internally before use. A positive result is a character output. If zero, the input was invalid. Only the initial state is non-negative (zero), which is fine since it’s, by definition, not possible to traverse to the initial state. No
c input will produce a bad state.
In the original problem the terminals were missing. Despite being a state machine,
morse_decode is a pure function. The caller can save their position in the trie by saving the state integer and trying different inputs from that state.
UTF-8 decoder state machine
The classic UTF-8 decoder state machine is Bjoern Hoehrmann’s Flexible and Economical UTF-8 Decoder. It packs the entire state machine into a relatively small table using clever tricks. It’s easily my favorite UTF-8 decoder.
I wanted to try my own hand at it, so I re-derived the same canonical UTF-8 automaton:
Then I encoded this diagram directly into a much larger (2,064-byte), less elegant table, too large to display inline here:
However, the trade-off is that the executable code is smaller, faster, and branchless again (by accident, I swear!):
Like Bjoern’s decoder, there’s a code point accumulator. The real state machine has 1,109,950 terminal states, and many more edges and nodes. The accumulator is an optimization to track exactly which edge was taken to which node without having to represent such a monstrosity.
Despite the huge table I’m pretty happy with it.
Word count state machine
Here’s another state machine I came up with awhile back for counting words one Unicode code point at a time while accounting for Unicode’s various kinds of whitespace. If your input is bytes, then plug this into the above UTF-8 state machine to convert bytes to code points! This one uses a switch instead of a lookup table since the table would be sparse (i.e. let the compiler figure it out).
I’m particularly happy with the edge-triggered state transition mechanism. The sign of the state tracks whether the “signal” is “high” (inside of a word) or “low” (outside of a word), and so it counts rising edges.
The counter is not technically part of the state machine — though it eventually overflows for practical reasons, it isn’t really “finite” — but is rather an external count of the times the state machine transitions from low to high, which is the actual, useful output.
Reader challenge: Find a slick, efficient way to encode all those code points as a table rather than rely on whatever the compiler generates for the
switch (chain of branches, jump table?).
Coroutines and generators as state machines
In languages that support them, state machines can be implemented using coroutines, including generators. I do particularly like the idea of compiler-synthesized coroutines as state machines, though this is a rare treat. The state is implicit in the coroutine at each yield, so the programmer doesn’t have to manage it explicitly. (Though often that explicit control is powerful!)
Unfortunately in practice it always feels clunky. The following implements the word count state machine (albeit in a rather un-Pythonic way). The generator returns the current count and is continued by sending it another code point:
However, the generator ceremony dominates the interface, so you’d probably want to wrap it in something nicer — at which point there’s really no reason to use the generator in the first place:
Same idea in Lua, which famously has full coroutines:
Except for initially priming the coroutine, at least
coroutine.wrap() hides the fact that it’s a coroutine.