The most efficient way to create hash maps that are insertion ordered is to use a variant of what Python refers to as a “compact dictionary,” which was popularized/rediscovered by Raymond Hettinger. (According to Hettinger, there is previous work, implemented in Java, that is similar to the internal object implementation of V8, which was devised by Lars Bak, another Toit co-founder.) In compact dictionaries, the keys and values are stored in a linear array, and the linear array is indexed by a more standard open addressing hash table. It appears as follows.
This is also how a hash map appears in Toit. Indeed, hash numbers in the index table are not required, but they speed up lookup. Storing the final few bits of the hash code enables us to discover the majority of index collisions without examining the keys in the underlying array. Additionally, having hash codes in the index speeds up the process of rebuilding the index when it grows too large.
If a user deletes an entry from this type of hash map, we change the key with a marker indicating that the entry was deleted. Perhaps disturbingly, we refer to these markings as “tombstones.” When there are sufficient of these tombstones, the hash map is recreated, which is acceptable. This has no effect on your average speed due to the magic of amortised constant time.
This is fantastic because it now provides an efficient method for creating hash maps with insertion ordering. These are non-aggressive hash maps, which is why we utilised them in the Dart project as well, which is now probably best known as the engine powering Flutter. (Several of the Dart founders later founded Toit.)
Additional indicators that your hash maps do not despise you
Additionally, it is occasionally essential to declare the equality function per map rather than per key type. In C++, this is accomplished using the Hash and Pred template parameters on unordered map. (However, as the name unordered map implies, conventional C++ collections despise you, and hence this additional feature provides no consolation.) At the moment, Toit does not provide per-map equality functions, but this will change shortly! In Java, you would need to encapsulate the keys in your map or use Trove.
The thorn in the side of the ointment and how to remove it
Even if your language has compact dictionaries, there is a slight kink in the works. Your hash maps will continue to act as though they despise you in certain scenarios, but in a different way! Certain individuals prefer to utilise hash maps (and sets, which implement the same concept) as a deduplicating work queue. You can add “tasks” to the set, and because it is a set, it will automatically disregard duplicates. A task queue (also known as a FIFO queue) requires both an append-to-the-end and a remove-from-the-start operation. We already have the append-to-the-end action, and it is simple to perform a remove-first operation by inspecting the beginning of the backing array. If the Toit language does not already include this function, the following could be added:
The problem arises after a while of using a set or map like this. There will be a lot of tombstones near the start of the backing, and suddenly the remove-from-the-start operation isn’t so fast. It needs to skip over more and more tombstones, and the run time to insert-then-remove n items becomes quadratic. This issue can’t be solved by rebuilding the hash map more frequently, because if you do that, you lose the property of amortized constant time operations.
An ad-hoc solution would be to record for each hash map where the first entry is in the backing array. This feels like a band-aid though. What if users want to repeatedly remove the last entry using a set as a sort of deduplicating stack (or LIFO)? The hateful quadratic hash map problem would be back. What if they want to repeatedly remove the second entry in the set? One thing I learned on V8 is that it’s hard to predict what your users will do with a programming language implementation you have built.
And we built Toit to work on tiny devices. Do we really want to spend an extra word per hash map to record the number of leading tombstones? The solution we came up with was tombstones with distance markers. Each tombstone additionally records the distance to the next real entry, either on the left or the right. We update this lazily when scanning through the backing, so it doesn’t cost any time on other operations, but it lets us skip long stretches of tombstones wherever they appear.
For the common case where there are no deletions, or the deletions are randomly distributed, nothing changes, but for the problematic cases we retain the desirable property of amortized constant time insertion and deletion.
Testing Languages for this Bug
So here, for your enjoyment, is the graph for various languages. Currently, all the compact-dictionary-based implementations (except Toit) have this issue. Hopefully, this blog post will inspire the implementers to fix it.