Just as we were finishing up this blog post, a team at the University of Cambridge released a paper describing such an attack. Their approach, however, is quite different from ours – it focuses on the Unicode bidirectional mechanism (Bidi). We have implemented a different take on what the paper titles “Invisible Character Attacks” and “Homoglyph Attacks“.
Without further ado, here’s the backdoor. Can you spot it?
The script implements a very simple network health check HTTP endpoint that executes
ping -c 1 google.com as well as
curl -s http://example.com and returns whether these commands executed successfully. The optional HTTP parameter
timeout limits the command execution time.
ID_Start can be used in identifiers (characters with property
ID_Continue can be used after the initial character).
The character “ㅤ” (0x3164 in hex) is called “HANGUL FILLER” and belongs to the Unicode category “Letter, other”. As this character is considered to be a letter, it has the
Next, a way to use this invisible character unnoticed had to be found. The following visualizes the chosen approach by replacing the character in question with its escape sequence representation:
A destructuring assignment is used to deconstruct the HTTP parameters from
req.query. Contrary to what can be seen, the parameter
timeout is not the sole parameter unpacked from the
req.query attribute! An additional variable/HTTP parameter named “ㅤ” is retrieved – if a HTTP parameter named “ㅤ” is passed, it is assigned to the invisible variable
Similarly, when the
checkCommands array is constructed, this variable
ㅤ is included into the array:
Each element in the array, the hardcoded commands as well as the user-supplied parameter, is then passed to the
exec function. This function executes OS commands. For an attacker to execute arbitrary OS commands, they would have to pass a parameter named “ㅤ” (in it’s URL-encoded form) to the endpoint:
This approach cannot be detected through syntax highlighting as invisible characters are not shown at all and therefore are not colorized by the IDE/text editor:
The attack requires the IDE/text editor (and the used font) to correctly render the invisible characters. At least Notepad++ and VS Code render it correctly (in VS Code the invisible character is slightly wider than ASCII characters). The script behaves as described at least with Node 14.
Besides invisible characters one could also introduce backdoors using Unicode characters that look very similar to e.g. operators:
The “ǃ” character used is not an exclamation mark but an “ALVEOLAR CLICK” character. The following line therefore does not compare the variable
environment to the string
"PRODUCTION" but instead assigns the string
"PRODUCTION" to the previously undefined variable
Thus, the expression within the if statement is always
true (tested with Node 14).
There are many other characters that look similar to the ones used in code which may be used for such proposes (e.g. “／”, “−”, “＋”, “⩵”, “❨”, “⫽”, “꓿”, “∗”). Unicode calls these characters “confusables”.
Note that messing with Unicode to hide vulnerable or malicious code is not a new idea (also using invisible characters) and Unicode inherently opens up additional possibilities to obfuscate code. We believe that these tricks are quite neat though, which is why we wanted to share them.
Unicode should be kept in mind when doing reviews of code from unknown or untrusted contributors. This is especially interesting for open source projects as they might receive contributions from developers that are effectively anonymous.
The Cambridge team proposes restricting Bidi Unicode characters. As we have shown, homoglyph attacks and invisible characters can pose a threat as well. In our experience non-ASCII characters are pretty rare in code. Many development teams chose to use English as the primary development language (both for code and strings within the code) in order to allow for international cooperation (ASCII covers all/most characters used in the English language). Translation into other languages is often done using dedicated files. When we review German language code, we mostly see non-ASCII characters being substituted with ASCII characters (e.g. ä → ae, ß → ss). It might therefore be a good idea to disallow any non-ASCII characters.