UTF-8 🤦♂️ I already knew about the "confusables", e.g.: e vs. е. Which look ~same but are different. But you can also smuggle arbitrary byte streams in any character via "variation selectors". So this emoji: 😀󠅧󠅕󠄐󠅑󠅢󠅕󠄐󠅓󠅟󠅟󠅛󠅕󠅔 is 53 tokens. Yay https://paulbutler.org/2025/smuggling-arbitrary-data-through-an-emoji/
See Tweet