Andrej Karpathy

Andrej Karpathy

@karpathy · Twitter ·

UTF-8 🤦‍♂️ I already knew about the "confusables", e.g.: e vs. е. Which look ~same but are different. But you can also smuggle arbitrary byte streams in any character via "variation selectors". So this emoji: 😀󠅧󠅕󠄐󠅑󠅢󠅕󠄐󠅓󠅟󠅟󠅛󠅕󠅔 is 53 tokens. Yay https://paulbutler.org/2025/smuggling-arbitrary-data-through-an-emoji/

Post media