Back to Lingo

Unicode

Data Encoding

Unicode is a universal character encoding standard that assigns a unique number (called a code point) to every character from virtually every writing system in the world. Unlike ASCII, which only handles 128 characters, Unicode can represent over 140,000 characters including letters from all languages, emojis, mathematical symbols, and historical scripts. The most common encoding format for Unicode is UTF-8, which is backward-compatible with ASCII.

Why it matters

Unicode solved the problem of incompatible character encodings that plagued early computing. Before Unicode, different regions used different encoding systems, causing text to display incorrectly when shared across systems. Today, Unicode is the standard for the web, programming languages, JSON data, and operating systems. When you see emojis, Chinese characters, or Arabic script displayed correctly on a webpage, Unicode is making that possible.

How it Works

Each character in Unicode has a code point written as "U+" followed by a hexadecimal number. For example, the letter "A" is U+0041, the emoji 😀 is U+1F600, and the Chinese character 中 is U+4E2D. UTF-8 encoding then converts these code points into binary bytes for storage and transmission. Simple English characters use just one byte (staying compatible with ASCII), while complex characters may use up to four bytes.

Common encodings

UTF-8 is the dominant encoding on the web because it is efficient for English text while supporting all Unicode characters. UTF-16 uses two or four bytes per character and is common in Windows and Java. UTF-32 uses exactly four bytes per character, making it simple but less space-efficient. You will encounter these encodings when working with APIs, reading files, or handling international text in your applications.

See more

Further reading

You need to be signed in to leave a comment and join the discussion