# What's the Character Encoding?

We always hear the words like `ASCII`, `UTF-8`, `UTF-16` and `Unicode` when usually writing code, sometimes we call them character set or call them character encoding. What’s the relationship between them? Are they same things? And how the computer use them to store the characters? Let’s get a deep look today.

We all know that the computers use the binary format to store data to disks. When we want to store a character, the character has to have the corresponding binary, you can imagine it as a large mapping list, this is the character set we said. Actually, it’s a transforming algorithm the character encoding is. We use the character encoding to transform the binary of a character to a new one to save the character to disks, and inverting the binary saved in disks when reading the characters, let’s make a diagram to explain this:

Why do we need to transform the character binary with character encoding nor storing the character binary directly? To answer this question we need to understand the evolution of the character sets following the development of the computer. As the computer was invented by American, the early character sets is applying to the Latin alphabet, the amount of Latin alphabet is 26, add later some Arabic numerals, various punctuation marks and special marks, all of them constitute the ASCII (American Standard Code for Information Interchange) set. The ASCII contains 128 characters so far, if we save them with binary, the 2^7 = 128 seven bits is just enough, but we add an extra bit in case of having another characters in the future. So there we get an interesting question why a byte has eight bits nor three or four? This is because everyone ASCII character can be referred to as an eight bits binary, the meaning of a byte is this.

Up to now we seemingly haven’t found any useful of the character encoding, actually the character set of the ASCII is its character encoding, that is to say, the ASCII hasn’t to convert. But with the development of computers, more and more countries begin using computers and every country has their own language. Like chinese has tens of thousands of characters, but the max capacity of the ASCII set is 256, clearly it’s not enough so every country begin to make their own character set, such as Europe’s `ISO/IEC 8859`, Japan’s `Shift_Jis` and China’s GB2312/GBK/BG18030. When we use the text editor of computers in the past, we have to specify the correct character encoding, otherwise you will get a bunch of messy characters. these new large character sets needs more bits to save, we might need 4 bytes or more to save the tens of thousands of chinese characters, but only the characters at the back of the character set need 4 bytes, the first bits is equal to the ASCII for compatibility with it. Some characters only need 2 bytes, if all the characters occupy 4 bytes to store, this would waste very much space and not compatible with the ASCII.

In order to solve the problem above, we need select different bits to save different characters. But the characters binary saved to disk is continuous, how do we determine the character needs one bit or multiple bits? This time we need to design a algorithm to resolve this question. Like GBK encoding, because the amount of collected characters isn’t very large, so we only need 1-2 bytes to save them. When saving ASCII characters, assign the highest bit of every byte to 1, so that the computer can determine the character needs one or more bits when reading them.

Later GB2312 character encoding collects more characters, so up to 4 bytes to save the characters, this needs design a more complex character encoding. The function of character encodings is to convert the fixed length binary of character sets to the variable length binary for optimize store space. The ASCII character set only need 1 bytes, so they don’t have the concept of character encoding, the GBK creator considered the character encoding when designing the character set, so the character set itself is the character encoding. And like the next Unicode character set we will talk about is a pure character set, it needs to use with a character encoding, UUF-8, UTF-6 or UTF-32. Using character sets can save some storing space, but the large character sets needs multiple calculations when converting characters, this would reduce the speed of decoding.

With computers are used in more and more countries, the more and more character sets were used, these character sets are incompatible each other, the text documents written in this computer might display a bunch of messy characters in another computers. So we urgently need a large unified character set to contain all characters, so we aren’t torn on the selections and transformations of character sets. In this expectation, the Unicode was created. The first Unicode official version was pushed in 1994, it has contained more than one million characters, the modern computer systems have supported the Unicode from the ground up, the normal software, webpage, writing articles and coding all use the Unicode.

We said that the Unicode is a pure character set above, it doesn’t support any character encoding, we can determine using the UTF-8, UTF-16 or UTF-32 depend on the storing space and performance. There are the differences between them:

• UTF-8: Variable length encoding, can use 1-6 bytes to stor a character.
• UTF-32: Fixed length encoding, every character would occupy four bytes space.
• UTF-16: Variable length encoding, using two or four bytes to store a character, this is a balanced way based on space and performance considerations.

There isn’t the best encoding way, all of them depend on the optimizations of different purposes, but with performance improvement and algorithm optimization, users are more care about the data transmission speed. The UTF-8 encoding needs the fewest bytes to transmit data, so the UTF-8 is the most popular encoding format at present. Most webpages, systems and editors use the UTF-8 as the default encoding format.

One final word, if the encoding is variable length, we say them as narrow character, eg: UTF-8. If the encoding is fixed length, we say them as wide character. Once the character encodings are complex and complicated, after generation after generation improvement, now we basically use the Unicode character set and the UTF-8 character encoding to store and transmit data.

• UTF-8: 变长的编码方案，可以使用 1-6个字节来存储字符
• UTF-32: 定长的编码方案，无论什么字符都使用 4 个字节
• UTF-16: 变长的编码方案，使用 2 个或是 4 个字节来存储，出于性能和空间的考虑，这是一种折中的方案