Character Encoding

1. Purpose

Encode human characters, so that it can be stored in the computer, be transfered to another person and the target reader program can display it correctly. It is related to human language. It is different from the BASE64 encoding, Url encoding, HTML encoding, which are used for communication between application and application.

2. Types

-ASCII: (American Standard Code for Information Interchange), 7 bits, including Latin characters (a~z), Arabic numeric (0~9), English symbols and control symbols. (No preamble) (ASCII table)

-EASCII: Extended ASCII, 8 bits.

-Unicode: Represent most human languages within single character set (Universal Character Set, UCS). UCS has several UCS Transformation Format (UTF), including UTF-7, UTF-8, UTF-16 and UTF-32. UTF-16 and UTF-32 also have two kind, big-endian (most significant byte first) and little-endian (least significant byte first). We say Unicode usually means UTF-16.

Have different preamble for each transformation format:

UTF-8: EF BB BF
UTF-16 big-endian byte order: FE FF
UTF-16 little-endian byte order: FF FE
UTF-32 big-endian byte order: 00 00 FE FF
UTF-32 little-endian byte order: FF FE 00 00

-GB2312: Simplified Chinese character set, including 6763 Chinese words and other symbols. (No preamble)

-GBK: Extended GB2312, including 20902 Chinese words and other symbols. Can correspond to each Chinese words in Unicode.

-BIG5: Tractional Chinese character set.

-ANSI: Also know as MBCS (Multi-Byte Character Set), it is not an actual character encoding. It vary according to System’s code page setting (AKA default character encoding). Could be GB2312, BIG5 or others.