譚惟心Unicode,UTF-8,資料型態

維基百科UTF8

劉任昌101單元

劉任昌102單元

UTF-8(8-bit Unicode Transformation Format)是一種針對Unicode的可變長度字元編碼,也是一種字首碼。它可以用一至四個位元組對Unicode字元集中的所有有效編碼點進行編碼,屬於Unicode標準的一部分。(劉任昌整理自維基百科)

比較小長度的文字碼(半形英文字碼與阿拉伯數字等)使用頻率較高,直接使用Unicode編碼效率低下,浪費記憶體空間,也浪費電腦的處理資源,更浪費傳輸時間。UTF-8就是為了解決向下相容ASCII碼而設計,Unicode中前128個字元,使用與ASCII碼相同的二進位值的單個位元組進行編碼,而且字面與ASCII碼的字面一一對應,這使得原來處理ASCII字元的軟體無須或只須做少部份修改,即可繼續使用。

Unicode

●Unicde 
 ●UTF-7
 ●UTF-8 
 ●UTF-16 
 ●UTF-32

Codepage layout

 The following table summarizes usage of UTF-8 code units (individual bytes or octets) in a code page format. The upper half (0_ to 7_) is for bytes used only in single-byte codes, so it looks like a normal code page; the lower half is for continuation bytes (8_ to B_) and leading bytes (C_ to F_), and is explained further in the legend below.

Blue cells are 7-bit (single-byte) sequences. They must not be followed by a continuation byte.

 Orange cells with a large dot are a continuation byte. The hexadecimal number shown after the + symbol is the value of the 6 bits they add. This character never occurs as the first byte of a multi-byte sequence. 

 White cells are the leading bytes for a sequence of multiple bytes,[19] the length shown at the left edge of the row. The text shows the Unicode blocks encoded by sequences starting with this byte, and the hexadecimal code point shown in the cell is the lowest character value encoded using that leading byte. 
 
Red cells must never appear in a valid UTF-8 sequence. The first two red cells (C0 and C1) could be used only for a 2-byte encoding of a 7-bit ASCII character which should be encoded in 1 byte; as described below, such "overlong" sequences are disallowed. To understand why this is, consider the character 128, hex 80, binary 1000 0000. To encode it as 2 characters, the low six bits are stored in the second character as 128 itself 10 000000, but the upper two bits are stored in the first character as 110 00010, making the minimum first character C2. The red cells in the F_ row (F5 to FD) indicate leading bytes of 4-byte or longer sequences that cannot be valid because they would encode code points larger than the U+10FFFF limit of Unicode (a limit derived from the maximum code point encodable in UTF-16 ). FE and FF do not match any allowed character pattern and are therefore not valid start bytes.

 Pink cells are the leading bytes for a sequence of multiple bytes, of which some, but not all, possible continuation sequences are valid. E0 and F0 could start overlong encodings, in this case the lowest non-overlong-encoded code point is shown. F4 can start code points greater than U+10FFFF which are invalid. ED can start the encoding of a code point in the range U+D800–U+DFFF; these are invalid since they are reserved for UTF-16 surrogate halves.(譚惟心整理自維基百科)

留言

這個網誌中的熱門文章

金三甲 譚惟心 期貨市場理論與實務2022Q3共50題

譚惟心JavaScript 金融計算

譚惟心:選擇權敏感度 1月 12, 2023