博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
UTF-8
阅读量:4205 次
发布时间:2019-05-26

本文共 7461 字,大约阅读时间需要 24 分钟。

UTF-8

From Wikipedia, the free encyclopedia
Jump to: ,

UTF-8 ( Transformation Format — 8-) is a for .  UTF-8 is like and , because it can represent every character in the Unicode character set. But unlike UTF-16 and UTF-32, it possesses the advantages of being with . And it has the advantage of avoiding the complications of and the resulting need to use (BOM). For these and other reasons, UTF-8 has become the dominant character encoding for the , accounting for more than half of all Web pages.  The (IETF) requires all to identify the used for character data, and the supported character encodings must include UTF-8.  The (IMC) recommends that all e‑mail programs be able to display and create mail using UTF-8.  UTF-8 is also increasingly being used as the default character encoding in , , , and .

UTF-8 encodes each of the 1,112,064 in the Unicode character set using one to four 8-bit (termed “” in the Unicode Standard).  Code points with lower numerical values (i. e., earlier code positions in the Unicode character set, which tend to occur more frequently in practice) are encoded using fewer bytes, making the encoding scheme reasonably efficient.  In particular, the first 128 characters of the Unicode character set, which correspond one-to-one with , are encoded using a single octet with the same binary value as the corresponding ASCII character, making valid ASCII text valid UTF-8-encoded Unicode text as well.

The official code for the UTF-8 character encoding is UTF-8.

Contents

[] History

By early 1992 the search was on for a good byte-stream encoding of multi-byte character sets. The draft standard contained a non-required called UTF that provided a byte-stream encoding of its 32-bit code points. This encoding was not satisfactory on performance grounds, but did introduce the notion that bytes in the ASCII range of 0–127 represent themselves in UTF, thereby providing backward compatibility.

In July 1992, the committee XoJIG was looking for a better encoding. Dave Prosser of submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only bytes where the high bit was set.

In August 1992, this proposal was circulated by an X/Open representative to interested parties. of the group at then made a crucial modification to the encoding to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string to find code point boundaries. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with . The following days, Pike and Thompson implemented it and updated to use it throughout, and then communicated their success back to X/Open.

UTF-8 was first officially presented at the conference in , from January 25–29, 1993.

The original specification allowed for sequences of up to six bytes, covering numbers up to 31 bits (the original limit of the ). In November 2003 UTF-8 was restricted by to four bytes covering only the range U+0000 to U+10FFFF, in order to match the constraints of the character encoding.

[] Design

The design of UTF‑8 as originally proposed by Dave Prosser and subsequently modified by Ken Thompson was intended to satisfy two objectives:

  1. To be backward-compatible with ; and
  2. To enable encoding of up to at least 231 characters (the theoretical limit of the first draft proposal for the ).

Being backward-compatible with ASCII implied that every valid ASCII character (a 7-bit character set) also be a valid UTF‑8 character sequence, specifically, a one-byte UTF‑8 character sequence whose binary value equals that of the corresponding ASCII character:

Bits Last code point Byte 1
  7 U+007F 0xxxxxxx

Prosser’s and Thompson’s challenge was to extend this scheme to handle code points with up to 31 bits.  The solution proposed by Prosser as subsequently modified by Thompson was as follows:

Bits Last code point Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
  7 U+007F 0xxxxxxx
11 U+07FF 110xxxxx 10xxxxxx
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The salient features of the above scheme are as follows:

  1. Every valid ASCII character is also a valid UTF‑8 encoded Unicode character with the same binary value.  (Thus, valid ASCII text is also valid UTF‑8-encoded Unicode text.)
  2. For every UTF‑8 byte sequence corresponding to a single Unicode character, the first byte unambiguously indicates the length of the sequence in bytes.
  3. All continuation bytes (byte nos. 2 – 6 in the table above) have 10 as their two most-significant bits (bits 7 – 6); in contrast, the first byte never has 10 as its two most-significant bits.  As a result, it is immediately obvious whether any given byte anywhere in a (valid) UTF‑8 stream represents the first byte of a byte sequence corresponding to a single character, or a continuation byte of such a byte sequence.
  4. As a consequence of no. 3 above, starting with any arbitrary byte anywhere in a (valid) UTF‑8 stream, it is necessary to back up by only at most five bytes in order to get to the beginning of the byte sequence corresponding to a single character (three bytes in actual UTF‑8 as explained in the next section).
  5. Starting with the second row in the table above (two bytes), every additional byte extends the maximum number of bits by five (six additional bits from the additional continuation byte, minus one bit lost in the first byte).
  6. Prosser’s and Thompson’s scheme was sufficiently general to be extended beyond 6-byte sequences (however, this would have allowed FE or FF bytes to occur in valid UTF-8 text — see under Advantages in section "" below — and indefinite extension would lose the desirable feature that the length of a sequence can be determined from the start byte only).

[] Description

UTF-8 is a variable-width encoding, with each character represented by one to four bytes. If the character is encoded by just one byte, the high-order bit is 0 and the other bits give the code value (in the range 0..127). If the character is encoded by a sequence of more than one byte, the first byte has as many leading '1' bits as the total number of bytes in the sequence, followed by a '0' bit, and the succeeding bytes are all marked by a leading "10" bit pattern. The remaining bits in the byte sequence are concatenated to form the Unicode value. Thus a byte with lead bit '0' is a single-byte code, a byte with multiple leading '1' bits is the first of a multi-byte sequence, and a byte with a leading "10" bit pattern is a continuation byte of a multi-byte sequence. The format of the bytes thus allows the beginning of each sequence to be detected without decoding from the beginning of the string.

Code point range code point UTF-8 bytes Example
U+0000 to
U+007F
0xxxxxxx 0xxxxxxx character '$' = code point U+0024
= 00100100
00100100
→ hexadecimal 24
U+0080 to
U+07FF
00000yyy yyxxxxxx 110yyyyy
10xxxxxx
character '¢' = code point U+00A2
= 00000000 10100010
11000010 10100010
→ hexadecimal C2 A2
U+0800 to
U+FFFF
zzzzyyyy yyxxxxxx 1110zzzz
10yyyyyy
10xxxxxx
character '€' = code point U+20AC
= 00100000 10101100
11100010 10000010 10101100
→ hexadecimal E2 82 AC
U+010000 to
U+10FFFF
000wwwzz zzzzyyyy yyxxxxxx 11110www
10zzzzzz
10yyyyyy
10xxxxxx
character '

转载地址:http://efali.baihongyu.com/

你可能感兴趣的文章
JavaWeb forward与sendRedirect区别
查看>>
JavaWeb 报错The absolute uri: http://java.sun.com/jsp/jstl/core cannot be resolved in either web.xml
查看>>
JavaWeb getParameter和getAttribute的区别
查看>>
JavaWeb jsp内置对象与servlet对应关系
查看>>
Spring 之依赖注入DI
查看>>
Spring 注解总结
查看>>
Spring 面向切面编程AOP
查看>>
数据库优化 SQL语句优化
查看>>
Spring 各个jar包的作用
查看>>
SpringMVC 出现ClassNotFoundException: org.springframework.web.context.ContextLoaderListener
查看>>
SpringMVC 过滤器HiddenHttpMethodFilter
查看>>
SpringMVC 返回json数据报错IllegalArgumentException: No converter found for return value of type:xxx
查看>>
SpringMVC 基本配置文件
查看>>
Velocity 模板出现NestedIOException: Cannot find Velocity template for URL [layout.vm]
查看>>
Velocity 模板基本用法
查看>>
SpringMVC 使用总结
查看>>
Mybatis 出现Mapped Statements collection does not contain value for xxx
查看>>
Mybatis 解决字段名与实体类属性名不相同的冲突
查看>>
Mybatis Generator最完整配置详解
查看>>
Mybatis 一级缓存和二级缓存
查看>>