Language(s) | International |
---|---|
Standard | RFC 2152 |
Classification | Unicode Transformation Format, ASCII armor, variable-width encoding, stateful encoding |
Transforms / Encodes | ISO/IEC 10646 (Unicode) |
Preceded by | HZ-GB-2312 |
Succeeded by | UTF-8 over 8BITMIME |
UTF-7 (7-bit Unicode Transformation Format) is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with QP-encoding.
UTF-7 (according to its RFC) isn't a "Unicode Transformation Format", as the definition can only encode code points in the BMP (the first 65536 Unicode code points, which does not include emojis and many other characters). However if a UTF-7 translator is to/from UTF-16 then it can (and probably does)[citation needed] encode each surrogate half as though it was a 16-bit code point, and thus can encode all code points. It is unclear if other UTF-7 software (such as translators to UTF-32 or UTF-8) support this.
UTF-7 has never been an official standard of the Unicode Consortium. It is known to have security issues, which is why software has been changed to disable its use.[1] It is prohibited in HTML 5.[2][3]
MIME, the modern standard for e-mail formats, forbids encoding of headers using byte values above the ASCII range. Although MIME allows encoding the message body in various character sets (broader than ASCII), the underlying transmission infrastructure (SMTP, the main E-mail transfer standard) is still not guaranteed to be 8-bit clean. Therefore, a non-trivial content transfer encoding has to be applied in case of doubt. Unfortunately base64 has a disadvantage of making even US-ASCII characters unreadable in non-MIME clients. On the other hand, UTF-8 combined with quoted-printable produces a very size-inefficient format requiring 6–9 bytes for non-ASCII characters from the BMP and 12 bytes for characters outside the BMP.
Provided certain rules are followed during encoding, UTF-7 can be sent in e-mail without using an underlying MIME transfer encoding, but still must be explicitly identified as the text character set. In addition, if used within e-mail headers such as "Subject:", UTF-7 must be contained in MIME encoded words identifying the character set. Since encoded words force use of either quoted-printable or base64, UTF-7 was designed to avoid using the = sign as an escape character to avoid double escaping when it is combined with quoted-printable (or its variant, the RFC 2047/1522 "Q"-encoding of headers).
UTF-7 is generally not used as a native representation within applications as it is very awkward to process. Despite its size advantage over the combination of UTF-8 with either quoted-printable or base64, the now defunct Internet Mail Consortium recommended against its use.[4]
8BITMIME has also been introduced, which reduces the need to encode message bodies in a 7-bit format.
A modified form of UTF-7 (sometimes dubbed 'mUTF-7'[5]) was used in the Internet Message Access Protocol (IMAP) e-mail retrieval protocol, version 4 rev 1, for "international" mailbox names.[6] The following version, IMAP version 4 rev 2, uses UTF-8 instead.[7]
UTF-7 was first proposed as an experimental protocol in RFC 1642, A Mail-Safe Transformation Format of Unicode. This RFC has been made obsolete by RFC 2152, an informational RFC which never became a standard. As RFC 2152 clearly states, the RFC "does not specify an Internet standard of any kind". Despite this, RFC 2152 is quoted as the definition of UTF-7 in the IANA's list of charsets. Neither is UTF-7 a Unicode Standard. The Unicode Standard 5.0 only lists UTF-8, UTF-16 and UTF-32. There is also a modified version, specified in RFC 2060, which is sometimes identified as UTF-7.
Some characters can be represented directly as single ASCII bytes. The first group is known as "direct characters" and contains 62 alphanumeric characters and 9 symbols: ' ( ) , - . / : ?
. The direct characters are safe to include literally. The other main group, known as "optional direct characters", contains all other printable characters in the range U+0020–U+007E except ~ \ +
and space (the characters \
and ~
being excluded due to being redefined in "variants of ASCII" such as JIS-Roman). Using the optional direct characters reduces size and enhances human readability but also increases the chance of breakage by things like badly designed mail gateways and may require extra escaping when used in encoded words for header fields.
Space, tab, carriage return and line feed may also be represented directly as single ASCII bytes. However, if the encoded text is to be used in e-mail, care is needed to ensure that these characters are used in ways that do not require further content transfer encoding to be suitable for e-mail. The plus sign (+
) may be encoded as +-
.
Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into two surrogates), and then in modified Base64. The start of these blocks of modified Base64-encoded UTF-16 is indicated by a +
sign. The end is indicated by any character not in the modified Base64 set. If the character after the modified Base64 is a -
(ASCII hyphen-minus) then it is consumed by the decoder and decoding resumes with the next character. Otherwise decoding resumes with the character after the base64.
Hello, World!
" is encoded as "Hello, World+ACE-
"1 + 1 = 2
" is encoded as "1 +- 1 +AD0- 2
"£1
" is encoded as "+AKM-1
". The Unicode code point for the pound sign is U+00A3 which converts into modified Base64 as in the table below. There are two bits left over, which are padded to 0.Hex digit | 0 | 0 | A | 3 | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bit pattern | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
Index | 0 | 10 | 12 | |||||||||||||||
Base64-Encoded | A | K | M |
First, an encoder must decide which characters to represent directly in ASCII form, which +
has to be escaped as +-
, and which to place in blocks of Unicode characters. The expansion cost of UTF-7 can be high: for example, the character sequence U+10FFFF U+0077 U+10FFFF is 9 bytes in UTF-8, but 17 bytes in UTF-7. (At worst, treating every codepoint as a sequence in its own right produces the maximum expansion of 5x, e.g. when encoding @@
as +AEA-+AEA-
.) Each Unicode sequence must be encoded using the following procedure, then surrounded by the appropriate delimiters.
Using the £† (U+00A3 U+2020) character sequence as an example:
First an encoded data must be separated into plain ASCII text chunks (including +es followed by a dash) and nonempty Unicode blocks as mentioned in the description section. Once this is done, each Unicode block must be decoded with the following procedure (using the result of the encoding example above as our example)
A byte order mark (BOM) is an optional special byte sequence at the very start of a stream or file that, without being data itself, indicates the encoding used for the data that follows; it can be used in the absence of metadata that denotes the encoding. For a given encoding scheme, it's that scheme's representation of Unicode code point U+FEFF
.[8]
While it's typically a single, fixed byte sequence, in UTF-7 four variations may appear, because the last 2 bits of the 4th byte of the UTF-7 encoding of U+FEFF
belong to the following character, resulting in 4 possible bit patterns and therefore 4 different possible bytes in the 4th position. See the UTF-7 entry in the table of Unicode byte order marks.[9]
UTF-7 allows multiple representations of the same source string. In particular, ASCII characters can be represented as part of Unicode blocks. As such, if standard ASCII-based escaping or validation processes are used on strings that may be later interpreted as UTF-7, then Unicode blocks may be used to slip malicious strings past them. To mitigate this problem, systems should perform decoding before validation and should avoid attempting to autodetect UTF-7.
Older versions of Internet Explorer can be tricked into interpreting the page as UTF-7. This can be used for a cross-site scripting attack as the <
and >
marks can be encoded as +ADw-
and +AD4-
in UTF-7, which most validators let through as simple text.[10]
UTF-7 is considered obsolete, at least for Microsoft software (.NET), with code paths previously supporting it intentionally broken (to prevent security issues) in .NET 5, in 2020.[1]
Store mailbox names on disk using UTF-8 instead of modified UTF-7 (mUTF-7).
In modified UTF-7, printable US-ASCII characters, except for "&", represent themselves…. The character "&" (0x26) is represented by the two-octet sequence "&-". All other characters… are represented in modified BASE64….
In IMAP4rev2, mailbox names are encoded in Net-Unicode (this differs from IMAP4rev1).
Early telecommunications | |
---|---|
ISO/IEC 8859 |
|
Bibliographic use | |
National standards | |
ISO/IEC 2022 | |
Mac OS Code pages ("scripts") | |
DOS code pages | |
IBM AIX code pages | |
Windows code pages | |
EBCDIC code pages | |
DEC terminals (VTx) | |
Platform specific |
|
Unicode / ISO/IEC 10646 | |
TeX typesetting system | |
Miscellaneous code pages | |
Control character | |
Related topics | |