Search: in
Byte order mark
Byte order mark in Encyclopedia Encyclopedia
  Tutorials     Encyclopedia     Videos     Books     Software     DVDs  
       





Byte order mark

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at . BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.[1]

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.

Contents


Usage

If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (essentially a null character). In Unicode 3.2, this usage is deprecated in favour of the "Word Joiner" character, U+2060.[1] This allows U+FEFF to be only used as a BOM.

UTF-8

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters for this.

The Unicode Standard permits the BOM in UTF-8,[2] but does not require or recommend for or against its use.[3][4] Byte order has no meaning in UTF-8,[5] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8. The BOM may appear when UTF-8 data is converted from other encodings that use a BOM; in this case the standard recommends against removing it so that round-tripping between encodings does not lose information.

ASCII characters encode as themselves in UTF-8. Therefore a plain ASCII file is already in UTF-8 encoding. Requiring a BOM would break this backward compatibility, which is why the Standard does not specifically advocate the UTF-8 BOM. Furthermore, common appearance of a BOM encourages systems to rely on it to identify UTF-8 text, artificially distinguishing between ASCII and UTF-8. The distinction could discourage use of Unicode. UTF-8 can be easily identified by pattern recognition, so the BOM should be unnecessary.

A BOM also complicates migration toward Unicode. Many programs without Unicode support can accept UTF-8 bytes internally but cannot handle a BOM at the start. For example, non-ASCII UTF-8 text might appear as a string literal in the source code of a computer program, and when executed the program will write the correct UTF-8 to a file or to a display even though the programming language knows nothing about UTF-8. To contrast, a BOM at the start of the file would raise a syntax error even though it really can handle UTF-8.

A leading BOM can also defeat software that uses pattern matching on the start of a text file, since it inserts 3 bytes before the pattern. Though commonly associated with the Unix shebang at the start of an interpreted script,[6] the problem is more widespread. For instance in PHP, the existence of a BOM will cause the page to begin output before the initial code is interpreted, causing problems if the page is trying to send custom HTTP headers (which must be set before output begins).

Some common programs from Microsoft, such as Notepad and Visual C++,[7] add BOMs to UTF-8 files by default. Google Docs adds a BOM when a Microsoft Word document is downloaded as a .txt file.

Java does not support UTF-8 with BOM and does not intend to implement it in future releases. [8][9]

UTF-16

In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

  • If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF. This sequence appears as the ISO-8859-1 characters in a text display that expects the text to be ISO-8859-1.
  • if the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE. This sequence appears as the ISO-8859-1 characters in a text display that expects the text to be ISO-8859-1.

Programs expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable).

For the IANA registered charsets UTF-16BE and UTF-16LE, a byte order mark should not be used because the names of these character sets already determine the byte order. If encountered anywhere in such a text stream, U+FEFF is to be interpreted as a "zero width no-break space".

Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian." Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore the presumption of big-endian is widely ignored. When those same files are accessible on the Internet, on the other hand, no such presumption can be made. Searching for ASCII characters or just the space character (U+0020) is a method of determining the UTF-16 byte order.

UTF-32

Although a BOM could be used with UTF-32, this encoding is rarely used for transmission. Otherwise the same rules as for UTF-16 are applicable.

Representations of byte order marks by encoding

This table illustrates how BOMs are represented as octet sequences and how they appear expressed in another encoding (here the legacy ISO-8859-1):

Encoding Representation (hexadecimal) Representation (decimal) Representation (ISO-8859-1)
UTF-8 EF BB BF 239 187 191
UTF-16 (BE) FE FF 254 255
UTF-16 (LE) FF FE 255 254
UTF-32 (BE) 00 00 FE FF 0 0 254 255 ( is the ASCII null character)
UTF-32 (LE) FF FE 00 00 255 254 0 0 ( is the ASCII null character)
UTF-7[10] 2B 2F 76 38
2B 2F 76 39
2B 2F 76 2B
2B 2F 76 2F
[11]
43 47 118 56
43 47 118 57
43 47 118 43
43 47 118 47
+/v8
+/v9
+/v+
+/v/
UTF-1[10] F7 64 4C 247 100 76 dL
UTF-EBCDIC[10] DD 73 66 73 221 115 102 115 sfs
SCSU[10] 0E FE FF 14 254 255 ( is the ASCII "shift out" character)
BOCU-1[10] FB EE 28 251 238 40 (
GB-18030[10] 84 31 95 33 132 49 149 51 1 3 ( and are unmapped ISO-8859-1 characters)

See also

  • Left-to-right mark
  • Non-breaking space
  • Punctuation

References

External links

ar: de:Byte Order Mark es:Marca de orden de bytes (BOM) fr:Indicateur d'ordre des octets ko: it:Byte Order Mark he:BOM lt:BOM ja: no:BOM pl:BOM (informatyka) pt:Marca de ordem de byte ru:Byte order mark sv:Byte order mark uk: zh:






Source: Wikipedia | The above article is available under the GNU FDL. | Edit this article



Search for Byte order mark in Tutorials
Search for Byte order mark in Encyclopedia
Search for Byte order mark in Videos
Search for Byte order mark in Books
Search for Byte order mark in Software
Search for Byte order mark in DVDs
Search for Byte order mark in Store




Advertisement




Byte order mark in Encyclopedia
Byte_order_mark top Byte_order_mark

Home - Add TutorGig to Your Site - Disclaimer

©2011-2013 TutorGig.info All Rights Reserved. Privacy Statement