5 HTML Document Representation

Contents

  1. The Document Character Set
  2. Character encodings
    1. Choosing an encoding
    2. Specifying the character encoding
  3. Character references
    1. Numeric character references
    2. Character entity references
  4. Undisplayable characters

この章では、 HTML 文書がコンピュータで、そしてインターネット上で、どのように表現されるかを論じます。

In this chapter, we discuss how HTML documents are represented on a computer and over the Internet.

文書文字セットの項では、 HTML 文書で扱える抽象的な文字の問題に言及します。 文字にはラテン語の文字"A"、キリル語の文字"I"、漢字の"水"などが含まれるのです。

The section on the document character set addresses the issue of what abstract characters may be part of an HTML document. Characters include the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.

文字符号化方式の項では、ファイルで、或いはインターネット上で転送されるときに、どのように文字が書き表されるかの問題に言及します。 文字符号化方式によっては、著者が文書に入れたいと思う全ての文字を直接には書き表わせないことがあり、このようなときのために HTML は、文字参照と呼ばれる、いかなる文字をも参照する別の機構を提供します。

The section on character encodings addresses the issue of how those characters may be represented in a file or when transferred over the Internet. As some character encodings cannot directly represent all characters an author may want to include in a document, HTML offers other mechanisms, called character references, for referring to any character.

人間の言語には非常に数多くの文字があり、その文字を表わす非常に多種多様な方法がある以上、文書が世界中のユーザエージェントによって正確に解釈されるようによく気を付けなければなりません。

Since there are a great number of characters throughout human languages, and a great variety of ways to represent those characters, proper care must be taken so that documents may be understood by user agents around the world.

5.1 文書文字セット The Document Character Set

interoperability を増進するために 、SGML ではその各々の応用品( HTML を含む)において、文書文字セットを指定する必要があります。 文書文字セットは、以下の項目から成り立っています。

To promote interoperability, SGML requires that each application (including HTML) specify its document character set. A document character set consists of:

各々のSGML文書(HTML文書を含む)は、レパートリーからの字の組み合わせとなります。 コンピュータシステムは各々の字をその符号位置で区別します。例えば、アスキー文字セットでは、符号位置65・66・67は、それぞれ文字A・B・Cに対応します。

Each SGML document (including each HTML document) is a sequence of characters from the repertoire. Computer systems identify each character by its code position; for example, in the ASCII character set, code positions 65, 66, and 67 refer to the characters 'A', 'B', and 'C', respectively.

アスキー文字セットはWebのような地球規模の情報発信には十分ではないので、HTMLはそれよりずっと完成された文字セット、 [ISO10646] で定義された Universal Character Set (UCS) と呼ばれるものを用います。 この標準は、世界中の社会生活で使用されている、数千に及ぶ字のレパートリーを定義します。

The ASCII character set is not sufficient for a global information system such as the Web, so HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]. This standard defines a repertoire of thousands of characters used by communities all over the world.

[ISO10646]で定義されている文字セットは一対一でUnicode 2.0に 等しいもの となります ([UNICODE])。 この二つの標準はどちらとも新字を加えて次々に更新されており、その変化については対応するWebサイトに相談すべきです。 現在の仕様書では、ISO/IEC-10646への参照とUnicodeへの参照は、概ね同一の文書文字セットとなります。 しかし、HTML仕様書はまた、Unicode仕様書を 双方向性文章アルゴリズム のような他の問題のためにも参照します。

The character set defined in [ISO10646] is character-by-character equivalent to Unicode 2.0 ([UNICODE]). Both of these standards are updated from time to time with new characters, and the amendments should be consulted at the respective Web sites. In the current specification, references to ISO/IEC-10646 or Unicode imply the same document character set. However, the HTML specification also refers to the Unicode specification for other issues such as the bidirectional text algorithm.

HTML文書の典型的な受け渡しは、ファイルに収められるあるいはネットワークを介して転送されるバイト一組への符号化によるわけですが、しかし文書文字セットは、HTML文書を利用者エージェントが正しく解釈できるようにする目的には十分ではありません。 利用者エージェントは、文書文字列をバイト列に変換するのに使用されていた具体的な文字符号化方式も判っていなければなりません。

The document character set, however, does not suffice to allow user agents to correctly interpret HTML documents as they are typically exchanged -- encoded as a sequence of bytes in a file or during a network transmission. User agents must also know the specific character encoding that was used to transform the document character stream into a byte stream.

5.2 Character encodings

What this specification calls a character encoding is known by different names in other specifications (which may cause some confusion). However, the concept is largely the same across the Internet. Also, protocol headers, attributes, and parameters referring to character encodings share the same name -- "charset" -- and use the same values from the [IANA] registry (see [CHARSETS] for a complete list).

The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters. This conversion fits naturally with the scheme of Web activity: servers send HTML documents to user agents as a stream of bytes; user agents interpret them as a sequence of characters. The conversion method can range from simple one-to-one correspondence to complex switching schemes or algorithms.

A simple one-byte-per-character encoding technique is not sufficient for text strings over a character repertoire as large as [ISO10646]. There are several different encodings of parts of [ISO10646] in addition to encodings of the entire character set (such as UCS-4).

5.2.1 Choosing an encoding

Authoring tools (e.g., text editors) may encode HTML documents in the character encoding of their choice, and the choice largely depends on the conventions used by the system software. These tools may employ any convenient encoding that covers most of the characters contained in the document, provided the encoding is correctly labeled. Occasional characters that fall outside this encoding may still be represented by character references. These always refer to the document character set, not the character encoding.

Servers and proxies may change a character encoding (called transcoding) on the fly to meet the requests of user agents (see section 14.2 of [RFC2068], the "Accept-Charset" HTTP request header). Servers and proxies do not have to serve a document in a character encoding that covers the entire document character set.

Commonly used character encodings on the Web include ISO-8859-1 (also referred to as "Latin-1"; usable for most Western European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a Japanese encoding), EUC-JP (another Japanese encoding), and UTF-8 (an encoding of ISO 10646 using a different number of bytes for different characters). Names for character encodings are case-insensitive, so that for example "SHIFT_JIS", "Shift_JIS", and "shift_jis" are equivalent.

This specification does not mandate which character encodings a user agent must support.

Conforming user agents must correctly map to Unicode all characters in any character encodings that they recognize (or they must behave as if they did).

Notes on specific encodings 

When HTML text is transmitted in UTF-16 (charset=UTF-16), text data should be transmitted in network byte order ("big-endian", high-order byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE], clause C3, page 3-1.

Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal FFFE, a character guaranteed never to be assigned. Thus, a user-agent receiving a hexadecimal FFFE as the first bytes of a text would know that bytes have to be reversed for the remainder of the text.

The UTF-1 transformation format of [ISO10646] (registered by IANA as ISO-10646-UTF-1), should not be used. For information about ISO 8859-8 and the bidirectional algorithm, please consult the section on bidirectionality and character encoding.

5.2.2 Specifying the character encoding

渡された文書の符号化にどの文字符号化方式が使用されていたかを、サーバはどのようにして知るのでしょうか?ファイルの最初の数バイトを調べるサーバもあり、既知のファイルと文字符号化方式に関するデータベースに対してチェックするサーバもあります。多くの最新式WebサーバはWebマスターにとって、文字符号化方式の設定について、旧式のものよりも管理しやすくなっています。Webマスターはいかなる時にもこの"charset"変数を送り出せるようにすべきですが、文書について間違った"charset"変数をつけてしまわないように注意しなくてはなりません。
How does a server determine which character encoding applies for a document it serves? Some servers examine the first few bytes of the document, or check against a database of known files and encodings. Many modern servers give Web masters more control over charset configuration than old servers do. Web masters should use these mechanisms to send out a "charset" parameter whenever possible, but should take care not to identify a document with the wrong "charset" parameter value.

渡された文書の符号化にどの文字符号化方式が使用されていたかを、ユーザエージェントはどのようにして知るのでしょうか?サーバがこの情報を与えるべきなのです。サーバが文書の文字符号化方式名をユーザエージェントに伝える最も直裁な方法は、 HTTP Content-Type フィールドの charset 変数を使うことです。たとえば、次のHTTPヘッダは文字符号化方式が"EUC-JP"であると告げています。
How does a user agent know which character encoding has been used? The server should provide this information. The most straightforward way for a server to inform the user agent about the character encoding of the document is to use the "charset" parameter of the "Content-Type" header field of the HTTP protocol ([RFC2068], sections 3.4 and 14.18) For example, the following HTTP header announces that the character encoding is EUC-JP:

Content-Type: text/html; charset=EUC-JP

Please consult the section on conformance for the definition of text/html.

The HTTP protocol ([RFC2068], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter.

To address server or configuration limitations, HTML documents may include explicit information about the document's character encoding; the META element can be used to provide user agents with this information.

For example, to specify that the character encoding of the current document is "EUC-JP", a document should include the following META declaration:

<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

META 宣言は、(少なくともMETA 要素が解析されるまでは)それ自体を表わしているアスキー文字のように文字符号化方式が整備されているときだけ、 使用されなければなりません。META 宣言は、HEAD 要素のできるだけ初めの方に現わされるべきです。
The META declaration must only be used when the character encoding is organized such that ASCII characters stand for themselves (at least until the META element is parsed). META declarations should appear as early as possible in the HEAD element.

For cases where neither the HTTP protocol nor the META element provides information about the character encoding of a document, HTML also provides the charset attribute on several elements. By combining these mechanisms, an author can greatly improve the chances that, when the user retrieves a resource, the user agent will recognize the character encoding.

まとめると、文書の文字符号化方式を確定するには、準拠するユーザエージェントは、次の優先順位(優先順に上から下に並ぶ)を守らなければなりません:
To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

  1. "Content-Type"フィールド中のHTTP "charset"変数。
    An HTTP "charset" parameter in a "Content-Type" field.
  2. "http-equiv"において"Content-Type"を設定し、"charset"の値を設定するMETA 宣言。
    A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
  3. 外部の素材を指定する要素で設定されたcharset属性。
    The charset attribute set on an element that designates an external resource.

この優先順位の一覧に加え、ユーザエージェントは試行錯誤を行い利用者の設定を使ってかまいません。例えば、日本語の文章に使われている数多くの文字符号化方式を見分けるために、多くのWWWブラウザが試行錯誤を重ねます。また、典型的なWWWブラウザは、他の指標が存在しない場合に適用する、利用者が設定できる局地的な文字符号化方式を持ちます。
In addition to this list of priorities, the user agent may use heuristics and user settings. For example, many user agents use a heuristic to distinguish the various encodings used for Japanese text. Also, user agents typically have a user-definable, local default character encoding which they apply in the absence of other indicators.

ユーザエージェントは、間違った"charset"情報を利用者が無効にできる機構を備えてかまいません。 しかしながら、もしユーザエージェントがそのような機構を提供するならば、それは閲覧のためにだけ働くべきです。間違った"charset"変数をつけたWeb ページを作ってしまうことを避けるため、編集のときにはそれは働かないようにすべきです。
User agents may provide a mechanism that allows users to override incorrect "charset" information. However, if a user agent offers such a mechanism, it should only offer it for browsing and not for editing, to avoid the creation of Web pages marked with an incorrect "charset" parameter.

注意Note. もし、特定のアプリケーションのために、 [ ISO10646 ] に含まれない文字を参照する必要が出てくれば、現在のバージョンの標準と未来のバージョンの標準との対立を避けるために、文字は独自拡張領域へ割り当てられるべきです。しかしながらデータの可搬性のために、これは極めて避けるべきです。
If, for a specific application, it becomes necessary to refer to characters outside [ISO10646], characters should be assigned to a private zone to avoid conflicts with present or future versions of the standard. This is highly discouraged, however, for reasons of portability.

5.3 Character references

A given character encoding may not be able to express all characters of the document character set. For such encodings, or when hardware or software configurations do not allow users to input some document characters directly, authors may use SGML character references. Character references are a character encoding-independent mechanism for entering any character from the document character set.

Character references in HTML may appear in two forms:

Character references within comments have no special meaning; they are comment data only.

Note. HTML provides other ways to present character data, in particular inline images.

Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

5.3.1 Numeric character references

Numeric character references specify the code position of a character in the document character set. Numeric character references may take two forms:

Here are some examples of numeric character references:

Note. Although the hexadecimal representation is not defined in [ISO8879], it is expected to be in the revision, as described in [WEBSGML]. This convention is particularly useful since character standards generally use hexadecimal representations.

5.3.2 Character entity references

In order to give authors a more intuitive way of referring to characters in the document character set, HTML offers a set of character entity references. Character entity references use symbolic names so that authors need not remember code positions. For example, the character entity reference &aring; refers to the lower case "a" character topped with a ring; "&aring;" is easier to remember than &#229;.

HTML 4.0 does not define a character entity reference for every character in the document character set. For instance, there is no character entity reference for the Cyrillic capital letter "I". Please consult the full list of character references defined in HTML 4.0.

Character entity references are case-sensitive. Thus, &Aring; refers to a different character (upper case A, ring) than &aring; (lower case a, ring).

Four character entity references deserve special mention since they are frequently used to escape special characters:

Authors wishing to put the "<" character in text should use "&lt;" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use "&gt;" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.

Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&amp;" in attribute values since character references are allowed within CDATA attribute values.

Some authors use the character entity reference "&quot;" to encode instances of the double quote mark (") since that character may be used to delimit attribute values.

5.4 Undisplayable characters

A user agent may not be able to render all characters in a document meaningfully, for instance, because the user agent lacks a suitable font, a character has a value that may not be expressed in the user agent's internal character encoding, etc.

Because there are many different things that may be done in such cases, this document does not prescribe any specific behavior. Depending on the implementation, undisplayable characters may also be handled by the underlying display system and not the application itself. In the absence of more sophisticated behavior, for example tailored to the needs of a particular script or language, we recommend the following behavior for user agents:

  1. Adopt a clearly visible, but unobtrusive mechanism to alert the user of missing resources.
  2. If missing characters are presented using their numeric representation, use the hexadecimal (not decimal) form since this is the form used in character set standards.