Character Encoding

<< Character Sets and Encodings | Page Encoding >>

Character Encoding

Character Encoding

A character encoding maps a character set to units of a specific width and defines byte
serialization and ordering rules. Many character sets have more than one encoding. For
example, Java programs can represent Japanese character sets using the EUC-JP or Shift-JIS
encodings, among others. Each encoding has rules for representing and serializing a character
set.

The ISO 8859 series defines 13 character encodings that can represent texts in dozens of
languages. Each ISO 8859 character encoding can have up to 256 characters. ISO-8859-1
(Latin-1) comprises the ASCII character set, characters with diacritics (accents, diaereses,
cedillas, circumflexes, and so on), and additional symbols.

UTF-8 (Unicode Transformation Format, 8-bit form) is a variable-width character encoding
that encodes 16-bit Unicode characters as one to four bytes. A byte in UTF-8 is equivalent to
7-bit ASCII if its high-order bit is zero; otherwise, the character comprises a variable number of
bytes.

UTF-8 is compatible with the majority of existing web content and provides access to the
Unicode character set. Current versions of browsers and email clients support UTF-8. In
addition, many new web standards specify UTF-8 as their character encoding. For example,
UTF-8 is one of the two required encodings for XML documents (the other is UTF-16).

See Appendix

Figure 376

for more information on character encodings in the Java 2 platform.

Web components usually use PrintWriter to produce responses; PrintWriter automatically
encodes using ISO-8859-1. Servlets can also output binary data using OutputStream classes,
which perform no encoding. An application that uses a character set that cannot use the default
encoding must explicitly set a different encoding.

For web components, three encodings must be considered:

Request

Page (JSP pages)

Response

Request Encoding

The request encoding is the character encoding in which parameters in an incoming request are
interpreted. Currently, many browsers do not send a request encoding qualifier with the
Content-Type

header. In such cases, a web container will use the default encoding, ISO-8859-1,

to parse request data.

If the client hasn't set character encoding and the request data is encoded with a different
encoding from the default, the data won't be interpreted correctly. To remedy this situation,
you can use the

ServletRequest.setCharacterEncoding(String enc)

method to override the

character encoding supplied by the container. To control the request encoding from JSP pages,

Character Sets and Encodings

Chapter 15 · Internationalizing and Localizing Web Applications

473