There are many character encodings, the most common being ASCII,
Latin1, and Unicode. ASCII is a 7 bit code that has become the standard
for the first 127 characters of the others. It contains NO accented
letters. When stored in an 8-bit byte (octet), the leftmost bit is 0.
If you, or your form users,
might use any characters beyond the
ASCII set, it is important that you get clear on how they will be
encoded.
Latin1 (ISO-8859-1) is an 8-bit
code, it uses the other 128
codes, (with leftmost bit 1) for the additional characters used by
"western european" languages. Examples: é É à â ç ö ñ . Latin9 (ISO-8859-15) is a newer varient, it includes
the Euro sign.
Unicode by contrast, is a
16-bit code. With 65,536 possible combinations, it covers every known
language and many special symbols. The first 256 values are the same as
Latin1 (and hence also ASCII.) Since UTF-16 would double the size of
most files, it is not commonly used. UTF-8
stores all ASCII characters in a single byte, and all other characters (including the Latin 1 extras) in
multiple bytes, all beginning with 1 on the left. This generally
results in a file only slightly larger than Latin1, and can accomodate
all Unicode characters. (This does not mean that any browser can render
them all, this is dependent on the fonts installed on the client
computer.)
< | < |
|
> |
> |
|
& |
& |
|
" |
" |
|
@ |
@ or @ | |
É |
É |
|
à |
à |
|
Û |
Û or Û |
|
π |
π or π |
Greek letter pi (or π) |
“ |
“ |
left double quote |
” |
” | right double quote |
— |
— |
"em" dash, width of letter m |
|
nonbreaking space (forces space, won't wrap) |
For authoritative list of characters, At w3.org see : The whole story, for More than you wanted to know, follow the link to full list of character references.
<meta content="text/html; charset=ISO-8859-1"
http-equiv="content-type">
... or UTF-8. Otherwise, the client has no information, and may send
form data using whichever encoding it likes. This can vary from system
to system.