Character Encoding

There are many character encodings, the most common being ASCII, Latin1, and Unicode. ASCII is a 7 bit code that has become the standard for the first 127 characters of the others. It contains NO accented letters. When stored in an 8-bit byte (octet), the leftmost bit is 0.

If you, or your form users, might use any characters beyond the ASCII set, it is important that you get clear on how they will be encoded.
Latin1 (ISO-8859-1) is an 8-bit code, it uses the other 128 codes, (with leftmost bit 1) for the additional characters used by "western european" languages. Examples: é É à â ç ö ñ . Latin9 (ISO-8859-15) is a newer varient, it includes the Euro sign.

Unicode by contrast, is a 16-bit code. With 65,536 possible combinations, it covers every known language and many special symbols. The first 256 values are the same as Latin1 (and hence also ASCII.) Since UTF-16 would double the size of most files, it is not commonly used. UTF-8 stores all ASCII characters in a single byte, and all other characters (including the Latin 1 extras) in multiple bytes, all beginning with 1 on the left. This generally results in a file only slightly larger than Latin1, and can accomodate all Unicode characters. (This does not mean that any browser can render them all, this is dependent on the fonts installed on the client computer.)


From the beginning, HTML needed a way to "escape" some special characters ( < > & ), and it also safely encodes non-ASCII characters. This is foolproof. Any unicode character can be expressed as an entity by number (decimal, or hexadecimal with an x), and common ones have names. For instance, all Greek letters have there English names, e.g. alpha, Alpha.
< &lt;




&#64; or &#x40;


&#219; or &Ucirc;

&#960; or &pi; 
Greek letter pi  (or &#x3c0;)

left double quote

&rdquo;   right double quote

"em" dash, width of letter m

nonbreaking space (forces space, won't wrap)

For authoritative list of characters, At see : The whole story, for More than you wanted to know, follow the link to full list of character references.

Specifying an encoding:

The default configuration for Apache is to send a header: content-type: text/html; charset=UTF-8
If this is in effect, you have no real choice, you cannot use Latin in your page, and form data will be encoded and received in UTF-8.
Otherwise (the current case for Osiris and cs-linux) you should specify a charset with a meta tag in the <head> section of
  <meta content="text/html; charset=ISO-8859-1"

... or UTF-8. Otherwise, the client has no information, and may send form data using whichever encoding it likes. This can vary from system to system.

Encoding in your Database

You should use the same encoding for your database. OOPS! I just set up your database for UNICODE (UTF-8). Should you try to store Latin encoding, an ERROR will result. If you want to use Latin, I should drop and recreate your database before you create any tables! PostgreSQL can convert back and forth if so instructed, but this would be awkward and unnecessary.