How to develop multi-lingual HTML page

This page's topic is a creation of a Web page with mixed national scripts, say, Cyrillic, Polish and Japanese on the same page.

Different case first: Cyrillic-only or Cyrillic+English Web pages

When a Web page contains only Cyrillic letters or Cyrillic and English letters:
there is no mix of Character Sets in this case because Cyrillic Character Set (and any Cyrillic font) contains English letters (ASCII symbols really), too - ASCII symbols are included into every Character Set (and every font) in the world.

The creation of such non-multilingual Cyrillic+English HTML files is covered on another page of my site - "How to develop Cyrillic HTML page".

Now let's start the main topic of this page -
multi-lingual HTML (mixed scripts on the same Web page).

The standards tell us that given HTML text (from <html> to </html>, one .html file) can have only one encoding of its content:

if encoding was specified in the HTTP Header sent from the Web server along with the page, then it's only one value there in the
Charset field.
One can see HTTP Header fields/values in Netscape/Mozilla via View / Page Info
If encoding for the page is specified in the HTML text itself, then again, it's only one value that can be present there, for example:
<META http-equiv="content-type" content="text/html; charset=windows-1251">

Note. Same situation, by the way, is for XML texts - only one encoding for the document, say
<?xml version="1.0" encoding="windows-1251"?>

Unlike MS Word files, .HTML file (as well as .XML) is a plain text file, exactly the same as .TXT files.

Therefore, HTML file as well as .TXT, can contain the symbols of only one Character Set:
Word files, in addition to the text itself, contain a lot of other information and each national symbol is tagged with the name of Character Set and encoding that this symbol belongs to, while plain text files do not contain such information, they have only the text itself.

HTML standard requires to a developer to specify one and only one Character Set Encoding (charset) for an HTML file.

It's not possible to have in one HTML file (i.e. on one Web page), a mix of the letters of say, Western European Character Set (such as accented French or German letters) and letters of Cyrillic Character Set.

Here is a simple example: German a-umlaut has a code 228 in an encoded Western European Character Set. Same code point in Cyrillic Character Set (encoded in Windows-1251) is assigned to small Russian 'd'.

Thus, if an .HTML (plain text) file contains a byte with a value of 228, then a user sees either German a-umlaut in all places where the code is 228
(if page is announced as "Western": charset=iso-8859-1 or charset=windows-1252)
or small Russian 'd' in all such places
(if page is announced as "Cyrillic(Windows)": charset=windows-1251)
but not both letters on the same page!

If a charset is not specified for a page, then seeing German a-umlaut or small Russian 'd' depends on a character encoding selected in a browser's menu. For example, in Internet Explorer 5 it depends on a currently selected encoding in View/Encoding - "Western" or "Cyrillic(Windows)".

Then how could a developer create a Web page where s/he wants letters of several different alphabets?

There are two possible cases:

Case 1.

The page is not really a multilanguage one. That is, say, on a Russian page (encoding Cyrillic Windows-1251) we need to have one or two German (or Greek for math or whatever) letters and not a large portions of a non-Russian text.

Such German or Greek letter can not be typed as a real letter in .HTML as it was explained above, but we can use one of these two special ways to present such letter on our Cyrillic page:

as an HTML entity, for example, German a-umlaut can be entered into .HTML as ¨
Note, that such method can be used only for Western letters and some special symbols (including Greek symbols used in math).
One can not use such method to represent say Russian or Polish or Latvian letters, it's forbidden by the standard.
The list of all HTML entities as well as more details about this standard can be found here:
http://www.w3.org/TR/html401/sgml/entities.html
as a so called 'big number' - &#nnnn; which is a Unicode value for that symbol (in the Decimal form).
For example, German a-umlaut can be entered into Russian .HTML as ä.
Similarly, Russian 'd' can be entered into a Western .HTML (encoding "Western") as д
The simlest way to find needed 'big number':
- Go to Microsoft page that lists all encodings - "Codepages" and find needed one, f.e. "Cyrillic, Windows codepage 1251".
  There, under each letter you will see a Unicode value for it (in Hexadecimal form), for example, under Russian 'd' - 0434
- Now we need to convert Hex 0434 into a Decimal form. Call Calculator included into MS Windows - either via Start/Programs/Accessories or just do Start/Run - type calc
  In the Calculator choose View/Scientific. Click on Hex and type 0434. Click on Dec - see the needed Decimal value - 1076.

Note. Both methods described above do not work in Netscape 4, so if you need to support Netscape 4, you should use the methods of Case 2 right below.

Case 2.

The page is really multilingual - has large pieces of different languages and not just 2-3 letters
(and/or you need to support Netscape 4)

Obviously, no one would input large German texts into a Russian .HTML by representing each and every German letter as ¨ or as &#nnnn;.
Moreover, the modification (f.e. fixing a spelling) of such .HTML would be impossible, because there will be no readable text there to look at.

Then in such case people do the following.

The rule stays the same - one Character Set Encoding per page, but the developer should use a new, very large Character Set - UNICODE, that contains letters of almost all world's alphabets - all European, most of Asian, etc.

Unicode Character Set includes characters from many alphabets, so for the above example, it contains both letters - Russian small 'd' and German a-umlaut - with different codes assigned to them!
(In each encoding, f.e. "Western" or "Cyrillic" a unique code is assigned to each symbol and Unicode is not an exception to this obvious rule - each symbol of Unicode has its own unique code)

That is, there are the following Character Sets and you need to choose one for the creation of your Web page:

"Western" - contain letters of Western European languages
Encodings used: iso-8859-1, windows-1252.
"Cyrillic" - contain Cyrillic letters
Encodings used: Windows-1251, KOI8-R.
"Central European" - contain Polish, Czech, Hungarian and other letters of Central European languages
Encodings used: iso-8859-2, windows-1250.
...
"Unicode" - new, very large Character Set, contains letters of almost all languages in the world.
Encodings used: UTF-8, UCS-2 and some others but only UTF-8 is used on the WWW.

It's clear from the above that Unicode Character Set is the only candidate for a multi-lingual Web page.

As in any other Encoding of a Character Set, in a Unicode encoding each letter has a unique code, so in Unicode(UTF-8) German a-umlaut and small Russian 'd' have different codes - 0xC3A4 and 0xD0B4 respectively
(8-bit legacy encoding letters such as Russian or accented Western European, are coded as 2-byte items in UTF-8)
and thus both of them can be present 'as is' in the UTF-8 HTML text.

This is the reason why in Unicode plain text file (.HTML, .TXT, .XML) a person can have letters of different alphabets.
Definitely, the font used has to be a Unicode font, too, but it's not a problem - most modern versions of Windows do have Unicode fonts such as "Arial", "Times New Roman", etc.
Therefore, using UTF-8 you will be able to mix, say, German, Russian, and Greek on the same Web page!

Developers of multi-lingual Web pages use UTF-8 encoding of Unicode.
That is, a text inside such HTML file is a Unicode(UTF-8) text.

Note. If you also need some Far East characters, that is, you'd like to have, say, German, Russian, and Japanese, then:

you need to find a Unicode font that, in addition to European characters (English, German, Russian, Greek, etc.) has Far East characters.
Usually, because of the size, Unicode fonts cover only a subset of a full Unicode table, so "Arial" and "Times New Roman" don't have Japanese or Chinese.
That is, you may need to download and install a Unicode font that covers more languages than regular "Arial" or "Times New Roman" used in your browser by default for the encoding (script)="UTF-8".
For instance, here is one very large Unicode font that contains Japanese and Chinese in addition to European alphabets
(it's already present on your PC if you have either of the following: MS Office ver. 2000 or higher, FrontPage 2000, MS Publisher ver. 2000 or higher):

"Arial Unicode MS"
If you do not have this font, then find similar one (i.e. one containing Far Eastern symbols) on this page: "Unicode fonts for Windows computers", for example, download and install freeware font
Bitstream CyberBit

then you need to choose this, Far East capable font in your browser:

in Netscape ver. 4 and higher - choose this font for Encoding: "Unicode" in Edit/Preferences/Appearance/Fonts
In Netscape ver. 6 and higher (and Mozilla) it's not really necessary:
there is a new, enhanced functionality called "Font Linking":
For example, "Times New Roman" is specified for Unicode, but it does not contain Japanese letters that are present on some UTF-8 page.
It's fine - "Font Linking" means that browser will search all Windows fonts until it finds one that does contain Japanese!

MS Internet Explorer 5+ has a problem there:
user can not select any font for Unicode in
(Tools/Internet Options/Fonts - for Scripts)
because "Unicode" is not in the list!
In MS IE 4 it was there...
But it's not very critical - Internet Explorer also has "Font Linking" capabilities, so if a UTF-8 page contains some Japanese, then Internet Explorer will find font that contains Japanese and use its Japanese glyphs.

Here are my own test pages for UTF-8 that include a form with ...method=GET to see - in Address bar - what URL-encoded Hex values the browser sends out after a user hits "Submit":

UTF-8: Russian, German, Polish

UTF-8: Russian, German, Polish, and Japanese

Web developer who needs to create such multi-lingual page, has to find an HTML editor capable of handling Unicode(UTF-8) text.
In such editor a developer works as in MS Word 97/2000 - different alphabets are accessible via the keyboard mode switch:
switch keyboard to Russian to input Russian letters, switch to German to input German text, etc.

Or, if you need to insert just a handful of non-Russian letters into your UTF-8 HTML text, you can copy them from a utility Charmap included into MS Windows:

Start/Run and type charmap
(or find this utility in Start/Programs/Accessories)
choose some regular font such as "Arial"
click on needed non-Russian letter(s) to place it(them) to the selection field
click on "Copy" - letter(s) go to Windows Clipboard and you will be able to Paste them into your editor's window

Again, .HTML is a plain text file, so to confirm that it has inside UTF-8 letters and not other encoding letters, a developer may look at this HTML file using some Hex Viewer and see inside correct UTF-8 codes for the letters, such as C3A4 for German a-umlaut and D0B4 for Russian small 'd'.

Important! As for any other Character Set Encoding, it's a good idea that Unicode Web page announces its Character Set Encoding to let a browser know what to expect. The corresponding line for Unicode should be
charset=utf-8

This is done either via Web Server tune-up (preferred way) when the Server sends, along with page itself, an HTTP Header where one of the fields contain this information (charset=utf-8) or
via including this information into HTML file itself using HTML tag
<META http-equiv="content-type" content="text/html; charset=utf-8">

as on my test UTF-8 pages (I have no way to force Web server of my ISP - Compuserve - to build needed HTTP Header for me).

If none of the above is done, it's still Okay - user can manually choose Unicode(UTF-8) in the browser's menu to view a UTF-8 page.

Here are the UTF-8 enabled editors that I tried myself

Free plain text Unicode editor UniPad.
This is just a plain text editor, similar to Notepad, but it lets you work with several different forms of Unicode and UTF-8 is one of them.
This editor is for those developers who write HTML code themselves, typing all HTML tags, etc.
The file can be stored in Unicode(UTF-8) via Save As menu option.
After you save your file as UTF-8, you need to do one additional thing:
- go to File / File Properties
- uncheck the box called "Byte Order Mark" - HTML text should not have this mark
Because it's not a specialized WYSIWYG HTML editor, it, unlike Composer, does not insert any encoding information lines, so if you need to have this
<META http-equiv="content-type" content="text/html; charset=utf-8">
line, then insert it yourself (i.e. if you cannot specify the encoding via HTTP Header sent by your Web server).
Microsoft Front Page 2000

Note. If your future UTF-8 text will contain Far East letters such as Japanese, then you need to specify that a larger Unicode font such as "Arial Unicode MS" shoud be used for UTF-8:
Tools/Page Options/Default Font, choose "Multilingual (UTF-8)" in the list and then choose such large font in both fields that you see right below.

In Front Page 2000 you need to open a new document and immediately specify that you are going to create a UTF-8 page:
- File/Properties/Language
- Find "HTML encoding" section where you need to choose "Multilingual (UTF-8)" in the both fields
Now you can input your text.
Front Page 2000 will insert the following line at the top of the file:
<META http-equiv="content-type" content="text/html; charset=utf-8">
Netscape Composer
Netscape ver. 4 and above has a built-in WYSIWYG HTML editor - Composer:
- Netscape ver. 6 and higher, as well as Mozilla:
  - Netscape 6 - File/New/"Blank Page �o Edit" or Task/Composer
  - Netscape 7 and Mozilla - File/New/"Composer Page" or Windows/Composer
- Netscape 4.x - Communicator/Composer or File/New/Blank Page
This HTML editor is a very handy tool for developing a UTF-8 Web page.
Important thing (applicable to any page in any encoding) is that in Composer you should not choose any specific font name in the fonts window. You should have there what Composer has initially:
Variable Width font
Here is how to create a UTF-8 HTML text in Composer:
- If you are developing a UTF-8 page 'from scratch' then just open a blank Composer page and select there
  View/Character Set/Unicode(UTF-8)
  (View/Character Encoding/Unicode(UTF-8) in ver. 6+)
  Now you can start typing in any language supported by a Unicode font in use. Standard fonts such as "Arial", "Times New Roman" will let you type in any mix of European alphabets, so you can create, say, a page where UTF-8 text will include Russian, German, Czech, and Greek!
  Again, you type using the same approach as you use in MS Word 97/2000:
  switch keyboard to "RU" and type some Russian, then switch to "EN" and type some English, then switch to "DE" and type some German, etc.
  The resulting .HTML file will have Unicode(UTF-8) characters inside. For example, Russian as well as accented Western European letters, are codes as 2-byte items as it was mentioned above.
  When you click on File/SaveAs, Composer will create a .HTML file where the following line will be present at the top of the text:
  <META http-equiv="content-type" content="text/html; charset=utf-8">
- If it's not a work 'from scratch' and you have some existing Russian HTML file where you'd like to add, say, some German or Greek letters, then it's a slightly different approach. It means that you want to change the encoding of the HTML text from, say, Cyrillic(Windows-1251) to Unicode(UTF-8).
  Here is how to do it:
  - Close Composer if you had it open.
  - In main Netscape window select the encoding of your existing page. For example, if it was Russian page in Windows-1251, then you need to do:
    - Netscape 6+, Mozilla - View/ Character Encoding / Cyrillic(Windows-1251)
    - Netscape 4.5+ - View/ CharacterSet / Cyrillic(Windows-1251)
    - Netscape 4.0x - View / Encoding / Cyrillic(Windows-1251)
  - Call Composer. There, do File/Open and load this existing Russian HTML file
  - Now you can change the encoding - in Composer:
    - Netscape 4.5+ - View / CharacterSet / Unicode(UTF-8)
    - Netscape 4.0x - View / Encoding / Unicode(UTF-8)
    - Netscape 6+, Mozilla - File / Save As Charset - choose Unicode(UTF-8) and write this new file to the hard drive
    Because UTF-8 does contain Cyrillic letters, the conversion will work Okay, so just ignore the warning message that Composer presents to you.
  - Now you have UTF-8 text, so you can type some German, Greek, etc.
    Then save the file - File/SaveAs for Netscape 4 or File/Save for Netscape 6 where you've already saved once.
  - Close Composer.
  - Composer, as mentioned above, inserts the following line at the top:
    <META http-equiv="content-type" content="text/html; charset=utf-8">
    but Netscape 4 does not delete any lines, so 'old' line specifying the original encoding of this HTML file will still be there (if any):
    <META http-equiv="content-type" content="text/html; charset=windows-1251">
    Therefore, if you use Netscape 4, you need to remove this old encoding line if it exists.
    To do so, use the following menu option of Composer:
    Tools/HTML Tools/Edit HTML Source
    
    and delete the line with charset=windows-1251 located very close to the top of the text, right after a line with charset=utf-8.
    The do File/Save again.
MS Word 2000 (Word XP probably works the same way)
It's not recommended to use Word for the creation of HTML file, because Word creates for you an HTML code that contains a lot of unnecessary HTML tags, file is large, etc.
But anyway, here is how to do it in Word 2000.
There are two different scenarios - either you create a brand new HTML text or converting existing .doc to .html.
1. Creating brand new HTML text
- File / New / Web Page
- Let Word know at once that you are creating a UTF-8 HTML file -
  go to Tools / Options and:
  - in the General tab window click on "Web Options" button
  - in the "Web Options" window, go to "Encoding" tab
  - choose "Unicode(UTF-8)" in the "Save this document as" list
- Now you can type your multilingual text
- File / Save As. Don't use non-English (Russian or French, etc.) letters in the file name.
  Click on "Title" button in the File/SaveAs dialog to change the Title if necessary - it is not a good idea to use non-English letters in the Web page Title.
The newly created HTML file will contain normal UTF-8 letters inside and also Word inserts the following line at the top of the HTML code (you can see it using View / HTML Source):
<META http-equiv="content-type" content="text/html; charset=utf-8">
2. Converting existing .doc to HTML
- Open your multilingual document (.doc) in Word 2000
- Let Word know at once that you are creating a UTF-8 HTML file -
  go to Tools / Options and:
  - in the General tab window click on "Web Options" button
  - in the "Web Options" window, go to "Encoding" tab
  - choose "Unicode(UTF-8)" in the "Save this document as" list
- File / Save As Web Page. Don't use non-English (Russian or French, etc.) letters in the file name.
  Click on "Title" button in the File/SaveAsWebPage dialog to change the Title if necessary - it is not a good idea to use non-English letters in the Web page Title.
The newly created HTML file will contain normal UTF-8 letters inside and also Word inserts the following line at the top of the HTML code (you can see it using View / HTML Source):
<META http-equiv="content-type" content="text/html; charset=utf-8">

If you need more technical details about the development of multi-lingual Web pages, UTF-8, etc. then here are two very well-known sites devoted to this issue:

J.Korpela. Techniques for multilingual Web sites

A.Flavell. HTML Internationalization