Go to the Russian language version of this page

How to develop multi-lingual HTML page

Paul Gorodyansky 'Cyrillic (Russian): instructions for Windows and Internet'

 
This page's topic is a creation of a Web page with mixed national scripts, say, Cyrillic, Polish and Japanese on the same page.


Different case first: Cyrillic-only or Cyrillic+English Web pages

When a Web page contains only Cyrillic letters or Cyrillic and English letters:
there is no mix of Character Sets in this case because Cyrillic Character Set (and any Cyrillic font) contains English letters (ASCII symbols really), too - ASCII symbols are included into every Character Set (and every font) in the world.

The creation of such non-multilingual Cyrillic+English HTML files is covered on another page of my site - "How to develop Cyrillic HTML page".


Now let's start the main topic of this page -
multi-lingual HTML (mixed scripts on the same Web page).

The standards tell us that given HTML text (from <html> to </html>, one .html file) can have only one encoding of its content:

Note. Same situation, by the way, is for XML texts - only one encoding for the document, say
  <?xml version="1.0" encoding="windows-1251"?>


Unlike MS Word files, .HTML file (as well as .XML) is a plain text file, exactly the same as .TXT files.

Therefore, HTML file as well as .TXT, can contain the symbols of only one Character Set:
Word files, in addition to the text itself, contain a lot of other information and each national symbol is tagged with the name of Character Set and encoding that this symbol belongs to, while plain text files do not contain such information, they have only the text itself.

HTML standard requires to a developer to specify one and only one Character Set Encoding (charset) for an HTML file.

It's not possible to have in one HTML file (i.e. on one Web page), a mix of the letters of say, Western European Character Set (such as accented French or German letters) and letters of Cyrillic Character Set.

Here is a simple example: German a-umlaut has a code 228 in an encoded Western European Character Set. Same code point in Cyrillic Character Set (encoded in Windows-1251) is assigned to small Russian 'd'.

Thus, if an .HTML (plain text) file contains a byte with a value of 228, then a user sees either German a-umlaut in all places where the code is 228
  (if page is announced as "Western": charset=iso-8859-1 or charset=windows-1252)
or small Russian 'd' in all such places
  (if page is announced as "Cyrillic(Windows)": charset=windows-1251)
but not both letters on the same page!

If a charset is not specified for a page, then seeing German a-umlaut or small Russian 'd' depends on a character encoding selected in a browser's menu. For example, in Internet Explorer 5 it depends on a currently selected encoding in View/Encoding - "Western" or "Cyrillic(Windows)".

Then how could a developer create a Web page where s/he wants letters of several different alphabets?

There are two possible cases:

Case 1.

The page is not really a multilanguage one. That is, say, on a Russian page (encoding Cyrillic Windows-1251) we need to have one or two German (or Greek for math or whatever) letters and not a large portions of a non-Russian text.

Such German or Greek letter can not be typed as a real letter in .HTML as it was explained above, but we can use one of these two special ways to present such letter on our Cyrillic page:



Note. Both methods described above do not work in Netscape 4, so if you need to support Netscape 4, you should use the methods of Case 2 right below.



Case 2.

The page is really multilingual - has large pieces of different languages and not just 2-3 letters
(and/or you need to support Netscape 4)

Obviously, no one would input large German texts into a Russian .HTML by representing each and every German letter as &uml; or as &#nnnn;.
Moreover, the modification (f.e. fixing a spelling) of such .HTML would be impossible, because there will be no readable text there to look at.

Then in such case people do the following.

The rule stays the same - one Character Set Encoding per page, but the developer should use a new, very large Character Set - UNICODE, that contains letters of almost all world's alphabets - all European, most of Asian, etc.

Unicode Character Set includes characters from many alphabets, so for the above example, it contains both letters - Russian small 'd' and German a-umlaut - with different codes assigned to them!
(In each encoding, f.e. "Western" or "Cyrillic" a unique code is assigned to each symbol and Unicode is not an exception to this obvious rule - each symbol of Unicode has its own unique code)

That is, there are the following Character Sets and you need to choose one for the creation of your Web page:

It's clear from the above that Unicode Character Set is the only candidate for a multi-lingual Web page.

As in any other Encoding of a Character Set, in a Unicode encoding each letter has a unique code, so in Unicode(UTF-8) German a-umlaut and small Russian 'd' have different codes - 0xC3A4 and 0xD0B4 respectively
(8-bit legacy encoding letters such as Russian or accented Western European, are coded as 2-byte items in UTF-8)
and thus both of them can be present 'as is' in the UTF-8 HTML text.

This is the reason why in Unicode plain text file (.HTML, .TXT, .XML) a person can have letters of different alphabets.
Definitely, the font used has to be a Unicode font, too, but it's not a problem - most modern versions of Windows do have Unicode fonts such as "Arial", "Times New Roman", etc.
Therefore, using UTF-8 you will be able to mix, say, German, Russian, and Greek on the same Web page!

Developers of multi-lingual Web pages use UTF-8 encoding of Unicode.
That is, a text inside such HTML file is a Unicode(UTF-8) text.
 

Note. If you also need some Far East characters, that is, you'd like to have, say, German, Russian, and Japanese, then:


Here are my own test pages for UTF-8 that include a form with ...method=GET to see - in Address bar - what URL-encoded Hex values the browser sends out after a user hits "Submit":


Web developer who needs to create such multi-lingual page, has to find an HTML editor capable of handling Unicode(UTF-8) text.
In such editor a developer works as in MS Word 97/2000 - different alphabets are accessible via the keyboard mode switch:
switch keyboard to Russian to input Russian letters, switch to German to input German text, etc.

Or, if you need to insert just a handful of non-Russian letters into your UTF-8 HTML text, you can copy them from a utility Charmap included into MS Windows:



Again, .HTML is a plain text file, so to confirm that it has inside UTF-8 letters and not other encoding letters, a developer may look at this HTML file using some Hex Viewer and see inside correct UTF-8 codes for the letters, such as C3A4 for German a-umlaut and D0B4 for Russian small 'd'.



Important! As for any other Character Set Encoding, it's a good idea that Unicode Web page announces its Character Set Encoding to let a browser know what to expect. The corresponding line for Unicode should be
  charset=utf-8

This is done either via Web Server tune-up (preferred way) when the Server sends, along with page itself, an HTTP Header where one of the fields contain this information (charset=utf-8) or
via including this information into HTML file itself using HTML tag
  <META http-equiv="content-type" content="text/html; charset=utf-8">

as on my test UTF-8 pages (I have no way to force Web server of my ISP - Compuserve - to build needed HTTP Header for me).

If none of the above is done, it's still Okay - user can manually choose Unicode(UTF-8) in the browser's menu to view a UTF-8 page.



Here are the UTF-8 enabled editors that I tried myself


If you need more technical details about the development of multi-lingual Web pages, UTF-8, etc. then here are two very well-known sites devoted to this issue:

J.Korpela. Techniques for multilingual Web sites

A.Flavell. HTML Internationalization


Paul Gorodyansky. 'Cyrillic (Russian): instructions for Windows and Internet'