The creation of such non-multilingual Cyrillic+English HTML files is covered on another
page of my
Now let's start the main topic of this
multi-lingual HTML (mixed scripts on the same Web page).
The standards tell us that given HTML text (from <html> to </html>, one .html file) can have only one encoding of its content:
Note. Same situation, by the way, is for XML texts - only one encoding for the document, say
<?xml version="1.0" encoding="windows-1251"?>
Unlike MS Word files, .HTML file (as well as .XML) is a plain text file, exactly the same as .TXT files.
Therefore, HTML file as well as .TXT, can contain the symbols of only one
Word files, in addition to the text itself,
contain a lot of other information and each national symbol is tagged with
the name of Character Set and encoding that this symbol belongs to, while
HTML standard requires to a developer to specify one and only one
It's not possible to have in one HTML file
Here is a simple example: German a-umlaut has a code 228 in an encoded
Thus, if an .HTML (plain text) file contains a byte with a value of 228, then
a user sees either German a-umlaut in all places where the code is 228
(if page is announced as "Western": charset=iso-8859-1 or charset=windows-1252)
or small Russian 'd' in all such places
(if page is announced as
"Cyrillic(Windows)": charset=windows-1251)
but not both letters on the same page!
If a charset is not specified for a page, then seeing German
a-umlaut or small Russian 'd' depends on a character encoding selected
in a browser's menu. For example, in
Then how could a developer create a Web page where s/he wants letters of several different alphabets?
There are two possible cases:
Case 1.
The page is not really a multilanguage one. That is, say,
on a Russian page (encoding Cyrillic Windows-1251) we need to have one or two
Such German or Greek letter can not be typed as a real letter in .HTML as it was explained above, but we can use one of these two special ways to present such letter on our Cyrillic page:
Note, that such method can be used only for Western letters and some special symbols (including Greek symbols used in math).
One can not use such method to represent say Russian or Polish or Latvian letters, it's forbidden by the standard.
The list of all HTML entities as well as more details about this standard
can be found here:
http://www.w3.org/TR/html401/sgml/entities.html
The simlest way to find needed 'big number':
- Go to Microsoft page that lists all encodings - "Codepages" and find needed one, f.e.
"Cyrillic, Windows codepage 1251".
There, under each letter you will see a Unicode value for it (in Hexadecimal form), for example, under Russian'd' - 0434 - Now we need to convert Hex 0434 into a Decimal form. Call Calculator included into
MS Windows - either viaStart/Programs/Accessories or just doStart/Run - type calc
In the Calculator choose View/Scientific. Click on Hex and type 0434. Click onDec - see the needed Decimalvalue - 1076.
Note. Both methods described above do not work in Netscape 4, so if you need to support
Case 2.
The page is really multilingual - has large pieces of different languages and not just 2-3 letters
(and/or you need to support Netscape 4)
Obviously, no one would input large German texts into a Russian .HTML
by representing each and every German letter as ¨ or as &#nnnn;.
Moreover, the modification (f.e. fixing a spelling) of such .HTML would be impossible,
because there will be no readable text there to look at.
Then in such case people do the following.
The rule stays the same - one Character Set Encoding per page,
but the developer should use a new, very large
Unicode Character Set includes characters from many alphabets, so for the above example,
it contains both
(In each encoding, f.e. "Western" or "Cyrillic" a unique code is assigned to each symbol
and Unicode is not an exception to this obvious
That is, there are the following Character Sets and you need to choose one for the creation of your Web page:
It's clear from the above that Unicode Character Set is the only candidate for a multi-lingual Web page.
As in any other Encoding of a Character Set, in a Unicode encoding
each letter has a unique code, so in
(8-bit legacy encoding letters such as Russian or
accented Western European, are coded as 2-byte items in UTF-8)
and thus both of them can be present
This is the reason why in Unicode plain text
Definitely, the font used has to be a Unicode font, too, but it's not
a
Therefore, using UTF-8 you will be able to mix, say, German, Russian, and Greek on the
same Web page!
Developers of multi-lingual Web pages use UTF-8 encoding of Unicode.
That is, a text inside such HTML file is a Unicode(UTF-8) text.
Note. If you also need some Far East characters, that is, you'd like to have, say, German, Russian, and Japanese, then:
- you need to find a Unicode font that, in addition to European
characters (English, German, Russian, Greek, etc.) hasFar East characters.
Usually, because of the size, Unicode fonts cover only a subset of a full Unicode table, so "Arial" and"Times New Roman" don't have Japanese or Chinese.That is, you may need to download and install a Unicode font that covers more languages than regular "Arial" or
"Times New Roman" used in your browser by default for theencoding (script)="UTF-8".
For instance, here is one very large Unicode font that contains Japanese and Chinese in addition to European alphabets
(it's already present on your PC if you have either of the following:MS Office ver. 2000 or higher, FrontPage 2000, MS Publisher ver. 2000 or higher):
"Arial Unicode MS" If you do not have this font, then find similar one (i.e. one containing Far Eastern symbols) on this page:
"Unicode fonts for Windows computers" , for example, download and install freeware fontBitstream CyberBit
- then you need to choose this, Far East capable font in your browser:
- in Netscape ver. 4 and higher - choose this font for Encoding: "Unicode" in
Edit/Preferences/Appearance/Fonts In Netscape ver. 6 and higher (and Mozilla) it's not really necessary:
there is a new, enhanced functionality called"Font Linking":
For example,"Times New Roman" is specified for Unicode, but it does not contain Japanese letters that are present on some UTF-8 page.
It's fine - "Font Linking" means that browser will search all Windows fonts until it finds one that does contain Japanese!
- MS Internet Explorer 5+ has a problem there:
user can not select any font for Unicode in
(Tools/Internet Options/Fonts - for Scripts)
because "Unicode" is not in the list!
In MS IE 4 it was there...But it's not very critical - Internet Explorer also has
"Font Linking" capabilities, so if a UTF-8 page contains some Japanese, then Internet Explorer will find font that contains Japanese and use its Japanese glyphs.
Here are my own test pages for UTF-8 that include a form with ...method=GET
to
Web developer who needs to create such multi-lingual page, has to find an
HTML editor capable of handling
In such editor a developer works as in
switch keyboard to Russian to input Russian letters, switch
to German to input German text, etc.
Or, if you need to insert just a handful of non-Russian letters into your UTF-8 HTML text, you can copy them from a utility Charmap included into MS Windows:
Again, .HTML is a plain text file, so to confirm that it has inside UTF-8 letters
and not other encoding letters, a developer may look at this HTML file using
some Hex Viewer and see inside correct UTF-8 codes for the letters, such as
C3A4 for German
Important! As for any other Character Set Encoding, it's a good idea that
Unicode Web page announces its
charset=utf-8
This is done either via Web Server tune-up (preferred way) when the Server
sends, along with page itself, an
via including this information into HTML file itself using HTML tag
<META http-equiv="content-type" content="text/html; charset=utf-8">
as on my test UTF-8 pages (I have no way to force Web server of my
If none of the above is done, it's still Okay - user can manually
choose
Here are the UTF-8 enabled editors that I tried myself
This is just a plain text editor, similar to Notepad, but it lets you work with several
different forms of Unicode and UTF-8 is one of them.
This editor is for those developers who write HTML code themselves, typing
all HTML tags, etc.
The file can be stored in Unicode(UTF-8) via Save As menu option.
After you save your file as UTF-8, you need to do one additional thing:
Because it's not a specialized WYSIWYG HTML editor, it, unlike Composer,
does not insert any encoding information lines, so if you need to have this
<META http-equiv="content-type" content="text/html; charset=utf-8">
line, then insert it yourself (i.e. if you cannot specify the encoding via HTTP Header sent
by your Web server).
Note. If your future UTF-8 text will contain Far East letters such as Japanese, then you need to specify that a larger Unicode font such as"Arial Unicode MS" shoud be used for UTF-8:
Tools/Page Options/Default Font, choose"Multilingual (UTF-8)" in the list and then choose such large font in both fields that you see right below.
In Front Page 2000 you need to open a new document and immediately
specify that you are going to create a UTF-8 page:
Now you can input your text.
Front Page 2000 will insert the following line at the top of the file:
<META http-equiv="content-type" content="text/html; charset=utf-8">
This HTML editor is a very handy tool for developing a UTF-8 Web page.
Important thing (applicable to any page in any encoding)
is that in Composer you should not choose any
Variable Width font
Here is how to create a UTF-8 HTML text in Composer:
Now you can start typing in any language supported by a Unicode
font in use. Standard fonts such as "Arial",
Again, you type using the same approach as you use in
switch keyboard to "RU" and type some Russian, then switch to "EN"
and type some English, then switch to "DE" and type some German, etc.
The resulting .HTML file will have Unicode(UTF-8) characters inside. For example, Russian as well as accented Western European letters, are codes as 2-byte items as it was mentioned above.
When you click on File/SaveAs, Composer will create a .HTML file
where the following line will be present at the top of the text:
<META http-equiv="content-type" content="text/html; charset=utf-8">
Here is how to do it:
Because UTF-8 does contain Cyrillic letters, the conversion will work Okay, so just ignore the warning message that Composer presents to you.
but Netscape 4 does not delete any lines, so 'old'
line specifying the original encoding of this HTML file
will still be there (if any):
<META http-equiv="content-type" content="text/html; charset=windows-1251">
Therefore, if you use Netscape 4,
you need to remove this old encoding line if it exists.
To do so, use the following menu option of Composer:
Tools/HTML Tools/Edit HTML Source
and delete the line with charset=windows-1251 located
very close to the top of the text, right after a line
with charset=utf-8.
The do File/Save again.
1. Creating brand new HTML text
The newly created HTML file will contain normal UTF-8 letters inside and also Word
inserts the following line at the top of the HTML code (you can see it using
<META http-equiv="content-type" content="text/html; charset=utf-8">
2. Converting existing .doc to HTML
The newly created HTML file will contain normal UTF-8 letters inside and also Word
inserts the following line at the top of the HTML code (you can see it using
<META http-equiv="content-type" content="text/html; charset=utf-8">
If you need more technical details about the development of multi-lingual Web pages, UTF-8, etc. then here are two very well-known sites devoted to this issue:
J.Korpela. Techniques for multilingual Web sites
A.Flavell. HTML Internationalization