Unicode and Cyrillic: Copy/Paste and other problems

This page is Chapter 2 of the "Unicode and Cyrillic: problems & solutions" section of my site.

Unicode and Cyrillic: Copy/Paste and other problems

Problem: No readable Cyrillic or just question marks (????) while working with MS Word 97 and newer or other Unicode-based programs
(Internet Explorer, Outlook Express, MS Outlook, Netscape 7/Mozilla, etc.):

during Copy/Paste between some non-Unicode application and MS Word 97 and newer (or another Unicode program)
Next is a Word-specific problem:
while working with a Cyrillic plain text (.txt) file in Word 97 and newer - either trying to load one into Word (unreadable text as a result) or save your Word text as .TXT file (set of "???" as a result)

Terminology Note.
Most of these problems do not exist in the 'Russian Windows' environment.
When I write below, "Russian version of Windows", I do not mean only this special, localized version where a word "Start" is in Russian.
What I mean is any Windows installation (even with English interface) where a system code page (System Default locale) is Russian ("Cyrillic" code page 1251).
(system code page issue is described in details in the "Full Russification" section of my site)

So, to avoid this long description - "...when system code page..." - I call such Windows installation a Russian Windows.

The reason for the appearance of these new problems in the modern software (such as Word 97/2000, Internet Explorer, etc.) is that the modern applications use a new type of data encoding - Unicode.

Many Windows applications are still non-Unicode programs and use legacy encodings such as "Western European, Code Page 1252" or "Cyrillic, Code Page 1251".

Examples of the non-Unicode programs where you can type some Cyrillic text:

Text input windows of Netscape ver. 3 and 4, for example, an e-mail preparation window (Composition window)
UltraEdit by Ian D. Mead, that I use all the time for preparing my Web pages and other text-related and programming work.
To work with Russian, I just need - via View/SetFont - choose say "Courier New", Script="Cyrillic" there.
Majority of 3rd party plain text editors (those working with .TXT files) are non-Unicode programs.
Macromedia Dreamweaver - text input window
(didn't see myself, but have read that Macromedia software is also of a non-Unicode type)

When you want to move Cyrillic texts (perform Copy and Paste) between Word 97 (or another Unicode-based program) and some non-Unicode program, or want to work with a plain text file (.TXT) along with Word 97/2000, you may face some problems:

working simultaneously with both types of Cyrillic texts - texts that use Unicode and those that do not use Unicode - often leads (under a non-Russian Windows) to the unreadable, gibberish texts or just a set of question marks instead of Cyrillic letters.

Below you find the solutions for these problems.

Note.
I assume that you already know how to enable Cyrillic fonts and Cyrillic keyboard tools in your Windows.
If it's not the case, then do it before reading any further here.
To enable Cyrillic fonts and keyboard, read "Cyrillic in Windows" section of my site.

Table of Content

Copy/Paste between Unicode and non-Unicode programs:
- Unicode program ---> non-Unicode program
  From Unicode program (f.e. Word 97 and newer, Internet Explorer, Outlook Express, MS Outlook 2000, Netscape 7/Mozilla) to a non-Unicode one (f.e. Netscape 4.79, UltraEdit, Dreamweaver) -
  question marks (???) instead of Cyrillic in the non-Unicode program's window
- non-Unicode program ---> Unicode program
  From non-Unicode program (f.e. Netscape 4.79 or UltraEdit) to Unicode one (f.e. Word 97 and newer, Internet Explorer, Outlook Express, MS Outlook, Netscape 7/Mozilla) -
  unreadable (gibberish) text instead of Cyrillic in the Unicode program's window
Specific to MS Word 97 and newer: Cyrillic plain text (.TXT) file -
- reading such file: unreadable (gibberish) text instead of Cyrillic
- saving your Cyrillic document as .TXT: question marks instead of Cyrillic

Copy/Paste:
Unicode program ---> non-Unicode program

Trying to copy some Cyrillic text from a Unicode program (f.e. Word 97 and newer, Internet Explorer, Outlook Express, MS Outlook 2000, Netscape 7/Mozilla) to a non-Unicode one (f.e. Netscape 4.79, UltraEdit, Dreamweaver)
and see just question marks (???) instead of Cyrillic as a result.

This usually happens under a non-Russian version of Windows (that is, where System Code Page is not "Cyrillic, CP-1251").
Conversion from Unicode text to a non-Unicode text is usually based on System Code Page, and thus under the "Western" Windows installation (where system code page is "Western European, CP-1252") the following happens:

Unicode contains Cyrillic letters while "Western" code page (encoding) does not contain any.
Therefore for each Cyrillic letter a result of such conversion is a question mark ('?') which is a designated symbol meaning "character is not found in the target code page".
By the way, it's a real, regular question mark and nothing more, i.e. no more Cyrillic letters in that text with question marks.

Solution: use an intermediate window - a program that understands Unicode and also lets you specify that you are dealing with "Cyrillic" and not "Western" encoding.

I suggest to use one of the following programs of such type (click on the corresponding link below to read the instruction):

Netscape 4.7x as intermediate window
Only if you have Netscape already (by the way, earlier 4.x versions are fine, too).
Because if you don't have Netscape 4 installed, you'd better use another program I suggest - UniPad - it's much 'lighter' program being just a text editor.
Unicode text editor UniPad as intermediate window

Netscape 4.7x as intermediate window for Copy/Paste

Netscape 4 can help to solve the problem during Copy/Paste from a Unicode program (f.e. Internet Explorer or Word 97/2000) to a non-Unicode program (f.e. plain text editor or Dreamweaver):

Netscape Communicator 4.ő has built-in HTML editor - Composer, that is good for this - it understands Unicode and also lets us specify that we are dealing with Cyrillic text and not "Western":

Call Netscape
Open Composer window via the menu - Communicator/Composer
Switch to Cyrillic(Windows) encoding:
- in Netscape 4.x: View/CharacterSet/Cyrillic(Windows-1251)
- in Netscape 4.0x: View/Encoding/Cyrillic(Windows-1251)

Now you can use this window as an intermediate one:

Copy the text from your Unicode program to Netscape Composer first
(where current encoding - Cyrillic(Windows)!)
Select the text that was copied to Composer (f.e. via Ctrl/A or Edit/SelectAll) and copy it now to the needed non-Unicode program.
This will produce normal Cyrillic text, not question marks, because system now 'knows' that the text was in "Cyrillic" encoding and not in the encoding of system code page ("Western")

Back to the Table of Content

Unicode text editor UniPad as intermediate window for Copy/Paste

UniPad (freeware for personal use) can help to solve the problem during Copy/Paste from a Unicode program (f.e. Internet Explorer or MS Word 97/2000) to a non-Unicode program (f.e. plain text editor or Dreamweaver):

Download and install UniPad editor. Here is "UniPad Home Page"
Open UniPad and do File/New in it to have a new document window

Now you can use this window as an intermediate one:

Copy the text from your your Unicode program to the UniPad window first.
Because UniPad understands Unicode, you should see normal Cyrillic there.
Select the text in this UniPad window and then do Edit / Copy As
Choose needed encoding in the list - "Windows CP-1251 (Cyrillic)" and click Ok.
That was the conversion from Unicode text to non-Unicode text where instead of using a default - System Code Page ("Western") as a target encoding, we explicitly specified that the target encoding is "Cyrillic"!
Now you can safely paste the text to any non-Unicode program - you'll see normal Cyrillic as a result and not question marks.

Back to the Table of Content

Copy/Paste:
non-Unicode program ---> Unicode program

Trying to copy some Cyrillic text from a non-Unicode program (f.e. Netscape 4.79 or UltraEdit or Macromedia Dreamweaver) to a Unicode program (f.e. Word 97 and newer, Internet Explorer, Outlook Express, MS Outlook 2000, Netscape 7/Mozilla) and see just unreadable (gibberish) text instead of Cyrillic as a result.

This usually happens under a non-Russian version of Windows (that is, where System Code Page is not "Cyrillic, CP-1251").
The Unicode program does not know that the incoming text is a Cyrillic one and is using system code page as a default during the conversion from non-Unicode text to Unicode text.
For example, under "Western" installation of Windows it looks at the incoming bytes as a sequence of "Western" encoding bytes and performs the conversion
"Western European, CP-1252" ---> Unicode

For example:
Cyrillic small 'd' contained in that original non-Unicode text has a byte value of 228 in the "Cyrillic, CP-1251" code page. But that Unicode program assumes that incoming data belong to "Western" encoding! In "Western, CP-1252" code page a value 228 is a German a-umlaut, so the following conversion takes place:
non-Unicode German a-umlaut ---> Unicode German a-umlaut
and you'll see German a-umlaut in that Unicode program instead of Russian 'd' after you paste the text there.

There are 2 possible solutions to this situation. Some non-Unicode programs let you use very simple Solution 1, so just try it first, but if it does not work, then use Solution 2.

Note. Word 2000/XP has its own solution for the text copied to a Word's window - see "MS macro Eefonts for Word 2000/XP" section below.

Solution 1
Use the following approach while copying the text from a non-Unicode program (f.e. Netscape 4.79 or UltraEdit or Dreamweaver, etc.) to the Windows Clipboard:

Select the text you want to copy
Before you choose Edit/Copy in that program's menu (or press Ctrl/C), you need to switch your keyboard to the needed Cyrillic mode, say, "Russian" if it's a Russian text.
(The activation of Russian keyboard is covered in the "Russian Keyboard: standard and phonetic" section of my site)
Now, having "RU" on your Taskbar keyboard language indicator, do Edit/Copy.
This kind of 'tells' the system that you are trying to do Copy/Paste with a Cyrillic text and not "Western"
When you paste the text to a Unicode program now (f.e. to Word 97/2000 or Internet Explorer), you should see normal Cyrillic - the conversion from non-Unicode to Unicode is correctly assuming that incoming text belongs to "Cyrillic, CP-1251" code page.

This Solution 1 (switching keyboard to Cyrillic mode before copying) may not work for each and every non-Unicode program.
In such case:

if a Unicode program where you are trying to copy Cyrillic text is Word 2000/XP then use MS macro Eefonts for Word 2000/XP
otherwise, use universal Solution 2

MS macro Eefonts for Word 2000/XP

Microsoft offers a free macro that solves the problem of a non-readable text copied from some non-Unicode program to a Word 2000/XP document.
Same macro helps to make readable an old Cyrillic .doc created in the past with non-Unicode Word 6 for example.

Go to the Microsoft page (Knowledge base article Q260162)
"Incorrect Characters Appear When You Open Document in Earlier Eastern European Version of Word".

Find there a link to download Eefonts.exe.

Download and install it. Now in your Word 2000/XP you will have a new option under the Tools menu:
Tools / Fix Broken Text

When you copy some Cyrillic from a non-Unicode program to Word 2000/XP, you will see first some gibberish text (as explained above).
You need to select that text and

Tools / Fix Broken Text
Choose "Russian" in the list (if the text you are copying is Russian).

Now you will have a readable Cyrillic!

Solution 2
for non-Unicode --> Unicode copying case

The universal solution for the successful copy of Cyrillic text from a non-Unicode program (Netscape 4.79, Dreamweaver, plain text editor, etc.) to a Unicode one (Internet Explorer, Outlook, etc.) is the following:

Use an intermediate window such as a program that understands Unicode and also lets you specify that you are dealing with "Cyrillic" and not "Western" encoding.

I suggest to use a freeware (for personal use) editor UniPad as such intermediate program:

Download and install UniPad editor:
"UniPad Home Page"
open UniPad and do File/New in it to have a new document window

Now you can use this UniPad window while copying Cyrillic from a non-Unicode program (f.e. Netscape 4.79 or Dreamweaver or UltraEdit) to some Unicode program (MS Word 97/2000, Internet Explorer, Outlook Express, etc.):

In your non-Unicode program select and copy the text that contains Cyrillic letters to Windows Clipboard (Edit/Copy in your program's menu or Ctrl/C)
Go to UniPad and do Edit / Paste As
In the list choose needed encoding -
"Windows CP-1251 (Cyrillic)" and click Ok.
Now you should see normal Cyrillic text in this UniPad window.
That was the conversion from non-Unicode text to Unicode text (UniPad is a Unicode editor) where instead of using System Code Page (say, "Western") as a source encoding, we explicitly specified that the source encoding is "Cyrillic"!
Now you can safely select and copy the text from this UniPad window (which is already a Unicode text now) to any Unicode-based program - you'll see normal Cyrillic as a result

Back to the Table of Content

Cyrillic in MS Word 97 and newer:
working with .TXT files

opening such file in Word: unreadable (gibberish) text instead of Cyrillic

saving your Cyrillic document as .TXT: question marks instead of Cyrillic

The above happens under a non-Russian Windows, i.e. when system code page is not "Cyrillic, CP-1251".

Plain text files (.TXT) contain non-Unicode text, so when Unicode-based Word 97/2000 deals with such files, it performs the conversion between Unicode text and non-Unicode text.
By default, this conversion uses system code page and therefore we see the above problems if system code page is say "Western, CP-1252" and not "Cyrillic".

The solution is to specify that the content of the plain text (.TXT) file belongs to "Cyrillic" encoding and not to system code page.

MS Word 2000 and newer has its own way to specify that, while Word 97 requires an intermediate program to be used.

Here are the solutions for the two cases where plain text (.TXT) Cyrillic files are involved:

Opening a Cyrillic plain text (.TXT) file in MS Word 97+
Saving your Word 97 and newer Cyrillic text as a plain text (.TXT) file

Opening a Cyrillic plain text (.TXT) file in MS Word 97 and newer

Let's assume that you have some plain text (.TXT) Russian file that contains the text in "Cyrillic CP-1251" encoding (a.k.a "Cyrillic(Windows)" or "Windows-1251").

Word 2000 (and newer versions) allows you to specify that this file is really a Cyrillic one, while Word 97 requires more complex approach to be used.

MS Word 2000 and newer

Tools/Options/General and check the box "Confirm Conversion at Open" (i.e. show the conversion details dialog)
Word 2010 - File/Options/Advanced and at the end of this list of options fine "General" and place check-mark to "Confirm file format conversion on open"
File / Open and then in the "Files of Type" choose
"Encoded text files (.txt)"
(in Word 2003 - choose "Text file (*.txt)")
Point to your Cyrillic plain text file and click "Open"
Word 2000 presents you another list called "Convert File". Choose "Encoded Text" there
Word 2003 offers you to choose "Encoded text" at once (it 'knows' already that it's the case)
Now Word asks you to specify the encoding of that text:
- Click on "Other Encoding"
- In the list, choose "Cyrillic(Windows)"
- Click on Ok
You should see now normal Cyrillic text in your Word 2000 window.

MS Word 97

There are several possible solutions for loading Cyrillic .TXT file into Word 97, let's look at two of them:

If you already have Netscape browser then use the simplest Netscape-based method
If you don't have Netscape then you can use the following method that requires more steps to do the work:
If you have a non-Unicode plain text editor that can open Cyrillic .TXT files or if you are willing to install such editor, then look at this, Plain Text Editor-based method

Netscape-based method for loading Cyrillic .TXT into Word 97

In Netscape, do File/Open, choose "Text (.TXT)" as a "Files of Type".
Your Cyrillic .txt opens in Netscape. Change encoding to Cyrillic(Windows-1251):

in Netscape 6 - View/Character Encoding/Cyrillic(Windows-1251)
in Netscape 4.5+ - View/Character Set/Cyrillic(Windows-1251)
in Netscape 4.0x: View/Encoding/Cyrillic(Windows-1251)

Now you should see normal Cyrillic text and can safely copy it to Word 97.

Plain Text non-Unicode editor-based method of loading Cyrillic .TXT file into Word 97.

Instead of opening your Cyrillic plain text (.TXT) file directly in Word 97, you need to open it in any non-Unicode plain text editor and then use Copy/Paste methods of this page to place this text into Word 97.
I am using a shareware plain text editor UltraEdit, so you can download it, too or use your favorite plain text editor that works with Cyrillic.

Let's use UltraEdit as an example:

In the UltraEdit menu, go to View / Set Font and there:
- select "Courier New".
- in the Script list below right choose "Cyrillic"
Do File / Open to load your Cyrillic .TXT file into UltraEdit. You should see normal Cyrillic text.
To copy the text from non-Unicode UltraEdit to Unicode-based Word 97 use the method explained above on this page:
Copy/Paste technique non-Unicode --->Unicode

Back to the Two .TXT related problems list

Back to the Table of Content

Saving Word 97+ Cyrillic text as a plain text (.TXT) file

Let's assume that you want to save your document opened in MS Word, as a plain text (.TXT) Russian file.

Word 2000 (and newer versions) allows you to specify that this file is really a Cyrillic one, while Word 97 requires more complex approach to be used.

MS Word 2000 and newer

File / Save As, type the name of that new file and then in the "Save as Type" choose "Encoded text (.txt)"
(in Word 2003 choose "Plain Text (*.txt))
Word presents you another window - "File Conversion" where it asks you to specify the encoding of that text:
Click on "Other Encoding"
In the list, choose "Cyrillic(Windows)"
Click on Ok

That newly created plain text file contains normal Windows-1251 Cyrillic text and not question marks :)

MS Word 97

So you have some Cyrillic text in your open MS Word 97 window and want to save it as a a plain text file.
Instead of creating this Cyrillic plain text (.TXT) file using Word 97, you need to copy the text to any non-Unicode plain text editor and then do Save As there.
I am using a shareware plain text editor UltraEdit, so you can download it, too or use your favorite plain text editor that works with Cyrillic.

Let's use UltraEdit as an example:

In the UltraEdit menu, go to View / Set Font and there:
- select "Courier New".
- in the Script list below right choose "Cyrillic"
To copy the text from Unicode-based Word 97 to non-Unicode UltraEdit, you need to use the method explained above on this page:
Copy/Paste technique Unicode --->non-Unicode
Now, in UltraEdit, when you see normal Cyrillic text copied from MS Word 97, you can create this plain text (.txt) file - via File / Save As menu of UltraEdit.

Back to the Two .TXT related problems list

Back to the Table of Content