Chapter 4. Editing documents

 Working with Unicode

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode is an internationally recognized standard, adopted by industry leaders. The Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646.

It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends. Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets.

As a modern XML Editor, <oXygen/> provides support for the Unicode standard enabling your XML application to be targeted across multiple platforms, languages and countries without re-engineering. Internally, the <oXygen/> XML Editor uses 16bit characters covering the Unicode Character set.

As a Java application <oXygen/> comes with a default Java input method for typing characters with Unicode codes. However the default input method does not cover all the Unicode codes, for example the codes for some accented characters or characters of East Asian languages. Such characters can be inserted in the editor panel of <oXygen/> either with the Character Map dialog available from menu EditInsert from Character Map or by installing a Java input method that supports the insertion of the needed characters. The installation of a Java input method depends on the platform on which <oXygen/> runs (Windows, Mac OS X, Linux, etc) and is the same for any Java application.

 Opening and saving Unicode documents

On loading documents of the type XML, XSL, XSD and DTD, <oXygen/> reads the document prolog to determine the specified encoding type. This is then used to instruct the Java Encoder to load support for and save using the code chart specified. In the event that the encoding type cannot be determined, <oXygen/> will prompt and display the Available Java Encodings dialog which will provide a list of all encodings supported by the Java platform.

If the opened document contains a character which cannot be represented with the encoding detected from the document prolog or selected from the Available Java Encodings dialog <oXygen/> applies the policy specified for handling such errors. If the policy is set to REPORT <oXygen/> displays an error dialog about the character not allowed by the encoding. If the policy is set to IGNORE the character is removed from the document displayed in the editor panel. If the policy is set to REPLACE the character will be replaced with a standard replacement character for that encoding.

While in most cases you will use UTF-8, simply changing the encoding name will cause the file to be saved using the new encoding.

On saving the edited document if it contains characters not included in the encoding declared in the document prolog <oXygen/> will detect the problem and will signal it to the user who is required to resolve the conflict before he is able to save the document.

To edit document written in Japanese or Chinese, you will need to change the font to one that supports the specific characters (a Unicode font). For the Windows platform, use of Arial Unicode MS or MS Gothic is recommended. Do not expect Wordpad or Notepad to handle these encodings. Use Internet Explorer or Word to eventually examine XML documents.

When a document with a UTF-16 encoding is edited and saved in <oXygen/>, the saved document will have a byte order mark (BOM) which will specify the byte order of the document's content. The default byte order is platform dependent. That means that a UTF-16 document created on a Windows platform (where the default byte order mark is UnicodeLittle) will have a different BOM than a UTF-16 document created on a Mac OS platform (where the byte order mark is UnicodeBig). The byte order and the BOM of an existing document will be preserved by <oXygen/> when the document is edited and saved. This behavior can be changed in <oXygen/> from the Encoding preferences panel.

[Note]Note

The naming convention used under Java does not always correspond to the common names used by the Unicode standard. For instance, while in XML you will use encoding="UTF-8", in Java the same encoding has the name "UTF8".

 The Unicode toolbar

The display of the Unicode toolbar is switched on and off from PerspectiveShow ToolbarUnicode and contains the actions Change text orientation with the default shortcut Ctrl + Shift + O and Insert from Character Map

The Change text orientation action enables editing documents in languages with right to left writing (Hebrew, Arabic, etc.) by moving the caret to the left when new characters are inserted in the document. Please note that you may have to set an appropriate Unicode aware font for the editor panel, able to render the characters of the language of the edited file.

The Insert from Character Map action opens a dialog in which you can select one character in the matrix of all characters available in a font and insert it in the edited document. The action is available also in the Edit menu.

 

Figure 4.1. The Character Map dialog

The Character Map dialog

The character selected in the character table or an entity with the decimal code or the hexadecimal code of that character can be inserted in the current editor. You will see it in the editor if the font is able to render it. The Insert button inserts the selected character in the editor. The Copy button copies it to the clipboard without inserting it in the editor.

A character can be located very quickly in the map if you know the Unicode code: just type the code in the search field above the character map and the character is selected automatically in the map. If the code is hexadecimal the radio button for hexadecimal codes is selected automatically. Selecting a radio button with the mouse starts searching the code in the map.

The Character Map dialog cannot be used to insert Unicode characters in the grid version of a document editor. Accordingly the Insert button of the dialog will be disabled if the current document is edited in grid mode.