emWin Language Support

Text written in a foreign language like Arabic, Thai or Chinese contains characters, which are normally not part of the fonts shipped with emWin.
This chapter explains the basics like the Unicode standard, which defines all available characters worldwide and the UTF-8 encoding scheme, which is used by emWin to decode text with Unicode characters.
It also explains how to enable Arabic language support and how to render text with Shift-JIS (Japanese Industry Standard) encoding.

Unicode

The Unicode standard is a 16-bit character encoding scheme. All of the characters available worldwide are in a single 16-bit character set (which works globally). The Unicode standard is defined by the Unicode consortium.

emWin can display individual characters or strings in Unicode, although it is most common to simply use mixed strings, which can have any number of Unicode sequences within one ASCII string.

UTF-8 encoding

ISO/IEC 10646-1 defines a multi-octet character set called the Universal Character Set (UCS) which encompasses most of the world's writing systems. Multi-octet characters, however, are not compatible with many current applications and protocols, and this has led to the development of a few UCS transformation formats (UTF), each with different characteristics.

UTF-8 has the characteristic of preserving the full ASCII range, providing compatibility with file systems, parsers and other software that rely on ASCII values but are transparent to other values.

In emWin, UTF-8 characters are encoded using sequences of 1 to 3 octets. If the high-order bit is set to 0, the remaining 7 bits being used to encode the character value. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the value of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded. The following table shows the encoding ranges:

Character range UTF-8 Octet sequence
0000 - 007F 0xxxxxxx
0080 - 07FF 110xxxxx 10xxxxxx
0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx

Encoding example

The text "Halöle" contains ASCII characters and European extensions. The following hexdump shows this text as UTF-8 encoded text:

48 61 6C C3 B6 6C 65

Programming examples

If we want to display a text containing non-ASCII characters, we can do this by manually computing the UTF-8 codes for the non-ASCII characters in the string. However, if your compiler supports UTF-8 encoding (Sometimes called multi-byte encoding), even non-ASCII characters can be used directly in strings.

//
// Example using ASCII encoding:
//
GUI_UC_SetEncodeUTF8();       /* required only once to activate UTF-8*/
GUI_DispString("Hal\xc3\xb6le");
//
// Example using UTF-8 encoding:
//
GUI_UC_SetEncodeUTF8(); /* required only once to activate UTF-8*/
GUI_DispString("Halöle");

Unicode characters

The character output routine used by emWin ( GUI_DispChar() ) does always take an unsigned 16-bit value (U16) and has the basic ability to display a character defined by Unicode. It simply requires a font which contains the character you want to display.

UTF-8 strings

This is the most recommended way to display Unicode. You do not have to use special functions to do so. If UTF-8-encoding is enabled each function of emWin which handles with strings decodes the given text as UTF-8 text.

Using U2C.exe to convert UTF-8 text into C-code

The Tool subdirectory of emWin contains the tool U2C.exe to convert UTF-8 text to C-code. It reads an UTF-8 text file and creates a C-file with C-strings. The following steps show how to convert a text file into C-strings and how to display them with emWin:

Step 1: Creating a UTF-8 text file

Save the text to be converted in UTF-8 format. You can use Notepad.exe to do this. Load the text under Notepad.exe:

 

Choose "File/Save As...". The file dialog should contain a combo box to set the encoding format. Choose "UTF-8" and save the text file.

Step 2: Converting the text file into a C-code file

Start U2C.exe . After starting the program you need to select the text file to be converted. After selecting the text file the name of the C-file should be selected. Output of U2C.exe :

"Japanese:"
"1 - \xe3\x82\xa8\xe3\x83\xb3\xe3\x82\xb3\xe3\x83\xbc
 "\xe3\x83\x87\xe3\x82\xa3\xe3\x83\xb3\xe3\x82\xb0"
"2 - \xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88"
"3 - \xe3\x82\xb5\xe3\x83\x9d\xe3\x83\xbc\xe3\x83\x88"
"English:"
"1 - encoding"
"2 - text"
"3 - support"

Step 3: Using the output in the application code

The following example shows how to display the UTF-8 text with emWin:

#include "GUI.h"
static const char * _apStrings[] = {
  "Japanese:",
  "1 - \xe3\x82\xa8\xe3\x83\xb3\xe3\x82\xb3\xe3\x83\xbc"
      "\xe3\x83\x87\xe3\x82\xa3\xe3\x83\xb3\xe3\x82\xb0",
  "2 - \xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88",
  "3 - \xe3\x82\xb5\xe3\x83\x9d\xe3\x83\xbc\xe3\x83\x88",
  "English:",
  "1 - encoding",
  "2 - text",
  "3 - support"
};

void MainTask(void) {
  int i;
  GUI_Init();
  GUI_SetFont(&GUI_Font16_1HK);
  GUI_UC_SetEncodeUTF8();
  for (i = 0; i < GUI_COUNTOF(_apStrings); i++) {
    GUI_DispString(_apStrings[i]);
    GUI_DispNextLine();
  }
  while(1) {
    GUI_Delay(500);
  }
}

Unicode API

The table below lists the available routines in alphabetical order within their respective categories. Detailed descriptions of the routines can be found in the sections that follow.

Routine Description
UTF-8 functions
GUI_UC_ConvertUC2UTF8 Converts a Unicode string into UTF-8 format.
GUI_UC_ConvertUTF82UC Cibverts a UTF-8 string into Unicode format.
GUI_UC_EnableBIDI Enables/Disables the support for bidirectional fonts.
GUI_UC_Encode Encodes the given character with the current encoding.
GUI_UC_GetCharCode Returns the decoded character.
GUI_UC_GetCharSize Returns the number of bytes used to encode the given character.
GUI_UC_SetEncodeNone Disables encoding.
GUI_UC_SetEncodeUTF8 Enables UTF-8 encoding.
Double byte functions
GUI_UC_DispString Displays a double byte string.

Arabic language support

The basic difference between western languages and Arabic is, that Arabic is written from the right to the left and that it does not know uppercase and lowercase characters. Further the character codes of the text are not identical with the character index in the font file used to render the character, because the notation forms of the characters depend on the positions in the text.

Notation forms

The Arabic base character set is defined in the Unicode standard within the range from 0x0600 to 0x06FF. Unfortunately these character codes can not directly be used to get the character of the font for drawing it, because the notation form depends on the character position in the text. One character can have up to 4 different notation forms:

  • One, if it is at the beginning of a word (initial)
  • One, if it is at the end of a word (final)
  • One, if it is in the middle of a word (medial)
  • One, if the character stands alone (isolated)

But not each character is allowed to be joined to the left and to the right (double-joined). The character 'Hamza' for example always needs to be separated and 'Alef' is only allowed at the end or separated. Character combinations of the letters 'Lam' and 'Alef' should be transformed to a 'Ligature'. This means one character substitutionally for the combination of 'Lam' and 'Alef'.

The above explanation shows, that the notation form is normally not identically with the character code of the text. The following table shows how emWin transforms the characters to the notation form in dependence of the text position:

Base Isolated Final Initial Medial Character
0x0621 0xFE80 - - - Hamza
0x0622 0xFE81 0xFE82 - - Alef with Madda above
0x0623 0xFE83 0xFE84 - - Alef with Hamza above
0x0624 0xFE85 0xFE86 - - Waw with Hamza above
0x0625 0xFE87 0xFE88 - - Alef with Hamza below
0x0626 0xFE89 0xFE8A 0xFE8B 0xFE8C Yeh with Hamza above
0x0627 0xFE8D 0xFE8E - - Alef
0x0628 0xFE8F 0xFE90 0xFE91 0xFE92 Beh
0x0629 0xFE93 0xFE94 - - Teh Marbuta
0x062A 0xFE95 0xFE96 0xFE97 0xFE98 Teh
0x062B 0xFE99 0xFE9A 0xFE9B 0xFE9C Theh
0x062C 0xFE9D 0xFE9E 0xFE9F 0xFEA0 Jeem
0x062D 0xFEA1 0xFEA2 0xFEA3 0xFEA4 Hah
0x062E 0xFEA5 0xFEA6 0xFEA7 0xFEA8 Khah
0x062F 0xFEA9 0xFEAA - - Dal
0x0630 0xFEAB 0xFEAC - - Thal
0x0631 0xFEAD 0xFEAE - - Reh
0x0632 0xFEAF 0xFEB0 - - Zain
0x0633 0xFEB1 0xFEB2 0xFEB3 0xFEB4 Seen
0x0634 0xFEB5 0xFEB6 0xFEB7 0xFEB8 Sheen
0x0635 0xFEB9 0xFEBA 0xFEBB 0xFEBC Sad
0x0636 0xFEBD 0xFEBE 0xFEBF 0xFEC0 Dad
0x0637 0xFEC1 0xFEC2 0xFEC3 0xFEC4 Tah
0x0638 0xFEC5 0xFEC6 0xFEC7 0xFEC8 Zah
0x0639 0xFEC9 0xFECA 0xFECB 0xFECC Ain
0x063A 0xFECD 0xFECE 0xFECF 0cFED0 Ghain
0x0641 0xFED1 0xFED2 0xFED3 0xFED4 Feh
0x0642 0xFED5 0xFED6 0xFED7 0xFED8 Qaf
0x0643 0xFED9 0xFEDA 0xFEDB 0xFEDC Kaf
0x0644 0xFEDD 0xFEDE 0xFEDF 0xFEE0 Lam
0x0645 0xFEE1 0xFEE2 0xFEE3 0xFEE4 Meem
0x0646 0xFEE5 0xFEE6 0xFEE7 0xFEE8 Noon
0x0647 0xFEE9 0xFEEA 0xFEEB 0xFEEC Heh
0x0648 0xFEED 0xFEEE - - Waw
0x0649 0xFEEF 0xFEF0 - - Alef Maksura
0x064A 0xFEF1 0xFEF2 0xFEF3 0xFEF4 Yeh
0x067E 0xFB56 0xFB57 0xFB58 0xFB59 Peh
0x0686 0xFB7A 0xFB7B 0xFB7C 0xFB7D Tcheh
0x0698 0xFB8A 0xFB8B - - Jeh
0x06A9 0xFB8E 0xFB8F 0xFB90 0xFB91 Keheh
0x06AF 0xFB92 0xFB93 0xFB94 0xFB95 Gaf
0x06CC 0xFBFC 0xFBFD 0xFBFE 0xFBFF Farsi Yeh

Ligatures

Character combinations of 'Lam' and 'Alef' needs to be transformed to ligatures. The following table shows how emWin transforms these combinations into ligatures, if the first letter is a 'Lam' (code 0x0644):

Second letter Ligature (final) Ligature (elswhere)
0x0622, Alef with Madda above 0xFEF6 0xFEF5
0x0623, Alef with Hamza above 0xFEF8 0xFEF7
0x0625, Alef with Hamza below 0xFEFA 0xFEF9
0x0627, Alef 0xFEFC 0xFEFB

Bidirectional text alignment

As mentioned above Arabic is written from the right to the left (RTL). But if for example the Arabic text contains numbers build of more than one digit these numbers should be written from left to right. And if Arabic text is mixed with European text a couple of further rules need to be followed to get the right visual alignment of the text.

The Unicode consortium has defined these rules in the Unicode standard. If bidirectional text support is enabled, emWin follows up most of these rules to get the right visual order before drawing the text. emWin also supports mirroring of neutral characters in RTL aligned text. This is important if for example Arabic text contains parenthesis. The mirroring is done by replacing the code of the character to be mirrored with the code of a mirror partner whose image fits to the mirrored image. This is done by a fast way using a table containing all characters with existing mirror partners. Note that support for mirroring further characters is not supported.

The following example shows how bidirectional text is rendered by emWin:

UTF-8 text Rendering
\xd8\xb9\xd9\x84\xd8\xa7 1, 2, 345
\xd8\xba\xd9\x86\xd9\x8a XYZ
\xd8\xa3\xd9\x86\xd8\xa7

Requirements

Arabic language support is part of the emWin basic package. emWin standard fonts do not contain Arabic characters. Font files containing Arabic characters can be created using the Font Converter.

Memory

The bidirectional text alignment and Arabic character transformation uses app. 60 KB of ROM and app. 800 bytes of additional stack.

How to enable Arabic support

Per default emWin writes text always from the left to the right and there will be no Arabic character transformation as described above. To enable support for bidirectional text and Arabic character transformation, add the following line to your application:

GUI_UC_EnableBIDI(1);

If enabled, emWin follows the rules of the bidirectional algorithm, described by the Unicode consortium, to get the right visual order before drawing text.

Example

The Sample folder contains the example FONT_Arabic , which shows how to draw Arabic text. It contains an emWin font with Arabic characters and some small Arabic text examples.

Font files used with Arabic text

Font files used to render Arabic languages need to include at least all characters defined in the Â’ArabicÂ’ range 0x600-0x6FF and the notation forms and ligatures listed in the tables of this chapter.

Thai language support

The Thai alphabet uses 44 consonants and 15 basic vowel characters. These are horizontally placed, left to right, with no intervening space, to form syllables, words, and sentences. Vowels are written above, below, before, or after the consonant they modify, although the consonant always sounds first when the syllable is spoken. The vowel characters (and a few consonants) can be combined in various ways to produce numerous compound vowels (diphthongs and triphthongs).

Requirements

As explained above the Thai language makes an extensive usage of compound characters. To be able to draw compound characters in emWin, a new font type is needed, which contains all required character information like the image size, image position and cursor incrementation value. From version 4.00 emWin supports a new font type with this information. This also means that older font types can not be used to draw Thai text. Note that the standard fonts of emWin does not contain font files with Thai characters. To create a Thai font file, the font converter of version 3.04 or newer is required. Memory The Thai language support needs no additional ROM or RAM.

How to enable Thai support

Thai support does not need to be enabled by a configuration switch. The only thing required to draw Thai text is a font file of type 'Extended' created with the font converter from version 3.04 or newer.

Example

The Sample folder contains the example FONT_ThaiText.c , which shows how to draw Thai text. It contains an emWin font with Thai characters and some small Thai text examples.

Font files used with Thai text

Font files used to render Thai text need to include at least all characters defined in the 'Thai' range 0xE00-0xE7F.

Shift JIS support

Shift JIS (Japanese Industry Standard) is a character encoding method for the Japanese language. It is the most common Japanese encoding method. Shift JIS encoding makes generous use of 8-bit characters, and the value of the first byte is used to distinguish single- and multiple-byte characters. The Shift JIS support of emWin is only needed if text with Shift JIS encoding needs to be rendered. You need no special function calls to draw a Shift JIS string. The main requirement is a font file which contains the Shift JIS characters.

Creating Shift JIS fonts

The Font Converter can generate a Shift JIS font for emWin from any Windows font. When using a Shift JIS font, the functions used to display Shift JIS characters are linked automatically with the library. For detailed information on how to create Shift-JIS fonts, contact SEGGER Microcontroller GmbH & Co. KG (info@segger.com). A separate Font Converter documentation describes all you need for an efficient way of implementing Shift JIS in your emWin projects.