15.7 CCS tables


The iconv library stores files with CCS tables in the the ccs/ subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form (.cct files, see the ccs/binary/ subdirectory) and in form of compilable .c source files. The .cct files are only used when the --enable-newlib-iconv-external-ccs configure script option is enabled. The .c files are linked to the Newlib library if the corresponding encoding is enabled.


As stated earlier, the Newlib iconv library performs all conversions through the 32-bit UCS, but the codes which are used in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set. Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is used instead of the 32-bit UCS-4.


CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to 16-bit UCS-2 and vice versa while 16-bit CCS tables map 16-bit CCS to 16-bit UCS-2 and vice versa. 8-bit tables are small (in size) while 16-bit tables may be big enough. Because of this, 16-bit CCS tables may be either speed- or size-optimized. Size-optimized CCS tables are smaller then speed-optimized ones, but the conversion process is slower if the size-optimized CCS tables are used. 8-bit CCS tables have only size-optimized variant.

Each CCS table (both speed- and size-optimized) consists of from_ucs and to_ucs subtables. "from_ucs" subtable maps UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to UCS-2 codes.


Almost all 16-bit CCS tables contain less then 0xFFFF codes and a lot of gaps exist.

15.7.1 Speed-optimized tables format


In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is trivial - it is just the array of 256 16-bit UCS codes. Therefore, an UCS-2 code Y corresponding to a X CCS code is calculates as Y = to_ucs[X].


Obviously, the simplest way to create the "from_ucs" table or the 16-bit "to_ucs" table is to use the huge 16-bit array like in case of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain less then 0xFFFF code maps and this fact may be exploited to reduce the size of the CCS tables.


In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The 16-bit "CCS -> UCS-2" CCS table format is the same, except the mapping direction and the CCS bits number.


In case of the 8-bit speed-optimized table the "from_ucs" subtable corresponds the "from_ucs" array and has the following layout:


from_ucs array:
————————————-
0xFF mapping (2 bytes) (only for 8-bit table).
————————————-
Heading block
————————————-
Block 1
————————————-
Block 2
————————————-
...
————————————-
Block N
————————————-


The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each subrange is represented by an 256-element block (256 1-byte elements or 256 2-byte element in case of 16-bit CCS table) with elements which are equivalent to the CCS codes of this subrange. If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be absent and there will be less then 256 blocks.


Any element number m of the heading block (which contains 256 2-byte elements) corresponds to the m-th 256-element subrange. If the subrange contains some codes, the value of the m-th element of the heading block contains the offset of the corresponding block in the "from_ucs" array. If there is no codes in the subrange, the heading block element contains 0xFFFF.


If there are some gaps in a block, the corresponding block elements have the 0xFF value. If there is an 0xFF code present in the CCS, it’s mapping is defined in the first 2-byte element of the "from_ucs" array.


Having such a table format, the algorithm of searching the CCS code X which corresponds to the UCS-2 code Y is as follows.


  1. If Y is equivalent to the value of the first 2-byte element of the "from_ucs" array, X is 0xFF. Else, continue to search.
  2. Calculate the block number: BlkN = (Y & 0xFF00) >> 8.
  3. If the heading block element with number BlkN is 0xFFFF, there is no corresponding CCS code (error, wrong input data). Else, fetch the "flom_ucs" array index of the BlkN-th block.
  4. Calculate the offset of the X code in its block: Xindex = Y & 0xFF
  5. If the Xindex-th element of the block (which is equivalent to from_ucs[BlkN+Xindex]) value is 0xFF, there is no corresponding CCS code (error, wrong input data). Else, X = from_ucs[BlkN+Xindex].

15.7.2 Size-optimized tables format


As it is stated above, size-optimized tables exist only for 16-bit CCS-es. This is because there is too small difference between the speed-optimized and the size-optimized table sizes in case of 8-bit CCS-es.


Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of size-optimized tables.

This sections describes the format of the "UCS-2 -> CCS" size-optimized CCS table. The format of "CCS -> UCS-2" table is the same.

The idea of the size-optimized tables is to split the UCS-2 codes ("from" codes) on ranges (range is a number of consecutive UCS-2 codes). Then CCS codes ("to" codes) are stored only for the codes from these ranges. Distinct "from" codes, which have no range (unranged codes, are stored together with the corresponding "to" codes.


The following is the layout of the size-optimized table array:


size_arr array:
————————————-
Ranges number (2 bytes)
————————————-
Unranged codes number (2 bytes)
————————————-
Unranged codes array index (2 bytes)
————————————-
Ranges indexes (triads)
————————————-
Ranges
————————————-
Unranged codes array
————————————-


The Unranged codes array index size_arr section helps to find the offset of the needed range in the size_arr and has the following format (triads):
the first code in range, the last code in range, range offset.


The array of these triads is sorted by the firs element, therefore it is possible to quickly find the needed range index.


Each range has the corresponding sub-array containing the "to" codes. These sub-arrays are stored in the place marked as "Ranges" in the layout diagram.


The "Unranged codes array" contains pairs ("from" code, "to" code") for each unranged code. The array of these pairs is sorted by "from" code values, therefore it is possible to find the needed pair quickly.


Note, that each range requires 6 bytes to form its index. If, for example, there are two ranges (1 - 5 and 9 - 10), and one unranged code (7), 12 bytes are needed for two range indexes and 4 bytes for the unranged code (total 16). But it is better to join both ranges as 1 - 10 and mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the range index and 4 bytes to mark codes 6 and 8 as absent are needed (total 10 bytes). This optimization is done in the size-optimized tables. Thus, ranges may contain small gaps. The absent codes in ranges are marked as 0xFFFF.


Note, a pair of "from" codes is stored by means of unranged codes since the number of bytes which are needed to form the range is greater than the number of bytes to store two unranged codes (5 against 4).


The algorithm of searching of the CCS code X which corresponds to the UCS-2 code Y (input) in the "UCS-2 -> CCS" size-optimized table is as follows.


  1. Try to find the corresponding triad in the "Unranged codes array index". Since we are searching in the sorted array, we can do it quickly (divide by 2, compare, etc).
  2. If the triad is found, fetch the X code from the corresponding range array. If it is 0xFFFF, return an error.
  3. If there is no corresponding triad, search the X code among the sorted unranged codes. Return error, if noting was found.

15.7.3 .cct ant .c CCS Table files


The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs" speed-optimized tables. The .c source files for 16-bit CCS tables have "to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size" tables.


When .c files are compiled and used, all the 16-bit and 32-bit values have the native endian format (Big Endian for the BE systems and Little Endian for the LE systems) since they are compile for the system before they are used.


In case of .cct files, which are intended for dynamic CCS tables loading, the CCS tables are stored either in LE or BE format. Since the .cct files are generated by the ’mktbl.pl’ Perl script, it is possible to choose the endianess of the tables. It is also possible to store two copies (both LE and BE) of the CCS tables in one .cct file. The default .cct files (which come with the Newlib sources) have both LE and BE CCS tables. The Newlib iconv library automatically chooses the needed CCS tables (with appropriate endianess).


Note, the .cct files are only used when the --enable-newlib-iconv-external-ccs is used.

15.7.4 The ’mktbl.pl’ Perl script


The ’mktbl.pl’ script is intended to generate .cct and .c CCS table files from the CCS source files.


The CCS source files are just text files which has one or more colons with CCS <-> UCS-2 codes mapping. To see an example of the CCS table source files see one of them using URL-s which will be given bellow.


The following table describes where the source files for CCS table files provided by the Newlib distribution are located.

NameURL
big5http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
cns11643_plane1 cns11643_plane14 cns11643_plane2http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
cp775 cp850 cp852 cp855 cp866http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
iso_8859_1 iso_8859_2 iso_8859_3 iso_8859_4 iso_8859_5 iso_8859_6 iso_8859_7 iso_8859_8 iso_8859_9 iso_8859_10 iso_8859_11 iso_8859_13 iso_8859_14 iso_8859_15http://www.unicode.org/Public/MAPPINGS/ISO8859/
iso_ir_111http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT
jis_x0201_1976 jis_x0208_1990 jis_x0212_1990http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
koi8_rhttp://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
koi8_ruhttp://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT
koi8_uhttp://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT
koi8_unihttp://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT
ksx1001http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
win_1250 win_1251 win_1252 win_1253 win_1254 win_1255 win_1256 win_1257 win_1258http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/

The CCS source files aren’t distributed with Newlib because of License restrictions in most Unicode.org’s files.

The following are ’mktbl.pl’ options which were used to generate .cct files. Note, to generate CCS tables source files -s option should be added.

  1. For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct, iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct, iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct, iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct win_1256.cct, win_1258.cct, win_1251.cct, win_1253.cct, win_1255.cct, win_1257.cct, koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct, big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct files, only the -i <SRC_FILE_NAME> option were used.
  2. To generate the jis_x0208_1990.cct file, the -i jis_x0208_1990.txt -x 2 -y 3 options were used.
  3. To generate the cns11643_plane1.cct file, the -i cns11643.txt -p1 -N cns11643_plane1 -o cns11643_plane1.cct options were used.
  4. To generate the cns11643_plane2.cct file, the -i cns11643.txt -p2 -N cns11643_plane2 -o cns11643_plane2.cct options were used.
  5. To generate the cns11643_plane14.cct file, the -i cns11643.txt -p0xE -N cns11643_plane14 -o cns11643_plane14.cct options were used.

For more info about the ’mktbl.pl’ options, see the ’mktbl.pl -h’ output.


It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes in the CCS source file, the bits which are higher then 16 defines plane (see the cns11643.txt CCS source file).


Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example, several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to the pair of UCS-2 codes. In these cases, such CCS codes (lost codes) aren’t just rejected but instead, they are mapped to the default UCS-2 code (which is currently the ? character’s code).