9 Some IBM or Microsoft charsets

Recode provides various IBM or Microsoft code pages (see Tabular sources (RFC 1345)). An easy way to find them all at once out of Recode itself is through the command:

recode -l | egrep -i '(CP|IBM)[0-9]'

But also, see few special charsets presented in the incoming sections.


9.1 EBCDIC code

This charset is the IBM’s External Binary Coded Decimal for Interchange Coding. This is an eight bits code. The following three variants were implemented in Recode independently of RFC 1345:

EBCDIC

In Recode, the us..ebcdic conversion is identical to ‘dd conv=ebcdic’ conversion, and Recode ebcdic..us conversion is identical to ‘dd conv=ascii’ conversion. This charset also represents the way Control Data Corporation relates EBCDIC to 8-bits ASCII.

EBCDIC-CCC

In Recode, the us..ebcdic-ccc or ebcdic-ccc..us conversions represent the way Concurrent Computer Corporation (formerly Perkin Elmer) relates EBCDIC to 8-bits ASCII.

EBCDIC-IBM

In Recode, the us..ebcdic-ibm conversion is almost identical to the GNU ‘dd conv=ibm’ conversion. Given the exact ‘dd conv=ibm’ conversion table, Recode once said:

Codes  91 and 213 both recode to 173
Codes  93 and 229 both recode to 189
No character recodes to  74
No character recodes to 106

So I arbitrarily chose to recode 213 by 74 and 229 by 106. This makes the EBCDIC-IBM recoding reversible, but this is not necessarily the best correction. In any case, I think that GNU dd should be amended. dd and Recode should ideally agree on the same correction. So, this table might change once again.

RFC 1345 brings into Recode 15 other EBCDIC charsets, and 21 other charsets having EBCDIC in at least one of their alias names. You can get a list of all these by executing:

recode -l | grep -i ebcdic

Note that Recode may convert a pure stream of EBCDIC characters, but it does not know how to handle binary data between records which is sometimes used to delimit them and build physical blocks. If end of lines are not marked, fixed record size may produce something readable, but VB or VBS blocking is likely to yield some garbage in the converted results.


9.2 IBM’s PC code

This charset is available in Recode under the name IBM-PC, with dos, MSDOS and pc as acceptable aliases. The shortest way of specifying it in Recode is pc.

The charset is aimed towards a PC microcomputer from IBM or any compatible. This is an eight-bit code. This charset is fairly old in Recode, its tables were produced a long while ago by mere inspection of a printed chart of the IBM-PC codes and glyph.

It has CR-LF as its implied surface. This means that, if the original end of lines have to be preserved while going out of IBM-PC, they should currently be added back through the usage of a surface on the other charset, or better, just never removed. Here are examples for both cases:

recode pc..l2/cl < input > output
recode pc/..l2 < input > output

RFC 1345 brings into Recode 44 ‘IBM’ charsets or code pages, and also 8 other code pages. You can get a list of these all these by executing:13

recode -l | egrep -i '(CP|IBM)[0-9]'

All charset or aliases beginning with letters ‘CP’ or ‘IBM’ also have CR-LF as their implied surface. The same is true for a purely numeric alias in the same family. For example, all of 819, CP819 and IBM819 imply CR-LF as a surface. Note that ISO-8859-1 does not imply a surface, despite it shares the same tabular data as 819.

There are a few discrepancies between this IBM-PC charset and the very similar RFC 1345 charset ibm437. The IBM-PC charset has two extra characters at positions 20 (Latin-1 0xB6, Pilcrow) and 21 (Latin-1 0xA7, Section sign); further, it has position 250 as 0xB7, middle dot, while ibm437 has middle dot at position 249. According to this comparison of code tables: https://www.haible.de/bruno/charsets/conversion-tables/CP437.html the source for RFC 1345, dkuug.dk/IBM437.TXT is the only source that thus defines this mapping.


9.3 Unisys’ Icon code

This charset is available in Recode under the name Icon-QNX, with QNX as an acceptable alias.

The file is using Unisys’ Icon way to represent diacritics with code 25 escape sequences, under the system QNX. This is a seven-bit code, even if eight-bit codes can flow through as part of IBM-PC charset.


Footnotes

(13)

On DOS/Windows, stock shells do not know that apostrophes quote special characters like |, so one needs to use double quotes instead of apostrophes.