Technical reference

Tokenised file format

A tokenised program file on a disk device has the following format.

Magic byte
FF
Program lines
Each line is stored as follows:
Bytes Format Meaning
2 Unsigned 16-bit little-endian integer. Memory location of the line following the current one. This is used internally by GW-BASIC but ignored when a program is loaded.
2 Unsigned 16-bit little-endian integer. The line number.
Variable Tokenised BASIC, see below. The contents of the line.
1 00 (NUL byte) End of line marker.
End of file marker
An 1A is written to mark the end of file. This is optional; the file will be read without problems if it is omitted.

Tokenised BASIC

The printable ASCII characters in the range 207E are used for string literals, comments, variable names, and elements of statement syntax that are not reserved words. Reserved words are represented by their reserved word tokens and numeric literals are represented by numeric token sequences.

Numeric token sequences

Numeric literals are stored in tokenised programs according to the following representation. All numbers are positive; negative numbers are stored simply by preceding the number with EA, the token for -.

Class Bytes Format
Indirect line numbers 3 0E followed by an unsigned 16-bit little-endian integer.
Octal integers 3 0B followed by an unsigned 16-bit little-endian integer.
Hexadecimal integers 3 0C followed by an unsigned 16-bit little-endian integer.
Positive decimal integers less than 11 1 Tokens 111B represent 0—10.
Positive decimal integers less than 256 2 0F followed by an unsigned 8-bit integer.
Other decimal integers 3 1C followed by a two's complement signed 16-bit little-endian integer. GW-BASIC will recognise a negative number encountered this way but it will not store negative numbers itself using the two's complement, but rather by preceding the positive number with EA.
Single precision floating-point number 5 1D followed by a four-byte single in Microsoft Binary Format.
Double precision floating-point number 9 1F followed by an eight-byte double in Microsoft Binary Format.
Keyword tokens

Most keywords in PC-BASIC are reserved words. Reserved words are represented in a tokenised program by a single- or double-byte token. The complete list is below.

All function names and operators are reserved words and all statements start with a reserved word (which in the case of LET is optional). However, the converse is not true: not all reserved words are statements, functions, or operators. For example, TO and SPC( only occur as part of a statement syntax. Furthermore, some keywords that form part of statement syntax are not reserved words: examples are AS, BASE, and ACCESS.

Keywords that are not reserved words are spelt out in full text in the tokenised source.

A variable or user-defined function name must not be identical to a reserved word. The list below is an exhaustive list of reserved words that can be used to determine whether a name is legal.

The following additional reserved words are activated by the option syntax={pcjr|tandy}.

Internal use tokens

The tokens 10, 1E and 0D are known to be used internally by GW-BASIC. They should not appear in a correctly stored tokenised program file.

Microsoft Binary Format

Floating point numbers in GW-BASIC and PC-BASIC are represented in Microsoft Binary Format (MBF), which differs from the IEEE 754 standard used by practically all modern software and hardware. Consequently, binary files generated by either BASIC are fully compatible with each other and with some applications contemporary to GW-BASIC, but not easily interchanged with other software. QBASIC, for example, uses IEEE floats.

MBF differs from IEEE in the position of the sign bit and in using only 8 bits for the exponent, both in single- and in double-precision. This makes the range of allowable numbers in an MBF double-precision number smaller, but their precision higher, than for an IEEE double: an MBF single has 23 bits of precision, while an MBF double has 55 bits of precision. Both have the same range.

Unlike IEEE, the Microsoft Binary Format does not support signed zeroes, subnormal numbers, infinities or not-a-number values.

MBF floating point numbers are represented in bytes as follows:

Single
M3 M2 M1 E0
Double
M7 M6 M5 M4 M3 M2 M1 E0

Here, E0 is the exponent byte and the other bytes form the mantissa, in little-endian order so that M1 is the most significant byte. The most significant bit of M1 is the sign bit, followed by the most significant bits of the mantissa: M1 = s0 f1 f2 f3 f4 f5 f6 f7. The other bytes contain the less-significant mantissa bits: M2 = f8 f9 fA fB fC fD fE fF, and so on.

The value of the floating-point number is v = 0 if E0 = 0 and v = (-1) s0 × mantissa × 2 E0 - 128 otherwise, where the mantissa is formed as a binary fraction mantissa = 0 . 1 f1 f2 f3 ...


Protected file format

The protected format is an encrypted form of the tokenised format. GW-BASIC would refuse to show the source code of such files. This protection scheme could easily be circumvented by changing a flag in memory. Deprotection programs have circulated widely for decades and the decryption algorithm and keys were published in a mathematical magazine.

A protected program file on a disk device has the following format.

Magic byte
FE
Payload
Encrypted content of a tokenised program file, including its end of file marker but excluding its magic byte. The encription cipher rotates through an 11-byte and a 13-byte key so that the resulting transformation is the same after 143 bytes. For each byte,
  • Subtract the corresponding byte from the 11-byte sequence
    0B 0A 09 08 07 06 05 04 03 02 01
  • Exclusive-or with the corresponding byte from the 11-byte key
    1E 1D C4 77 26 97 E0 74 59 88 7C
  • Exclusive-or with the corresponding byte from the 13-byte key
    A9 84 8D CD 75 83 43 63 24 83 19 F7 9A
  • Add the corresponding byte from the 13-byte sequence
    0D 0C 0B 0A 09 08 07 06 05 04 03 02 01
End of file marker
An 1A is written to mark the end of file. This is optional; the file will be read without problems if it is omitted. Since the end-of-file marker of the tokenised program is included in the encrypted content, a protected file is usually one byte longer than its unprotected equivalent.

BSAVE file format

A memory-dump file on a disk device has the following format.

Magic byte
FD
Header
Bytes Format Meaning
2 Unsigned 16-bit little-endian integer. Segment of the memory block.
2 Unsigned 16-bit little-endian integer. Offset of the first byte of the memory block.
2 Unsigned 16-bit little-endian integer. Length of the memory block in bytes.
Payload
The bytes of the memory block.
Footer
On Tandy only, the magic byte and the six bytes of the header are repeated here. This is optional; the file will be read without problems if it is omitted.
End of file marker
An 1A is written to mark the end of file. This is optional; the file will be read without problems if it is omitted.

Cassette file format

Files on cassette are stored as frequency-modulated sound. The payload format of files on cassette is the same as for files on disk device, but the headers are different and the files may be split in chunks.

Modulation

A 1-bit is represented by a single 1 ms wave period (1000 Hz). A 0-bit is represented by a single 0.5 ms wave period (2000 Hz).

Byte format

A byte is sent as 8 bits, most significant first. There are no start- or stopbits.

Record format

A file is made up of two or more records. Each record has the following format:

Length Format Meaning
256 bytes All FF 2048 ms pilot wave at 1000 Hz, used for calibration.
1 bit 0 Synchronisation bit.
1 byte 16 (SYN) Synchronisation byte.
256 bytes Data block.
2 bytes Unsigned 16-bit big-endian integer CRC-16-CCITT checksum.
31 bits 30 1s followed by a 0. End of record marker.

Tokenised, protected and BSAVE files consist of a header record followed by a single record which may contain multiple 256-byte data blocks, each followed by the 2 CRC bytes. Plain text program files and data files consist of a header record followed by multiple single-block records.

Header block format
Bytes Format Meaning
1 A5 Header record magic byte
8 8 characters Filename.
1 File type. 00 for data file, 01 for memory dump, 20 or A0 for protected, 40 for plain text program, 80 for tokenised program.
2 Unsigned 16-bit little-endian integer Length of next data record, in bytes.
2 Unsigned 16-bit little-endian integer Segment of memory location.
2 Unsigned 16-bit little-endian integer Offset of memory location.
1 00 End of header data
239 All 01 Filler
Data block format
Bytes Format Meaning
1 8-bit unsigned integer Number of payload bytes in last record, plus one. If zero, the next record is not the last record.
255 Payload data. If this is the last record, any unused bytes are filled by repeating the last payload byte.


Emulator file formats

PC-BASIC uses a number of file formats to support its emulation of legacy hardware, which are documented in this section. These file formats are not used by GW-BASIC or contemporary software.

HEX font file format

The HEX file format for bitfonts was developed by Roman Czyborra for the GNU Unifont package. PC-BASIC uses an extended version of this file format to store its fonts.

A HEX file is an ASCII text file, consisting of lines terminated by LF. Each line of this file is one of the following:

UCP code page file format

Unicode-codepage mappings are stored in UCP files.

A UCP file is an ASCII text file, consisting of lines terminated by LF. Each line of this file is one of the following:

CAS tape file format

A CAS file is a bit-level representation of cassette data introduced by the PCE emulator. CAS-files produced by PC-BASIC start with the characters PC-BASIC tapeEOF. This sequence is followed by seven 0 bits, followed by the tape contents. The seven zero bits are intended to ensure that the tape contents are byte-aligned; the one bit is made up by the synchronisation bit following the pilot wave.

Note that PC-BASIC does not require the introductory sequence to read a CAS-file correctly, nor does it require the contents of a CAS-file to be byte-aligned. However, new files produced by PC-BASIC follow this convention.


Character codes

Depending on context, PC-BASIC will treat a code point in the control characters range as a control character or as a glyph defined by the active codepage which by default is codepage 437. Code points of &h80 or higher are always interpreted as a codepage glyph.

ASCII

This is a list of the American Standard Code for Information Interchange (ASCII). ASCII only covers 128 characters and defines the code point ranges &h00&h1F and &h7F as control characters which do not have a printable glyph assigned to them. This includes such values as the Carriage Return (CR) character that ends a program line.

In the context of this documentation, character &h1A (SUB) will usually be indicated as EOF since it plays the role of end-of-file marker in DOS.

  _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
0_ NUL SOH STX ETX EOT ENQ ACK BEL BS  HT  LF  VT  FF  CR  SO  SI 
1_ DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM  SUB ESC FS  GS  RS  US 
2_ ! " # $ % & ' ( ) * + , - . /
3_ 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4_ @ A B C D E F G H I J K L M N O
5_ P Q R S T U V W X Y Z [ \ ] ^ _
6_ ` a b c d e f g h i j k l m n o
7_ p q r s t u v w x y z { | } ~ DEL

Codepage 437

This table shows the characters that are produced by the 256 single-byte code points when the DOS Latin USA codepage 437 is loaded, which is the default. Other codepages can be loaded to assign other characters to these code points.

  _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
0_
1_ §
2_ ! " # $ % & ' ( ) * + , - . /
3_ 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4_ @ A B C D E F G H I J K L M N O
5_ P Q R S T U V W X Y Z [ \ ] ^ _
6_ ` a b c d e f g h i j k l m n o
7_ p q r s t u v w x y z { | } ~
8_ Ç ü é â ä à å ç ê ë è ï î ì Ä Å
9_ É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ƒ
A_ á í ó ú ñ Ñ ª º ¿ ¬ ½ ¼ ¡ « »
B_
C_
D_
E_ α ß Γ π Σ σ µ τ Φ Θ Ω δ φ ε
F_ ± ÷ ° · ²  

Keycodes

Scancodes

PC-BASIC uses PC/XT scancodes, which originated on the 83-key IBM Model F keyboard supplied with the IBM PC 5150. The layout of this keyboard was quite distinct from modern standard keyboards with 101 or more keys, but keys on a modern keyboard produce the same scancode as the key with the same function on the Model F. For example, the key that (on a US keyboard) produces the \ was located next to the left Shift key on the Model F keyboard and has scancode &h2B. The (US) backslash key still has this scancode, even though it is now usually found above the Enter key.

To further complicate matters, keyboards for different locales have their layout remapped in software rather than in hardware, which means that they produce the same scancode as the key that on a US keyboard is in the same location, regardless of which character they actually produce.

Therefore, the A on a French keyboard will produce the same scancode as the Q on a UK or US keyboard. The aforementioned US \ key is identified with the key that is generally found to the bottom left of Enter on non-US keyboards. For example, on my UK keyboard this is the # key. Non-US keyboards have an additional key next to the left Shift which on the UK keyboard is the \. Therefore, while this key is in the same location and has the same function as the Model F \, it has a different scancode.

In the table below, the keys are marked by their function on a US keyboard, but it should be kept in mind that the scancode is linked to the position, not the function, of the key.

Key Scancode
Esc 01
1 ! 02
2 @ 03
3 # 04
4 $ 05
5 % 06
6 ^ 07
7 & 08
8 * 09
9 ( 0A
0 ) 0B
- _ 0C
= + 0D
Backspace 0E
Tab 0F
q Q 10
w W 11
e E 12
r R 13
t T 14
y Y 15
u U 16
i I 17
o O 18
p P 19
[ { 1A
] } 1B
Enter 1C
Ctrl 1D
a A 1E
s S 1F
d D 20
f F 21
g G 22
h H 23
j J 24
k K 25
l L 26
; : 27
' " 28
` ~ 29
Left Shift 2A
\ | 2B
z Z 2C
x X 2D
c C 2E
v V 2F
b B 30
n N 31
m M 32
, < 33
. > 34
/ ? 35
Right Shift 36
keypad * PrtSc 37
Alt 38
Space 39
Caps Lock 3A
F1 3B
F2 3C
F3 3D
F4 3E
F5 3F
F6 40
F7 41
F8 42
F9 43
F10 44
Num Lock 45
Scroll Lock Pause 46
keypad 7 Home 47
keypad 8 48
keypad 9 Pg Up 49
keypad - 4A
keypad 4 4B
keypad 5 4C
keypad 6 4D
keypad + 4E
keypad 1 End 4F
keypad 2 50
keypad 3 Pg Dn 51
keypad 0 Ins 52
keypad . Del 53
SysReq 54
\ | (Non-US 102-key) 56
F11 57
F12 58
Left Logo (Windows 104-key) 5B
Right Logo (Windows 104-key) 5C
Menu (Windows 104-key) 5D
ひらがな/カタカナ Hiragana/Katakana (Japanese 106-key) 70
\ _ (Japanese 106-key) 73
変換 Henkan (Japanese 106-key) 79
無変換 Muhenkan (Japanese 106-key) 7B
半角/全角 Hankaku/Zenkaku (Japanese 106-key) 29
¥ | (Japanese 106-key) 7D
한자 Hanja (Korean 103-key) F1
한/영 Han/Yeong (Korean 103-key) F2
\ ? ° (Brazilian ABNT2) 73
keypad . (Brazilian ABNT2) 7E

e-ASCII codes

Alongside scancodes, most keys also carry a character value the GW-BASIC documentation calls extended ASCII. Since this is a rather overloaded term, we shall use the abbreviation e-ASCII exclusively for these values. The values returned by the INKEY$ function are e-ASCII values.

e-ASCII codes are one or two bytes long; single-byte codes are simply ASCII codes whereas double-byte codes consist of a NUL character plus a code indicating the key pressed. Some, but certainly not all, of these codes agree with the keys' scancodes.

Unlike scancodes, e-ASCII codes of unmodified keys and those of keys modified by Shift, Ctrl or Alt are all different.

Unmodified, Shifted and Ctrled e-ASCII codes are connected to a key's meaning, not its location. For example, the e-ASCII for Ctrl+a are the same on a French and a US keyboard. By contrast, the Alted codes are connected to the key's location, like scancodes. The US keyboard layout is used in the table below.

Key e-ASCII e-ASCII Shift e-ASCII Ctrl e-ASCII Alt
Esc 1B 1B 1B
1 ! 31 21 00 78
2 @ 32 40 00 03 00 79
3 # 33 23 00 7A
4 $ 34 24 00 7B
5 % 35 25 00 7C
6 ^ 36 5E 1E 00 7D
7 & 37 26 00 7E
8 * 38 2A 00 7F
9 ( 39 28 00 80
0 ) 30 29 00 81
- _ 2D 5F 1F 00 82
= + 3D 2B 00 83
Backspace 08 08 7F 00 8C
Tab 09 00 0F 00 8D 00 8E
q Q 71 51 11 00 10
w W 77 57 17 00 11
e E 65 45 05 00 12
r R 72 52 12 00 13
t T 74 54 14 00 14
y Y 79 59 19 00 15
u U 75 55 15 00 16
i I 69 49 09 00 17
o O 6F 4F 0F 00 18
p P 70 50 10 00 19
[ { 5B 7B 1B
] } 5D 7D 1D
Enter 0D 0D 0A 00 8F
a A 61 41 01 00 1E
s S 73 53 13 00 1F
d D 64 44 04 00 20
f F 66 46 06 00 21
g G 67 47 07 00 22
h H 68 48 08 00 23
j J 6A 4A 0A 00 24
k K 6B 4B 0B 00 25
l L 6C 4C 0C 00 26
; : 3B 3A
' " 27 22
` ~ 60 7E
\ | 5C 7C 1C
z Z 7A 5A 1A 00 2C
x X 78 58 18 00 2d
c C 63 43 03 00 2E
v V 76 56 16 00 2F
b B 62 42 02 00 30
n N 6E 4E 0E 00 31
m M 6D 4D 0D 00 32
, < 2C 3C
. > 2E 3E
/ ? 2F 3F
PrtSc 00 72 00 46
Space 20 20 20 00 20
F1 00 3B 00 54 00 5E 00 68
F2 00 3C 00 55 00 5F 00 69
F3 00 3D 00 56 00 60 00 6A
F4 00 3E 00 57 00 61 00 6C
F5 00 3F 00 58 00 62 00 6D
F6 00 40 00 59 00 63 00 6E
F7 00 41 00 5A 00 64 00 6F
F8 00 42 00 5B 00 65 00 70
F9 00 43 00 5C 00 66 00 71
F10 00 44 00 5D 00 67 00 72
F11 (Tandy) 00 98 00 A2 00 AC 00 B6
F12 (Tandy) 00 99 00 A3 00 AD 00 B7
Home 00 47 00 47 00 77
End 00 4F 00 4F 00 75
PgUp 00 49 00 49 00 84
PgDn 00 51 00 51 00 76
00 48 00 48
00 4B 00 87 00 73
00 4D 00 88 00 74
00 50 00 50
keypad 5 35 35 05
Ins 00 52 00 52
Del 00 53 00 53

Memory model

PC-BASIC (rather imperfectly) emulates the memory of real-mode MS-DOS. This means that memory can be addressed in segments of 64 KiB. Each memory address is given by the segment value and the 0--65535 byte offset with respect to that segment. Note that segments overlap: the actual memory address is found by segment*16 + offset. The maximum memory size that can be addressed by this scheme is thus 1 MiB, which was the size of the conventional and upper memory in real-mode MS-DOS.

Overview

Areas of memory with a special importance are:

Segment Name Purpose
&h0000 Low memory Holds machine information, among other things
&h13AD (may vary) Data segment Program code, variables, arrays, strings
&hA000 (EGA)
&hB000 (MDA)
&hB800 (CGA)
Video segment Text and graphics on visible and virtual screens
&hC000 -- RAM font definition, among other things
&hF000 Read-only memory ROM font definition, among other things

Data segment

The data segment is organised as follows. The addresses may vary depending on the settings of various options; given here are the default values for GW-BASIC 3.23.

Offset Size (bytes) Function
&h0000 3429 Interpreter workarea. Unused in PC-BASIC; can be adjusted with the --reserved-memory option.
&h0D65 (max-files+1) * 322 File blocks: one for the program plus one for each file allowed by --max-files.
&h126D 3 + c Program code. An empty program uses 3 bytes.
&h1270 + c v Scalar variables.
&h1270 + c + v a Array variables.
&hFDFC - s a String variables, filled downward from &hFDFC
&hFDFC 512 BASIC stack, size set by CLEAR statement.
&hFFFE Top of data segment, set by CLEAR statement.