SMOKE-16 Portable Character Encoding ---------------------------- $Id: portable.txt,v 1.3 2001/09/07 17:36:43 bsittler Exp $ This document contains a brief description of the SMOKE-16 portable character encoding. --BCWS ----------------------------------------------------------------- NOTE ============================================================ The Blue Madness refers to EBCD*C, a bizarre set of non-standard character encodings "invented" by a certain ancient and nameless mainframe manufacturer as part of a devious customer-control plot. Fortunately, it is rapidly being replaced by ASCII-derived encodings. ================================================================= OVERVIEW OF THE SMOKE-16 PORTABLE CHARACTER ENCODING Object files must remain portable among different SMOKE-16 toolset hosts. This includes textual information such as symbol names and archive member names. To keep this information portable among hosts with incompatible native character encodings (such as The Blue Madness, ISO 8859-x, SJIS, UTF-8 and ASCII,) the SMOKE-16 toolset uses a portable character encoding (the "portable encoding") for the "string table" section of object files (see 'a_out.txt' in the 'doc' directory for more information on the string table, and SMOKE-16 object files in general.) dec oct hex char ascii name +------+ 0 0000 0x00 | '\0' | null (nul) ... 7 0007 0x07 | '\a' | bell (bel) 8 0010 0x08 | '\b' | backspace (bs) 9 0011 0x09 | '\t' | character tabulation (ht) 10 0012 0x0a | '\n' | line feed (lf) 11 0013 0x0b | '\v' | line tabulation (vt) 12 0014 0x0c | '\f' | form feed (ff) 13 0015 0x0d | '\r' | carriage return (cr) ... 32 0040 0x20 | ' ' | space 33 0041 0x21 | '!' | exclamation mark 34 0042 0x22 | '\"' | quotation mark 35 0043 0x23 | '#' | number sign 36 0044 0x24 | '$' | dollar sign 37 0045 0x25 | '%' | percent sign 38 0046 0x26 | '&' | ampersand 39 0047 0x27 | '\'' | apostrophe 40 0050 0x28 | '(' | left parenthesis 41 0051 0x29 | ')' | right parenthesis 42 0052 0x2a | '*' | asterisk 43 0053 0x2b | '+' | plus sign 44 0054 0x2c | ',' | comma 45 0055 0x2d | '-' | hyphen-minus 46 0056 0x2e | '.' | full stop 47 0057 0x2f | '/' | solidus 48 0060 0x30 | '0' | digit zero 49 0061 0x31 | '1' | digit one 50 0062 0x32 | '2' | digit two 51 0063 0x33 | '3' | digit three 52 0064 0x34 | '4' | digit four 53 0065 0x35 | '5' | digit five 54 0066 0x36 | '6' | digit six 55 0067 0x37 | '7' | digit seven 56 0070 0x38 | '8' | digit eight 57 0071 0x39 | '9' | digit nine 58 0072 0x3a | ':' | colon 59 0073 0x3b | ';' | semicolon 60 0074 0x3c | '<' | less-than sign 61 0075 0x3d | '=' | equals sign 62 0076 0x3e | '>' | greater-than sign 63 0077 0x3f | '\?' | question mark 64 0100 0x40 | '@' | commercial at 65 0101 0x41 | 'A' | latin capital letter a 66 0102 0x42 | 'B' | latin capital letter b 67 0103 0x43 | 'C' | latin capital letter c 68 0104 0x44 | 'D' | latin capital letter d 69 0105 0x45 | 'E' | latin capital letter e 70 0106 0x46 | 'F' | latin capital letter f 71 0107 0x47 | 'G' | latin capital letter g 72 0110 0x48 | 'H' | latin capital letter h 73 0111 0x49 | 'I' | latin capital letter i 74 0112 0x4a | 'J' | latin capital letter j 75 0113 0x4b | 'K' | latin capital letter k 76 0114 0x4c | 'L' | latin capital letter l 77 0115 0x4d | 'M' | latin capital letter m 78 0116 0x4e | 'N' | latin capital letter n 79 0117 0x4f | 'O' | latin capital letter o 80 0120 0x50 | 'P' | latin capital letter p 81 0121 0x51 | 'Q' | latin capital letter q 82 0122 0x52 | 'R' | latin capital letter r 83 0123 0x53 | 'S' | latin capital letter s 84 0124 0x54 | 'T' | latin capital letter t 85 0125 0x55 | 'U' | latin capital letter u 86 0126 0x56 | 'V' | latin capital letter v 87 0127 0x57 | 'W' | latin capital letter w 88 0130 0x58 | 'X' | latin capital letter x 89 0131 0x59 | 'Y' | latin capital letter y 90 0132 0x5a | 'Z' | latin capital letter z 91 0133 0x5b | '[' | left square bracket 92 0134 0x5c | '\\' | reverse solidus 93 0135 0x5d | ']' | right square bracket 94 0136 0x5e | '^' | circumflex accent 95 0137 0x5f | '_' | low line 96 0140 0x60 | '`' | grave accent 97 0141 0x61 | 'a' | latin small letter a 98 0142 0x62 | 'b' | latin small letter b 99 0143 0x63 | 'c' | latin small letter c 100 0144 0x64 | 'd' | latin small letter d 101 0145 0x65 | 'e' | latin small letter e 102 0146 0x66 | 'f' | latin small letter f 103 0147 0x67 | 'g' | latin small letter g 104 0150 0x68 | 'h' | latin small letter h 105 0151 0x69 | 'i' | latin small letter i 106 0152 0x6a | 'j' | latin small letter j 107 0153 0x6b | 'k' | latin small letter k 108 0154 0x6c | 'l' | latin small letter l 109 0155 0x6d | 'm' | latin small letter m 110 0156 0x6e | 'n' | latin small letter n 111 0157 0x6f | 'o' | latin small letter o 112 0160 0x70 | 'p' | latin small letter p 113 0161 0x71 | 'q' | latin small letter q 114 0162 0x72 | 'r' | latin small letter r 115 0163 0x73 | 's' | latin small letter s 116 0164 0x74 | 't' | latin small letter t 117 0165 0x75 | 'u' | latin small letter u 118 0166 0x76 | 'v' | latin small letter v 119 0167 0x77 | 'w' | latin small letter w 120 0170 0x78 | 'x' | latin small letter x 121 0171 0x79 | 'y' | latin small letter y 122 0172 0x7a | 'z' | latin small letter z 123 0173 0x7b | '{' | left curly bracket 124 0174 0x7c | '|' | vertical line 125 0175 0x7d | '}' | right curly bracket 126 0176 0x7e | '~' | tilde ... +------+ Not coincidentally, this portable character encoding includes all the printable ASCII characters and those ASCII control characters having standard C character escape sequences ('\a', '\b', '\t', '\n', '\v', '\f' and '\r'.) The null character ('\0') is also included, since it has the same value in every character encoding; it is used to terminate entries in the string table. SYMBOLS AND ARCHIVE MEMBERS Symbol names and archive member names are restricted to characters from the portable encoding. This means that your object file names need to have names which are translatable to ASCII before they can be placed in a SMOKE-16 object archive. This is to ensure that they can be manually extracted on a wide range of systems. Unfortunately, it also means you can't use extended characters from The Blue Madness, ISO-8859-x, SJIS or UTF-8, arbitrary ASCII control characters, or other extended characters in symbol names or in archive member filenames. That's the price of wide portability. ASSEMBLING PORTABLE PROGRAMS When you assemble SMOKE-16 programs, the resulting SMOKE-16 object files will use the assembling machine's native character encoding (the "native encoding") for character and string constants, by default. These object files will still work on machines using other character encodings than the native encoding, but program logic (including any character and string constants) will still be in the native encoding. When you give the '-portable' option to 'as', character and string constants are assembled using the portable encoding instead of the native encoding, and characters outside the portable character encoding must be referred to numerically, using hexadecimal or octal character escape sequences. Of course, the ideal solution would be for every host to use the same character set. RUNNING PORTABLE PROGRAMS When you run SMOKE-16 executables using the SMOKE-16 emulator 'emu', strings from the emulated SMOKE-16 environment are passed directly to the emulating machine's C library and/or system calls, with no character encoding translation. For this reason, programs must be assembled using a character encoding compatible with the emulating machine's execution character encoding.