So in an effort to achieve feature parity with the original BASIC Engine, I have added support for national keyboards a while ago. In doing so, I realized that there are quite a few deficiencies in the way text is currently handled.
The existing default font (ATI 6x8), which is a CP437 font, does not support all of the fancy Latin characters. The randomly-encoded 8-bit fonts that I have thrown together and that are a holdover from the original Famicom system obviously don't either. And even if they did, it would be a major undertaking to make these fonts in any way "compatible" with one another. And when we factor the Japanese font in, the whole approach breaks down completely.
Meanwhile, in another corner of the system, I've been working on exercising the new on-board C compiler (TCC) and the API I have thrown together to program the system in C by porting a variety of text editors. Why text editors? For one, the current editor (E) is not very good. For another, text editors typically are old-ass code bases with a gigantic API footprint, so if those monsters work, so will everything else. The biggest issue I have encountered is that modern editors don't play well with 40 year-old IBM PC code pages.
In a third corner I have for a long time considered how to add proper internationalization. It would appear that people who are learning something new are better at it when they do so in a language they already know, and the enormous 512MB of RAM and the gigantic SD cards could be put to good use by adding a bunch of translations. The whole font and encoding chaos would, however, make it a pain in the ass to interact with translators, both human and machine.
The solution to this whole mess is, of course, Unicode, specifically UTF-8. The catch with UTF-8 is that it is a multi-byte encoding, meaning that each character may be encoded in any number of bytes, from one to four. This means that there is no direct correspondence anymore between a character on the screen, a character read from the keyboard, a character read or written to and from a file, and a character in a string in memory.
The bottom line is that everything needed to change, and it was a big, big pain. While doing it I also decided to chuck out the old fonts and rely on TrueType fonts exclusively instead of trying to make the ancient bitmap fonts Unicode-ready. In order to reap a tangible benefit from this whole exercise, I have also extracted all messages and help text and piped them through various machine translation systems, then had them translated back back to German for verification. The result is that we now have a Unicode-capable text subsystem based on TrueType fonts, complete internationalization of system messages and help texts in four languages (English, German, French and Spanish). Which is nice.
There is, unfortunately, no way to do all of this without breaking some stuff. There will obviously be bugs, but there are also more-or-less unavoidable incompatibilities. So here is what you need to know as a user:
Text editor
I had to chuck out E because it does not support multi-byte encodings. In theory it would have been possible to fix it, but it was never that great to begin with, and I had already ported atto as an exercise, so I threw that in. I think it's a major upgrade, and it fully supports UTF-8.
Fonts
The ATI 6x8 font is available as a TrueType font, but it still only contains the characters in CP437. I have therefore decided to throw it out and replace it with a 6x8 font from the HP 100 LX that supports all major non-CJK scripts.
Strings
Since now all strings are UTF-8 encoded, you can no longer rely on one byte in a string to be one character. This means, for instance, that LEFT$("äöü", 2)
is ä
, not äö
, as one might expect.
Program text
All strings are UTF-8 means that all program text read from and written to files is UTF-8. This causes problems with existing programs, which are, well, not UTF-8. To allow loading these programs I have added an option to LOAD
that converts single-byte encoded programs to UTF-8. This conversion depends on which font the program to be loaded uses. To load the Tetris demo, for instance, you would use LOAD "tetris.bas" FONT 2
.
There are a few things that are missing at the moment that I intend to add at some point:
- There are no fallback fonts yet. If the current font doesn't have a glyph, the text subsystem will not try to find it in another font.
- There are no CJK fonts.
- The 8-bit fonts (Commodore and Amstrad) don't have any Latin (let alone non-Latin) characters.
- There is no way to load custom fonts.
I am, on the whole, very happy with how this turned out. The breakage seems to be minmal, and the translations are shockingly good. To check them out, set the language with CONFIG 11,<lang>
, with <lang>
being 0
for English, 1
for German, 2
for French and 3
for Spanish.