BREAKAGE AHEAD: UTF-8, internationalization, TrueType fonts, oh my!

uli

So in an effort to achieve feature parity with the original BASIC Engine, I have added support for national keyboards a while ago. In doing so, I realized that there are quite a few deficiencies in the way text is currently handled.

The existing default font (ATI 6x8), which is a CP437 font, does not support all of the fancy Latin characters. The randomly-encoded 8-bit fonts that I have thrown together and that are a holdover from the original Famicom system obviously don't either. And even if they did, it would be a major undertaking to make these fonts in any way "compatible" with one another. And when we factor the Japanese font in, the whole approach breaks down completely.

Meanwhile, in another corner of the system, I've been working on exercising the new on-board C compiler (TCC) and the API I have thrown together to program the system in C by porting a variety of text editors. Why text editors? For one, the current editor (E) is not very good. For another, text editors typically are old-ass code bases with a gigantic API footprint, so if those monsters work, so will everything else. The biggest issue I have encountered is that modern editors don't play well with 40 year-old IBM PC code pages.

In a third corner I have for a long time considered how to add proper internationalization. It would appear that people who are learning something new are better at it when they do so in a language they already know, and the enormous 512MB of RAM and the gigantic SD cards could be put to good use by adding a bunch of translations. The whole font and encoding chaos would, however, make it a pain in the ass to interact with translators, both human and machine.

The solution to this whole mess is, of course, Unicode, specifically UTF-8. The catch with UTF-8 is that it is a multi-byte encoding, meaning that each character may be encoded in any number of bytes, from one to four. This means that there is no direct correspondence anymore between a character on the screen, a character read from the keyboard, a character read or written to and from a file, and a character in a string in memory.

The bottom line is that everything needed to change, and it was a big, big pain. While doing it I also decided to chuck out the old fonts and rely on TrueType fonts exclusively instead of trying to make the ancient bitmap fonts Unicode-ready. In order to reap a tangible benefit from this whole exercise, I have also extracted all messages and help text and piped them through various machine translation systems, then had them translated back back to German for verification. The result is that we now have a Unicode-capable text subsystem based on TrueType fonts, complete internationalization of system messages and help texts in four languages (English, German, French and Spanish). Which is nice.

There is, unfortunately, no way to do all of this without breaking some stuff. There will obviously be bugs, but there are also more-or-less unavoidable incompatibilities. So here is what you need to know as a user:

Text editor

I had to chuck out E because it does not support multi-byte encodings. In theory it would have been possible to fix it, but it was never that great to begin with, and I had already ported atto as an exercise, so I threw that in. I think it's a major upgrade, and it fully supports UTF-8.

Fonts

The ATI 6x8 font is available as a TrueType font, but it still only contains the characters in CP437. I have therefore decided to throw it out and replace it with a 6x8 font from the HP 100 LX that supports all major non-CJK scripts.

Strings

Since now all strings are UTF-8 encoded, you can no longer rely on one byte in a string to be one character. This means, for instance, that LEFT$("äöü", 2) is ä, not äö, as one might expect.

Program text

All strings are UTF-8 means that all program text read from and written to files is UTF-8. This causes problems with existing programs, which are, well, not UTF-8. To allow loading these programs I have added an option to LOAD that converts single-byte encoded programs to UTF-8. This conversion depends on which font the program to be loaded uses. To load the Tetris demo, for instance, you would use LOAD "tetris.bas" FONT 2.

There are a few things that are missing at the moment that I intend to add at some point:

There are no fallback fonts yet. If the current font doesn't have a glyph, the text subsystem will not try to find it in another font.
There are no CJK fonts.
The 8-bit fonts (Commodore and Amstrad) don't have any Latin (let alone non-Latin) characters.
There is no way to load custom fonts.

I am, on the whole, very happy with how this turned out. The breakage seems to be minmal, and the translations are shockingly good. To check them out, set the language with CONFIG 11,<lang>, with <lang> being 0 for English, 1 for German, 2 for French and 3 for Spanish.

Dmian

Awesome!
The only thing with old custom flavours of ASCII (like PETSCII or other ISO-646 variants) when converted to UTF-8 is that you normally lose the famous PET graphic characters (or any custom character), or they are mapped weirdly in higer (or wrong) positions.

The best reference I know for the conversion is: https://style64.org/petscii/

The good thing about UTF-8 is that if you want to encode your own character, you have the Private Use Area (planes 15 and 16 of the Basic Multilingua Plane) that allows you to put there whatever you want.
You hardly see a font that has glyphs for all Unicode characters.

The most common strategy as a Designer is to get one of the common ones (for example, Arial), and create glyphs for the characters you want, leaving the original glyphs for the rest. It looks weird when you suddenly get an Arial glyph in the middle of a (for example) pixel font, but it's better than showing nothing there.
One pixel font editor that uses this strategy, if I'm not mistaken, is:
http://www.pentacom.jp/pentacom/bitfontmaker2/

It's not that difficult to create a pixel font that has, at least, ISO 8859-1 using that editor.
In my case, I would love to keep using PETSCII, so I'll see what changes need to be done to keep using that font.
BTW, Style64 also has a C64 TrueType font: https://style64.org/c64-truetype

uli

Dmian I'm actually using the C64 Pro Mono font (and its mapping), so if I didn't get any magic numbers wrong, we should be fine.

I forgot to mention that I have also changed the keyboard map so that the PETSCII characters can be entered using the same keys as on Commodore keyboards (but with Alt instead of Shift and Alt-Shift instead of C=). To the extent that the mappings of the PETSCII characters are outside the private use area, that also works with the HP 6x8 font.

The HP font is from https://int10h.org/oldschool-pc-fonts/fontlist/ , which has a bunch of old PC fonts augmented with additional scripts, but they don't have an Amstrad or Commodore font, nor do they have any CJK support.

The Amstrad TTF font that is in use right now is highly deficient, though: It doesn't seem to have any of the graphical characters, but I couldn't find anything better...

Dmian

uli Perfect! 😁
What's missing from the current Amstrad font? Is there anything I can help you with?
I don't mind expanding a set, if possible.

uli

If you are so inclined, you could go to http://www.cpcwiki.eu/index.php?title=Keyboard_Versions#Character_Set_ROMs and mash those character sets together into a Creative Commons-licensed TrueType font made from scratch. 🙂

[I'd like to have the same for the other fonts as well, actually, because the random licensing terms might cause a headache some time down the road, but that's obviously a lot of work, which is why I didn't do it...]

Dmian

uli Ok, I’ll give it a go this weekend. Do you prefer CC for any particular reason? For fonts I’ve made in the past I’ve used SIL’s OFL. But I believe I made a CC font once. No problen with using any license in particular.

uli

Dmian CC is just my knee-jerk reaction to everything that is not code. OFL looks fine as well. Thanks!

davegardnerisme That would make sense, but there are a few issues. The intention so far was for strings to double as byte arrays. Once you start interpreting the contents of the string that goes out the window. I don't really have a good idea how to deal with this. The obvious way is to introduce stuff like BLEN(), BLEFT$() etc. for bytewise access, or alternatively MBLEN(), MBLEFT$() etc. for UTF-8, but neither looks very appealing to me...

davegardnerisme

Looking forward to trying out new editor; I had no love for the old one.

Is the plan to “fix” LEFT$ for example? So that you can rely on it in a similar fashion to how you can slice runes in Go? Eg make it return the left most number of characters, regardless of how many bytes those characters take up? Just curious really.

Rizzo2049er

uli At the end of the day, the need of having utility functions for both Unicode and byte strings should manifest itself as inevitable. And it seems to me that going the B_func_() way would be the best match to BASIC's lenient character as an explicit and unsurprising programming language.
Also, looking at what has been done to other languages in introducing Unicode or attempts thereof, there is much evidence of endless pain being inflicted onto the programmer, motivated by the most noble intentions.
It seems to me that the "BASIC way" would be providing just the barely necessary, such as the traditional string functions and I/O commands with Unicode code point semantics for UTF-8. In that vein, the least painful method for dealing with invalid UTF-8 sequences may be simply the official Unicode recommendation of replacing all error instances with U+FFFD.

Dmian

uli Let's see if something like this will do:
http://damianvila.com/basicengine/AmstradCPC464-Regular.ttf

I mapped the characters to their corresponding UTF-8 representation, following this: https://en.wikipedia.org/wiki/Amstrad_CPC_character_set

Control characters are not in their original position, but in their Unicode representation. If you need me to put them in their position (the square representing null in 0000, etc), let me know and I'll see if I can move them there.

In the case of the rocket, the bomb and the cloud with lightning, I used PUA characters (E000, E001, E002), because there are no equivalent glyphs in Unicode.

More characters can be added, if needed, to complete the Extended Latin set, or add characters needed for Mac and Windows. The style is not difficult to follow, and besides, the glyphs used for local changes are pretty ugly, so I guess I can't do it worse than that... 😆
Let me know if it's what you need, and what you think.

uli

Dmian Thanks a lot! Unfortunately this font doesn't render very well. It doesn't have a crisp outline like the others. You can see for yourself, using LOAD FONT "<filename>" SIZE 8,8 in the latest build.

Rizzo2049er That makes sense. I implemented it that way, adding BLEN(), BLEFT$(), BRIGHT$() and BMID$(), and fixing CHR$() and ASC() to return/accept UTF-8 as well. Thank you for your input.

Dmian

uli I can try other ways to create it. I’ll give it a spin to see if I can correct that.
Edit: I think it may be metrics. I'll see if I can correct it.

Dmian

uli Ok, I think this may be it:

http://damianvila.com/basicengine/cpc464.ttf

There's something I can't quite control when I create my fonts and they turn out ugly, so, funnily, I resorted to getting one of the 8x8 ones in the collection of pixel fonts you linked, import it in my Font program, get rid of the glyphs and info, and replace everything with my glyphs and info (luckily, I created my font with the same metrics than those ones). This allowed it to be rendered correctly. Probably there's something I'm forgetting, but I don't know what.
In any case, it seems the font works. Let me know if you see something peculiar or strange about it.

Rizzo2049er

uli Your are most welcome 😃 I also think that this is probably the least painful option for porting "historic" sources from classical BASIC dialects. Anyway, no time to look into your new code right now, thus I wonder how 8-bit versions of CHR$() and ASC() should be named (if not yet implemented). I guess the "B for byte" prefix would do here as well, just for consistency, notwithstanding the cryptic "feel" of those names.

uli

Dmian Works for me. Could I have a license for that? Then I could replace the existing CPC font with it.

Rizzo2049er I don't think legacy implementations of CHR$() and ASC() are required. a$[0] is equivalent to the old ASC(a$), and CHR$(a) can be replaced by something like a$=[a]. That doesn't work in expressions, though, so you may have to add a couple extra characters, but that is not reason enough for me to add a BCHR$().

Dmian

uli I don't know how you do it in other systems, but when you get the info on a Mac, you can see the licence. I published it with a CC0 1.0 Universal (public domain, no need for recognition).
I'll put a page for these fonts soon on my website, explicitly indicating the licence, but it's in the font metadata already.

uli

Ah, OK. I can see it in the font information in fontforge. Thanks again!

Dmian

uli 😉

Btw, I'll see if I can expand the character set to include, at least, all Western European languages. I can transplant my Kana designs to this font too, if you think it may be helpful.