Discussion:
Unicode support
(too old to reply)
nhmall
2015-01-09 01:43:05 UTC
Permalink
What are people's thoughts on the best approach to achieving the
following goals in the base nethack code:
- Allow players to be able to use any language to name themselves or
their pets and possessions and have them referred to as such in the game
messages.
- Display Unicode characters on the map for various things.

While a lot (majority?) of projects have chosen to use UTF-8 and
variable 8-bit character strings, it would also be possible for nethack
to stray from the pack and internally use UTF-32 with strings of
fixed-size 32-bit characters and convert them to UTF-8 for input and
output. The code is already set up that way.

The obvious argument against using UTF-32 internally is the wasted
bytes, just like any other project. On the other hand, converting the
nethack code to use variable character strings everywhere might be the
challenge for UTF-8.

What do you think?

Some potentially-related reading that may be of possible interest:
http://en.wikipedia.org/wiki/UTF-32
http://en.wikipedia.org/wiki/UTF-8
http://www.joelonsoftware.com/articles/Unicode.html
http://stackoverflow.com/questions/496321/utf8-utf16-and-utf32
http://utf8everywhere.org/
http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful
ais523
2015-01-09 19:09:18 UTC
Permalink
Post by nhmall
What are people's thoughts on the best approach to achieving the
- Allow players to be able to use any language to name themselves or
their pets and possessions and have them referred to as such in the game
messages.
- Display Unicode characters on the map for various things.
While a lot (majority?) of projects have chosen to use UTF-8 and
variable 8-bit character strings, it would also be possible for nethack
to stray from the pack and internally use UTF-32 with strings of
fixed-size 32-bit characters and convert them to UTF-8 for input and
output. The code is already set up that way.
The obvious argument against using UTF-32 internally is the wasted
bytes, just like any other project. On the other hand, converting the
nethack code to use variable character strings everywhere might be the
challenge for UTF-8.
What do you think?
I've written a blog post on the issue here:
http://nethack4.org/blog/unicode.html

I'll be collecting opinions on this that I hear from other people, and
sending the consensus opinion to the DevTeam, if one emerges.
--
ais523
NetHack4 maintainer
Ray Chason
2015-01-10 05:16:47 UTC
Permalink
Post by nhmall
What are people's thoughts on the best approach to achieving the
- Allow players to be able to use any language to name themselves or
their pets and possessions and have them referred to as such in the game
messages.
- Display Unicode characters on the map for various things.
While a lot (majority?) of projects have chosen to use UTF-8 and
variable 8-bit character strings, it would also be possible for nethack
to stray from the pack and internally use UTF-32 with strings of
fixed-size 32-bit characters and convert them to UTF-8 for input and
output. The code is already set up that way.
The obvious argument against using UTF-32 internally is the wasted
bytes, just like any other project. On the other hand, converting the
nethack code to use variable character strings everywhere might be the
challenge for UTF-8.
What do you think?
http://en.wikipedia.org/wiki/UTF-32
http://en.wikipedia.org/wiki/UTF-8
http://www.joelonsoftware.com/articles/Unicode.html
http://stackoverflow.com/questions/496321/utf8-utf16-and-utf32
http://utf8everywhere.org/
http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful
Would you mind if I add:
http://sourceforge.net/projects/nethack-i18n/
to that list. The primary intention here was a full i18n job (!) but
secondarily it uses UTF-8 throughout. Thus it allows any Unicode string for
names, engravings, etc. and Unicode symbols on the map. My current project is
adding support for combining characters on the map.

Some issues I encountered, and am still encountering:

* Engravings: The scuffing routine needs to be modified, or it will produce
invalid UTF-8. A simple approach, which an early NetHack-i18n used, is to check
the byte being scuffed to see if it is a continuation byte, and replace the
whole character with '?'. The result is at least non-perverse. The current
nethack-i18n takes the character properties into account, assigning a weight of
2 to full width CJK characters, 0 to combining characters and 1 to all others,
and the weight determines the chance that the character will be scuffed. Full
width characters scuff to U+FF1F FULLWIDTH QUESTION MARK, as in JNetHack.

* Various user interfaces:
* The old school TTY interface is a pain to convert for Unicode, and
NetHack-i18n does not support it. The Curses interface makes a nice
substitute, and leaves any character conversion issues up to the Curses
library. I suggest win32a, a derivative of PDCurses, for the Windows
port; see http://www.projectpluto.com/win32a.htm .
* The label widget in the Xorg version of Xaw is broken, and cannot display
Unicode. The Xorg people refuse to fix it. That might not be a problem,
since you're not translating the game text.
* In the unmodified NetHack, only the Win32 interface supports tabbed menu
columns. You'll want these for all interfaces, including Curses, because
bytes, characters and columns do not correspond in UTF-8. A bonus is that
all interfaces (except Curses of course) can support proportional fonts.
* I shudder to think how Unicode might play out in the older interfaces
(MS-DOS, Amiga, Atari ST, OS/2, BeOS, Mac Carbon). NetHack-i18n does not
address this issue.

* Centering in the text tombstone needs to take character width and multibyte
characters into account. This is similar to lining up menu columns in Curses.

* Capitalization, the a/an rule and such: The default properties work well for
case mapping (though Dutch players, and we've had a few of those at RGRN, might
trip over that IJ digraph). Normalization form D can help with choosing "a" or
"an". Name your fruit "éclair", with the accent, in unmodified NetHack and it
will say "a éclair" because it doesn't recognize "é" as a vowel; NetHack-i18n
normalizes, separating the accent from the "e", and gives "an éclair".

* Buffer sizes in the binary files: PL_NSIZ may be limiting for some scripts.
A character can be as many as three bytes, or four if supplemental characters
are used. This might not be a problem for Chinese and Japanese, as a single
kanji can encode as much information as several Latin letters; but if you're
using, say, Devanagari, you're taking up three bytes for as much information as
one carries in ASCII. NetHack-i18n does not (yet) address this issue, because
doing so will break save compatibility.
Martijn Lievaart
2015-01-10 19:16:38 UTC
Permalink
Post by Ray Chason
* Capitalization, the a/an rule and such: The default properties work
well for case mapping (though Dutch players, and we've had a few of
those at RGRN, might trip over that IJ digraph).
Not the digraph per se, but even MS doesn't get Capelle aan de IJssel
right. Would be cool if NH did. :-)

M4

Loading...