Hull and Outfitting

Using Unicode as the new Marine Internal Format

Save PDF

Using Unicode as the new Marine Internal Format

Save PDF

Last UpdatedJan 09, 2024
1 minute read

The major design decision for the Unicode conversion of Marine was to use (32bit) Unicode Scalar (US) instead of ASCII codes as its integer character representation (holding 1 US character code per array element), and to use UTF8 format for its character byte strings, with up to 4 bytes to represent 1 character.

The section below describes a few important properties of Unicode Scalars and the UTF8 format.

A Unicode scalar is a 32 bit integer code which uniquely represents a single Unicode character. Unicode assumes that the 32 bit range will cope uniquely with all the world's character sets.
For the ASCII character set (codes 32 to 127) the codes are equal to the Unicode scalar codes.
UTF8 encoding needs 1, 2, 3 or 4 bytes to represent a Unicode scalar. For UTF8 we need to be able to clearly distinguish the number of characters held and the number of bytes needed to represent them - as they cannot be assumed to be the same.
Within UTF8 the ASCII characters can always be found by a simple byte by byte search in either direction.
UTF8 allows the first byte of any adjacent character to be found by simple byte by byte search in either direction, and every first byte yields the number of bytes in the character.
After having found a UTF8 character first byte (which could be an ASCII character for example, $, /, &, space, ~ ) then subsequent bytes of the character are never ASCII bytes. So when you've found an ASCII byte it is a genuine character and not part of another character.

Hull and Outfitting

Using Unicode as the new Marine Internal Format

Table of Contents

Using Unicode as the new Marine Internal Format

Related Links