Using Unicode as the new AVEVA E3D Internal Format
- Last UpdatedOct 25, 2022
- 2 minute read
The major design decision for the Unicode conversion of AVEVA E3D was to use (32bit) Unicode Scalar (US) instead of ASCII codes as its integer character representation (holding 1 US character code per array element), and to use UTF8 format for its character byte strings, with up to 4 bytes to represent 1 character.
The section below describes a few important properties of Unicode Scalars and the UTF8 format.
-
A Unicode scalar is a 32 bit integer code which uniquely represents a single Unicode character. Unicode assumes that the 32 bit range will cope uniquely with all the character sets of the world.
-
For the ASCII character set (codes 32 to 127) the codes are equal to the Unicode scalar codes.
-
UTF8 encoding needs 1, 2, 3 or 4 bytes to represent a Unicode scalar. For UTF8 you need to be able to clearly distinguish the number of characters held and the number of bytes needed to represent them - as they cannot be assumed to be the same.
-
Within UTF8 the ASCII characters can always be found by a simple byte by byte search in either direction.
-
UTF8 allows the first byte of any adjacent character to be found by simple byte by byte search in either direction, and every first byte yields the number of bytes in the character.
-
After having found a UTF8 character first byte (which could be an ASCII character, for example, $, /, &, space, ~, and so on) then subsequent bytes of the character are never ASCII bytes. So when you have found an ASCII byte, it is a genuine character and not part of another character.