Assorted Unicode Oddities
Here are a few things I find odd about the Unicode standard, in no particular order.
U+1F93B 🤻 Modern Pentathlon and U+1F946 🥆 Rifle are not Extended_Pictographic, even though every other code point in Supplemental Symbols and Pictographs – assigned or unassigned – is.
U+FFE8 ￨ Halfwidth Forms Light Vertical is the narrow variant of U+2502 │ Box Drawings Light Vertical. Back in Unicode 1.0.0, all box drawing characters used the phrase ‘Forms’ in place of ‘Box Drawings’, but this was changed when the character set was merged with ISO/IEC 10646. Today U+FFE8 is the only remnant of the old names.
Some superscript modifier letters do not decompose into anything because the base letters they theoretically derive from do not exist in Unicode, or did not exist when they were originally added. This includes:
- U+1D2F ᴯ Modifier Letter Capital Barred B
- U+1D3B ᴻ Modifier Letter Capital Reversed N
- U+1D4E ᵎ Modifier Letter Small Turned I
U+1D4C ᵌ Modifier Letter Small Turned Open E and U+1D9F ᶟ Modifier Letter Small Reversed Open E both decompose to U+025C ɜ Latin Small Letter Reversed Open E, even though their names would suggest U+1D08 ᴈ Latin Small Letter Turned Open E to be the better fit for the former. U+1D4C and U+1D08 were both added in Unicode 4.0, so this isn’t a case of the base character missing and a close relative having to serve double duty. In fact, U+1D9F was added after all the others.
The decomposition type of U+3036 〶 Circled Postal Mark is <compat> and not <circle>.
U+145BD 𔖽 Anatolian Hieroglyph A393 Eight includes its numeric reading in its name, which is unique among all Anatolian Hieroglyphs. Initially the set was proposed with character names that consisted of both a catalogue number and the symbol’s meaning, but in the final version the meanings were dropped from the names of all characters except three special‐purpose symbols (U+145B1, U+145CE, U+145CF), and this one by mistake.
Line_Break property values can sometimes be a bit counterintuitive, presumably due to the various interactions between several source sets. For Example:
- U+2591 ░ Light Shade and U+2590 ▐ Right Half Block are Alphabetic while U+2592 ▒ Medium Shade, U+2593 ▓ Dark Shade, and U+258C ▌ Left Half Block are Ambiguous.
The same goes for East_Asian_Width:
- U+1F004 🀄 Mahjong Tile Red Dragon and U+1F0CF 🃏 Playing Card Black Joker are Wide, but all the other Mahjong tiles and playing cards are Neutral.