unicode_data v0.1.2 UnicodeData

Provides access to Unicode properties needed for more complex text processing.

Script detection

Proper text layout requires knowing which script is in use for a run of text. Unicode provides the Script property to identify the script associated with a codepoint. The script short name is also provided, which can be passed to font engines or cross-referenced with ISO 15924.

Once the script is identified, it’s possible to determine if the script is a right-to-left script, as well as what additional support might be required for proper layout.

Shaping support

The Joining_Type and Joining_Group properties provide support for shaping engines doing layout of cursive scripts.

Link to this section Summary

Functions

Determine the joining group for cursive scripts

Determine the joining type for cursive scripts

Determine if the script is written right-to-left

Lookup the script property associated with a codepoint

Get the short name associated with a script. This is the tag used to identify scripts in OpenType fonts and generally matches the script code defined in ISO 15942

Determine if a script uses the Joining Type property to select contextual forms

Link to this section Functions

Link to this function joining_group(codepoint)

Determine the joining group for cursive scripts.

Characters from other scripts return No_Joining_Group as they do not participate in cursive shaping.

The ALAPH and DALATH RISH joining groups are of particular interest to shaping engines dealing with Syriac. Chapter 9.3 of the Unicode Standard discusses Syriac shaping in detail.

This is sourced from ArabicShaping.txt

Examples

iex> UnicodeData.joining_group("ك")
"KAF"
iex> UnicodeData.joining_group("د")
"DAL"
iex> UnicodeData.joining_group("ܐ")
"ALAPH"
Link to this function joining_type(codepoint)

Determine the joining type for cursive scripts.

Cursive scripts have the following join types:

  • R Right_Joining (top-joining for vertical)
  • L Left_Joining (bottom-joining for vertical)
  • D Dual_Joining
  • C Join_Causing
  • U Non_Joining
  • T Transparent

Characters from other scripts return U as they do not participate in cursive shaping.

This is sourced from ArabicShaping.txt

Examples

iex> UnicodeData.joining_type("ك")
"D"
iex> UnicodeData.joining_type("د")
"R"
iex> UnicodeData.joining_type("ܐ")
"R"
Link to this function right_to_left?(script)

Determine if the script is written right-to-left.

This data is derived from ISO 15924. There’s a handy sortable table on the Wikipedia page for ISO 15924.

Examples

iex> UnicodeData.right_to_left?("Latin")
false
iex> UnicodeData.right_to_left?("Arabic")
true

You can also pass the script short name.

iex> UnicodeData.right_to_left?("adlm")
true
Link to this function script_from_codepoint(codepoint)

Lookup the script property associated with a codepoint.

This will return the script property value. In addition to the explicitly defined scripts, there are three special values.

  • Characters with script value Inherited inherit the script of the preceding character.
  • Characters with script value Common are used in multiple scripts.
  • Characters of Unknown script are unassigned, private use, noncharacter or surrogate code points.

This is sourced from Scripts.txt

Examples

iex> UnicodeData.script_from_codepoint("a")
"Latin"
iex> UnicodeData.script_from_codepoint("9")
"Common"
iex> UnicodeData.script_from_codepoint("ك")
"Arabic"
Link to this function script_to_tag(script)

Get the short name associated with a script. This is the tag used to identify scripts in OpenType fonts and generally matches the script code defined in ISO 15942.

See Annex #24 for more about the relationship between Unicode and ISO 15942.

Data from OpenType script tags and PropertyValueAliases.txt

Examples

iex> UnicodeData.script_to_tag("Latin")
"latn"
iex> UnicodeData.script_to_tag("Unknown")
"zzzz"
iex> UnicodeData.script_to_tag("Adlam")
"adlm"
Link to this function uses_joining_type?(script)

Determine if a script uses the Joining Type property to select contextual forms.

Typically this is used to select a shaping engine, which will then call joining_type/1 and joining_group/1 to do cursive shaping.

Examples

iex> UnicodeData.uses_joining_type?("Latin")
false
iex> UnicodeData.uses_joining_type?("Arabic")
true
iex> UnicodeData.uses_joining_type?("Nko")
true

You can also pass the script short name.

iex> UnicodeData.uses_joining_type?("syrc")
true