hanyutils v0.2.3 Pinyin View Source

Utilities to deal with pinyin syllables and groups thereof.

The main goal of this module is to provide functions to manipulate strings that contain pinyin words, which are potentially mixed with other content. Users of this module can use read/2, read!/2 or sigil_p/2 to parse a string and turn it into a pinyin_list/0. Afterwards, such a list can be converted into a "numbered" or a "marked" string. Numbered strings are created with numbered/1; in this representation, tone marks are not added to the pinyin syllable, numbers are used to indicate the tone instead. When marked/1 is used, pinyin is printed with tone marks.

When a string is parsed with read/2, it is converted into a list containing strings and t/0 structs. These structs encode pinyin syllables. Users of this module generally do not need to worry about manipulating these structs directly, but they are exposed for users who want to handle pinyin using custom logic. create/2, from_marked/1 and from_numbered/1 can be used to directly create a pinyin struct for a given syllable. Like pinyin_list/0, t/0 structs can be converted to strings with numbered/1 and marked/1.

Link to this section Summary

Types

List of pinyin syllables mixed with plain strings.

t()

Representation of a pinyin syllable.

Functions

Create a Pinyin struct (t/0) from an unmarked string and a tone numeral.

Create a pinyin struct (t/0) from a string with tone marks.

Create a pinyin struct (t/0) from a string with a tone number.

Convert a t/0 or pinyin_list/0 to a tone-marked string.

Convert a t/0 or pinyin_list/0 to a numbered version.

Read a string and convert it into a list of string and pinyin structs.

Identical to read/2, but returns the result or a Pinyin.ParseError

Sigil to create a pinyin list or struct.

Link to this section Types

Specs

pinyin_list() :: [t() | String.t()]

List of pinyin syllables mixed with plain strings.

Specs

t() :: %Pinyin{tone: 0..4, word: String.t()}

Representation of a pinyin syllable.

This struct represents a single syllable in pinyin. It stores a textual representation of the syllable without any tone marks. In this representation, ü is always stored as v. The tone of the syllable is stored in the tone field. 0 represents the neutral tone.

Do not create a pinyin struct manually. Instead, use a function such as create/2, from_marked/1, from_numbered/1 or use the sigil_p/2 sigil.

Link to this section Functions

Specs

create(String.t(), 0..4) :: t()

Create a Pinyin struct (t/0) from an unmarked string and a tone numeral.

This function is useful if you want to dynamically create pinyin structs. The use of this function is preferred over directly using %Pinyin{}, as this function normalises word and verifies tone is valid before the struct is created.

Examples

iex> Pinyin.create("ni", 3)
%Pinyin{tone: 3, word: "ni"}
iex> Pinyin.create("lüe", 4)
%Pinyin{tone: 4, word: "lve"}
iex> Pinyin.create("lve", 4)
%Pinyin{tone: 4, word: "lve"}
iex> Pinyin.create("ni", 5)
** (FunctionClauseError) no function clause matching in Pinyin.create/2

Specs

from_marked(String.t()) :: t()

Create a pinyin struct (t/0) from a string with tone marks.

When converting the string, the tone marker is stripped and placed in the tone field of the resulting struct. An ArgumentError is thrown if multiple tone marks are present. Therefore, this function should only be used for a single pinyin word.

Examples

iex> Pinyin.from_marked("nǐ")
%Pinyin{tone: 3, word: "ni"}
iex> Pinyin.from_marked("nǐ")
%Pinyin{tone: 3, word: "ni"}
iex> Pinyin.from_marked("nǐhǎo")
** (ArgumentError) Multiple tone marks present in 'nǐhǎo'

Specs

from_numbered(String.t()) :: t()

Create a pinyin struct (t/0) from a string with a tone number.

The tone number has to be 1, 2, 3 or 4 and has to be the last element of the string. If this is not the case, an invalid pinyin struct is obtained.

If the tone of the word is known upfront, the use of create/2 should be preferred, as it does not need to parse the string.

Examples

iex> Pinyin.from_numbered("ni3")
%Pinyin{tone: 3, word: "ni"}
iex> Pinyin.from_numbered("ni5")
%Pinyin{tone: 0, word: "ni5"}
iex> Pinyin.from_numbered("ni")
%Pinyin{tone: 0, word: "ni"}

Specs

marked(t() | pinyin_list()) :: String.t()

Convert a t/0 or pinyin_list/0 to a tone-marked string.

The tone-marked string consists of the pinyin word with the tone added in the correct location. It is generally used when printing pinyin. Any occurence of "v" is shown as "ü".

Examples

iex> marked(~p/ni3/s)
"nǐ"
iex> marked(~p/lve4/s)
"lüè"
iex> marked(~p/ni3hao3/)
"nǐhǎo"
iex> marked(~p/NI3HAO3/)
"NǏHǍO"
iex> marked(~p/Ni3hao3, how are you?/w)
"Nǐhǎo, how are you?"

Specs

numbered(t() | pinyin_list()) :: String.t()

Convert a t/0 or pinyin_list/0 to a numbered version.

The numbered version consists of the word without tone markings followed by the number of the tone. It is often used when typing pinyin manually. Any occurence of "ü" is shown as "v". Non-pinyin text is not modified.

Examples

iex> numbered(~p/nǐ/s)
"ni3"
iex> numbered(~p/lüè/s)
"lve4"
iex> numbered(~p/nǐhǎo/)
"ni3hao3"
iex> numbered(~p/NǏHǍO/)
"NI3HAO3"
iex> numbered(~p/Nǐhǎo, how are you?/w)
"Ni3hao3, how are you?"
Link to this function

read(string, mode \\ :exclusive)

View Source

Specs

read(String.t(), :exclusive | :words | :mixed) ::
  {:ok, pinyin_list()} | {:error, String.t()}

Read a string and convert it into a list of string and pinyin structs.

This function reads a string containing pinyin mixed with normal text. The output of this function is a list of strings and pinyin structs. White space and punctuation will be separated from other strings.

The input string may contain tone-marked (e.g. "nǐ") pinyin, numbered ("ni3") pinyin or a mix thereof (nǐ hao3). Note that in numbered pinyin mode tone numerals must not be separated from their word. For instance, "ni3" will be correctly parsed as "nǐ", "ni 3" will not. When tone marked pinyin is used the tone must be marked on the correct letter. For instance, hǎo will parse correctly, haǒ will not; we recommend the use of numbered pinyin if there is uncertainty about the location of the tone mark.

Parse Modes

By default, this function only accepts strings which consists exclusively of pinyin, whitespace and puncutation. Parsing any text that cannot be interpreted as pinyin will result in an error:

iex> Pinyin.read("Ni3hao3!")
{:ok, [%Pinyin{tone: 3, word: "Ni"}, %Pinyin{tone: 3, word: "hao"}, "!"]}
iex> Pinyin.read("Ni3hao3, hello!")
{:error, "hello!"}

This behaviour can be tweaked if pinyin mixed with regular text needs to be parsed; this can be done by passing a mode to this function. There are 3 available modes:

  • :exclusive: The default. Every character (except white space and punctuation) is interpreted as pinyin. If this is not possible, an error is returned.
  • :words: Any word (i.e. a continuous part of the string that does not contain whitespace or punctuation) is either interpreted as a sequence of pinyin syllables or as non-pinyin text. If a word contains any characters that cannot be interpreted as pinyin, the whole word is considered to be non-pinyin text. This mode does not return errors.
  • :mixed: Any word can contain a mixture of pinyin and non-pinyin characters. The read function will interpret anything it can interpret as pinyin as pinyin and leaves the other text unmodified. This is mainly useful to mix characters and pinyin. It is not recommended to use this mode to mix pinyin and normal text. This mode does not return errors.

The following examples show the use of all three modes:

iex> Pinyin.read("Ni3hao3!", :exclusive)
{:ok, [%Pinyin{tone: 3, word: "Ni"}, %Pinyin{tone: 3, word: "hao"}, "!"]}
iex> Pinyin.read("Ni3hao3, hello!", :exclusive)
{:error, "hello!"}
iex> Pinyin.read("Ni3好hao3, hello!", :exclusive)
{:error, "Ni3好hao3, hello!"}

iex> Pinyin.read("Ni3hao3!", :words)
{:ok, [%Pinyin{tone: 3, word: "Ni"}, %Pinyin{tone: 3, word: "hao"}, "!"]}
iex> Pinyin.read("Ni3hao3, hello!", :words)
{:ok, [%Pinyin{tone: 3, word: "Ni"}, %Pinyin{tone: 3, word: "hao"}, ", ", "hello", "!"]}
iex> Pinyin.read("Ni3好hao3, hello!", :words)
{:ok, ["Ni3好hao3",  ", ", "hello", "!"]}

iex> Pinyin.read("Ni3hao3!", :mixed)
{:ok, [%Pinyin{tone: 3, word: "Ni"}, %Pinyin{tone: 3, word: "hao"}, "!"]}
iex> Pinyin.read("Ni3hao3, hello!", :mixed)
{:ok, [%Pinyin{tone: 3, word: "Ni"}, %Pinyin{tone: 3, word: "hao"}, ", ", %Pinyin{word: "he"}, "l", %Pinyin{word: "lo"}, "!"]}
iex> Pinyin.read("Ni3好hao3, hello!", :mixed)
{:ok, [%Pinyin{tone: 3, word: "Ni"}, "好",  %Pinyin{tone: 3, word: "hao"}, ", ", %Pinyin{word: "he"}, "l", %Pinyin{word: "lo"}, "!"]}

When :mixed or :word mode is used, it is possible some words are incorrectly identified as pinyin. This is generally not a problem for users who just wish to use marked/1 or numbered/1 on the result of read/2, since pinyin syllables with no tone are printed as is.

Capitalization and -r suffix

This function is able to read capitalized and uppercase pinyin strings. That is, strings such as "Ni3hao3", "NI3HAO3" and "NI3hao3" are accepted. However, pinyin words with mixed capitalization are not recognized:

iex> Pinyin.read("Hao3")
{:ok, [%Pinyin{tone: 3, word: "Hao"}]}
iex> Pinyin.read("HAO3")
{:ok, [%Pinyin{tone: 3, word: "HAO"}]}
iex> Pinyin.read("HaO3")
{:error, "HaO3"}

Finally, this function does not detect the -r suffix. Users of the library should take care to fully write out er instead. That is, do not write "zher", use "zheer" instead.

iex> Pinyin.read("zher")
{:error, "zher"}
iex> Pinyin.read("zheer")
{:ok, [%Pinyin{word: "zhe"}, %Pinyin{word: "er"}]}
Link to this function

read!(string, mode \\ :exclusive)

View Source

Specs

read!(String.t(), :exclusive | :words | :mixed) :: pinyin_list() | no_return()

Identical to read/2, but returns the result or a Pinyin.ParseError

Examples

iex> Pinyin.read!("ni3hao3")
[%Pinyin{tone: 3, word: "ni"}, %Pinyin{tone: 3, word: "hao"}]
iex> Pinyin.read!("ni3 hao3")
[%Pinyin{tone: 3, word: "ni"}, " ", %Pinyin{tone: 3, word: "hao"}]
iex> Pinyin.read!("ni 3")
** (Pinyin.ParseError) Error occurred when attempting to parse: `3`
Link to this macro

sigil_p(arg, mode)

View Source (macro)

Sigil to create a pinyin list or struct.

When used without any modifiers, this sigil converts its input into a pinyin list through the use of read!/2 in :exclusive mode. The w and m modifiers can be used to use :words or :mixed mode respectively.

When this sigil is called with the s modifier, a pinyin struct is created by calling from_numbered/1.

Examples

iex> ~p/ni3/
[%Pinyin{tone: 3, word: "ni"}]
iex> ~p/ni3 hello/w
[%Pinyin{tone: 3, word: "ni"}, " ", "hello"]
iex> ~p/ni3好/m
[%Pinyin{tone: 3, word: "ni"}, "好"]
iex> ~p/ni3/s
%Pinyin{tone: 3, word: "ni"}