Image.OCR.Languages (image_ocr v0.2.0)

Copy Markdown View Source

Translation between user-facing language identifiers and Tesseract's trained-data filename codes.

All public Image.OCR and mix-task language arguments accept any of:

  • An ISO 639-1 two-letter code, as a string or atom: "en", :en, "fr", :de.

  • A BCP-47 tag for region- or script-specific variants where ISO 639-1 alone is ambiguous: "zh-Hans", "zh-Hant", "sr-Latn", "az-Cyrl".

  • A Tesseract trained-data code passed through verbatim, for languages or artefacts ISO 639-1 cannot express: "eng", "chi_sim", "frk" (German Fraktur), "osd" (orientation/script detection), "script/Latin".

  • A +-joined combination of any of the above: "en+fr", "chi_sim+eng".

Ambiguous codes

ISO 639-1 zh does not specify a script and is rejected — use "zh-Hans" for Simplified Chinese or "zh-Hant" for Traditional Chinese.

Summary

Functions

Inverse of to_tesseract/1: returns the ISO 639-1 code for a Tesseract trained-data code when one exists, otherwise the input unchanged.

Translates a user-supplied language identifier to the corresponding Tesseract trained-data code (the basename of the .traineddata file).

Functions

from_tesseract(tess_code)

@spec from_tesseract(String.t()) :: String.t()

Inverse of to_tesseract/1: returns the ISO 639-1 code for a Tesseract trained-data code when one exists, otherwise the input unchanged.

Examples

iex> Image.OCR.Languages.from_tesseract("eng")
"en"

iex> Image.OCR.Languages.from_tesseract("frk")
"frk"

to_tesseract(code)

@spec to_tesseract(atom() | String.t()) :: String.t()

Translates a user-supplied language identifier to the corresponding Tesseract trained-data code (the basename of the .traineddata file).

Compound +-joined identifiers are translated component-wise.

Arguments

  • code is a string or atom describing one or more languages. See the moduledoc for the accepted forms.

Returns

  • The translated string, e.g. "eng", "chi_sim+eng", or "frk".

Examples

iex> Image.OCR.Languages.to_tesseract(:en)
"eng"

iex> Image.OCR.Languages.to_tesseract("zh-Hans")
"chi_sim"

iex> Image.OCR.Languages.to_tesseract("en+fr")
"eng+fra"

iex> Image.OCR.Languages.to_tesseract("frk")
"frk"