Translation between user-facing language identifiers and Tesseract's trained-data filename codes.
All public Image.OCR and mix-task language arguments accept any of:
An ISO 639-1 two-letter code, as a string or atom:
"en",:en,"fr",:de.A BCP-47 tag for region- or script-specific variants where ISO 639-1 alone is ambiguous:
"zh-Hans","zh-Hant","sr-Latn","az-Cyrl".A Tesseract trained-data code passed through verbatim, for languages or artefacts ISO 639-1 cannot express:
"eng","chi_sim","frk"(German Fraktur),"osd"(orientation/script detection),"script/Latin".A
+-joined combination of any of the above:"en+fr","chi_sim+eng".
Ambiguous codes
ISO 639-1 zh does not specify a script and is rejected — use "zh-Hans"
for Simplified Chinese or "zh-Hant" for Traditional Chinese.
Summary
Functions
Inverse of to_tesseract/1: returns the ISO 639-1 code for a Tesseract
trained-data code when one exists, otherwise the input unchanged.
Translates a user-supplied language identifier to the corresponding
Tesseract trained-data code (the basename of the .traineddata file).
Functions
Inverse of to_tesseract/1: returns the ISO 639-1 code for a Tesseract
trained-data code when one exists, otherwise the input unchanged.
Examples
iex> Image.OCR.Languages.from_tesseract("eng")
"en"
iex> Image.OCR.Languages.from_tesseract("frk")
"frk"
Translates a user-supplied language identifier to the corresponding
Tesseract trained-data code (the basename of the .traineddata file).
Compound +-joined identifiers are translated component-wise.
Arguments
codeis a string or atom describing one or more languages. See the moduledoc for the accepted forms.
Returns
- The translated string, e.g.
"eng","chi_sim+eng", or"frk".
Examples
iex> Image.OCR.Languages.to_tesseract(:en)
"eng"
iex> Image.OCR.Languages.to_tesseract("zh-Hans")
"chi_sim"
iex> Image.OCR.Languages.to_tesseract("en+fr")
"eng+fra"
iex> Image.OCR.Languages.to_tesseract("frk")
"frk"