View Source Charex.Token (charex v0.4.0)
Wrapper of the Charabia Token.
Notes:
char_map
seems to always be nil from what I can tell, but there might be some languages that use it, so it's here for compatibility, but don't be surprised if it's nil for you.
char_start
/char_end
These are not really characters as we may commonly think of them(which are actually called grapheme clusters in Unicode).
String.length/1
in Elixir for example counts grapheme clusters, which is different than what this range is counting, which is codepoint indexes.
For example "๐จโ๐ฉโ๐งโ๐ฆ"
would have a start
/end
that is 7 numbers apart instead of 1 number apart.
This is the same as you'd get from String.codepoints/1
, or to_charlist/1
byte_start
/byte_end
These are as you'd expect, byte offsets, and could be used to slice the string with :binary.part/3
, but NOT with String.slice/3
(if you use this function it may appear to work for latin/ascii characters, but will break your code as soon as non-ascii characters are used. :binary.part/3
is the correct function to use)
Summary
Functions
Returns a list of all 70 of the language values used in the language
key
Returns a list of all 25 of the script values used in the script
key
Types
@type lang() ::
:other
| :hy
| :tl
| :ca
| :sk
| :la
| :af
| :sn
| :zu
| :ak
| :tk
| :km
| :si
| :ne
| :my
| :or
| :ml
| :fa
| :te
| :id
| :az
| :pa
| :uz
| :gu
| :th
| :ur
| :vi
| :ta
| :et
| :lv
| :lt
| :mk
| :sr
| :hr
| :sl
| :ro
| :kn
| :mr
| :be
| :bg
| :el
| :cs
| :hu
| :nl
| :tr
| :fi
| :sv
| :da
| :nb
| :ko
| :jv
| :am
| :pl
| :yi
| :he
| :ja
| :hi
| :ar
| :ka
| :uk
| :de
| :fr
| :bn
| :it
| :pt
| :es
| :zh
| :ru
| :en
| :eo
@type script() ::
:other
| :hans
| :thai
| :telu
| :taml
| :sinh
| :orya
| :mymr
| :mlym
| :latn
| :khmr
| :jpan
| :knda
| :hebr
| :hang
| :guru
| :gujr
| :grek
| :geor
| :ethi
| :deva
| :cyrl
| :beng
| :armn
| :arab
@type t() :: %Charex.Token{ byte_end: pos_integer(), byte_start: pos_integer(), char_end: pos_integer(), char_map: %{required(pos_integer()) => pos_integer()} | nil, char_start: pos_integer(), kind: :word | :stop_word | {:separator, :hard} | {:separator, :soft}, language: lang(), lemma: String.t(), script: script() }