View Source Charex.Token (charex v0.4.0)

Wrapper of the Charabia Token.

Notes:

char_map

seems to always be nil from what I can tell, but there might be some languages that use it, so it's here for compatibility, but don't be surprised if it's nil for you.

char_start/char_end

These are not really characters as we may commonly think of them(which are actually called grapheme clusters in Unicode). String.length/1 in Elixir for example counts grapheme clusters, which is different than what this range is counting, which is codepoint indexes. For example "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ" would have a start/end that is 7 numbers apart instead of 1 number apart. This is the same as you'd get from String.codepoints/1, or to_charlist/1

byte_start/byte_end

These are as you'd expect, byte offsets, and could be used to slice the string with :binary.part/3, but NOT with String.slice/3(if you use this function it may appear to work for latin/ascii characters, but will break your code as soon as non-ascii characters are used. :binary.part/3 is the correct function to use)

Summary

Functions

Returns a list of all 70 of the language values used in the language key

Returns a list of all 25 of the script values used in the script key

Types

@type lang() ::
  :other
  | :hy
  | :tl
  | :ca
  | :sk
  | :la
  | :af
  | :sn
  | :zu
  | :ak
  | :tk
  | :km
  | :si
  | :ne
  | :my
  | :or
  | :ml
  | :fa
  | :te
  | :id
  | :az
  | :pa
  | :uz
  | :gu
  | :th
  | :ur
  | :vi
  | :ta
  | :et
  | :lv
  | :lt
  | :mk
  | :sr
  | :hr
  | :sl
  | :ro
  | :kn
  | :mr
  | :be
  | :bg
  | :el
  | :cs
  | :hu
  | :nl
  | :tr
  | :fi
  | :sv
  | :da
  | :nb
  | :ko
  | :jv
  | :am
  | :pl
  | :yi
  | :he
  | :ja
  | :hi
  | :ar
  | :ka
  | :uk
  | :de
  | :fr
  | :bn
  | :it
  | :pt
  | :es
  | :zh
  | :ru
  | :en
  | :eo
@type script() ::
  :other
  | :hans
  | :thai
  | :telu
  | :taml
  | :sinh
  | :orya
  | :mymr
  | :mlym
  | :latn
  | :khmr
  | :jpan
  | :knda
  | :hebr
  | :hang
  | :guru
  | :gujr
  | :grek
  | :geor
  | :ethi
  | :deva
  | :cyrl
  | :beng
  | :armn
  | :arab
@type t() :: %Charex.Token{
  byte_end: pos_integer(),
  byte_start: pos_integer(),
  char_end: pos_integer(),
  char_map: %{required(pos_integer()) => pos_integer()} | nil,
  char_start: pos_integer(),
  kind: :word | :stop_word | {:separator, :hard} | {:separator, :soft},
  language: lang(),
  lemma: String.t(),
  script: script()
}

Functions

@spec all_langs() :: [lang()]

Returns a list of all 70 of the language values used in the language key

@spec all_scripts() :: [script()]

Returns a list of all 25 of the script values used in the script key