Codepagex v0.1.4 Codepagex
Codepagex is an elixir library to convert between string encodings to and from utf-8. Like iconv, but written in pure Elixir.
All the encodings are fetched from unicode.org tables and conversion functions are generated from these at compile time.
Examples
The package is assumed to be interfaced using only the Codepagex
module.
iex> from_string("æøåÆØÅ", :iso_8859_1)
{:ok, <<230, 248, 229, 198, 216, 197>>}
iex> to_string(<<230, 248, 229, 198, 216, 197>>, :iso_8859_1)
{:ok, "æøåÆØÅ"}
iex> from_string!("æøåÆØÅ", :iso_8859_1)
<<230, 248, 229, 198, 216, 197>>
iex> to_string!(<<230, 248, 229, 198, 216, 197>>, :iso_8859_1)
"æøåÆØÅ"
When there are invalid byte sequences in a String or encoded binary, the functions will not succeed. If you still want to handle these strings, you may specify a function to handle these circumstances. Eg:
iex> from_string("Hello æøå!", :ascii, replace_nonexistent("_"))
{:ok, "Hello ___!", 3}
iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> to_string!(iso, :ascii, use_utf_replacement())
"Hello ���!"
Encodings
A full list of encodings is found by running encoding_list/1
.
The encodings are best supplied as an atom, or else the string is converted to atom for you (but with a somewhat less efficient function lookup). Eg:
iex> from_string("æøå", "ISO8859/8859-9")
{:ok, <<230, 248, 229>>}
iex> from_string("æøå", :"ISO8859/8859-9")
{:ok, <<230, 248, 229>>}
For some encodings, an alias is set up for easier dispatch. The list of aliases
is found by running aliases/1
. The code looks like:
iex> from_string!("Hello æøåÆØÅ!", :iso_8859_1)
<<72, 101, 108, 108, 111, 32, 230, 248, 229, 198, 216, 197, 33>>
Encoding selection
By default all ISO-8859 encodings and ASCII is included. There are a few more
available, and these must be specified in the config/config.exs
file. The
specified files are then compiled. Adding many encodings may affect compilation
times, in particular for the largest ones.
To specify the encodings to use, add the following lines to your
config/config.exs
and recompile:
use Mix.Config
config :codepagex, :encodings, [:ascii]
This will add only the ASCII encoding, as specified by it’s shorthand alias. Any number of encodings may be specified like this in the list. The list may contain strings, atoms or regular expressions that match either an alias or a full encoding name, eg:
use Mix.Config
config :codepagex, :encodings, [
:ascii, # by alias name
~r[iso8859]i, # by a regex matching the full name
"ETSI/GSM0338", # by the full name as a string
:"MISC/CP856" # by a full name as an atom
]
The encodings that are known to require very long compile times are:
- VENDORS/MISC/KPS9566
- VENDORS/MICSFT/WINDOWS/CP932
- VENDORS/MICSFT/WINDOWS/CP936
- VENDORS/MICSFT/WINDOWS/CP949
- VENDORS/MICSFT/WINDOWS/CP950
TODO
- A few encodings are not yet supported for different reasons. In particular the asian and arab ones with left-right and up-down variations.
- Test Elixir function specs
Summary
Functions
See Codepagex.Mappings.aliases/1
See Codepagex.Mappings.encoding_list/1
Converts an Elixir string in utf-8 encoding to a binary in another encoding
Convert an Elixir String in utf-8 to a binary in a specified encoding. A function parameter specifies how to deal with codepoints that are not representable in the target encoding
Like from_string/2
but raising exceptions on errors
Like from_string/4
but raising exceptions on errors
This function may be used in conjunction with to from_string/4
or
from_string!/4
. If there are utf-8 codepoints in the source string that are
not possible to represent in the target encoding, they are replaced with a
String
Converts a binary in a specified encoding to an Elixir string in utf-8 encoding
Convert a binary in a specified encoding into an Elixir string in utf-8 encoding
Like to_string/2
but raises exceptions on errors
Like to_string/4
but raises exceptions on errors
Convert a binary in one encoding to a binary in another encoding. The string is converted to utf-8 internally in the process
Like translate/3
but raises exceptions on errors
This function may be used as a parameter to to_string/4
or to_string!/4
such that any bytes in the input binary that don’t have a proper encoding are
replaced with a special unicode character and the function will not
fail
Types
from_s_missing_outer() :: (String.t -> {:ok, from_s_missing_inner} | {:error, term})
to_s_missing_inner() :: (binary, term -> {:ok, String.t, binary, term} | {:error, term})
to_s_missing_outer() :: (String.t -> {:ok, to_s_missing_inner} | {:error, term})
Functions
See Codepagex.Mappings.encoding_list/1
.
Converts an Elixir string in utf-8 encoding to a binary in another encoding.
The encoding
parameter should be in encoding_list/0
as an atom or String,
or in aliases/0
.
Examples
iex> from_string("Hɦ¦Ó", :iso_8859_1)
{:ok, <<72, 201, 166, 166, 211>>}
iex> from_string("Hɦ¦Ó", :"ISO8859/8859-1") # without alias
{:ok, <<72, 201, 166, 166, 211>>}
iex> from_string("ʒ", :iso_8859_1)
{:error, "Invalid bytes for encoding"}
from_string(binary, encoding, from_s_missing_outer, term) :: {:ok, String.t, integer} | {:error, term, integer}
Convert an Elixir String in utf-8 to a binary in a specified encoding. A function parameter specifies how to deal with codepoints that are not representable in the target encoding.
Compared to from_string/2
, you may pass a missing_fun
function parameter
to handle encoding errors in string
. The function replace_nonexistent/1
may be used as a default error handling machanism.
The encoding
parameter should be in encoding_list/0
as an atom or String,
or in aliases/0
.
Implementing missing_fun
The missing_fun
must be an anonymous function that returns a second
function. The outer function will receive the encoding used by
from_string/4
, and must then return {:ok, inner_function}
or {:error,
reason}
. Returning :error
will cause from_string/4
to fail.
The returned inner function must receive two arguments.
- a String containing the remainder of the
string
parameter that is still unprocessed. - the accumulator
acc
The return value must be
{:ok, replacement, new_rest, new_acc}
to continue processing{:error, reason, new_acc}
to causefrom_string/4
to fail
The acc
parameter from from_string/4
is passed between every invocation
of the inner function then returned by to_string/4
. In many use cases,
acc
may be ignored.
Examples
Using the replace_nonexistent/1
function to handle invalid bytes:
iex> from_string("Hello æøå!", :ascii, replace_nonexistent("_"))
{:ok, "Hello ___!", 3}
Defining a custom missing_fun
:
iex> missing_fun =
...> fn encoding ->
...> case from_string("#", encoding) do
...> {:ok, replacement} ->
...> inner_fun =
...> fn <<_ :: utf8, rest :: binary>>, acc ->
...> {:ok, replacement, rest, acc + 1}
...> end
...> {:ok, inner_fun}
...> err ->
...> err
...> end
...> end
iex> from_string("Hello æøå!", :ascii, missing_fun, 0)
{:ok, "Hello ###!", 3}
The previous code was included for completeness. If you know your replacement is valid in the target encoding, you might as well do:
iex> missing_fun = fn _encoding ->
...> inner_fun =
...> fn <<_ :: utf8, rest :: binary>>, acc ->
...> {:ok, "#", rest, acc + 1}
...> end
...> {:ok, inner_fun}
...> end
iex> from_string("Hello æøå!", :ascii, missing_fun, 10)
{:ok, "Hello ###!", 13}
Like from_string/2
but raising exceptions on errors.
Examples
iex> from_string!("Hɦ¦Ó", :iso_8859_1)
<<72, 201, 166, 166, 211>>
iex> from_string!("ʒ", :iso_8859_1)
** (Codepagex.Error) Invalid bytes for encoding
from_string!(String.t, encoding, from_s_missing_outer, term) :: binary | no_return
Like from_string/4
but raising exceptions on errors.
Examples
iex> missing_fun = replace_nonexistent("_")
iex> from_string!("Hello æøå!", :ascii, missing_fun)
"Hello ___!"
This function may be used in conjunction with to from_string/4
or
from_string!/4
. If there are utf-8 codepoints in the source string that are
not possible to represent in the target encoding, they are replaced with a
String.
When using this function, from_string/4
will never return an error if
replace_with
converts to the target encoding without errors.
The accumulator input acc
of from_string/4
is incremented on each
replacement done.
Examples
iex> from_string!("Hello æøå!", :ascii, replace_nonexistent("_"))
"Hello ___!"
iex> from_string("Hello æøå!", :ascii, replace_nonexistent("_"), 100)
{:ok, "Hello ___!", 103}
Converts a binary in a specified encoding to an Elixir string in utf-8 encoding.
The encoding parameter should be in encoding_list/0
(passed as atoms or
strings), or in aliases/0
.
Examples
iex> to_string(<<72, 201, 166, 166, 211>>, :iso_8859_1)
{:ok, "Hɦ¦Ó"}
iex> to_string(<<128>>, "ETSI/GSM0338")
{:error, "Invalid bytes for encoding"}
to_string(binary, encoding, to_s_missing_outer, term) :: {:ok, String.t, integer} | {:error, term, integer}
Convert a binary in a specified encoding into an Elixir string in utf-8 encoding
Compared to to_string/2
, you may pass a missing_fun
function parameter to
handle encoding errors in the binary
. The function use_utf_replacement/0
may be used as a default error handling machanism.
Implementing missing_fun
The missing_fun
must be an anonymous function that returns a second
function. The outer function will receive the encoding used by to_string/4
,
and must then return {:ok, inner_function}
or {:error, reason}
. Returning
:error
will cause to_string/4
to fail.
The returned inner function must receive two arguments.
- a binary containing the remainder of the
binary
parameter that is still unprocessed. - the accumulator
acc
The return value must be
{:ok, replacement, new_rest, new_acc}
to continue processing{:error, reason, new_acc}
to causeto_string/4
to fail
The acc
parameter from to_string/4
is passed between every invocation of
the inner function then returned by to_string/4
. In many use cases, acc
may be ignored.
Examples
Using the use_utf_replacement/0
function to handle invalid bytes:
iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> to_string(iso, :ascii, use_utf_replacement())
{:ok, "Hello ���!", 3}
iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> missing_fun =
...> fn encoding ->
...> case to_string("#", encoding) do
...> {:ok, replacement} ->
...> inner_fun =
...> fn <<_, rest :: binary>>, acc ->
...> {:ok, replacement, rest, acc + 1}
...> end
...> {:ok, inner_fun}
...> err ->
...> err
...> end
...> end
iex> to_string(iso, :ascii, missing_fun, 0)
{:ok, "Hello ###!", 3}
The previous code was included for completeness. If you know your replacement is valid in the target encoding, you might as well do:
iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> missing_fun =
...> fn _encoding ->
...> inner_fun =
...> fn <<_, rest :: binary>>, acc ->
...> {:ok, "#", rest, acc + 1}
...> end
...> {:ok, inner_fun}
...> end
iex> to_string(iso, :ascii, missing_fun, 10)
{:ok, "Hello ###!", 13}
Like to_string/2
but raises exceptions on errors.
Examples
iex> to_string!(<<72, 201, 166, 166, 211>>, :iso_8859_1)
"Hɦ¦Ó"
iex> to_string!(<<128>>, "ETSI/GSM0338")
** (Codepagex.Error) Invalid bytes for encoding
to_string!(binary, encoding, to_s_missing_outer, term) :: String.t | no_return
Like to_string/4
but raises exceptions on errors.
Examples
iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> to_string!(iso, :ascii, use_utf_replacement())
"Hello ���!"
Convert a binary in one encoding to a binary in another encoding. The string is converted to utf-8 internally in the process.
The encoding parameters should be in encoding_list/0
or aliases/0
. It may
be passed as an atom, or a string for full encoding names.
Examples
iex> translate(<<174>>, :iso_8859_1, :iso_8859_15)
{:ok, <<174>>}
iex> translate(<<174>>, :iso_8859_1, :iso_8859_2)
{:error, "Invalid bytes for encoding"}
Like translate/3
but raises exceptions on errors
Examples
iex> translate!(<<174>>, :iso_8859_1, :iso_8859_15)
<<174>>
iex> translate!(<<174>>, :iso_8859_1,:iso_8859_2)
** (Codepagex.Error) Invalid bytes for encoding
This function may be used as a parameter to to_string/4
or to_string!/4
such that any bytes in the input binary that don’t have a proper encoding are
replaced with a special unicode character and the function will not
fail.
If this function is used, to_string/4
will never return an error.
The accumulator input acc
of to_string/4
is incremented by the number of
replacements made.
Examples
iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> to_string!(iso, :ascii, use_utf_replacement())
"Hello ���!"
iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> to_string(iso, :ascii, use_utf_replacement())
{:ok, "Hello ���!", 3}