Ftfy.Fixes (ftfy v0.1.0)

Copy Markdown View Source

The individual fixes that Ftfy.fix_text/2 can perform, and the functions named in "explanations" such as the output of Ftfy.fix_and_explain/2.

Port of ftfy.fixes.

Summary

Functions

Decode backslashed escape sequences (\x, \u, \U, octal, and the single-character escapes), even in the presence of other Unicode.

Fix UTF-8 mojibake that is embedded within otherwise-fine text, by fixing the detected sequences in isolation.

Treat any remaining C1 control characters as their Windows-1252 equivalents, the way web browsers do.

Replace fullwidth and halfwidth characters with their standard forms.

Replace single-character Latin ligatures with their component letters.

Convert all line breaks to the Unix \n style.

Replace properly-paired UTF-16 surrogate codepoints with the character they represent, or with U+FFFD otherwise.

Remove a byte-order mark decoded as if it were part of the text.

Remove control characters that have no displayed effect on text.

Strip out ANSI terminal escape sequences, such as color codes.

Replace lossy UTF-8 sequences (where a continuation byte became 0x1A or '?') with the UTF-8 encoding of U+FFFD. Operates on raw bytes.

Put back byte A0 (non-breaking space) where a Windows-1252 program replaced it with an ASCII space, when doing so makes a fixable UTF-8 sequence. Operates on raw bytes.

Replace curly quotation marks with straight equivalents.

Decode HTML entities and character references, including some nonstandard all-caps ones, but only the unambiguous ones that end in semicolons.

Functions

ansi_re()

decode_escapes(text)

@spec decode_escapes(binary()) :: binary()

Decode backslashed escape sequences (\x, \u, \U, octal, and the single-character escapes), even in the presence of other Unicode.

Unlike the rest of ftfy, this must be called explicitly; escaped text is not necessarily a mistake.

decode_inconsistent_utf8(text)

@spec decode_inconsistent_utf8(binary()) :: binary()

Fix UTF-8 mojibake that is embedded within otherwise-fine text, by fixing the detected sequences in isolation.

fix_c1_controls(text)

@spec fix_c1_controls(binary()) :: binary()

Treat any remaining C1 control characters as their Windows-1252 equivalents, the way web browsers do.

fix_character_width(text)

@spec fix_character_width(binary()) :: binary()

Replace fullwidth and halfwidth characters with their standard forms.

fix_latin_ligatures(text)

@spec fix_latin_ligatures(binary()) :: binary()

Replace single-character Latin ligatures with their component letters.

fix_line_breaks(text)

@spec fix_line_breaks(binary()) :: binary()

Convert all line breaks to the Unix \n style.

fix_surrogates(text)

@spec fix_surrogates(binary()) :: binary()

Replace properly-paired UTF-16 surrogate codepoints with the character they represent, or with U+FFFD otherwise.

Note: the BEAM cannot represent lone surrogate codepoints in a UTF-8 binary, so for any valid string input this is a no-op. It exists for API parity and to handle the (rare) case of a charlist that carries surrogate codepoints.

remove_bom(text)

@spec remove_bom(binary()) :: binary()

Remove a byte-order mark decoded as if it were part of the text.

remove_control_chars(text)

@spec remove_control_chars(binary()) :: binary()

Remove control characters that have no displayed effect on text.

remove_terminal_escapes(text)

@spec remove_terminal_escapes(binary()) :: binary()

Strip out ANSI terminal escape sequences, such as color codes.

replace_lossy_sequences(bytes)

@spec replace_lossy_sequences(binary()) :: binary()

Replace lossy UTF-8 sequences (where a continuation byte became 0x1A or '?') with the UTF-8 encoding of U+FFFD. Operates on raw bytes.

restore_byte_a0(bytes)

@spec restore_byte_a0(binary()) :: binary()

Put back byte A0 (non-breaking space) where a Windows-1252 program replaced it with an ASCII space, when doing so makes a fixable UTF-8 sequence. Operates on raw bytes.

uncurl_quotes(text)

@spec uncurl_quotes(binary()) :: binary()

Replace curly quotation marks with straight equivalents.

unescape_html(text)

@spec unescape_html(binary()) :: binary()

Decode HTML entities and character references, including some nonstandard all-caps ones, but only the unambiguous ones that end in semicolons.