The individual fixes that Ftfy.fix_text/2 can perform, and the functions
named in "explanations" such as the output of Ftfy.fix_and_explain/2.
Port of ftfy.fixes.
Summary
Functions
Decode backslashed escape sequences (\x, \u, \U, octal, and the
single-character escapes), even in the presence of other Unicode.
Fix UTF-8 mojibake that is embedded within otherwise-fine text, by fixing the detected sequences in isolation.
Treat any remaining C1 control characters as their Windows-1252 equivalents, the way web browsers do.
Replace fullwidth and halfwidth characters with their standard forms.
Replace single-character Latin ligatures with their component letters.
Convert all line breaks to the Unix \n style.
Replace properly-paired UTF-16 surrogate codepoints with the character they represent, or with U+FFFD otherwise.
Remove a byte-order mark decoded as if it were part of the text.
Remove control characters that have no displayed effect on text.
Strip out ANSI terminal escape sequences, such as color codes.
Replace lossy UTF-8 sequences (where a continuation byte became 0x1A or '?') with the UTF-8 encoding of U+FFFD. Operates on raw bytes.
Put back byte A0 (non-breaking space) where a Windows-1252 program replaced it with an ASCII space, when doing so makes a fixable UTF-8 sequence. Operates on raw bytes.
Replace curly quotation marks with straight equivalents.
Decode HTML entities and character references, including some nonstandard all-caps ones, but only the unambiguous ones that end in semicolons.
Functions
Decode backslashed escape sequences (\x, \u, \U, octal, and the
single-character escapes), even in the presence of other Unicode.
Unlike the rest of ftfy, this must be called explicitly; escaped text is not necessarily a mistake.
Fix UTF-8 mojibake that is embedded within otherwise-fine text, by fixing the detected sequences in isolation.
Treat any remaining C1 control characters as their Windows-1252 equivalents, the way web browsers do.
Replace fullwidth and halfwidth characters with their standard forms.
Replace single-character Latin ligatures with their component letters.
Convert all line breaks to the Unix \n style.
Replace properly-paired UTF-16 surrogate codepoints with the character they represent, or with U+FFFD otherwise.
Note: the BEAM cannot represent lone surrogate codepoints in a UTF-8 binary, so for any valid string input this is a no-op. It exists for API parity and to handle the (rare) case of a charlist that carries surrogate codepoints.
Remove a byte-order mark decoded as if it were part of the text.
Remove control characters that have no displayed effect on text.
Strip out ANSI terminal escape sequences, such as color codes.
Replace lossy UTF-8 sequences (where a continuation byte became 0x1A or '?') with the UTF-8 encoding of U+FFFD. Operates on raw bytes.
Put back byte A0 (non-breaking space) where a Windows-1252 program replaced it with an ASCII space, when doing so makes a fixable UTF-8 sequence. Operates on raw bytes.
Replace curly quotation marks with straight equivalents.
Decode HTML entities and character references, including some nonstandard all-caps ones, but only the unambiguous ones that end in semicolons.