Hyphen8
Introduction
Hyphen8 is a pure Elixir implementation of the Knuth-Liang Hyphenation Algorithm. That algorithm formed the basis of Franklin Mark Liang's 1983 Stanford Dissertation, WORD HY-PHEN-A-TION BY COM-PU-TER. It remains the standard hyphenation method in TeX.
Usage
Pass a string to Hyphen8.Engine.main()
:
iex> Hyphen8.Engine.main("let's hyphenate containerization orchestration platform")
You will receive the hyphenated string:
"let s hy-phen-ate con-tainer-iza-tion or-ches-tra-tion plat-form"
The current version will not reconstruct punctuation. To customize the string and word splitting, adjust the regular expressions in Hyphen8.Engine.parse_words()
and Hyphen8.Engine.parse_characters()
.
History of the Knuth-Liang Algorithm
Liang built a program, Patgen, which developed a large pattern table. He says the following about the resulting algorithm:
"The new hyphenation algorithm is based on the idea of hyphenating and inhibiting patterns. These are simply strings of letters that, when they match a word, give us information about hyphenation at some point in the pattern. For example, '-tion' and *c-c' are good hyphenating patterns. An important feature of this method is that a suitable set of patterns can be extracted automatically from the dictionary.
"...The resulting hyphenation algorithm uses about 4500 patterns...These patterns find 89% of the hyphens in a pocket dictionary word list, with essentially no error."
In the official resource for Patgen, we find a clear description of the algorithm's logic:
"The patterns consist of strings of letters and digits, where a digit indicates a
hyphenation value
for some intercharacter position. For example, the pattern\.{3t2ion}
specifies that if the string\.{tion}
occurs in a word, we should assign a hyphenation value of3
to the position immediately before the\.{t}
, and a value of2
to the position between the\.{t}
and the\.{i}
.
"...To hyphenate a word, we find all patterns that match within the word and determine the hyphenation values for each intercharacter position. If more than one pattern applies to a given position, we take the maximum of the values specified (i.e., the higher value takes priority). If the resulting hyphenation value is odd, this position is a feasible breakpoint; if the value is even or if no value has been specified, we are not allowed to break at this position.
Hyphen8 uses Liang's original tables. There is a lot of room for optimization in this Elixir port, but the basic implementation is there.
There is a large sample output in this repo. View this hyphenated version of Moby-Dick. The input was the unabridged, UTF-8 copy of Moby Dick on Project Gutenberg. This file is Hyphen8's raw output.
Additional Resources
Installation
If available in Hex, the package can be installed
by adding hyphen8
to your list of dependencies in mix.exs
:
def deps do
[
{:hyphen8, "~> 0.1.5"}
]
end