Penelope v0.5.0 Penelope.ML.Text.POSFeaturizer View Source
The POS featurizer converts a list of lists of tokens into nested lists containing feature maps relevant to POS tagging for each token.
Features used for the POS tagger are largely inspired by A Maximum Entropy Model for Part-Of-Speech Tagging; the following is an example feature map for an individual token:
token_list = ["it", "is", "a", little-known", "fact"]
token = "little-known"
%{
"has_hyphen" => true,
"has_digit" => false,
"has_cap" => false,
"pre_1" => "l",
"pre_2" => "li",
"pre_3" => "lit",
"pre_4" => "litt",
"suff_1" => "n",
"suff_2" => "wn",
"suff_3" => "own",
"suff_4" => "nown",
"tok_-2" => "is",
"tok_-1" => "a",
"tok_0" => "little-known",
"tok_1" => "fact",
"tok_2" => "",
}
Link to this section Summary
Functions
transforms the token lists into lists of feature maps
Link to this section Functions
transforms the token lists into lists of feature maps.