content_indexer v0.1.0 ContentIndexer.Services.Calculator

Summary

calculates the content_indexer weights for a document of tokens against a corpus of tokenized documents

https://en.wikipedia.org/wiki/Tf-idf

** What is Tf-Idf **

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is
intended to reflect how important a word is to a document in a collection or corpus. It is often
used as a weighting factor in information retrieval and text mining.

This library supports calculating large datasets in parallel using the Erlang OTP based server and actors

Currently the supported file types are plain-text, PDF and DOCX (xml)

** Basic Useage **

Pass it a list of tokens and a corpus of tokens as a list of lists and it will return a list of tokens
with corresponding content_indexer weights based on the corpus of tokens

iex> ContentIndexerService.calculate_content_indexer_documents(
  ["bread","butter","jam"],
  [["red","brown","jam"],["blue","green","butter"],["pink","green","bread","jam"]]
)
{:ok, [bread: 0.3662040962227032, butter: 0.3662040962227032,jam: 0.3662040962227032]}

Link to this section Summary

Functions

calculates the content_indexer weights for each token in the list of tokens against the corpus of tokens

calculates the content_indexer weights for each token in the query - weights the query against itself

calculates the term frequency for each token in the list of tokens representing the document and returns a list of the tokens with their respective term frequencies

calculates the word count for each token in the list of tokens representing the document and returns a list of the tokens with their respective word counts

calculates the content_indexer

Link to this section Functions

Link to this function calculate_content_indexer_documents(tokens, corpus_of_tokens)

calculates the content_indexer weights for each token in the list of tokens against the corpus of tokens

iex> ContentIndexerService.calculate_content_indexer_documents(

["bread","butter","jam"],
[["red","brown","jam"],["blue","green","butter"],["pink","green","bread","jam"]]

)

Link to this function calculate_content_indexer_documents(tokens, corpus_of_tokens, corpus_size)
Link to this function calculate_content_indexer_query(tokens)

calculates the content_indexer weights for each token in the query - weights the query against itself

iex> ContentIndexerService.calculate_content_indexer_query(

["bread","butter","jam"]

)

Link to this function calculate_tf_document(tokens)

calculates the term frequency for each token in the list of tokens representing the document and returns a list of the tokens with their respective term frequencies

iex> ContentIndexerService.calculate_tf_document([“bread”,”butter”,”jam”,”jam”,”bread”,”bread”])

Link to this function calculate_token_count_document(tokens)

calculates the word count for each token in the list of tokens representing the document and returns a list of the tokens with their respective word counts

iex> ContentIndexerService.calculate_token_count_document([“bread”,”butter”,”jam”,”jam”,”bread”,”bread”])

Link to this function calculate_tokens_againts_corpus(content, corpus)

calculates the content_indexer

iex> ContentIndexerValidateService.calculate_tokens_againts_corpus(

"bread,butter,jam",
["red,brown,jam","blue,green,butter","pink,green,bread,jam"]

) {:ok,

[
  {"bread", 0.13515503603605478},
  {"butter", 0.13515503603605478},
  {"jam", 0.0}
]

}

Link to this function init_calculator()
Link to this function list_contains(list, item)