content_indexer v0.2.0 ContentIndexer.Services.Calculator
Summary
calculates the content_indexer weights for a document of tokens against a corpus of tokenized documents
https://en.wikipedia.org/wiki/Tf-idf
** What is Tf-Idf **
tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is
intended to reflect how important a word is to a document in a collection or corpus. It is often
used as a weighting factor in information retrieval and text mining.
This library supports calculating large datasets in parallel using the Erlang OTP based server and actors
Currently the supported file types are plain-text, PDF and DOCX (xml)
** Basic Useage **
Pass it a list of tokens and a corpus of tokens as a list of lists and it will return a list of tokens
with corresponding content_indexer weights based on the corpus of tokens
iex> ContentIndexerService.calculate_content_indexer_documents(
["bread","butter","jam"],
[["red","brown","jam"],["blue","green","butter"],["pink","green","bread","jam"]]
)
{:ok, [bread: 0.3662040962227032, butter: 0.3662040962227032,jam: 0.3662040962227032]}
Link to this section Summary
Functions
calculates the content_indexer weights for each token in the list of tokens against the corpus of tokens
calculates the content_indexer weights for each token in the list of tokens against the corpus of tokens
calculates the content_indexer weights for each token in the query - weights the query against itself
calculates the term frequency for each token in the list of tokens representing the document and returns a list of the tokens with their respective term frequencies
calculates the word count for each token in the list of tokens representing the document and returns a list of the tokens with their respective word counts
calculates the content_indexer
simple function to check if an item is contained in the list
Link to this section Functions
calculates the content_indexer weights for each token in the list of tokens against the corpus of tokens
## Parameters
- tokens: List of tokens to be indexed
- corpus_of_tokens: List of lists representing the corpus of all tokens
## Example
iex> ContentIndexerService.calculate_content_indexer_documents(
["bread","butter","jam"],
[
["red","brown","jam"],
["blue","green","butter"],
["pink","green","bread","jam"]
])
{:ok, [bread: 0.3662040962227032, butter: 0.3662040962227032,jam: 0.3662040962227032]}
calculates the content_indexer weights for each token in the list of tokens against the corpus of tokens
## Parameters
- tokens: List of tokens to be indexed
- corpus_of_tokens: List of lists representing the corpus of all tokens
- corpus_size: Integer with the size of the corpus_of_tokens - just so we can avoid re-calculating it
## Example
iex> ContentIndexerService.calculate_content_indexer_documents(
["bread","butter","jam"],
[
["red","brown","jam"],
["blue","green","butter"],
["pink","green","bread","jam"]
])
{:ok, [bread: 0.3662040962227032, butter: 0.3662040962227032,jam: 0.3662040962227032]}
calculates the content_indexer weights for each token in the query - weights the query against itself
## Parameters
- tokens: List of tokens to be indexed
## Example
iex> ContentIndexerService.calculate_content_indexer_query(["bread","butter","jam"])
{:ok, [bread: 0.0, butter: 0.0, jam: 0.0]}
calculates the term frequency for each token in the list of tokens representing the document and returns a list of the tokens with their respective term frequencies
## Parameters
- tokens: List of tokens to be indexed
## Example
iex> ContentIndexerService.calculate_tf_document(["bread","butter","jam","jam","bread","bread"])
{:ok, [bread: 0.5, butter: 0.16666666666666666, jam: 0.3333333333333333]}
calculates the word count for each token in the list of tokens representing the document and returns a list of the tokens with their respective word counts
## Parameters
- tokens: List of tokens to be indexed
## Example
iex> ContentIndexerService.calculate_token_count_document(["bread","butter","jam","jam","bread","bread"])
{:ok, [bread: 3, butter: 1, jam: 2]}
calculates the content_indexer
## Parameters
- content: String of tokens to be indexed
- corpus: List of String tokens representing the corpus
## Example
iex> ContentIndexerValidateService.calculate_tokens_againts_corpus("bread,butter,jam", ["red,brown,jam","blue,green,butter","pink,green,bread,jam"])
{:ok,
[
{"bread", 0.13515503603605478},
{"butter", 0.13515503603605478},
{"jam", 0.0}
]
}
simple function to check if an item is contained in the list
## Parameters
- list: List of any type
- item: Any type of item stored in the list
## Example
iex> ContentIndexerService.calculate_content_indexer_documents(
["bread","butter","jam"],
[
["red","brown","jam"],
["blue","green","butter"],
["pink","green","bread","jam"]
])
{:ok, [bread: 0.3662040962227032, butter: 0.3662040962227032,jam: 0.3662040962227032]}