content_indexer v0.2.0 ContentIndexer.Services.Calculator

Summary

calculates the content_indexer weights for a document of tokens against a corpus of tokenized documents

https://en.wikipedia.org/wiki/Tf-idf

** What is Tf-Idf **

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is
intended to reflect how important a word is to a document in a collection or corpus. It is often
used as a weighting factor in information retrieval and text mining.

This library supports calculating large datasets in parallel using the Erlang OTP based server and actors

Currently the supported file types are plain-text, PDF and DOCX (xml)

** Basic Useage **

Pass it a list of tokens and a corpus of tokens as a list of lists and it will return a list of tokens
with corresponding content_indexer weights based on the corpus of tokens

iex> ContentIndexerService.calculate_content_indexer_documents(
  ["bread","butter","jam"],
  [["red","brown","jam"],["blue","green","butter"],["pink","green","bread","jam"]]
)
{:ok, [bread: 0.3662040962227032, butter: 0.3662040962227032,jam: 0.3662040962227032]}

Link to this section Summary

Functions

calculates the content_indexer weights for each token in the list of tokens against the corpus of tokens

calculates the content_indexer weights for each token in the list of tokens against the corpus of tokens

calculates the content_indexer weights for each token in the query - weights the query against itself

calculates the term frequency for each token in the list of tokens representing the document and returns a list of the tokens with their respective term frequencies

calculates the word count for each token in the list of tokens representing the document and returns a list of the tokens with their respective word counts

calculates the content_indexer

simple function to check if an item is contained in the list

Link to this section Functions

Link to this function calculate_content_indexer_documents(tokens, corpus_of_tokens)

calculates the content_indexer weights for each token in the list of tokens against the corpus of tokens

## Parameters

- tokens: List of tokens to be indexed
- corpus_of_tokens: List of lists representing the corpus of all tokens

## Example

iex> ContentIndexerService.calculate_content_indexer_documents(
        ["bread","butter","jam"],
        [
          ["red","brown","jam"],
          ["blue","green","butter"],
          ["pink","green","bread","jam"]
        ])
      {:ok, [bread: 0.3662040962227032, butter: 0.3662040962227032,jam: 0.3662040962227032]}
Link to this function calculate_content_indexer_documents(tokens, corpus_of_tokens, corpus_size)

calculates the content_indexer weights for each token in the list of tokens against the corpus of tokens

## Parameters

- tokens: List of tokens to be indexed
- corpus_of_tokens: List of lists representing the corpus of all tokens
- corpus_size: Integer with the size of the corpus_of_tokens - just so we can avoid re-calculating it

## Example

iex> ContentIndexerService.calculate_content_indexer_documents(
      ["bread","butter","jam"],
      [
        ["red","brown","jam"],
        ["blue","green","butter"],
        ["pink","green","bread","jam"]
      ])
      {:ok, [bread: 0.3662040962227032, butter: 0.3662040962227032,jam: 0.3662040962227032]}
Link to this function calculate_content_indexer_query(tokens)

calculates the content_indexer weights for each token in the query - weights the query against itself

## Parameters

- tokens: List of tokens to be indexed

## Example

iex> ContentIndexerService.calculate_content_indexer_query(["bread","butter","jam"])
      {:ok, [bread: 0.0, butter: 0.0, jam: 0.0]}
Link to this function calculate_tf_document(tokens)

calculates the term frequency for each token in the list of tokens representing the document and returns a list of the tokens with their respective term frequencies

## Parameters

- tokens: List of tokens to be indexed

## Example

iex> ContentIndexerService.calculate_tf_document(["bread","butter","jam","jam","bread","bread"])
      {:ok, [bread: 0.5, butter: 0.16666666666666666, jam: 0.3333333333333333]}
Link to this function calculate_token_count_document(tokens)

calculates the word count for each token in the list of tokens representing the document and returns a list of the tokens with their respective word counts

## Parameters

- tokens: List of tokens to be indexed

## Example

iex> ContentIndexerService.calculate_token_count_document(["bread","butter","jam","jam","bread","bread"])
      {:ok, [bread: 3, butter: 1, jam: 2]}
Link to this function calculate_tokens_againts_corpus(content, corpus)

calculates the content_indexer

## Parameters

- content: String of tokens to be indexed
- corpus: List of String tokens representing the corpus

## Example

iex> ContentIndexerValidateService.calculate_tokens_againts_corpus("bread,butter,jam", ["red,brown,jam","blue,green,butter","pink,green,bread,jam"])
      {:ok,
        [
          {"bread", 0.13515503603605478},
          {"butter", 0.13515503603605478},
          {"jam", 0.0}
        ]
      }
Link to this function init_calculator()
Link to this function list_contains(list, item)

simple function to check if an item is contained in the list

## Parameters

- list: List of any type
- item: Any type of item stored in the list

## Example

iex> ContentIndexerService.calculate_content_indexer_documents(
      ["bread","butter","jam"],
      [
        ["red","brown","jam"],
        ["blue","green","butter"],
        ["pink","green","bread","jam"]
      ])
      {:ok, [bread: 0.3662040962227032, butter: 0.3662040962227032,jam: 0.3662040962227032]}