babble

A small Markov chain text generator: train it on example text, then generate new sentences that sound almost like the source. It offers incremental training, sentence-aware generation that stops at a natural full stop, and a pluggable sampler that puts you in control of how each next word is chosen.

let model =
  babble.new(order: 2, tokenization: babble.Words)
  |> babble.train("the cat sat on the mat.")
  |> babble.train("the dog sat on the log.")

let assert Ok(sentence) = babble.generate(model, babble.weighted, max_tokens: 200)

Generation is driven by a Sampler — see that type along with weighted and most_likely.

Types

Tunable parameters for a model, fixed at construction.

order is the n-gram context length (clamped to >= 1) and tokenization selects word- vs character-level base tokens. The generation length cap is a generate argument, not a model setting.

pub type Config {
  Config(order: Int, tokenization: Tokenization)
}

Constructors

Why a generation request produced no output.

pub type GenerateError {
  EmptyModel
}

Constructors

  • EmptyModel

    The model has no learned transitions to start from (it was never trained, or only ever saw empty/whitespace messages).

An opaque Markov model: its config, the learned transitions (each context of order tokens mapped to a multiset of successor tokens), and how many messages it has seen.

pub opaque type Model

A strategy for choosing the next Step during generation.

At each point in the walk, babble hands the sampler every successor it has seen for the current context: a non-empty list of #(step, count) pairs, where count is how many times that step followed the context in training. The sampler returns one of those steps — Continue emits a word and the walk goes on, Stop ends the sentence. (A Stop candidate is present whenever the model saw a sentence end after this context.)

babble ships two: weighted (random, proportional to the counts) and most_likely (deterministic). Everything else — temperature, top-k, blocklists, biasing — you write yourself, because a sampler is just an ordinary function:

// Ignore the counts and pick a successor uniformly at random.
fn uniform(candidates: List(#(babble.Step, Int))) -> babble.Step {
  case list.drop(candidates, int.random(list.length(candidates))) {
    [#(step, _), ..] -> step
    [] -> babble.Stop
  }
}

A sampler is called once per word and keeps no state between calls, so for reproducible output reach for most_likely rather than trying to seed randomness here.

pub type Sampler =
  fn(List(#(Step, Int))) -> Step

The next step in a sentence: emit a word, or stop. A Sampler chooses one of these from the weighted candidates at each point in the walk.

pub type Step {
  Continue(String)
  Stop
}

Constructors

  • Continue(String)

    Continue the sentence by emitting this word/grapheme.

  • Stop

    End the sentence here (a learned sentence boundary).

Whether base tokens are whole words or single characters.

pub type Tokenization {
  Words
  Characters
}

Constructors

  • Words
  • Characters

Values

pub fn config(model: Model) -> Config

The (clamped) configuration this model was built with.

pub fn generate(
  model: Model,
  sampler: fn(List(#(Step, Int))) -> Step,
  max_tokens max_tokens: Int,
) -> Result(String, GenerateError)

Generate one sentence, choosing each next word with sampler and emitting at most max_tokens of them.

Walks the chain from the start of a sentence, asking sampler for the next step at each point, until it stops at a learned sentence end or reaches max_tokens (clamped to >= 1). Returns Error(EmptyModel) if the model has never been trained.

Pass weighted for varied, corpus-like output or most_likely for deterministic output. See Sampler to write your own.

Examples

// Varied output — a different sentence each call:
let assert Ok(sentence) = babble.generate(model, babble.weighted, max_tokens: 200)

// No data yet:
let empty = babble.new(order: 2, tokenization: babble.Words)
assert babble.generate(empty, babble.weighted, max_tokens: 50) == Error(babble.EmptyModel)
pub fn generate_paragraph(
  model: Model,
  sentences: Int,
  sampler: fn(List(#(Step, Int))) -> Step,
  max_tokens max_tokens: Int,
) -> Result(String, GenerateError)

Generate sentences sentences (at least 1) with sampler, each capped at max_tokens, joined by spaces.

pub fn generate_starting_with(
  model: Model,
  prefix: String,
  sampler: fn(List(#(Step, Int))) -> Step,
  max_tokens max_tokens: Int,
) -> Result(String, GenerateError)

Generate a sentence that begins with prefix, choosing with sampler and emitting at most max_tokens words beyond the prefix.

The continuation seeds from the last order prefix words (left-padded with Start); an unknown prefix falls back to the start context, but the prefix words are always kept at the front. Empty models return Error(EmptyModel).

pub fn is_empty(model: Model) -> Bool

True when the model has learned no transitions yet.

pub fn message_count(model: Model) -> Int

How many non-empty messages have been folded into the model.

pub fn most_likely(candidates: List(#(Step, Int))) -> Step

A Sampler that always picks the most frequent successor, with ties broken deterministically so the result never depends on internal map ordering.

Generation with this sampler is fully reproducible: a given model always produces the same sentence. That makes it ideal for tests and snapshots, or a fixed “house style” output. Because it always takes the single most-travelled path, its output tends to reproduce whole training sentences verbatim.

Examples

let model =
  babble.new(order: 2, tokenization: babble.Words)
  |> babble.train("the cat sat.")

assert babble.generate(model, babble.most_likely, max_tokens: 50) == Ok("the cat sat.")
pub fn new(
  order order: Int,
  tokenization tokenization: Tokenization,
) -> Model

A new empty model, ready to train.

Both settings are fixed at construction (changing them would invalidate the learned counts, so there are no setters):

  • order — the n-gram context length: how many previous tokens to condition on when picking the next. Clamped to >= 1. Higher = more coherent but more verbatim; 2 is a good default.
  • tokenizationWords or Characters.

The generation length cap is passed to generate, not set here.

Examples

let model = babble.new(order: 2, tokenization: babble.Words)
assert babble.is_empty(model)
pub fn train(model: Model, message: String) -> Model

Fold a single message into the model, returning a new model.

Each sentence is tokenised, padded with order Start markers and a trailing End, and every order-length context -> next transition is counted. The message counter bumps once if the message held a non-empty sentence. It is cheap and never rebuilds, so you can keep folding in new text.

Examples

let model =
  babble.new(order: 2, tokenization: babble.Words)
  |> babble.train("the cat sat.")

assert babble.message_count(model) == 1
pub fn train_many(model: Model, messages: List(String)) -> Model

Fold many messages into the model, in order.

pub fn weighted(candidates: List(#(Step, Int))) -> Step

A Sampler that picks a successor at random, with probability proportional to how often it followed the context in training.

This is the natural “talk like the corpus” behaviour and the one you’ll want most of the time. It uses the platform RNG, so output varies between calls — pass it straight to generate; you rarely call it yourself.

Examples

let assert Ok(sentence) = babble.generate(model, babble.weighted)
Search Document