# Expectation System Design

## Overview

The expectation system is the core of ExDataCheck's validation framework. Inspired by Great Expectations, it provides a declarative way to define data quality requirements.

## Expectation Behavior

All expectations implement the `ExDataCheck.Expectation` behavior:

```elixir
defmodule ExDataCheck.Expectation do
  @type dataset :: list(map()) | Stream.t()
  @type column :: atom() | String.t()
  @type result :: %ExDataCheck.ExpectationResult{}
  @type opts :: keyword()

  @callback validate(dataset, column, opts) :: result
  @callback describe(column, opts) :: String.t()
end
```

## Expectation Categories

### 1. Value Expectations

Test individual values in a column.

#### `expect_column_values_to_be_between/3`

```elixir
@spec expect_column_values_to_be_between(column, min, max) :: expectation
```

**Purpose:** Ensure all values in a column fall within a specified range.

**Implementation:**
```elixir
defmodule ExDataCheck.Expectations.Value do
  def expect_column_values_to_be_between(column, min, max) do
    %Expectation{
      type: :value_range,
      column: column,
      validator: fn dataset ->
        values = extract_column(dataset, column)
        failing = Enum.filter(values, fn v -> v < min or v > max end)

        %ExpectationResult{
          success: length(failing) == 0,
          expectation: "column #{column} values between #{min} and #{max}",
          observed: %{
            total_values: length(values),
            failing_values: length(failing),
            failing_examples: Enum.take(failing, 5)
          }
        }
      end
    }
  end
end
```

**Usage:**
```elixir
expect_column_values_to_be_between(:age, 0, 120)
expect_column_values_to_be_between(:score, 0.0, 1.0)
```

#### `expect_column_values_to_be_in_set/2`

```elixir
@spec expect_column_values_to_be_in_set(column, set :: list()) :: expectation
```

**Purpose:** Ensure all values are members of a specified set.

**Implementation:**
```elixir
def expect_column_values_to_be_in_set(column, allowed_values) do
  set = MapSet.new(allowed_values)

  %Expectation{
    type: :value_set,
    column: column,
    validator: fn dataset ->
      values = extract_column(dataset, column)
      invalid = Enum.reject(values, &MapSet.member?(set, &1))

      %ExpectationResult{
        success: length(invalid) == 0,
        expectation: "column #{column} values in set #{inspect(allowed_values)}",
        observed: %{
          total_values: length(values),
          invalid_values: Enum.uniq(invalid),
          invalid_count: length(invalid)
        }
      }
    end
  }
end
```

**Usage:**
```elixir
expect_column_values_to_be_in_set(:status, ["active", "pending", "completed"])
expect_column_values_to_be_in_set(:country, ["US", "UK", "CA"])
```

#### `expect_column_values_to_match_regex/2`

```elixir
@spec expect_column_values_to_match_regex(column, regex :: Regex.t()) :: expectation
```

**Purpose:** Ensure values match a regular expression pattern.

**Usage:**
```elixir
expect_column_values_to_match_regex(:email, ~r/@/)
expect_column_values_to_match_regex(:phone, ~r/^\d{3}-\d{3}-\d{4}$/)
```

#### `expect_column_values_to_not_be_null/1`

```elixir
@spec expect_column_values_to_not_be_null(column) :: expectation
```

**Purpose:** Ensure no null/nil values in column.

**Usage:**
```elixir
expect_column_values_to_not_be_null(:user_id)
expect_column_values_to_not_be_null(:timestamp)
```

#### `expect_column_values_to_be_unique/1`

```elixir
@spec expect_column_values_to_be_unique(column) :: expectation
```

**Purpose:** Ensure all values are unique (no duplicates).

**Usage:**
```elixir
expect_column_values_to_be_unique(:transaction_id)
expect_column_values_to_be_unique(:email)
```

### 2. Statistical Expectations

Test statistical properties of columns.

#### `expect_column_mean_to_be_between/3`

```elixir
@spec expect_column_mean_to_be_between(column, min, max) :: expectation
```

**Purpose:** Ensure column mean falls within range.

**Implementation:**
```elixir
def expect_column_mean_to_be_between(column, min, max) do
  %Expectation{
    type: :statistical_mean,
    column: column,
    validator: fn dataset ->
      values = extract_column(dataset, column) |> Enum.filter(&is_number/1)
      mean = Statistics.mean(values)

      %ExpectationResult{
        success: mean >= min and mean <= max,
        expectation: "column #{column} mean between #{min} and #{max}",
        observed: %{
          mean: mean,
          count: length(values)
        }
      }
    end
  }
end
```

**Usage:**
```elixir
expect_column_mean_to_be_between(:age, 25, 45)
expect_column_mean_to_be_between(:score, 0.7, 0.9)
```

#### `expect_column_stdev_to_be_between/3`

```elixir
@spec expect_column_stdev_to_be_between(column, min, max) :: expectation
```

**Purpose:** Ensure column standard deviation falls within range.

**Usage:**
```elixir
expect_column_stdev_to_be_between(:scores, 0.1, 0.3)
```

#### `expect_column_median_to_be_between/3`

```elixir
@spec expect_column_median_to_be_between(column, min, max) :: expectation
```

**Purpose:** Ensure column median falls within range.

**Usage:**
```elixir
expect_column_median_to_be_between(:income, 40000, 60000)
```

#### `expect_column_quantile_to_be/3`

```elixir
@spec expect_column_quantile_to_be(column, quantile :: float, expected :: number) :: expectation
```

**Purpose:** Check specific quantile values.

**Usage:**
```elixir
expect_column_quantile_to_be(:age, 0.95, 65)  # 95th percentile should be ~65
expect_column_quantile_to_be(:latency, 0.99, 500)  # p99 latency
```

### 3. ML-Specific Expectations

Expectations designed for machine learning workflows.

#### `expect_feature_distribution/3`

```elixir
@spec expect_feature_distribution(column, distribution :: atom, opts :: keyword) :: expectation
```

**Purpose:** Ensure feature follows expected distribution.

**Supported Distributions:**
- `:normal` - Normal/Gaussian distribution
- `:uniform` - Uniform distribution
- `:exponential` - Exponential distribution
- `:binomial` - Binomial distribution

**Implementation:**
```elixir
def expect_feature_distribution(column, :normal, opts) do
  expected_mean = Keyword.fetch!(opts, :mean)
  expected_stdev = Keyword.fetch!(opts, :stdev)
  tolerance = Keyword.get(opts, :tolerance, 0.1)

  %Expectation{
    type: :distribution,
    column: column,
    validator: fn dataset ->
      values = extract_column(dataset, column)
      observed_mean = Statistics.mean(values)
      observed_stdev = Statistics.stdev(values)

      # Kolmogorov-Smirnov test
      ks_statistic = ks_test(values, :normal, expected_mean, expected_stdev)

      %ExpectationResult{
        success: ks_statistic < tolerance,
        expectation: "column #{column} follows normal distribution",
        observed: %{
          mean: observed_mean,
          stdev: observed_stdev,
          ks_statistic: ks_statistic
        }
      }
    end
  }
end
```

**Usage:**
```elixir
expect_feature_distribution(:age, :normal, mean: 35, stdev: 10)
expect_feature_distribution(:score, :uniform, min: 0.0, max: 1.0)
```

#### `expect_label_balance/2`

```elixir
@spec expect_label_balance(column, opts :: keyword) :: expectation
```

**Purpose:** Ensure classification labels are reasonably balanced.

**Implementation:**
```elixir
def expect_label_balance(column, opts) do
  min_ratio = Keyword.get(opts, :min_ratio, 0.2)

  %Expectation{
    type: :label_balance,
    column: column,
    validator: fn dataset ->
      values = extract_column(dataset, column)
      counts = Enum.frequencies(values)
      total = length(values)
      ratios = Map.new(counts, fn {k, v} -> {k, v / total} end)
      min_observed = Enum.min(Map.values(ratios))

      %ExpectationResult{
        success: min_observed >= min_ratio,
        expectation: "column #{column} labels balanced (min ratio: #{min_ratio})",
        observed: %{
          ratios: ratios,
          min_ratio: min_observed,
          counts: counts
        }
      }
    end
  }
end
```

**Usage:**
```elixir
expect_label_balance(:class, min_ratio: 0.3)  # No class < 30%
expect_label_balance(:sentiment, min_ratio: 0.2)
```

#### `expect_feature_correlation/3`

```elixir
@spec expect_feature_correlation(column1, column2, opts :: keyword) :: expectation
```

**Purpose:** Check correlation between features (detect multicollinearity).

**Usage:**
```elixir
expect_feature_correlation(:feature_a, :feature_b, max: 0.9)
expect_feature_correlation(:age, :tenure, min: 0.3, max: 0.7)
```

#### `expect_no_data_drift/2`

```elixir
@spec expect_no_data_drift(column, baseline :: %{}) :: expectation
```

**Purpose:** Detect data drift from training distribution.

**Implementation:**
```elixir
def expect_no_data_drift(column, baseline) do
  %Expectation{
    type: :drift_detection,
    column: column,
    validator: fn dataset ->
      values = extract_column(dataset, column)
      drift_score = calculate_drift(values, baseline)
      threshold = 0.05  # p-value threshold

      %ExpectationResult{
        success: drift_score > threshold,
        expectation: "column #{column} no significant drift from baseline",
        observed: %{
          drift_score: drift_score,
          method: :kolmogorov_smirnov,
          drifted: drift_score <= threshold
        }
      }
    end
  }
end
```

**Usage:**
```elixir
baseline = ExDataCheck.Drift.create_baseline(training_data, :age)
expect_no_data_drift(:age, baseline)
```

### 4. Schema Expectations

Validate data structure and types.

#### `expect_column_to_exist/1`

```elixir
@spec expect_column_to_exist(column) :: expectation
```

**Purpose:** Ensure column exists in dataset.

**Usage:**
```elixir
expect_column_to_exist(:user_id)
expect_column_to_exist(:timestamp)
```

#### `expect_column_to_be_of_type/2`

```elixir
@spec expect_column_to_be_of_type(column, type :: atom) :: expectation
```

**Supported Types:**
- `:integer`
- `:float`
- `:string`
- `:boolean`
- `:atom`
- `:list`
- `:map`
- `:datetime`

**Usage:**
```elixir
expect_column_to_be_of_type(:age, :integer)
expect_column_to_be_of_type(:score, :float)
expect_column_to_be_of_type(:tags, :list)
```

#### `expect_column_count_to_equal/1`

```elixir
@spec expect_column_count_to_equal(count :: integer) :: expectation
```

**Purpose:** Ensure dataset has exact number of columns.

**Usage:**
```elixir
expect_column_count_to_equal(10)
```

#### `expect_table_row_count_to_be_between/2`

```elixir
@spec expect_table_row_count_to_be_between(min, max) :: expectation
```

**Purpose:** Ensure dataset has expected number of rows.

**Usage:**
```elixir
expect_table_row_count_to_be_between(1000, 10000)
expect_table_row_count_to_be_between(100, :infinity)
```

## Expectation Result Structure

```elixir
defmodule ExDataCheck.ExpectationResult do
  @type t :: %__MODULE__{
    success: boolean(),
    expectation: String.t(),
    observed: map(),
    metadata: map()
  }

  defstruct [
    :success,
    :expectation,
    :observed,
    metadata: %{}
  ]
end
```

## Validation Flow

```mermaid
sequenceDiagram
    participant Dataset
    participant Validator
    participant Expectation
    participant Result

    Dataset->>Validator: validate(expectations)
    loop For each expectation
        Validator->>Expectation: execute(dataset)
        Expectation->>Dataset: extract_column()
        Dataset-->>Expectation: column_data
        Expectation->>Expectation: apply_validation()
        Expectation->>Result: create_result()
        Result-->>Validator: expectation_result
    end
    Validator->>Validator: aggregate_results()
    Validator-->>Dataset: validation_result
```

## Custom Expectations

Users can define custom expectations:

```elixir
defmodule MyApp.CustomExpectations do
  use ExDataCheck.Expectation

  def expect_valid_email(column) do
    %Expectation{
      type: :custom_email,
      column: column,
      validator: fn dataset ->
        values = extract_column(dataset, column)
        invalid = Enum.reject(values, &valid_email?/1)

        %ExpectationResult{
          success: length(invalid) == 0,
          expectation: "column #{column} contains valid emails",
          observed: %{
            total: length(values),
            invalid_count: length(invalid),
            invalid_examples: Enum.take(invalid, 5)
          }
        }
      end
    }
  end

  defp valid_email?(email) do
    # Email validation logic
    String.match?(email, ~r/@/)
  end
end
```

## Expectation Suites

Group related expectations:

```elixir
defmodule MyApp.ExpectationSuites do
  def user_data_suite do
    [
      expect_column_to_exist(:user_id),
      expect_column_values_to_be_unique(:user_id),
      expect_column_values_to_not_be_null(:email),
      expect_column_values_to_match_regex(:email, ~r/@/),
      expect_column_values_to_be_between(:age, 18, 100)
    ]
  end

  def ml_training_suite do
    [
      expect_no_missing_values(:features),
      expect_label_balance(:target, min_ratio: 0.2),
      expect_column_count_to_equal(10),
      expect_table_row_count_to_be_between(1000, 1_000_000)
    ]
  end
end
```

## Best Practices

### 1. Start Simple

Begin with basic expectations and add complexity as needed:

```elixir
# Start with
expectations = [
  expect_column_to_exist(:age)
]

# Then add
expectations = [
  expect_column_to_exist(:age),
  expect_column_to_be_of_type(:age, :integer)
]

# Finally refine
expectations = [
  expect_column_to_exist(:age),
  expect_column_to_be_of_type(:age, :integer),
  expect_column_values_to_be_between(:age, 0, 120),
  expect_column_mean_to_be_between(:age, 25, 45)
]
```

### 2. Use Profiling to Inform Expectations

Profile your data first to understand reasonable thresholds:

```elixir
profile = ExDataCheck.profile(data)
# Use profile statistics to set expectation thresholds
expect_column_mean_to_be_between(:age,
  profile.columns.age.mean * 0.9,
  profile.columns.age.mean * 1.1
)
```

### 3. Combine Expectations

Layer expectations for comprehensive validation:

```elixir
[
  # Schema validation
  expect_column_to_exist(:score),
  expect_column_to_be_of_type(:score, :float),

  # Value validation
  expect_column_values_to_be_between(:score, 0.0, 1.0),
  expect_column_values_to_not_be_null(:score),

  # Statistical validation
  expect_column_mean_to_be_between(:score, 0.7, 0.9),
  expect_column_stdev_to_be_between(:score, 0.1, 0.3)
]
```

### 4. Document Your Expectations

Add metadata to expectations:

```elixir
%Expectation{
  type: :value_range,
  column: :age,
  validator: validator_fn,
  metadata: %{
    rationale: "Age should be reasonable for adult users",
    impact: "Critical - invalid ages break downstream models",
    owner: "data-team"
  }
}
```