ExLLM.Infrastructure.CircuitBreaker.HealthCheck (ex_llm v0.8.1)

View Source

Health check and monitoring system for circuit breakers.

Provides comprehensive health assessment for both individual circuits and the overall circuit breaker system. Includes health scoring, issue detection, and recommendations for maintaining optimal fault tolerance.

Health Scoring

Health scores range from 0-100:

  • 90-100: Excellent - Circuit is performing optimally
  • 70-89: Good - Circuit is stable with minor concerns
  • 50-69: Fair - Circuit has issues that should be monitored
  • 30-49: Poor - Circuit requires attention
  • 0-29: Critical - Circuit needs immediate intervention

Health Factors

  • State: Circuit breaker state (closed/open/half-open)
  • Failure Rate: Recent failure percentage
  • Recovery Time: Time circuits spend in open state
  • Frequency: How often circuits are being triggered
  • Bulkhead Utilization: Concurrency and queue usage
  • Configuration: Threshold appropriateness

Usage

# Check overall system health
ExLLM.Infrastructure.CircuitBreaker.HealthCheck.system_health()

# Check specific circuit health
ExLLM.Infrastructure.CircuitBreaker.HealthCheck.circuit_health("api_service")

# Get health summary for all circuits
ExLLM.Infrastructure.CircuitBreaker.HealthCheck.health_summary()

# Get detailed health report
ExLLM.Infrastructure.CircuitBreaker.HealthCheck.health_report()

Summary

Functions

Get detailed health status for a specific circuit.

Get circuits that need immediate attention.

Get a detailed health report for dashboard/monitoring systems.

Get a summary of health status for all circuits.

Check if the circuit breaker system is healthy overall.

Get comprehensive health status for the entire circuit breaker system.

Types

circuit_health()

@type circuit_health() :: %{
  circuit_name: String.t(),
  health_score: health_score(),
  health_level: health_level(),
  state: :closed | :open | :half_open,
  issues: [String.t()],
  recommendations: [String.t()],
  metrics: map(),
  last_updated: DateTime.t()
}

health_level()

@type health_level() :: :excellent | :good | :fair | :poor | :critical

health_score()

@type health_score() :: 0..100

system_health()

@type system_health() :: %{
  overall_score: health_score(),
  overall_level: health_level(),
  total_circuits: non_neg_integer(),
  healthy_circuits: non_neg_integer(),
  unhealthy_circuits: non_neg_integer(),
  critical_circuits: non_neg_integer(),
  issues: [String.t()],
  recommendations: [String.t()],
  last_updated: DateTime.t()
}

Functions

circuit_health(circuit_name, opts \\ [])

@spec circuit_health(
  String.t(),
  keyword()
) :: {:ok, circuit_health()} | {:error, term()}

Get detailed health status for a specific circuit.

critical_circuits(opts \\ [])

@spec critical_circuits(keyword()) :: {:ok, [String.t()]} | {:error, term()}

Get circuits that need immediate attention.

health_report(opts \\ [])

@spec health_report(keyword()) :: {:ok, map()} | {:error, term()}

Get a detailed health report for dashboard/monitoring systems.

health_summary(opts \\ [])

@spec health_summary(keyword()) :: {:ok, [map()]} | {:error, term()}

Get a summary of health status for all circuits.

healthy?(opts \\ [])

@spec healthy?(keyword()) :: boolean()

Check if the circuit breaker system is healthy overall.

system_health(opts \\ [])

@spec system_health(keyword()) :: {:ok, system_health()} | {:error, term()}

Get comprehensive health status for the entire circuit breaker system.