Skip to content

Server API

Go REST API service for managing LLM evaluation workflows.

All endpoints are under /api/v1. Request and response bodies use JSON. The OpenAPI 3.1.0 specification is served at /openapi.yaml.

See https://eval-hub.github.io/eval-hub/ for the full specification.

POST /api/v1/evaluations/jobs # Submit evaluation
GET /api/v1/evaluations/jobs # List jobs
GET /api/v1/evaluations/jobs/{id} # Get job status and results
DELETE /api/v1/evaluations/jobs/{id} # Cancel job
POST /api/v1/evaluations/jobs/{id}/events # Status/result callback (adapter → server)
GET /api/v1/evaluations/providers # List providers
POST /api/v1/evaluations/providers # Register provider
GET /api/v1/evaluations/providers/{id} # Get provider
PUT /api/v1/evaluations/providers/{id} # Update provider
PATCH /api/v1/evaluations/providers/{id} # Patch provider
DELETE /api/v1/evaluations/providers/{id} # Delete provider

Query parameters: benchmarks=true|false (default true), scope=system|tenant (default is not set which means all providers).

Benchmarks are returned as part of the provider response. There is no separate /benchmarks endpoint.

Each provider response may include an optional agent object with structured metadata for AI agent consumption:

FieldTypeDescription
evaluatesstring[]Semantic capability tags (e.g. safety, reasoning)
recommended_whenstring[]Natural-language recommendation conditions
target_typestringmodel, agent, or inference_server
summarystringConcise description (max 200 chars)
complementsstring[]Related provider IDs for follow-up evaluations
hintsstring[]Operational guidance for job construction
result_interpretationstring[]How to interpret evaluation results

Benchmarks nested in the provider response may include their own agent block with result_interpretation and score_ranges.

Example (abbreviated):

{
"resource": { "id": "garak" },
"name": "garak",
"agent": {
"evaluates": ["safety", "security", "red_teaming", "toxicity"],
"target_type": "model",
"summary": "Red-team an LLM for safety vulnerabilities, toxicity, and OWASP risks"
}
}

Provider agent metadata can be updated via PATCH /api/v1/evaluations/providers/{id} with paths under /agent. There is no server-side ?target_type= or ?evaluates= query filter — filter client-side or use the MCP discover_providers tool.

See Agent Discoverability for the full metadata model and discovery workflows.

GET /api/v1/evaluations/collections # List collections
POST /api/v1/evaluations/collections # Create collection
GET /api/v1/evaluations/collections/{id} # Get collection
PUT /api/v1/evaluations/collections/{id} # Update collection
PATCH /api/v1/evaluations/collections/{id} # Patch collection
DELETE /api/v1/evaluations/collections/{id} # Delete collection

Collection responses may include an optional agent object with the same fields as providers except target_type:

FieldTypeDescription
evaluatesstring[]Dimensions this collection assesses
recommended_whenstring[]When to suggest this collection
summarystringConcise description for agents
complementsstring[]Related collection or provider IDs
hintsstring[]Operational guidance (duration, resources)
result_interpretationstring[]How to interpret aggregate scores
GET /api/v1/health # Health check
GET /metrics # Prometheus metrics
GET /openapi.yaml # OpenAPI specification
GET /docs # Interactive API docs

Configuration loads from config/config.yaml, with environment variable and file-based secret overrides.

SettingEnv VarDefaultDescription
service.portPORT8080API listen port
database.driver-sqlitesqlite or pgx
database.urlDB_URLSQLite in-memoryConnection string
mlflow.tracking_uriMLFLOW_TRACKING_URI-MLflow server URL
prometheus.enabled-trueEnable /metrics
otel.enabled-falseEnable OpenTelemetry

Providers are loaded from YAML files in config/providers/. Built-in providers: lm_evaluation_harness (167 benchmarks), garak (8), guidellm (7), lighteval (24).

Custom providers can be added via YAML files or the POST /api/v1/evaluations/providers endpoint.

Creates a Kubernetes Job per benchmark with:

  • ConfigMap: JobSpec mounted at /meta/job.json
  • Adapter container: Runs the evaluation framework
  • Sidecar container: Forwards status events to the server
  • Volumes: OCI credentials, MLflow token, model auth secrets

Spawns subprocesses (up to 5 workers) for each benchmark. Enabled with the -local flag. Useful for development without a cluster.

The server is deployed by the TrustyAI Operator via the EvalHub custom resource. See OpenShift Setup for production deployment.