Evaluations and testing
Hyponema gives you four complementary loops for validating agent behavior:
- Playground — render a persona against a user and inspect the resolved prompt before any session runs.
- Agent tests — replay or simulate conversations against a dataset and score each row.
- Online scorers — score live production traffic continuously.
- Post-session runners — run an LLM extraction job after each conversation ends to produce structured records.
Use them in roughly that order: playground while authoring, tests before publish, online scorers and post-session runners after going live.
Playground
Section titled “Playground”Open Playground in the dashboard to pick an agent, a user, and any dynamic-variable overrides, then preview the rendered system prompt with system + custom variables + the memory context block resolved.
The playground does not send a turn — it confirms what the model would see. Use it to catch broken templates, missing required variables, or unwanted leakage of profile data into the prompt.
Datasets
Section titled “Datasets”A dataset is a named bag of test rows. Each row has an input (user message), expected behavior, and optional metadata. Manage datasets from Tests → Datasets:
curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/datasets" \ -H "Authorization: Bearer $HYPONEMA_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "slug": "billing-faq", "name": "Billing FAQ", "description": "Common billing questions a reception agent must handle." }'Bulk-add rows with POST /datasets/{slug}/rows:bulk.
Scorers
Section titled “Scorers”A scorer judges a single conversation turn or whole conversation. Hyponema ships LLM-driven scorers (rubric-based judgments) and rule scorers (string match, regex, JSON-path).
Create one from Tests → Scorers:
curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/scorers" \ -H "Authorization: Bearer $HYPONEMA_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Stays in scope", "kind": "llm_judge", "rubric": "Score 0–1. Did the agent stay within the persona's no-go zones?" }'Attach scorers to an agent for visibility in runs.
Agent tests
Section titled “Agent tests”A test binds an agent, a dataset, and one or more scorers. Each run replays the dataset rows against the agent and scores the responses.
# Definecurl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/agents/$AGENT_ID/tests" \ -H "Authorization: Bearer $HYPONEMA_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Pre-publish smoke", "dataset_slug": "billing-faq", "scorer_ids": ["scorer_..."] }'
# Runcurl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/agents/$AGENT_ID/tests/$TEST_ID/run" \ -H "Authorization: Bearer $HYPONEMA_API_KEY"Inspect runs through Tests → Runs, drill into individual rows, and compare against the previous run.
Online scorers (production scoring)
Section titled “Online scorers (production scoring)”Production calls are scored continuously through online scorer rules. Each rule pairs a scorer with a sampling policy (every conversation, every Nth, only conversations matching a tag). Results land in observability alongside the trace.
Manage rules from Tests → Online scorers or:
curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/online-scorer-rules" \ -H "Authorization: Bearer $HYPONEMA_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "scorer_id": "scorer_...", "agent_id": "agent_...", "sample_rate": 0.1 }'Use online scoring to catch regressions a published agent shows in the wild.
Post-session runners
Section titled “Post-session runners”A post-session runner is an LLM job that fires after a conversation ends. It reads the transcript (and optionally prior extraction records), calls a small read-only memory tool set, and returns either a free-form summary or a JSON object that conforms to an output_schema.
This is the operator-visible side of “structured data after each call”: risk score updates, ticket creation hints, sentiment, follow-up flags, anything you want to compute from the transcript.
Configure runners from the agent’s Post-session tab:
curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/agents/$AGENT_ID/post-session-runners" \ -H "Authorization: Bearer $HYPONEMA_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Triage flag", "prompt": "Read the transcript. If the user mentioned an emergency, set urgent=true and quote the line.", "output_mode": "structured", "output_schema": { "type": "object", "properties": { "urgent": { "type": "boolean" }, "quote": { "type": "string" } }, "required": ["urgent"] } }'GET .../runners/{id}/records lists past extractions. Records also surface in the user detail page so you can trace a flag back to the conversation that produced it.
The runner has access to read-only memory tools and to two post-session-specific tools — one for fetching prior extractions for the same user, and one to declare structured output when output_mode=structured. Tool access is bounded by max_tool_iterations and timeout_seconds.
What to wire when
Section titled “What to wire when”- Before first publish — Playground + one or two agent tests with a small dataset.
- Each persona change — re-run the agent test, diff scores against the previous run.
- In production — at least one online scorer rule and at least one post-session runner per agent.
- After incidents — add a dataset row that reproduces the failure mode.