Evaluations in TypeScript SDK

Copy page

Manage evaluators, test suites, and batch evaluations programmatically with the TypeScript SDK

The TypeScript SDK provides an EvaluationClient that talks to the Evaluations API so you can manage evaluators, test suites, run configurations, trigger runs with scoring, and read results—all from code.

For full endpoint details and request/response shapes, see the Evaluations API reference.

Setup: create a client

import { EvaluationClient } from "@inkeep/agents-sdk";

const client = new EvaluationClient({
  tenantId: process.env.INKEEP_TENANT_ID!,
  projectId: process.env.INKEEP_PROJECT_ID!,
  apiUrl: "https://api.inkeep.com",
  apiKey: process.env.INKEEP_API_KEY,
});
ParameterTypeRequiredDescription
tenantIdstringYesYour tenant (organization) ID
projectIdstringYesYour project ID
apiUrlstringYesAPI base URL (e.g. https://api.inkeep.com or your self-hosted URL)
apiKeystringNoBearer token for authenticated requests

End-to-end example

The walkthrough below creates an evaluator, a test suite with items, a run configuration, triggers an evaluation, and checks the results. Each step builds on the previous one.

Step 1 — Create an evaluator

An evaluator defines how to score agent output. You provide a prompt, a JSON schema for the structured result, and a model to run the evaluation.

The schema should include a numeric score, a boolean passed, and a reasoning string so you get both a quantitative metric and a human-readable explanation for every evaluation.

const evaluator = (await client.createEvaluator({
  name: "Answer quality",
  description: "Checks correctness, helpfulness, and tone of agent replies",
  prompt: `You are an expert QA evaluator.

Given the full conversation between a user and an AI assistant, evaluate the **last assistant reply** on three dimensions:
1. **Correctness** — Is the information factually accurate? If an expected output is provided, does the reply match its intent?
2. **Helpfulness** — Does the reply fully address the user's question without unnecessary filler?
3. **Tone** — Is the reply professional, clear, and appropriately concise?

Return a JSON object with:
- "score": a number from 1 to 5 (1 = poor, 5 = excellent) reflecting overall quality.
- "passed": true if the reply is acceptable for production (score >= 4), false otherwise.
- "reasoning": 1-2 sentences explaining the score, citing specific strengths or issues.`,
  schema: {
    type: "object",
    properties: {
      score: {
        type: "number",
        description: "Overall quality score from 1 to 5",
      },
      passed: {
        type: "boolean",
        description: "Whether the reply meets production quality bar",
      },
      reasoning: {
        type: "string",
        description: "Brief explanation of the score",
      },
    },
    required: ["score", "passed", "reasoning"],
  },
  model: { model: "gpt-4o-mini" },
  passCriteria: {
    operator: "and",
    conditions: [{ field: "score", operator: ">=", value: 4 }],
  },
})) as { id: string };
FieldTypeRequiredDescription
namestringYesDisplay name
descriptionstringNoWhat this evaluator checks
promptstringYesInstructions for the evaluation model
schemaobject (JSON Schema)YesStructure of the evaluation output — typically includes score, passed, and reasoning
modelobjectYes{ model: string, providerOptions?: object }
passCriteriaobjectNo{ operator: "and"|"or", conditions: [{ field, operator, value }] }. Operators: >, <, >=, <=, =, !=

Step 2 — Create a test suite and add items

A test suite is a named collection of items. Each item has input (the messages sent to the agent) and optional expectedOutput (reference answer).

const testSuite = (await client.createDataset({
  name: "Support FAQs",
})) as { id: string };

await client.createDatasetItem(testSuite.id, {
  input: {
    messages: [
      { role: "user", content: { text: "How do I reset my password?" } },
    ],
  },
  expectedOutput: [
    { role: "assistant", content: { text: "Go to Settings → Security → Reset password." } },
  ],
});

await client.createDatasetItems(testSuite.id, [
  {
    input: {
      messages: [{ role: "user", content: { text: "What are your hours?" } }],
    },
  },
  {
    input: {
      messages: [{ role: "user", content: { text: "Do you offer refunds?" } }],
    },
  },
]);

Step 3 — Create a run configuration

A run configuration ties the test suite to one or more agents and optional default evaluators. It is a saved "recipe" you can trigger repeatedly.

const runConfig = (await client.createDatasetRunConfig({
  name: "Support smoke test",
  datasetId: testSuite.id,
  agentIds: ["your-agent-id"],
  evaluatorIds: [evaluator.id],
})) as { id: string };
FieldTypeRequiredDescription
namestringYesDisplay name for the config
descriptionstringNoOptional description
datasetIdstringYesThe test suite to run
agentIdsstring[]NoAgents that process each item (at least one needed before trigger)
evaluatorIdsstring[]NoDefault evaluators attached when you trigger a run

Step 4 — Trigger a run

Starting a run queues every item × agent combination. When evaluators are included, a batch evaluation job is automatically created for the resulting conversations.

const started = await client.triggerDatasetRun(runConfig.id, {
  evaluatorIds: [evaluator.id],
});
// { datasetRunId: "...", status: "pending", totalItems: 3 }

The evaluatorIds on trigger is optional — omit it to use the defaults from the run configuration, or pass different ids to override.

Step 5 — Check run status

const run = await client.getDatasetRun(started.datasetRunId);
console.log(run);

Poll getDatasetRun or check the Test Suites page in the Visual Builder to watch items transition from pending to completed with evaluation scores.

Step 6 (optional) — Evaluate a previous run after the fact

If you ran a test suite without evaluators and want to score those conversations later, trigger a batch evaluation scoped to the run:

await client.triggerBatchEvaluation({
  evaluatorIds: [evaluator.id],
  datasetRunIds: [started.datasetRunId],
});

Method reference

Evaluators

MethodPurpose
createEvaluator(data)Create an evaluator (prompt, schema, model, optional pass criteria)
listEvaluators()List all evaluators in the project
getEvaluator(evaluatorId)Fetch one evaluator
updateEvaluator(evaluatorId, partial)Update evaluator fields
deleteEvaluator(evaluatorId)Delete an evaluator

Test suites and items

MethodPurpose
listDatasets()List test suites for the project
getDataset(testSuiteId)Fetch one test suite
createDataset({ name })Create a test suite
updateDataset(testSuiteId, partial)Update name, etc.
deleteDataset(testSuiteId)Delete a test suite and its items
listDatasetItems(testSuiteId)List items
getDatasetItem(testSuiteId, itemId)Fetch one item
createDatasetItem(testSuiteId, itemData)Create an item (input required; expectedOutput optional)
createDatasetItems(testSuiteId, items[])Bulk create
updateDatasetItem(testSuiteId, itemId, partial)Update an item
deleteDatasetItem(testSuiteId, itemId)Delete an item

Run configurations and runs

MethodPurpose
createDatasetRunConfig({ name, datasetId, agentIds?, evaluatorIds? })Create a run configuration (which agents run the suite; optional default evaluators)
triggerDatasetRun(runConfigId, { evaluatorIds?, branchName? }?)Start a run; returns datasetRunId, status, totalItems
listDatasetRuns(testSuiteId)List runs for a test suite
getDatasetRun(runId)Fetch a run with items and conversations

Batch evaluation

MethodPurpose
triggerBatchEvaluation({ evaluatorIds, name?, conversationIds?, dateRange?, datasetRunIds? })One-off batch evaluation over conversations
OptionTypeRequiredDescription
evaluatorIdsstring[]YesIDs of evaluators to run
namestringNoName for the job
conversationIdsstring[]NoLimit to these conversations
dateRange{ startDate, endDate } (YYYY-MM-DD)NoLimit to conversations in this date range
datasetRunIdsstring[]NoLimit to conversations from these test suite runs

Evaluation suite configs (continuous tests)

Suite configs group evaluators and optional agent filters and sample rates for continuous tests that evaluate a fraction of live conversations automatically.

MethodPurpose
createEvaluationSuiteConfig({ evaluatorIds, sampleRate?, filters? })Create a suite config
addEvaluatorToSuiteConfig(configId, evaluatorId)Add an evaluator
removeEvaluatorFromSuiteConfig(configId, evaluatorId)Remove an evaluator
listEvaluationSuiteConfigEvaluators(configId)List evaluators on a config
OptionTypeRequiredDescription
evaluatorIdsstring[]YesAt least one evaluator ID
sampleRatenumberNoFraction of matching conversations to evaluate (0–1)
filtersobjectNoRestrict scope, e.g. { agentIds: ["agent-id"] }

To list results by job or run config, use the Evaluations API.