Quick Start
Get started with @open-evals/core in minutes
Installation
npm install @open-evals/core @open-evals/metrics @ai-sdk/openaiBasic Example
The core package provides everything you need to evaluate LLM responses. Start by creating a dataset of samples, choose your metrics, and run the evaluation:
import { EvaluationDataset, evaluate } from '@open-evals/core'
import { Faithfulness } from '@open-evals/metrics'
import { openai } from '@ai-sdk/openai'
// 1. Create a dataset with your test samples
const dataset = new EvaluationDataset([
{
query: 'What is TypeScript?',
response: 'TypeScript is a typed superset of JavaScript.',
retrievedContexts: ['TypeScript adds static typing to JavaScript.'],
},
])
// 2. Configure the metric with your chosen model
const metric = new Faithfulness({ model: openai('gpt-4o-mini') })
// 3. Run evaluation - the function handles parallelization automatically
const results = await evaluate(dataset, [metric])
console.log(results.statistics.averages) // { faithfulness: 0.92 }Understanding Sample Types
The framework supports two different evaluation scenarios, each with a specific sample format.
Single-Turn Samples
Use this format for simple query-response pairs, like QA systems or RAG applications. Each sample represents one question and one answer:
{
query: 'What is the capital of France?',
response: 'Paris is the capital of France.',
reference: 'The capital of France is Paris.', // Optional: ground truth answer
retrievedContexts: ['Paris is the capital city of France.'] // Optional: for RAG evaluation
}Multi-Turn Samples
Use this format for conversations where context from previous messages matters. The framework can evaluate the full conversation flow:
{
messages: [
{ role: 'user', content: 'Hello!' },
{ role: 'assistant', content: 'Hi! How can I help?' },
{ role: 'user', content: 'Tell me about Paris' },
{ role: 'assistant', content: 'Paris is the capital of France...' },
]
}Type Guards
When working with datasets that contain both single-turn and multi-turn samples, use the provided type guard functions for type-safe handling:
import { isSingleTurnSample, isMultiTurnSample } from '@open-evals/core'
dataset.forEach((sample) => {
if (isSingleTurnSample(sample)) {
// TypeScript knows sample has query, response, etc.
console.log('Query:', sample.query)
} else if (isMultiTurnSample(sample)) {
// TypeScript knows sample has messages
console.log('Messages:', sample.messages.length)
}
})These helpers provide type narrowing, giving you full TypeScript autocomplete and type safety when processing samples.
Generating Responses
When you have a dataset with queries but need to populate the responses, use the generate() method. This is a common workflow when testing your LLM application:
import { EvaluationDataset, isSingleTurnSample } from '@open-evals/core'
import { generateText } from 'ai'
import { openai } from '@ai-sdk/openai'
// 1. Start with queries and reference answers
const dataset = new EvaluationDataset([
{
query: 'What is the capital of France?',
response: '', // Empty - will be filled by generate()
reference: 'The capital of France is Paris.',
},
{
query: 'Explain TypeScript',
response: '',
reference: 'TypeScript is a typed superset of JavaScript.',
},
])
// 2. Generate responses using your model
const populatedDataset = await dataset.generate(
async (sample) => {
// Use type guard for type safety
if (isSingleTurnSample(sample)) {
const result = await generateText({
model: openai('gpt-4o-mini'),
prompt: sample.query,
})
return result.text
}
return ''
},
{ concurrency: 10 } // Process 10 samples at a time
)
// 3. Evaluate the generated responses
const results = await evaluate(populatedDataset, [
new Faithfulness({ model: openai('gpt-4o-mini') }),
])The generate() method handles parallel execution automatically, making it efficient for large datasets. Adjust the concurrency parameter based on your API rate limits.
Using with Evalite
You can also use metrics with the Evalite by converting them with toEvaliteScorer:
import { evalite } from 'evalite'
import { Faithfulness, toEvaliteScorer } from '@open-evals/metrics'
import { openai } from '@ai-sdk/openai'
evalite('Documentation Q&A Evaluation', {
data: async () => [
{
input: {
query: 'What is TypeScript?',
retrievedContexts: ['TypeScript adds static typing to JavaScript.'],
},
},
],
task: async (sample) => {
// Your LLM task that generates responses
const result = await generateText({
model: openai('gpt-4o-mini'),
prompt: sample.query,
})
return result.text
},
scorers: [
toEvaliteScorer(
new Faithfulness({
model: openai('gpt-4.1-mini'),
})
),
],
})The toEvaliteScorer function adapts metrics from @open-evals/metrics to work seamlessly with Evalite, preserving all scoring metadata.
Configuration Options
The evaluate function accepts configuration to control how evaluations run. These options help you balance speed, error handling, and tracking.
await evaluate(dataset, metrics, {
concurrency: 10, // Number of samples to evaluate in parallel (default: 10)
throwOnError: false, // Whether to stop on first error or collect all results
metadata: { exp: '001' }, // Custom data to attach to this evaluation run
})Concurrency determines how many evaluations run simultaneously. Higher values speed up large datasets but may hit API rate limits. Start with 10 for API-based metrics, increase for local models.
Error handling lets you choose between failing fast or collecting all results including errors.
Working with Datasets
The EvaluationDataset class provides utility methods for common data operations. All methods return new datasets, so you can chain operations:
// Filter samples based on criteria
const filtered = dataset.filter((s) => s.query.includes('TypeScript'))
// Get a random subset for testing
const sample = dataset.sample(10)
// Split into training and test sets
const [train, test] = dataset.split(0.8)
// Persist your dataset for later use
await writeFile('data.jsonl', dataset.toJSONL())
const loaded = EvaluationDataset.fromJSONL(await readFile('data.jsonl'))How is this guide?