Metrics
Production-ready evaluation metrics for LLM applications
Overview
The @open-evals/metrics package provides production-ready evaluation metrics for assessing the quality of LLM applications. These metrics use LLM-as-a-judge approaches to evaluate different aspects of your AI system's performance, from factual accuracy to grounding in source material.
Each metric is designed to be easy to use, highly configurable, and based on proven evaluation methodologies from research.
Installation
npm install @open-evals/metricsVibe Checking vs Metric-Driven Development
The Black Box Problem
LLMs are fundamentally black boxes. Unlike traditional software where you control the logic, LLMs are:
- Non-Deterministic - Same input can produce different outputs
- Opaque - You can't inspect their internal reasoning process
- Unpredictable - Small prompt changes can cause unexpected behavior shifts
- Hard to Debug - No stack traces, no breakpoints, just text in and text out
This makes traditional development practices insufficient for LLM applications.
From Vibes to Metrics
Most teams start with "vibe checking" - manually reviewing a few outputs and saying "looks good" based on gut feeling. This is a good starting point, but it doesn't scale:
- Inconsistent - Different people have different standards
- Slow - Manual review becomes a bottleneck
- Incomplete - You can't check every output
- Unreliable - Misses edge cases and regressions
- Unscientific - Can't prove improvements or track quality over time
Metric-driven development replaces vibes with objective, automated measurements:
- Objective - Consistent scoring criteria across all evaluations
- Scalable - Evaluate hundreds or thousands of outputs automatically
- Trackable - Monitor quality metrics over time as your system evolves
- Comparable - A/B test different prompts, models, or approaches with confidence
- Reliable - Catch regressions before they reach production
Why Use Evaluation Metrics?
When building LLM applications, you need to ensure they're producing high-quality outputs:
- Detecting Hallucinations - Ensure responses are grounded in provided context, not made up
- Measuring Accuracy - Verify that answers match expected ground truth
- Tracking Performance - Monitor quality as your application evolves
- Comparing Approaches - Objectively compare different prompts, models, or retrieval strategies
Automated metrics solve this problem by providing consistent, scalable ways to evaluate your LLM outputs.
How Metrics Help
The metrics package addresses these challenges by:
- Automating Evaluation - Use LLMs to judge quality at scale, eliminating manual review bottlenecks
- Providing Consistency - Get reproducible scores across different runs and team members
- Offering Transparency - Each score includes reasoning explaining the judgment
- Enabling Iteration - Quickly test changes and measure their impact on quality
- Supporting Multiple Use Cases - Different metrics for RAG systems, Q&A, fact-checking, and more
How It Works
Evaluating your LLM application with metrics is a simple three-step process:
1. Choose a Metric
Select the metric that matches your evaluation needs:
import { Faithfulness, FactualCorrectness } from '@open-evals/metrics'
import { openai } from '@ai-sdk/openai'
// For RAG applications - check if responses are grounded in context
const faithfulness = new Faithfulness({
model: openai('gpt-4o-mini'),
})
// For Q&A systems - check accuracy against reference answers
const factualCorrectness = new FactualCorrectness({
model: openai('gpt-4o-mini'),
mode: 'f1',
})2. Evaluate Samples
Pass your query, response, and context to the metric:
const result = await faithfulness.evaluateSingleTurn({
query: 'What is the capital of France?',
response: 'Paris is the capital of France.',
retrievedContexts: ['Paris is the capital city of France.'],
})
console.log(`Score: ${result.score}`) // 0.0 - 1.0
console.log(`Reason: ${result.reason}`) // Explanation3. Analyze Results
Each metric returns a score (0.0 to 1.0) and a reason explaining the judgment:
{
name: "faithfulness",
score: 0.95,
reason: "The response is fully supported by the provided context...",
metadata: {
// Detailed evaluation data
}
}Key Features
LLM-as-a-Judge Evaluation
All metrics use language models to evaluate outputs, leveraging their reasoning capabilities to assess quality in nuanced ways that simple string matching cannot.
Configurable Evaluation
Fine-tune metrics for your specific needs:
const metric = new FactualCorrectness({
model: openai('gpt-4o'),
mode: 'precision', // or 'recall', 'f1'
atomicity: 'high', // Granularity of claim decomposition
coverage: 'high', // Thoroughness of evaluation
})Research-Backed Approaches
Metrics implement proven evaluation methodologies:
- Faithfulness - Based on statement verification and NLI approaches
- Factual Correctness - Uses claim decomposition and alignment techniques
Transparent Scoring
Every evaluation includes:
- Numeric score - Quantitative measure (0.0 to 1.0)
- Reasoning - Natural language explanation
- Metadata - Detailed breakdown of the evaluation process
Integration with Datasets
Metrics work seamlessly with the evaluation framework:
import { evaluate, EvaluationDataset } from '@ai-sdk-eval/core'
const dataset = new EvaluationDataset([
/* your samples */
])
const results = await evaluate(dataset, [faithfulness, factualCorrectness])Integration with Evalite
All metrics can be used with the Evalite evaluation framework using toEvaliteScorer. This allows you to leverage Evalite's experiment tracking, comparison tools, and web UI while using the production-ready metrics from this package.
import { evalite } from 'evalite'
import { Faithfulness, toEvaliteScorer } from '@open-evals/metrics'
import { openai } from '@ai-sdk/openai'
evalite('My Evaluation', {
data: async () => [{ input: { query: '...', retrievedContexts: ['...'] } }],
task: async (sample) => 'Your LLM response',
scorers: [
toEvaliteScorer(new Faithfulness({ model: openai('gpt-4.1-mini') })),
],
})See the Evalite Integration guide for complete documentation.
Architecture
Metrics in this package extend the base LLMMetric class from @ai-sdk-eval/core, which provides:
- Consistent Interface - All metrics implement the same evaluation methods
- Type Safety - Full TypeScript support with proper type inference
- Extensibility - Create custom metrics by extending the base classes
- Composability - Combine multiple metrics for comprehensive evaluation
This design ensures metrics are easy to use while remaining flexible enough for advanced use cases.
How is this guide?