Open Evals

Metrics

Production-ready evaluation metrics for LLM applications

Overview

The @open-evals/metrics package provides production-ready evaluation metrics for assessing the quality of LLM applications. These metrics use LLM-as-a-judge approaches to evaluate different aspects of your AI system's performance, from factual accuracy to grounding in source material.

Each metric is designed to be easy to use, highly configurable, and based on proven evaluation methodologies from research.

Installation

npm install @open-evals/metrics

Vibe Checking vs Metric-Driven Development

The Black Box Problem

LLMs are fundamentally black boxes. Unlike traditional software where you control the logic, LLMs are:

  • Non-Deterministic - Same input can produce different outputs
  • Opaque - You can't inspect their internal reasoning process
  • Unpredictable - Small prompt changes can cause unexpected behavior shifts
  • Hard to Debug - No stack traces, no breakpoints, just text in and text out

This makes traditional development practices insufficient for LLM applications.

From Vibes to Metrics

Most teams start with "vibe checking" - manually reviewing a few outputs and saying "looks good" based on gut feeling. This is a good starting point, but it doesn't scale:

  • Inconsistent - Different people have different standards
  • Slow - Manual review becomes a bottleneck
  • Incomplete - You can't check every output
  • Unreliable - Misses edge cases and regressions
  • Unscientific - Can't prove improvements or track quality over time

Metric-driven development replaces vibes with objective, automated measurements:

  • Objective - Consistent scoring criteria across all evaluations
  • Scalable - Evaluate hundreds or thousands of outputs automatically
  • Trackable - Monitor quality metrics over time as your system evolves
  • Comparable - A/B test different prompts, models, or approaches with confidence
  • Reliable - Catch regressions before they reach production

Why Use Evaluation Metrics?

When building LLM applications, you need to ensure they're producing high-quality outputs:

  • Detecting Hallucinations - Ensure responses are grounded in provided context, not made up
  • Measuring Accuracy - Verify that answers match expected ground truth
  • Tracking Performance - Monitor quality as your application evolves
  • Comparing Approaches - Objectively compare different prompts, models, or retrieval strategies

Automated metrics solve this problem by providing consistent, scalable ways to evaluate your LLM outputs.

How Metrics Help

The metrics package addresses these challenges by:

  • Automating Evaluation - Use LLMs to judge quality at scale, eliminating manual review bottlenecks
  • Providing Consistency - Get reproducible scores across different runs and team members
  • Offering Transparency - Each score includes reasoning explaining the judgment
  • Enabling Iteration - Quickly test changes and measure their impact on quality
  • Supporting Multiple Use Cases - Different metrics for RAG systems, Q&A, fact-checking, and more

How It Works

Evaluating your LLM application with metrics is a simple three-step process:

1. Choose a Metric

Select the metric that matches your evaluation needs:

import { Faithfulness, FactualCorrectness } from '@open-evals/metrics'
import { openai } from '@ai-sdk/openai'

// For RAG applications - check if responses are grounded in context
const faithfulness = new Faithfulness({
  model: openai('gpt-4o-mini'),
})

// For Q&A systems - check accuracy against reference answers
const factualCorrectness = new FactualCorrectness({
  model: openai('gpt-4o-mini'),
  mode: 'f1',
})

2. Evaluate Samples

Pass your query, response, and context to the metric:

const result = await faithfulness.evaluateSingleTurn({
  query: 'What is the capital of France?',
  response: 'Paris is the capital of France.',
  retrievedContexts: ['Paris is the capital city of France.'],
})

console.log(`Score: ${result.score}`) // 0.0 - 1.0
console.log(`Reason: ${result.reason}`) // Explanation

3. Analyze Results

Each metric returns a score (0.0 to 1.0) and a reason explaining the judgment:

{
  name: "faithfulness",
  score: 0.95,
  reason: "The response is fully supported by the provided context...",
  metadata: {
    // Detailed evaluation data
  }
}

Key Features

LLM-as-a-Judge Evaluation

All metrics use language models to evaluate outputs, leveraging their reasoning capabilities to assess quality in nuanced ways that simple string matching cannot.

Configurable Evaluation

Fine-tune metrics for your specific needs:

const metric = new FactualCorrectness({
  model: openai('gpt-4o'),
  mode: 'precision', // or 'recall', 'f1'
  atomicity: 'high', // Granularity of claim decomposition
  coverage: 'high', // Thoroughness of evaluation
})

Research-Backed Approaches

Metrics implement proven evaluation methodologies:

  • Faithfulness - Based on statement verification and NLI approaches
  • Factual Correctness - Uses claim decomposition and alignment techniques

Transparent Scoring

Every evaluation includes:

  • Numeric score - Quantitative measure (0.0 to 1.0)
  • Reasoning - Natural language explanation
  • Metadata - Detailed breakdown of the evaluation process

Integration with Datasets

Metrics work seamlessly with the evaluation framework:

import { evaluate, EvaluationDataset } from '@ai-sdk-eval/core'

const dataset = new EvaluationDataset([
  /* your samples */
])

const results = await evaluate(dataset, [faithfulness, factualCorrectness])

Integration with Evalite

All metrics can be used with the Evalite evaluation framework using toEvaliteScorer. This allows you to leverage Evalite's experiment tracking, comparison tools, and web UI while using the production-ready metrics from this package.

import { evalite } from 'evalite'
import { Faithfulness, toEvaliteScorer } from '@open-evals/metrics'
import { openai } from '@ai-sdk/openai'

evalite('My Evaluation', {
  data: async () => [{ input: { query: '...', retrievedContexts: ['...'] } }],
  task: async (sample) => 'Your LLM response',
  scorers: [
    toEvaliteScorer(new Faithfulness({ model: openai('gpt-4.1-mini') })),
  ],
})

See the Evalite Integration guide for complete documentation.

Architecture

Metrics in this package extend the base LLMMetric class from @ai-sdk-eval/core, which provides:

  • Consistent Interface - All metrics implement the same evaluation methods
  • Type Safety - Full TypeScript support with proper type inference
  • Extensibility - Create custom metrics by extending the base classes
  • Composability - Combine multiple metrics for comprehensive evaluation

This design ensures metrics are easy to use while remaining flexible enough for advanced use cases.

How is this guide?