Faithfulness
Measure how grounded the model's response is in the provided context
Overview
The Faithfulness metric evaluates how well a model's response is grounded in the provided context. It ensures that the generated answer doesn't contain hallucinations or claims that cannot be verified from the retrieved documents.
This metric is particularly valuable for Retrieval-Augmented Generation (RAG) applications where factual accuracy and grounding in source material are critical.
How It Works
The Faithfulness metric uses a three-step evaluation process:
- Statement Decomposition: The model's response is broken down into atomic, standalone statements
- Faithfulness Verification: Each statement is checked against the retrieved contexts using natural language inference (NLI)
- Score Calculation: The final score is the ratio of faithful statements to total statements
Installation
npm i @open-evals/metricsUsage
import { Faithfulness } from '@open-evals/metrics'
import { openai } from '@ai-sdk/openai'
// Create the metric
const faithfulness = new Faithfulness({
model: openai('gpt-4.1'),
})
// Evaluate a sample
const result = await faithfulness.evaluateSingleTurn({
query: 'What is the capital of France?',
response: 'The capital of France is Paris. It is located on the Seine River.',
retrievedContexts: [
'Paris is the capital and most populous city of France. It is situated on the Seine River.',
],
})
console.log(result)
// {
// name: 'faithfulness',
// score: 1.0,
// reason: '2 out of 2 statements were faithful to the context',
// metadata: {
// statements: [...]
// }
// }Configuration
Options
Prop
Type
Required Sample Fields
The Faithfulness metric requires the following fields in your evaluation sample:
query: The user's question or inputresponse: The model's generated answerretrievedContexts: Array of context strings that were used to generate the response
Scoring
The metric returns a score between 0 and 1:
- 1.0: All statements in the response are fully grounded in the provided context
- 0.5: Half of the statements can be verified from the context
- 0.0: None of the statements are supported by the context
Score Calculation
Faithfulness Score = (Number of Faithful Statements) / (Total Number of Statements)Example with Metadata
The metric includes detailed metadata about the evaluation:
const result = await faithfulness.evaluateSingleTurn({
query: 'Who was Albert Einstein?',
response:
'Albert Einstein was a German-born theoretical physicist who developed the theory of relativity.',
retrievedContexts: [
'Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest physicists of all time. He is best known for developing the theory of relativity.',
],
})
console.log(result.metadata?.statements)
// [
// {
// statement: "Albert Einstein was a German-born theoretical physicist.",
// reason: "This statement is directly supported by the context.",
// verdict: 1
// },
// {
// statement: "Albert Einstein developed the theory of relativity.",
// reason: "The context explicitly mentions he developed the theory of relativity.",
// verdict: 1
// }
// ]Use Cases
- RAG Systems: Verify that generated answers don't hallucinate beyond the retrieved documents
- Question Answering: Ensure responses stay grounded in provided source material
- Fact Checking: Validate that claims in generated text can be traced back to context
- Content Generation: Monitor AI-generated content for accuracy against source documents
Best Practices
-
Use High-Quality LLMs: The metric's accuracy depends on the quality of the evaluation model.
-
Provide Complete Context: Ensure that
retrievedContextscontains all relevant information used to generate the response. -
Monitor Edge Cases: Empty responses or responses with no decomposable statements will return a score of 0.
-
Combine with Other Metrics: Use Faithfulness alongside other metrics like Factual Correctness for comprehensive evaluation.
How is this guide?