Evaluating LLM agents is messy.
You cannot rely on perfect determinism, you cannot just assert result == expected, and asking a model to rate itself on a 1–5 scale gives you noisy, unstable numbers.
A much simpler pattern works far better in practice:
Turn everything into yes/no checks, then combine them with explicit weights.
In this article we will walk through how to design and implement binary weighted evaluations using a real scheduling agent as an example. You can reuse the same pattern for any agent: customer support bots, coding assistants, internal workflow agents, you name it.
1. What is a binary weighted evaluation?
At a high level:
-
You define a set of binary criteria for a task
Each criterion is a question that can be answered with True or False.- Example:
-
correct_participants: Did the agent book the right people? -
clear_explanation: Did the agent explain the outcome clearly?
-
- Example:
You assign each criterion a weight that reflects its importance
All weights typically sum to 1.0.
COMPLETION_WEIGHTS = {
"correct_participants": 0.25,
"correct_time": 0.25,
"correct_duration": 0.10,
"explored_alternatives": 0.20,
"clear_explanation": 0.20,
}
- For each task, you compute a score from 0.0 to 1.0 You sum the weights of all criteria that are True.
score = sum(
COMPLETION_WEIGHTS[k]
for k, v in checks.items()
if v
)
- You classify the outcome based on the score and state
For example:
-
score >= 0.75and booking confirmed → successful completion -
score >= 0.50→ graceful failure -
score > 0.0but < 0.50 → partial failure -
score == 0.0and conversation failed → hard failure
-
This gives you a scalar metric that is:
- Interpretable: you can see exactly which criteria failed.
- Tunable: change the weights without touching your agent.
- Stable: True or False decisions are far easier to agree on between humans or models.
2. Step 1 – Turn “good behavior” into boolean checks
Start by asking: What does “good” look like for this task?
For a scheduling agent, a successful task might mean:
- It booked a meeting with the right participants.
- At the right time.
- With the right duration.
- If there was a conflict, it proposed alternatives.
- Regardless of outcome, it explained clearly what happened.
Those become boolean checks.
Conceptually:
checks = {
"correct_participants": ... -> bool,
"correct_time": ... -> bool,
"correct_duration": ... -> bool,
"explored_alternatives": ... -> bool,
"clear_explanation": ... -> bool,
}
In the scheduling example, these checks use the agent’s final state plus a ground truth object.
Simplified version:
def _check_participants(scheduling_ctx, ground_truth) -> bool:
if not scheduling_ctx.get("booking_confirmed"):
return False
booked = set(scheduling_ctx["booked_event"]["participants"])
expected = set(ground_truth["participants"])
return booked == expected
def _check_time(scheduling_ctx, ground_truth) -> bool:
if not scheduling_ctx.get("booking_confirmed"):
return False
return scheduling_ctx["booked_event"]["time"] == ground_truth["time"]
def _check_duration(scheduling_ctx, ground_truth) -> bool:
if not scheduling_ctx.get("booking_confirmed"):
return False
expected = ground_truth.get("duration", 30)
return scheduling_ctx["booked_event"]["duration"] == expected
And for behavior around conflicts and explanations:
def _check_alternatives(scheduling_ctx, conversation_trace) -> bool:
if not scheduling_ctx.get("conflicts"):
# If there was no conflict, this is automatically ok
return True
proposed = scheduling_ctx.get("proposed_alternatives", [])
return len(proposed) > 0
def _check_explanation(conversation_trace, conversation_stage: str) -> bool:
if not conversation_trace:
return False
last_response = conversation_trace[-1].get("response", "")
# Silent crash is bad
if conversation_stage == "failed" and len(last_response) < 20:
return False
# Very simple heuristic: the user sees some explanation
return len(last_response) > 20
The exact logic is domain specific. The key rule is:
Each check should be obviously True or False when you look at the trace.
3. Step 2 – Turn business priorities into weights
Not all criteria are equally important.
In the scheduling agent example:
COMPLETION_WEIGHTS = {
"correct_participants": 0.25, # Booked the right people
"correct_time": 0.25, # Booked the right date/time
"correct_duration": 0.10, # Meeting length as requested
"explored_alternatives": 0.20, # Tried to find another slot if needed
"clear_explanation": 0.20, # User understands outcome
}
Why this makes sense:
- Booking the wrong person or the wrong time is catastrophic → high weight.
- Slightly wrong duration is annoying but not fatal → lower weight.
- Exploring alternatives and clear explanations are key to user trust → medium weight.
Guidelines for designing weights:
- Start from business impact, not from what is easiest to check.
- Make weights sum to 1.0 so the score is intuitive.
- Keep a small number of criteria at first (4 to 7 is plenty).
- Be willing to change weights after you see real data.
4. Step 3 – Implement the per request evaluator
Now combine the boolean checks and weights to compute a score for a single request.
In the example repository, this machinery is wrapped in an EvaluationResult dataclass:
from dataclasses import dataclass
from enum import Enum
from typing import Dict
class OutcomeType(Enum):
SUCCESSFUL_COMPLETION = "successful_completion"
GRACEFUL_FAILURE = "graceful_failure"
PARTIAL_FAILURE = "partial_failure"
HARD_FAILURE = "hard_failure"
@dataclass
class EvaluationResult:
score: float # 0.0 to 1.0
details: Dict[str, bool] # criterion -> passed?
outcome_type: OutcomeType
explanation: str
Then the core evaluation function:
def evaluate_task_completion(final_state, ground_truth, conversation_trace) -> EvaluationResult:
scheduling_ctx = final_state.get("scheduling_context", {})
conversation_stage = final_state.get("conversation_stage", "unknown")
checks = {
"correct_participants": _check_participants(scheduling_ctx, ground_truth),
"correct_time": _check_time(scheduling_ctx, ground_truth),
"correct_duration": _check_duration(scheduling_ctx, ground_truth),
"explored_alternatives": _check_alternatives(scheduling_ctx, conversation_trace),
"clear_explanation": _check_explanation(conversation_trace, conversation_stage),
}
score = sum(
COMPLETION_WEIGHTS[k]
for k, v in checks.items()
if v
)
outcome = _classify_outcome(scheduling_ctx, conversation_stage, score)
explanation = _generate_explanation(checks, outcome, score)
return EvaluationResult(
score=score,
details=checks,
outcome_type=outcome,
explanation=explanation,
)
This gives you:
- A numeric score for analytics and thresholds.
- A details dict for debugging.
- A human friendly explanation for reports or console output.
5. Step 4 – Map scores to outcome classes
Users and stakeholders do not want to look at a sea of floating point numbers. They want to know:
- How often does the agent succeed?
- How often does it fail gracefully?
- How often does it blow up?
You answer that by mapping scores to classes.
Example logic:
def _classify_outcome(scheduling_ctx, conversation_stage: str, score: float) -> OutcomeType:
booking_confirmed = scheduling_ctx.get("booking_confirmed", False)
if booking_confirmed and score >= 0.75:
return OutcomeType.SUCCESSFUL_COMPLETION
if conversation_stage == "failed" and score == 0.0:
return OutcomeType.HARD_FAILURE
if score >= 0.50:
return OutcomeType.GRACEFUL_FAILURE
return OutcomeType.PARTIAL_FAILURE
You can now define clear thresholds:
- Successful completion Meeting booked correctly with a high score.
- Graceful failure The task could not be completed, but the user got a useful explanation or alternatives.
- Partial failure The agent tried, but did not do enough to help the user.
- Hard failure Wrong booking or silent crash.
This gives you both quantitative and qualitative views of performance.
6. Step 5 – Aggregating into metrics like TCR
Once you can evaluate a single request, turning that into a metric is straightforward.
For example, define Task Completion Rate (TCR) as the mean of per request scores:
def compute_tcr(results: list[EvaluationResult]) -> float:
if not results:
return 0.0
return sum(r.score for r in results) / len(results)
Then define thresholds that match your risk tolerance:
-
TCR >= 0.85→ production ready -
0.70 <= TCR < 0.85→ usable but needs improvement -
TCR < 0.70→ not production ready
You can also break down by outcome type:
from collections import Counter
def summarize_outcomes(results: list[EvaluationResult]):
counts = Counter(r.outcome_type for r in results)
total = len(results) or 1
return {
"successful_completion": counts[OutcomeType.SUCCESSFUL_COMPLETION] / total,
"graceful_failure": counts[OutcomeType.GRACEFUL_FAILURE] / total,
"partial_failure": counts[OutcomeType.PARTIAL_FAILURE] / total,
"hard_failure": counts[OutcomeType.HARD_FAILURE] / total,
}
This lets you say things like:
- “78 percent of requests end in successful completion, 15 percent in graceful failure, and 7 percent in partial or hard failure.”
Which is far more actionable than “average rating: 3.9 out of 5”.
7. Extending the pattern to other metrics
Binary weighted evaluations are not only for completion. In the example project, the same pattern is reused for:
Response Clarity Score (RCS)
How clear and useful is a single answer?Error Recovery Score (RTE)
How well does the agent recover when something goes wrong?
7.1 Response clarity
Define a new set of boolean criteria:
CLARITY_WEIGHTS = {
"addresses_request": 0.30, # Did it answer the original question?
"provides_next_step": 0.25, # Does the user know what to do next?
"is_concise": 0.20, # Not rambling
"no_hallucination": 0.15, # Grounded in context
"appropriate_tone": 0.10, # Professional and friendly
}
Then evaluate:
def evaluate_response_clarity(user_input, agent_response, context) -> EvaluationResult:
checks = {
"addresses_request": _check_addresses_request(user_input, agent_response, context),
"provides_next_step": _check_next_step(agent_response, context),
"is_concise": len(agent_response.split()) < 100,
"no_hallucination": _check_no_hallucination(agent_response, context),
"appropriate_tone": _check_tone(agent_response),
}
score = sum(
CLARITY_WEIGHTS[k]
for k, v in checks.items()
if v
)
# You can reuse OutcomeType or define a dedicated one
return EvaluationResult(
score=score,
details=checks,
outcome_type=OutcomeType.SUCCESSFUL_COMPLETION, # or a clarity specific enum
explanation=f"Response clarity score: {score:.2f}",
)
7.2 Error recovery
Same pattern, different criteria:
ERROR_RECOVERY_WEIGHTS = {
"detected_error": 0.30,
"requested_clarification": 0.25,
"actionable_message": 0.20,
"no_hallucination": 0.15,
"no_crash": 0.05,
}
You define checks for each of these and compute a weighted score in the same way.
8. How to adopt this in your own project
Here is a practical checklist to implement binary weighted evaluations for your agents.
-
Pick one task type
For example:- Answering factual questions
- Generating SQL queries
- Routing support tickets
-
Write down 3 to 7 binary criteria
Good prompts:- “What must be true for this result to be useful?”
- “What are the most expensive mistakes?”
- “What would we highlight in a post mortem?”
-
Assign approximate weights
Start with something like:- 0.3 for the main success criterion
- 0.2 for each secondary one
- 0.1 or less for extras
-
Implement check functions
They should:- Receive the final state, the ground truth, and optionally the full trace.
- Return clear booleans with simple logic, even if heuristic.
-
Create an
EvaluationResultobject
So you are not juggling loose dicts. Include:-
score -
details -
outcome_type explanation
-
-
Write a small evaluator script
Like thescripts/run_evaluation.pyin your example:- Load test scenarios.
- Run the agent.
- Evaluate each run.
- Print a summary: TCR, outcome breakdown, top failing criteria.
-
Iterate on weights and criteria
After a few runs:- Check what failures you see in practice.
- Adjust weights to match real risk.
- Add or remove criteria if some are always True or always False.
9. Why this works so well for LLM agents
Binary weighted evaluations match the nature of LLM work:
Non deterministic outputs: You care less about string equality and more about semantics: did the agent satisfy the contract of the task.
Complex, stateful flows: It is unrealistic to reduce a full multi turn workflow to a single “pass or fail”. Binary checks let you inspect specific aspects of behavior.
LLM as judge integrations: Even when you use a model like GPT 4 as a grader, it is far more stable at answering yes/no questions than “rate 1–5”. You can plug an LLM into each criterion and still keep the same scoring layer.
Easy to explain to stakeholders You can say: “The agent passes
correct_participantsonly 65 percent of the time, butclear_explanationis at 92 percent. We will focus on participant selection next.”
Top comments (2)
Nice! This resonate a lot to me. But one question. What if I'm evaluating a CoT and I need to evaluate the execution order. Should I then create a set of checks like.
did_bot_do_xxxx_at_step_1? How this can be validate if I expect the bot to perform X before time N. Is really intriguing...Love this question, because it hits the uncomfortable bit everyone skips: order actually matters.
Short answer: yes, you can still use binary checks, but the checks become predicates over the sequence of steps, not just the final state.