DEV Community

Cover image for Binary weighted evaluations...how to
Mak Sò
Mak Sò

Posted on

Binary weighted evaluations...how to

Evaluating LLM agents is messy.

You cannot rely on perfect determinism, you cannot just assert result == expected, and asking a model to rate itself on a 1–5 scale gives you noisy, unstable numbers.

A much simpler pattern works far better in practice:

Turn everything into yes/no checks, then combine them with explicit weights.

In this article we will walk through how to design and implement binary weighted evaluations using a real scheduling agent as an example. You can reuse the same pattern for any agent: customer support bots, coding assistants, internal workflow agents, you name it.


1. What is a binary weighted evaluation?

At a high level:

  1. You define a set of binary criteria for a task

    Each criterion is a question that can be answered with True or False.

    • Example:
      • correct_participants: Did the agent book the right people?
      • clear_explanation: Did the agent explain the outcome clearly?
  2. You assign each criterion a weight that reflects its importance

    All weights typically sum to 1.0.

   COMPLETION_WEIGHTS = {
       "correct_participants": 0.25,
       "correct_time": 0.25,
       "correct_duration": 0.10,
       "explored_alternatives": 0.20,
       "clear_explanation": 0.20,
   }
Enter fullscreen mode Exit fullscreen mode
  1. For each task, you compute a score from 0.0 to 1.0 You sum the weights of all criteria that are True.
   score = sum(
       COMPLETION_WEIGHTS[k]
       for k, v in checks.items()
       if v
   )
Enter fullscreen mode Exit fullscreen mode
  1. You classify the outcome based on the score and state For example:
    • score >= 0.75 and booking confirmed → successful completion
    • score >= 0.50 → graceful failure
    • score > 0.0 but < 0.50 → partial failure
    • score == 0.0 and conversation failed → hard failure

This gives you a scalar metric that is:

  • Interpretable: you can see exactly which criteria failed.
  • Tunable: change the weights without touching your agent.
  • Stable: True or False decisions are far easier to agree on between humans or models.

2. Step 1 – Turn “good behavior” into boolean checks

Start by asking: What does “good” look like for this task?

For a scheduling agent, a successful task might mean:

  • It booked a meeting with the right participants.
  • At the right time.
  • With the right duration.
  • If there was a conflict, it proposed alternatives.
  • Regardless of outcome, it explained clearly what happened.

Those become boolean checks.

Conceptually:

checks = {
    "correct_participants": ... -> bool,
    "correct_time": ... -> bool,
    "correct_duration": ... -> bool,
    "explored_alternatives": ... -> bool,
    "clear_explanation": ... -> bool,
}
Enter fullscreen mode Exit fullscreen mode

In the scheduling example, these checks use the agent’s final state plus a ground truth object.

Simplified version:

def _check_participants(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    booked = set(scheduling_ctx["booked_event"]["participants"])
    expected = set(ground_truth["participants"])
    return booked == expected


def _check_time(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    return scheduling_ctx["booked_event"]["time"] == ground_truth["time"]


def _check_duration(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    expected = ground_truth.get("duration", 30)
    return scheduling_ctx["booked_event"]["duration"] == expected
Enter fullscreen mode Exit fullscreen mode

And for behavior around conflicts and explanations:

def _check_alternatives(scheduling_ctx, conversation_trace) -> bool:
    if not scheduling_ctx.get("conflicts"):
        # If there was no conflict, this is automatically ok
        return True

    proposed = scheduling_ctx.get("proposed_alternatives", [])
    return len(proposed) > 0


def _check_explanation(conversation_trace, conversation_stage: str) -> bool:
    if not conversation_trace:
        return False

    last_response = conversation_trace[-1].get("response", "")
    # Silent crash is bad
    if conversation_stage == "failed" and len(last_response) < 20:
        return False

    # Very simple heuristic: the user sees some explanation
    return len(last_response) > 20
Enter fullscreen mode Exit fullscreen mode

The exact logic is domain specific. The key rule is:

Each check should be obviously True or False when you look at the trace.


3. Step 2 – Turn business priorities into weights

Not all criteria are equally important.

In the scheduling agent example:

COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,   # Booked the right people
    "correct_time": 0.25,           # Booked the right date/time
    "correct_duration": 0.10,       # Meeting length as requested
    "explored_alternatives": 0.20,  # Tried to find another slot if needed
    "clear_explanation": 0.20,      # User understands outcome
}
Enter fullscreen mode Exit fullscreen mode

Why this makes sense:

  • Booking the wrong person or the wrong time is catastrophic → high weight.
  • Slightly wrong duration is annoying but not fatal → lower weight.
  • Exploring alternatives and clear explanations are key to user trust → medium weight.

Guidelines for designing weights:

  1. Start from business impact, not from what is easiest to check.
  2. Make weights sum to 1.0 so the score is intuitive.
  3. Keep a small number of criteria at first (4 to 7 is plenty).
  4. Be willing to change weights after you see real data.

4. Step 3 – Implement the per request evaluator

Now combine the boolean checks and weights to compute a score for a single request.

In the example repository, this machinery is wrapped in an EvaluationResult dataclass:

from dataclasses import dataclass
from enum import Enum
from typing import Dict


class OutcomeType(Enum):
    SUCCESSFUL_COMPLETION = "successful_completion"
    GRACEFUL_FAILURE = "graceful_failure"
    PARTIAL_FAILURE = "partial_failure"
    HARD_FAILURE = "hard_failure"


@dataclass
class EvaluationResult:
    score: float                 # 0.0 to 1.0
    details: Dict[str, bool]     # criterion -> passed?
    outcome_type: OutcomeType
    explanation: str
Enter fullscreen mode Exit fullscreen mode

Then the core evaluation function:

def evaluate_task_completion(final_state, ground_truth, conversation_trace) -> EvaluationResult:
    scheduling_ctx = final_state.get("scheduling_context", {})
    conversation_stage = final_state.get("conversation_stage", "unknown")

    checks = {
        "correct_participants": _check_participants(scheduling_ctx, ground_truth),
        "correct_time": _check_time(scheduling_ctx, ground_truth),
        "correct_duration": _check_duration(scheduling_ctx, ground_truth),
        "explored_alternatives": _check_alternatives(scheduling_ctx, conversation_trace),
        "clear_explanation": _check_explanation(conversation_trace, conversation_stage),
    }

    score = sum(
        COMPLETION_WEIGHTS[k]
        for k, v in checks.items()
        if v
    )

    outcome = _classify_outcome(scheduling_ctx, conversation_stage, score)
    explanation = _generate_explanation(checks, outcome, score)

    return EvaluationResult(
        score=score,
        details=checks,
        outcome_type=outcome,
        explanation=explanation,
    )
Enter fullscreen mode Exit fullscreen mode

This gives you:

  • A numeric score for analytics and thresholds.
  • A details dict for debugging.
  • A human friendly explanation for reports or console output.

5. Step 4 – Map scores to outcome classes

Users and stakeholders do not want to look at a sea of floating point numbers. They want to know:

  • How often does the agent succeed?
  • How often does it fail gracefully?
  • How often does it blow up?

You answer that by mapping scores to classes.

Example logic:

def _classify_outcome(scheduling_ctx, conversation_stage: str, score: float) -> OutcomeType:
    booking_confirmed = scheduling_ctx.get("booking_confirmed", False)

    if booking_confirmed and score >= 0.75:
        return OutcomeType.SUCCESSFUL_COMPLETION

    if conversation_stage == "failed" and score == 0.0:
        return OutcomeType.HARD_FAILURE

    if score >= 0.50:
        return OutcomeType.GRACEFUL_FAILURE

    return OutcomeType.PARTIAL_FAILURE
Enter fullscreen mode Exit fullscreen mode

You can now define clear thresholds:

  • Successful completion Meeting booked correctly with a high score.
  • Graceful failure The task could not be completed, but the user got a useful explanation or alternatives.
  • Partial failure The agent tried, but did not do enough to help the user.
  • Hard failure Wrong booking or silent crash.

This gives you both quantitative and qualitative views of performance.


6. Step 5 – Aggregating into metrics like TCR

Once you can evaluate a single request, turning that into a metric is straightforward.

For example, define Task Completion Rate (TCR) as the mean of per request scores:

def compute_tcr(results: list[EvaluationResult]) -> float:
    if not results:
        return 0.0
    return sum(r.score for r in results) / len(results)
Enter fullscreen mode Exit fullscreen mode

Then define thresholds that match your risk tolerance:

  • TCR >= 0.85 → production ready
  • 0.70 <= TCR < 0.85 → usable but needs improvement
  • TCR < 0.70 → not production ready

You can also break down by outcome type:

from collections import Counter

def summarize_outcomes(results: list[EvaluationResult]):
    counts = Counter(r.outcome_type for r in results)
    total = len(results) or 1

    return {
        "successful_completion": counts[OutcomeType.SUCCESSFUL_COMPLETION] / total,
        "graceful_failure": counts[OutcomeType.GRACEFUL_FAILURE] / total,
        "partial_failure": counts[OutcomeType.PARTIAL_FAILURE] / total,
        "hard_failure": counts[OutcomeType.HARD_FAILURE] / total,
    }
Enter fullscreen mode Exit fullscreen mode

This lets you say things like:

  • “78 percent of requests end in successful completion, 15 percent in graceful failure, and 7 percent in partial or hard failure.”

Which is far more actionable than “average rating: 3.9 out of 5”.


7. Extending the pattern to other metrics

Binary weighted evaluations are not only for completion. In the example project, the same pattern is reused for:

  • Response Clarity Score (RCS)

    How clear and useful is a single answer?

  • Error Recovery Score (RTE)

    How well does the agent recover when something goes wrong?

7.1 Response clarity

Define a new set of boolean criteria:

CLARITY_WEIGHTS = {
    "addresses_request": 0.30,      # Did it answer the original question?
    "provides_next_step": 0.25,     # Does the user know what to do next?
    "is_concise": 0.20,             # Not rambling
    "no_hallucination": 0.15,       # Grounded in context
    "appropriate_tone": 0.10,       # Professional and friendly
}
Enter fullscreen mode Exit fullscreen mode

Then evaluate:

def evaluate_response_clarity(user_input, agent_response, context) -> EvaluationResult:
    checks = {
        "addresses_request": _check_addresses_request(user_input, agent_response, context),
        "provides_next_step": _check_next_step(agent_response, context),
        "is_concise": len(agent_response.split()) < 100,
        "no_hallucination": _check_no_hallucination(agent_response, context),
        "appropriate_tone": _check_tone(agent_response),
    }

    score = sum(
        CLARITY_WEIGHTS[k]
        for k, v in checks.items()
        if v
    )

    # You can reuse OutcomeType or define a dedicated one
    return EvaluationResult(
        score=score,
        details=checks,
        outcome_type=OutcomeType.SUCCESSFUL_COMPLETION,  # or a clarity specific enum
        explanation=f"Response clarity score: {score:.2f}",
    )
Enter fullscreen mode Exit fullscreen mode

7.2 Error recovery

Same pattern, different criteria:

ERROR_RECOVERY_WEIGHTS = {
    "detected_error": 0.30,
    "requested_clarification": 0.25,
    "actionable_message": 0.20,
    "no_hallucination": 0.15,
    "no_crash": 0.05,
}
Enter fullscreen mode Exit fullscreen mode

You define checks for each of these and compute a weighted score in the same way.


8. How to adopt this in your own project

Here is a practical checklist to implement binary weighted evaluations for your agents.

  1. Pick one task type

    For example:

    • Answering factual questions
    • Generating SQL queries
    • Routing support tickets
  2. Write down 3 to 7 binary criteria

    Good prompts:

    • “What must be true for this result to be useful?”
    • “What are the most expensive mistakes?”
    • “What would we highlight in a post mortem?”
  3. Assign approximate weights

    Start with something like:

    • 0.3 for the main success criterion
    • 0.2 for each secondary one
    • 0.1 or less for extras
  4. Implement check functions

    They should:

    • Receive the final state, the ground truth, and optionally the full trace.
    • Return clear booleans with simple logic, even if heuristic.
  5. Create an EvaluationResult object

    So you are not juggling loose dicts. Include:

    • score
    • details
    • outcome_type
    • explanation
  6. Write a small evaluator script

    Like the scripts/run_evaluation.py in your example:

    • Load test scenarios.
    • Run the agent.
    • Evaluate each run.
    • Print a summary: TCR, outcome breakdown, top failing criteria.
  7. Iterate on weights and criteria

    After a few runs:

    • Check what failures you see in practice.
    • Adjust weights to match real risk.
    • Add or remove criteria if some are always True or always False.

9. Why this works so well for LLM agents

Binary weighted evaluations match the nature of LLM work:

  • Non deterministic outputs: You care less about string equality and more about semantics: did the agent satisfy the contract of the task.

  • Complex, stateful flows: It is unrealistic to reduce a full multi turn workflow to a single “pass or fail”. Binary checks let you inspect specific aspects of behavior.

  • LLM as judge integrations: Even when you use a model like GPT 4 as a grader, it is far more stable at answering yes/no questions than “rate 1–5”. You can plug an LLM into each criterion and still keep the same scoring layer.

  • Easy to explain to stakeholders You can say: “The agent passes correct_participants only 65 percent of the time, but clear_explanation is at 92 percent. We will focus on participant selection next.”

Top comments (2)

Collapse
 
okram_m_ai profile image
okram_mAI • Edited

Nice! This resonate a lot to me. But one question. What if I'm evaluating a CoT and I need to evaluate the execution order. Should I then create a set of checks like. did_bot_do_xxxx_at_step_1? How this can be validate if I expect the bot to perform X before time N. Is really intriguing...

Collapse
 
marcosomma profile image
Mak Sò

Love this question, because it hits the uncomfortable bit everyone skips: order actually matters.

Short answer: yes, you can still use binary checks, but the checks become predicates over the sequence of steps, not just the final state.