The Two-Stage AI Architecture Behind DebriefSales's Consistent Scoring

One of the earliest problems I ran into when designing DebriefSales's scoring model was variance. Ask a large language model to score a sales call transcript and you will get a good answer. Ask it again with the same transcript and you will get a slightly different answer. Ask it a third time and the scores might have drifted enough to matter.

This is a fundamental property of how language models work. They are probabilistic, not deterministic. That is fine for generating coaching feedback — some natural variation in language is acceptable and even desirable. But it is not acceptable for scoring. If a rep scores 7.2 on a discovery call today and re-uploads the same call tomorrow and gets 6.8, the tool has lost credibility. Scores need to mean something stable.

The naive approach and why it fails

The obvious solution is to just prompt the model to be consistent. "Score this call on a scale of 1 to 10. Be precise and consistent." This helps, but it does not solve the problem. The model is still making subjective judgements about how to weight different elements of the call, and those judgements shift slightly with each inference.

The other naive approach is to set temperature to zero — the parameter that controls how random a model's outputs are. This produces more deterministic outputs, but it also makes the coaching language mechanical and repetitive. Reps stop reading it because it sounds the same every time.

Neither approach gives you what you actually want: consistent, defensible scores and varied, engaging coaching feedback.

The two-stage solution

The architecture we use in DebriefSales separates these two requirements into two distinct AI passes:

        Stage 1: Structured evidence extraction
        The model reads the transcript and extracts specific, factual observations
It identifies whether specific behaviours occurred — did the rep ask an open question in the first two minutes? Did they confirm a next step before ending the call?
Outputs are structured JSON: binary flags and evidence quotes, not scores
Temperature is low. The task is factual extraction, not generation.

      

        Stage 2: Scoring and coaching generation
        The model receives the structured evidence from Stage 1 — not the raw transcript
Scores are calculated from the evidence flags using a defined rubric, not inferred from the transcript directly
Coaching language is generated from the scored evidence with normal temperature, producing natural variation
The score is deterministic. The coaching feedback is generated.

      

The key insight is that by separating what happened on the call (factual extraction) from what it means (scoring and coaching), you can apply deterministic logic to the scoring step while leaving the language generation free to be expressive.

Why this matters for trust A rep who disputes their score needs to be able to see exactly which evidence the score is based on. "You scored 6.4 on discovery depth" is not useful feedback. "You scored 6.4 on discovery depth because you asked two Situation questions but no Implication questions, and the call moved to pricing before the customer had elaborated their problem" is actionable and defensible.

The practical impact

Since moving to this architecture, score variance across duplicate submissions of the same call has dropped to within 0.1 points in the vast majority of cases. The coaching feedback remains varied and natural because it is generated, not retrieved.

This is not a novel idea in AI system design — it is a standard pattern for any application where you need reliable outputs from a probabilistic model. But applying it to call scoring specifically required rethinking the prompt architecture considerably, and the result is a system that earns rep trust in a way that single-pass scoring does not.

If you are building a product that uses AI for evaluation rather than generation, this separation of concerns is probably the most important architectural decision you will make.

See DebriefSales in action

7-day free trial. No credit card required.

Start your free trial →

The two-stage AI architecture behind DebriefSales's consistent scoring

The naive approach and why it fails

The two-stage solution

Stage 1: Structured evidence extraction

Stage 2: Scoring and coaching generation

The practical impact

See DebriefSales in action

Score your next call tomorrow.