One of the earliest problems I ran into when designing DebriefSales's scoring model was variance. Ask a large language model to score a sales call transcript and you will get a good answer. Ask it again with the same transcript and you will get a slightly different answer. Ask it a third time and the scores might have drifted enough to matter.
This is a fundamental property of how language models work. They are probabilistic, not deterministic. That is fine for generating coaching feedback — some natural variation in language is acceptable and even desirable. But it is not acceptable for scoring. If a rep scores 7.2 on a discovery call today and re-uploads the same call tomorrow and gets 6.8, the tool has lost credibility. Scores need to mean something stable.
The naive approach and why it fails
The obvious solution is to just prompt the model to be consistent. "Score this call on a scale of 1 to 10. Be precise and consistent." This helps, but it does not solve the problem. The model is still making subjective judgements about how to weight different elements of the call, and those judgements shift slightly with each inference.
The other naive approach is to set temperature to zero — the parameter that controls how random a model's outputs are. This produces more deterministic outputs, but it also makes the coaching language mechanical and repetitive. Reps stop reading it because it sounds the same every time.
Neither approach gives you what you actually want: consistent, defensible scores and varied, engaging coaching feedback.
The two-stage solution
The architecture we use in DebriefSales separates these two requirements into two distinct AI passes:
Stage 1: Structured evidence extraction
- The model reads the transcript and extracts specific, factual observations
- It identifies whether specific behaviours occurred — did the rep ask an open question in the first two minutes? Did they confirm a next step before ending the call?
- Outputs are structured JSON: binary flags and evidence quotes, not scores
- Temperature is low. The task is factual extraction, not generation.
Stage 2: Scoring and coaching generation
- The model receives the structured evidence from Stage 1 — not the raw transcript
- Scores are calculated from the evidence flags using a defined rubric, not inferred from the transcript directly
- Coaching language is generated from the scored evidence with normal temperature, producing natural variation
- The score is deterministic. The coaching feedback is generated.
The key insight is that by separating what happened on the call (factual extraction) from what it means (scoring and coaching), you can apply deterministic logic to the scoring step while leaving the language generation free to be expressive.
The practical impact
Since moving to this architecture, score variance across duplicate submissions of the same call has dropped to within 0.1 points in the vast majority of cases. The coaching feedback remains varied and natural because it is generated, not retrieved.
This is not a novel idea in AI system design — it is a standard pattern for any application where you need reliable outputs from a probabilistic model. But applying it to call scoring specifically required rethinking the prompt architecture considerably, and the result is a system that earns rep trust in a way that single-pass scoring does not.
If you are building a product that uses AI for evaluation rather than generation, this separation of concerns is probably the most important architectural decision you will make.