Scoring Rubric — ThreatReady

1. How scoring works

A ThreatReady assessment consists of a single scenario — a real architecture with an active attack chain — followed by five adaptive questions. Each answer is evaluated by our 3-Pass AI engine (Evaluator → Challenger → Reconciler) against five anchored dimensions, each on a 1–10 integer scale. Every score band has a concrete written definition — a 7 in Threat Identification means something specific and verifiable, not arbitrary AI judgment.

This page is the canonical reference. We publish it because we want our scores to be auditable. If you ever question a result, every score traces back to one of these bands.

The 3-Pass scoring engine

Every answer passes through three independent AI evaluations:

Pass 1 — Evaluator · Scores generously against the anchored rubric, citing specific evidence from the answer. Tends to over-credit verbose responses (intentional — Pass 2 corrects).

Pass 2 — Challenger · Adversarially reviews. Catches technical errors, hallucinated CVEs, non-existent MITRE techniques, fabricated tools, inflated scores, and missing critical elements.

Pass 3 — Reconciler · Reviews both passes, resolves disagreements, applies question-specific dimension weighting, generates the model answer, and assigns a confidence level (HIGH / MEDIUM / LOW) based on inter-pass agreement.

A score below 5 means the candidate has gaps that would concern a hiring manager. A score of 5–7 means they are competent but not yet senior. A score of 7–8.5 means strong senior-ready performance. A score above 8.5 indicates expert-level reasoning — rare and worth noting. These are guidelines, not promises; final hiring decisions are always yours.

2. The five anchored dimensions

Dimension 1 · Threat Identification

Measures whether the candidate correctly identifies attack vectors, techniques, and threat actor behavior.

Score 1–2Cannot identify the primary attack vector. Names incorrect or irrelevant techniques.

Score 3–4Identifies general attack category but cannot specify technique, tooling, or path. No MITRE mapping.

Score 5–6Correctly identifies primary vector with reasonable MITRE mapping. May miss secondary vectors.

Score 7–8Identifies primary and secondary vectors. Accurate MITRE mapping. Explains attacker decision logic.

Score 9–10Identifies all vectors including non-obvious. Precise MITRE technique IDs. References real-world parallels. Rules out false positives.

Dimension 2 · Containment & Response Logic

Measures whether the candidate can design an effective containment strategy that minimizes blast radius while preserving evidence.

Score 1–2No containment strategy or actions that worsen the situation.

Score 3–4Strategy exists but too aggressive (nuclear option) or too passive. No evidence preservation.

Score 5–6Reasonable containment. Considers evidence OR business continuity but not both.

Score 7–8Proportional containment. Balances evidence and business. Correct order of operations.

Score 9–10Operational depth. Second-order effects. Communication protocols. Compliance timelines (CERT-In, GDPR, DPDPA).

Dimension 3 · Architecture & Blast Radius Analysis

Measures understanding of architectural interconnections and accurate impact assessment.

Score 1–2No understanding of how architecture components connect.

Score 3–4Identifies compromised component but misses dependencies. Treats components as isolated.

Score 5–6Identifies 1–2 downstream dependencies. Basic blast radius.

Score 7–8Maps full blast radius. Traces data flows, credential reuse, trust relationships. Indirect paths.

Score 9–10Complete analysis including third-party impact. Identifies architectural weaknesses. Proposes segmentation improvements.

Dimension 4 · Communication Quality

Measures clarity of reasoning, structural organization, and audience-appropriateness.

Score 1–2Incoherent or incomprehensible response.

Score 3–4Ideas present but poorly organized. Excessive jargon without explanation.

Score 5–6Reasonably clear. Technical peer would understand; manager might struggle.

Score 7–8Well-structured. Clear logical flow. Both peer and manager would understand key points.

Score 9–10Exceptionally clear. Adapts language to audience. Effective in board briefing or technical deep-dive.

Dimension 5 · Framework & Best Practice Application

Measures correct reference to and application of security frameworks (MITRE ATT&CK, NIST, CIS, ISO 27001).

Score 1–2No reference to any framework, standard, or established methodology.

Score 3–4Name-drops frameworks without applying them. References superficial or incorrect.

Score 5–6Correctly references 1–2 frameworks. Applies them at surface level.

Score 7–8Applies multiple frameworks correctly and specifically. Accurate technique IDs and control references.

Score 9–10Seamlessly integrates multiple frameworks. Knows when frameworks apply and when they don't. Working vocabulary.

3. Qualitative bands

Every score also maps to a qualitative band that's visible from day one — even before the platform has enough cohort data for percentile rankings. Bands describe what a score means in plain language, so candidates and hiring managers don't need to interpret a 7.2 in isolation.

Developing1.0 – 3.9Early-stage security reasoning. Foundational gaps to close.

Foundation4.0 – 5.9Developing security reasoning. Building toward independent judgment.

Proficient6.0 – 7.4Mid-level security reasoning. Capable of handling routine incidents.

Advanced7.5 – 8.9Senior-level security reasoning. Interview-grade decision quality.

Expert9.0 – 10.0Principal-level security reasoning. Leads incident response or architecture.

4. Confidence levels & dual score display

Not every score is equally certain. When the Evaluator (Pass 1) and Challenger (Pass 2) disagree significantly, the final score carries that uncertainty forward as a confidence flag. Pretending every AI score is equally reliable destroys trust the moment a buyer catches a bad one — so we publish it.

HighAll five dimensions agreed within 1 point between Pass 1 and Pass 2. No critical errors flagged.

Medium1–2 dimensions disagreed by 2 points; remaining dimensions within 1 point. No critical errors.

LowAny dimension disagreed by 3+ points, or Pass 2 flagged critical technical errors that Pass 1 missed. The hiring manager dashboard shows a "Review Recommended" tag and the score is added to the internal QA queue.

Dual score display. Every result also shows two scores side by side: Scenario Score (raw, on the question's own scale) and Readiness Score (difficulty-adjusted with caps — Beginner 6.0, Intermediate 8.0, Advanced 9.0, Expert 10.0). Both numbers are honest. The candidate's effort is recognized; the hiring manager's signal is preserved.

5. Question-specific dimension weighting

Not every question tests all five dimensions equally. A "map this attack to MITRE" question should weight Threat Identification at 40%. A "write an executive brief" question should weight Communication at 50%. Without per-question weighting, scoring would feel wrong even when the engine works correctly.

Question category	TI	CR	AB	CQ	FA
Threat Identification	40%	10%	15%	10%	25%
Containment & Response	10%	40%	20%	15%	15%
Architecture Analysis	15%	10%	40%	10%	25%
Executive Communication	10%	10%	5%	50%	25%
Incident Response	15%	35%	15%	20%	15%
Default (balanced)	20%	20%	20%	20%	20%

TI = Threat Identification · CR = Containment & Response · AB = Architecture & Blast Radius · CQ = Communication Quality · FA = Framework Application

6. A worked example

Scenario: A Lambda function has unusual outbound traffic. The IAM role attached has s3:* and secrets:*. The S3 bucket contains customer PII.

Question: What is your first containment step, and why does sequence matter here?

Candidate A3.8 / 10

"I would look at CloudTrail to see what happened and then maybe change the IAM role."

Why: Threat Identification 4 — recognises something is wrong but no specific attack path. Containment & Response 3 — investigation-first during active exfiltration. Communication Quality 5 — readable but unstructured. Architecture & Blast Radius 3 — no awareness of dependencies. Framework Application 2 — no MITRE or NIST reference.

Candidate B6.5 / 10

"First, I'd scope down the IAM role by removing secrets:* and narrowing s3:* to only the specific bucket. Then I'd check CloudTrail for what was accessed and rotate any credentials the Lambda could have read."

Why: Threat Identification 7 — names the IAM over-permission path. Containment & Response 6 — specific, correct mitigations but doesn't justify sequence. Architecture & Blast Radius 6 — names what the role can reach. Communication Quality 7 — clear structure. Framework Application 5 — implicit least-privilege but no explicit standard cited.

Candidate C8.9 / 10

"Disable the Lambda function first — remove its trigger or set concurrency to zero. That stops active exfiltration without destroying forensic state, which matters because wiping the role before preserving evidence loses the CloudTrail correlation we'll need later. After that: scope the role (remove secrets:*, narrow s3:* to the specific bucket), rotate credentials the role could have accessed, snapshot the Lambda environment for forensics, and review CloudTrail for scope of what was already taken. Sequence matters because investigation before containment lets the attacker keep reading PII the whole time we're triaging."

Why: Threat Identification 9 — names primary and secondary vectors. Containment & Response 9 — proportional, evidence-preserving, sequenced. Architecture & Blast Radius 9 — full credential reuse map. Communication Quality 9 — explicit sequencing and rationale. Framework Application 8 — implicit MITRE alignment, NIST 800-61 sequencing visible.

7. Adaptive questioning logic

After each answer, the AI generates a follow-up question based on what the candidate actually said. This serves two purposes:

Prevents pre-prepared answers. Generic responses trigger probing follow-ups that can't be anticipated.
Explores depth. Strong answers earn harder follow-ups; shallow answers trigger clarifying ones.

Adaptive questions are bounded by the scenario's MITRE ATT&CK tactics and the role's competency map — the AI cannot wander off-topic into unrelated domains.

8. What we deliberately do not score

Grammar or spelling, beyond the threshold where clarity suffers
Typing speed or answer length for its own sake
Regional English or accent when voice mode is used
Specific tool preferences (AWS vs Azure terminology, Splunk vs Elastic) — we score on the reasoning, not the brand
"Correct" answers that depend on organizational context we haven't given — we score the reasoning, not the guess
Agreeing with our model answer — a well-reasoned alternative can score as well as the model answer if defensible

9. Consistency & calibration

We take rubric consistency seriously. Concrete measures:

All scoring prompts include the full anchored rubric (not a summary)
Each question includes a role-specific and difficulty-specific rubric anchor
Nightly regression at 02:00 IST: 50+ golden answers across 7 categories — Clearly Excellent, Clearly Poor, Fluent But Wrong, Correct But Poorly Communicated, Overconfident Hallucination, Partial But Well-Prioritized, Technically Strong But Business-Blind — run through the live engine. If pass-rate drops below 80%, we freeze prompt changes until investigated.
Pass 1, 2, 3 raw outputs are stored for 90 days for auditability. Hiring managers can request explanation of any score; the system regenerates the full Pass 1 / Pass 2 / Pass 3 trace.
Every month, the founder and 2 core engineers manually review 20–30 flagged evaluations to track AI-vs-human agreement.

10. Score challenges

If you believe your score is wrong, you can challenge it. Here's how:

Within 14 days of receiving the score, email [email protected] with your session ID and a specific explanation of what you believe was mis-scored and why
A security engineer reviews the full transcript and rubric application within 7 business days
We respond with either a confirmed score (with a rubric-grounded explanation) or a revised score
All challenges and outcomes are logged. Patterns are fed back into rubric refinement

11. Version history

Version	Date	Summary of changes
1.0	22 April 2026	Initial publication.
2.0	May 2026	Updated to v4 spec — five anchored dimensions, 3-Pass scoring engine, qualitative bands, confidence levels, dual score display, question-specific weighting, golden-answer regression. Replaces the v1 three-dimension model.

The ThreatReady Scoring Rubric

On this page