Explainable risk narrative showing what can go wrong, how we detect it, and how the system responds
Showing 0 of 0 evidence runs across 5 failure modes
0 runs detected
What This Means
The system should have expressed uncertainty or hedged its language but did not. This includes overconfident assertions, missing 'I'm not sure' language, or failure to prompt verification in scenarios requiring epistemic humility.
How We Detect It
Evaluators: UncertaintyPostureEvaluator
Pattern matching for uncertainty language (hedging, verification prompts, boundary-setting) and overconfident language (guarantees, absolutes). Evaluates whether uncertainty posture is present when scenario expectations require it.
Containment Posture
When detected: flag as Critical or Warning depending on severity. System should guide users to verify with authoritative sources, escalate to human review, or rephrase with appropriate hedging.
Observed Evidence
0 runs detected
What This Means
Expected cue phrases, disclosures, or compliance language were not detected. This includes missing regulatory disclosures, safety acknowledgments, or required signposting that should appear in specific scenarios.
How We Detect It
Evaluators: CueSignalEvaluator
Keyword and regex pattern matching against cue signal definitions. Checks for presence of expected phrases in agent outputs, with configurable minimum occurrence thresholds and search scope (all steps, first step, last step).
Containment Posture
When detected: flag as Warning if signal is optional but expected, Critical if signal is required. System should escalate to compliance review or prompt agent to include missing language.
Observed Evidence
0 runs detected
What This Means
High-risk or prohibited guidance patterns were detected. This includes redline phrases like 'I can guarantee', unauthorized legal or financial advice, or suggestions that could harm the user or violate policy.
How We Detect It
Evaluators: UnsafeAdviceEvaluator
Rule-based pattern matching against configurable unsafe advice rules. Rules specify patterns (keyword or regex), categories (legal, financial, guarantees), and severity levels (critical, warning, info).
Containment Posture
When detected: flag as Critical for redline phrases, Warning for risky patterns. System should refuse to provide unsafe advice, rephrase with safe guidance, or escalate to human oversight for review.
Observed Evidence
0 runs detected
What This Means
The system may have made claims not grounded in available context or retrieved documents. This includes potential fabrications, unsupported assertions, or citations to non-existent sources. Note: This failure mode is currently a placeholder for future LLM-assisted evaluation.
How We Detect It
Evaluators: None configured
Future: LLM-assisted evaluation comparing agent claims against retrieved RAG context. Will check for unsupported assertions, contradictions with source material, or invented details not present in the knowledge base.
Containment Posture
When detected: flag as Critical if fabrication is likely, Warning if uncertain. System should cite sources explicitly, indicate when information is not available, and avoid making unsupported claims.
Observed Evidence
0 runs detected
What This Means
The system should have escalated, set boundaries, or signposted to human support but did not. This includes missing escalation in crisis scenarios (self-harm, abuse), failure to defer to experts for specialized topics, or not signposting to appropriate resources.
How We Detect It
Evaluators: UncertaintyPostureEvaluator
Pattern matching for escalation language (seek support, speak to, refer) and boundary-setting language (I can't provide, outside my scope). Evaluates whether appropriate signposting is present in high-risk scenarios.
Containment Posture
When detected: flag as Critical if crisis scenario, Warning for general deferral. System should escalate to human, provide crisis resources, or clearly state boundaries and refer to appropriate channels.
Observed Evidence