Skip to main content
INFORMATIVEDRAFTDocumentation Governance

Results & Status

1. Purpose

This document defines the formal semantics of conformance evaluation outcomes.

When Validation Lab or any conformance evaluator produces a result, it MUST use these exact terms and meanings.

2. Primary Outcomes

MPLP conformance evaluation produces exactly one of three outcomes:

OutcomeMeaning
CONFORMANTEvidence satisfies all requirements for the claimed class
NON-CONFORMANTEvidence violates one or more requirements
INCOMPLETE-EVIDENCECannot determine; evidence is missing or invalid

2.1 CONFORMANT

Definition: The Evidence Pack contains all required artifacts, all artifacts pass schema validation, and all evaluation dimensions pass for the claimed conformance class.

Implications:

  • The system produced evidence matching the protocol specification
  • The lifecycle was correctly recorded
  • Governance gates were properly applied (if applicable)

Does NOT imply:

  • Correctness of agent decisions
  • Quality of generated plans
  • Security of the runtime
  • Legal compliance

2.2 NON-CONFORMANT

Definition: The Evidence Pack fails one or more required evaluation dimensions for the claimed conformance class.

Implications:

  • At least one violation was detected
  • The system did not follow the protocol in some aspect
  • Detailed failure reasons SHOULD be provided

Common Causes:

  • Schema validation failures
  • Missing required artifacts
  • Broken referential integrity
  • Ungated high-risk actions (for L2+)
  • Missing trace segments

2.3 INCOMPLETE-EVIDENCE

Definition: The evaluation cannot be completed because required evidence is missing, corrupted, or invalid.

Implications:

  • The evaluator can make no conformance determination
  • The system may or may not be conformant
  • Additional evidence is required

Common Causes:

  • Missing Context or Plan
  • Corrupted JSON files
  • Export failure
  • Partial evidence pack

3. Secondary Status Values

For granular reporting, evaluations may include secondary status:

StatusMeaningUsed When
PASSSingle check passedPer-dimension reporting
FAILSingle check failedPer-dimension reporting
SKIPCheck not applicableL1 evaluation skipping L3 checks
ERROREvaluation errorEvaluator bug or crash
TIMEOUTEvaluation exceeded time limitLarge evidence packs

4. Result Structure

Conformance results SHOULD be structured as:

[!NOTE] Hypothetical Example The following JSON structure is a non-normative example for illustration only. It does not represent a real evaluation result.

{
"evaluation_id": "eval-550e8400-e29b-41d4-a716-446655440000",
"evaluated_at": "2025-12-28T00:00:00Z",
"protocol_version": "1.0.0",
"claimed_class": "L2",

"outcome": "NON-CONFORMANT",

"dimensions": {
"schema_validity": "PASS",
"lifecycle_completeness": "PASS",
"governance_gating": "FAIL",
"trace_integrity": "PASS",
"failure_bounding": "SKIP",
"version_declaration": "PASS"
},

"failures": [
{
"dimension": "governance_gating",
"artifact": "plans/plan-456.json",
"message": "Step step-3 requires confirm but no Confirm object found",
"severity": "error"
}
],

"evidence_summary": {
"contexts": 1,
"plans": 1,
"traces": 1,
"confirms": 0
}
}

5. Severity Levels

Failures may have different severities:

SeverityMeaningImpact
errorHard failureResults in NON-CONFORMANT
warningSoft issueDoes not affect outcome
infoObservationInformational only

Rule: Any error severity failure results in NON-CONFORMANT outcome.

6. Outcome Stability

6.1 Determinism

Given the same Evidence Pack and protocol version, the outcome MUST be deterministic.

evaluate(pack_v1, protocol_1.0.0) = CONFORMANT
evaluate(pack_v1, protocol_1.0.0) = CONFORMANT // Always same result

6.2 Monotonicity

Adding valid evidence to an pack cannot change CONFORMANT to NON-CONFORMANT.

pack_a = {context, plan, trace}           → CONFORMANT
pack_b = pack_a + {more_valid_traces} → CONFORMANT (still)

7. Reporting Requirements

Evaluation reports MUST include:

FieldRequiredDescription
outcomePrimary outcome
protocol_versionVersion evaluated against
evaluated_atTimestamp of evaluation
claimed_classL1, L2, or L3
dimensionsPer-dimension status
failuresIf NON-CONFORMANTFailure details
evidence_summaryRecommendedArtifact counts

Scope: Defines 3 primary outcomes, severity levels, result structure
Requirement: Evaluators MUST use these exact terms