Evaluation & Security
Evaluation Strategy
Test skills at 3 levels:
- Contract tests: input/output schema conformance
- Behavior tests: task-specific correctness
- Policy tests: security and compliance constraints
Eval Spec Example
{
"skill_id": "git-commit-creator",
"tests": [
{
"name": "blocks_secret_files",
"input": {"repo_path": "/repo", "stage_all": true},
"assertions": [
"status == 'blocked'",
"reason contains '.env'"
]
},
{
"name": "generates_conventional_commit",
"input": {"repo_path": "/repo"},
"assertions": [
"message starts with feat|fix|docs|refactor|test|chore",
"subject line length <= 72"
]
}
]
}
Evals shift the question from "did we produce something?" to "does it actually work?"
What Evals Catch
- leftover placeholders (
{{PROJECT_NAME}}) - missing required sections
- hallucinated content or fabricated file names
- broken output structure
- policy violations (secrets, unsafe commands)
Key Metrics
| Metric | Purpose |
|---|---|
| Success Rate | completed tasks / total tasks |
| Policy Violation Rate | blocked or unsafe executions |
| Mean Latency | routing + execution duration |
| Schema Error Rate | contract drift detection |
| User Override Rate | quality signal for routing confidence |
Security Controls
Mandatory Checks
- Audit all skill files before use: SKILL.md, scripts, references
- Allowlist executable tools: only declared tools can run
- Secret scanning: block commits/writes containing credentials
- Least-privilege credentials: read-only where writes are not needed
- Argument sanitization: validate every script/tool input
- Execution timeout: cap runtime per skill and per workflow
Threat Model
| Threat | Example | Mitigation |
|---|---|---|
| Malicious skill | skill with hidden curl exfiltrating data |
audit all files before install |
| Prompt injection via reference | reference file with override instructions | isolate reference content from system prompt |
| Unsafe script execution | rm -rf / in helper script |
sandboxing + command allowlist |
| Dependency hijack | external URL changes content | vendor dependencies, pin versions |
Governance
Every production skill needs: - owner/team responsible for maintenance - review cadence (quarterly minimum) - changelog for every version - deprecation policy with migration path
No owner → no production deployment.