This week’s blog title pays tribute to one of my preferred books, “Measure What Matters” by John Doerr. In my earlier post, I briefly addressed the concept of dynamic evaluations for agents. This topic resonates with me because of my professional experience in application lifecycle management. I have also worked with cloud orchestration, cloud security, and low-code application development. There is a clear necessity for autonomous, intelligent continuous security within our field. Over the past several weeks, I have conducted extensive research, primarily reviewing publications from http://www.arxiv.org, to explore emerging possibilities enabled by dynamic evaluations or agents.
This week’s discussion includes a significant mathematical part. To clarify, when referencing intelligent continuous security, I define it as follows:
- End-to-end security
- Continuous security in every phase
- Integration of lifecycle security practices leveraging AI and ML
The excitement surrounding this area stems from employing AI technologies to bolster defense against an evolving threat landscape. This landscape is increasingly accelerated by advancements in AI. This article will examine the primary objects under evaluation. It will cover key metrics for security agent testing, risk-weighted security impact, and coverage. It will also discuss dynamic algorithms and scenario generation. These elements are all crucial within the framework of autonomous red, blue, and purple team operations for security scenarios. Then, a straightforward scenario will be presented to illustrate how these components interrelate.
This topic holds significant importance due to the current shortage of cybersecurity professionals. This is particularly relevant given the proliferation of autonomous vehicles, delivery systems, and defensive mechanisms. As these technologies advance, the demand for self-learning autonomous red, blue, and purple teams will become imperative. For instance, consider the ramifications if an autonomous vehicle were compromised and transformed into a weaponized entity.
What “dynamic evals” mean in this context?
For security agents (red/blue/purple)
- Static evals: fixed test suite (e.g., canned OWASP tests) −> one-off-score
- Dynamic evals:
- Continuously generates new attack and defense scenarios.
- Re-samples them over time as system and agents change
- Uses online/off-policy algorithms to compare new policies safely
Based on the recent paper on red team and dynamic evaluation frameworks for LLM agents, it argues that static benchmarks go stale quickly, and must be replaced by ongoing, scenario-generating eval systems.
For security, we also anchor to OWASP ASVS/Testing Guide for what “good coverage” means, and CVSS/OWASP risk ratings for how bad a found vulnerability is
Objects we’re evaluating
Think of your environment as a Markov Decision process (MDP). A MDP models situations where outcomes partly random and partly under the control of a decision maker. It is a formal to describe decision-making over time with uncertainty. With that out of the way, these as the components of the MDP in the context of dynamic evals.
- State s: slices of system state + context
- code snapshot, open ports, auth config, logs, alerts, etc.
- Action a: what the agent does
- probe, run scanner X, craft request Y, deploy honeypot, block IP, open ticket, etc.
- Transition P (s | s, a): how the system changes.
- Reward r: how “good” or “bad” that step was.
Dynamic eval = define good rewards, log trajectories (st, at, rt, st+1), then use off-policy evaluation and online testing to compare policies
Core metrics for security-testing agents
Task-level detection/exploitation metrics
On each scenario j (e.g., “there is a SQL injection in service A”):
- True Positive rate (TPR):
- False positive rate (FPR):
- Mean time to detect (MTTD) across runs:
- Exploit the chain depth for red agents: average number of steps in successful attack chains.
Risk-weighted security impact
Use CVSS or similar scoring to weight vulnerability by severity:
- For each found vulnerability v, with CVSS score ci , define a Risk-Weighted Yield (RWY):
- You can normalize by time or by number of actions:
- Risk per 100 actions
- Risk per test hour:
For blue-team agents, we need to invert it:
- Residual risk after defense actions = baseline RWY – RWY after patching/hardening
Behavioral metrics (agent quality)
For each trajectory:
- Stealth score (red) or stability score (blue)
- e.g., fraction of actions that did not trigger noise/ unnecessary alerts.
- Action efficiency:
- e.g., fraction of actions that did not trigger noise/ unnecessary alerts.
Policy entropy over actions:
High entropy
explores; low latency
more deterministic; track this over time.
Coverage metrics
Map ASVS/ testing guide controls to scenarios.
Define a coverage vector over requirement IDs
You can track Markovian coverage. It measures how frequently the agent visits specific state space zones, like auth or data paths. This is estimated by clustering log states.
Algorithms to make this dynamic
Off-policy evaluation (OPE) for new agent policies
You don’t want to put every experimental red agent directly against your real systems. Instead:
- Log trajectories from baseline policies (humans, old agents)
- Propose a new policy
- Use OPE to estimate how would perform on the same states.
Standard tools from RL/bandits:
- Importance Sampling (IS):
- For each trajectory , weight rewards by:
then estimate:
- Self-normalized IS (SNIS) to reduce variance:
- Doubly robust (DR) estimators
Combine a model-based value estimate with IS to get a low-variance, unbiased estimates.
Safety-aware contextual bandits for online testing
The bandit problem is a fundamental topic in statistics and machine learning, focusing on decision-making under uncertainty. The goal is to maximize rewards by balancing exploration of different options and exploitation of those with the best-known outcomes. A common example is choosing among slot machines at a casino. Each has its own payout probability. You try different machines to learn which pays best. Then you continue playing the most rewarding one.
When you go online, treat “Which policy should handle this security test?” as a bandit problem:
- Context = environment traits (service, tech stack, criticality)
- Arms = candidate agents (policies)
- Rewards = risk-weighted yield (for red ) or residual risk reduction (for blue), with penalties for unsafe behavior
Use Thompson sampling (commonly used in multi-arm bandit problems) and is a Bayesian construct or Upper Control Bound (UCB), which relies on confidence intervals but constraint them (e.g., only allocate no more than X% traffic to new policy if lower confidence bound on rewards is above the safety floor). Recent work on safety-constrained bandits/ OPE explicitly tackles this.
This gives you a continuous, adaptive “tournament” for agents without fully trusting unproven ones.
Sequential hypothesis testing/drift detection
You want to trigger alarms when a new version regresses:
- Let be the performance estimates (e.g. RWY@100a or TPR) for old versus new agent.
- Use bootstrap over scenarios / trajectories to get confidence intervalsApply sequential tests (e.g., sequential probability ratio test) so that you can stop early when it is clear that B is better/worse
- If performance drops below a threshold (e.g., TPR falls, or RWY@100a tanks), auto-fail the rollout (pump the breaks on the CI/CD pipeline when deploying the agents)
Dynamic scenario generation
Dynamic evals need a living corpus of tests, not just a fixed checklist
Scenario Generator
- Parameterize the tests from frameworks like OWASP ASVS/ Testing guide and MITRE ATT&CKinto templates:
- “Auth bypass on endpoint with pattern X”
- “Least privilege violation in role Y”
- Combine them with:
- New code paths/services (from your repos & infra graph)
- Past vulnerabilities (re-tests)
- Recent external vulnerability classes (e.g., new serialization bugs)
- “Auth bypass on endpoint with pattern X”
Scenario selection: bandits again
You won’t run everything all the time. Use multi-armed bandits on scenarios themselves (remember you are looking overall optimized outcomes in uncertain scenarios):
- Each scenario is an arm.
- Reward= information gain (did we learn something?) or “surprise” (difference between expected and observed agent performance).
- Prefer:
- High-risk, high-impact areas (per OWASP risk rating & CVSS)
- Areas where metrics are uncertain (high variance)
This ensures your evals stay focused and fresh instead of hammering the same easy tests.
Example: End-to-end dynamic eval loop
Phew! That was a lot of math. Imagine researching all of this, learning or relearning some of these concepts, and doing my day job. In the age of AI, I appreciate a good prompt that can help with research and summarize the basic essence of the papers and webpages I’ve referenced. Without further ado, let’s get into it:
- Define the reward function for each type (yes, sounds like training mice in a lab)
- Red teams
- Blue teams:
- Continuously generate scenarios from ASVS/ATT&CK-like templates, weighted by business criticality.
- Schedule tests via a scenario-bandit (focus on high-risk and uncertain areas).
- Route test to agents using safety-constrained policy bandits.
- Log trajectories and security outcomes (vulnerabilities found, incidents observed) .
- Run OPE offline to evaluate new agents before they touch critical environments.
- Run sequential tests and drift detection to auto-rollback regressed versions.
- Periodically recompute coverage & risk (this is important)
- ASVS Coverage, RWY@time, TPR/FPR trends, calibration of risk estimates
Risk and Concerns
Dynamics evals can still overfit if:
- Agents memorize your test templates
- You don’t rotate/mutate scenarios
- You over-optimize to a narrow set of metrics (e.g., “find anything, even if low impact” à high noise)
Mitigations:
- Keep a hidden eval set of scenarios and environments never used for training or interactive training (yes, this is needed)
- Perform “probe-based” agentic red teaming (inject adversarial conditions at specific nodes of the agent workflow, not just inputs i.e. chaos monkey agentic style) to detect brittle behaviors
- Track metric diversity: impact, precision, stability, coverage
- Have the required minimum threshold on all metrics not just on one
As you can see, Dynamic Evals present challenges, but the cost of failure escalates significantly when agents perform poorly in a customer-facing scenario. The current set of work in coding, such as Agents.MD, etc., is just shortening the context window to get a reasonable amount of determinism, and the only way agents get away with it is because developers fix the code and provide the appropriate feedback.
That topic is a conversation for a different day.























