R7 · EV-display variance — Resolution Report

Per-combo EV inconsistencies surfaced at RFI positions · classified as B1 known-failure (KI-4 / G-series / I-series family) · 2026-05-11

B1 known-failure Family: KI-4 model EV shape concerns Owner (model): Trung Owner (QA framework): Nimit

The issue (one paragraph)

At RFI positions on Cash NLHE 6-max 100bb 2-blind no ante, the model's per-combo raise-action EV varies dramatically across "similar" folded hands. Standard junk hands (72o, 82o, 92o, T2o, J2o) show sensible raise EVs in the −0.2 to −2.0 bb range. But a small set of rarely-played buckets — most prominently 26o — show absurdly positive raise EVs (+4.98 bb at CO, +5.25 bb at MP, +6.77 bb at UTG) despite the model folding them 100%. The fold decision itself is consistent; the displayed EV column for unselected actions is unreliable.

Why this matters

A student or coach inspecting "What's the EV of 26o vs 72o at UTG?" will see two hands that both fold 100% but with displayed alternatives of very different EV. The fold decision is correct in both cases — but the surface column for "what if I raised?" creates a false story: 26o looks like a clear open (+6.77 bb) while 72o looks like a marginal one (−1.80 bb). The displayed EV is not trustworthy as a coaching signal.

Existing B1 properties that should catch this

The B1 consistency framework already defines and implements 26 EV-related properties across F/G/H/I series. The I-series (Policy-EV consistency, 5 properties) is the most directly relevant:

ID	Property	What it checks	Catches R7?
`I1`	High-prob actions have high EV	Dominant action (p > 0.5) EV within 30% of max EV range	Partial
`I2`	Mixing implies indifference	Mixed actions (both p > 0.3) must have similar EV (≤ 0.5 BB)	No (only mixing)
`I3`	Zero-prob High-EV flag	For each combo: any zero-prob action (p < 0.001) with EV > max-active-EV + 0.01 BB is a violation	YES — direct match
`I4`	Policy-weighted EV consistency	\|Σ p(a) × EV(a) − combo.ev\| ≤ 0.1 BB	Partial
`I5`	EV ordering at policy extremes	Pure (100%) actions: EV should equal max action EV	YES

All 5 I-series properties are implemented in projects_dev/agserving-rangeviewer-v2/tests/parity/properties/runners/classI.ts as of 2026-04-16. Plus 8 F-series (EV constraints), 6 G-series (EV monotonicity, known-failure family KI-4), 7 H-series (EV coherence) — 26 EV-related properties total.

2026-05-11 re-run against current model

Replicated the I1/I3/I4/I5 checks against the current production endpoint (model moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx), iterating all 1,326 combos per situation (the production runner samples only combos[0] per hand class). 6 situations × ~1,326 combos. Test script: tests/pull_b1_iclass_rerun.py.

Property	Tested	Violations	Rate	Implication
`I1`	7,860	1,628	20.7%	1 in 5 dominant actions have EV >30% below max
`I3`	45,598	12,988	28.5%	1 in 3.5 zero-prob actions have EV exceeding max-active EV
`I4`	0	0	—	`combo.ev` field absent in response — can't run
`I5`	5,186	3,446	66.5%	2 in 3 pure (100%) actions are not the max-EV option

Sample I3 violations — not just edge hands

The 26o anomaly is one visible instance of a broader pattern. The same check fires on premium hands:

Combo	Situation	Played action	Played EV	Zero-prob alt	Zero-prob EV	Gap
AA	UTG open	raise (100%)	+8.58	call	+9.85	+1.27 bb
AKs	UTG open	raise 2.5 (100%)	+0.97	raise 5.0	+2.27	+1.30 bb
AKs	UTG open	raise 2.5 (100%)	+0.97	raise 7.5	+2.88	+1.91 bb
26o	UTG open	fold (100%)	0.00	raise 2.5	+6.77	+6.77 bb

The AA example is especially striking — the model "leaves +1.27 bb on the table per AA hand" by raising instead of calling, according to its own EV estimates. Calling AA from UTG is not a real strategic option; the model's EV for the unselected call is just noisy.

Why production B1 didn't catch this

Nimit's 2026-04-23 B1 cross-format sweep did not list I3 in failing properties for any format. Three plausible reasons:

Model variant difference. Nimit's run likely targeted the pre-_no_cap_trung model. The current endpoint serves _no_cap_trung, which may have introduced or amplified EV noise on rarely-played buckets.
Sampling depth. The production runner iterates only hand.combos[0] per hand class (~169 classes). My re-run iterated all 1,326 combos. Suit-specific anomalies in combos[1+] could be missed by the production runner.
DSL → payload translation. The runner uses its own DSL parser that may produce slightly different payloads than the canonical V2 strategy_grid contract — could lead to subtle differences in what's being queried.

Resolution

Two parallel tracks. Both required.

Track A — Run B1 properties against the current model

The framework is already designed to catch this class of issue. The fix here is operational, not methodological.

Re-run the full B1 vitest suite against moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx (current production endpoint).
Specifically inspect I1, I3, I5 results — these are the EV-coherence properties most relevant to R7. Expect non-trivial violations based on our re-run (I3 at 28.5%, I5 at 66.5%).
Cross-check G-series (G1, G2, G3, G4, G6) — these are already in KI-4 known-failure family and should be re-confirmed against the current model.
Cadence: every model version bump must re-run B1 before deploying as production endpoint. If a model variant ships without B1 sign-off, EV-coherence regressions like R7 ship to coaches uncaught.

Track B — Verify B1 property implementations

Properties existed and "passed" in Nimit's 2026-04-23 run but my expanded re-run shows 28-66% violations. Either the model changed or the production runner has blind spots.

Audit hand.combos[0] sampling. Modify the production runner to iterate all combos per hand class (or at least sample multiple per class) so suit-specific EV outliers surface.
Verify DSL → payload contract matches the canonical V2 strategy_grid spec (RVV2-exact payload semantics: raise_to = cumulative chips, rake_cap_bb=3, no grid_weight_type).
Compare runner output on identical situations between (a) the vitest runner and (b) the Python expanded check (tests/pull_b1_iclass_rerun.py). Reconcile any disagreement.
Document the property semantics with a concrete example per property (e.g., R7 → I3) so future model bumps are diagnosed correctly.

Suggested next steps (priority order)

Re-run B1 I-series against _no_cap_trung — confirm I1/I3/I5 violation rates on the current production model. Run via the production vitest runner (gold standard).
Modify production runner to iterate all combos per hand class — current combos[0]-only sampling is too thin to surface the scale of EV noise this re-run shows.
Escalate I-series + KI-4 family to Trung (owner of EV) — the production model leaks 1-2 bb per hand on premium ranges (AA call-vs-raise gap = 1.27 bb is the cleanest example). Worth investigating whether it's a model defect, an EV-computation issue, or a training-coverage gap on alternative actions.
Cross-link R7 ↔ I3 in b1-qa-tracker.md — connect the reviewer-page finding back to the B1 property identity so future runs are interpreted correctly.
Run the same I-checks at postflop spots (BB defense, c-bet decisions) to see if EV-noise is preflop-specific or pervasive across decision types.
Coach-surface display decision — should the UI display raw model EV for unselected actions, or apply a sanity-clamp (e.g. "EV unreliable: below training threshold" for hands with fold prob ≥ 95% and noise-suspect EV)? Routes to product after Track A completes.

Open questions for Trung + Nimit

Trung (EV owner): Is the per-action EV computed via the model's own per-bucket prediction, or is it the result of an alternative-action rollout? Either way, the variance we're seeing across folded hands suggests a fundamental coverage gap — the model has no training signal for "raise with 26o from UTG" so the EV estimate drifts.
Trung: Should there be a calibration layer that clamps unselected-action EVs to plausible bounds (e.g., raise EV cannot exceed [pot × max_winrate − bet_size × (1 − fold_to_raise)])?
Nimit (B1 framework): Why didn't I3 fire in the 2026-04-23 sweep? Was the run targeting a different model variant, or is the combos[0]-only sampling masking the violations?
Nimit: Should the B1 runner be modified to iterate all combos per hand class, or to sample multiple combos per class? What's the right trade-off between test runtime and coverage depth?

Source of truth: engineering-department/gameplay-ai/projects/external-solver-benchmark/. Test scripts: tests/pull_rfi_fold_ev_check.py, tests/pull_rfi_fold_ev_v2.py, tests/pull_b1_iclass_rerun.py. Cross-reference: R7 in Reviewer Findings · Solver QA index.