QuintAce ← Reviewer Findings

R7 · EV-display variance — Resolution Report

Per-combo EV inconsistencies surfaced at RFI positions · classified as B1 known-failure (KI-4 / G-series / I-series family) · 2026-05-11

B1 known-failure Family: KI-4 model EV shape concerns Owner (model): Trung Owner (QA framework): Nimit

The issue (one paragraph)

At RFI positions on Cash NLHE 6-max 100bb 2-blind no ante, the model's per-combo raise-action EV varies dramatically across "similar" folded hands. Standard junk hands (72o, 82o, 92o, T2o, J2o) show sensible raise EVs in the −0.2 to −2.0 bb range. But a small set of rarely-played buckets — most prominently 26o — show absurdly positive raise EVs (+4.98 bb at CO, +5.25 bb at MP, +6.77 bb at UTG) despite the model folding them 100%. The fold decision itself is consistent; the displayed EV column for unselected actions is unreliable.

Why this matters

A student or coach inspecting "What's the EV of 26o vs 72o at UTG?" will see two hands that both fold 100% but with displayed alternatives of very different EV. The fold decision is correct in both cases — but the surface column for "what if I raised?" creates a false story: 26o looks like a clear open (+6.77 bb) while 72o looks like a marginal one (−1.80 bb). The displayed EV is not trustworthy as a coaching signal.

Existing B1 properties that should catch this

The B1 consistency framework already defines and implements 26 EV-related properties across F/G/H/I series. The I-series (Policy-EV consistency, 5 properties) is the most directly relevant:

IDPropertyWhat it checksCatches R7?
I1High-prob actions have high EVDominant action (p > 0.5) EV within 30% of max EV rangePartial
I2Mixing implies indifferenceMixed actions (both p > 0.3) must have similar EV (≤ 0.5 BB)No (only mixing)
I3Zero-prob High-EV flagFor each combo: any zero-prob action (p < 0.001) with EV > max-active-EV + 0.01 BB is a violationYES — direct match
I4Policy-weighted EV consistency|Σ p(a) × EV(a) − combo.ev| ≤ 0.1 BBPartial
I5EV ordering at policy extremesPure (100%) actions: EV should equal max action EVYES

All 5 I-series properties are implemented in projects_dev/agserving-rangeviewer-v2/tests/parity/properties/runners/classI.ts as of 2026-04-16. Plus 8 F-series (EV constraints), 6 G-series (EV monotonicity, known-failure family KI-4), 7 H-series (EV coherence) — 26 EV-related properties total.

2026-05-11 re-run against current model

Replicated the I1/I3/I4/I5 checks against the current production endpoint (model moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx), iterating all 1,326 combos per situation (the production runner samples only combos[0] per hand class). 6 situations × ~1,326 combos. Test script: tests/pull_b1_iclass_rerun.py.

PropertyTestedViolationsRateImplication
I17,8601,62820.7%1 in 5 dominant actions have EV >30% below max
I345,59812,98828.5%1 in 3.5 zero-prob actions have EV exceeding max-active EV
I400combo.ev field absent in response — can't run
I55,1863,44666.5%2 in 3 pure (100%) actions are not the max-EV option

Sample I3 violations — not just edge hands

The 26o anomaly is one visible instance of a broader pattern. The same check fires on premium hands:

ComboSituationPlayed actionPlayed EVZero-prob altZero-prob EVGap
AAUTG openraise (100%)+8.58call+9.85+1.27 bb
AKsUTG openraise 2.5 (100%)+0.97raise 5.0+2.27+1.30 bb
AKsUTG openraise 2.5 (100%)+0.97raise 7.5+2.88+1.91 bb
26oUTG openfold (100%)0.00raise 2.5+6.77+6.77 bb

The AA example is especially striking — the model "leaves +1.27 bb on the table per AA hand" by raising instead of calling, according to its own EV estimates. Calling AA from UTG is not a real strategic option; the model's EV for the unselected call is just noisy.

Why production B1 didn't catch this

Nimit's 2026-04-23 B1 cross-format sweep did not list I3 in failing properties for any format. Three plausible reasons:

  1. Model variant difference. Nimit's run likely targeted the pre-_no_cap_trung model. The current endpoint serves _no_cap_trung, which may have introduced or amplified EV noise on rarely-played buckets.
  2. Sampling depth. The production runner iterates only hand.combos[0] per hand class (~169 classes). My re-run iterated all 1,326 combos. Suit-specific anomalies in combos[1+] could be missed by the production runner.
  3. DSL → payload translation. The runner uses its own DSL parser that may produce slightly different payloads than the canonical V2 strategy_grid contract — could lead to subtle differences in what's being queried.

Resolution

Two parallel tracks. Both required.

Track A — Run B1 properties against the current model

The framework is already designed to catch this class of issue. The fix here is operational, not methodological.

  1. Re-run the full B1 vitest suite against moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx (current production endpoint).
  2. Specifically inspect I1, I3, I5 results — these are the EV-coherence properties most relevant to R7. Expect non-trivial violations based on our re-run (I3 at 28.5%, I5 at 66.5%).
  3. Cross-check G-series (G1, G2, G3, G4, G6) — these are already in KI-4 known-failure family and should be re-confirmed against the current model.
  4. Cadence: every model version bump must re-run B1 before deploying as production endpoint. If a model variant ships without B1 sign-off, EV-coherence regressions like R7 ship to coaches uncaught.

Track B — Verify B1 property implementations

Properties existed and "passed" in Nimit's 2026-04-23 run but my expanded re-run shows 28-66% violations. Either the model changed or the production runner has blind spots.

  1. Audit hand.combos[0] sampling. Modify the production runner to iterate all combos per hand class (or at least sample multiple per class) so suit-specific EV outliers surface.
  2. Verify DSL → payload contract matches the canonical V2 strategy_grid spec (RVV2-exact payload semantics: raise_to = cumulative chips, rake_cap_bb=3, no grid_weight_type).
  3. Compare runner output on identical situations between (a) the vitest runner and (b) the Python expanded check (tests/pull_b1_iclass_rerun.py). Reconcile any disagreement.
  4. Document the property semantics with a concrete example per property (e.g., R7 → I3) so future model bumps are diagnosed correctly.

Suggested next steps (priority order)

  1. Re-run B1 I-series against _no_cap_trung — confirm I1/I3/I5 violation rates on the current production model. Run via the production vitest runner (gold standard).
  2. Modify production runner to iterate all combos per hand class — current combos[0]-only sampling is too thin to surface the scale of EV noise this re-run shows.
  3. Escalate I-series + KI-4 family to Trung (owner of EV) — the production model leaks 1-2 bb per hand on premium ranges (AA call-vs-raise gap = 1.27 bb is the cleanest example). Worth investigating whether it's a model defect, an EV-computation issue, or a training-coverage gap on alternative actions.
  4. Cross-link R7 ↔ I3 in b1-qa-tracker.md — connect the reviewer-page finding back to the B1 property identity so future runs are interpreted correctly.
  5. Run the same I-checks at postflop spots (BB defense, c-bet decisions) to see if EV-noise is preflop-specific or pervasive across decision types.
  6. Coach-surface display decision — should the UI display raw model EV for unselected actions, or apply a sanity-clamp (e.g. "EV unreliable: below training threshold" for hands with fold prob ≥ 95% and noise-suspect EV)? Routes to product after Track A completes.

Open questions for Trung + Nimit