Per-combo EV inconsistencies surfaced at RFI positions · classified as B1 known-failure (KI-4 / G-series / I-series family) · 2026-05-11
At RFI positions on Cash NLHE 6-max 100bb 2-blind no ante, the model's per-combo raise-action EV varies dramatically across "similar" folded hands. Standard junk hands (72o, 82o, 92o, T2o, J2o) show sensible raise EVs in the −0.2 to −2.0 bb range. But a small set of rarely-played buckets — most prominently 26o — show absurdly positive raise EVs (+4.98 bb at CO, +5.25 bb at MP, +6.77 bb at UTG) despite the model folding them 100%. The fold decision itself is consistent; the displayed EV column for unselected actions is unreliable.
A student or coach inspecting "What's the EV of 26o vs 72o at UTG?" will see two hands that both fold 100% but with displayed alternatives of very different EV. The fold decision is correct in both cases — but the surface column for "what if I raised?" creates a false story: 26o looks like a clear open (+6.77 bb) while 72o looks like a marginal one (−1.80 bb). The displayed EV is not trustworthy as a coaching signal.
The B1 consistency framework already defines and implements 26 EV-related properties across F/G/H/I series. The I-series (Policy-EV consistency, 5 properties) is the most directly relevant:
| ID | Property | What it checks | Catches R7? |
|---|---|---|---|
I1 | High-prob actions have high EV | Dominant action (p > 0.5) EV within 30% of max EV range | Partial |
I2 | Mixing implies indifference | Mixed actions (both p > 0.3) must have similar EV (≤ 0.5 BB) | No (only mixing) |
I3 | Zero-prob High-EV flag | For each combo: any zero-prob action (p < 0.001) with EV > max-active-EV + 0.01 BB is a violation | YES — direct match |
I4 | Policy-weighted EV consistency | |Σ p(a) × EV(a) − combo.ev| ≤ 0.1 BB | Partial |
I5 | EV ordering at policy extremes | Pure (100%) actions: EV should equal max action EV | YES |
All 5 I-series properties are implemented in projects_dev/agserving-rangeviewer-v2/tests/parity/properties/runners/classI.ts as of 2026-04-16. Plus 8 F-series (EV constraints), 6 G-series (EV monotonicity, known-failure family KI-4), 7 H-series (EV coherence) — 26 EV-related properties total.
Replicated the I1/I3/I4/I5 checks against the current production endpoint (model moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx), iterating all 1,326 combos per situation (the production runner samples only combos[0] per hand class). 6 situations × ~1,326 combos. Test script: tests/pull_b1_iclass_rerun.py.
| Property | Tested | Violations | Rate | Implication |
|---|---|---|---|---|
I1 | 7,860 | 1,628 | 20.7% | 1 in 5 dominant actions have EV >30% below max |
I3 | 45,598 | 12,988 | 28.5% | 1 in 3.5 zero-prob actions have EV exceeding max-active EV |
I4 | 0 | 0 | — | combo.ev field absent in response — can't run |
I5 | 5,186 | 3,446 | 66.5% | 2 in 3 pure (100%) actions are not the max-EV option |
The 26o anomaly is one visible instance of a broader pattern. The same check fires on premium hands:
| Combo | Situation | Played action | Played EV | Zero-prob alt | Zero-prob EV | Gap |
|---|---|---|---|---|---|---|
| AA | UTG open | raise (100%) | +8.58 | call | +9.85 | +1.27 bb |
| AKs | UTG open | raise 2.5 (100%) | +0.97 | raise 5.0 | +2.27 | +1.30 bb |
| AKs | UTG open | raise 2.5 (100%) | +0.97 | raise 7.5 | +2.88 | +1.91 bb |
| 26o | UTG open | fold (100%) | 0.00 | raise 2.5 | +6.77 | +6.77 bb |
The AA example is especially striking — the model "leaves +1.27 bb on the table per AA hand" by raising instead of calling, according to its own EV estimates. Calling AA from UTG is not a real strategic option; the model's EV for the unselected call is just noisy.
Nimit's 2026-04-23 B1 cross-format sweep did not list I3 in failing properties for any format. Three plausible reasons:
_no_cap_trung model. The current endpoint serves _no_cap_trung, which may have introduced or amplified EV noise on rarely-played buckets.hand.combos[0] per hand class (~169 classes). My re-run iterated all 1,326 combos. Suit-specific anomalies in combos[1+] could be missed by the production runner.Two parallel tracks. Both required.
The framework is already designed to catch this class of issue. The fix here is operational, not methodological.
moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx (current production endpoint).Properties existed and "passed" in Nimit's 2026-04-23 run but my expanded re-run shows 28-66% violations. Either the model changed or the production runner has blind spots.
hand.combos[0] sampling. Modify the production runner to iterate all combos per hand class (or at least sample multiple per class) so suit-specific EV outliers surface.raise_to = cumulative chips, rake_cap_bb=3, no grid_weight_type).tests/pull_b1_iclass_rerun.py). Reconcile any disagreement._no_cap_trung — confirm I1/I3/I5 violation rates on the current production model. Run via the production vitest runner (gold standard).combos[0]-only sampling is too thin to surface the scale of EV noise this re-run shows.b1-qa-tracker.md — connect the reviewer-page finding back to the B1 property identity so future runs are interpreted correctly.combos[0]-only sampling masking the violations?