R8 · Stand-Up Game deep-stack OOD — Resolution Report
Stand-Up Game solver produces unstable verdicts at 1260bb · root cause is training-coverage gap; symptom is consistency violation · 2026-05-11
P2
B1 known-failure
Surfaced by: Mariano AA $340K comic article
Editorial mitigation: already shipped (500bb-only)
The issue (one paragraph)
At deep Stand-Up Game stack depths (1260bb in the Mariano AA hand), the model's verdicts flip fold ↔ jam ↔ call across small parameter perturbations and violate val-monotonicity (fold% should decrease and commit% should increase as val increases from 1 → 10; at 1260bb it doesn't). The root cause is a training-coverage gap — the Stand-Up Game model wasn't trained at this stack-depth regime. The way we DETECTED it was via consistency checks (val-monotonicity sweep), which is exactly the right diagnostic: if monotonicity breaks, the query is out-of-distribution.
Context — Mariano AA $340K hand
Mariano vs Lambo Tyler, HCL High Stakes Friday, Jan 30 2026. AA vs 88. Stand-Up Game with asymmetric marker counts (Mariano + Tyler still need a NIT button; other 5 seats have one). Effective stack depth: ~1260bb ($252K / $200 BB).
When the comic article author queried this realstate, verdicts flipped across small parameter perturbations — not the smooth gradient you'd expect from a converged equilibrium. The article shipped only 500bb-validated queries as the editorial answer; 1260bb-specific verdicts were not published.
What we observed (consistency symptoms)
| Consistency check | What it tests | What 1260bb shows |
| Val-monotonicity | fold% monotone-decreasing and commit% monotone-increasing across val ∈ {1, 2, 3, 5, 10} | Violates — non-monotonic at 1260bb (was the OOD diagnostic) |
| Verdict stability across perturbations | Nudging state counts or depth slightly should produce continuous changes, not categorical flips | Violates — fold ↔ jam ↔ call categorical flips |
B1 property B1 (Stack-depth continuity) | "No cliffs between adjacent depths" — automated B1 property check | Likely violates; needs explicit run at 1260bb to confirm (current runner tests stacks [20, 50, 100, 200] only) |
Root cause
Training-coverage gap. The Stand-Up Game model was trained at stack depths within a bounded range (default 500bb is what's tested). Deeper stacks (1500bb-class) are outside that range, so the model extrapolates rather than estimating from training distribution. Consistency violations (val-monotonicity break, perturbation instability) are how OOD manifests in observable output.
Connection to existing tracking
- Adjacent to KI-5 family (training-coverage thin-spots) — but for Stand-Up Game at deep stacks specifically, not the same spot as KI-5 (MP val=1 in Squid). Not previously filed as its own KI.
- Same family as R7 / B1 KI-4 (model behavior in low-training-signal regimes). EV/policy noise on rare buckets at RFI (R7) and verdict flipping at deep-stack Stand-Up (R8) are two manifestations of the same root cause: where training signal is thin, output is unstable.
- Memory reference:
feedback_squid_query_actual_marker_state (May 2026) codifies the val-monotonicity check as the canonical OOD diagnostic.
Resolution
Track A — Improve training coverage (model team)
Owner: Trung / model training team
- Verify current Stand-Up Game training coverage at deep stacks (500bb–1500bb regime).
- Decide whether to:
- Extend training data — add deep-stack Stand-Up scenarios to the next training run; or
- Document a "max-supported depth" gate in solver clients (e.g., refuse to serve Stand-Up queries above 700bb with a clear error, rather than returning unstable output silently).
- Re-run val-monotonicity sweep at 500/800/1000/1260/1500bb after training extension. Confirm the property passes across the extended range.
Track B — Extend B1 runner stack-depth coverage
Owner: Nimit / B1 framework
- Current B1
A1 / B1 (stack-depth continuity) runners test stacks [20, 50, 100, 200] only — covers cash but doesn't reach Stand-Up deep-stack regime.
- Extend the stack-depth axis in the runner DSL to include
[500, 800, 1260, 1500] for Stand-Up Game specifically.
- Add val-monotonicity as a first-class B1 property runner for Stand-Up mode (today it's analyst discipline per
feedback_squid_query_actual_marker_state; should be automated). See R9 for the framework-side gap.
- Run extended B1 against current Stand-Up model — expected outcome: B1 stack-depth continuity property FAILS at 800bb+ on Stand-Up. This becomes the property-test signal for when Track A's training extension lands.
Suggested next steps (priority order)
- Decide between training extension vs max-depth gate (Trung / model team). Gating is cheaper and ships faster; extension is the real fix. Either is acceptable as long as the API stops returning unstable output silently.
- If gating: implement max-depth check in
strategy_grid_client.py + agserving endpoint. Document the supported depth range per game mode in API docs.
- If extending: add deep-stack Stand-Up training scenarios to next training run. Track convergence via val-monotonicity sweep.
- Extend B1 runner stack-depth coverage regardless of which Track A path is chosen — gives us the test signal for when the model behavior changes.
- Add val-monotonicity as a B1 property runner (see R9 — separate finding).
Open questions
- Trung / model team: What's the current training-distribution upper bound on stack depth for Stand-Up Game? Is it documented anywhere?
- Trung / model team: Cost of extending training to 1500bb vs cost of adding a max-depth gate to solver clients? Which approach makes sense given roadmap?
- Nimit: Can the B1 runner accept a stack-depth-range parameter per game mode (so Stand-Up gets [500, 800, 1260, 1500] while Cash stays at [20, 50, 100, 200])?
- Product / coaches: Is there a coaching surface that currently lets coaches query Stand-Up at 1260bb? If so, what does it show today — the unstable output, or is there already a gate?