R8 · Stand-Up Game deep-stack OOD — Resolution Report

Stand-Up Game solver produces unstable verdicts at 1260bb · root cause is training-coverage gap; symptom is consistency violation · 2026-05-11

P2 B1 known-failure Surfaced by: Mariano AA $340K comic article Editorial mitigation: already shipped (500bb-only)

The issue (one paragraph)

At deep Stand-Up Game stack depths (1260bb in the Mariano AA hand), the model's verdicts flip fold ↔ jam ↔ call across small parameter perturbations and violate val-monotonicity (fold% should decrease and commit% should increase as val increases from 1 → 10; at 1260bb it doesn't). The root cause is a training-coverage gap — the Stand-Up Game model wasn't trained at this stack-depth regime. The way we DETECTED it was via consistency checks (val-monotonicity sweep), which is exactly the right diagnostic: if monotonicity breaks, the query is out-of-distribution.

Context — Mariano AA $340K hand

Mariano vs Lambo Tyler, HCL High Stakes Friday, Jan 30 2026. AA vs 88. Stand-Up Game with asymmetric marker counts (Mariano + Tyler still need a NIT button; other 5 seats have one). Effective stack depth: ~1260bb ($252K / $200 BB).

When the comic article author queried this realstate, verdicts flipped across small parameter perturbations — not the smooth gradient you'd expect from a converged equilibrium. The article shipped only 500bb-validated queries as the editorial answer; 1260bb-specific verdicts were not published.

What we observed (consistency symptoms)

Consistency check	What it tests	What 1260bb shows
Val-monotonicity	`fold%` monotone-decreasing and `commit%` monotone-increasing across `val ∈ {1, 2, 3, 5, 10}`	Violates — non-monotonic at 1260bb (was the OOD diagnostic)
Verdict stability across perturbations	Nudging state counts or depth slightly should produce continuous changes, not categorical flips	Violates — fold ↔ jam ↔ call categorical flips
B1 property `B1` (Stack-depth continuity)	"No cliffs between adjacent depths" — automated B1 property check	Likely violates; needs explicit run at 1260bb to confirm (current runner tests stacks [20, 50, 100, 200] only)

Root cause

Training-coverage gap. The Stand-Up Game model was trained at stack depths within a bounded range (default 500bb is what's tested). Deeper stacks (1500bb-class) are outside that range, so the model extrapolates rather than estimating from training distribution. Consistency violations (val-monotonicity break, perturbation instability) are how OOD manifests in observable output.

Connection to existing tracking

Adjacent to KI-5 family (training-coverage thin-spots) — but for Stand-Up Game at deep stacks specifically, not the same spot as KI-5 (MP val=1 in Squid). Not previously filed as its own KI.
Same family as R7 / B1 KI-4 (model behavior in low-training-signal regimes). EV/policy noise on rare buckets at RFI (R7) and verdict flipping at deep-stack Stand-Up (R8) are two manifestations of the same root cause: where training signal is thin, output is unstable.
Memory reference: feedback_squid_query_actual_marker_state (May 2026) codifies the val-monotonicity check as the canonical OOD diagnostic.

Resolution

Track A — Improve training coverage (model team)

Owner: Trung / model training team

Verify current Stand-Up Game training coverage at deep stacks (500bb–1500bb regime).
Decide whether to:
- Extend training data — add deep-stack Stand-Up scenarios to the next training run; or
- Document a "max-supported depth" gate in solver clients (e.g., refuse to serve Stand-Up queries above 700bb with a clear error, rather than returning unstable output silently).
Re-run val-monotonicity sweep at 500/800/1000/1260/1500bb after training extension. Confirm the property passes across the extended range.

Track B — Extend B1 runner stack-depth coverage

Owner: Nimit / B1 framework

Current B1 A1 / B1 (stack-depth continuity) runners test stacks [20, 50, 100, 200] only — covers cash but doesn't reach Stand-Up deep-stack regime.
Extend the stack-depth axis in the runner DSL to include [500, 800, 1260, 1500] for Stand-Up Game specifically.
Add val-monotonicity as a first-class B1 property runner for Stand-Up mode (today it's analyst discipline per feedback_squid_query_actual_marker_state; should be automated). See R9 for the framework-side gap.
Run extended B1 against current Stand-Up model — expected outcome: B1 stack-depth continuity property FAILS at 800bb+ on Stand-Up. This becomes the property-test signal for when Track A's training extension lands.

Suggested next steps (priority order)

Decide between training extension vs max-depth gate (Trung / model team). Gating is cheaper and ships faster; extension is the real fix. Either is acceptable as long as the API stops returning unstable output silently.
If gating: implement max-depth check in strategy_grid_client.py + agserving endpoint. Document the supported depth range per game mode in API docs.
If extending: add deep-stack Stand-Up training scenarios to next training run. Track convergence via val-monotonicity sweep.
Extend B1 runner stack-depth coverage regardless of which Track A path is chosen — gives us the test signal for when the model behavior changes.
Add val-monotonicity as a B1 property runner (see R9 — separate finding).

Open questions

Trung / model team: What's the current training-distribution upper bound on stack depth for Stand-Up Game? Is it documented anywhere?
Trung / model team: Cost of extending training to 1500bb vs cost of adding a max-depth gate to solver clients? Which approach makes sense given roadmap?
Nimit: Can the B1 runner accept a stack-depth-range parameter per game mode (so Stand-Up gets [500, 800, 1260, 1500] while Cash stays at [20, 50, 100, 200])?
Product / coaches: Is there a coaching surface that currently lets coaches query Stand-Up at 1260bb? If so, what does it show today — the unstable output, or is there already a gate?

Source:

marketing-department/04_marketing_materials/content-portal/every-format-every-game/mariano-stand-up-aces/deviation-log.md

. Memory reference: feedback_squid_query_actual_marker_state. Cross-reference: R8 in Reviewer Findings · Solver QA index.