QuintAce ← Reviewer Findings

R8 · Stand-Up Game deep-stack OOD — Resolution Report

Stand-Up Game solver produces unstable verdicts at 1260bb · root cause is training-coverage gap; symptom is consistency violation · 2026-05-11

P2 B1 known-failure Surfaced by: Mariano AA $340K comic article Editorial mitigation: already shipped (500bb-only)

The issue (one paragraph)

At deep Stand-Up Game stack depths (1260bb in the Mariano AA hand), the model's verdicts flip fold ↔ jam ↔ call across small parameter perturbations and violate val-monotonicity (fold% should decrease and commit% should increase as val increases from 1 → 10; at 1260bb it doesn't). The root cause is a training-coverage gap — the Stand-Up Game model wasn't trained at this stack-depth regime. The way we DETECTED it was via consistency checks (val-monotonicity sweep), which is exactly the right diagnostic: if monotonicity breaks, the query is out-of-distribution.

Context — Mariano AA $340K hand

Mariano vs Lambo Tyler, HCL High Stakes Friday, Jan 30 2026. AA vs 88. Stand-Up Game with asymmetric marker counts (Mariano + Tyler still need a NIT button; other 5 seats have one). Effective stack depth: ~1260bb ($252K / $200 BB).

When the comic article author queried this realstate, verdicts flipped across small parameter perturbations — not the smooth gradient you'd expect from a converged equilibrium. The article shipped only 500bb-validated queries as the editorial answer; 1260bb-specific verdicts were not published.

What we observed (consistency symptoms)

Consistency checkWhat it testsWhat 1260bb shows
Val-monotonicityfold% monotone-decreasing and commit% monotone-increasing across val ∈ {1, 2, 3, 5, 10}Violates — non-monotonic at 1260bb (was the OOD diagnostic)
Verdict stability across perturbationsNudging state counts or depth slightly should produce continuous changes, not categorical flipsViolates — fold ↔ jam ↔ call categorical flips
B1 property B1 (Stack-depth continuity)"No cliffs between adjacent depths" — automated B1 property checkLikely violates; needs explicit run at 1260bb to confirm (current runner tests stacks [20, 50, 100, 200] only)

Root cause

Training-coverage gap. The Stand-Up Game model was trained at stack depths within a bounded range (default 500bb is what's tested). Deeper stacks (1500bb-class) are outside that range, so the model extrapolates rather than estimating from training distribution. Consistency violations (val-monotonicity break, perturbation instability) are how OOD manifests in observable output.

Connection to existing tracking

Resolution

Track A — Improve training coverage (model team)

Owner: Trung / model training team

  1. Verify current Stand-Up Game training coverage at deep stacks (500bb–1500bb regime).
  2. Decide whether to:
    • Extend training data — add deep-stack Stand-Up scenarios to the next training run; or
    • Document a "max-supported depth" gate in solver clients (e.g., refuse to serve Stand-Up queries above 700bb with a clear error, rather than returning unstable output silently).
  3. Re-run val-monotonicity sweep at 500/800/1000/1260/1500bb after training extension. Confirm the property passes across the extended range.

Track B — Extend B1 runner stack-depth coverage

Owner: Nimit / B1 framework

  1. Current B1 A1 / B1 (stack-depth continuity) runners test stacks [20, 50, 100, 200] only — covers cash but doesn't reach Stand-Up deep-stack regime.
  2. Extend the stack-depth axis in the runner DSL to include [500, 800, 1260, 1500] for Stand-Up Game specifically.
  3. Add val-monotonicity as a first-class B1 property runner for Stand-Up mode (today it's analyst discipline per feedback_squid_query_actual_marker_state; should be automated). See R9 for the framework-side gap.
  4. Run extended B1 against current Stand-Up model — expected outcome: B1 stack-depth continuity property FAILS at 800bb+ on Stand-Up. This becomes the property-test signal for when Track A's training extension lands.

Suggested next steps (priority order)

  1. Decide between training extension vs max-depth gate (Trung / model team). Gating is cheaper and ships faster; extension is the real fix. Either is acceptable as long as the API stops returning unstable output silently.
  2. If gating: implement max-depth check in strategy_grid_client.py + agserving endpoint. Document the supported depth range per game mode in API docs.
  3. If extending: add deep-stack Stand-Up training scenarios to next training run. Track convergence via val-monotonicity sweep.
  4. Extend B1 runner stack-depth coverage regardless of which Track A path is chosen — gives us the test signal for when the model behavior changes.
  5. Add val-monotonicity as a B1 property runner (see R9 — separate finding).

Open questions