QuintAce ← Reviewer Findings

R9 · Stand-Up Game depth-flip on Klu T9 — Resolution Report

Categorical sign change between 100bb and 226bb — call → fold — at Stand-Up val=10 · 2026-05-11

P2 B1 known-failure Surfaced by: HCL 5-way Stand-Up comic article B1 property needed: SQ4 (stack-depth × val × state continuity)

The issue (one paragraph)

In the HCL 5-way Stand-Up Game hand (Jan 3, 2025) at val=10 with Klu standing (Scenario D), the solver gives opposite verdicts for Klu's T9 depending on stack depth: at 100bb Klu calls 35%; at 226bb Klu folds 100%. Standard poker intuition says format-pressure effects (the Stand-Up penalty) should weaken as stacks deepen — penalty becomes proportionally smaller relative to the pot — not flip direction. The flip is categorical (call → fold), not a smooth shift. NLHE cash equivalent sweep does NOT exhibit this behavior (depth-stable within ±2pp on UTG RFI across 50–1000bb), so this looks Stand-Up-specific.

Data — the flip

Stack depthKlu T9 verdictProbability
100bbCall35%
226bbFold100%

Setup: HCL 5-way Stand-Up, Scenario D (Peter + Adi + Klu standing), val=10. Source: hcl-five-way-stand-up/deviation-log.md · c:/tmp/hcl-five-way-research/SUMMARY.md Open Question #3.

Supporting evidence — NLHE cash control sweep

To distinguish "stack-depth model behavior in general" from "Stand-Up-specific depth interaction," we ran an analogous depth sweep on NLHE cash UTG RFI across [50, 100, 150, 200, 300, 500, 800, 1000] bb. Result: NLHE cash is stable — total swing in range-aggregate RFI is −1.73pp from 50bb to 1000bb. The only flagged hand (Q9s) shows a smooth 13pp drift across 950bb of depth, not a categorical flip.

ComparisonDepth rangeVerdict swingType
H1 — Klu T9, Stand-Up val=10100bb → 226bb (126bb)Call 35% → Fold 100% (−65pp)CATEGORICAL FLIP
NLHE Q9s — UTG RFI500bb → 800bb (300bb)Raise 50% → Fold 54% (+4pp)Smooth shift
NLHE 87s — UTG RFI50bb → 1000bb (950bb)Fold 78% → 58% → 58% (smooth U-curve)Smooth

Stand-Up's flip is ~16× larger in magnitude than NLHE's biggest shift and across ~8× less depth. Different phenomenon — not a general model-depth issue.

Existing B1 property coverage

The right property family is B1 (Stack-depth continuity, "no cliffs between adjacent depths"). It exists. It applies to all formats including Stand-Up. But the production runner has three coverage gaps:

PropertySpecImplementationCovers H1?
B1 (Stack depth continuity)"No cliffs between adjacent depths" — applies_to: [core]classB.ts:47 — but stack range capped at [20…100bb], only tests UTG : open, no val-axisNo — runner doesn't reach 226bb, doesn't test Stand-Up's val parameter, doesn't test postflop multi-way spots
SQ1 (Val monotonicity)VPIP non-decreasing across val ∈ {1, 2, 3, 5, 10}, fixed depth(squid.yaml SQ-series)No — different axis (val, not depth)
SQ2 (State monotonicity)VPIP ordering: hero-has ≤ fresh ≤ hero-no-squid ≤ all-desperate(squid.yaml)No — different axis (state)
SQ3 (Squid × position)Later position responds more to val(squid.yaml)No — different axis (position × val)
SQ4 (proposed)Stack-depth × val × state continuityDoesn't exist yetYes — this is the gap H1 surfaces

Resolution

Track A — Diagnose the Klu T9 flip (model team / Trung)

Solver-tree artifact or real depth interaction?

  1. Sweep intermediate depths (100 / 130 / 150 / 180 / 200 / 226 / 260bb) on the exact same Scenario D · val=10 setup. Map where the call → fold transition happens. Gradual = real depth interaction; categorical = solver-tree artifact.
  2. Inspect the bet-sizing tree at each depth — does the legal action set or sizing increments change between 100bb and 226bb in a way that mechanically forces fold?
  3. Cross-check at val=5 (less extreme) and val=2 — does the flip still occur, or is it specifically a val=10 phenomenon?
  4. Document the diagnosis in llm-verifier-game-expansion/squid-classic/known-issues/ if it turns out to be a real model finding.

Track B — Add SQ4 property + extend B1 runner depth (Nimit / B1 framework)

Framework needs to be able to detect this class of issue automatically.

  1. Add SQ4 to squid.yaml: "Stack-depth × val × state continuity — for each (val ∈ {1,2,3,5,10}, state) pair, assert no dominant-action flip between adjacent depths in stack sweep [100, 150, 200, 226, 300, 500, 800] bb."
  2. Implement runSQ4() in classB.ts (or new classSQ.ts if SQ-series gets its own file). Use the existing assertAdjacentSmooth primitive but with the val/state coordinate axes.
  3. Extend runB1 stack range for Stand-Up format — current [20, 25, 30, 40, 50, 60, 80, 100] caps at 100bb; add [150, 200, 226, 300, 500] conditionally when format = squid/standup.
  4. Register the Klu T9 spot (Scenario D, val=10, T9 at 100bb vs 226bb) as a fixed B1 property test case so this exact regression is detectable on future model versions.

Proposed SQ4 yaml spec

- id: SQ4
  category: X
  description: "Stack-depth × val × state continuity"
  formal_test: |
    For each (val, state) pair in {val ∈ {1, 2, 3, 5, 10}} × {fresh,
    hero-has, hero-no-squid, all-desperate}, sweep stack depth across
    [100, 150, 200, 226, 300, 500, 800] bb. Assert no dominant-action
    flip between adjacent depths (e.g., call-dominant → fold-dominant)
    unless the action distribution shifts smoothly (≥30% of the change
    happens across at least 2 adjacent depth steps).
  applies_to: [squid]
  severity: gate
  rationale: |
    H1 (Klu T9 100bb call 35% → 226bb fold 100%) surfaced that the
    stack-depth × val interaction can produce categorical sign changes
    in Stand-Up that do NOT occur in NLHE cash. This property catches
    that class of behavior automatically. Companion control: NLHE cash
    UTG RFI 50bb→1000bb shows total swing of −1.73pp; smooth, not flipping.
  added_in: v2

Suggested next steps (priority order)

  1. Map the flip point with intermediate-depth queries (100/130/150/180/200/226bb on Scenario D · val=10 · Klu T9). Confirms whether it's gradual or categorical. Fastest test; ~10 endpoint queries.
  2. Cross-check at val=5 / val=2 — is the flip val-specific or does it persist across val? Tells us whether the issue is "deep stack" or "val=10 + deep stack" specifically.
  3. Add SQ4 to squid.yaml + implement runner. Smallest, highest-leverage framework fix.
  4. Extend B1 runner stack range for Stand-Up to include 100-500bb range.
  5. Cross-reference with R8 — R8's 1260bb instability and R9's 226bb flip may share root causes (Stand-Up training coverage in deep regimes). Joint Track A could cover both.

Open questions for the team