Back to feed

sledtools/pika branch #27

pika-git-1

Model CI blocked and unhealthy states

Target branch: master

Merge Commit: c9dea1d8641b7307a05564ce0bb731bc9dd3dd47

branch: merged tutorial: ready ci: success
Open CI Details

Continuous Integration

CI: success

Compact status on the review page, with full logs on the CI page.

Open CI Details

Latest run #32 success

2 passed

head be8393c045024d3dd357022105dddcf15bd5eb4e · queued 2026-03-24 13:34:45 · 2 lane(s)

queued 4m 27s · ran 24s

check-notifications · success check-agent-contracts · success

Summary

This branch introduces CI blocked and unhealthy target states to the pika-news CI system. It adds a new ci_target_health table and per-lane columns (execution_reason, failure_kind, ci_target_key) to track why queued lanes are not yet running (blocked by concurrency group, waiting for capacity, or target unhealthy) and to classify failures (test failure, timeout, infrastructure). When a lane finishes with an infrastructure failure, the system updates the target health record and, after a configurable threshold of consecutive infrastructure failures, marks the target as unhealthy with a cooloff period. Unhealthy targets cause queued lanes to be skipped during claim and visually flagged in both the CLI and web dashboards. A periodic refresh mechanism keeps execution reasons up to date across all queued lanes.

Tutorial Steps

Define CI state enums and health tracking types

Intent: Introduce a new `ci_state` module containing the core domain types: `CiLaneExecutionReason` (queued, running, blocked_by_concurrency_group, waiting_for_capacity, target_unhealthy, stale_recovered), `CiLaneFailureKind` (test_failure, timeout, infrastructure), `CiTargetHealthState` (healthy, unhealthy), and `CiTargetHealthSnapshot`. Also add the `classify_ci_failure` heuristic for auto-detecting failure kind from log text, `configured_target_key_for_lane` for resolving target keys from lane config, and `next_target_cooloff_until` for computing cooloff windows.

Affected files: crates/pika-news/src/ci_state.rs, crates/pika-news/src/main.rs

Evidence
@@ -0,0 +1,70 @n+ALTER TABLE branch_ci_run_lanes
+ADD COLUMN execution_reason TEXT NOT NULL DEFAULT 'queued';
@@ +0,0 ... enum CiLaneExecutionReason { Queued, Running, BlockedByConcurrencyGroup, WaitingForCapacity, TargetUnhealthy, StaleRecovered }
@@ +0,0 ... enum CiLaneFailureKind { TestFailure, Timeout, Infrastructure }
@@ +0,0 ... pub struct CiTargetHealthSnapshot { target_id, state, consecutive_infra_failure_count, ... cooloff_until }
@@ +0,0 ... pub fn classify_ci_failure(log_text: &str) -> CiLaneFailureKind
@@ +0,0 ... pub fn configured_target_key_for_lane(lane: &ForgeLane) -> Option<String>
@@ +0,0 ... pub fn next_target_cooloff_until(now: DateTime<Utc>) -> String

A new module ci_state.rs is added and registered in main.rs. It defines the enums and structs that represent the lifecycle of a CI lane beyond simple queued/running/passed/failed:

  • CiLaneExecutionReason captures why a lane is in its current status. A lane with status = queued might have execution_reason = target_unhealthy to indicate it is deliberately held back.
  • CiLaneFailureKind classifies failures into test_failure, timeout, or infrastructure. The classify_ci_failure function inspects log text for patterns like "timed out", "ci runner error", or "infrastructure" to automatically assign the kind.
  • CiTargetHealthState and CiTargetHealthSnapshot model per-target health. A target becomes unhealthy after CI_TARGET_HEALTH_INFRA_FAILURE_THRESHOLD (default 2) consecutive infrastructure failures. Once unhealthy, next_target_cooloff_until computes a 15-minute cooloff window.
  • configured_target_key_for_lane resolves the target key from lane configuration, preferring pikaci_target_id, then falling back to well-known concurrency groups like apple-host.

All enums implement FromStr for database deserialization and as_str()/label() for serialization and display.

Add database migration for new columns and ci_target_health table

Intent: Create migration 0019 that adds `execution_reason`, `failure_kind`, and `ci_target_key` columns to both `branch_ci_run_lanes` and `nightly_run_lanes`, backfills existing data, adds composite indexes for efficient querying, and creates the `ci_target_health` table.

Affected files: crates/pika-news/migrations/0019_ci_queue_state_and_target_health.sql

Evidence
@@ -0,0 +1,70 @@
+ALTER TABLE branch_ci_run_lanes
+ADD COLUMN execution_reason TEXT NOT NULL DEFAULT 'queued';
+ALTER TABLE branch_ci_run_lanes
+ADD COLUMN failure_kind TEXT;
+ALTER TABLE branch_ci_run_lanes
+ADD COLUMN ci_target_key TEXT;
+UPDATE branch_ci_run_lanes
+SET execution_reason = CASE
+    WHEN status = 'running' THEN 'running'
+    ELSE execution_reason
+END;
+UPDATE branch_ci_run_lanes
+SET ci_target_key = CASE
+    WHEN pikaci_target_id IS NOT NULL AND TRIM(pikaci_target_id) <> '' THEN pikaci_target_id
+    WHEN concurrency_group = 'apple-host' THEN 'apple-host'
+    ...
+END
+WHERE ci_target_key IS NULL;
+CREATE INDEX idx_branch_ci_run_lanes_status_reason
+    ON branch_ci_run_lanes(status, execution_reason, id ASC);
+CREATE INDEX idx_branch_ci_run_lanes_target_key
+    ON branch_ci_run_lanes(ci_target_key, status, id ASC);
+CREATE TABLE ci_target_health (
+    target_id TEXT PRIMARY KEY,
+    state TEXT NOT NULL,
+    consecutive_infra_failure_count INTEGER NOT NULL DEFAULT 0,
+    last_success_at TEXT,
+    last_failure_at TEXT,
+    last_failure_kind TEXT,
+    cooloff_until TEXT,
+    updated_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
+);

Migration 0019_ci_queue_state_and_target_health.sql makes five categories of schema changes:

  1. New columns on branch_ci_run_lanes and nightly_run_lanes: execution_reason (NOT NULL, defaults to 'queued'), failure_kind (nullable), and ci_target_key (nullable).

  2. Backfill execution_reason: Already-running lanes get 'running' instead of the default 'queued'.

  3. Backfill ci_target_key: Populates the target key from pikaci_target_id if present, otherwise from well-known concurrency group names (apple-host, nightly-android).

  4. Composite indexes: Four new indexes on (status, execution_reason, id) and (ci_target_key, status, id) for both lane tables, enabling efficient claim queries and execution reason refresh scans.

  5. ci_target_health table: Tracks per-target health with state, consecutive_infra_failure_count, timestamps for last success/failure, the kind of last failure, and a cooloff_until timestamp.

Extend lane record structs with new fields

Intent: Add `execution_reason`, `failure_kind`, `ci_target_key`, and `target_health` fields to `BranchCiLaneRecord` and `NightlyLaneRecord` so the full lane state is available throughout the application.

Affected files: crates/pika-news/src/branch_store.rs

Evidence
@@ -107,8 +113,12 @@ pub struct BranchCiLaneRecord {
+    pub execution_reason: CiLaneExecutionReason,
+    pub failure_kind: Option<CiLaneFailureKind>,
     pub pikaci_run_id: Option<String>,
     pub pikaci_target_id: Option<String>,
+    pub ci_target_key: Option<String>,
+    pub target_health: Option<CiTargetHealthSnapshot>,
@@ -168,8 +178,12 @@ pub struct NightlyLaneRecord {
+    pub execution_reason: CiLaneExecutionReason,
+    pub failure_kind: Option<CiLaneFailureKind>,
+    pub ci_target_key: Option<String>,
+    pub target_health: Option<CiTargetHealthSnapshot>,

Both BranchCiLaneRecord and NightlyLaneRecord gain four new fields:

  • execution_reason: CiLaneExecutionReason — why the lane is in its current state
  • failure_kind: Option<CiLaneFailureKind> — classification of the failure if the lane failed
  • ci_target_key: Option<String> — the resolved CI target identifier
  • target_health: Option<CiTargetHealthSnapshot> — the current health snapshot of the lane's target, hydrated after query

The rerun source structs (BranchLaneRerunSource, NightlyLaneRerunSource) also gain ci_target_key so reruns preserve target assignment.

Persist new columns during lane creation and rerun

Intent: Update all INSERT statements for branch and nightly lanes to include `ci_target_key` and `execution_reason` columns, ensuring new lanes and reruns are created with proper initial values.

Affected files: crates/pika-news/src/branch_store.rs

Evidence
@@ -749,15 +766,18 @@ impl Store {
+                        ci_target_key,
+                        execution_reason,
                         status
-                     ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, 'queued')",
+                     ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 'queued', 'queued')",
@@ -1042,9 +1068,11 @@ impl Store {
+                    ci_target_key,
+                    execution_reason,
                     status,
                     rerun_of_lane_run_id
-                 ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, 'queued', ?7)",
+                 ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 'queued', 'queued', ?8)",

Every INSERT INTO branch_ci_run_lanes and INSERT INTO nightly_run_lanes statement is updated to:

  1. Call configured_target_key_for_lane(lane) to resolve and store ci_target_key at creation time.
  2. Include execution_reason with an initial value of 'queued'.

This applies to four code paths:

  • Initial lane creation for branch CI suites
  • Initial lane creation for nightly runs
  • Rerun lane creation for branch CI
  • Rerun lane creation for nightly runs

The rerun queries also now SELECT and propagate ci_target_key from the source lane.

Update lane reset operations to clear failure state

Intent: Ensure that all operations that reset a lane back to 'queued' status (cancel, retry, stale recovery) also reset `execution_reason` and `failure_kind` to their initial values.

Affected files: crates/pika-news/src/branch_store.rs

Evidence
@@ -1267,6 +1304,8 @@ impl Store {
+                     execution_reason = 'queued',
+                     failure_kind = NULL,
@@ -1312,6 +1351,8 @@ impl Store {
+                         execution_reason = 'queued',
+                         failure_kind = NULL,
@@ -1966,6 +2014,8 @@ impl Store {
+                         execution_reason = 'stale_recovered',
+                         failure_kind = NULL,

Six distinct lane-reset code paths are updated:

  1. Cancel branch CI lane → sets execution_reason = 'queued', failure_kind = NULL
  2. Cancel all branch CI lanes → same
  3. Cancel nightly lane → same
  4. Cancel all nightly lanes → same
  5. Stale branch lane recovery → sets execution_reason = 'stale_recovered' (a distinct reason to track lanes that were recovered from stale state)
  6. Stale nightly lane recovery → same

The lease-lost failure path also clears failure_kind = NULL since lease loss is not a test/infra failure.

Skip unhealthy targets during lane claim

Intent: Modify the claim logic for both branch and nightly lanes to load currently unhealthy targets and skip queued lanes that target them, preventing dispatch to broken infrastructure.

Affected files: crates/pika-news/src/branch_store.rs

Evidence
@@ -2065,11 +2132,13 @@ impl Store {
+            let unhealthy_targets =
+                current_unhealthy_target_ids(&tx, Utc::now()).context("load unhealthy ci targets")?;
@@ -2100,6 +2170,19 @@ impl Store {
+                    let target_key = tx
+                        .query_row(
+                            "SELECT ci_target_key FROM branch_ci_run_lanes WHERE id = ?1",
+                            params![job.lane_run_id],
+                            |row| row.get::<_, Option<String>>(0),
+                        )
+                    if target_key
+                        .as_deref()
+                        .is_some_and(|target| unhealthy_targets.contains(target))
+                    {
+                        continue;
+                    }
@@ -2109,6 +2192,8 @@ impl Store {
+                             execution_reason = 'running',
+                             failure_kind = NULL,

The claim_pending_branch_ci_lane_runs and claim_pending_nightly_lane_runs methods now:

  1. Load the set of currently unhealthy target IDs via current_unhealthy_target_ids(), which queries ci_target_health for targets that are in unhealthy state with an active cooloff window.
  2. For each candidate lane, look up its ci_target_key and skip it if the target is unhealthy.
  3. When a lane is claimed (transitioned to running), set execution_reason = 'running' and failure_kind = NULL.

This creates a natural back-pressure mechanism: lanes targeting unhealthy infrastructure stay queued until the cooloff expires or the target recovers.

Update target health on lane finish and classify failures

Intent: When a lane finishes, automatically classify the failure kind from log text, persist it, and update the `ci_target_health` table to track consecutive infrastructure failures and trigger unhealthy state transitions.

Affected files: crates/pika-news/src/branch_store.rs

Evidence
@@ -2182,33 +2267,70 @@ impl Store {
+    pub fn finish_branch_ci_lane_run(
+        ...
+        let failure_kind = if status == CiLaneStatus::Failed.as_str() {
+            Some(classify_ci_failure(log_text))
+        } else {
+            None
+        };
+        self.finish_branch_ci_lane_run_with_kind(
@@ +... fn update_target_health_after_lane_finish(
+    conn: &Connection,
+    ci_target_key: Option<&str>,
+    status: &str,
+    failure_kind: Option<CiLaneFailureKind>,
+    now: DateTime<Utc>,
+) -> anyhow::Result<()> {
@@ +... let state = if consecutive_failures >= CI_TARGET_HEALTH_INFRA_FAILURE_THRESHOLD {
+                    CiTargetHealthState::Unhealthy
+                } else {
+                    CiTargetHealthState::Healthy
+                };

The finish_branch_ci_lane_run and finish_nightly_lane_run methods are refactored into a two-layer pattern:

  1. Public entry point calls classify_ci_failure(log_text) for failed lanes to auto-detect the failure kind, then delegates to finish_*_with_kind.
  2. finish_*_with_kind persists failure_kind alongside the status update and calls update_target_health_after_lane_finish.

The update_target_health_after_lane_finish function implements the health state machine:

  • Infrastructure failure: Increments consecutive_infra_failure_count. If it reaches the threshold (2), marks the target as unhealthy and sets cooloff_until to 15 minutes in the future.
  • Non-infrastructure failure (test failure): Does not count toward unhealthy state, but records the failure.
  • Success (passed): Resets consecutive_infra_failure_count to 0 and marks the target as healthy.

The function uses INSERT ... ON CONFLICT ... DO UPDATE (upsert) to handle both new and existing target health records.

Hydrate lane records with target health snapshots

Intent: After loading lane records from the database, batch-load their associated target health snapshots so the full health context is available for rendering.

Affected files: crates/pika-news/src/branch_store.rs

Evidence
@@ +... fn hydrate_lane_target_health(
+    conn: &Connection,
+    lanes: &mut [BranchCiLaneRecord],
+) -> anyhow::Result<()> {
@@ +... fn load_ci_target_health_snapshots(
+    conn: &Connection,
+    target_ids: &[String],
+) -> anyhow::Result<HashMap<String, CiTargetHealthSnapshot>> {
@@ -2560,6 +2744,7 @@ fn list_branch_ci_run_lanes(
+    hydrate_lane_target_health(conn, &mut lanes)?;

Lane listing functions (list_branch_ci_run_lanes, list_nightly_run_lanes) now call hydration functions after loading lanes:

  1. hydrate_lane_target_health / hydrate_nightly_lane_target_health: Collect unique ci_target_key values from the loaded lanes, then batch-load their health snapshots.
  2. load_ci_target_health_snapshots: Builds a dynamic IN (...) query against ci_target_health and returns a HashMap<String, CiTargetHealthSnapshot>.
  3. Each lane's target_health field is populated by looking up its ci_target_key in the map.

The lane SELECT queries are also updated to read the new columns (execution_reason, failure_kind, ci_target_key) and parse them via parse_execution_reason and parse_optional_failure_kind helper functions.

Add periodic execution reason refresh

Intent: Introduce a `refresh_ci_lane_execution_reasons` method that recalculates the execution_reason for all queued lanes based on current system state (concurrency groups, capacity, target health), and wire it into the CI polling loop.

Affected files: crates/pika-news/src/branch_store.rs, crates/pika-news/src/ci.rs

Evidence
@@ +... pub fn refresh_ci_lane_execution_reasons(
+        &self,
+        ci_concurrency: Option<usize>,
+    ) -> anyhow::Result<()> {
@@ +... fn refresh_queued_lane_reasons_for_table(
+    conn: &Connection,
+    table: &str,
+    available_slots: &mut Option<usize>,
+    running_groups: &mut HashSet<String>,
+    unhealthy_targets: &HashSet<String>,
+) -> anyhow::Result<()> {

A new refresh_ci_lane_execution_reasons method on Store is called periodically (integrated into ci.rs polling) to keep execution reasons accurate:

  1. Running lanes: Forces execution_reason = 'running' for any lane with status = 'running' that somehow has a stale reason.
  2. Queued lanes: Iterates all queued lanes in ID order (nightly first, then branch) and assigns the reason based on priority:
    • TargetUnhealthy if the lane's target is in the unhealthy set
    • BlockedByConcurrencyGroup if the lane's concurrency group is occupied
    • WaitingForCapacity if global CI concurrency slots are exhausted
    • StaleRecovered preserved if already set (to maintain audit trail)
    • Queued otherwise (the lane would be picked up on next claim)
  3. Simulates claim order: lanes that would be claimed consume available slots and occupy concurrency groups, so downstream lanes get accurate reasons.

This runs within a single IMMEDIATE transaction to provide a consistent snapshot.

Update pikaci target ID to also set ci_target_key

Intent: When a running lane reports its pikaci target ID, backfill `ci_target_key` if it was not already set, ensuring target health tracking covers dynamically assigned targets.

Affected files: crates/pika-news/src/branch_store.rs

Evidence
@@ -2228,7 +2350,8 @@ impl Store {
+                         ci_target_key = COALESCE(ci_target_key, ?2)
@@ -2415,7 +2593,8 @@ impl Store {
+                         ci_target_key = COALESCE(ci_target_key, ?2)

Both update_branch_ci_lane_pikaci_ids and update_nightly_lane_pikaci_ids now include ci_target_key = COALESCE(ci_target_key, ?2) in their UPDATE statements. This means:

  • If the lane already had a ci_target_key from configuration, it is preserved.
  • If it was NULL (no configured target), it gets populated from the pikaci_target_id reported at runtime.

This ensures that even lanes without preconfigured targets participate in target health tracking once they are assigned to a runner.

Extend the CLI client with new lane display fields

Intent: Update the ph CLI tool's data model and rendering to display execution reasons, failure kinds, and target health information in branch status output and wait snapshots.

Affected files: crates/ph/src/lib.rs

Evidence
@@ -205,8 +205,18 @@ struct CiLane {
+    #[serde(default)]
+    execution_reason: CiLaneExecutionReason,
+    #[serde(default)]
+    failure_kind: Option<CiLaneFailureKind>,
+    #[serde(default)]
+    ci_target_key: Option<String>,
+    #[serde(default)]
+    target_health_state: Option<CiTargetHealthState>,
+    #[serde(default)]
+    target_health_summary: Option<String>,
@@ -705,41 +781,94 @@ fn print_branch_status
+fn render_lane_status_line(lane: &CiLane) -> String {
+    if matches!(lane.status.as_str(), "queued" | "running")
+        && lane.execution_reason != CiLaneExecutionReason::Queued
+        && lane.execution_reason.as_str() != lane.status
+    {
+        line.push_str(" · ");
+        line.push_str(lane.execution_reason.label());
+    }
@@ +... fn render_lane_snapshot_fragment(lane: &CiLane) -> String {
+    format!(
+        "{}:{}:{}:{}:{}",
+        lane.id,
+        lane.status,
+        lane.execution_reason.as_str(),

The CLI's CiLane struct gains six new fields (all #[serde(default)] for backward compatibility with older servers):

  • execution_reason, failure_kind, ci_target_key, target_health_state, target_health_summary

Corresponding enums (CiLaneExecutionReason, CiLaneFailureKind, CiTargetHealthState) are defined with display logic.

Rendering is refactored:

  • render_branch_status replaces the direct println! calls in print_branch_status with a String-returning function for testability.
  • render_lane_status_line builds a rich single-line display: lane_id status [· reason] [· failure=kind] [· health summary] [target run_id].
  • render_lane_snapshot_fragment replaces the old active_lane_titles-based snapshot with a detailed id:status:reason:failure:health format, enabling --wait to detect changes in execution reason and health state, not just status.

A comprehensive test branch_status_renders_blocked_unhealthy_and_failure_details validates all rendering paths.

Update web API serialization for new lane fields

Intent: Extend the JSON API responses to include execution_reason, failure_kind, ci_target_key, target_health_state, and target_health_summary so web dashboards and CLI clients can display the new state.

Affected files: crates/pika-news/src/web.rs

Evidence
@@ ... fn serialize_branch_lane ... +json!({
+    "execution_reason": lane.execution_reason.as_str(),
+    "failure_kind": lane.failure_kind.map(|k| k.as_str()),
+    "ci_target_key": lane.ci_target_key,
+    "target_health_state": lane.target_health.as_ref().map(|h| h.state.as_str()),
+    "target_health_summary": lane.target_health.as_ref().map(|h| h.summary_line()),

The web layer's lane serialization (used for both branch CI and nightly lane API responses) is updated to include:

  • execution_reason — the string representation of the execution reason enum
  • failure_kind — optional string classification of the failure
  • ci_target_key — the resolved target identifier
  • target_health_state"healthy" or "unhealthy" from the hydrated snapshot
  • target_health_summary — a human-readable one-line summary like "target apple-host unhealthy · consecutive infra failures 2 · cooloff until 2026-03-24T00:15:00Z"

This applies to both branch lane and nightly lane serialization paths.

Update HTML templates with blocked and health indicators

Intent: Add visual indicators in the branch CI and nightly live dashboard templates to show execution reasons, failure classifications, and target health status.

Affected files: crates/pika-news/templates/branch_ci_live.html, crates/pika-news/templates/nightly_live.html

Evidence
@@ ... <span class="ci-lane-execution-reason">{{ lane.execution_reason }}</span>
@@ ... {% if lane.target_health_state == "unhealthy" %}<span class="ci-target-unhealthy">target unhealthy</span>{% endif %}
@@ ... {% if lane.failure_kind %}<span class="ci-failure-kind">{{ lane.failure_kind }}</span>{% endif %}

Both branch_ci_live.html and nightly_live.html templates are updated to render the new fields:

  • Execution reason badge: Displayed for queued/running lanes when the reason is not the default (e.g., shows "blocked by concurrency group", "waiting for capacity", "target unhealthy").
  • Failure kind badge: Shown for failed lanes, classifying the failure as "test failure", "timeout", or "infrastructure".
  • Target health indicator: When a lane's target is unhealthy, a warning is displayed with the health summary (including consecutive failure count and cooloff deadline).

These indicators give operators immediate visibility into why lanes are stalled or what kind of failure occurred, directly in the live dashboard.

Wire refresh into the CI polling loop

Intent: Call the execution reason refresh on each CI poll tick so that the dashboard always reflects current blocking reasons.

Affected files: crates/pika-news/src/ci.rs

Evidence
@@ ... store.refresh_ci_lane_execution_reasons(ci_concurrency)

The CI module's main polling function is updated to call store.refresh_ci_lane_execution_reasons(ci_concurrency) on each tick. This ensures that:

  1. Lanes blocked by concurrency groups are correctly labeled even if the blocking lane started between refreshes.
  2. Lanes targeting newly-unhealthy targets are immediately flagged.
  3. Lanes that were waiting for capacity but now have slots available get their reason updated to queued.
  4. The dashboard and CLI --wait snapshots always reflect the latest blocking reasons.

The refresh runs within the same transaction as the rest of the poll tick, maintaining consistency.

Diff