This branch introduces CI blocked and unhealthy target states to the pika-news CI system. It adds a new ci_target_health table and per-lane columns (execution_reason, failure_kind, ci_target_key) to track why queued lanes are not yet running (blocked by concurrency group, waiting for capacity, or target unhealthy) and to classify failures (test failure, timeout, infrastructure). When a lane finishes with an infrastructure failure, the system updates the target health record and, after a configurable threshold of consecutive infrastructure failures, marks the target as unhealthy with a cooloff period. Unhealthy targets cause queued lanes to be skipped during claim and visually flagged in both the CLI and web dashboards. A periodic refresh mechanism keeps execution reasons up to date across all queued lanes.
Tutorial Steps
Define CI state enums and health tracking types
Intent: Introduce a new `ci_state` module containing the core domain types: `CiLaneExecutionReason` (queued, running, blocked_by_concurrency_group, waiting_for_capacity, target_unhealthy, stale_recovered), `CiLaneFailureKind` (test_failure, timeout, infrastructure), `CiTargetHealthState` (healthy, unhealthy), and `CiTargetHealthSnapshot`. Also add the `classify_ci_failure` heuristic for auto-detecting failure kind from log text, `configured_target_key_for_lane` for resolving target keys from lane config, and `next_target_cooloff_until` for computing cooloff windows.
A new module ci_state.rs is added and registered in main.rs. It defines the enums and structs that represent the lifecycle of a CI lane beyond simple queued/running/passed/failed:
CiLaneExecutionReason captures why a lane is in its current status. A lane with status = queued might have execution_reason = target_unhealthy to indicate it is deliberately held back.
CiLaneFailureKind classifies failures into test_failure, timeout, or infrastructure. The classify_ci_failure function inspects log text for patterns like "timed out", "ci runner error", or "infrastructure" to automatically assign the kind.
CiTargetHealthState and CiTargetHealthSnapshot model per-target health. A target becomes unhealthy after CI_TARGET_HEALTH_INFRA_FAILURE_THRESHOLD (default 2) consecutive infrastructure failures. Once unhealthy, next_target_cooloff_until computes a 15-minute cooloff window.
configured_target_key_for_lane resolves the target key from lane configuration, preferring pikaci_target_id, then falling back to well-known concurrency groups like apple-host.
All enums implement FromStr for database deserialization and as_str()/label() for serialization and display.
Add database migration for new columns and ci_target_health table
Intent: Create migration 0019 that adds `execution_reason`, `failure_kind`, and `ci_target_key` columns to both `branch_ci_run_lanes` and `nightly_run_lanes`, backfills existing data, adds composite indexes for efficient querying, and creates the `ci_target_health` table.
+UPDATE branch_ci_run_lanes
+SET execution_reason = CASE
+ WHEN status = 'running' THEN 'running'
+ ELSE execution_reason
+END;
+UPDATE branch_ci_run_lanes
+SET ci_target_key = CASE
+ WHEN pikaci_target_id IS NOT NULL AND TRIM(pikaci_target_id) <> '' THEN pikaci_target_id
+ WHEN concurrency_group = 'apple-host' THEN 'apple-host'
+ ...
+END
+WHERE ci_target_key IS NULL;
+CREATE INDEX idx_branch_ci_run_lanes_status_reason
+ ON branch_ci_run_lanes(status, execution_reason, id ASC);
+CREATE INDEX idx_branch_ci_run_lanes_target_key
+ ON branch_ci_run_lanes(ci_target_key, status, id ASC);
+CREATE TABLE ci_target_health (
+ target_id TEXT PRIMARY KEY,
+ state TEXT NOT NULL,
+ consecutive_infra_failure_count INTEGER NOT NULL DEFAULT 0,
+ last_success_at TEXT,
+ last_failure_at TEXT,
+ last_failure_kind TEXT,
+ cooloff_until TEXT,
+ updated_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
+);
Migration 0019_ci_queue_state_and_target_health.sql makes five categories of schema changes:
New columns on branch_ci_run_lanes and nightly_run_lanes: execution_reason (NOT NULL, defaults to 'queued'), failure_kind (nullable), and ci_target_key (nullable).
Backfill execution_reason: Already-running lanes get 'running' instead of the default 'queued'.
Backfill ci_target_key: Populates the target key from pikaci_target_id if present, otherwise from well-known concurrency group names (apple-host, nightly-android).
Composite indexes: Four new indexes on (status, execution_reason, id) and (ci_target_key, status, id) for both lane tables, enabling efficient claim queries and execution reason refresh scans.
ci_target_health table: Tracks per-target health with state, consecutive_infra_failure_count, timestamps for last success/failure, the kind of last failure, and a cooloff_until timestamp.
Extend lane record structs with new fields
Intent: Add `execution_reason`, `failure_kind`, `ci_target_key`, and `target_health` fields to `BranchCiLaneRecord` and `NightlyLaneRecord` so the full lane state is available throughout the application.
Both BranchCiLaneRecord and NightlyLaneRecord gain four new fields:
execution_reason: CiLaneExecutionReason — why the lane is in its current state
failure_kind: Option<CiLaneFailureKind> — classification of the failure if the lane failed
ci_target_key: Option<String> — the resolved CI target identifier
target_health: Option<CiTargetHealthSnapshot> — the current health snapshot of the lane's target, hydrated after query
The rerun source structs (BranchLaneRerunSource, NightlyLaneRerunSource) also gain ci_target_key so reruns preserve target assignment.
Persist new columns during lane creation and rerun
Intent: Update all INSERT statements for branch and nightly lanes to include `ci_target_key` and `execution_reason` columns, ensuring new lanes and reruns are created with proper initial values.
Every INSERT INTO branch_ci_run_lanes and INSERT INTO nightly_run_lanes statement is updated to:
Call configured_target_key_for_lane(lane) to resolve and store ci_target_key at creation time.
Include execution_reason with an initial value of 'queued'.
This applies to four code paths:
Initial lane creation for branch CI suites
Initial lane creation for nightly runs
Rerun lane creation for branch CI
Rerun lane creation for nightly runs
The rerun queries also now SELECT and propagate ci_target_key from the source lane.
Update lane reset operations to clear failure state
Intent: Ensure that all operations that reset a lane back to 'queued' status (cancel, retry, stale recovery) also reset `execution_reason` and `failure_kind` to their initial values.
Cancel branch CI lane → sets execution_reason = 'queued', failure_kind = NULL
Cancel all branch CI lanes → same
Cancel nightly lane → same
Cancel all nightly lanes → same
Stale branch lane recovery → sets execution_reason = 'stale_recovered' (a distinct reason to track lanes that were recovered from stale state)
Stale nightly lane recovery → same
The lease-lost failure path also clears failure_kind = NULL since lease loss is not a test/infra failure.
Skip unhealthy targets during lane claim
Intent: Modify the claim logic for both branch and nightly lanes to load currently unhealthy targets and skip queued lanes that target them, preventing dispatch to broken infrastructure.
The claim_pending_branch_ci_lane_runs and claim_pending_nightly_lane_runs methods now:
Load the set of currently unhealthy target IDs via current_unhealthy_target_ids(), which queries ci_target_health for targets that are in unhealthy state with an active cooloff window.
For each candidate lane, look up its ci_target_key and skip it if the target is unhealthy.
When a lane is claimed (transitioned to running), set execution_reason = 'running' and failure_kind = NULL.
This creates a natural back-pressure mechanism: lanes targeting unhealthy infrastructure stay queued until the cooloff expires or the target recovers.
Update target health on lane finish and classify failures
Intent: When a lane finishes, automatically classify the failure kind from log text, persist it, and update the `ci_target_health` table to track consecutive infrastructure failures and trigger unhealthy state transitions.
@@ +... let state = if consecutive_failures >= CI_TARGET_HEALTH_INFRA_FAILURE_THRESHOLD {
+ CiTargetHealthState::Unhealthy
+ } else {
+ CiTargetHealthState::Healthy
+ };
The finish_branch_ci_lane_run and finish_nightly_lane_run methods are refactored into a two-layer pattern:
Public entry point calls classify_ci_failure(log_text) for failed lanes to auto-detect the failure kind, then delegates to finish_*_with_kind.
finish_*_with_kind persists failure_kind alongside the status update and calls update_target_health_after_lane_finish.
The update_target_health_after_lane_finish function implements the health state machine:
Infrastructure failure: Increments consecutive_infra_failure_count. If it reaches the threshold (2), marks the target as unhealthy and sets cooloff_until to 15 minutes in the future.
Non-infrastructure failure (test failure): Does not count toward unhealthy state, but records the failure.
Success (passed): Resets consecutive_infra_failure_count to 0 and marks the target as healthy.
The function uses INSERT ... ON CONFLICT ... DO UPDATE (upsert) to handle both new and existing target health records.
Hydrate lane records with target health snapshots
Intent: After loading lane records from the database, batch-load their associated target health snapshots so the full health context is available for rendering.
Lane listing functions (list_branch_ci_run_lanes, list_nightly_run_lanes) now call hydration functions after loading lanes:
hydrate_lane_target_health / hydrate_nightly_lane_target_health: Collect unique ci_target_key values from the loaded lanes, then batch-load their health snapshots.
load_ci_target_health_snapshots: Builds a dynamic IN (...) query against ci_target_health and returns a HashMap<String, CiTargetHealthSnapshot>.
Each lane's target_health field is populated by looking up its ci_target_key in the map.
The lane SELECT queries are also updated to read the new columns (execution_reason, failure_kind, ci_target_key) and parse them via parse_execution_reason and parse_optional_failure_kind helper functions.
Add periodic execution reason refresh
Intent: Introduce a `refresh_ci_lane_execution_reasons` method that recalculates the execution_reason for all queued lanes based on current system state (concurrency groups, capacity, target health), and wire it into the CI polling loop.
A new refresh_ci_lane_execution_reasons method on Store is called periodically (integrated into ci.rs polling) to keep execution reasons accurate:
Running lanes: Forces execution_reason = 'running' for any lane with status = 'running' that somehow has a stale reason.
Queued lanes: Iterates all queued lanes in ID order (nightly first, then branch) and assigns the reason based on priority:
TargetUnhealthy if the lane's target is in the unhealthy set
BlockedByConcurrencyGroup if the lane's concurrency group is occupied
WaitingForCapacity if global CI concurrency slots are exhausted
StaleRecovered preserved if already set (to maintain audit trail)
Queued otherwise (the lane would be picked up on next claim)
Simulates claim order: lanes that would be claimed consume available slots and occupy concurrency groups, so downstream lanes get accurate reasons.
This runs within a single IMMEDIATE transaction to provide a consistent snapshot.
Update pikaci target ID to also set ci_target_key
Intent: When a running lane reports its pikaci target ID, backfill `ci_target_key` if it was not already set, ensuring target health tracking covers dynamically assigned targets.
Both update_branch_ci_lane_pikaci_ids and update_nightly_lane_pikaci_ids now include ci_target_key = COALESCE(ci_target_key, ?2) in their UPDATE statements. This means:
If the lane already had a ci_target_key from configuration, it is preserved.
If it was NULL (no configured target), it gets populated from the pikaci_target_id reported at runtime.
This ensures that even lanes without preconfigured targets participate in target health tracking once they are assigned to a runner.
Extend the CLI client with new lane display fields
Intent: Update the ph CLI tool's data model and rendering to display execution reasons, failure kinds, and target health information in branch status output and wait snapshots.
Corresponding enums (CiLaneExecutionReason, CiLaneFailureKind, CiTargetHealthState) are defined with display logic.
Rendering is refactored:
render_branch_status replaces the direct println! calls in print_branch_status with a String-returning function for testability.
render_lane_status_line builds a rich single-line display: lane_id status [· reason] [· failure=kind] [· health summary] [target run_id].
render_lane_snapshot_fragment replaces the old active_lane_titles-based snapshot with a detailed id:status:reason:failure:health format, enabling --wait to detect changes in execution reason and health state, not just status.
A comprehensive test branch_status_renders_blocked_unhealthy_and_failure_details validates all rendering paths.
Update web API serialization for new lane fields
Intent: Extend the JSON API responses to include execution_reason, failure_kind, ci_target_key, target_health_state, and target_health_summary so web dashboards and CLI clients can display the new state.
The web layer's lane serialization (used for both branch CI and nightly lane API responses) is updated to include:
execution_reason — the string representation of the execution reason enum
failure_kind — optional string classification of the failure
ci_target_key — the resolved target identifier
target_health_state — "healthy" or "unhealthy" from the hydrated snapshot
target_health_summary — a human-readable one-line summary like "target apple-host unhealthy · consecutive infra failures 2 · cooloff until 2026-03-24T00:15:00Z"
This applies to both branch lane and nightly lane serialization paths.
Update HTML templates with blocked and health indicators
Intent: Add visual indicators in the branch CI and nightly live dashboard templates to show execution reasons, failure classifications, and target health status.
Both branch_ci_live.html and nightly_live.html templates are updated to render the new fields:
Execution reason badge: Displayed for queued/running lanes when the reason is not the default (e.g., shows "blocked by concurrency group", "waiting for capacity", "target unhealthy").
Failure kind badge: Shown for failed lanes, classifying the failure as "test failure", "timeout", or "infrastructure".
Target health indicator: When a lane's target is unhealthy, a warning is displayed with the health summary (including consecutive failure count and cooloff deadline).
These indicators give operators immediate visibility into why lanes are stalled or what kind of failure occurred, directly in the live dashboard.
Wire refresh into the CI polling loop
Intent: Call the execution reason refresh on each CI poll tick so that the dashboard always reflects current blocking reasons.