Back to feed

sledtools/pika branch #25

pika-git-1

Add manual CI recovery controls

Target branch: master

Merge Commit: 8b4c20ca4482a4a450ca98a31d486e38d0235b26

branch: merged tutorial: ready ci: success
Open CI Details

Continuous Integration

CI: success

Compact status on the review page, with full logs on the CI page.

Open CI Details

Latest run #30 success

2 passed

head 10828d6783a578f64e58e777f61166c470f4e9f4 · queued 2026-03-20 21:34:56 · 2 lane(s)

queued 16s · ran 24s

check-notifications · success check-agent-contracts · success

Summary

This branch adds manual CI recovery controls to the pika forge system, enabling operators to intervene when CI lanes become stuck, unresponsive, or need manual resolution. Four new CLI commands are introduced — fail-lane, requeue-lane, recover-run, and wake-ci — along with their corresponding server-side API endpoints and database store operations. The changes span the ph CLI client (command parsing, API client methods, lane resolution logic), the pika-news backend store (transactional fail/requeue/recover mutations with claim-token fencing), the web server (new POST handler routes, operator hint rendering, nightly detail API), and the HTML templates (action buttons for fail/requeue/recover in both branch CI and nightly views). Heartbeat and lease-expiry columns are also surfaced throughout the lane record types to support staleness detection and operator guidance in the UI.

Tutorial Steps

Register four new CLI subcommands in the ph client

Intent: Wire up the new FailLane, RequeueLane, RecoverRun, and WakeCi subcommands in the top-level command dispatch and clap enum so users can invoke them from the command line.

Affected files: crates/ph/src/lib.rs

Evidence
@@ -35,6 +35,10 @@ pub fn run() -> anyhow::Result<()> {
+        PhCommand::FailLane(args) => cmd_fail_lane(&cli, args),
+        PhCommand::RequeueLane(args) => cmd_requeue_lane(&cli, args),
+        PhCommand::RecoverRun(args) => cmd_recover_run(&cli, args),
+        PhCommand::WakeCi => cmd_wake_ci(&cli),
@@ -82,6 +86,10 @@ enum PhCommand {
+    FailLane(LaneActionArgs),
+    RequeueLane(LaneActionArgs),
+    RecoverRun(RecoverRunArgs),
+    WakeCi,

The PhCommand enum gains four new variants. FailLane and RequeueLane both carry LaneActionArgs (shared argument struct), RecoverRun carries RecoverRunArgs, and WakeCi is a unit variant with no arguments. The run() match arm dispatches each to its corresponding cmd_* function.

This is the entry point for all four recovery operations — the rest of the CLI changes implement the argument structs, command bodies, and API client methods these dispatch calls depend on.

Define argument structs with mutually-exclusive selectors

Intent: Provide clap-derived argument structs that allow users to target lanes by either branch name or nightly run ID, and select lanes by either lane name or lane run ID, enforcing mutual exclusivity at the argument-parsing level.

Affected files: crates/ph/src/lib.rs

Evidence
@@ -92,6 +100,30 @@ struct LoginArgs {
+#[derive(Debug, Clone, clap::Args)]
+struct LaneActionArgs {
+    branch_or_id: Option<String>,
+    #[arg(long, conflicts_with = "branch_or_id")]
+    nightly_run_id: Option<i64>,
+    #[arg(
+        long,
+        conflicts_with = "lane_run_id",
+        required_unless_present = "lane_run_id"
+    )]
+    lane: Option<String>,
+    #[arg(long, conflicts_with = "lane", required_unless_present = "lane")]
+    lane_run_id: Option<i64>,
+}
+
+#[derive(Debug, Clone, clap::Args)]
+struct RecoverRunArgs {
+    branch_or_id: Option<String>,
+    #[arg(long, conflicts_with = "branch_or_id")]
+    nightly_run_id: Option<i64>,
+    #[arg(long, conflicts_with = "nightly_run_id")]
+    run_id: Option<i64>,
+}

LaneActionArgs supports two targeting modes:

  1. Branch mode: positional branch_or_id (branch name or numeric ID) to target a branch's CI lanes.
  2. Nightly mode: --nightly-run-id to target a nightly run's lanes.

Within either mode, the lane is selected by --lane (lane name like check-pika) or --lane-run-id (numeric lane run ID). Clap's conflicts_with and required_unless_present annotations enforce that exactly one selector from each pair is provided.

RecoverRunArgs is simpler — it identifies a whole run (branch or nightly) and recovers all non-terminal lanes in it.

Add API response types for lane mutations and recovery

Intent: Define the deserialization types needed to interpret server responses from the new fail, requeue, recover, and wake endpoints.

Affected files: crates/ph/src/lib.rs

Evidence
@@ -191,6 +223,46 @@ struct BranchLogsResponse {
+#[derive(Debug, Clone, Deserialize, PartialEq, Eq)]
+struct NightlyDetailResponse {
+    nightly_run_id: i64,
+    ...
+    lanes: Vec<CiLane>,
+}
+
+#[derive(Debug, Clone, Deserialize, PartialEq, Eq)]
+struct LaneMutationResponse {
+    status: String,
+    branch_id: Option<i64>,
+    nightly_run_id: Option<i64>,
+    lane_run_id: i64,
+    lane_status: String,
+}
+
+#[derive(Debug, Clone, Deserialize, PartialEq, Eq)]
+struct RecoverRunResponse {
+    status: String,
+    ...
+    recovered_lane_count: usize,
+}
+
+#[derive(Debug, Clone, Deserialize, PartialEq, Eq)]
+struct WakeCiResponse {
+    status: String,
+    message: String,
+}

Four new response structs:

  • NightlyDetailResponse: Full nightly run detail including lanes, used by nightly_detail() to resolve lane names within a nightly.
  • LaneMutationResponse: Returned by both fail and requeue endpoints; carries the resulting lane_status.
  • RecoverRunResponse: Returned by recover endpoints; carries recovered_lane_count so the CLI can report how many lanes were reset.
  • WakeCiResponse: Simple status + message from the scheduler wake endpoint.

Implement command bodies for fail-lane, requeue-lane, recover-run, and wake-ci

Intent: Write the four cmd_* functions that authenticate, resolve targets, call the appropriate API method, and print results.

Affected files: crates/ph/src/lib.rs

Evidence
@@ -366,6 +438,53 @@ fn cmd_url(cli: &Cli, branch_or_id: Option<&str>) -> anyhow::Result<()> {
+fn cmd_fail_lane(cli: &Cli, args: &LaneActionArgs) -> anyhow::Result<()> {
+    let session = load_session(&cli.state_dir)?;
+    ...
+    execute_lane_action(&api, args, LaneActionKind::Fail)
+}
+
+fn cmd_requeue_lane(cli: &Cli, args: &LaneActionArgs) -> anyhow::Result<()> {
+    ...
+    execute_lane_action(&api, args, LaneActionKind::Requeue)
+}
+
+fn cmd_recover_run(cli: &Cli, args: &RecoverRunArgs) -> anyhow::Result<()> {
+    ...
+    let response = api.recover_branch_ci_run(resolved.branch_id, run_id)?;
+    println!("recovered branch #{} run #{} lanes={}", ...);
+}
+
+fn cmd_wake_ci(cli: &Cli) -> anyhow::Result<()> {
+    ...
+    let response = api.wake_ci()?;
+    println!("{}", response.message);
+}

Each command follows the same pattern: load session, construct an authenticated ApiClient, then dispatch.

cmd_fail_lane and cmd_requeue_lane both delegate to execute_lane_action (a shared helper that handles the branch-vs-nightly branching logic and lane resolution). cmd_recover_run checks for --nightly-run-id first, falling back to branch resolution. cmd_wake_ci is the simplest — it just POSTs to the scheduler wake endpoint and prints the response message.

All commands use println! for human-readable output summarizing what was done.

Build lane resolution helpers for branch and nightly targets

Intent: Implement the shared execute_lane_action function and the lane selector resolution logic that maps user-supplied --lane or --lane-run-id arguments to actual CiLane records, searching across CI runs.

Affected files: crates/ph/src/lib.rs

Evidence
@@ -431,6 +565,112 @@ fn infer_current_branch() -> anyhow::Result<String> {
+fn execute_lane_action(
+    api: &ApiClient,
+    args: &LaneActionArgs,
+    action: LaneActionKind,
+) -> anyhow::Result<()> {
+    if let Some(nightly_run_id) = args.nightly_run_id {
+        let nightly = api.nightly_detail(nightly_run_id)?;
+        let lane = resolve_lane_selector(&nightly.lanes, ...);
+        ...
+    }
+    let resolved = resolve_branch_ref(api, args.branch_or_id.as_deref())?;
+    let branch = api.branch_detail(resolved.branch_id)?;
+    let lane = resolve_branch_lane(&branch, ...);
+    ...
+}
@@ +fn resolve_branch_lane<'a>(
+    branch: &'a BranchDetailResponse,
+    lane: Option<&str>,
+    lane_run_id: Option<i64>,
+) -> anyhow::Result<&'a CiLane> {
+    let selector = lane_selector(lane, lane_run_id)?;
+    for run in &branch.ci_runs {
+        if let Ok(found) = resolve_lane_selector(&run.lanes, ...) {
+            return Ok(found);
+        }
+    }
+    ...
+}
@@ +fn resolve_lane_selector<'a>(
+    lanes: &'a [CiLane],
+    lane: Option<&str>,
+    lane_run_id: Option<i64>,
+) -> anyhow::Result<&'a CiLane> {

The resolution logic works in layers:

  1. execute_lane_action: Branches on whether nightly_run_id is set. For nightly targets, it fetches the nightly detail and resolves against its lanes. For branch targets, it resolves the branch ref, fetches detail, and resolves against branch lanes.

  2. resolve_branch_lane: Iterates over all CI runs on a branch (newest first, since that's the API ordering) and tries to find a matching lane. This means --lane check-pika resolves against the latest run that has a lane with that ID.

  3. resolve_lane_selector: The lowest-level resolver that matches either by lane name (lane_id field) or by numeric lane run ID.

  4. lane_selector: Validates that exactly one of --lane or --lane-run-id was provided.

  5. resolve_branch_run_id: For recover-run, validates that the requested run ID actually exists on the branch, defaulting to the first (latest) run.

Add API client methods for all new server endpoints

Intent: Extend the ApiClient with methods to call the fail, requeue, recover, wake, and nightly detail endpoints on the server.

Affected files: crates/ph/src/lib.rs

Evidence
@@ -710,6 +950,15 @@ impl ApiClient {
+    fn nightly_detail(&self, nightly_run_id: i64) -> anyhow::Result<NightlyDetailResponse> {
+        self.send(Method::GET, &format!("/news/api/forge/nightly/{nightly_run_id}"), None::<&()>, true)
+    }
@@ -728,6 +977,89 @@ impl ApiClient {
+    fn fail_branch_ci_lane(&self, branch_id: i64, lane_run_id: i64) -> anyhow::Result<LaneMutationResponse> {
+        self.send(Method::POST, &format!("/news/branch/{branch_id}/ci/fail/{lane_run_id}"), Some(&serde_json::json!({})), true)
+    }
+    fn requeue_branch_ci_lane(...)
+    fn recover_branch_ci_run(...)
+    fn fail_nightly_lane(...)
+    fn requeue_nightly_lane(...)
+    fn recover_nightly_run(...)
+    fn wake_ci(&self) -> anyhow::Result<WakeCiResponse> {
+        self.send(Method::POST, "/news/api/forge/ci/wake", Some(&serde_json::json!({})), true)
+    }

Eight new ApiClient methods are added, each a thin wrapper around self.send():

MethodHTTPPath
nightly_detailGET/news/api/forge/nightly/{id}
fail_branch_ci_lanePOST/news/branch/{id}/ci/fail/{lane_id}
requeue_branch_ci_lanePOST/news/branch/{id}/ci/requeue/{lane_id}
recover_branch_ci_runPOST/news/branch/{id}/ci/recover/{run_id}
fail_nightly_lanePOST/news/nightly/{id}/fail/{lane_id}
requeue_nightly_lanePOST/news/nightly/{id}/requeue/{lane_id}
recover_nightly_runPOST/news/nightly/{id}/recover
wake_ciPOST/news/api/forge/ci/wake

All mutation endpoints send an empty JSON body and require authentication (true for the auth parameter).

Add heartbeat and lease columns to lane record types

Intent: Surface the last_heartbeat_at and lease_expires_at columns in both BranchCiLaneRecord and NightlyLaneRecord, and update the SQL queries that populate them.

Affected files: crates/pika-news/src/branch_store.rs

Evidence
@@ -115,6 +115,8 @@ pub struct BranchCiLaneRecord {
+    pub last_heartbeat_at: Option<String>,
+    pub lease_expires_at: Option<String>,
@@ -174,6 +176,8 @@ pub struct NightlyLaneRecord {
+    pub last_heartbeat_at: Option<String>,
+    pub lease_expires_at: Option<String>,
@@ -2169,7 +2529,7 @@ fn list_branch_ci_run_lanes(
-            "SELECT id, lane_id, title, entrypoint, status, pikaci_run_id, pikaci_target_id, log_text, retry_count, rerun_of_lane_run_id, created_at, started_at, finished_at
+            "SELECT id, lane_id, title, entrypoint, status, pikaci_run_id, pikaci_target_id, log_text, retry_count, rerun_of_lane_run_id, created_at, started_at, finished_at, last_heartbeat_at, lease_expires_at

Two new Option<String> fields — last_heartbeat_at and lease_expires_at — are added to both BranchCiLaneRecord and NightlyLaneRecord. The corresponding SQL SELECT queries in list_branch_ci_run_lanes and list_nightly_run_lanes are extended to fetch columns at index 13 and 14.

These columns already exist in the database schema (they're used by the worker heartbeat/lease system). Surfacing them in the record types allows the web layer to compute operator hints about lane health.

Implement store-level fail, requeue, and recover operations with claim-token fencing

Intent: Add transactional database mutations that allow operators to manually fail lanes, requeue lanes, or recover entire runs, using claim_token increments to fence out stale workers.

Affected files: crates/pika-news/src/branch_store.rs

Evidence
@@ -1141,6 +1163,344 @@ impl Store {
+    pub fn fail_branch_ci_lane(
+        &self,
+        branch_id: i64,
+        lane_run_id: i64,
+        actor_npub: &str,
+    ) -> anyhow::Result<Option<()>> {
+        ...
+        if !matches!(row.status.as_str(), "queued" | "running") {
+            bail!("lane {} is already {}", row.lane_id, row.status);
+        }
+        ...
+        SET status = 'failed', log_text = ?1, finished_at = CURRENT_TIMESTAMP,
+            lease_expires_at = NULL, claim_token = ?2
+        ...
+    }
@@ +    pub fn requeue_branch_ci_lane(
+        ...
+        if !matches!(row.status.as_str(), "queued" | "running" | "failed") {
+            bail!("lane {} cannot be requeued from {}", ...);
+        }
+        ...
+        SET status = 'queued', log_text = NULL, ... claim_token = ?1
+        ...
+    }
@@ +    pub fn recover_branch_ci_run(
+        ...
+        UPDATE branch_ci_run_lanes
+        SET status = 'queued', ... claim_token = claim_token + 1
+        WHERE branch_ci_run_id = ?1 AND status IN ('queued', 'running', 'failed')
+        ...
+    }

Six new Store methods are implemented, three for branches and three mirrored for nightlies:

fail_branch_ci_lane / fail_nightly_lane

  • Validates lane is in queued or running state (rejects already-terminal lanes).
  • Sets status to failed, appends a manual failure note to log_text identifying the actor.
  • Clears lease_expires_at and increments claim_token to invalidate any active worker lease.
  • Calls update_branch_ci_suite_status / update_nightly_run_status to recalculate the parent run's aggregate status.

requeue_branch_ci_lane / requeue_nightly_lane

  • Accepts lanes in queued, running, or failed state.
  • Resets to queued with all progress fields cleared (log_text, pikaci_run_id, started_at, finished_at, last_heartbeat_at, lease_expires_at).
  • Increments retry_count and claim_token.
  • The claim_token increment is the key safety mechanism: any worker holding an old claim_token will be rejected by the heartbeat and finish operations (the existing CI_LANE_LEASE_LOST check).

recover_branch_ci_run / recover_nightly_run

  • Bulk-requeues all non-terminal (queued, running, failed) lanes in a single UPDATE statement.
  • Returns the count of affected rows so the caller knows how many lanes were recovered.
  • Skips success lanes, so a partially-complete run only reruns what hasn't passed.

All operations use IMMEDIATE transactions to prevent concurrent mutation races.

Two helper functions support log annotation:

  • manual_lane_failure_note: Formats a note like "Manual fail by npub1...: marked check-pika failed for CI recovery."
  • append_log_note: Appends to existing log text or creates new.

Register new HTTP routes and implement request handlers in the web server

Intent: Add the server-side POST routes for fail/requeue/recover on both branch and nightly lanes, a GET route for nightly detail, and a POST route for scheduler wake, plus their handler implementations.

Affected files: crates/pika-news/src/web.rs

Evidence
@@ -1002,10 +1028,34 @@ pub async fn serve(
+        .route("/news/branch/:branch_id/ci/fail/:lane_run_id", post(fail_branch_ci_lane_handler))
+        .route("/news/branch/:branch_id/ci/requeue/:lane_run_id", post(requeue_branch_ci_lane_handler))
+        .route("/news/branch/:branch_id/ci/recover/:run_id", post(recover_branch_ci_run_handler))
+        .route("/news/nightly/:nightly_run_id/fail/:lane_run_id", post(fail_nightly_lane_handler))
+        .route("/news/nightly/:nightly_run_id/requeue/:lane_run_id", post(requeue_nightly_lane_handler))
+        .route("/news/nightly/:nightly_run_id/recover", post(recover_nightly_run_handler))
@@ +        .route("/news/api/forge/nightly/:nightly_run_id", get(api_forge_nightly_detail_handler))
+        .route("/news/api/forge/ci/wake", post(wake_ci_handler))

Nine new routes are added to the Axum router:

Mutation routes (authenticated, require forge-write permission):

  • POST /news/branch/:branch_id/ci/fail/:lane_run_id
  • POST /news/branch/:branch_id/ci/requeue/:lane_run_id
  • POST /news/branch/:branch_id/ci/recover/:run_id
  • POST /news/nightly/:nightly_run_id/fail/:lane_run_id
  • POST /news/nightly/:nightly_run_id/requeue/:lane_run_id
  • POST /news/nightly/:nightly_run_id/recover
  • POST /news/api/forge/ci/wake

Read route (authenticated):

  • GET /news/api/forge/nightly/:nightly_run_id

Each handler extracts path parameters, validates authentication, calls the corresponding store method, and returns a JSON response with the operation result.

Add operator hints and enhanced lane status rendering

Intent: Compute and display human-readable operator hints on lane status views to help operators identify stuck, lease-expired, or long-running lanes that may need manual intervention.

Affected files: crates/pika-news/src/web.rs

Evidence
@@ -405,6 +405,9 @@ struct CiLaneView {
+    last_heartbeat_at: Option<String>,
+    lease_expires_at: Option<String>,
+    operator_hint: Option<String>,
@@ -1848,6 +1931,66 @@ fn lane_status_badge_class(status: &str) -> &'static str {
+fn parse_ci_timestamp(raw: &str) -> Option<DateTime<Utc>> {
+    DateTime::parse_from_rfc3339(raw)
+        .map(|value| value.with_timezone(&Utc))
+        .ok()
+        .or_else(|| {
+            NaiveDateTime::parse_from_str(raw, "%Y-%m-%d %H:%M:%S")
+                ...
+        })
+}
+
+fn lane_operator_hint(
+    status: &str,
+    created_at: &str,
+    started_at: Option<&str>,
+    finished_at: Option<&str>,
+    last_heartbeat_at: Option<&str>,
+    lease_expires_at: Option<&str>,
+) -> Option<String> {

Both CiLaneView and NightlyLaneView gain three new fields: last_heartbeat_at, lease_expires_at, and operator_hint.

The lane_operator_hint function analyzes lane timing data to produce actionable warnings:

  • Queued lanes: Flags if queued for an unusually long time, suggesting the scheduler may not be picking them up.
  • Running lanes: Checks if the lease has expired (worker likely dead) or if no heartbeat has been received recently.
  • General: Uses parse_ci_timestamp which handles both RFC 3339 and SQLite's %Y-%m-%d %H:%M:%S datetime formats.

The NightlyLaneView also gains status_badge_class and is_failed fields that were previously missing (making nightly lane rendering consistent with branch lane rendering).

Pass identifiers to live templates for action button targeting

Intent: Include branch_id and nightly_run_id in the live HTML templates so that client-side action buttons can construct the correct API URLs for fail/requeue/recover operations.

Affected files: crates/pika-news/src/web.rs, crates/pika-news/templates/branch_ci_live.html, crates/pika-news/templates/nightly_live.html

Evidence
@@ -439,6 +448,7 @@ struct BranchCiLiveTemplate {
+    branch_id: i64,
@@ -495,6 +505,22 @@ struct NightlyLiveTemplate {
+    nightly_run_id: i64,
@@ -1741,6 +1796,7 @@ fn render_branch_ci_live_html(
+        branch_id: record.branch_id,
@@ -1763,6 +1819,7 @@ fn render_nightly_live_html(
+        nightly_run_id: run.nightly_run_id,

The BranchCiLiveTemplate and NightlyLiveTemplate structs now carry the parent entity's ID. This ID is used in the HTML templates to construct action URLs for the fail, requeue, and recover buttons — e.g., POST /news/branch/{branch_id}/ci/fail/{lane_run_id}.

The render functions (render_branch_ci_live_html and render_nightly_live_html) pass these values through from the source records.

Add action buttons and operator hints to HTML templates

Intent: Render fail, requeue, and recover action buttons in the branch CI, nightly, and detail HTML templates, along with operator hint warnings, so operators can perform recovery actions from the web UI.

Affected files: crates/pika-news/templates/branch_ci_live.html, crates/pika-news/templates/detail.html, crates/pika-news/templates/nightly.html, crates/pika-news/templates/nightly_live.html

Evidence
@@ branch_ci_live.html - action buttons for branch CI lanes
@@ nightly_live.html - action buttons for nightly lanes
@@ detail.html - operator hints and actions on branch detail page
@@ nightly.html - operator hints and actions on nightly detail page

All four templates are updated to provide manual recovery controls:

  • branch_ci_live.html: Adds fail/requeue buttons per lane and a recover button per run. Buttons are wired to POST to the corresponding API endpoints using JavaScript fetch calls.
  • nightly_live.html: Mirrors the branch template pattern for nightly lanes — fail/requeue per lane, recover per run.
  • detail.html: The branch detail page gains operator hint display (colored warning text) and inline action buttons.
  • nightly.html: The nightly detail page gains the same operator hints and action buttons.

Operator hints are rendered conditionally — they only appear when lane_operator_hint returned a non-None value, indicating a lane that may need attention (e.g., expired lease, stuck in queue).

Add comprehensive tests for lane fencing and recovery semantics

Intent: Verify that requeued lanes reject stale workers via claim-token fencing, and that recover-run only resets non-terminal lanes while leaving successful lanes untouched.

Affected files: crates/pika-news/src/branch_store.rs, crates/ph/src/lib.rs

Evidence
@@ -3367,4 +3742,181 @@ mod tests {
+    #[test]
+    fn manual_requeue_rejects_old_workers_for_branch_and_nightly_lanes() {
+        ...
+        store.requeue_branch_ci_lane(branch.branch_id, first_branch.lane_run_id)...
+        assert!(store.heartbeat_branch_ci_lane_run(first_branch.lane_run_id, first_branch.claim_token, 120)
+            .expect_err("stale branch heartbeat should fail")
+            .to_string().contains(CI_LANE_LEASE_LOST));
+        ...
+        assert!(second_branch.claim_token > first_branch.claim_token);
+    }
@@ +    #[test]
+    fn recover_run_requeues_only_nonterminal_lanes() {
+        ...
+        store.finish_branch_ci_lane_run(jobs[0].lane_run_id, jobs[0].claim_token, "success", "ok")...
+        store.finish_branch_ci_lane_run(jobs[1].lane_run_id, jobs[1].claim_token, "failed", "boom")...
+        let recovered = store.recover_branch_ci_run(branch.branch_id, jobs[2].suite_id)...
+        assert_eq!(recovered, 2);
+        assert_eq!(lanes[0].status, "success");
+        assert_eq!(lanes[1].status, "queued");
+        assert_eq!(lanes[2].status, "queued");
+    }
@@ ph/src/lib.rs tests
+    #[test]
+    fn fail_lane_resolves_branch_lane_name_against_latest_run() {
+    ...
+    #[test]
+    fn requeue_lane_resolves_nightly_lane_name() {
+    ...
+    #[test]
+    fn wake_ci_hits_scheduler_wake_endpoint() {

Store tests (branch_store.rs)

manual_requeue_rejects_old_workers_for_branch_and_nightly_lanes: The critical safety test. It:

  1. Claims a lane (gets a claim_token).
  2. Requeues it (increments claim_token).
  3. Verifies the old worker's heartbeat and finish calls are rejected with CI_LANE_LEASE_LOST.
  4. Claims the requeued lane and verifies the new claim_token is higher.
  5. Repeats the entire flow for nightly lanes.

recover_run_requeues_only_nonterminal_lanes: Sets up a 3-lane run where lane 1 succeeds, lane 2 fails, and lane 3 is still running. After recovery:

  • Lane 1 (success) is untouched.
  • Lanes 2 and 3 are reset to queued with retry_count = 1.
  • Verified for both branch and nightly runs.

CLI tests (lib.rs)

Three integration tests using mock HTTP servers:

  • fail_lane_resolves_branch_lane_name_against_latest_run: Verifies that ph fail-lane feature/recover --lane check-pika correctly resolves the lane name against the latest CI run (run #5, lane #91) and not the older run (#4, lane #90).
  • requeue_lane_resolves_nightly_lane_name: Verifies ph requeue-lane --nightly-run-id 12 --lane nightly_pika correctly resolves and calls the requeue endpoint.
  • wake_ci_hits_scheduler_wake_endpoint: Verifies ph wake-ci authenticates and POSTs to the wake endpoint.

Diff