Exit status and signals
W&B uses the training process exit status to decide whether a run is requeued and how run state is recorded. Exit code contract:- Exit code 0: The run is considered to have completed successfully and is not requeued.
- Non-zero exit code: The run is treated as failed or preempted. When you use
mark_preempting(), W&B requeues the run so another agent (or the same agent after restart) can resume it.
sys.exit() call. Understanding and relying on this contract is vitally important in preemptible or cluster environments.
When the process exits due to a catchable signal, your handler can run, call wandb.run.mark_preempting() if you want the run requeued, perform cleanup (for example, save a checkpoint), then exit with a non-zero code. A common convention is sys.exit(128 + signum) for termination by signal. W&B records that exit code and the same requeue rules apply. When the process is killed by the operating system kernel with SIGKILL, the process cannot run exit hooks, so no final summary is written and the run may appear as crashed or killed; the agent still starts the next run.
Stale runs and server-side timeouts
If a run neither finishes nor posts new metrics for a long time (on the order of about five minutes), the W&B server marks the run as crashed. That often happens when the training process hangs, stops logging, or is terminated without a clean exit (for example afterSIGKILL). Logging metrics on a steady cadence or exiting with a defined code helps keep run state aligned with what actually happened.
Catchable signals and preemption
You can register custom signal handlers in your training script. When a catchable signal is delivered, your handler runs; metrics already sent to W&B are preserved, and the agent detects the process exit and starts the next run. Best practices:- Register handlers early (for example, before entering the main training loop).
- In the handler, call
wandb.run.mark_preempting()when you intend the run to be requeued after preemption, perform cleanup (for example, save a checkpoint), then exit with a non-zero code.
SIGUSR1 (a typical cluster preemption signal) and SIGTERM. It leaves SIGINT free for interactive use (for example, manual cancellation from the terminal). The handler calls wandb.run.mark_preempting() and exits using 128 + signum:
SIGKILL (uncatchable)
SIGKILL cannot be caught or ignored. The process terminates immediately with no chance to run handlers or atexit callbacks. W&B cannot write a final summary for the run. The agent still recovers and continues the sweep, but run data for that run is incomplete. Use SIGKILL only as a last resort; prefer SIGTERM or SIGINT when you need graceful shutdown.
Forwarding signals from agent to child
When you use thewandb agent CLI, the agent runs your training script as a child process. When you interrupt the agent (for example, with Ctrl+C or when a scheduler sends SIGTERM to the job), the child (training process) does not receive the signal by default; the training script cannot run its handler or call mark_preempting(). This is described in GitHub #3667.
To let the child shut down gracefully and call wandb.run.mark_preempting() in a handler, run the CLI agent with --forward-signals:
wandb.agent() in the Python API. That path runs your training function in a thread, not as a separate child process, so the same forwarding behavior does not apply.
When the CLI agent receives SIGINT or SIGTERM with forwarding enabled, it relays the signal to the child so your training script’s handler can run, call wandb.run.mark_preempting() and wandb.finish() with a non-zero exit code if needed, and exit with a non-zero code. If you press Ctrl+C twice on the agent process, the agent receives SIGTERM by default. With --forward-signals, SIGINT can be forwarded to the child so your handler runs.
See the wandb agent CLI reference for details.
Preemptible clusters like SLURM
On preemption, the training process must receive the signal, mark the run as preempting, and exit with a non-zero code so the run is requeued. A new agent (or the same agent after the job is requeued) can then resume the run.
Ensure the training process receives the signal:
- When the scheduler signals the agent: Run the agent with
wandb agent --forward-signalsso that when the scheduler (or user) sends a signal to the agent, the agent forwards it to the child. The child’s handler can then callwandb.run.mark_preempting(),wandb.finish(exit_code=...)with a non-zero code, andsys.exit(128 + signum)(or another non-zero exit code). - When the scheduler signals the launch script (not the agent directly): Have the launch script send the preemption signal directly to the training process. For example, the training script writes its process ID to a file; the launch script traps the cluster signal (for example
SIGUSR1) and runskill -SIGUSR1 $(cat $PID_FILE)so the training process’s handler runs.
SIGTERM or SIGUSR1). In the handler, call wandb.run.mark_preempting() if a run is active, then finish the run with a non-zero exit code and sys.exit(128 + signum) (or another non-zero code) so the run is requeued. See Resume preemptible Sweeps runs for when runs are requeued and how that interacts with mark_preempting().
Sweep state: Run wandb sweep entity/project/sweep_ID --resume before starting the agent so the sweep is in resume mode and will hand out requeued runs.
Multi-agent coordination: When many agents run at once (such as SLURM array jobs), they can race to claim the same preempted run. This is a known limitation. Stagger agent startup or use external coordination mechanisms like locks to help work around this potential issue.
wandb sweep --cancel
You cancel a sweep using the W&B API, not an OS signal. Run a command like wandb sweep --cancel entity/project/sweep_ID. The server tells the agent to exit, and the agent then terminates running child processes and stops. There can be a short delay (on the order of the agent’s API polling interval) before cancellation takes effect.
Cancellation delivers SIGKILL to runs. Child processes have no chance to run user-defined signal handlers. The same applies when you use the Cancel control on the Sweeps UI. Use --cancel when you want to stop the entire sweep and mark it cancelled. For graceful shutdown of the current run, send a catchable signal to the run (or use --forward-signals with the CLI agent and signal the agent). For graceful sweep completion, use wandb sweep --stop instead of --cancel.
See Manage sweeps for pause, resume, stop, and cancel options.
Killing the agent vs the run
If you send a signal to the agent process (not the child training process), the agent may exit while the child continues running as an orphan. The orphan may keep printing to your terminal, and the shell may not show a new prompt until you press Enter. Unless you use--forward-signals with the CLI agent, stopping the agent does not guarantee the child training process stops.
To confirm the agent has exited, use an OS command like ps -p <agent_pid> or pgrep -f "wandb agent" instead of relying on prompt appearance.
Reference: mark_preempting() and final run state
The table below summarizes how run state depends on when you call mark_preempting() and how the process exits. It assumes you use the wandb agent CLI with your training program as a subprocess.
| Scenario | No mark_preempting() | Signal handler calls mark_preempting() and exits non-zero | mark_preempting() always called right after init() |
|---|---|---|---|
| Run completes normally with exit code 0 | FINISHED | FINISHED | FINISHED |
| Run fails with non-zero exit code | FAILED | FAILED | PREEMPTED |
Run receives SIGKILL | CRASHED after about five minutes | CRASHED after about five minutes (uncatchable) | PREEMPTED after about five minutes |
Run receives SIGINT | KILLED | PREEMPTED (with a SIGINT handler) | PREEMPTED |
Run receives another signal (for example SIGTERM or SIGUSR1) | CRASHED after about five minutes | PREEMPTED (with a matching handler) | PREEMPTED after about five minutes |
mark_preempting() inside a signal handler, you do not cover cases where the handler never runs, such as SIGKILL.
If you always call mark_preempting() immediately after wandb.init(), any failure can be treated as preemption and the run may be requeued repeatedly, including for bugs or bad configuration.
For environments with a well-defined preemption signal, the usual approach is a signal handler that calls mark_preempting() and exits non-zero, not an unconditional call after init().