Monitoring

Monitoring

Monitor runtime state, automation health, and execution reliability in Agent Observer.

Monitoring

Agent Observer exposes runtime state so operators can detect failures quickly and keep long-running work healthy.

What To Monitor

Agent/session level

status transitions (thinking, tool_calling, streaming, done, error)
active agent count vs expected active chats
token movement relative to workload intensity

Scheduler level

task status distribution (success/error)
nextRunAt correctness
recurring failure patterns by task

Todo Runner level

completed vs total todos
failed and blocked counts
current item and retry pressure

Daily Operator Checks

Review any error statuses first.
Check for blocked todo items.
Validate recurring tasks still match current repository structure.
Confirm token/activity behavior is plausible.

Alert-Like Heuristics

Investigate quickly if:

many schedule failures appear after a repo change
todo jobs stall repeatedly on one item
token counters flatline during expected model activity
activity labels do not match runtime behavior

Label Semantics

CRON should represent scheduled repeated work only.
Todo Runner work should be interpreted through job progress state, not CRON semantics.

Incident Notes Best Practice

For recurring failures, record:

failing task/job id
exact error message
first observed timestamp
root cause
mitigation applied

This makes future regression diagnosis much faster.

Overview

What Agent Observer is, who it is for, and how to use it as a local control plane for AI coding workflows.

Memory

Scope and context persistence model for chat, schedules, and todo-runner execution.