Monitoring
Monitor runtime state, automation health, and execution reliability in Agent Observer.
Monitoring
Agent Observer exposes runtime state so operators can detect failures quickly and keep long-running work healthy.
What To Monitor
Agent/session level
- status transitions (
thinking,tool_calling,streaming,done,error) - active agent count vs expected active chats
- token movement relative to workload intensity
Scheduler level
- task status distribution (success/error)
nextRunAtcorrectness- recurring failure patterns by task
Todo Runner level
- completed vs total todos
- failed and blocked counts
- current item and retry pressure
Daily Operator Checks
- Review any
errorstatuses first. - Check for blocked todo items.
- Validate recurring tasks still match current repository structure.
- Confirm token/activity behavior is plausible.
Alert-Like Heuristics
Investigate quickly if:
- many schedule failures appear after a repo change
- todo jobs stall repeatedly on one item
- token counters flatline during expected model activity
- activity labels do not match runtime behavior
Label Semantics
CRONshould represent scheduled repeated work only.- Todo Runner work should be interpreted through job progress state, not CRON semantics.
Incident Notes Best Practice
For recurring failures, record:
- failing task/job id
- exact error message
- first observed timestamp
- root cause
- mitigation applied
This makes future regression diagnosis much faster.