Agent Observer Docs

Monitoring

Monitor runtime state, automation health, and execution reliability in Agent Observer.

Monitoring

Agent Observer exposes runtime state so operators can detect failures quickly and keep long-running work healthy.

What To Monitor

Agent/session level

  • status transitions (thinking, tool_calling, streaming, done, error)
  • active agent count vs expected active chats
  • token movement relative to workload intensity

Scheduler level

  • task status distribution (success/error)
  • nextRunAt correctness
  • recurring failure patterns by task

Todo Runner level

  • completed vs total todos
  • failed and blocked counts
  • current item and retry pressure

Daily Operator Checks

  1. Review any error statuses first.
  2. Check for blocked todo items.
  3. Validate recurring tasks still match current repository structure.
  4. Confirm token/activity behavior is plausible.

Alert-Like Heuristics

Investigate quickly if:

  • many schedule failures appear after a repo change
  • todo jobs stall repeatedly on one item
  • token counters flatline during expected model activity
  • activity labels do not match runtime behavior

Label Semantics

  • CRON should represent scheduled repeated work only.
  • Todo Runner work should be interpreted through job progress state, not CRON semantics.

Incident Notes Best Practice

For recurring failures, record:

  • failing task/job id
  • exact error message
  • first observed timestamp
  • root cause
  • mitigation applied

This makes future regression diagnosis much faster.

On this page