One day, MissionControl will be augmented or replaced with a real monitoring page, driven by Munin or Spong or something, which will make life happy and wonderful. One day, dispense rewrites will be finished. One day, ([DAA], [TRS], [GMB], [AHC], [The latest wheel member]) will graduate.

A dispense rewrite has been finished; [DAA] and [AHC] have graduated; let's make a start on the first item.

Thinking about monitoring

Monitoring covers both alerting (letting people know that stuff is broken) and trending (graphing utilisation or capacity). Both are important; an alert that we have run out of DHCP leases is useful but so is knowing that we are running close to 90% utilisation all the time.

So we want to monitor:

  • Services and their utilisation, including (but not limited to) DNS, DHCP, SMTP, LDAP, SSH, NFS, Samba, RADIUS, dispense, the web server, user web sites, webmail, git, Subversion, IRC, TFTP, IMAP,
  • Computers and their resources, including uptime, temperature, disk space, memory usage, logged on users, etc.
  • Networks and their utilisation, including port throughput, wireless clients, etc.

Basically what we actually care about is services, but most monitoring software is set up to think about machines. Sigh.

The original intention behind @ucc_status was to receive automatic alerts and disseminate them over SMS, but it has always been updated manually. So we could either hook the alerting software up to a new Twitter account or just push everything to ucc_status in the interest of transparency. We could also insert information into the Phonehome database or email it to hostmaster (e.g. disk utilisation).