You are here

System Monitoring

Project ID: 
22
Current stage: 
Manager: 
Unit: 
What: 

Description: Deploy a monitoring system configured via LCFG resources. In stage 1, this system will purely monitor the AFS service, stage 2 will expand this to being usable by all service component authors.

Deliverables: Stage 1, a monitoring service suitable for monitoring the availability of the AFS file and database servers

Stage 2, a monitoring framework capable of extension to monitor any LCFG configured service.

Why: 

Customer: In stage 1, the AFS system managers.

Stage 2 will open this up to the entire CO community. Better notification of system outages, and the collection of uptime statistics will also improve the service we provide to end users.

Case statement: Monitoring within DICE is currently done in an extremely ad hoc fashion, where it happens at all. As we deploy more and more critical services with redundancy built it, it becomes vital to know if these fail. Redundancy can hide initial service failures, until the final redundant system falls down, and the entire service fails.

In particular, AFS has a set of redundant database and file servers. It is important to know when one of these go down, as the system will continue regardless of server failure.

When: 

Status:

Timescales:

Priority:

Time:

How: 

Proposal: A detailed proposal was circulated to COs in December 2005, following investigation of a number of different options for monitoring technology and configuration approaches. We will deploy Nagios, and use a configuration system which directly fetches profile information from the LCFG servers to manage host configuration.

It was proposed that the development be undertaken in two stages. The first stage will produce a system capable of monitoring all AFS database and fileservers.

Stage 2 will extend this to produce a framework capable of monitoring any service configured through LCFG, by providing a means for component authors to write monitoring scripts based on their component's resource.

In order to increase the utility of monitoring messages, initial service failure notifications will be provided via Jabber, using presence to avoid sending notifications to users who are unavailable. Escalation via email will also be provided.

Any other notification methods, such as SMS or pager are outside the scope of this project.

Resources: Simon's time (which is pulled in many different directions ...)

Plan:

Other: 

Dependencies: If we wish to offer reporting via Jabber, a production quality Jabber service will be required for COs.

LDAP schema changes are required to support Nagios user configuration

Risks:

URL: https://wiki.inf.ed.ac.uk/DICE/MonitoringProject

Milestones

Proposed date Achieved date Name Description
2007-05-28 2007-05-25 sysmon-jabber Deploy production version of required notification service (Jabber)
Depends upon services-unit making hardware available
2007-06-25 2007-06-20 sysmon-codecomp Code complete, tested and deployed on development hardware
2007-12-05 2007-10-03 sysmon-afscomp AFS component rewritten to actually configure AFS services, and therefore be monitorable. This milestone transferred to the AFS project.
2007-07-23 2007-07-01 installhardware Hardware for production service installed.
Date depends upon hardware delivery, and time availability within inf-unit
2007-07-27 2007-07-01 sysmon-prodcode System moved to production machines.
Date depends upon installation of production hardware
2007-11-05 2007-09-07 sysmon-slave Install 'backup' slave monitoring system
2007-09-07 2007-09-07 sysmon-doc Produce documentation on how to write a monitoring component