You are here

Server Hardware Interaction

Project ID: 
134
Current stage: 
Manager: 
Unit: 
What: 

Description: This project aims to improve the management of and interaction with the server hardware.

Deliverables: Easy, automatic updates for server firmware and RAID firmware; improved monitoring of RAID controllers; monitoring of ambient temperature and clean server shutdown when it gets too high; an OMSA configuration which is useful to us and which we can live with; an improved appreciation of what is and isn't useful and easily automatable in this area. The highest priority deliverable is automated monitoring of the RAID controllers.

Why: 

Customer: Directly, the computing staff of the School of Informatics. Indirectly, users of School of Informatics services, through improved reliability and quicker & easier maintenance tasks.

Case statement: We need to improve our interaction with server hardware. For example we have in the past lost data through RAID controller firmware not being kept up to date. It's useful to keep server BIOS versions upgraded. It's useful to know when disks in a RAID array have developed a fault. And we should be able to automatically and cleanly shut down machines when the ambient temperature gets too high. Software exists to help manage these sort of tasks; we need to work out which software would help us and how and design and implement lightweight solutions accordingly. This would provide a way to improve our server reliability (for those servers which aren't being painstakingly monitored and upgraded by their managers) or automate some time-consuming tedious tasks (for those managers who do take pains to upgrade firmware etc.)

When: 

Status:

Timescales: The project has been allocated two weeks of time.

Priority: CEG has judged this to be a high priority project w.r.t. temperature measurement and RAID monitoring.

Time: The project has been allocated two weeks.

How: 

Proposal: see Plan.

Resources: Plentiful access to examples of our more popular models of rack-mounted servers.

Plan: The highest priority deliverable for this project is now clean automatic shutdown on excessive ambient temperature. The next highest is RAID monitoring. The plan for that is to find out about our RAID facilities; try out possible monitoring software; get this working; pick the most suitable; write or obtain a Nagios translator for it.
If time is left over after achieving this then the project will go on to work on other possible deliverables such as more automatic and easier firmware updates for server hardware.

Other: 

Dependencies: The project will depend on getting access to servers...

Risks:

Milestones

Proposed date Achieved date Name Description
2010-02-04 2010-02-12 temp1 Design an emergency temperature protection system
2010-02-05 2010-02-12 temp2 Implement an emergency temperature protection system
2010-02-05 2010-02-12 temp3 Test the emergency temperature protection system
2010-02-08 2010-02-12 temp4 Document the emergency temperature protection system
2010-05-21 2010-05-21 hwmon1 Learn about Nagios and RAID status in preparation for developing a simple hardware status monitor for DICE servers.
2010-05-31 2010-05-28 hwmon2 Implement and document a simple hardware status check for DICE servers for use with the Nagios monitoring system. It should check at least RAID disk status.
2010-07-02 hwmon3 Test and promote the simple hardware status check system for DICE servers.