You are here

Increase the number of nodes in the 64 node cluster.

Project ID: 
23
Current stage: 
Manager: 
Unit: 
What: 

Description: There is currently space in the beowulf racking for additional nodes for this cluster. With the addition of a shelf in rack 11 and utilising existing spare network kit it should be possible to add up to 16 destop machines to lion. These nodes would act as a bank of hot spares and at the same time could be used to run short jobs on individual nodes.

This would make use of equipment currently sitting in storage, improve the usability of the cluster and provide some more capacity for some cluster users.

Deliverables: An additional 10-16 nodes for the lion cluster.

Why: 

Customer: The gridengine userbase:- research staff and MSc/Phd students.

Case statement: Lion's nodes are now 4+ years old and we are seeing more unexplained glitches and the first outright hardware failures. Until now the nodes have been remarkably reliable and it has been possible to maintain the cluster using two "cold spares". Recently the frequency of problems had got to the stage where on average one or two nodes may be unavailable during the week which causes problems for users requiring to run jobs on a large number (or all) of the nodes.

Adding nodes would both reduce the pressure on what is still a heavily used cluster and reduce the pressure on the support staff to maintain 100% uptime on all the nodes. This could also act as a trial for replacing a percentage of the cluster nodes with old desktop machines on a rolling basis.

When: 

Status: pending approval

Timescales: This should be done as the clusters are upgraded to FC5, before January 2007

Priority: This would be a fairly cheap way to expand the cluster and reduce support overheads a bit using what is essentially surplus kit.

If we want to go ahead with this we should do this now, or defer it until we make whatever changes to the clusters when we move to the forum.

Time: Some support/technician time to identify likely spare PC's and transport them to KB.

Some technician time to find and install a shelf (this should just be grabbing a BSI standard shelf from 3312 or 2905) and to install some truning for routing power cables.

4-5 days to integrate switch and new nodes into the cluster.

How: 

Proposal: Redeploy the minibw procurve and up to 16 desktops from storage to create a hot spare pool for the 64 node cluster

Resources: The nodes, a BSS shelf, cables and if, we have any, 1 or 2 spare 1Gb cards for a Procurve 4000.

Plan:

1. Fit new shelf and move minibw procurve.

2. Fit bridge trunking between beowulf rack and rack

3. Reorganise cluster power cabling 2 FTE dayDuring cluster downtime for upgrade

4. Install nodes 2 FTE dayIdeally integrated with the cluster upgrade.

5. Add nodes into gridengine configuration.1 FTE day

Other: 

Dependencies: The project doesn't have any dependancies but ideally should be coordinated with the cluster upgrades to FC5.

Risks: None foreseen.

URL: http://www.dice.inf.ed.ac.uk/units/research_and_teaching/projects/lion_e...

Milestones

Proposed date Achieved date Name Description
2007-03-03 2007-03-04 reconfigure rac install shelving and move minibw switch to rack 12
2007-03-18 2007-03-20 Source hardware Source redundant desktops from support and install then in racking.
2007-03-30 2007-04-04 Integrate nodes integrate the new nodes into the cluster
2007-06-20 2007-05-30 report Write up a report on the project
2007-07-14 signoff submit project to be signed off