You are here

Cluster Parallel filesystem

Project ID: 
73
Current stage: 
Manager: 
Unit: 
What: 

Description: A parallel filesystem for use on the departments clusters which integrates with the university ECDF cluster.

Deliverables: Access to Informatics SAN space at ECDF on our clusters.
A parallel filesystem based on GPFS, accessible from EDCF and informatics cluster nodes.
A GPFS installation which can be managed via LCFG.
A parallel filesystem available on individual hosts within the school.

Documentation for end users and support staff.

Why: 

Customer: Primarily the users of the compute clusters but conceivable any user with a requirement for access to a large filesystem supporting fast IO.

Case statement: Jobs currently running on the clusters are hitting the limits of the currently available technology. As 64, 34 or even 16 nodes can maximise the bandwidth to a single fileserver we will need to switch to some kind of parallel filesystem with multiple "servers".

ECDF are using GPFS to make SAN filespace available to the eddie cluster, there is already a demand from our cluster users to have this filespace available on our clusters and it would be sensible to have all cluster filespace available across all clusters to minimise duplication of data. Using GPFS to do this would minimise duplication of effort.

There may also be a case for making GPFS available on other hosts in the School however it is not expected that it will be generally available.

When: 

Status: Proposal. Some initial work has been done to test GPFS on FC5/6 but it has become clear that this is not currently viable.

Timescales: Realistically we can't rollout a production service until GPFS is available on an LCFG managed SL5. However it should be possible to have everything in place to roll out shortly after the platform becomes available.

this actually took 19.11 weeks

Priority: high

Time: At this point this is unknown it's expected that the evaluation period would take 4-6 man weeks worth of effort.

How: 

Proposal:

Resources: For the evaluation section we would need a suitable number of hosts to set up a filesystem on. IS indicate that 6+ IO servers are required for optimum performance and it would be useful to have this order of identical machines available to run performance tests. Older desktops would be fine but we'd need a reasonable amount of free disk space.
It would also be useful to have additional machines available to test a hetrogeneous cluster however these could be machines "borrowed" from the existing clusters and cluster infrastructure.

The project would need a security review of the filesystem and the proposed configuration developed in the evaluation phase.

Resources required for deployment are dependant on the results of the evaluation phase, If we were to convert the cluster home directories & scratch space at KB we would need to replace har with a number of additional linux based servers.

Plan: We need to set up a test cluster to gain experience, this should investigate the following issues

Management.
Network traffic.
Optimum number of "servers" for clusters.
Likely performance to current clusters.
Security.
Integration with ECDF and eddie.
Data retention & backup strategies.

At the end of the evaulation stage the project should produce a proposal outlining the production service, the plan would then be revised.

Other: 

Dependencies: For roll out onto the clusters the project is dependant on:
* An LCFG port to Scientific Linux (probably version 5)
* GPFS support for the equivilent RHEL version
* IS help in integrating their filesystem with our clusters and vice versa.

GPFS for RHEL5 is not likely to appear until 3rd Quarter this year.

Risks: We would be developing a filesystem based on technology that is closed source, is currently not available on our target OS and is unlikely to be officially supported on that OS. This risk is mitigated by the fact that ECDF would be in the same position.

Based on our current knowledge GPFS will degrade the security of the clusters. It may not be possible to meet all the project deliverables wilst maintainging an acceptable level of security.

URL:http://www.dice.inf.ed.ac.uk/units/research_and_teaching/projects/GPFS/

Milestones

Proposed date Achieved date Name Description
2007-10-02 2007-10-06 set up cluster Obtain 4-6 nodes which can be used to set up a GPFS test cluster, install SL4
2007-10-12 2007-10-15 root based auth Investigate the best way to set up root based passwordless ssh access for the cluster
2007-10-17 2007-10-18 test install gp set up a test cluster using gpfs on sl4
2007-10-28 2007-12-30 acquire GPFS 3 We want to grab GPFS 3.2 as soon as it is released and take a look. If it works ok on SL5 we need to upgrade our test cluster.
2007-11-02 2007-11-06 cluster upgrade Upgrade the test cluster to SL5
2007-12-01 2007-12-15 LCFGise setup convert all the manual configuration tweaks required to get gpfs installed to header files and RPMS.
2007-11-25 2007-11-30 run performance Run a series of performance tests shoing the effects of adding more server nodes and using mirroring/striping techniques.
2010-01-30 shared filesyst Set up a shared filesystem with eddie (either way) and run some benchmarks to check GPFS performance over SRIF.
2010-04-27 2010-04-27 produce report produce an report and decide what happens next.
2009-04-10 2009-04-01 Test production Set up a test production service to establish what kind of setup would be suitable for the schools use.
2009-05-10 2009-05-04 tests Run some tests to try to identify bottlenecks.
2009-06-20 2009-06-16 analyze results produce some graphs and analysis of the data
2009-11-25 2009-11-14 fix hardware reaplace broken hardware with new cast offs.
2010-05-01 finish submit project for signoff