You are here
Production Condor Service
Description:
Create a production quality Condor service for use on staff and lab machines allowing Research Staff and students to use spare CPU cycles for
computationaly intensive tasks.
Condor is a clustering system which is targetted at using spare CPU cycles on otherwise unused desktops, Condor can be configured to only use desktop machines where the keyboard and mouse are idle. Should Condor detect that a machine is no longer available (such as a key press detected), in many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle.
We have had a test cluster running on a number of School machines for some time and we are now in a position to produce a production quality service on lab and staff machines. This will allow research staff to submit jobs to condor clusters which will utilise idle CPU's on standard desktop machines to process jobs.
Deliverables: A production service running on lab and staff machines with multiple master nodes sited at KB and in the city centre. Suitable Documentation for the system to be managed day to day by the support unit and to provide a gentle introduction for new users. Beyond basic testing, usage testing will be carried out by the Distributed Computing working group.
Customer: This project will support research over a
number of Institutes and would be a resouce available to all Research staff and
students. It would also be possible to open up access to the cluster for non research use on nodes which had not been grant funded. The Distributed computing working group would sign off on acceptability.
Case statement: The School purchases a large number of desktop computers which spend much, in some cases most, of their time unused. Deploying condor on lab and staff machines would make use of this currently
wasted resource and at the same time free up more time on dedicated clusters for jobs which have more stringent resource requirements.
Status:
Timescales:
2nd Pilot July-August
Start phased roll out in lab and staff machines from middle semester 1 2006-2007.
Multiple masters in place by December 2006.
Condor running site wide by start Semester 3 2006-2007.
Priority:
There is a user request
https://rt3.inf.ed.ac.uk/Ticket/Display.html?id=23966 for condor to be deployed in labs by July.
There is considerable pressure from some institutes for more cluster compuing resources.
The current pilot is well subscribed and well used.
Time:
This is not fully quantifiable at this stage as a certain amount of time is needed simply to evaluate the technology and how best to implement certain aspects of the proposal.
1 and 2 would require about 5 man days work with a contingency of 2 man days
3 would require 3 man
days from RAT and 1 man day from Frontline support with an unknown user support
commitment (probably from both units).
Best current estimate for the remainder would be 1-3 man months worth of effort but this is very difficult to quantify as it is not clear what level of configuration or help staff would need to
get their machines ready to run as condor nodes and some of the risks listed above could have a major impact on the project. Figures would become clearer after
stage 3 is complete.
e.g. If the kernel patch doesn't work and condor can't background tasks then a much larger commitment of resources would be required to produce a solution or we would have to restrict condor use to outside of office hours.
Proposal: Deploy condor site wide on lab/general access machines with an opt out for individual users machines
Resources:
Hardware for multiple master nodes would be required, this can be fairly low spec in terms of processor/disk requirements but should be geared towards high
availablility (rack mount, console access, raidable disks, some form of UPS).
The project would need expertise from the MDP unit to patch the kernel and
there would be an ongoing support requirement to ensure that the patch was applied to subsequent kernels.
The project would need expertise from the Services unit on how best to integrade condor with afs.
There would be a need
for staff time from frontline support in reconfiguring the labs machines.
There would be a need for ongoing support upgrading condor RPMS.
There would be an ongoing need for condor specific documentation and user support.
Plan:
- Patch kernel and test with current condor Pool
- Rework component to work with third party condor rpmss
- Set up second condor pool to test above
- Large scale testing of 2 above in student lab (probably newly upgraded FC5 lab)
- Review above and plan for larger scale deployment in labs and volunteer user desktops
- Deploy multiple masters
- Deploy in all labs and volunteer user desktops
- Deploy school wide.
Dependencies: Condor currently can't detect console usage with USB keyboards and mice, consequently it does not suspend or migrate processes when a user starts using the machine. We would need a patched kernel to be deployed on condor nodes.
Risks:
If takeup is high then condor usage could impact on fileserver or LAN performance.
Master nodes are open to denial of service attacks from malicious or badly written submission scripts.
Running desktop machines at high load levels continously for long
periods may degrade the hardware. It may also result in large scale hardware failures where there is a fundamental hardware problem. For example the capacitor problem with the gx270's, we may get a lab full of machines failing within days
rather than over a period of weeks.
We currently have no way of running condor with afs.
Automatic backgrounding of jobs currently does not work on nodes with USB input devices on FC5.
Large numbers of nodes running at
high loads in labs may produce complaints from students about the levels of temperatur and noise generated.
URL: https://wiki.inf.ed.ac.uk/DICE/CondorProject
Milestones
Proposed date | Achieved date | Name | Description |
---|---|---|---|
2006-08-10 | 2006-08-10 | third party | Rework component to work with third party condor rpms |
2006-09-27 | 2006-08-15 | test pool | Set up second condor pool to test reworked component |
2007-02-19 | 2006-08-20 | kernel | Patch kernel and test with current condor Pool |
2007-02-17 | 2006-08-16 | student lab | Larger scale user testing of pool in student lab (using locked FC5 lab) |
2007-04-01 | 2006-08-20 | review | Review testing and plan for larger scale deployment in labs and volunteer user desktops |
2007-12-13 | 2006-08-31 | multiple master | Deploy multiple masters |
2007-03-06 | 2007-12-01 | all labs | Deploy in all labs and volunteer user desktops |
2008-03-26 | 2008-03-06 | school wide | Deploy school wide. |
2008-12-31 | 2008-09-05 | documentation | Produce Documentation for users and COs |
2009-05-21 | review docs | Have support review documentation until they accept it. | |
2008-01-14 | 2007-12-21 | activate hawkey | ondor has a monitoring system hawkeye, which can run scripts on arbitrary events or on checks failing. Running this to monitor ldap and or free memory would limit ldap problems. |