You are here

Hadoop cluster revamp

Project ID: 
291
Current stage: 
Manager: 
Unit: 
Summary: 
Rework the existing hadoop component and headers to provide better management and monitoring
What: 

Rework the hadoop component to
* Allow individual nodes to be added and removed from jobtracker
* Allow additional diskspace on nodes to be added and removed from the hdfs filesystem
* Better support the use of multiple clusters.

Develop a small test suite that could be used to automatically test the operation of the cluster and possibly integrate this with nagios.

Deploy performance/usage monitoring software to better quantify the cluster usage.

Why: 

The current component works well at configuring a fairly static cluster but adding and removing nodes, both to run jobs and as disk nodes in hdfs is awkward and time consuming. The configuration often does not reflect the actual nodes in the cluster which has led to data loss in the past.

Being able to remove nodes when the cluster was not being used would allow us to sleep nodes (where supported) and reduce the cluster power footprint. Ideally this could be scripted and allow the cluster to automatically minimise it's power usage without affecting it's usability.

The current hadoop configuration (version and installed packages) is largely dictated by the extreme computing course. Being able to set up multiple hadoop or hdfs clusters on the existing hardware with a minimum of overhead would allow us to respond more flexibly to requests from other groups of users where better security or a more recent version of hadoop is required.

Having a standard set of test jobs would allow us to monitor the clusters health and hopefully reduce downtime.

How: 

Core:
Rewrite the hadoop component to fully manage the configuration of all the clients and the starting and stopping of the various daemons.
Produce a mechanism that would as much as possible automate managing the membership of the cluster (say via a spanning map) and also manage which member nodes are currently active within the cluster.
Take the basic hadoop example tutorial and automate execution of it so that it could be run as a nightly cron job. Investigate any hadoop nagios clients and integrate them into our nagios setup.
Generate basic monthly logs of cluster usage.

Optionally:
Write a daemon that monitors the cluster usage and dynamically activates and sleeps nodes to match the cluster usage.
Investigate any hadoop nagios clients and integrate them into our nagios setup.
Check for any useage/accounting type packages available for hadoop or write one if there is none available.

Effort estimate: 
2-3 weeks for reworking the component, 2-3 days for the test setup 2+ weeks for the optional stuff.
Other: 

Dependencies: Any development would have to fit in alongside current cluster usage

Risks: