When companies purchase a significant number of machines and cluster them together to solve their computing needs, their site environment often drives specific requirements for their clusters. These requirements can include specific networking configurations, particular applications they want to have managed, a specific approach to software installation and maintenance, or existing management software and processes that they want to use on the cluster.
The key to successful cluster administration software is that it be flexible enough to accommodate many of these environments. For optimum flexibility, the systems management software must have the following characteristics:
It must have some fundamental capabilities that can be used to accomplish a wide variety of tasks. These include capabilities like parallel command execution, configuration file management, and software maintenance.
The out-of-band hardware control must be extensible to support a wide variety of hardware.
It must support a variety of node installation methodologies, for example, direct installation using the native installer, cloning nodes, or running diskless nodes.
It must support a variety of networking configurations, including routers, firewalls, low bandwidth networks, and high-security environments.
The monitoring capabilities must be configurable, extensible, and support standards.
The management software must have the proper APIs and command-line interfaces necessary to support running it in a hierarchical fashion for very large clusters or subdivided clusters.
It must be modular and customizable so that it can fit into companies' existing structures and processes (CLI, extensibility, use of isolated parts, etc.).
It must have mechanisms for allowing frequent updates and user contributions.
This article discusses each of these characteristics in turn and gives examples of cluster administration software that possesses these qualities.
Flexible Fundamental Capabilities
Cluster administration encompasses a wide variety of tasks that are often unique to the cluster or to the cluster's purpose. Therefore cluster management tools need to provide ways to accomplish many different tasks with simple tools. The more inherent the flexibility in these tools the better. Basic functionality that is needed for cluster management includes:
Support for multiple distributions: Tools that work across multiple operating systems and architectures allow for greater use. While Red Hat and SUSE SLES are two of the main distributions for enterprise clusters, support for free distributions like Fedora and Debian is also desired by many cluster users.
Distributed command execution: A distributed shell is an essential clustering component, as it allows the administrator to quickly perform command-line tasks across the entire cluster or a subset of nodes. This capability is a catchall, because it allows the administrator to perform tasks that are not specifically supported by the rest of the administration software. Required flexibility includes time-out values, skipping of offline nodes, and the ability to use any underlying remote shell.
Distribution of files: Distribution of files comes in a close second as an essential clustering capability. There are two modes of file distribution: one time copy, and a repository of files kept common throughout the cluster. The latter mode is useful for maintaining configuration files throughout the cluster or on a subset of nodes, and it can have increased flexibility by automatically running user-defined scripts before and after files are copied to the nodes.
Software maintenance: Software maintenance - the ability to upgrade and install software after a node is installed - is also important to enable administrators to install or upgrade individual applications without reinstalling the node. This feature must also automatically install prerequisite RPMs.
With the basic tools above it's possible to accomplish a large number of complex cluster tasks including installation and startup of the HPC stack, cluster-wide user management, configuration and startup of services, and addition of nodes to workload queues. For instance, installation and startup of HPC software can be done with software maintenance and the distributed shell. The configuration and startup of services like NTP and automounter, as well as user management, can be configured mainly through the distribution of files from the management server.
Examples of cluster administration tools that include forms of the above functionality are xCAT, the C3 tools in Oscar, Scali Manage, and CSM. Some of the tasks above can also be accomplished through the use of enterprise-level software that can be used in clusters. One example is Red Carpet, which provides software maintenance.
Extensible Hardware Control
Many clusters consist of heterogeneous hardware. Even if all the nodes are the same machine type, there are still non-node devices such as switches and terminal servers to consider. This provides a challenging environment for remote hardware control (power on, off, and query) to the various types of hardware, as many models require unique methods of power control.
To support the ever-growing number of power methods, the administration software must support user-defined power methods that can be plugged into the main power commands. A pluggable method allows the software to more easily support new hardware, and allows the user to run the same command to all the nodes, despite their different control methods. It also allows other software components, such as installation, to drive the power control to the various hardware.
In addition to power control of the cluster hardware, remote console is another area that requires pluggable methods. There are a wide variety of terminal servers, and now serial, over LAN (SOL) support, on the market and each of these has its own intricacies for establishing a remote console session to the node.
In addition to writing your own console method for new terminal servers, "in-house" development can allow more flexibility when upgrading cluster hardware: instead of being required to wait for and upgrade to the latest version of the software to support new hardware, you can script your own solution. Examples of simple hardware control methods that cluster administrators can easily develop are power on through Wake On LAN, power off through a distributed shell, and power control via a power switch like APC or Baytech. Cluster products that provide extensible power control include xCAT, Scali Manage, and CSM.
Variety of Node Installation Methods
Installing the operating system and applications on nodes is one of the most important functions of cluster administration software, because it can take so long to do manually. Because the method of installation affects the other administration processes, it's important for the software to support a variety of installation methods.
For clusters in which the nodes are not all identical and for which there exists a separate software maintenance procedure, the approach of directly installing the RPMs from the distribution media is generally the most useful. This allows the administrator to initiate an install with just the distribution CDs in hand, and they can easily specify a different list of RPMs for different nodes. Products that support this installation method include Rocks, Clusterworx, Scali Manage, xCAT, and CSM. They generally use kickstart's and autoyast's unattended installation features to automate the installation of multiple nodes over the network in parallel.
While many users like the simplicity of the direct installation method, an equally large user camp prefers the cloning method. This generally combines the node installation method with the node software maintenance strategy. In this approach a typical node (sometimes called a "golden" node) is installed manually and configured exactly how the administrator wants the rest of the nodes to be. Then the software image is captured from the golden node and replicated to the other nodes.
When updates or configuration changes are necessary, the golden node is updated and the capture/replicate process is done again. This approach is most effective for clusters in which the nodes in the cluster are almost identical, in terms of both hardware and software.
There are a variety of ways the software image can be captured. Some tools, like Clusterworx and the open source tool ghost, take a snapshot of the disk image and replicate that disk image to the nodes. This approach has the advantage of being independent of the operating system that is being captured, but can only work on homogeneous hardware, since the disks all need to be similar. Another approach, used by System Imager, captures all the files in each of the file systems on the golden node using rsync. This has a couple of advantages:
The software image can be replicated between nodes with different hardware and disks.
When updates are made to the golden node, just the delta changes can be captured and replicated, increasing performance dramatically.
Products that use System Imager to provide cloning capability include OSCAR, xCAT, and CSM. (Clusterworx also supports file system-based cloning.)
While installing the operating system locally on each node generally works well (disks are cheap, and the OS files can be loaded more quickly once they are on the local disk), some users are moving to diskless nodes. The motivation for this is generally not price (disks are dirt cheap these days) or even easier maintenance (there are both pros and cons in this area). The motivation is usually reliability in large clusters, because the last moving part in the node is eliminated. (For certain users, security can also be a motivation, since there is no persistent information on the nodes.) There are three main classes of diskless clusters:
The node is network booted via PXE or bootp, the kernel is sent over the network, and the OS image comes from file systems either mounted from an NFS server or in the RAM disk sent with the kernel. Products that support this type include CIT, Clusterworx, and CSM.
A kernel and minimal OS image are loaded over the network when the node boots. The node performs only specialized functions such as allowing application processes to be migrated to it. A head node generally handles the user interface, job queues, and process scheduling. The Scyld software is an example of this type of cluster.
Each node has its own OS image on a SAN. Some don't consider this a diskless cluster, although it has most of the benefits of one (usually at a premium price). Most cluster products that support either direct install or cloning can support this approach, as long as the correct drivers and LUN configuration are added. The Egenera BladeFrame system is based on this architecture.
Most users have definite requirements about the type of node installation they want to use, since it is central to their whole administration strategy. Therefore, it is important for the cluster administration software to support as many of the presented node installation methods as possible. These methods should also be customizable by supporting the use of user-defined post installation scripts and by supporting install/image servers to increase performance.
Extensible Monitoring Capabilities
Similar to hardware control, extensible monitoring of the cluster is a useful tool for the automation of cluster events. While there are many enterprise software packages that provide error detection and response, it's useful to have at least some set of customizable and user-defined monitoring capabilities in a cluster administration product. Common events across the cluster to which the software may need to respond include node down and up events (useful for manipulating workload queues), filesystem space used, processor idle time, network adapter throughput, and syslog entries. The following extensibility points are important in event monitoring:
User-defined sensors to monitor any arbitrary values in the system, and the ability to monitor standard instrumentation such as SNMP and CIM
User-defined response scripts to be run locally or across the cluster in response to occurring events
The ability to forward events to a variety of enterprise monitoring products such as the Tivoli Enterprise Console or CA Unicenter
Figure 1 shows a typical architecture of an extensible monitoring system. Products that have this capability include Ganglia, Scali Manage, Clusterworx, and CSM.
Hierarchical Support
There are several possible reasons for using cluster administration in a hierarchical fashion. The obvious reason is to be able to manage more nodes than supported by the current scaling limit of the administration software. Another reason is to divide up the nodes into smaller sets that can be managed individually, sometimes by different administrators. A third reason is to handle unusual networking configurations, for example, cross-geography clusters.
A typical hierarchical cluster consists of a three-level hierarchy in which there are sets of nodes, with each set being managed by a management server (called the First Line Management Server in Figure 2, or FMS). A top-level management server (Executive Management Server or EMS) manages all of the First Line Management Servers.
Ideally, all management operations could be done from the EMS, but it's important that the following are done from the EMS:
Install the FMSs, replicate any required distribution images, and drive the FMS to install its leaf nodes.
Push out updated software to the FMSs and all the leaf nodes.
Push out configuration files and data to the FMSs and all the leaf nodes.
Run commands on any or all of the FMSs and leaf nodes.
Control the leaf node hardware (power on/off, etc.).
Configure event monitoring and monitor events from all of the FMSs and leaf nodes.
Hierarchical support is important to allow the cluster administration software to work in more cluster environments. The only products that we know of that support hierarchical clusters as described here are xCAT and CSM. Several products, for example, CIT, support a hierarchy for one specific operation, usually for node installation or diskless boot.
Modular and Customizable
We've already mentioned that customers often have established system management processes in their lab prior to using any of the administration products mentioned in this article. It's not normally well received when the product dictates the processes to be used for all the administration tasks (installation, software maintenance, user management, configuration, monitoring, etc.). To avoid this "barrier to entry," the product must have the following characteristics:
A complete command-line interface: This facilitates administrators writing scripts around the product's administration commands so they can be called from the administrator's own processes. To enable this, all the commands must be script-ready by not prompting for input, and by making the output unambiguously parsable, even in internationalized environments.
Modular: The administrator must be able to use only parts of the product, while ignoring other parts for which he or she already has a solution. For example, the product should not require that the administrator use its userID management solution to be able to access any of the other functions.
Extensible: As mentioned in previous sections, the product needs to be extensible to support different distributions easily (CSM has this), support new hardware for hardware control (xCAT and CSM), monitor user-defined resources (CSM and Ganglia), and support setting up user applications (Rocks, OSCAR).
Frequent Updates and User Contributions
As we all know, Linux software and its associated hardware does not stand still. The many components of a typical Linux cluster continue to evolve with new versions, usually several times a year, with all the components on different release schedules. And new technology continually appears. As a result, the administration software needs to continually adapt to its changing environment. This requires the ability to put out frequent updates to the product. Open source solutions (e.g., Rocks, CIT, OSCAR) generally have an easier time of this due to their iterative development style and less testing done by the development team (and more by the user community). But even vendor products need to find ways to release updates often.
CSM uses a combination of traditional product releases and early updates on the IBM alphaWorks site. User contributions can also help tremendously in keeping up with all the changing components. This is business as usual for open source solutions, but can be difficult for vendor products due to legal restrictions. This issue must be resolved in order for vendor products to be able to keep up with the changing environment.
Summary
In Linux clusters, there are so many open source administration utilities and so many home-grown solutions that there is very little need for a one-size-fits-all cluster administration product. The administration software must be extremely flexible to accommodate a variety of environments and to complement, but not conflict with, the utilities already being used.
About Jennifer Cranfill Jennifer Cranfill is a Staff Software Engineer at IBM in Poughkeepsie NY. She has spent the past 5 years working with HPC clusters and has focused on parallel file systems and cluster management tools.
About Bruce Potter Bruce is a Senior Technical Staff Member at IBM. He has been working in the area of systems management of clusters since 1989, including AIX clusters based on the IBM SP2, Windows clusters with IBM Director, and Linux & AIX clusters with CSM and open source.
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice: