UArk links

DVC links

Other links

High Performance Computing (HPC) Resource Management (RM) Systems

Explanantion:

This semester I am taking a special topics research course at the University of Arkansas called "HPC Resource Management Systems". There are several promising areas of research that I can take with this course, and since it is largely self-guided, I can choose the most interesting areas to examine.

Background: My Master's Thesis research was on virtualization in HPC systems. For my Thesis, I designed and developed a system (with a great deal of help) called Dynamic Virtual Clustering (DVC) that can deploy virtual machines (VMs) in a cluster and run jobs inside those VMs without user intervention. DVC has been integrated into a beta version of the Moab cluster scheduler, and is currently in use at Arizona State University. See the main DVC page for more information on what DVC does, and how it will be useful.

During my Thesis research, it became obvious that we needed a VM resource management system to simplify all the cool things we wanted to do. If we want to migrate VMs from one machine to another, we need to schedule the operation, ensure that the migration is valid (i.e. that there is enough memory, disk space, CPUs, capabilities, etc.) We also need to support things like VM creation and customization. We should be able to take a VM image, and change its IP, hostname, DNS servers, etc. on the fly upon boot. We need to keep track of physical disk space on local machines, memory avaliable, CPUs available, etc. so that we can correctly report whether a VM is supported. For parallel jobs (MPI applications for example), we may have to create several VMs on different physical machines before we can execute the application. If we want to do any VM preservation/checkpointing, we must coordinate the preservation of each VM. We also need a VM RM system to keep track of the capabilities of individual computers -- some machines may not support HVM domains, some may be able to run VMware or QEmu virtual machines.

The current DVC implementation only supports a few of the things listed above, and all functionality is implemented in the job scheduler which is exactly the wrong place for it to be. The correct place for this functionality to be is inside a resource management system.

This semester, I plan to look at some of the issues surrounding the use of virtual machines in cluster computing. This may mean designing a system to efficiently deploy VMs in a cluster and exploit the advantages of VMs. It could also be an opportunity to add some functionality to an existing resource management system (like PBS or SLURM) to efficiently support VM use.

HPC Scheduling Overview

Modern high performance computing clusters use batch schedulers to execute user jobs. Here is the typical life cycle of a batch scheduled job:

  1. The user creates a job script that specifies a number of processors, memory, disk space, etc. (Many users only specify processors, and defaults are used for everything else.)
  2. The user submits the job to a scheduler, and then waits.
  3. Once the scheduler has the job, it will attempt to find the best possible time and location to execute the job.
  4. After the scheduler allocates resources to the job, the resource manager is told to execute the job on the assigned resources.
  5. The resource manager is then responsible for job execution. It must monitor the job's execution to ensure that it does not use more resources than assigned, must report the job's status to the scheduler, and must start the job, wait for it to complete (or run out of execution time), and then clean up the job.

Resource Management Systems

Resource Managers are responsible for controlling access to various resources (obviously). Resources like CPUs, memory, and disk space are the most commonly thought of resources, but only considering these things is an extreme limitation on what an RM can actually manage.

Resources

"Resource" is an extremely generic term. In HPC, we generally restrict the definition of a resource to things whose access and use we can deny or allow. For example, a memory bus qualifies as a "resource" in the most general definition; however, since we can't restrict access to the memory bus, it isn't considered a manageable resource.

Even if we limit the definition of a resource to those things that we can manage access to, the set of all manageable resources is still extremely large. Network bandwidth and latency are two resources that can be managed. Another resource could be licenses from a license server. Resources can be disk bandwidth, or node features (special network cards, public internet access, installed software, etc.). A resource can also be VM support.
Xen, VMware, QEmu, Parallels, Virtual PC, and Bochs are just a few of the virtual machine monitor products (VMMs) on the market today. My work up to this point has focused on the Xen hypervisor as a means for using VMs in clusters. However, Xen supports 2 types of virtual machines - paravirtualized (PV) and Hardware-assisted virtual machines (HVM). PV machines don't require any special hardware support while HVM machines do require hardware support. The CPU's ability to support HVM virtual machines can be thought of as a node feature/resource that can be managed.
Once we have the resources that can be managed, we need a scheduler that can effectively allocate resources (CPU, disk, memory, network bandwidth, etc.) to a task. Task scheduling is a much more difficult task than resource management, but a good resource management system greatly reduces the amount of work a scheduler has to do in order to make effective decisions.

Batch Scheduler Systems

In the simplest case, job scheduling onto computational resources is an example of the bin-packing problem, a well-known NP-Complete problem. The scheduler must take a number of tasks (a workload) and determine the best way to "pack" the workload into the available resources. If we assume that each task runs for a finite amount of time, a good scheduler will attempt to minimize the amount of time required to complete the workload. However, most cluster schedulers must deal with many more factors in scheduling (priority, reservations, back-filling, fairshare, etc.) that greatly complicates the issue of resource allocation.

Possible Avenues of Research

RM Examination

We examine several common RMs and look at scalability, ... We also look at what others have done in the area, look at common trends and desired capabilities, and provide a meta-analysis of RMs in use.

VM RM Design

Geoffroy Vallee has implemented some VM resource management functionality into a product called OSCARV, but has focused on VM installation and deployment instead of the functionality we wish to provide.

We specify how an RM that manages VMs in a cluster environment should interact with the scheduler and other RMs. It also specifies what VMs should look like and how they should be managed.

VM Support in RMs

We add basic VM support to an existing RM so that we can specify children, state, features, etc. Since VMs must use the resources the parent owns, reporting a node and its VM children as independent nodes is quite stupid. Instead, we need to have a way to say that a VM is the child of a particular node. When the node or child is queried, only the parent will respond. This will also be extremely useful for VM migration, preservation, and checkpointing. If we migrate a VM from one node to another, we only have to flip some bits in the RM to denote what has happened. If we preserve a VM, the RM will know that the VM resides on the node, but will have free resources to execute new (possibly higher priority) jobs.

Parallel Checkpointing VM RM integration

Design and integrate a system that can transparently checkpoint a parallel app running in a VM. We may have to modiy an RM to handle the parallel checkpointing (which must be very well coordinated). We could also make some modifications to Xen to give a VM the notion of a "group". If one member of the group is paused/saved/restored, all other VMs in the group should do the same. (Note: This would be a whole lot of work!) See the "Published Papers" link above for some basic info on parallel checkpointing with VMs.