Batch systems support

Overview

The A-REX has to interfaced to the LRMS in order to be able to submit jobs and query their information. The A-REX supports several Local Resource Management Systems (LRMS), with which it interacts by several backend scripts.

The A-REX assumes that the LRMS has one or more queues, which is a couple of (usually homogeneous) worker nodes grouped together. The different LRMSes have different concepts of queues (or have no queues at all).

Nevertheless, in the A-REX configuration, the machines of the LRMS should be mapped to A-REX queues. The client side job submission tools query the information system for possible places to submit the jobs, where each queue on a CE is represented as an execution target, and treated separately.

Configring A-REX to use one of these LRMS backends typically involves the following steps:

Sharing directories between A-REX, the LRMS frontend and its working nodes. It might involve setup of shared filesystems such as NFS or similar.

Configuring [lrms] block and [queue] blocks in arc.conf in respect to LRMS setup.

Configuring the A-REX in respect to the shared scratch directories configuration.

Fig. 3 The LRMS frontend and the nodes sharing the session directory and the local users

General LRMS configuration

In the [lrms] block he name of the LRMS has to be specified with the lrms option.

The supported LRMS are:

fork - fork jobs on the ARC CE host node, not a cluster. Targeted for testing and development but not for real production workloads.

condor - uses HTCondor-powered HTC resource

slurm - for SLURM clusters

pbs - any flavor of PBS batch system, including Torque and PBSPro

ll - Load Leveler batch system

lsf - Load Sharing Facility batch system

sge - Oragle Grid Engine (formely Sun Grid Engine)

boinc - works as a gateway to BOINC volunteer computing rfesources

slurmpy - new experimental SLURM backend written in Python (requires nordugrid-arc-python-lrms package instsalled).

Each LRMS has its own specific configuration options that are prefixed with LRMS name in [lrms] block.

Besided this specific options, the behaviour of LRMS backend is affected by storage areas and limits setup, in particular:

tmpdir - defines the path to directory for temporary files on the worker nodes

shared_filesystem, scratchdir and shared_scratch - changes the way how A-REX will store the jobs’ data during the processing. More details can be found in Job scratch area document.

defaultmemory and req-queue defaultmemory - set the memory limit values for jobs that has no explcit requirements in the job description.

Accounting considerations

A-REX has several approaches to collect accounting information:

using cgroups measurements of memory and CPU on the WNs (starting from 6.2 release)

using measurement from the GNU Time utility that wraps the job executable invokation inside the job script

using data provided by LRMS

Depending on LRMS type in use there are different kind of information availble and/or missing in the LRMS accounting subsystem.

It is recommended to use cgroups or GNU Time methods to have reliable resources measurements in all cases. You can find more details in the Measuring accounting metrics of the job document.

Fork Backend

The Fork back-end is a simple back-end that interfaces to the local machine, i.e.: there is no batch system underneath. It simply forks the job, hence the name. The back-end then uses standard posix commands (e.g. ps or kill) to manage the job.

For is the deafult backend used in ARC pre-shipped zero configuration.

Recommended batch system configuration

Since fork is a simple back-end and does not use any batch system, there is no specific configuration needed for the underlying system.

It is still reqiuires queue definition and the queue should be named fork.

Example:

[lrms]
lrms = fork

[queue:fork]

Known limitations

Since Fork is not a batch system, many of the queue specific attributes or detailed job information is not available. The support for the Fork batch system was introduced so that quick deployments and testing of the middleware can be possible without dealing with deployment of a real batch system since fork is available on every UNIX box.

The Fork back-end is not recommended to be used in production. The back-end by its nature, has lots of limitations, for example it does not support parallel jobs.

Portable Batch System (PBS)

The Portable Batch System (PBS) is one of the most popular batch systems for small clusters. PBS comes in many flavours such as OpenPBS (unsupported), Terascale Open-Source Resource and QUEue Manager (TORQUE) and PBSPro (currently owned by Altair Engineering). ARC supports all the flavours and versions of PBS.

Recommended batch system configuration

PBS is a very powerful LRMS with dozens of configurable options. Server, queue and node attributes can be used to configure the cluster’s behaviour. In order to correctly interface PBS to ARC (mainly the information provider scripts) there are a couple of configuration REQUIREMENTS asked to be implemented by the local system administrator:

The computing nodes MUST be declared as cluster nodes (job-exclusive), at the moment time-shared nodes are not supported by the ARC setup. If you intend to run more than one job on a single processor then you can use the virtual processor feature of PBS.
For each queue, one of the max_user_run or max_running attributes MUST be set and its value SHOULD BE IN AGREEMENT with the number of available resources (i.e. don’t set the max_running = 10 if there are only six (virtual) processors in the system). If both max_running and max_user_run are set then obviously max_user_run has to be less or equal to max_running.
For the time being, do NOT set server limits like max_running, please use queue-based limits instead.
Avoid using the max_load and the ideal_load directives. The Node Manager (MOM) configuration file (<PBS home on the node>/mom_priv/config) should not contain any max_load or ideal_load directives. PBS closes down a node (no jobs are allocated to it) when the load on the node reaches the max_load value. The max_load value is meant for controlling time-shared nodes. In case of job-exclusive nodes there is no need for setting these directives, moreover incorrectly set values can close down a node.
Routing queues are now supported in a simple setup were a routing queue has a single queue behind it. This leverages MAUI work in most cases. Other setups (i.e. two or more execution queues behind a routing queue) cannot be used within ARC correctly.
PBS server logs SHOULD BE shared with ARC CE to allow backend scripts to check the job status and collect information needed for accounting. The path to logs on the ARC CE is defined with pbs_log_path option.

Additional useful configuration hints:

If possible, please use queue-based attributes instead of server level ones.
The acl_user_enable = True attribute may be used with the acl_users = user1,user2 attribute to enable user access control for the queue.
It is advisory to set the max_queuable attribute in order to avoid a painfully long dead queue.
Node properties from the <PBS home on the server>/server_priv/nodes file together with the resources_default.neednodes can be used to assign a queue to a certain type of node.

Checking the PBS configuration:

The node definition can be checked by pbsnodes -a. All the nodes MUST have ntype=cluster.
The required queue attributes can be checked as qstat -f -Q queuename. There MUST be a max_user_run or a max_running attribute listed with a REASONABLE value.

Example:

[lrms]
lrms = pbs
defaultmemory = 512
pbs_log_path = /net/bs/var/log/torque/server_logs

[queue:grid_rt]
comment = Realtime queue for infrastructure testing
allowaccess = ops
advertisedvo = ops

[queue:alien]
comment = Dedicated queue for ALICE
allowaccess = alice
advertisedvo = alice
defaultmemory = 3500

Known limitations

Some of the limitations are already mentioned under the PBS deployment requirements. No support for routing queues, difficulty of treating overlapping queues, the complexity of node string specifications for parallel jobs are the main shortcomings.

SLURM

SLURM is an open-source (GPL) resource manager designed for Linux clusters of all sizes. It is designed to operate in a heterogeneous cluster with up to 65,536 nodes. SLURM is actively being developed, distributed and supported by Lawrence Livermore National Laboratory, Hewlett-Packard, Bull, Cluster Resources and SiCortex.

Recommended batch system configuration

The backend should work with a normal installation using only SLURM or SLURM+MOAB/MAUI. Do not keep nodes with different amount of memory in the same queue.

For production use-cases it is recommended to enable slurm_use_sacct option.

Example:

[lrms]
lrms=slurm
slurm_use_sacct=yes
defaultmemory=4096

[queue:normal]
comment=Queue for grid jobs
architecture=x86_64
totalcpus=1500

Using Python LRMS backend implementation

Experimental python LRMS backend can be used for SLURM after nordugrid-arc-python-lrms package installation. Python backed is distinguished by slurmpy name that should be specified in lrms option.

This backend respects the same options set, as a classical SLURM backend script, but additionally allows the connection over SSH when [lrms/ssh] block is enabled and configured.

Known limitations

If you have nodes with different amount of memory in the same queue, this will lead to miscalculations. If SLURM is stopped, jobs on the resource will get canceled, not stalled. The SLURM backend is only tested with SLURM 1.3, it should however work with 1.2 as well.

HTCondor

The HTCondor system, developed at the University of Wisconsin-Madison, was initially used to harness free cpu cycles of workstations. Over time it has evolved into a complex system with many grid-oriented features. Condor is available on a large variety of platforms.

Recommended batch system configuration

Install HTCondor on the A-REX node and configure it as a submit machine. Next, add the following to the node’s Condor configuration (or define CONDOR_IDS as an environment variable):

CONDOR_IDS = 0.0

CONDOR_IDS has to be 0.0, so that Condor will be run as root and can then access the Grid job’s session directories (needed to extract various information from the job log).

Make sure that no normal users are allowed to submit Condor jobs from this node. If normal user logins are not allowed on the A-REX machine, then nothing needs to be done. If for some reason users are allowed to log into the A-REX machine, simply don’t allow them to execute the condor_submit program. This can be done by putting all local Unix users allocated to the grid in a single group, e.g. griduser, and then setting the file ownership and permissions on condor_submit like this:

[root ~]# chgrp griduser $condor_bin_path/condor_submit
[root ~]# chmod 750 $condor_bin_path/condor_submit

Example:

[lrms]
lrms = condor
defaultmemory = 2000

[queue:EL7]
comment = EL7 queue
defaultmemory = 3000
nodememory = 16384
condor_requirements = (Opsys == "linux") && (OpSysMajorVer == 66)

Known limitations

Only Vanilla universe is supported. MPI universe (for multi-CPU jobs) is not supported. Neither is Java universe (for running Java executables). ARC can only send jobs to Linux machines in the Condor pool, therefore excluding other unixes and Windows destinations.

LoadLeveler

LoadLeveler(LL), or Tivoli Workload Scheduler LoadLeveler in full, is a parallel job scheduling system developed by IBM.

Recommended batch system configuration

The back-end should work fine with a standard installation of LoadLeveler. For the back-end to report the correct memory usage and cputime spent, while running. LoadLeveler has to be set-up to show this data in the llq command. Normally this is turned off for performance reasons. It is up to the cluster administrator to decide whether or not to publish this information. The back-end will work whether or not this is turned on.

Known limitations

There is at the moment no support for parallel jobs on the LoadLeveler back-end.

LSF

Load Sharing Facility (or simply LSF) is a commercial computer software job scheduler sold by Platform Computing. It can be used to execute batch jobs on networked Unix and Windows systems on many different architectures.

Recommended batch system configuration

Set up one or more LSF queues dedicated for access by grid users. All nodes in these queues should have a resource type which corresponds to the one of the the frontend and which is reported to the outside. The resource type needs to be set properly in the lsb.queues configuration file.

Be aware that LSF distinguishes between 32 and 64 bit for Linux. For a homogeneous cluster, the type==any option is a convenient alternative.

In lsb.queues set one of the following:

RES_REQ = type==X86_64
RES_REQ = type==any

See the -R option of the bsub command man page for more explanation.

The lsf_profile_path option must be set to the filename of the LSF profile that the back-end should use.

Furthermore it is very important to specify the correct architecture for a given queue in arc.conf. Because the architecture flag is rarely set in the xRSL file the LSF back-end will automatically set the architecture to match the chosen queue.

LSF’s standard behaviour is to assume the same architecture as the frontend. This will fail for instance if the frontend is a 32 bit machine and all the cluster resources are 64 bit. If this is not done the result will be jobs being rejected by LSF because LSF believes there are no useful resources available.

Known limitations

Parallel jobs have not been tested on the LSF back-end.

The back-end does not at present support reporting different number of free CPUs per user.

SGE

Sun Grid Engine (SGE, Oracle Grid Engine, Codine) is an open source batch system maintained by Sun (Oracle). It is supported on Linux, and Solaris in addition to numerous other systems.

Recommended batch system configuration

Set up one or more SGE queues for access by grid users. Queues can be shared by normal and grid users. In case it is desired to set up more than one ARC queue, make sure that the corresponding SGE queues have no shared nodes among them. Otherwise the counts of free and occupied CPUs might be wrong. Only SGE versions 6 and above are supported. You must also make sure that the ARC CE can run qacct, as this is used to supply accounting information.

Example:

[lrms]
lrms = sge
sge_root = /opt/n1ge6
sge_bin_path = /opt/n1ge6/bin/lx24-x86

[queue: long]
sge_jobopts= -P atlas -r yes

Known limitations

Multi-CPU support is not well tested. All users are shown with the same quotas in the information system, even if they are mapped to different local users. The requirement that one ARC queue maps to one SGE queue is too restrictive, as the SGE’s notion of a queue differs widely from ARC’s definition. The flexibility available in SGE for defining policies is difficult to accurately translate into NorduGrid’s information schema. The closest equivalent of nordugrid-queue-maxqueuable is a per-cluster limit in SGE, and the value of nordugrid-queue-localqueued is not well defined if pending jobs can have multiple destination queues.

BOINC

BOINC is an open-source software platform for computing using volunteered resources. Support for BOINC in ARC is currently at the development level and to use it may require editing of the source code files to fit with each specific project.

Recommended batch system configuration

The BOINC database can be local to the ARC CE or remote. Read-access is required from the ARC CE to check for finished jobs and gather information on available resources. The ARC CE must be able to run commands in the project’s bin/ directory.

Project-specific variables can be set up in an RTE which must be used for each job. The following example shows the variables which must be defined to allow job submission to BOINC for the project “example” to work:

export PROJECT_ROOT="/home/boinc/project/example" # project directory
export BOINC_APP="example"                        # app name
export WU_TEMPLATE="templates/example_IN"         # input file template
export RESULT_TEMPLATE="templates/example_OUT"    # output file template
export RTE_LOCATION="$PROJECT_ROOT/Input/RTE.tar.gz" # RTEs, see below

The last variable is a tarball of runtime environments required by the job.

Known limitations

The BOINC back-end was designed around projects that use virtualisation. The prototype implementation in the current ARC version may not be generic enough to suit all BOINC projects.

When preparing a BOINC job, the ARC CE copies a tarball of the session directory to the BOINC project download area. Once the job is completed and the output uploaded to the BOINC peoject upload area, a modified assimilator daemon must be used to copy the result back to the session directory so that it can be retrieved by ARC clients or uploaded to Grid storage by the ARC CE.