Batch systems support
Overview
The A-REX has to interfaced to the LRMS in order to be able to submit jobs and query their information. The A-REX supports several Local Resource Management Systems (LRMS), with which it interacts by several backend scripts.
The A-REX assumes that the LRMS has one or more queues, which is a couple of (usually homogeneous) worker nodes grouped together. The different LRMSes have different concepts of queues (or have no queues at all).
Nevertheless, in the A-REX configuration, the machines of the LRMS should be mapped to A-REX queues. The client side job submission tools query the information system for possible places to submit the jobs, where each queue on a CE is represented as an execution target, and treated separately.
Configuring A-REX to use one of these LRMS backends typically involves the following steps:
Sharing directories between A-REX, the LRMS frontend and its working nodes. It might involve setup of shared filesystems such as NFS or similar.
Configuring [lrms] block and [queue] blocks in
arc.conf
in respect to LRMS setup.Configuring the A-REX in respect to the shared scratch directories configuration.
Fig. 2 The LRMS frontend and the nodes sharing the session directory and the local users
General LRMS configuration
In the [lrms] block the name of the LRMS has to be specified with the lrms option.
The supported LRMS are:
fork - fork jobs on the ARC CE host node, not a cluster. Targeted for testing and development but not for real production workloads.
condor - uses HTCondor-powered HTC resource
slurm - for SLURM clusters
The community supported LRMSs are: * pbs - any flavor of PBS batch system, including Torque and PBSPro * ll - Load Leveler batch system * lsf - Load Sharing Facility batch system * sge - Oragle Grid Engine (formely Sun Grid Engine) * boinc - works as a gateway to BOINC volunteer computing resources
Each LRMS has its own specific configuration options that are prefixed with the LRMS name in [lrms] block.
Besided these specific options, the behaviour of the LRMS backend is affected by storage areas and limits setup, in particular:
tmpdir - defines the path to the directory for temporary files on the worker nodes
shared_filesystem, scratchdir and shared_scratch - changes the way A-REX will store the jobs’ data during the processing. More details can be found in Job scratch area document.
defaultmemory and req-queue defaultmemory - set the memory limit values for jobs that have no explcit requirements in the job description.
Accounting considerations
A-REX has several approaches to collect accounting information:
using cgroups measurements of memory and CPU on the WNs
using measurement from the GNU Time utility that wraps the job executable invokation inside the job script
using data provided by LRMS
Depending on LRMS type in use there are different kinds of information available and/or missing in the LRMS accounting subsystem.
It is recommended to use cgroups or GNU Time methods to have reliable resources measurements in all cases. You can find more details in the Measuring accounting metrics of the job document.
Fork Backend
The Fork back-end is a simple back-end that interfaces to the local
machine, i.e.: there is no batch system underneath. It simply forks the
job, hence the name. The back-end then uses standard posix commands
(e.g. ps
or kill
) to manage the job.
Fork is the default backend used in the ARC pre-shipped zero configuration.
Recommended batch system configuration
Since fork is a simple back-end and does not use any batch system, there is no specific configuration needed for the underlying system.
It is still reqiuires queue definition and the queue should be named fork
.
Example:
[lrms]
lrms = fork
[queue:fork]
Known limitations
Since Fork is not a batch system, many of the queue specific attributes or detailed job information is not available. The support for the Fork batch system was introduced so that quick deployments and testing of the middleware can be possible without dealing with deployment of a real batch system since fork is available on every UNIX box.
The Fork back-end must not be used in production. The back-end by its nature, has lots of limitations, for example it does not support parallel jobs.
SLURM
SLURM is an open source resource manager designed for heterogeneous Linux clusters of all sizes. A large amount of HPC centres used SLURM.
Recommended batch system configuration
The backend should work with a normal installation of SLURM. Nodes with different specifications such as amount of memory or cores must be organized in different queues.
For production use-cases it is recommended to enable slurm_use_sacct option.
Example:
[lrms]
lrms=slurm
slurm_use_sacct=yes
defaultmemory=4096
[queue:normal]
comment=Queue for grid jobs
architecture=x86_64
totalcpus=1500
[queue:himem]
comment=Queue for highmem grid jobs
architecture=x86_64
nodememory=2048
HTCondor
The HTCondor system, is another widely used and popular open source resource manager.
Recommended batch system configuration
Install HTCondor on the A-REX node and configure it as a submit machine.
Next, add the following to the node’s Condor configuration (or define
CONDOR_IDS
as an environment variable):
CONDOR_IDS = 0.0
CONDOR_IDS
has to be 0.0, so that Condor will be run as root and can
then access the Grid job’s session directories (needed to extract
various information from the job log).
Make sure that no normal users are allowed to submit Condor jobs from
this node. If normal user logins are not allowed on the A-REX machine,
then nothing needs to be done. If for some reason users are allowed to
log into the A-REX machine, simply don’t allow them to execute the
condor_submit
program. This can be done by putting all local Unix users
allocated to the grid in a single group, e.g. griduser
, and then
setting the file ownership and permissions on condor_submit
like this:
[root ~]# chgrp griduser $condor_bin_path/condor_submit
[root ~]# chmod 750 $condor_bin_path/condor_submit
Example:
[lrms]
lrms = condor
defaultmemory = 2000
[queue:EL7]
comment = EL7 queue
defaultmemory = 3000
nodememory = 16384
condor_requirements = (Opsys == "linux") && (OpSysMajorVer == 66)
Known limitations
Only Vanilla universe is supported. MPI universe (for multi-CPU jobs) is not supported. Neither is Java universe (for running Java executables). ARC can only send jobs to Linux machines in the Condor pool, therefore excluding other unixes and Windows destinations.
Portable Batch System (PBS)
The Portable Batch System (PBS) is one of the most popular batch systems for small clusters. PBS comes in many flavours such as OpenPBS (unsupported), Terascale Open-Source Resource and QUEue Manager (TORQUE) and PBSPro (currently owned by Altair Engineering). ARC supports all the flavours and versions of PBS.
Recommended batch system configuration
PBS is a very powerful LRMS with dozens of configurable options. Server, queue and node attributes can be used to configure the cluster’s behaviour. In order to correctly interface PBS to ARC (mainly the information provider scripts) there are a couple of configuration REQUIREMENTS asked to be implemented by the local system administrator:
The computing nodes MUST be declared as cluster nodes (job-exclusive), at the moment time-shared nodes are not supported by the ARC setup. If you intend to run more than one job on a single processor then you can use the virtual processor feature of PBS.
For each queue, one of the
max_user_run
ormax_running
attributes MUST be set and its value SHOULD BE IN AGREEMENT with the number of available resources (i.e. don’t set themax_running = 10
if there are only six (virtual) processors in the system). If bothmax_running
andmax_user_run
are set then obviouslymax_user_run
has to be less or equal tomax_running
.For the time being, do NOT set server limits like
max_running
, please use queue-based limits instead.Avoid using the
max_load
and theideal_load
directives. The Node Manager (MOM) configuration file (<PBS home on the node>/mom_priv/config
) should not contain anymax_load
orideal_load
directives. PBS closes down a node (no jobs are allocated to it) when the load on the node reaches themax_load
value. Themax_load
value is meant for controlling time-shared nodes. In case of job-exclusive nodes there is no need for setting these directives, moreover incorrectly set values can close down a node.Routing queues are now supported in a simple setup were a routing queue has a single queue behind it. This leverages MAUI work in most cases. Other setups (i.e. two or more execution queues behind a routing queue) cannot be used within ARC correctly.
PBS server logs SHOULD BE shared with ARC CE to allow backend scripts to check the job status and collect information needed for accounting. The path to logs on the ARC CE is defined with pbs_log_path option.
Additional useful configuration hints:
If possible, please use queue-based attributes instead of server level ones.
The
acl_user_enable = True
attribute may be used with theacl_users = user1,user2
attribute to enable user access control for the queue.It is advisory to set the
max_queuable
attribute in order to avoid a painfully long dead queue.Node properties from the
<PBS home on the server>/server_priv/nodes
file together with theresources_default.neednodes
can be used to assign a queue to a certain type of node.
Checking the PBS configuration:
The node definition can be checked by
pbsnodes -a.
All the nodes MUST haventype=cluster
.The required queue attributes can be checked as
qstat -f -Q queuename
. There MUST be amax_user_run
or amax_running
attribute listed with a REASONABLE value.
Example:
[lrms]
lrms = pbs
defaultmemory = 512
pbs_log_path = /net/bs/var/log/torque/server_logs
[queue:grid_rt]
comment = Realtime queue for infrastructure testing
allowaccess = ops
advertisedvo = ops
[queue:alien]
comment = Dedicated queue for ALICE
allowaccess = alice
advertisedvo = alice
defaultmemory = 3500
Known limitations
Some of the limitations are already mentioned under the PBS deployment requirements. No support for routing queues, difficulty of treating overlapping queues, the complexity of node string specifications for parallel jobs are the main shortcomings.
LoadLeveler
LoadLeveler(LL), or Tivoli Workload Scheduler LoadLeveler in full, is a parallel job scheduling system developed by IBM.
Recommended batch system configuration
The back-end should work fine with a standard installation of
LoadLeveler. For the back-end to report the correct memory usage and
cputime spent, while running. LoadLeveler has to be set-up to show this
data in the llq
command. Normally this is turned off for performance
reasons. It is up to the cluster administrator to decide whether or not
to publish this information. The back-end will work whether or not this
is turned on.
Known limitations
There is at the moment no support for parallel jobs on the LoadLeveler back-end.
LSF
Load Sharing Facility (or simply LSF) is a commercial computer software job scheduler sold by Platform Computing. It can be used to execute batch jobs on networked Unix and Windows systems on many different architectures.
Recommended batch system configuration
Set up one or more LSF queues dedicated for access by grid users. All
nodes in these queues should have a resource type which corresponds to
the one of the the frontend and which is reported to the outside. The
resource type needs to be set properly in the lsb.queues
configuration file.
Be aware that LSF distinguishes between 32 and 64
bit for Linux. For a homogeneous cluster, the type==any
option is a
convenient alternative.
In lsb.queues
set one of the following:
RES_REQ = type==X86_64
RES_REQ = type==any
See the -R
option of the bsub
command man page for more
explanation.
The lsf_profile_path option must be set to the filename of the LSF profile that the back-end should use.
Furthermore it is very important to specify the correct architecture for
a given queue in arc.conf
. Because the architecture flag is rarely set
in the xRSL file the LSF back-end will automatically set the
architecture to match the chosen queue.
LSF’s standard behaviour is to assume the same architecture as the frontend. This will fail for instance if the frontend is a 32 bit machine and all the cluster resources are 64 bit. If this is not done the result will be jobs being rejected by LSF because LSF believes there are no useful resources available.
Known limitations
Parallel jobs have not been tested on the LSF back-end.
The back-end does not at present support reporting different number of free CPUs per user.
SGE
Sun Grid Engine (SGE, Oracle Grid Engine, Codine) is an open source batch system maintained by Sun (Oracle). It is supported on Linux, and Solaris in addition to numerous other systems.
Recommended batch system configuration
Set up one or more SGE queues for access by grid users. Queues can be shared by normal and grid users. In case it is desired to set up more than one ARC queue, make sure that the corresponding SGE queues have no shared nodes among them. Otherwise the counts of free and occupied CPUs might be wrong. Only SGE versions 6 and above are supported. You must also make sure that the ARC CE can run qacct, as this is used to supply accounting information.
Example:
[lrms]
lrms = sge
sge_root = /opt/n1ge6
sge_bin_path = /opt/n1ge6/bin/lx24-x86
[queue: long]
sge_jobopts= -P atlas -r yes
Known limitations
Multi-CPU support is not well tested. All users are shown with the same quotas in the information system, even if they are mapped to different local users. The requirement that one ARC queue maps to one SGE queue is too restrictive, as the SGE’s notion of a queue differs widely from ARC’s definition. The flexibility available in SGE for defining policies is difficult to accurately translate into NorduGrid’s information schema. The closest equivalent of nordugrid-queue-maxqueuable is a per-cluster limit in SGE, and the value of nordugrid-queue-localqueued is not well defined if pending jobs can have multiple destination queues.
BOINC
BOINC is an open-source software platform for computing using volunteered resources. Support for BOINC in ARC is currently at the development level and to use it may require editing of the source code files to fit with each specific project.
Recommended batch system configuration
The BOINC database can be local to the ARC CE or remote. Read-access is
required from the ARC CE to check for finished jobs and gather
information on available resources. The ARC CE must be able to run
commands in the project’s bin/
directory.
Project-specific variables can be set up in an RTE which must be used for each job. The following example shows the variables which must be defined to allow job submission to BOINC for the project “example” to work:
export PROJECT_ROOT="/home/boinc/project/example" # project directory
export BOINC_APP="example" # app name
export WU_TEMPLATE="templates/example_IN" # input file template
export RESULT_TEMPLATE="templates/example_OUT" # output file template
export RTE_LOCATION="$PROJECT_ROOT/Input/RTE.tar.gz" # RTEs, see below
The last variable is a tarball of runtime environments required by the job.
Known limitations
The BOINC back-end was designed around projects that use virtualisation. The prototype implementation in the current ARC version may not be generic enough to suit all BOINC projects.
When preparing a BOINC job, the ARC CE copies a tarball of the session directory to the BOINC project download area. Once the job is completed and the output uploaded to the BOINC peoject upload area, a modified assimilator daemon must be used to copy the result back to the session directory so that it can be retrieved by ARC clients or uploaded to Grid storage by the ARC CE.