A-REX Technical Description

Internal files of the A-REX

A-REX stores information about jobs in files in the control directory. Information is stored in files to make it easier to recover in case of failure, but for faster processing job state is also held in memory while A-REX is running.

The files and sub-directories in the control directory and their formats are described below:

  • accounting - sub-directory containing accounting related information (typicaly sqlite database with accounting records)

  • delegations – sub-directory containing collection of delegated credentials and sqlite database for associating them to submitted jobs.

  • logs – sub-directory with information prepared for reporting plugins.

  • dhparam.pem - file with Diffie-Hellman parameters for establishing TLS connection with A-REX server.This file is generated first time A-REX is started and it may take some time till it is populated.

  • dtr.state - file with current state of data statging functionality

  • gm-heartbeat - modification time of this file is continuosly updated by A-REX to indicate it’s main processing loop is running

  • gm.fifo - FIFO communication channel to A-REX mostly used by babckend scripts to indicate jobs whose state have cahnged

  • info.xml - file with current state of A-REX CE expresssed in GLUE2 XML

  • ID.status – file with current state of the job. Here ID corresponds to arbitrary ASCII string assigned to submitted job. Currenly ID is made of 12 lowercase hex symbols. But that may change without notice. This is a plain text file containing a single word representing the internal name of current state of the job. Possible values and corresponding external job states are:

    • ACCEPTED

    • PREPARING

    • SUBMIT

    • INLRMS

    • FINISHING

    • FINISHED

    • CANCELING

    • DELETED

    See corresponding Section for a description of the various states. Additionally each value can be prepended the prefix “PENDING:” (like PENDING:ACCEPTED, see corresponding Section). This is used to show that a job is ready to be moved to the next state but it has to stay in it’s current state only because otherwise some limits set in the configuration would be exceeded.

    This file is not stored directly in the control directory but in the following sub-directories:

    • accepting - for jobs in ACCEPTED state

    • finished - for jobs in FINISHED and DELETED states

    • processing - for other states

    • restarting - temporary location for jobs being restarted on user request or after restart of A-REX

  • description – file contains the description of the job (JD). This and all the following files are stored in hierarchy of subdirectories jobs/SUBID/SUBID/SUBID/SUBID. Here SUBID are aritrary ASCII string which if put together form ID of the job. The files re store inside set of sub-directories to reduce load on filesystems which typically suffer performance decrease when amount of files in directory increases. Currently each SUBID consists of 3 lowercase hex symbols. But that may change.

  • local – information about the job used by the A-REX. It consists of lines of format “name = value”. Not all of them are always available. The following names are defined:

    • globalid – job identifier as seen by user tools. Depending on used interface it is either BES ActivityIdentifier XML tree, GUID of EMI ES or GridFTP URL.

    • headnode – URL of service interface used to submit this job.

    • interface – name of interface used for jobs submission - org.nordugrid.xbes, org.ogf.glue.emies.activitycreation or org.nordugrid.gridftpjob.

    • lrms – name of the LRMS backend to be used for local submission

    • queue – name of the queue to run the job at

    • localid – job id in LRMS (appears only after the job reached state InLRMS)

    • args – main executable name followed by a list of command-line arguments

    • argscode – code which main executable returns in case of success

    • pre – executable name followed by a list of command-line arguments for executable to run before main executable. There maybe few of them

    • precode – code which pre-executable returns in case of success

    • post – executable name followed by a list of command-line arguments for executable to run after main executable. There maybe few of them

    • postcode – code which post-executable returns in case of success

    • subject – user certificate’s subject, also known as the distinguished name (DN)

    • starttime – GMT time when the job was accepted represented in the Generalized Time format of LDAP

    • lifetime – time period to preserve the SD after the job has finished in seconds

    • notify – email addresses and flags to send mail to about the job specified status changes

    • processtime – GMT time when to start processing the job in Generalized Time format

    • exectime – GMT time when to start job execution in Generalized Time format

    • clientname – name (as provided by the user interface) and IP address:port of the submitting client machine

    • clientsoftware – version of software used to submit the job

    • rerun – number of retries left to rerun the job

    • priority – data staging priority (1 - 100)

    • downloads – number of files to download into the SD before execution

    • uploads – number of files to upload from the SD after execution

    • jobname – name of the job as supplied by the user

    • projectname – name of the project as supplied by the user. There may be few of them

    • jobreport – URL of a user requested accounting service. The A-REX will also send job records to this service in addition to the default accounting service configured in the configuration. There may be few of them

    • cleanuptime – GMT time when the job should be removed from the cluster and it’s SD deleted in Generalized Time format

    • expiretime – GMT time when the credentials delegated to the job expire in Generalized Time format

    • gmlog – directory name which holds files containing information about the job when accessed through GridFTP interface

    • sessiondir – the job’s SD

    • failedstate – state in which job failed (available only if it is possible to restart the job)

    • failedcause – contains internal for jobs failed because of processing error and client if client requested job cancellation.

    • credentialserver – URL of MyProxy server to use for renewing credentials.

    • freestagein – yes if client is allowed to stage-in any file

    • activityid – Job-id of previous job in case the job has been resubmitted or migrated. This value can appear multiple times if a job has been resubmitted or migrate more than once.

    • migrateactivityid –

    • forcemigration – This boolean is only used for migration of jobs. It determines whether the job should persist if the termination of the previous job fails.

    • transfershare – name of share used in Preparing and Finishing states. This file is filled partially during job submission and fully when the job moves from the Accepted to the Preparing state.

  • input – list of input files. Each line contains 3 values separated by a space. First value contains name of the file relative to the SD. Second value is a URL or a file description. Example:

    input.dat gsiftp://grid.domain.org/dir/input 12378.dat

    A URL represents a location from which a file can be downloaded. Each URL can contain additional options.

    A file description refers to a file uploaded from the UI and consists of [size][.checksum] where

    • size - size of the file in bytes.

    • checksum - checksum of the file identical to the one produced by cksum (1).

    These values are used to verify the transfer of the uploaded file. Both size and checksum can be left out. A special kind of file description . is used to specify files which are not required to exist. The third optional value is path to delegated credentials to be used for communication with remote server.

    This file is used by the data staging subsystem of the A-REX. Files with URL will be downloaded to the SD or cache and files with ’file description’ will simply be checked to exist. Each time a new valid file appears in the SD it is removed from the list and input file is updated.

  • input_status – contains list of files uploaded by client to the SD.

  • output – list of output files. Each line contains 1, 2 or 3 values separated by a space. First value is the name of the file relative to the SD. The second value, if present, is a URL. Supported URLs are the same as those supported by input file. Optional 3rd value is path to delegated credentials to be used while accessing remote server.

    This file is used by the data staging subsystem of the A-REX. Files with URL will be uploaded to SE and remaining files will be left in the SD. Each time a file is uploaded it is removed from the list and output file is updated. Files not mentioned as output files are removed from the SD at the beginning of the Finishing state.

  • output_status – list of output files successfully pushed to remote locations.

  • failed – the existence of this file marks the failure of the job. It can also contain one or more lines of text describing the reason of failure. Failure includes the return code different from zero of the job itself.

  • errors – this file contains the output produced by external utilities like data staging, script for job submission to LRMS, etc on their stderr handle. Those are not necessarily errors, but can be just useful information about actions taken during the job processing. In case of problem include content of that file while asking for help.

  • diag – information about resources used during execution of job and other information suitable for diagnostics and statistics. It’s format is similar to that of local file. The following names are at least defined:

    • nodename – name of computing node which was used to execute job,

    • runtimeenvironments – used runtime environments separated by ’;’,

    • exitcode – numerical exit code of job,

    • frontend distribution – name and version of operating system distribution on frontend computer,

    • frontend system – name of operating on frontend computer,

    • frontend subject – subject (DN) of certificate representing frontend computer,

    • frontend ca – subject (DN) of issuer of certificate representing frontend computer, and other information provided by GNU time utility. Note that some implementations of time insert unrequested information in their output. Hence some lines can have broken format.

  • proxy – delegated X509 credentials or chain of public certificates.

  • proxy.tmp – temporary X509 credentials with different UNIX ownership used by processes run with effective user id different from job owner’s id.

  • statistics – statistics on input and output data transfer

  • xml - job’s current state expressed in GLUE2 XML rendering

There may be other files inside jobs/SUBID/SUBID/SUBID/SUBID sub-directories which are created and used by different parts of the A-REX. Their presence can not be guaranteed and can change depending on changes in the A-REX code.