How to work with data?

ARC libraries are very flexible in terms of supported data transfer protocols. It is designed to be extendable via a set of plugable Data Management Components (DMC).

The availble protocols to work with depends on the DMCs installed on the system (look for Additional plugins in particular).

Note

Notice the special plugin to integrate GFAL2 aditional plugins for data transfer protocols in addition to ARC native DMCs. For example, to add support for legacy LFC file catalogue protocol to ARC you have to install nordugrid-arc-plugins-gfal and gfal2-plugin-lfc.

This applies both to client tools and ARC Data Staging.

Data transfer URLs

File locations in ARC can be specified both as local file names, and as Internet standard Uniform Resource Locators (URL). There are also some additional URL options that can be used.

Depending on the installed ARC components some or all of the following transfer protocols and metadata services are supported:

Table 10 List of main supported protocols
Protocol Description
ftp ordinary File Transfer Protocol (FTP)
gsiftp GridFTP, the Globus -enhanced FTP protocol with security, encryption, etc. developed by The Globus Alliance
http ordinary Hyper-Text Transfer Protocol (HTTP) with PUT and GET methods using multiple streams
https HTTP with SSL
httpg HTTP with Globus GSI
dav WebDAV
davs WebDAV with SSL
ldap ordinary Lightweight Data Access Protocol (LDAP)
srm Storage Resource Manager (SRM) service
root Xrootd protocol
rucio Rucio – a data management system used by ATLAS and other scientific experiments
acix ARC Cache Index
s3 Amazon S3
file local to the host file name with a full path

An URL can be used in a standard form, i.e.

protocol://[host[:port]]/file

Or, to enhance the performance or take advantage of various features, it can have additional options:

protocol://[host[:port]][;option[;option[...]]]/file[:metadataoption[:metadataoption[...]]

For a metadata service URL, construction is the following:

protocol://[url[|url[...]]@]host[:port][;option[;option[...]]]
  /lfn[:metadataoption[:metadataoption[...]]]

where the nested URL(s) are physical replicas. Options are passed on to all replicas, but if it is desired to use the same option with a different value for all replicas, the option can be specified as a common option using the following syntax:

protocol://[;commonoption[;commonoption]|][url[|url[...]]@]host[:port]
  [;option[;option[...]]/lfn[:metadataoption[:metadataoption[...]]]

In user-level tools, URLs may be expressed using this syntax, or there may be simpler ways to construct complex URLs. In particular, command line tools such as arccp, and the xRSL languages provide methods to express URLs and options in a simpler way.

For the SRM service, the syntax is:

srm://host[:port][;options]/[service_path?SFN=]file[:metadataoptions]

Versions 1.1 and 2.2 of the SRM protocol are supported. The default service_path is srm/managerv2 when the server supports v2.2, srm/managerv1 otherwise.

For Rucio the following URL is used to look up replicas of the given scope and name:

rucio://rucio-lb-prod.cern.ch/replicas/scope/name

The Rucio authorisation URL can be specified with the environment variable $RUCIO_AUTH_URL. The Rucio account to use can be specified either through the rucioaccount URL option or $RUCIO_ACCOUNT environment variable. If neither are specified the account is taken from the VOMS nickname attribute.

For ACIX the URLs look like:

acix://cacheindex.ndgf.org:6443/data/index?url=http://host.org/file1

S3 authentication is done through keys which must be set by the environment variables $S3_ACCESS_KEY and $S3_SECRET_KEY.

The URL components are:

host[:port] Hostname or IP address [and port] of a server
lfn Logical File Name
url URL of the file as registered in indexing service
service_path End-point path of the web service
file File name with full path
option URL option
commonoption URL option for all replicas
metadataoption Metadata option

The following URL options are supported:

threads=<number> specifies number of parallel streams to be used by GridFTP or HTTP(s,g); default value is 1, maximal value is 10
exec=yes|no means the file should be treated as executable
preserve=yes|no specify if file must be uploaded to this destination even if job processing failed (default is no)
cache=yes|no|renew|copy|check|invariant indicates whether the file should be cached; default for input files in A-REX is yes. renew forces a download of the file, even if the cached copy is still valid. copy forces the cached file to be copied (rather than linked) to the session directory, this is useful if for example the file is to be modified. check forces a check of the permission and modification time against the original source. invariant disables checking the original source modification time.
readonly=yes|no for transfers to file:// destinations, specifies whether the file should be read-only (unmodifiable) or not; default is yes
secure=yes|no indicates whether the GridFTP data channel should be encrypted; default is no
blocksize=<number> specifies size of chunks/blocks/buffers used in GridFTP or HTTP(s,g) transactions; default is protocol dependent
checksum=cksum|md5|adler32|no specifies the algorithm for checksum to be computed (for transfer verification or provided to the indexing server). This is overridden by any metadata options specified (see below). If this option is not provided, the default for the protocol is used. checksum=no disables checksum calculation.
overwrite=yes|no makes software trying (or not) to overwrite existing file(s); if yes, the tool will try to remove any information/content associated with the specified URL before writing to the destination.
protocol=gsi|gssapi|ssl|tls|ssl3 distinguishes between different kinds of HTTPS/HTTPG and SRM protocols. Here gssapi stands for HTTPG implementation using only GSSAPI functions to wrap data and gsi uses additional headers as implemented in Globus IO. The ssl and tls options stand for the usual HTTPS and are specifically usable only if used with the SRM protocol. The ssl3 option is mostly the same as the ssl one but uses SSLv3 handshakes while establishing HTTPS connections. The default is gssapi for SRM connections, tls for HTTPS and``gssapi`` for HTTPG. In the case of SRM, if default fails, gsi is tried.
spacetoken=<pattern> specifies a space token to be used for uploads to SRM storage elements supporting SRM version 2.2 or higher
autodir=yes|no specifies whether before writing to the specified location the software should try to create all directories mentioned in the specified URL. Currently this applies to FTP and GridFTP only. Default value for these protocols is yes
tcpnodelay=yes|no controls the use of the TCP_NODELAY socket option (which disables Nagle’s algorithm). Applies to HTTP(S) only. Default is no (supported only in arcls and other arc* tools)
transferprotocol=protocols specifies transfer protocols for meta-URLs such as SRM. Multiple protocols can be specified as a comma-separated list in order of preference.
rucioaccount=account specifies the Rucio account to use when authenticating with Rucio.
httpputpartial=yes|no while storing a file on a HTTP(S) server, the software will try to send it in chunks/parts. If the server reports error for the partial PUT command, the software will fall back to transferring the file in a single piece. This behavior is non-standard and not all servers report errors properly. Hence the default is a safer no.
httpgetpartial=yes|no while retrieving a file from a HTTP(S) server, the software will try to read it in chunks/parts. If the server does not support the partial GET command, it usually ignores requests for partial transfer range and the file is transfered in one piece. Default is yes.
failureallowed=yes|no if set to yes for a job input or output file, then a failure to transfer this file will not cause a failure of the job. Default is no.
relativeuri=yes|no if set to yes, HTTP operations will use the path instead of the full URL. Default is no.
accesslatency=disk|tape filter replicas returned from an index service based on their access latency. (available from version 6.12)

Local files are referred to by specifying either a location relative to the job submission working directory, or by an absolute path (the one that starts with /), preceded with a file:// prefix.

URLs also support metadata options which can be used for registering additional metadata attributes or querying the service using metadata attributes. These options are specified at the end of the LFN and consist of name and value pairs separated by colons. The following attributes are supported:

checksumtype Type of checksum. Supported values are cksum (default), md5 and adler32
checksumvalue The checksum of the file

The checksum attributes may also be used to validate files that were uploaded to remote storage.

Examples of URLs are:

  • http://grid.domain.org/dir/script.sh
  • gsiftp://grid.domain.org:2811;threads=10;secure=yes/dir/input_12378.dat
  • ldap://grid.domain.org:389/lc=collection1,rc=Nordugrid,dc=nordugrid,dc=org
  • file:///home/auser/griddir/steer.cra
  • srm://srm.domain.org/griddir/user/file1:checksumtype=adler32:checksumvalue=12345678 [1]
  • srm://srm.domain.org;transferprotocol=https/data/file2 [2]
  • rucio://rucio-lb-prod.cern.ch/replicas/user.grid/data.root
[1]This is a destination URL. The file will be copied to srm.domain.org at the path griddir/user/file1 and the checksum will be compared to what is reported by the SRM service after the transfer.
[2]This is a source or destination URL. When getting a transport URL from SRM, the HTTPS transfer protocol will be requested.

Stage-in during submission

During the job submission to computing element, data can be moved from the client machine along with the job description.

The inputFiles directive of the job description automatically activates this kind of data movement if source of the data is a local path or empty (""). An empty source value means that input filename is taken from the current working directory on the submission machine.

For example:

(inputFiles=("data1" "/mnt/data/analyses/data11.22.33")
            ("data2" ""))

During the job submission process the /mnt/data/analyses/data11.22.33 file will be uploaded to the job session directoty on the CE as data1. The data2 file from the current working directory will be uploaded as data2.

Stage-in on ARC CE

Instead of copying data from submission machine ARC CE can download it from the available storage services specified by URL in the inputFiles source value.

In this case all the power of the A-REX Data Caching can be used as well.

For example:

(inputfiles = "data.root.1" "srm://srm.mystorage.example.org/atlas/disk/data11_7TeV/data.root.1")

During job submission the data WILL NOT be uploaded. Instead, A-REX will analyze the job description and starts a Stage-In process. During Stage-in the data.root.1 file will be downloaded from provided SRM URL.

Manual data movement with arc* tools

There is a set of data management tools that can be used to manipulate data manually, out of jobs processing context.

arcls, arccp, arcrm and arcmkdir work similar to classic Unix data movement commands but accept local or remote URLs.

Again, any URL supported by installed ARC data management plugins can be passed to the tools as an argument.

[user ~]$ arccp srm://srm.mystorage.example.org/atlas/disk/data11_7TeV/data.root.1 /mnt/data/data.root.1

[user ~]$ arcls http://download.nordugrid.org/repos/6
centos/
debian/
fedora/
ubuntu/