How to work with data?

ARC libraries are very flexible in terms of supported data transfer protocols. It is designed to be extendable via a set of plugable Data Management Components (DMC).

The availble protocols to work with depends on the DMCs installed on the system (look for Additional plugins in particular).

Note

Notice the special plugin to integrate GFAL2 aditional plugins for data transfer protocols in addition to ARC native DMCs. For example, to add support for legacy LFC file catalogue protocol to ARC you have to install nordugrid-arc-plugins-gfal and gfal2-plugin-lfc.

This applies both to client tools and ARC Data Staging.

Data transfer URLs

File locations in ARC can be specified both as local file names, and as Internet standard Uniform Resource Locators (URL). There are also some additional URL options that can be used.

Depending on the installed ARC components some or all of the following transfer protocols and metadata services are supported:

Table 10 List of main supported protocols
Protocol	Description
`ftp`	ordinary File Transfer Protocol (FTP)
`gsiftp`	GridFTP, the Globus -enhanced FTP protocol with security, encryption, etc. developed by The Globus Alliance
`http`	ordinary Hyper-Text Transfer Protocol (HTTP) with PUT and GET methods using multiple streams
`https`	HTTP with SSL
`httpg`	HTTP with Globus GSI
`dav`	WebDAV
`davs`	WebDAV with SSL
`ldap`	ordinary Lightweight Data Access Protocol (LDAP)
`srm`	Storage Resource Manager (SRM) service
`root`	Xrootd protocol
`rucio`	Rucio – a data management system used by ATLAS and other scientific experiments
`acix`	ARC Cache Index
`s3`	Amazon S3
`file`	local to the host file name with a full path

An URL can be used in a standard form, i.e.

protocol://[host[:port]]/file

Or, to enhance the performance or take advantage of various features, it can have additional options:

protocol://[host[:port]][;option[;option[...]]]/file[:metadataoption[:metadataoption[...]]

For a metadata service URL, construction is the following:

protocol://[url[|url[...]]@]host[:port][;option[;option[...]]]
  /lfn[:metadataoption[:metadataoption[...]]]

where the nested URL(s) are physical replicas. Options are passed on to all replicas, but if it is desired to use the same option with a different value for all replicas, the option can be specified as a common option using the following syntax:

protocol://[;commonoption[;commonoption]|][url[|url[...]]@]host[:port]
  [;option[;option[...]]/lfn[:metadataoption[:metadataoption[...]]]

In user-level tools, URLs may be expressed using this syntax, or there may be simpler ways to construct complex URLs. In particular, command line tools such as arccp, and the xRSL languages provide methods to express URLs and options in a simpler way.

For the SRM service, the syntax is:

srm://host[:port][;options]/[service_path?SFN=]file[:metadataoptions]

Versions 1.1 and 2.2 of the SRM protocol are supported. The default service_path is srm/managerv2 when the server supports v2.2, srm/managerv1 otherwise.

For Rucio the following URL is used to look up replicas of the given scope and name:

rucio://rucio-lb-prod.cern.ch/replicas/scope/name

The Rucio authorisation URL can be specified with the environment variable $RUCIO_AUTH_URL. The Rucio account to use can be specified either through the rucioaccount URL option or $RUCIO_ACCOUNT environment variable. If neither are specified the account is taken from the VOMS nickname attribute.

For ACIX the URLs look like:

acix://cacheindex.ndgf.org:6443/data/index?url=http://host.org/file1

S3 authentication is done through keys which must be set by the environment variables $S3_ACCESS_KEY and $S3_SECRET_KEY.

The URL components are:

`host[:port]`	Hostname or IP address [and port] of a server
`lfn`	Logical File Name
`url`	URL of the file as registered in indexing service
`service_path`	End-point path of the web service
`file`	File name with full path
`option`	URL option
`commonoption`	URL option for all replicas
`metadataoption`	Metadata option

The following URL options are supported:

`threads=<number>`	specifies number of parallel streams to be used by GridFTP or HTTP(s,g); default value is 1, maximal value is 10
`exec=yes\|no`	means the file should be treated as executable
`preserve=yes\|no`	specify if file must be uploaded to this destination even if job processing failed (default is `no`)
`cache=yes\|no\|renew\|copy\|check\|invariant`	indicates whether the file should be cached; default for input files in A-REX is `yes`. `renew` forces a download of the file, even if the cached copy is still valid. `copy` forces the cached file to be copied (rather than linked) to the session directory, this is useful if for example the file is to be modified. `check` forces a check of the permission and modification time against the original source. `invariant` disables checking the original source modification time.
`readonly=yes\|no`	for transfers to `file://` destinations, specifies whether the file should be read-only (unmodifiable) or not; default is `yes`
`secure=yes\|no`	indicates whether the GridFTP data channel should be encrypted; default is `no`
`blocksize=<number>`	specifies size of chunks/blocks/buffers used in GridFTP or HTTP(s,g) transactions; default is protocol dependent
`checksum=cksum\|md5\|adler32\|no`	specifies the algorithm for checksum to be computed (for transfer verification or provided to the indexing server). This is overridden by any metadata options specified (see below). If this option is not provided, the default for the protocol is used. `checksum=no` disables checksum calculation.
`overwrite=yes\|no`	makes software trying (or not) to overwrite existing file(s); if `yes`, the tool will try to remove any information/content associated with the specified URL before writing to the destination.
`protocol=gsi\|gssapi\|ssl\|tls\|ssl3`	distinguishes between different kinds of `HTTPS`/`HTTPG` and `SRM` protocols. Here `gssapi` stands for `HTTPG` implementation using only GSSAPI functions to wrap data and `gsi` uses additional headers as implemented in Globus IO. The `ssl` and `tls` options stand for the usual `HTTPS` and are specifically usable only if used with the `SRM` protocol. The `ssl3` option is mostly the same as the `ssl` one but uses SSLv3 handshakes while establishing `HTTPS` connections. The default is `gssapi` for `SRM` connections, `tls` for `HTTPS` and``gssapi`` for `HTTPG`. In the case of `SRM`, if default fails, `gsi` is tried.
`spacetoken=<pattern>`	specifies a space token to be used for uploads to SRM storage elements supporting SRM version 2.2 or higher
`autodir=yes\|no`	specifies whether before writing to the specified location the software should try to create all directories mentioned in the specified URL. Currently this applies to FTP and GridFTP only. Default value for these protocols is `yes`
`tcpnodelay=yes\|no`	controls the use of the `TCP_NODELAY` socket option (which disables Nagle’s algorithm). Applies to HTTP(S) only. Default is `no` (supported only in `arcls` and other `arc*` tools)
`transferprotocol=protocols`	specifies transfer protocols for meta-URLs such as SRM. Multiple protocols can be specified as a comma-separated list in order of preference.
`rucioaccount=account`	specifies the Rucio account to use when authenticating with Rucio.
`httpputpartial=yes\|no`	while storing a file on a HTTP(S) server, the software will try to send it in chunks/parts. If the server reports error for the partial PUT command, the software will fall back to transferring the file in a single piece. This behavior is non-standard and not all servers report errors properly. Hence the default is a safer `no`.
`httpgetpartial=yes\|no`	while retrieving a file from a HTTP(S) server, the software will try to read it in chunks/parts. If the server does not support the partial GET command, it usually ignores requests for partial transfer range and the file is transfered in one piece. Default is `yes`.
`failureallowed=yes\|no`	if set to `yes` for a job input or output file, then a failure to transfer this file will not cause a failure of the job. Default is `no`.
`relativeuri=yes\|no`	if set to `yes`, HTTP operations will use the path instead of the full URL. Default is `no`.
`accesslatency=disk\|tape`	filter replicas returned from an index service based on their access latency. (available from version 6.12)

Local files are referred to by specifying either a location relative to the job submission working directory, or by an absolute path (the one that starts with /), preceded with a file:// prefix.

URLs also support metadata options which can be used for registering additional metadata attributes or querying the service using metadata attributes. These options are specified at the end of the LFN and consist of name and value pairs separated by colons. The following attributes are supported:

`checksumtype`	Type of checksum. Supported values are `cksum` (default), `md5` and `adler32`
`checksumvalue`	The checksum of the file

The checksum attributes may also be used to validate files that were uploaded to remote storage.

Examples of URLs are:

http://grid.domain.org/dir/script.sh

gsiftp://grid.domain.org:2811;threads=10;secure=yes/dir/input_12378.dat

ldap://grid.domain.org:389/lc=collection1,rc=Nordugrid,dc=nordugrid,dc=org

file:///home/auser/griddir/steer.cra

srm://srm.domain.org/griddir/user/file1:checksumtype=adler32:checksumvalue=12345678 [1]

srm://srm.domain.org;transferprotocol=https/data/file2 [2]

rucio://rucio-lb-prod.cern.ch/replicas/user.grid/data.root

Stage-in during submission

During the job submission to computing element, data can be moved from the client machine along with the job description.

The inputFiles directive of the job description automatically activates this kind of data movement if source of the data is a local path or empty (""). An empty source value means that input filename is taken from the current working directory on the submission machine.

For example:

(inputFiles=("data1" "/mnt/data/analyses/data11.22.33")
            ("data2" ""))

During the job submission process the /mnt/data/analyses/data11.22.33 file will be uploaded to the job session directoty on the CE as data1. The data2 file from the current working directory will be uploaded as data2.

Stage-in on ARC CE

Instead of copying data from submission machine ARC CE can download it from the available storage services specified by URL in the inputFiles source value.

In this case all the power of the A-REX Data Caching can be used as well.

For example:

(inputfiles = "data.root.1" "srm://srm.mystorage.example.org/atlas/disk/data11_7TeV/data.root.1")

During job submission the data WILL NOT be uploaded. Instead, A-REX will analyze the job description and starts a Stage-In process. During Stage-in the data.root.1 file will be downloaded from provided SRM URL.

Manual data movement with arc* tools

There is a set of data management tools that can be used to manipulate data manually, out of jobs processing context.

arcls, arccp, arcrm and arcmkdir work similar to classic Unix data movement commands but accept local or remote URLs.

Again, any URL supported by installed ARC data management plugins can be passed to the tools as an argument.

[user ~]$ arccp srm://srm.mystorage.example.org/atlas/disk/data11_7TeV/data.root.1 /mnt/data/data.root.1

[user ~]$ arcls http://download.nordugrid.org/repos/6
centos/
debian/
fedora/
ubuntu/