How to work with data?¶
ARC libraries are very flexible in terms of supported data transfer protocols. It is designed to be extendable via a set of plugable Data Management Components (DMC).
The availble protocols to work with depends on the DMCs installed on the system (look for Additional plugins in particular).
Note
Notice the special plugin to integrate GFAL2 aditional plugins for data transfer protocols in addition to ARC native DMCs. For example, to add support for legacy LFC file catalogue protocol to ARC you have to install nordugrid-arc-plugins-gfal
and gfal2-plugin-lfc
.
This applies both to client tools and ARC Data Staging.
Data transfer URLs¶
File locations in ARC can be specified both as local file names, and as Internet standard Uniform Resource Locators (URL). There are also some additional URL options that can be used.
Depending on the installed ARC components some or all of the following transfer protocols and metadata services are supported:
Protocol | Description |
---|---|
ftp |
ordinary File Transfer Protocol (FTP) |
gsiftp |
GridFTP, the Globus -enhanced FTP protocol with security, encryption, etc. developed by The Globus Alliance |
http |
ordinary Hyper-Text Transfer Protocol (HTTP) with PUT and GET methods using multiple streams |
https |
HTTP with SSL |
httpg |
HTTP with Globus GSI |
dav |
WebDAV |
davs |
WebDAV with SSL |
ldap |
ordinary Lightweight Data Access Protocol (LDAP) |
srm |
Storage Resource Manager (SRM) service |
root |
Xrootd protocol |
rucio |
Rucio – a data management system used by ATLAS and other scientific experiments |
acix |
ARC Cache Index |
s3 |
Amazon S3 |
file |
local to the host file name with a full path |
An URL can be used in a standard form, i.e.
protocol://[host[:port]]/file
Or, to enhance the performance or take advantage of various features, it can have additional options:
protocol://[host[:port]][;option[;option[...]]]/file[:metadataoption[:metadataoption[...]]
For a metadata service URL, construction is the following:
protocol://[url[|url[...]]@]host[:port][;option[;option[...]]]
/lfn[:metadataoption[:metadataoption[...]]]
where the nested URL(s) are physical replicas. Options are passed on to all replicas, but if it is desired to use the same option with a different value for all replicas, the option can be specified as a common option using the following syntax:
protocol://[;commonoption[;commonoption]|][url[|url[...]]@]host[:port]
[;option[;option[...]]/lfn[:metadataoption[:metadataoption[...]]]
In user-level tools, URLs may be expressed using this syntax, or there
may be simpler ways to construct complex URLs. In particular, command
line tools such as arccp
, and the xRSL
languages provide methods to express URLs and options in a simpler way.
For the SRM service, the syntax is:
srm://host[:port][;options]/[service_path?SFN=]file[:metadataoptions]
Versions 1.1 and 2.2 of the SRM protocol are supported. The default
service_path is srm/managerv2
when the server supports v2.2,
srm/managerv1
otherwise.
For Rucio the following URL is used to look up replicas of the given scope and name:
rucio://rucio-lb-prod.cern.ch/replicas/scope/name
The Rucio authorisation URL can be specified with the environment
variable $RUCIO_AUTH_URL
. The Rucio account to use can be specified
either through the rucioaccount
URL option or $RUCIO_ACCOUNT
environment variable. If neither are specified the account is taken from
the VOMS nickname
attribute.
For ACIX the URLs look like:
acix://cacheindex.ndgf.org:6443/data/index?url=http://host.org/file1
S3 authentication is done through keys which must be set by the
environment variables $S3_ACCESS_KEY
and $S3_SECRET_KEY
.
The URL components are:
host[:port] |
Hostname or IP address [and port] of a server |
lfn |
Logical File Name |
url |
URL of the file as registered in indexing service |
service_path |
End-point path of the web service |
file |
File name with full path |
option |
URL option |
commonoption |
URL option for all replicas |
metadataoption |
Metadata option |
The following URL options are supported:
threads=<number> |
specifies number of parallel streams to be used by GridFTP or HTTP(s,g); default value is 1, maximal value is 10 |
exec=yes|no |
means the file should be treated as executable |
preserve=yes|no |
specify if file must be uploaded to this destination even if job processing failed (default is no ) |
cache=yes|no|renew|copy|check|invariant |
indicates whether the file should be cached; default for input files in A-REX is yes . renew forces a download of the file, even if the cached copy is still valid. copy forces the cached file to be copied (rather than linked) to the session directory, this is useful if for example the file is to be modified. check forces a check of the permission and modification time against the original source. invariant disables checking the original source modification time. |
readonly=yes|no |
for transfers to file:// destinations, specifies whether the file should be read-only (unmodifiable) or not; default is yes |
secure=yes|no |
indicates whether the GridFTP data channel should be encrypted; default is no |
blocksize=<number> |
specifies size of chunks/blocks/buffers used in GridFTP or HTTP(s,g) transactions; default is protocol dependent |
checksum=cksum|md5|adler32|no |
specifies the algorithm for checksum to be computed (for transfer verification or provided to the indexing server). This is overridden by any metadata options specified (see below). If this option is not provided, the default for the protocol is used. checksum=no disables checksum calculation. |
overwrite=yes|no |
makes software trying (or not) to overwrite existing file(s); if yes , the tool will try to remove any information/content associated with the specified URL before writing to the destination. |
protocol=gsi|gssapi|ssl|tls|ssl3 |
distinguishes between different kinds of HTTPS /HTTPG and SRM protocols. Here gssapi stands for HTTPG implementation using only GSSAPI functions to wrap data and gsi uses additional headers as implemented in Globus IO. The ssl and tls options stand for the usual HTTPS and are specifically usable only if used with the SRM protocol. The ssl3 option is mostly the same as the ssl one but uses SSLv3 handshakes while establishing HTTPS connections. The default is gssapi for SRM connections, tls for HTTPS and``gssapi`` for HTTPG . In the case of SRM , if default fails, gsi is tried. |
spacetoken=<pattern> |
specifies a space token to be used for uploads to SRM storage elements supporting SRM version 2.2 or higher |
autodir=yes|no |
specifies whether before writing to the specified location the software should try to create all directories mentioned in the specified URL. Currently this applies to FTP and GridFTP only. Default value for these protocols is yes |
tcpnodelay=yes|no |
controls the use of the TCP_NODELAY socket option (which disables Nagle’s algorithm). Applies to HTTP(S) only. Default is no (supported only in arcls and other arc* tools) |
transferprotocol=protocols |
specifies transfer protocols for meta-URLs such as SRM. Multiple protocols can be specified as a comma-separated list in order of preference. |
rucioaccount=account |
specifies the Rucio account to use when authenticating with Rucio. |
httpputpartial=yes|no |
while storing a file on a HTTP(S) server, the software will try to send it in chunks/parts. If the server reports error for the partial PUT command, the software will fall back to transferring the file in a single piece. This behavior is non-standard and not all servers report errors properly. Hence the default is a safer no . |
httpgetpartial=yes|no |
while retrieving a file from a HTTP(S) server, the software will try to read it in chunks/parts. If the server does not support the partial GET command, it usually ignores requests for partial transfer range and the file is transfered in one piece. Default is yes . |
failureallowed=yes|no |
if set to yes for a job input or output file, then a failure to transfer this file will not cause a failure of the job. Default is no . |
relativeuri=yes|no |
if set to yes , HTTP operations will use the path instead of the full URL. Default is no . |
accesslatency=disk|tape |
filter replicas returned from an index service based on their access latency. (available from version 6.12) |
Local files are referred to by specifying either a location relative to
the job submission working directory, or by an absolute path (the one
that starts with /
), preceded with a file://
prefix.
URLs also support metadata options which can be used for registering additional metadata attributes or querying the service using metadata attributes. These options are specified at the end of the LFN and consist of name and value pairs separated by colons. The following attributes are supported:
checksumtype |
Type of checksum. Supported values are cksum (default), md5 and adler32 |
checksumvalue |
The checksum of the file |
The checksum attributes may also be used to validate files that were uploaded to remote storage.
Examples of URLs are:
http://grid.domain.org/dir/script.sh
gsiftp://grid.domain.org:2811;threads=10;secure=yes/dir/input_12378.dat
ldap://grid.domain.org:389/lc=collection1,rc=Nordugrid,dc=nordugrid,dc=org
file:///home/auser/griddir/steer.cra
srm://srm.domain.org/griddir/user/file1:checksumtype=adler32:checksumvalue=12345678
[1]srm://srm.domain.org;transferprotocol=https/data/file2
[2]rucio://rucio-lb-prod.cern.ch/replicas/user.grid/data.root
[1] | This is a destination URL. The file will be copied to srm.domain.org at the path griddir/user/file1 and the checksum will be compared to what is reported by the SRM service after the transfer. |
[2] | This is a source or destination URL. When getting a transport URL from SRM, the HTTPS transfer protocol will be requested. |
Stage-in during submission¶
During the job submission to computing element, data can be moved from the client machine along with the job description.
The inputFiles directive of the job description automatically activates this kind of data movement if source
of the data is a local path or empty (""
). An empty source value means that input filename
is taken from the current working directory on the submission machine.
For example:
(inputFiles=("data1" "/mnt/data/analyses/data11.22.33")
("data2" ""))
During the job submission process the /mnt/data/analyses/data11.22.33
file will be uploaded to the job session directoty on the CE as data1
. The data2
file from the current working directory will be uploaded as data2
.
Stage-in on ARC CE¶
Instead of copying data from submission machine ARC CE can download it from the available storage services specified by URL in the inputFiles source value.
In this case all the power of the A-REX Data Caching can be used as well.
For example:
(inputfiles = "data.root.1" "srm://srm.mystorage.example.org/atlas/disk/data11_7TeV/data.root.1")
During job submission the data WILL NOT be uploaded. Instead, A-REX will analyze the job description and starts a Stage-In process. During Stage-in the data.root.1
file will be downloaded from provided SRM URL.
Manual data movement with arc* tools¶
There is a set of data management tools that can be used to manipulate data manually, out of jobs processing context.
arcls
, arccp
, arcrm
and arcmkdir
work similar to classic Unix data movement commands but accept local or remote URLs.
Again, any URL supported by installed ARC data management plugins can be passed to the tools as an argument.
[user ~]$ arccp srm://srm.mystorage.example.org/atlas/disk/data11_7TeV/data.root.1 /mnt/data/data.root.1
[user ~]$ arcls http://download.nordugrid.org/repos/6
centos/
debian/
fedora/
ubuntu/