How to work with data?
ARC libraries are very flexible in terms of supported data transfer protocols. It is designed to be extendable via a set of plugable Data Management Components (DMC).
The availble protocols to work with depends on the DMCs installed on the system (look for Additional plugins in particular).
Note
Notice the special plugin to integrate GFAL2 aditional plugins for data transfer protocols in addition to ARC native DMCs. For example, to add support for legacy LFC file catalogue protocol to ARC you have to install nordugrid-arc-plugins-gfal
and gfal2-plugin-lfc
.
This applies both to client tools and ARC Data Staging.
Data transfer URLs
File locations in ARC can be specified both as local file names, and as Internet standard Uniform Resource Locators (URL). There are also some additional URL options that can be used.
Depending on the installed ARC components some or all of the following transfer protocols and metadata services are supported:
Protocol |
Description |
---|---|
|
ordinary File Transfer Protocol (FTP) |
|
GridFTP, the Globus -enhanced FTP protocol with security, encryption, etc. developed by The Globus Alliance |
|
ordinary Hyper-Text Transfer Protocol (HTTP) with PUT and GET methods using multiple streams |
|
HTTP with SSL |
|
HTTP with Globus GSI |
|
WebDAV |
|
WebDAV with SSL |
|
ordinary Lightweight Data Access Protocol (LDAP) |
|
Storage Resource Manager (SRM) service |
|
Xrootd protocol |
|
Rucio – a data management system used by ATLAS and other scientific experiments |
|
ARC Cache Index |
|
Amazon S3 |
|
local to the host file name with a full path |
An URL can be used in a standard form, i.e.
protocol://[host[:port]]/file
Or, to enhance the performance or take advantage of various features, it can have additional options:
protocol://[host[:port]][;option[;option[...]]]/file[:metadataoption[:metadataoption[...]]
For a metadata service URL, construction is the following:
protocol://[url[|url[...]]@]host[:port][;option[;option[...]]]
/lfn[:metadataoption[:metadataoption[...]]]
where the nested URL(s) are physical replicas. Options are passed on to all replicas, but if it is desired to use the same option with a different value for all replicas, the option can be specified as a common option using the following syntax:
protocol://[;commonoption[;commonoption]|][url[|url[...]]@]host[:port]
[;option[;option[...]]/lfn[:metadataoption[:metadataoption[...]]]
In user-level tools, URLs may be expressed using this syntax, or there
may be simpler ways to construct complex URLs. In particular, command
line tools such as arccp
, and the xRSL
languages provide methods to express URLs and options in a simpler way.
For the SRM service, the syntax is:
srm://host[:port][;options]/[service_path?SFN=]file[:metadataoptions]
Versions 1.1 and 2.2 of the SRM protocol are supported. The default
service_path is srm/managerv2
when the server supports v2.2,
srm/managerv1
otherwise.
For Rucio the following URL is used to look up replicas of the given scope and name:
rucio://rucio-lb-prod.cern.ch/replicas/scope/name
The Rucio authorisation URL can be specified with the environment
variable $RUCIO_AUTH_URL
. The Rucio account to use can be specified
either through the rucioaccount
URL option or $RUCIO_ACCOUNT
environment variable. If neither are specified the account is taken from
the VOMS nickname
attribute.
For ACIX the URLs look like:
acix://cacheindex.ndgf.org:6443/data/index?url=http://host.org/file1
S3 authentication is done through keys which must be set by the
environment variables $S3_ACCESS_KEY
and $S3_SECRET_KEY
.
The URL components are:
|
Hostname or IP address [and port] of a server |
|
Logical File Name |
|
URL of the file as registered in indexing service |
|
End-point path of the web service |
|
File name with full path |
|
URL option |
|
URL option for all replicas |
|
Metadata option |
The following URL options are supported:
|
specifies number of parallel streams to be used by GridFTP or HTTP(s,g); default value is 1, maximal value is 10 |
|
means the file should be treated as executable |
|
specify if file must be uploaded to this destination even if job processing failed (default is |
|
indicates whether the file should be cached; default for input files in A-REX is |
|
for transfers to |
|
indicates whether the GridFTP data channel should be encrypted; default is |
|
specifies size of chunks/blocks/buffers used in GridFTP or HTTP(s,g) transactions; default is protocol dependent |
|
specifies the algorithm for checksum to be computed (for transfer verification or provided to the indexing server). This is overridden by any metadata options specified (see below). If this option is not provided, the default for the protocol is used. |
|
makes software trying (or not) to overwrite existing file(s); if |
|
distinguishes between different kinds of |
|
specifies a space token to be used for uploads to SRM storage elements supporting SRM version 2.2 or higher |
|
specifies whether before writing to the specified location the software should try to create all directories mentioned in the specified URL. Currently this applies to FTP and GridFTP only. Default value for these protocols is |
|
controls the use of the |
|
specifies transfer protocols for meta-URLs such as SRM. Multiple protocols can be specified as a comma-separated list in order of preference. |
|
specifies the Rucio account to use when authenticating with Rucio. |
|
while storing a file on a HTTP(S) server, the software will try to send it in chunks/parts. If the server reports error for the partial PUT command, the software will fall back to transferring the file in a single piece. This behavior is non-standard and not all servers report errors properly. Hence the default is a safer |
|
while retrieving a file from a HTTP(S) server, the software will try to read it in chunks/parts. If the server does not support the partial GET command, it usually ignores requests for partial transfer range and the file is transfered in one piece. Default is |
|
if set to |
|
if set to |
|
filter replicas returned from an index service based on their access latency. (available from version 6.12) |
Local files are referred to by specifying either a location relative to
the job submission working directory, or by an absolute path (the one
that starts with /
), preceded with a file://
prefix.
URLs also support metadata options which can be used for registering additional metadata attributes or querying the service using metadata attributes. These options are specified at the end of the LFN and consist of name and value pairs separated by colons. The following attributes are supported:
|
Type of checksum. Supported values are |
|
The checksum of the file |
The checksum attributes may also be used to validate files that were uploaded to remote storage.
Examples of URLs are:
http://grid.domain.org/dir/script.sh
gsiftp://grid.domain.org:2811;threads=10;secure=yes/dir/input_12378.dat
ldap://grid.domain.org:389/lc=collection1,rc=Nordugrid,dc=nordugrid,dc=org
file:///home/auser/griddir/steer.cra
srm://srm.domain.org/griddir/user/file1:checksumtype=adler32:checksumvalue=12345678
[1]
srm://srm.domain.org;transferprotocol=https/data/file2
[2]
rucio://rucio-lb-prod.cern.ch/replicas/user.grid/data.root
Stage-in during submission
During the job submission to computing element, data can be moved from the client machine along with the job description.
The inputFiles directive of the job description automatically activates this kind of data movement if source
of the data is a local path or empty (""
). An empty source value means that input filename
is taken from the current working directory on the submission machine.
For example:
(inputFiles=("data1" "/mnt/data/analyses/data11.22.33")
("data2" ""))
During the job submission process the /mnt/data/analyses/data11.22.33
file will be uploaded to the job session directoty on the CE as data1
. The data2
file from the current working directory will be uploaded as data2
.
Stage-in on ARC CE
Instead of copying data from submission machine ARC CE can download it from the available storage services specified by URL in the inputFiles source value.
In this case all the power of the A-REX Data Caching can be used as well.
For example:
(inputfiles = "data.root.1" "srm://srm.mystorage.example.org/atlas/disk/data11_7TeV/data.root.1")
During job submission the data WILL NOT be uploaded. Instead, A-REX will analyze the job description and starts a Stage-In process. During Stage-in the data.root.1
file will be downloaded from provided SRM URL.
Manual data movement with arc* tools
There is a set of data management tools that can be used to manipulate data manually, out of jobs processing context.
arcls
, arccp
, arcrm
and arcmkdir
work similar to classic Unix data movement commands but accept local or remote URLs.
Again, any URL supported by installed ARC data management plugins can be passed to the tools as an argument.
[user ~]$ arccp srm://srm.mystorage.example.org/atlas/disk/data11_7TeV/data.root.1 /mnt/data/data.root.1
[user ~]$ arcls http://download.nordugrid.org/repos/6
centos/
debian/
fedora/
ubuntu/