Dulcinea
- Introduction
- Getting it to run
- Operating ATLAS production
- Advanced details: Interaction with Windmill
- Appendix: Useful Windmill tips
Introduction
Dulcinea is one of the executors of the ATLAS
Production System, designed to receive tasks from the Windmill
supervisor, translate them in a manner suitable
for the NorduGrid ARC job-submission, submit the corresponding
jobs to ARC-enabled resources, and perform other interactions with
this Grid system.
Like other executors, Dulcinea communicates with the Windmill
via XML messages.
Dulcinea is a largely self-contained service, e.g., not using
other ATLAS Production System tools, such as
Don
Quijote – the ATLAS Data Management System. This is
because the ARC middleware can move files from their temporary
output locations (where they were produced) to their final
destination on a Storage Element and register them in the RLS
without the need of external tools.
Dulcinea is built as a Python Module over the ARC
User-Interface API. This allows us in a very simple way to reuse
the brokering- and information-querying facilities of the ARC
User-Interface in Dulcinea. Because of this, Dulcinea is quite
robust and can easily recover from unexpected crashes if any.
Getting it to run
Step-by-step instructions
- Download Windmill from http://www-hep.uta.edu/windmill.
- Untar it and fix the known bugs: for Windmill 0.9.15,
apply patch windmill-0.9.15-bugs.patch):
tar xvzf windmill-0.9.15.tar.gz
patch -p0 < windmill-0.9.15-bugs.patch
- Install GPT, Globus, nordugrid-client and nordugrid-devel
(if you haven't done so already) from the NorduGrid downloads
area. See ARC client installation instructions for details.
- Download the Dulcinea executor code from NorduGrid CVS
- Build it (edit Makefile if you have Globus,
nordugrid-client, nordugrid-devel, libxml2 and Python
headers at a non-standard locations):
make
- Add the following line to windmill-x.x.x/launch_executor so that the ARC libraries are in the library search path:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$NORDUGRID_LOCATION/lib
- Create a symbolic link called dulcinea in the
windmill-x.x.x directory to the Dulcinea executor
directory:
ln -s /my/path/dulcinea /my/path/windmill-x.x.x/dulcinea
- Create a Windmill configuration file
windmill-x.x.x/data/windmill.xml (there are some
example template files in the directory).
- To use the Dulcinea executor, specify in the configuration:
<exetype>dulcinea</exetype>
- Adjust other parameters in the configuration to reasonable values... (a bit trial and error)
- Set the <oraconnection> in the configuration to the database
where your jobs are defined (typically, ATLAS proddb), e.g:
<oraconnection>atlas_prodsys/xxxxxxx@atlassg</oraconnection>
where xxxxxxx is the database access password.
- Make sure you pick up your own jobs using
<grid>, <implementation>,
<uses> and/or <currentstate>
tags, depending on how your jobs are defined in the
database. (This is a bit vague, but it is a common
problem for all the executors, so maybe there will be a common
solution)
- Create a proxy:
grid-proxy-init
If you are going to submit jobs longer than 12 hours (most
likely you do), add the corresponding proxy validity period using -valid
HH:MM option, e.g.
grid-proxy-init -valid 168:00
- Start the executor by running
./launch_executor in windmill-x.x.x
- Start the supervisor by running
./launch_supervisor in windmill-x.x.x
Operating ATLAS production
Under construction
Important things to know
- Beware of GACL. Most Storage Elements protect files
via Grid Access Control Lists (GACL), which means that by
default, only the executor owner can read files produced by
the respective jobs. In order to make the files available for
everybody, the executor uploads customized GACL-file to the
outputs. The example .dulcineagacl file can
be modified by removing, adding or editing <entry>...</entry>
blocks, but remember always to keep the following entries:
<entry>
<person>
<dn>your-personal-certificate-DN</dn>
</person>
<allow><read/><list/><write/><admin/></allow>
</entry>
<entry>
<any-user/>
<allow><read/><list/></allow>
</entry>
In case Dulcinea failed to upload GACL-file, or permissions
have to be modified (e.g., a new authorised person be
added), you can upload manually any GACL-file to the files
for which you have admin privileges. This is done
via ngacl set tool:
ngacl set gsiftp://sepath/file < newgaclfile
Unfortunatelly, there is no way to know who has
administrative powers over which file, so you must somehow
keep track of the files you created - for example, by
keeping the list of job names, or finding corresponding
files in the ATLAS Production Database. If you know file's
logical name, you can find its location via ngls:
ngls -L rls://gridsrv3.nbi.dk/you_files_logical_name
- Watch your proxy
- Look at Grid Manager logs
- Keep your certificate repository up-to-date
Details of interactions with the Windmill
Details of the Dulcinea interaction with Windmill:
-
numJobWanted
To answer this question, the executor runs an
ngstat on all clusters and match the passed
job-requirements with the information provided by the clusters.
The executor then answers back with 2 times the
number of available CPU's matching the job-criteria.
-
executeJobs
Job definitions are translated from XML into
the ARC User-Interface xRSL-language and
submitted using ngsub.
getStatus
To collect status-information about jobs, Dulcinea runs
ngstat on the jobs, copies back the submitted xRSL
and certain output files containing important
information. This can then be parsed to give the supervisor the
information it needs.
-
killJob
A simple ngkill suffices here.
-
getExecutorData
This is done simply by parsing the .ngjobs file.
fixJob
Void so far
Useful Windmill tips
Taken from the slides by Kaushik De
- Running Windmill: a screenshot
[sm@atlas002 windmill-0.9.15]$ ./launch_supervisor
Logged in as supervisor to server atlas000.uta.edu
>>>>> Starting to process jobs <<<<<
supervisor: enter command or help>
executor@atlas000.uta.edu/lexor-submit.cambridge.brochu is available (None / None)
...
*** recover submitted job - retry: 476983 536920 running None
*** recover submitted job - retry: 481396 536921 running None
***** supervisor is sending numJobsWanted xml
** Info - current jobcounts: 0 0 29
***** supervisor is sending getStatus
*** checking job - passed: 476669 536789 finished [Cap: OK ]
...
********** executor has returned status 476983 finished
*** renaming /datafiles/dc2/pileup/dc2.003003.lumi10.B1_jets_180/dc2.003003.lumi10.B1_jets_180._17182.pool.root.1 -> /datafiles/dc2/pileup/dc2.003003.lumi10.B1_jets_180/dc2.003003.lumi10.B1_jets_180._17182.pool.root
...
>>stopping submission of new jobs<<
- Getting list of Windmill commands: a screenshot
supervisor: enter command or help>
help
Documented commands (type help ):
========================================
auto chat connect disconnect exit
help log manual nolog noprint
noxml pause print reload resume
start status stop trigger xml
supervisor: enter command or help>
help log
log: start logging to file
supervisor: enter command or help>