NorduGrid technical meeting
17 June 2004, Copenhagen
Minutes
Present:
Daniel Kalici, Mike Gindonis, Jonas Bardino, Balázs Kónya, Aleksandr
Konstantinov, Oxana Smirnova, Jukka Klem, Henrik Thostrup
Jensen, Jesper Ryge Leth, Jakob Langgaard Nielsen, Martin
Folkman, Mattias Ellert, Anders
Wäänänen, Michael Grønager,
Niels Elgaard Larsen
Next meeting:
In Reykjavík, September (? TBD).
Knoppix CD:
Presentation by Niels
- Status:
- CD working as:
- cliet and server (frontend + worker node) standalone
- client with NorduGrid certificate read from: Floppy/USB/mounted Windows harddrive
- Not working as server with host certificate - ideas:
- Nice to have an Unsafe CA - Automatic for Knoppix use
- Do we want screensaver science
(feature req for Frontends running behind NAT... from Daniel to Aleksandr)
Job Manager:
Presentation by Henrik
Notes from the following discussion:
- The present user interface will be implemented using the
Job Manager (as a test) – to day only job submission can be
handled.
- Very pluggable: Just a new handler for each of the new ng*
commands.
- Failover: Two (or more) Job Managers can be run
simultaneously – one is just taking over if the other one
fails.
- No state save to disk – considering to use a
database instead (probably Zope Object database).
- In one month (middle July) current functionality (ng*)
will be implemented.
- For more see: the proposal
- Implemented using XMLRPC (a lightweight/predecessor
version of SOAP)
ATLAS DC2 – CA-policy
Problem: Grid manager has to know the CAs of a submitted job
– otherwise job might fail... Solution ideas:
- Advertise the CAs you authorize against
- Make a list of needed CAs based on Proxy, etc...
- Associate CAs with runtime envs...
- Make an auth-plugin that checks if the job can connect to
all storages – Aleksandr will look into this: Basically the
local cluster can use an auth-plugin to check if access to
input and output files of the script can be authorized –
before the job enters the ACCEPTED state.
The Auth-plugin kind of functionality could also make it
possible to e.g. run a ldd on all executables in the script
before entering the ACCEPTED state...
Managing keys:
- Install an RPM called ca-policy-atlas (or
eugridpma [ed: ouch]), which depends on the needed
CA-packages.
Other issues:
RLS: If you upload a file – ask RLS for this, if not
found (or server down) should ask another place for this but
it just fails (after trying same place 5 times).
Release of ARC: Clusters need at least 0.4.1. – when is
next release: 0.4.3 - TODAY!
Status:
LRMS Port (Balázs):
- Fork: Problems: More information from PBS than from fork
(history, queue etc). Grid manager front end for fork not
working correctly. Finished in 2 days
- Sun Grid Engine: Looking into this after fork...
- Condor: up and running, problems with ngclean
(Mathias) – implemented as a queue – not a
cluster.
- All above should be up and running in a month: July 17th
2004.
Frontend model (Daniel):
- Running fine
- Implemented on "Horseshoe" {620CPU Linux clusters},
"Theory" {Altix, Onyx, Origin} same frontend, different
queues. Installation on IBM PowerPC cluster in progress and
"Niflheim" with 800CPUs.
- Big test is ATLAS DC2 running on Horseshoe...
Runtime Environenment (Mike for Juha) in
progress...
- (Balázs) Swegrid would
like to administer their own namespace – to be sure a RE
really means the same - from now on to be discussed in
NDGF.
Accounting (Martin):
- Global Bank service, Logger (LUTS) service and different
account services per project/ per VO etc.
- Cluster-vise: JARM(Job account
res. Manager). Auth-plugin. (clients to LUTS and BANK)
- Based on Java program from GT3 (needs 1.2 and up) used in
auth-plugin as well as in the global bank.
- First alpha release of Admin web interface ready. Tested
on Hagrid, but no big testing.
- Advanced quoting schemes were discussed – like get a
quote/short time reservation etc. – some could
already be implemented by Johan Tordsson from Umeå.
- A scheme to specify when you want the job run, and how
much you want to pay.
- For more see e.g. the talk at the Swegrid inauguration
- Aleksandr presented NGLogger – a graphical interface
to the database. (See slides by Ugur)
Mailing list
- Everyone happy: with current status (Anders, Balázs, Oxana)
GM turnaround time:
Everybody agrees signalling should be implemented; Anders
fills the feature request.
Data Management:
RLS problems (Jakob):
- Globus RLS
- Freezes regularly – randomly stops accepting connections.
- Reasons for freezing are being investigated.
- RC was dying, too, but that was easy to catch (error
message) and restart. RLS doesn't die, just freezes.
- Does not support collections.
- Command-line tool is really horrible and lacks some
functionality, and is also illogical.
- It also leaks memory on non-authenticated searches.
- All in all, it is very easy to bring RLS down.
- Positive things:
- allows free-form attributes to get associated with entries
- works nicely with authentication.
Storage element (Aleksandr):
Currently, most SEs are NG GridFTP servers with static access
control. Indexing: 1 RC, 2 RLS servers are up.
Smart SE (SSE): GM for data tasks, kind of. Has internal
states (non-exposed):
Collecting Downloading
\ /
\ /
\ /
\ /
Complete
|
|
|
|
Verified
Reliable file transfer. Communication: https, transfer: http,
control: Web-services. So far works fine. Can support various
indexing services.
File transfer is slower than FTP, but multistream transfer is
possible.
An active indexing service for files is needed (see Trond's
proposal). At the moment there is no network of SEs, just a
bunch of disks. No automatic replication. No bulk
replication.
New tool: ngrequest – asynchronous
ngcopy. SSE makes the transfer and possibly LFN
resolution. To be documented (Aleksandr).
SRM: Oxana to send specs. Would be nice to have
SRM-compliant SEs and a bulk/automatic replication tool.
Indexing service: Jakob to write an architecture proposal.
Aleksandr will add replication funcationality to
ngcopy: copying LFN to itself should create a new
replica.
Issue: access rules for SE files. GACL won't scale for a VO
case: it will be big, and is always static. VOMS, CAS, PERMIS:
Aleksandr will have a look.
Minutes taken by M.P.G. and O.S.