NorduGrid technical meeting

16-19 March 2005, Vilnius

Minutes

Present: Balázs, Anders, Mattias, Mikael, Aleksandr, Oxana, Henrik, Mike, Niels, Mikko, Andrei I., Andrei Z., Vitaly, Lauri, Hardi, Csaba, Marko (at the minutes).

Wednesday, 16 March
===================

Rome production issues/problems:
--------------------------------

Oxana: The system is not well suited for an env where someone creates
a job and someone else manages it. Data access control (by GACL) is
the bottleneck.

Authorization for X to see Y's jobs/logs is difficult or ill defined
in critical components (job management, file management).

Anders: With a VOMS system, one could allocate auth for ATLAS group.

Balazs: GACL attributes with jobs: failures in servers (reported by
Russian Lund users - file in Bugzilla).

Anders: VOMS is hard to administer: to set up the admin GUI. Balazs &
Mikko & Csaba: We might have a VOMS server in HIP Finland or in Norway.

Anders: Maybe GACL does not really work with VOMS: 1) files -yes 2)
jobs 3) rls. Must test.

Plan: Put up a VOMS service in Norway or Finland, test the issues
above. Make Andrei report a bug (discussed on Sat).

How to live with the current situation:

Problem: rls server in Denmark: users cannot find the files because
they are not auth to rls. Authorization by manual config, no "group"
concept. How about: policy: read access to everybody?

Balazs: write access only to production people. - ok.

Plan: rls read access to everybody, write access to production
people. Install CA certs of "everybody". Done.

Balazs: We need an rls test service and a new production rls.

Mattias: will set up a test service in Uppsala.

It is an ATLAS issue to find a new production rls.

GACL for storage 
----------------

Balazs: .gacl default policy in /DC for ATLAS would look like what?

Oxana: The current default is ok, but it's a pain that I cannot see
what the top .gacl is.

Log file access 
---------------

Balazs: gmlog for failed jobs should be made readable for
debuggers. How?

Mattias: the executor can do it. It downloads the log anyway.

Balazs: then it should send the log to storage using ngcopy? In a
separate cleanup process?

Mattias: well, yeah.

Balazs: this is important and urgent. Support requests all about this.

Anders: Let's file a request to Dulcinea developers.

Mattias: I'll take a look.

Balazs: How to solve this problem generically is an Atlas issue.

CA policies 
-----------

Anders: All "interesting" CA's are soon to be a part of EU PMA package
(EuroGrid).

Balazs: ARC should include default access for users whose CA's provide
ARC resources + demoCA.

Anders: Should these CA's be packaged with other than standalone?

Balazs: For sites, what do we recommend?

Anders: EuroGrid, Baltic, DFN, GridPP (expired), Sri Lanka, Australia
(Melbourne?).

All: As a policy NG accepts: has NG policy, has resources, behaves ok,
willing to put their CA in PMA.

Plan: ask the Aussies what's their plan. Otherwise ready for the next
release. For www.nordugrid.org/NorduGridVO, admins of "new" user
groups should send their information, user list and policy
description. There should be specific groups for Baltics, Slovenia,
Slovakia, ARC.

Improve documentation: how to put limits in Grid resource using or
give priorities to some VOs.

Invitation: there will be an rls (Uppsala) and storage for anyone with
an NG certificate and an accepted CA. Demo CE boxes and a GIIS,
too. Make a runtime environment for this. Host certificate web
application: Marko.

Infosystem policies 
-------------------

Balazs: Contact persons' email addresses need to be checked
(practically done).

Plan: Rename existing toplevel servers to index1-index4. Countries
need to register into 2 indexes.

Release 0.4.5 
-------------

Anders: Patch for 0.4.4 -> 0.4.5. Tested?

Balazs: yes.

Aleksandr: Main changes in run.cc

Anders: Cancelling jobs kills the manager issue should be fixed. Final
testing needed.

Mattias: Will be installed.

Plan: By 24th March, if no problems, Anders will tag, put release
notes in.

Release 0.6 
-----------

Balazs: Main areas and persons responsible for each, new CVS directory
that allows "make client".

Anders: Dynamic versioning of API's vs the whole stuff (since
libraries, monitor, javaclient are relatively independend).

Balazs: This should not be reflected in packaging. For 0.6 let's keep
the current (monolith) versioning.

Mattias: In 2 months client tools based on new libs.

Anders: We can keep the current CVS and still do "make client"/"make
server".

Anders + Aleksandr: new CVS + component structure +
autotools. ./configure --disable-client = prepare only the server for
compilation.

Plan: Use autotools. Separate deps and targets. Start migrating.
After Easter check status. Target release date: end of April.

Slovenia certificate problems

Solved by Anders by fixing the corresponding policy file. Will be
packaged.

External software

Globus

Aleksandr: New release should be made against what ARC uses now
(2.4.x). If the compiles/works with 3.2.x/3.9.x fine, can be
considered.

Plan: see above.

Gpt Balazs: Can we get a new binary rel without gpt? Anders: Yes, the
spec file does not require gpt.

gsoap: 2.5.2 works. 2.6 does not.

Perl libs: Perl-ldap is not a dependency.

GACL: Keep it as it is.

VOMS: Keep it as it - you can build 0.5 without VOMS.

mysql: Needed by logger.


Sofa meeting in Hotel Atrium, Thursday, 17th March
==================================================

After 0.6, 1.0 
--------------

Wish list: GLUE 2.0 compat, consistent data management (requirements
tomorrow), ARClib: c++, Python, Java: unified function signatures/data
type names.

Only GSI from Globus? No, gridftp is needed.

Resource reservation (in advance).

GUI interfaces (Java, QT?), application portals.

Different platforms: Solaris? Debian. Mac? We could have a Java based
client for basic client functions, but brokering: would need to be
re-written in Java.

Resource control: quotas (data, jobs, per person/VO). Resource status
should be available for admin monitoring. Are there libs available for
this?

Babysitter by rules like: if the job in queue for 30 min,
resubmit. How to handle massive amounts of jobs?

Better logging: information can be used by other components.

Web services? Job submission in 1.0. Job descriptions in JSDL.

Job submission/status/monitoring tru same channel/interface. Resource
monitoring too in 2.0.

VOMS support.

Mike suggests an improvement: Person X's failed data transfers could
be continued if he/she submits new jobs.


Data Management, Friday, 18 March 
=================================

Oxana: We should think data manipulation as jobs: we don't want to do
it interactively.

Aleksandr: There is a Globus tool.

Oxana: GDMP was a good idea but poor implementation. We need storage
servers that move data on my behalf. You should be able to send jobs
to do it. Description: what files? What? Accessible to whom? Scheduled
when? "Rename files that have been created by me during last 2 weeks".

Balazs: To implement this, one needs basic services.

Oxana: Users should not care about that.

Oxana: But these things should not stay in job queues. We need file
placement services, indexing..

Aleksandr: How much would this system hide from the user? Oxana: On
the same level as grid-manager hides the queue system.

Csaba: How about Matrix SRB? Aleksandr: License issues must be
settled.

Oxana: Should be integrated in job management: in the same job descr.

Aleksandr: If we write stuff on top of SRB (and the license becomes an
issue) we have to emulate SRB: too much.

Balazs: Let's test SRB. There's an installation in Norway. Agreed as a
plan. Moreover: test SSE-Fireman integration, too (see Saturday's
minutes).


High-level design for storables (Aleksandr)
-------------------------------------------

Main operations

 AA enabled: get file, put delete rename copy list, modify property

Objects

 collections (set or ordered), files, services (indexing, storage..)

Properties

 access control (for everything) id reference storage: how the file is
 stored?


Saturday, 19 March: Pending issues
==================================

Next NGN meeting 
----------------

Arrival on Wed (morning). Tech meeting would start Wed afternoon
already.

GACL + data management 
----------------------

The SE should support: read list of admins from a file.

Aleksandr: This is a kludge. There should be a gridftp service for
different groups (blocks in nordugrid.conf).

Solution:

SE is physical space, you can access it with different URL: xyz/ATLAS,
xyz/CMS NorduGrid conf

[VO] URL=xyz/ATLAS NAME=ATLAS-SE FILE=/etc/nordugrid/ATLAS-VO.TXT

And you use ATLAS-SE in the GACL.

This goes to 0.6.

How about GACL templates? To be checked. Improve error messages:
upload/download of file x failed because of insuffient privileges in
dir Y.

I/O design: Filename, host+port, path, streams, cache, retry, secure,
blocksize, location, pattern, checksum, guid, multiple i/o, metadata,
acl.

RLS: 
----

Current issues: When the SSE registers to RLS, the AA in RLS is by the
SSE's host cert, not user cert. To be fixed.

St. Petersburg will run an RLS service.

Options for NG/RLS 

1. Ask M. Branco what kind of improvements "would minimize his
pain". (I'll do this -- Marko)

2. Use FireMan as index service: write/contribute
glite-catalog-search?

Option 2 preferred if FireMan can be installed.

Plan: Don't drop RLS but integrate FireMan for 2.0. Invite Peter
Kunszt to next NG meeting.

Retries in transfer: GridManager or executor? Proposed: ng-rerun that
would take care of the retries: downloads, queue, upload The reason
for the failure is in the errorstring message. Implement this in
client side: if the job failed in a certain state, the executor tries
to re-run. Improve errorstring.

Redefine finishing codes: exit code, rerunnable? Mattias: Check that
it is compatible with ngstat.

For 1.0: Re-designed downloader/uploader infrastructure.

Logger 
------

Vitaly & Andrei's will take a closer look of the logger database
structure. Target: ready for 0.6.

Debug and log files: Default log level for all the services: 0 (only
errors). Nothing: -1.

Error messages 
--------------

For jobs: what the queue provides + exit codes. Error codes for
services: we'll need a new attribute. Atlas_error_code is too
specific. Job failures and messages should be improved: write
grid-manager appendix.

Client error messages: make "no cluster with enough CPU" error message
"you are not authorized to use any cluster" if that is the case. Trace
other uninformative messages.

File transfer: Globus messages are not informative.

Each service should have a default log file.

Web page updates 
----------------

A couple of paragraphs about all the components of ARC.

Compatibility 
-------------

There are client_version and server_version variables.

SE & Inf.System attributes 
--------------------------

SE name (different from URL) hostname + string unique on the site SE
type: (enum) gridftp,SSE Contact URL (multival) Access control
framework: GACL, trivial Network speed: ambiquous i/o speed: hardware
style: disk/raid/tape/network status: etc (see Balazs' draft, will be
continued)