NorduGrid technical meeting

18-21 May 2005, St.Petersburg

Minutes

Present (in and out): Balázs, Aleksandr, Oxana, Marko, Andrei Z., Ilja, Mike, Nikolay, Juha, Christian, Frederik, Peter, Andrei I., Farid, Mattias (at the minutes) .

* 0. Misc: agenda, next meeting, publication, etc.


Agenda finalized:

- Wednesday: Release 0.6 issues

- Thursday:  Data Management Issues


Next meeting:

Copenhagen June 23-24 (pending OK from local organizers)


Publications:

- ARC: Marko will be editor.
       Draft May 30.
       Final version June 6.

- DC2: Based on the CHEP paper.
       Conclusions: Better data management needed
                    Better error handling needed
       Oxana and Mattias will divide tasks.
       Deadline June 13.


* 1. Action list follow-up from previous meetings.

Aleksandr: More meaningful error messages from the grid-manager. Some
  advances, more work in progress.

Aleksandr: Document current error codes in grid-manager manual appendix.

Everyone: Links to courses, tutorials, application and student
  projects should be sent to Oxana for publication on the web
  page. Permanent action item.

Mattias: Single user interface configuration file. Probably only for
  the new arclib based cli.

Balázs: Server configuration file clean-up.

Anders, Oxana: Download page reorganisation. Postponed until after new
  build structure.

Anders: CVS structure / build structure. Main topic on next meeting in
  Copenhagen.

Anders: migrate to globus 3.2.1. Now obsolete. New action item migrate
  to globus 4.0.

Oxana: Cached files are read-only. Should be documented in manuals.
  Still open.

Balázs: infosys provider for services. In progress.

All: Non-root service. In progress. Work started by Balázs.

Aleksandr, Leif: Separate authoration/authentication. Unixmap option
  in gridftp (documented, but not in template yet).

Aleksandr: Zombie job handling. Still open.

Balázs, Mattias, Aleksandr: cputime/walltime. Almost finished.
  Finalize and close.

Oxana: Document benchmarks in manuals. Finalize and close.

Balázs: Propagate grid-manager's internal limits propagate to infosys.

Aleksandr: Utility to clean broken jobs. Plan to implement as feature
  in gm-jobs utility.

Katarina, Aleksandr: logger interface in CVS. Part of CVS
  reorganization?

Aleksandr: Logger clean-up (garbage collecting). Done.

Balázs: Logger reliability. Postponed until after logger database
  reorganization

Aleksandr: Server side error codes. See above.

Mattias: Client side error messages. Still open.

Andrei I.: logger database reorganisation. Later in the summer.

Mattias: ngrerun. Almost there...

Aleksandr, Marko: Fireman built by Marko. Testing continues.

Marko: Collect DQ Requirements form Miguel. Wants SE info from MDS.

Aleksandr: host/user cert for RLS reg from SSE. Not yet. Issues with
  proxy lifetimes.

Oxana: I/O specification in XRSL. Still open. JSDL not possible
  without extensions. Switch to JSDL with "standard extensions".

Balázs: Merge nordugridmap.conf with nordugrid.conf

Balázs: Demo infrastructure. After 0.6 release.

Everyone: Document how to set priorities between jobs on a cluster.
  Too LRMS specific - probably better handled by LRMS documentation.

Marko: VOMS server tests. The client part works without modifications
  if the server is compiled with the new VOMS. voms-based GACL still
  not tested.

Andrei Z.: St Petersburg RLS server. Now running on a SuSE 9 machine.


* 2. Todo list for 0.6

Deadline for the 0.6 release is midsummer.

grid-manager / gridftpd
- non-root suid grid-manager.
- fireman client intergration.

userinterface
- single configuration file for the new arclib based cli commands.
  objections from Oxana on this point.

jarclib:
- experimental, but include in distribution.

GUI:
- probably not ready for inclusion in distribution.

configuration:
- environment variables: Should not need more than globus and voms does
- add missing variables to the template

globus:
- Should be possible to compile w/o Replica Catalog libraries.

documentation:
- Anders: Release Notes
- Blázs: Feature List, Release roadmap
- Anders: Build Instruction, Dependencies (INSTALL file)
- Mattias/Balázs: Client/Server Install instructions
- Balázs: Configuration Instruction/Documentation
- Balázs/Aleksandr/Oxana/Mattias: Main Technical Manuals (original
  author should maintain)
- Anders: INSTALL, README, LICENCE
- Everyone: ChangeLog
- Anders: error code documentation

testing:
- need more 0.5.x server for testing


* 3. Data Management

Use cases based on ATLAS DC experience (Oxana):

input file / output file staging should be automatic:
- Grid3: All files in one place.
- LCG: If input file in job description - all jobs to data.
       wrapper script that does data management - lots of assuptions on
       capabilities on worker nodes.
- ARC: The best solution. Grid-manager takes care of staging in and out
       data. Caching mechanism. The only bad thing is the handling of
       storage elements for output files (no checks for full SEs etc, no
       soft registering).

file copying:
- ngcopy is a nice tool for single file copy and registration (but
  same limitations as for grid-manager above).

- example: copy 4000 files at site A or site B where
    A: normal gsiftp
    B: no 3rd party, no single-thread, no modify time
  tricky with the present tools.

- requirements:
    meta storage
    dataset aware gridtools
    batch data movement service (should be possible to initiate from
      laptop and close down)
    access control (gacl) aware tools
    synchronize the gacls between replicas
    SE should be data management object (i.e. allow ls, cp, rm etc.)
      list SEs by VO, country, size, ACL
      list files on the SE
      move files from SE1 to SE1 (or from SE1 to a GRID)

Selected comments from the following discussion:
- Why more than one grid?
- DQ might be a good tool if further developed.
- gacl is a road block not a hole


Presentation of SSE (Aleksandr)

- gridftp is bad, because ftp is bad
- hence use http. It supports multiple streams, chunks and has
    standard secure channel https (or use globus httpg)
- uses host certificate for registration
- would like to have fine-grained delegation in proxy

Peter: Should be easy to write an SRM layer on the server side.


Presentation of gLite (Peter)

- storage elements - use what exists
- storage resource manager (SRM) - uniform interface to various mass
    storage technologies
- access protocols - gsiftp, https, rfio, ...
- catalogs - fireman file/replica/authorization/metadata
             gLite standalone metadata catalog
             supports unix like namespace (directories)
- posix I/O (gLite I/O based on alien I/O) through dedicated server
    but need file on local cluster
- file transfer service / file placement service
- data scheduler planned for release 2.
- user interface: glite-get glite-put, glite-rm (on LFN or GUID)
                  glite-catalog-* commands (ls, create, rename, ..)
                  glite-transfer-* commands (submit, status, cancel, ..)
- APIs: glite-io (C), fireman (C, C++, Java, Perl),
        file placement (C, C++, Java, Perl)
- POOL File Catalog API (glite catalog implementaion)
- Catalogs store "basic permissions" and ACL
- grid-only access model vs. mixed local and grid model

Selected comments from the following discussion:
- Namespaces in the file catalog:
    Can I ask "what SEs service a specific namespace"?
- Is fireman VO specific? Answer: yes.
- Which file do you get when you belong to more than one VO?
- WSDL for fireman from EGEE webpage
- symbolic links are allowed but not hard links
- symbolic links are only allowed for files, not for namespaces (directories)
- file placement service transfers files using server certificates


* 4. Swiss feature requests (Frederik)

- retries for uploads 
- downloads can be bottleneck since handled by the frontend only
- prioritization between jobs at stage-in
- queuing jobs can prevent pending jobs to start even though they are
    handled by a different LRMS queue.
- get rid if the shared session directories (depends on server, does
    not scale, scp)
- better support for non-PBS schedulers
- grid-manager scalability problem for > 1000 jobs
- grid-manager as non-root does not work
- job submission can fail if many jobs submits too quickly
- support for queues with fast response
- logging service allowing track CPU usage by VOs
- well documented and easy to use file catalogue


* 5. Runtime Environments (Juha)

Would be nice if people could follow the recommendations.
Not easy to enforce.
RTEs must be documented to be useful.
No clear overview of what application belongs where in the namespace.
Placed according to the wishes of the maintainers.
Can the monitor distinguish between registered and not registered RTEs?
Test RTEs must not match production RTEs

Problems with parallel environments since they have many version
numbers (verison of MPI implementation, version of compiler, version
of implementation)

Some RTE are defined for things that would fit better in the
information system, like e.g. LOCALDISK, TESTSITE.


* 6. Logger service

Andrey will redesign the database. Asking for usecases. Do we want to do
queries on xrsl?

XRSL attributes that should be queried should be duplicated in database.

Change to new version of usage record.


* 7. Renaming stuff

ngcopy -> ngcp
ngremove -> ngrm
ngrequest -> ngtransfer

keep MDS namespace nordugrid and ng* cli names
change RPM packages to arc-...


* 8. Future Features

- deamonizable ngsub for automatic retrieval.

- make gmlog file uploadable (should be possible to specify the gmlog
    directory in outputfiles argument).


* 9. Middleware lookaround

EDG/LCG: data management compatible with arc through gsiftp RLS and DQ
gLite: We will use fireman. gsiftp protocol in common for data transfers

fireman server to be set up in Oslo.

SRM basic funcionality to be implemented in SSE. Client SRM already supported
in 0.5.x

globus 4 - what do we want/need from the new functionalities?

  - CASS - similar to VOMS but not widely deployed - stick to VOMS for now
  - web service containers - maybe
  - myproxy - maybe
  - RFT - reliable file transfer service - maybe

  - new gridftp server implementation
     globus-ftp-control API might be dropped by globus in the future