NorduGrid technical meeting

Match 9-11, 2002, CERN

Present: M.E., A.K., B.K., O.S., A.W.

Minutes by O.S.

Day 1

 * A.W., O.S.: reports from the 4th EDG Workshop in Paris
   - EU review of the EDG: very positive, the most successful project
     ever
   - However, application WP's (8-10) and some others are not quite
     satisfied, and request stability, stability and stability
   - EDG roadmap: most middleware workpackages are coming with
     development of advanced versions of the stuff, but no bug fixes
     for the existing one. ITeam intends to push WP1-5 towards bug
     fixing though, in order to achieve a stable production release.
   - Some details: 
     1) Globus moves towards OGSA, first pre-alpha to be released in
      April; Globus 2 to be finalized and will be co-existing for
      another couple of years
     2) Ftree is dropped, and EDG's WP3 is preoccupied with R-GMA,
      though used by noone
     3) A.W. gave a talk on licenses for EDG: BSD-type seems to suit
      everybody so far, Slashgrid is about to adopt one as well
     4) ATF (Architectural Task Forse) is resurrected and re-shuffled,
      aiming at [finally] drafting an EDG architecture; there's
      certain interest to evaluate the NorduGrid's architecture

 * A.W. will present the NorduGrid architecture at the PARA'02
   conference in Espoo; contribution to be ready by April 1st and to
   be distributed within EDG's ATF

 * Issue of cooperation with EDG w.r.t. developing NorduGrid's own
   installation: keep collaborating, sacrifying a machine per site to
   EDG's exercises, but only with a second priority

 * Issue of ATLAS software
   - Jacob is preparing AFS-free RPM's (few days later: Christian
     Arnaud & Cal Loomis did the same, but never tested :-)
   - Objectivity is officialy dropped by ATLAS in favor of ROOT/IO
   - RedHat 7.2 is about to get certified as the official CERN
     platform
   - DC1 is the natural application, starts on April 15

 * Organisational issues:
   - Milestones:
     1) March 22nd (read: April 1st): architecture paper
     deadline. Paper and its follow-ups to be submitted to all
     possible forthcoming conferences (Balazs will distribute the list
     with deadlines), and possibly submitted as the ATLAS note. To be
     evolved in 3 instances: proposal -> implementation -> application
     2) April 15th: start of DC1
     3) May 18th (?): NorduGrid meeting in Helsinki. Demonstration of
     the NorduGrid architecture, using ATLAS-related jobs
     4) June 15-18: PARA'02 conference in Espoo

   - E-mail discipline: 
     1) mails with the tag "URGENT" in the subject line must be
     answered by everybody within 48 hours 
     2) threads are to be marked with a keyword (e.g., Information
     System, Grid Manager etc)
     3) initiator of a thread has to summarize the discussion before
     closing it

   - Remote conferencing: keep on calling for a phone conference
     whenever necessary; to look into possibilities of a
     videoconferencing (VRVS?)

   - Next meeting is to be the dedicated integration meeting; to be
     called end-April - beginning of May (May 1st is Wednesday) for 3
     days in Copenhagen

 * Short discussion of the Globus roadmap (after the slides of Bill
   Allcock). OGSA is OK, but not of an immediate concern. Updates to
   GridFTP, GRAM etc will be released meanwhile.

 * Packaging:
   - RPMs to be created from Globus' official releases (4 bundles?),
     not CVS
   - number of RPMs must be limited to 5 (server, client, common,
     development, info)
   - NorduGrid stuff should be installable on the top of an existing
     Globus installation
   - NorduGrid distribution should contain a lightweight, stripped
     version of Globus, possibly with necessary patches and fixes
   - User Interface (UI) (client) subset should be self-contained and
     sufficient, i.e., installable on machines without pre-existing
     Globus and without being a superuser. Must contain GridFTP and
     LDAP (GSI-based search)

 * Location of the NorduGrid software to be referred as the
   NORDUGRID_LOCATION
   Must not be hard-coded; standard location: /opt/nordugrid/ ,
   contains all the necessary subtree (./bin , ./lib , ./etc
   ...). Everything must be relocatable.

 * Configuration files: the Information System (IS) and the Grid
   Manager (GM) need one each, but they can be merged into a single
   nordugrid.conf file, residing at a standard location: either
      /etc/nordugrid.conf
   or
      $NORDUGRID_LOCATION/etc/nordugrid.conf


Day 2

 * Slashgrid: A.W.'s account. An example setup to be done at NBI, to
   check the functionality of the certificate-based acccess

 * Information System
   - Issue of prefixes in attributes: prefixes should be identified
     uniquely with each organisation, e.g., nordugrid. The choice of a
     prefix should be possible make do during the installation/setup
   - nordugrid-cluster-nodecpupower: at present expressed in bogomips,
     but since it is not used for scheduling, should be more
     human-readable.
   - an explanatory description of attributes and possible accepted
     values should me made (B.K.)
   - RC info: to add DN of the catalogue and DN's of collections
   - SE info: to add access protocol and mount point (a la EDG)
   - queue: B.K. to write suggetsions for a NorduGrid PBS queue
     configuration
   - nordugrid-authuser-sn: to be shortened to just a real human name,
     plus some ID in case of identical ones

 * RSL
   - not all the Globus-defined attributes need to be supported by the
     GM (O.S. to prepare the final list)
   - non-supported attributes should produce a warning message
   - several attributes (executable, arguments, stdin, stdout, stderr),
     specified Globus-way, will have to be re-written by the UI to
     suit GM (executable: dummy, arguments: actual executables etc)
   - startTime: refers to the download start time, not execution
   - lifeTime: from the moment the job is finished; to be added to
     MDS, can be user-specific
   - some additonal (to the previously distributed list) attributes:
     action, jobid (internal for UI), lrmsType, replicaCollection
   - O.S.: to prepare the RSL template

 * Grid Manager
   - A.K. produced a flow-chart, which has to be documented
   - GM starts downloader, Globus job submission (jobmanager-ng),
     uploader etc
   - job status info (files), RSL etc are contained in the job control
     directory; status files to be owned by the root
   - each job is assigned a session directory, path of which is a part
     of the JobID (JobID is being put into MDS, this information can be
     accessed only by the grid-mapped users)
   - status of a job is scanned by the Helper from PBS logs (instead
     of issuing qstat). Possible values:
      ACCEPTED/PREPARING/EXECUTING/FNISHING/FINISHED
   - upload of directories is not supported; neither are wildcards
   - if RC location is not specified, GM should use the local one (?)
   - jobmanager-ng is enabled in Oslo
   - A.K.: to send around an example of user-side RSL file

Day 3

 * User Interface
   - performs job submission/cancellation/status query
   - all active job ID's should be listed in the jobhistory file. Upon
     job completion, its ID is removed from the list. The jobhistory
     file should be reconstructabe from the MDS (in case of accidental
     removal)
   - user commands suggestions:
     ngsub   : submits a job
     ngkill  : terminates the execution
     ngclean : terminates the execution and erases all the traces of
	       the job, including the info in MDS
     ngget   : retrieves the output, optionaly issues ngclean
     ngstat  : queries MDS and retrieves the job status (options
	       -f[ull] , -a[ll], -u[ser]  should be
	       available). Job final status: either SUCCESS or
	       FAILURE; in the latter case ngget returns all the
	       associated files
     ngresub : moves a job from a queue to another (forced
	       re-scheduling)

 * Application runtime environment
   - at the moment, ATLAS software releases. RPMs are on the way
     (A.W.), to be installed at all the sites concerned
   - in future, a description for each environment is needed
   - possibly consider CERNLIB etc as a runtime environment?

 * Storage Element (SE)
   - each site: to set up a SE for test (not necessarily for ATLAS DC,
     snce that one would need 2 TB of storage space). An SE better to
     be a separate machine, allowing independent user-mapping
   - main requirement: a user should be able to upload files from an
     UI to a SE at any time, without sumitting a job
   - open question: shall a new RSL parameter, requesting the job to
     be moved to the data, be introduced?
   - issue of mirroring (caching) data on request: not feasible with
     existing tools, but can be a part of a future architecture

 * Immediate actions:
   - SE: each site to set up a separate machine with GRIS, mapping
     everybody to a single user (B.K.: week 12, A.W.: week 13, M.E.,
     A.K.: to study the possibilities)
   - IS (B.K.): 
     1) update the schema
     2) modify the providers (user - queue length, job - GM status)
     3) re-write parts related to the static information: to be read
	from the configuration file
     4) prepare description of all attributes and options
   - PBS configuration (B.K.):
     1) cluster configuration suggestions
     2) test script checking whether the PBS configuration makes sence
   - RSL
     1) example of an RSL script: A.K.
     2) list of attributes and a general RSL template: O.S.
   - GM (A.K.)
     1) implement root ownership for the common control directory
     2) provide an input for the configuration file
     3) documentation (including the flowchart)
     4) enable automatic selection of location in RC
   - packaging (A.W.): to come with a reasonable proposal of how to
     fit a stripped Globus and the NorduGrid package into 5 RPMs
   - UI (M.E.)
     1) docmentation (flowchart) of brokering
     2) actual implementation
   - Application runtime environment: A.W. to send around the ATLAS
     software details
   - Cooperation with EDG: sacrifice a machine per cluster to commit
     EDG/Testbed1 exercises
   - Remote conferencing: A.W., A.K. to study the issue of a VRVS
     virtual room (reflector)
   - Slashgrid: A.W. to set up an example at NBI