The meeting takes place at Room C366, Fysicum, Professorsgatan 1, Lund. The three days meeting starts on Monday, September 13, 10:00 and finishes on Wednesday, September 15, afternoon.
Next meeting: in Linkoeping, October 18 to 20
NorduGrid Technical meeting, 13-15 September 2004
Room C366, Fysicum, Professorsgatan 1, Lund
The three days meeting starts at Monday 10:am
and finishes on Wednesday afternoon.
[logger] Monday
(Monday, Katarina Pajchel)
-logger reliability,
-tests, running services
-interface {Katarina Pajchel}
[other middlewares, interoperability] Monday
(Tuesday or Wednesday, Mattias Wadenstein)
-LCG/EGEE
-Condor-G
[LRMS, grid layer, frontend model, platforms] Monday
-frontend model extension/performance
-grid-manager latency (5 seconds hello grid jobs?)
-LRMS backends status {Balazs Konya}
-sessiondir, controldir cleanup
-zombie jobs on the frontend
-log rotation
-gm internal limits
[build & packaging, integration of new modules]
(Monday afternoon, Tuesday, Wednesday, Johan Tordsson)
-arc CVS
-arc-globus
-ng -> arc renaming
-gridssh (Tuesday)
-benchmarking stuff, integrating Johan Tordsson's work
-conflicting files in client & server rpms
-TESTING (no sites running the tags)
-shrinking number of build platforms
[userinterface] Wednesday
(Monday afternoon, Tuesday, Wednesday, Johan Tordsson)
-The original command-line UI: ui config file, giislist, ngresub
-The GridBlocks portal
-The NDGF DALTON portal
-The Qt-based GUI {Jakob Langgaard}
-Java-based GUI by Ilja (in progress) {Ilja Livenson}
-Python-based application specific UI by Jonas (?)
-NDGF Grid-BLAST
-job-manager from Aalborg
-alternative broker from Umea
-more
[data management] Tuesday
(Tuesday or Wednesday, Mattias Wadenstein)
-RLS robustness in case of a dead SE
-gridftp clients: gsincftp vs uberftp {Nikolaj Kutovskij}
-reliable automatic replication tool
-RLS-replica-manager API {Jakob Langgaard}
-SE configuration
[schemas, attributes, meanings]
-walltime limits, requesting walltime, cputime/walltime xrsl stuff {Balazs
Konya}
-meaning of the memory attribute/xrsl {Balazs Konya}
-job exit codes, failures
-job states {Balazs Konya}
-logger schema {Jakob Langgaard}
[road to version 0.6]
-new features already on the 0.5 branch
-features must be implemented for 0.6
-timeline for the 0.6
-ARC's build/runtime dependencies
-TESTING!
[DC2, lessons, status, future]
-most common failures, reasons
-Oxana's list:
1) bulk file replication tool (URGENT) and a human-friendly RLS-interface
2) re-introduction of the dataset concept, replication by dataset
3) CA credentials monitoring/management tool
4) adaptive scheduling, re-scheduling
5) adequate information collection, esp. on a shared back-fill-type clusters
6) logging and bookkeeping: stored info and tools
7) generic LRMS back-end
8) RLS servers discovery (e.g. publication of the contact string in MDS)
9) information system description and configuration manual (URGENT)
10) apt-get/yum repository
11) massive jobs fallout prevention (e.g. urgent migration of queued jobs, resubmission)
12) publication of host certificate expiration time in MDS and usage of this info in brokering
13) tape storage interface
14) SE browser tool, with possibility to modify file access rights
15) multiple RLS servers
[misc]
-follow-up from the previous meeting
-bugzilla status, policies!
-next meeting
-NGN
-regular VRVS
-updates on ongoing student projects, planned tutorials, grid courses
-handling, posting draft documents
[ARC-Grid deployment,management]
-CA repository
-Runtime environment registry
[applications]
-mpi runtimeenvironment
-arc client library
-application portals
Present: Balazs, Anders, Mikael, Jakob, Katarina, Ilja, Aleksandr, Mattias E., Nikolai, Marko, Mikko, Steffen Ramsøe, Leif, Magnus Ullner, Jonas Lindemann, Lars Malinowsky, Henrik, Mattias Wadenstein, Johan Tordsson, Niels, Jonas Bardino, Oxana (at the minutes).
Aleksandr: logger database setup. SOAP over HTTPS, no much protection otherwise. Kolya will test logger reliability. Logger clean-up (garbage collecting) is needed: Aleksandr.
Installing logger sevrer is not a trivial quest: somebody will eventually prepare installation instructions and maybe even a package. Jakob will play around with NGLogger; rumours are that maybe somebody on Swegrid managed to set it up.
GUI for the logger.. Slides by Katarina.
Anders is concerned about privacy. Will discuss it some time later
(e.g. NGN meeting)
Logger DB and the interface should go into CVS
Some discussion on GLite and LCG. No interoperability is easily possible. Standard interfaces are not enough, common schemas for information system and data management are needed.
Log rotation: Aleksandr to add logrotate notes into GM manual and into nordugrid.conf template. Anders to make sure that log rotation is the default behaviour (in the server package).
Odd jobs: A utility to clean "broken" jobs is needed. Broken files / incomplete file sets in the control directory are not unusual. Aleksandr will provide some tool (?)
Zombie jobs: those whose owner is not authorised anymore. They will not be deleted for quite a while, and will keep appearing in ngstat etc. Annoying. Aleksandr to see how to allow non-authorised users to retrieve and clean their old jobs - if they were submitted while he was authorised.
Quick "Hello world" jobs: Aleksandr makes is a top priority. Implement configurable scan period parameter.
Different LRMSes: PBS, SGE, Condor and fork are supported. LRMS back-ends: at the moment, not all read nordugrid.conf, while they must. No hardwired stuff or any other source.
Anders is obsessed by hyperthreading. Balazs suggests it is not a Grid issue, and Anders finally agrees.
Job storms: Anders has another problem: PBS can't swallow more than a certain number of jobs submitted simultaneously. Argues that GM should limit job storms somehow. Others think it's a local LRMS problem. Aleksandr: back-end-specific fixes should not go into GM. GM will display limits (via gm-jobs), will be used by the infosystem to evaluate free CPUs per user.
GT3 packaging:Move to Globus 3.2.1 should happen a.s.a.p.. Anders demonstrates Globus 3.2.1 on AMD x86_64. It is possible to install not all the Globus, but only the necessary packages (for the client). So Globus is modular now (many RPMs), and is available via yum/up2date. Balazs still wants a single Globus file; another issue is how would regular users install it. New build instructions and patch list should be made available. Anders will start using GT3 for the nightlies. Virtual RPMs like globus-client, globus-server, should be a good idea.
Henrik: problems with 2.6 kernel; need recompilation from source to solve the issue.
For INLRMS - a better name to be found. Generalized states to which
various local peculiarities should be mapped:
R(unning)
Q(ueueing)
S(uspended)
H(eld)
U(knnown)
In the case GM died and the state is unknown, the latest know state is published, appended with a note "GMdied"
There will be no blank spaces in the state string
Final state to be split in 2: "FINISHED" and "FAILED". In the latter case, failure reason is taken from the GM files and appended.
Error message: will contain the state in which the job failed and the error message as proposed by Aleksandr.
Exit code of the job: to go into a separate attribute nordugrid-job-exitcode
Some discussion on leading slashes in names (infoproviders strip them). Some discussion on MPI - nothing really useful it seems (in a cross-site sense). Codes are not really binary-compatible.
CA keys: Anders will remove others CA (except of the NG and Estonia), link/virtual RPM including EUGrid PMA and instructions will be written. By September 28, 2004, Anders finalizes the stuff and fixes installation instructions
Download site: Anders will change the entire repository; Oxana will follow this and change the download page accordingly.
Every executable must have a man-page
Build on GT3
CVS reorganistaion: Urgent; Anders and Aleksandr will start the new CVS. Start on September 21, end by October 5 or 12. Only the HEAD version will be moved; 0.4.x stays in the old one.
"nordugrid" will be changed to "arc"; maybe also in infosystem, if it is not taken.
UI: few bugfixes.
Configuration: Will be improved, but no XML. UI configuration file to appear.
Users mapping: Separate it from the authorization procedure; dynamic user mapping. Leif to provide use case and algorithm to Aleksandr.
Non-root mode: someone has to investigate it
Plug-ins: There is a plugin "inputcheck" - tests whether an input file can be downloaded, because a CA key might be missing (doesn't make sence when used with several possible replicas). Job won't be accepted if there are no proper keys. Shipped with the s/w. Documentation is missing though. Same is true for other plugins. Aleksandr will add a block to nordugid.conf.
Aleksandr's ideas for new features in 0.6: Web-service compatible with GRAM (as in GT3), SRM interface. Data indexing service better then RLS. Authenticated infosystem.
Aleksandr doesn't like LDAP server either. Prefers each service to serve its own information, no information collectors. Balazs disagrees.
In Likoeping: go through the things done/to-be-done; candidate release by 1st November.
Test sites: Anders, Henrik, Mattias. To use "TESTSITE" runtimeenvironment and the authplugin to be provided by Leif, that accepts only jobs requesting this environment. The plugin to be shipped with the s/w.
Platforms: RH7.3, RH9, RHEL on Opteron, FC1 & 2, Debian, mdk10, SuSE? Others on case-by-case basis. Attempts to build natively on Windows continue.
UI configuration file.
uberftp: Nikolay's comparison table; is very fresh, less user-friendly, only interactive, had problems with NG gridftp (fixed). So far stay with gsincftp with the data corruption patch; re-evaluate uberftp later.
DC2 summary: Jakob; problem: pile-up, cache can accumulated hundreds of files (600+ GB). Aleksandr sais if there's no place in cache, files go to the sessiondir. Anyway, recommendation is to enable cache.
Related issue: xRSL should mention that the files in the cache are read-only. If cache=no, the files can be theoretically changed. Aleksandr plans to add an option to make cache-files writeable.
DC2 problems: Oxana; Balazs will fix RLS entry in MDS. And the infosystem manual (?). Anders will invent a cron-job notifying of soon-expiring host certificate. Big problem: there's a bug which causes segfaults if one of the SEs registered in RLS is unavailable.
RLS: Jakob; RLS does not freeze anymore thanks to changes by Globus. There are reports of multiple RLS servers getting out if sync, so only one server is up. Recently discovered problem: reference count is inconsistent. May lead to unexpected removal of a record. Showed a clumsy very-low-level native RLS command-line interface
SSE: Aleksandr; SSE is essentialy a flat storage, knows nothing about collections; registers automagically to RLS and synchronises its contents with it. Requires no special configuration. No bulk operation tools. "Browsable" by ngls. File transfer: ngcopy (no 3d party transfer). ngrequest provides 3d party transfer (gridftp to gridftp, any location to SSE). Anders suggests to have ngrequest as an option of ngcopy. To be discussed. Also ngremove and ngacl. XACML is unacceptably heavy-weight. Files stored at SSE are validated against size and checksum (cksum, md5 are supported). Hence reliable file transfer - as much as possible.
Replication Tool: Jakob decided to implement old Oxana's wishlist in a Replica Manager API. A client based on RLS. A collection is a special LFN (specific pattern).
RLS has a zillion of problems (single point of failure, no fine-grained authorization, no collections etc), but there's nothing better on the market. An indexing service should be written from scratch.
Command-line UI: configuration file is needed. Mattias will look into it.
Johan's scheduler: there's some messy prototype, to be rewritten using the new libraries. Feature: to be able to use advance reservation.
Benchmarking effort gradualy dies out.
Ilja's GUI: a demonstration. Proxy initialization, job submission, creation, monitoring works. Needs JRE, CoG, LDAP, NorduGrid and some other libraries. All together is 6-7 MB. In principle, can be used via WebStart. People would like to see it implemented as a portal and/or as an applet. Everybody would like to see it in CVS.
Babysitting: some argue babysitting should be in UI, some think not.
Jakob's GUI:
a demonstration. A requirement: configurable, extendible application plugin - to spare users from typing xrsl. This should eventually have application-specific menus, buttons. Mikael: GUIs must be application-specific, not Grid-specific; no sence in plugging in applications into GUI, but Grid into existing applications GUIs. E.g., ARC plugin for Matematica.Anders wants to have the same features in both GUIs. Markko will see how to extend the portal accordingly as well.
LUNARC Applications portal: Jonas L., Web portal. Non-generic, dedicated to specific (including commercial) applications.
Library: Henrik was extremely unhappy with the existing libngui one, went on creating a better, developer-friendly set. Jakob helped. The ultimate goal: to have jobmanager using Jakob's GUI and Johan's scheduler. Should be a standalone library - except for the Globus dependency. 0.7 will start moving to these libraries; some already can be used. Everything is called "arc" there. All the services sooner or later will switch to the new library, as long as it works.
Indexing service idea: Jakob and Anders have an idea to re-use file system and GACL for an indexing service. Many dislike the idea, as it doesn't solve the problem of authorisation (synchronisation of this wuth actual data), and looks like it'll be slow.
Monitoring: a geographical dynamic map is a good task for a thesis.
Applications: Jonas points out that users are used to getting their files to their home areas without any SEs. Henrik sais their job-manager is meant to do exactly this. It is good to reduce as much as possible extra steps needed on Grid as compared to a local batch system. Jonas to post the comparison.
Runtime environment: link Juha's page (Oxana).
Web page: add links to courses, tutorials, student projects (send links to Oxana).
Bugzilla: everybody is encouraged to sign up and submit all the bugs and enhancement requests there. Don't forget to close reported bug. More flags are needed: e.g. "CONSIDERING"; Anders will have a look.
Wiki: Markko to see for setting a Wiki area for NorduGrid drafts at HIP's premises.