The list describes missing functionalities and necessary enhancements. While some tasks could be solved by a single developer, others need a dedicated working group. Contact the NorduGrid staff if you or your group is interested in taking over a task or would like to propose a new one.
| Area | Task | Developers | Supervisors | Links | Priority | Status |
| Indexing service for stored files | Requirement collection and review of existing tools and services | O.Smirnova | 1 | |||
| Service design: reliable, decentralized, interfaced to storage facilities, intelligent, fast |
– Indexing service
proposal draft – Data Management System design draft |
1 | ||||
| User- and group-based access control | 2 | |||||
| Management tools (copy, move, delete, rename, replicate etc) | A.Konstantinov | Replication tool requirements | 1 | |||
| Disk SE interface | A.Konstantinov | 1 | ||||
| Interface to mass storage system(s) | 1 | |||||
| Storage element | Disk-based SE concept | A.Konstantinov | Data Management System design draft | |||
| "Smart" and "Stupid" SE testing | J.Nielsen | |||||
| Mass storage support | ||||||
| User- and group-based space control, quotas | ||||||
| Role-, user- and group-based access rights and permissions | A.Konstantinov | |||||
| GACL tests, user guide, migration | A.Wäänänen | GACL | ||||
| "Stupid Storage Element" configuration | B.Kónya | |||||
| Information system | Authorized access to information | B.Kónya | ||||
| Requirements collection and design for the information indexing service | B.Kónya | |||||
| Fail-safe topology | B.Kónya | |||||
| Fast / scalable response, performance studies | B.Kónya | |||||
| QoS control, registration authorisation | B.Kónya | |||||
| Storage Element information | B.Kónya | |||||
| Better suport for schedulers (like Maui) | B.Kónya | |||||
| Logging, bookkeeping and accounting | Authorized access | |||||
| Full job history / provenance data | A.Konstantinov | Requirements proposal | ||||
| System performance statistics | ||||||
| Resource usage account per user and per group (CPU, memory, disk space, bandwith etc) | ||||||
| Web interface to logs | O.Smirnova | |||||
| Accounting plugins | L.Nixon | |||||
| User- and group-based management (authorisation and resource allocation) | User and group policies | A.Wäänänen | ||||
| Set up a security group | A.Wäänänen | |||||
| CA Web page: certificate request on-line, public keys etc | A.Wäänänen | |||||
| Extended/non-ambigous information associated to a personal certificate | ||||||
| Replace existing VO server with VOMS one | B.Kónya | |||||
| Task-driven public key downloads | ||||||
| Proxy storage/delegation service (MyProxy?) | ||||||
| Grid Manager | Replacing GridFTP with a more sophisticated protocol(s) (?) | |||||
| Quotas for session directories | ||||||
| Support for clusters without shared file systems | ||||||
| Automatic registration of cached files into the indexing service | A.Konstantinov | Almost done | ||||
| Support for interactive tasks | ||||||
| Parallel input file download from different sources | A.Konstantinov | |||||
| Automatic compressing/decompressing input/output files | J.Nielsen | |||||
| Workload management / brokering | Respect for local policies (User- and group-based time allocation etc) | |||||
| Cost evaluation (data transfer, storage, execution) | ||||||
| Benchmark-based load balancing | ||||||
| Re-scheduling, re-submission, recovery | ||||||
| User- and group-based resource discovery | ||||||
| Cross-cluster parallelism | ||||||
| Installation | Installer/uninstaller script(s) and/or procedures | A.Wäänänen | ||||
| Configuration tool | A.Wäänänen | |||||
| Clean up external packages, sort out dependencies (INSTALL file etc) | A.Wäänänen | |||||
| Build-on-demand | A.Wäänänen | Almost done | ||||
| User-specific software installation and support | ATLAS EventFilter | J. Nielsen | ||||
| ATLAS Production system interface | J.Nielsen | Production System Proposal | ||||
| Interoperability with POOL | J.Nielsen | POOL – Persistency Framework | ||||
| Other | New technologies: GT3 | Cs.Anderlik | ||||
| Different systems (Solaris, Mac, Windows) | ||||||
| Interface to Condor, – both GM and IS | H.Riiser | |||||
| Performance tests: resources, users, jobs | ||||||
| Setting up a user support system | ||||||
| Graphical user interface |
People attending: Balázs Kónya, Aleksandr Konstantinov, Mattias Ellert, Oxana Smirnova, Ugur Erkarslan, Jakob Nielsen, Anders Wäänänen, Csaba Anderlik, Haakon Riiser (thursday/friday), Vandy Berten (thursday/friday), Morten Hansen (friday), Leif Nixon (hit-and-run)
Minutes by J.N.
Haakon Riise from Oslo, Norway: Seeking a project for his masters-thesis. Has worked on Condor before.
Csaba Anderlik from Bergen, Norway (NGDF): Norweigan representative in NGDF.
Vandy Berten, Bruxelles: Starting a ph.d. on Grid-Brokering and Scheduling.
Ugur Erklarsan, Oslo, Norway: Postdoc from Oslo. Will stay 1 year in Oslo working on Atlas/NorduGrid.
Morten Hansen, SDU, Denmark: Grid-accounting, student at Brian Vinter.
27th – 28th of November in Lund.
One of the two last weekends in January. Anders will post which weekend he prefers on the mailing-list next week. The meeting will take place in Helsinki. Balázs will arrange that with the finns.
Aleksandr and Oxana were worried about the project: there is not enough manpower to develop new features, NorduGrid has reached a saturation and the NorduGrid project does not exist politically anymore.
Aleksandr read the tasklist from the last meeting. Many of the things mentioned there have not been done – or are just being done.
The main tasks are already well-defined.
Oxana wanted to know, if it was worth making another tasklist when the tasks on the old one has not been solved. Everybody seemed to agree on that and committed themselves to do more development.
Furthermore, it seemed there were a number of people present that wanted to work on some tasks. The goal is be to integrate these people in the development.
Make tasklist and send them to the people that requested it (Brian, Niclas).
Needs a indexing service to replace the Globus Replica Catalog. The possibility to register collections in the indexing service is essential.
One task would be to collect requirements from different user-groups and evaluate the different solutions that already exist. Aleksandr suggested to review a simple indexing service based on local replica catalogs on each SE's connected a RLS. Everybody agreed.
Balázs reviewed the current Storage Element solution. The configuration is very messy, but is being cleaned up. Balázs will work on a simplified configuration the next weeks. Parts of this will be obsoleted by some of Anders' work on configuration (see later).
There are two plugins for the Storage Element – the simple fileplugin and the gaclplugin that implements gacls. Aleksandr suggested to experiment with the gaclplugin more as it has not been tested a lot – this could for example be in the upcoming Data-Challenge. Anders was put in charge of testing the gaclplugin. He wants to put the NBI users on the grid and give them access-rights to their personal /data-directory like on the local /data-directory.
The SE does not report anything to fx a (local) indexing service. A task would be to implement so that it report its use to such an indexing service for example a local replica catalog.
It would be nice if there was some kind of way of keeping quotas for users, VO's and as part of the Storage Element solution.
Aleksandr has developed a general and simple framework for creating http-services. At the moment a Logger-backend has being created using this framework and an SE-backend (the new SE-service) is being worked on. The new SE-service has much more flexibility than the old SE-solution. For example, one can configure the new SE so that it registers files into the Replica Catalog automatically.
Jakob wants to test the new SE – also in view of the upcoming Data-Challenge.
Aleksandr suggested to deploy the new OGSA-based GT3 toolkit. He has tried himself 5 times but all times failed. It should somebody with a lot of Java-experience to understand the error-messages. Csaba says he want to try it.
Balázs reviewed the status of the information-system. Balázs has been investigating the performance of the information providers thoroughly. ldap does not introduce much overhead in the performance. However, he has found and fixed some huge bottlenecks in the information providers when a lot of jobs are present. The next releases should exhibit a big performance gain.
At the moment, there is no authorization regarding access to information. It would be nice to have an indexing-service for user-jobs, but so far there's no authorization layer, so every user can read the jobs of everybody else. This needs to be added. Balázs is thinking about reimplementing the ldap-backend to introduce authorization.
It was discussed whether to have blacklists for misbehaving clusters. There was no clear consensus on this point but definitely there should be at least a clear policy for blacklisting sites and a clear policy how to get off the blacklist.
Aleksandr has developed a logger-utility that sends job-information to a central logger-database. The uploading is done through the new https-framework. The logger will be there in the next release. Needs a web-interface to present the information to users and in particular an authorization component.
In the central database, one can already make simple requirements. The address is: https://grid.uio.no:8001/logger. Offset should be less than 100.
Morten from SDU:
Two different problems exists: development of an economic model and of a technical solution. He assumed that an economic model already exists and has the follwing suggestion for a technical solution:
Assume the existence of a grid-bank. When submitting a job, the broker and the resources should first negotiate a prize. Morten assumes that this problem has already been solved. At job-submission, the user then submits a cheque with the specified amount together with the job-description. The resource gives this to the bank and the bank transfers the amount charged to a closed account. After the simulations the resource tells the bank what amount to really charge (i.e. the user might not use completely what was requested and the prize might thus be lower).
Morten will work on the grid-bank implementation and the interface between the grid-bank and the resources but assumes there is an existing economic model. He will produce a design-document and post it to one of the nordugrid mailing lists.
Leif presenting accounting ideas of Peter Gardfjall and Thomas Sandholm:
Similar to Mortens except closed account is replaced by holds and hold-ids. The currency in the account is for the moment CPU-time – even for different clusters. So 30 minutes Monolith is the same as 30 minutes Ingvar.
Peter and Thomas don't want to code – and if in OGSA
Leif also presented his own simple idea:
The user submits an xrsl to the GM (through a plugin) that calculates the prize and submits this to the bank. The bank says yes or no according to the amount on the account. If the bank is down, the GM says yes. At the end of the job, the amount spent is sent from the GM to the bank that deducts the amount.
Aleksandr can write the hooks for the accounting plugin in about 1 day. Leif will write the accounting-plugin.
Anders is working on an xml-configurator to simplify the configuration of the NorduGrid toolkit. He has implemented a schema and a stylesheet for creating documentation from the schema. The infoproviders will need to be rewritten.
Aleksandr does not want to replace gridftp with a more sophisticated protocol.
Would be nice with quotas for session directories/input files. At the moment there is no guarantee that requested disk-space at job-submission stage is available when the job actually starts. As a first thing, it would be nice to have a small utility that calculates the disk-space usage of each user on the cluster. Everybody will think of a complete solution.
Aleksandr works on automatic registration of cached files into the Replica Catalog. This can then be used in the brokering. Alternative solution is to have the cached files in the information system.
Anders has been looking into making a gsissh package. There are a couple of security issues though – if the somebody steals a certificate, he will have complete shell-access to the cluster. The biggest problem is the configuration of the gsissh-package.
Mattias will update the man-page for ngrenew.
Mattias will try to improve the inputfile-situation for ngresub.
Anders want to have a policy for introducing new people to NorduGrid – which access-list, VO should they be added to? Undecided.
Oxana would like to have automatic renewal of proxies for avoiding failed jobs due to expired proxies. It would therefore be nice to have support for MyProxy.
Balázs wanted to know if NG-patches to Globus are being fed back.
Mattias has uploaded his patches to the Globus bugzilla. So far nothing has been done on the Globus-side. Anders continues to apply the patches to the NorduGrid distribution.
Anders has built the new 2.4.3 globus. New features are the rls-client and the rls-server for the next release. Otherwise it's mainly a security update. There's also a new gpt package. NorduGrid builds without problems with the new gpt and globus.
Anders can build the NG-software as a NG-gridjob on lscf. It uses the same backend as the nightly build.
Balázs would like to have the package-dependencies in a file in the repository – both what is needed for building and what is needed for installing.
On the download page, there should be a configuration section. The packages globus-config, certrequest-config should be moved there.
There should be clear installation instructions and an installation/uninstallation script by the next meeting.
Goal of Condor is to harness computer power from unused machines. Condor has one central manager that locate a computer not currently executing jobs and is not used. The central manager hosts a queue. One can at any time get the status of the jobs from the central manager.
Once an execution host has been found, a direct connection between the submit host and the execution host is being created and system calls will be forwarded from the execution host to the submission host (the program being run should be relinked with a special static libc-library).
Because of this, it will be the responsibility of the user to provide the condor-binary.
Haakon Riise would like to work on trying to get NorduGrid job-submission work on Condor pools.
If this works, we will try Atlas jobs on Condor. Requires recompiling and relinking of atlsim.
Mattias described the present scheduling algorithm:
At the end, a list of possible submission targets with free CPU's are created. It chooses the one with the least amount of data to download.
Balázs suggested to put clusters with queueing jobs for the user at the end of the list of possible submission clusters.
In general a framework is needed for answering queries from the user e.g. when can my job start, when does it end? There should be a communication channel between the user-interface and the grid-manager. For example the user can upload an xrsl-description and wait for the grid-manager to process the query. Somehow the information from the script on the server side should be propagated back.
Anders suggested to look closer at GARA and SILVER.
Vandy thinks there is too much functionality in the client. Some kind of distributed broker separated from the UI would be nice.
Heated discussion on scheduling – whether the present solution is good enough.
Everybody agrees that advanced job-resubmission would be a very good thing – at least manually. We need to understand if automatic resubmission is feasible with accounting.
One approach is a job-pool that accepts jobs and allows the clusters to pull jobs from this job-pool.
Vandy will look at this?
A new gridtime xrsl-attribute will be introduced that is a scaled version of CPUTime according to the processor speed. The gridtime will be translated to real CPUTime when the cluster has been chosen.
Mattias will do this.
Should try to find out for the next meeting the exact requirements on NorduGrid for DC2. The problem is that even Atlas don't know at the moment.
But investigate the requirements from POOL, the Executor and so on.