Present: Balázs, Anders, Mattias, Mikael, Aleksandr, Oxana, Henrik, Mike, Niels, Mikko, Andrei I., Andrei Z., Vitaly, Lauri, Hardi, Csaba, Marko (at the minutes).
Wednesday, 16 March =================== Rome production issues/problems: -------------------------------- Oxana: The system is not well suited for an env where someone creates a job and someone else manages it. Data access control (by GACL) is the bottleneck. Authorization for X to see Y's jobs/logs is difficult or ill defined in critical components (job management, file management). Anders: With a VOMS system, one could allocate auth for ATLAS group. Balazs: GACL attributes with jobs: failures in servers (reported by Russian Lund users - file in Bugzilla). Anders: VOMS is hard to administer: to set up the admin GUI. Balazs & Mikko & Csaba: We might have a VOMS server in HIP Finland or in Norway. Anders: Maybe GACL does not really work with VOMS: 1) files -yes 2) jobs 3) rls. Must test. Plan: Put up a VOMS service in Norway or Finland, test the issues above. Make Andrei report a bug (discussed on Sat). How to live with the current situation: Problem: rls server in Denmark: users cannot find the files because they are not auth to rls. Authorization by manual config, no "group" concept. How about: policy: read access to everybody? Balazs: write access only to production people. - ok. Plan: rls read access to everybody, write access to production people. Install CA certs of "everybody". Done. Balazs: We need an rls test service and a new production rls. Mattias: will set up a test service in Uppsala. It is an ATLAS issue to find a new production rls. GACL for storage ---------------- Balazs: .gacl default policy in /DC for ATLAS would look like what? Oxana: The current default is ok, but it's a pain that I cannot see what the top .gacl is. Log file access --------------- Balazs: gmlog for failed jobs should be made readable for debuggers. How? Mattias: the executor can do it. It downloads the log anyway. Balazs: then it should send the log to storage using ngcopy? In a separate cleanup process? Mattias: well, yeah. Balazs: this is important and urgent. Support requests all about this. Anders: Let's file a request to Dulcinea developers. Mattias: I'll take a look. Balazs: How to solve this problem generically is an Atlas issue. CA policies ----------- Anders: All "interesting" CA's are soon to be a part of EU PMA package (EuroGrid). Balazs: ARC should include default access for users whose CA's provide ARC resources + demoCA. Anders: Should these CA's be packaged with other than standalone? Balazs: For sites, what do we recommend? Anders: EuroGrid, Baltic, DFN, GridPP (expired), Sri Lanka, Australia (Melbourne?). All: As a policy NG accepts: has NG policy, has resources, behaves ok, willing to put their CA in PMA. Plan: ask the Aussies what's their plan. Otherwise ready for the next release. For www.nordugrid.org/NorduGridVO, admins of "new" user groups should send their information, user list and policy description. There should be specific groups for Baltics, Slovenia, Slovakia, ARC. Improve documentation: how to put limits in Grid resource using or give priorities to some VOs. Invitation: there will be an rls (Uppsala) and storage for anyone with an NG certificate and an accepted CA. Demo CE boxes and a GIIS, too. Make a runtime environment for this. Host certificate web application: Marko. Infosystem policies ------------------- Balazs: Contact persons' email addresses need to be checked (practically done). Plan: Rename existing toplevel servers to index1-index4. Countries need to register into 2 indexes. Release 0.4.5 ------------- Anders: Patch for 0.4.4 -> 0.4.5. Tested? Balazs: yes. Aleksandr: Main changes in run.cc Anders: Cancelling jobs kills the manager issue should be fixed. Final testing needed. Mattias: Will be installed. Plan: By 24th March, if no problems, Anders will tag, put release notes in. Release 0.6 ----------- Balazs: Main areas and persons responsible for each, new CVS directory that allows "make client". Anders: Dynamic versioning of API's vs the whole stuff (since libraries, monitor, javaclient are relatively independend). Balazs: This should not be reflected in packaging. For 0.6 let's keep the current (monolith) versioning. Mattias: In 2 months client tools based on new libs. Anders: We can keep the current CVS and still do "make client"/"make server". Anders + Aleksandr: new CVS + component structure + autotools. ./configure --disable-client = prepare only the server for compilation. Plan: Use autotools. Separate deps and targets. Start migrating. After Easter check status. Target release date: end of April. Slovenia certificate problems Solved by Anders by fixing the corresponding policy file. Will be packaged. External software Globus Aleksandr: New release should be made against what ARC uses now (2.4.x). If the compiles/works with 3.2.x/3.9.x fine, can be considered. Plan: see above. Gpt Balazs: Can we get a new binary rel without gpt? Anders: Yes, the spec file does not require gpt. gsoap: 2.5.2 works. 2.6 does not. Perl libs: Perl-ldap is not a dependency. GACL: Keep it as it is. VOMS: Keep it as it - you can build 0.5 without VOMS. mysql: Needed by logger. Sofa meeting in Hotel Atrium, Thursday, 17th March ================================================== After 0.6, 1.0 -------------- Wish list: GLUE 2.0 compat, consistent data management (requirements tomorrow), ARClib: c++, Python, Java: unified function signatures/data type names. Only GSI from Globus? No, gridftp is needed. Resource reservation (in advance). GUI interfaces (Java, QT?), application portals. Different platforms: Solaris? Debian. Mac? We could have a Java based client for basic client functions, but brokering: would need to be re-written in Java. Resource control: quotas (data, jobs, per person/VO). Resource status should be available for admin monitoring. Are there libs available for this? Babysitter by rules like: if the job in queue for 30 min, resubmit. How to handle massive amounts of jobs? Better logging: information can be used by other components. Web services? Job submission in 1.0. Job descriptions in JSDL. Job submission/status/monitoring tru same channel/interface. Resource monitoring too in 2.0. VOMS support. Mike suggests an improvement: Person X's failed data transfers could be continued if he/she submits new jobs. Data Management, Friday, 18 March ================================= Oxana: We should think data manipulation as jobs: we don't want to do it interactively. Aleksandr: There is a Globus tool. Oxana: GDMP was a good idea but poor implementation. We need storage servers that move data on my behalf. You should be able to send jobs to do it. Description: what files? What? Accessible to whom? Scheduled when? "Rename files that have been created by me during last 2 weeks". Balazs: To implement this, one needs basic services. Oxana: Users should not care about that. Oxana: But these things should not stay in job queues. We need file placement services, indexing.. Aleksandr: How much would this system hide from the user? Oxana: On the same level as grid-manager hides the queue system. Csaba: How about Matrix SRB? Aleksandr: License issues must be settled. Oxana: Should be integrated in job management: in the same job descr. Aleksandr: If we write stuff on top of SRB (and the license becomes an issue) we have to emulate SRB: too much. Balazs: Let's test SRB. There's an installation in Norway. Agreed as a plan. Moreover: test SSE-Fireman integration, too (see Saturday's minutes). High-level design for storables (Aleksandr) ------------------------------------------- Main operations AA enabled: get file, put delete rename copy list, modify property Objects collections (set or ordered), files, services (indexing, storage..) Properties access control (for everything) id reference storage: how the file is stored? Saturday, 19 March: Pending issues ================================== Next NGN meeting ---------------- Arrival on Wed (morning). Tech meeting would start Wed afternoon already. GACL + data management ---------------------- The SE should support: read list of admins from a file. Aleksandr: This is a kludge. There should be a gridftp service for different groups (blocks in nordugrid.conf). Solution: SE is physical space, you can access it with different URL: xyz/ATLAS, xyz/CMS NorduGrid conf [VO] URL=xyz/ATLAS NAME=ATLAS-SE FILE=/etc/nordugrid/ATLAS-VO.TXT And you use ATLAS-SE in the GACL. This goes to 0.6. How about GACL templates? To be checked. Improve error messages: upload/download of file x failed because of insuffient privileges in dir Y. I/O design: Filename, host+port, path, streams, cache, retry, secure, blocksize, location, pattern, checksum, guid, multiple i/o, metadata, acl. RLS: ---- Current issues: When the SSE registers to RLS, the AA in RLS is by the SSE's host cert, not user cert. To be fixed. St. Petersburg will run an RLS service. Options for NG/RLS 1. Ask M. Branco what kind of improvements "would minimize his pain". (I'll do this -- Marko) 2. Use FireMan as index service: write/contribute glite-catalog-search? Option 2 preferred if FireMan can be installed. Plan: Don't drop RLS but integrate FireMan for 2.0. Invite Peter Kunszt to next NG meeting. Retries in transfer: GridManager or executor? Proposed: ng-rerun that would take care of the retries: downloads, queue, upload The reason for the failure is in the errorstring message. Implement this in client side: if the job failed in a certain state, the executor tries to re-run. Improve errorstring. Redefine finishing codes: exit code, rerunnable? Mattias: Check that it is compatible with ngstat. For 1.0: Re-designed downloader/uploader infrastructure. Logger ------ Vitaly & Andrei's will take a closer look of the logger database structure. Target: ready for 0.6. Debug and log files: Default log level for all the services: 0 (only errors). Nothing: -1. Error messages -------------- For jobs: what the queue provides + exit codes. Error codes for services: we'll need a new attribute. Atlas_error_code is too specific. Job failures and messages should be improved: write grid-manager appendix. Client error messages: make "no cluster with enough CPU" error message "you are not authorized to use any cluster" if that is the case. Trace other uninformative messages. File transfer: Globus messages are not informative. Each service should have a default log file. Web page updates ---------------- A couple of paragraphs about all the components of ARC. Compatibility ------------- There are client_version and server_version variables. SE & Inf.System attributes -------------------------- SE name (different from URL) hostname + string unique on the site SE type: (enum) gridftp,SSE Contact URL (multival) Access control framework: GACL, trivial Network speed: ambiquous i/o speed: hardware style: disk/raid/tape/network status: etc (see Balazs' draft, will be continued)