NORDUGRID-MEMO-2

Comments on Installation Procedure and Usage of Workload Management Software of European DataGrid.

This text was writen by Aleksandr Konstantinov in January 2002 using an information provided by other participants of NorduGrid project: Mattias Ellert, Oxana Smirnova, Balázs Kónya.

Following is list of problems encountered during installation and usage of Workload Management Software provided by Work Package 1 of European Data Grid project (futher referenced as 'broker').

Note:
The information labeled "Possible solution" shows how it was solved at NorduGrid and does not intend to be the best and only solution.
Note:
'Broker' version 1.0.3 and "WP1 - WMS Software Administrator and User Guide" (futher referenced as 'Guide') were used.
Note:
RB, JSS and Personal CondorG components were installed to run under dedicated user 'dguser', because 'Guide' encourages to do so. This had lead to additional problems.
Note:
Installation was performed not on RedHat 6.2 system, so the problems raised by that are not mentioned here
 

1. The compilation of binary 'broker' distribution probably was carried out in big hurry. As result different components require different versions of libstdc++ library and some of them even use two different versions simultaneously. Also the same name is used for different versions. Namely, UI components work only with libstdc++-libc6.1-2.so.3 provided on EDG site in egcs-1.2 package. Meanwhile for other components one provided by compat-* package from Redhat was found to be suitable.
Possible solutions:


2. Although version of Python mentioned in 'Guide' was installed (python-2.1.1-3), python scripts complain about undefined 'em'. Also name of the python interpreter of python-2.1.1-3 is python2 and not python.
Possible solution:


3. 'Guide' offers to use "adduser -r". There are two problems here.
a) This option is redhat specific.
b) With this option home directory of the user is not created. But during installation of CondorG and for startup scripts of RB and JSS this directory is required.
Possible solution:


4. Startup scripts of RB and JSS expect to find directory /home/${RBJSS_USER} . The home directories of the users are not necessary in /home .
Possible solutions:


5. Startup scripts use 'su - ${RBJSS_USER}' and this removes environment variables X509_* needed for authentication. The same goes for other
configuration variables like CONDOR_CONFIG, CLOBUS_LOCATION etc. 'Guide' mentions variable necessary to set up but is not clear about at which place of startup that should be done.
Possible solutions:


6.  startup scripts for RB and JSS components create Globus user proxy each. Taking into account, that according to 'Guide' those components are supposed to run on the same computer and under the same account that looks redundant. At the same moment those proxies are inaccessible for user used to run aforementioned components because they are created by root (startup scripts placed into /etc/rc.d/init.d are run by root). Also the way of creation (concatenation of public certificate and private key) seems to be
insecure.
Possible solutions:


7. Default value of GRIDMAP if run as user is not /etc/grid-security/grid-mapfile but $HOME/.gridmap . If last is not present all DN's are accepted and mapped to username same of current user.
Possible solutions:


8. CondorG (at least the standard one) uses old style certificate checking. This is not a problem with WP1 software but should be mentioned in 'Guide'.
Possible solution:


9.  In Workload/Broker/Server/RBJob.cpp file in function RBJob::SetupFileAccessPermissions() username of connected user is extracted from GRIDMAP file (see possible problem above) and them used with getgrnam function. getgrnam is supposed to find a group with the name provided in argument. Normally this fails because not all users in the system have the groups with the same name.
Possible solutions:


10. Later in the same function the sandbox for the job is created and group owning those directories is changed to aforementioned group. Because RB server is supposed to be run under dedicated user this normally would also fail.
Possible solution:


11. In Workload/Broker/Client/GetOutputSandboxStub.cpp file in function GetOutputSandbox::ReceiveOutputSandbox the path to sandbox is 'hardcoded' to be '/tmp' . This makes corresponding configuration option completely unusable. If that option is changed in configuration files it becomes impossible to retrieve results from sandbox (at least using dg-job-get-output).
Possible solution:


12. The use of gsincftp* is deprecated in globus. It would be better to use globus-url-copy. The last one is more robust and reliable. The tests in NorduGrid showed that last version of gsincftp* has bugs. At the same time globus-url-copy appears to be stable tool. For additional information one can look at following NorduGrid documents: "GridFTP tests" at  /slides/20011102-balazs.pdf  and archives of nordugrid-discuss mailing list at http://lscf.nbi.dk/mailman/listinfo/nordugrid-discuss .
Possible solution:


13.  It was noticed that after failed authentication to 'RBserver' (for instance opening telnet connection to default 7771 port and closing it) it starts consuming as much CPU resources as possible.
Possible solution:

At NorduGrid we used 'broker' with Globus MDS structure. In case that way of operation will ever be considered the following should be also taken into account.

EDG CEInformationProviders can be easily incorporated into Globus 2.0 GRIS. For more information look /documents/comments_on_ceinfo.pdf. But 'broker' can have problems finding CE.  The problem is triggered by openldap implementation used in EDG and Globus. Unfortunately it has following issue:

If ldap tree contains 2 databases under, for instance, 'dn=base1,dn=base' and 'dn=base2,dn=base', the search with base DN 'dn=base' will return nothing (actually error, no such object).
RB component performs ldap search request at central InformationIndex (equivalent of GIIS) and at GRISes looking for available CEs. Both searches are performed with same base DN (by default 'o=Grid'). As result of aforementioned issue any search with base DN 'o=Grid' in Globus MDS (both GIIS and GRIS) will fail. To obtain proper results searching in GRIS and GIIS should be performed with different base DNs.
It would be better to have both DNs configurable. This would allow to use RB with both II provided by WP1 and Globus MDS with CEInformationProvider integrated.
NorduGrid homepage