This text was writen by Aleksandr Konstantinov in January 2002 using an information provided by other participants of NorduGrid project: Mattias Ellert, Oxana Smirnova, Balázs Kónya.
Following is list of problems encountered during installation and usage of Workload Management Software provided by Work Package 1 of European Data Grid project (futher referenced as 'broker').
Note:
The information labeled "Possible solution" shows how it was
solved at NorduGrid and does not intend to be the best and only solution.
Note:
'Broker' version 1.0.3 and "WP1 - WMS Software Administrator and User
Guide" (futher referenced as 'Guide') were used.
Note:
RB, JSS and Personal CondorG components were installed to run under
dedicated user 'dguser', because 'Guide' encourages to do so. This had
lead to additional problems.
Note:
Installation was performed not on RedHat 6.2 system, so the
problems raised by that are not mentioned here
1. The compilation of binary 'broker' distribution probably was
carried out in big hurry. As result different components require different
versions of libstdc++ library and some of them even use two different versions
simultaneously. Also the same name is used for different versions. Namely,
UI components work only with libstdc++-libc6.1-2.so.3 provided on EDG site
in egcs-1.2 package. Meanwhile for other components one provided by compat-*
package from Redhat was found to be suitable.
Possible solutions:
2. Although version of Python mentioned in 'Guide' was installed
(python-2.1.1-3), python scripts complain about undefined 'em'. Also name
of the python interpreter of python-2.1.1-3 is python2 and not python.
Possible solution:
3. 'Guide' offers to use "adduser -r". There are two problems
here.
a) This option is redhat specific.
b) With this option home directory of the user is not created. But
during installation of CondorG and for startup scripts of RB and JSS this
directory is required.
Possible solution:
4. Startup scripts of RB and JSS expect to find directory
/home/${RBJSS_USER} . The home directories of the users are not necessary
in /home .
Possible solutions:
5. Startup scripts use 'su - ${RBJSS_USER}' and this removes
environment variables X509_* needed for authentication. The same goes for
other
configuration variables like CONDOR_CONFIG, CLOBUS_LOCATION etc. 'Guide'
mentions variable necessary to set up but is not clear about at which place
of startup that should be done.
Possible solutions:
6. startup scripts for RB and JSS components create
Globus user proxy each. Taking into account, that according to 'Guide'
those components are supposed to run on the same computer and under the
same account that looks redundant. At the same moment those proxies are
inaccessible for user used to run aforementioned components because they
are created by root (startup scripts placed into /etc/rc.d/init.d are run
by root). Also the way of creation (concatenation of public certificate
and private key) seems to be
insecure.
Possible solutions:
7. Default value of GRIDMAP if run as user is not /etc/grid-security/grid-mapfile
but $HOME/.gridmap . If last is not present all DN's are accepted and mapped
to username same of current user.
Possible solutions:
8. CondorG (at least the standard one) uses old style certificate
checking. This is not a problem with WP1 software but should be mentioned
in 'Guide'.
Possible solution:
9. In Workload/Broker/Server/RBJob.cpp file in function
RBJob::SetupFileAccessPermissions() username of connected user is extracted
from GRIDMAP file (see possible problem above) and them used with getgrnam
function. getgrnam is supposed to find a group with the name provided in
argument. Normally this fails because not all users in the system have
the groups with the same name.
Possible solutions:
10. Later in the same function the sandbox for the job is
created and group owning those directories is changed to aforementioned
group. Because RB server is supposed to be run under dedicated user this
normally would also fail.
Possible solution:
11. In Workload/Broker/Client/GetOutputSandboxStub.cpp file
in function GetOutputSandbox::ReceiveOutputSandbox the path to sandbox
is 'hardcoded' to be '/tmp' . This makes corresponding configuration option
completely unusable. If that option is changed in configuration files it
becomes impossible to retrieve results from sandbox (at least using dg-job-get-output).
Possible solution:
12. The use of gsincftp* is deprecated in globus. It would
be better to use globus-url-copy. The last one is more robust and reliable.
The tests in NorduGrid showed that last version of gsincftp* has bugs.
At the same time globus-url-copy appears to be stable tool. For additional
information one can look at following NorduGrid documents: "GridFTP tests"
at /slides/20011102-balazs.pdf
and archives of nordugrid-discuss mailing list at http://lscf.nbi.dk/mailman/listinfo/nordugrid-discuss
.
Possible solution:
13. It was noticed that after failed authentication
to 'RBserver' (for instance opening telnet connection to default 7771 port
and closing it) it starts consuming as much CPU resources as possible.
Possible solution:
At NorduGrid we used 'broker' with Globus MDS structure. In case that way of operation will ever be considered the following should be also taken into account.
EDG CEInformationProviders can be easily incorporated into Globus 2.0 GRIS. For more information look /documents/comments_on_ceinfo.pdf. But 'broker' can have problems finding CE. The problem is triggered by openldap implementation used in EDG and Globus. Unfortunately it has following issue:
If ldap tree contains 2 databases under, for instance, 'dn=base1,dn=base' and 'dn=base2,dn=base', the search with base DN 'dn=base' will return nothing (actually error, no such object).RB component performs ldap search request at central InformationIndex (equivalent of GIIS) and at GRISes looking for available CEs. Both searches are performed with same base DN (by default 'o=Grid'). As result of aforementioned issue any search with base DN 'o=Grid' in Globus MDS (both GIIS and GRIS) will fail. To obtain proper results searching in GRIS and GIIS should be performed with different base DNs.