NorduGrid Note


proposal

2003-05-17




Design of a database for mapping logical filenames to physical locations in a distributed computing environment

Trond Myklebust*

Abstract


We describe a possible design for a database capable of tracking the physical location of a file.








  1. Design criteria



  1. Local storage element

The data area consists of two main areas. The “incoming” area is a write-only area, and is meant to allow users to be able to upload data safely. The choice of the method for uploading data into this area is not a concern for this protocol.

The main “database” area is writable only by a local database manager process. It is by default world readable, however access to individual data entries may be subject to whatever ACLs that are registered in the associated metadata. Again the choice of supported methods for downloading the data lies outside the protocol, however the chosen methods should obviously respect any access permissions that may be set on the entry, including ACLs.

    1. List of supported operations

  1. Logical nameserver

Data must be addressible without the user being required to know the physical location of data instances. The replication operation implies that the location of the closest instance of a dataset is not fixed. The “logical nameserver” is a central database for tracking all instances of a dataset. Any instance which is to be made available for general use must therefore be registered in this database.

For each dataset, therefore, the logical nameserver stores the following information



The nameserver can also store references to collections of logical filenames. Such an entry will differ from the entry for a dataset in that it stores

Both logical names and logical filenames are world readable.

    1. List of supported operations

Once the data has been registered, a user may retrieve the entire list of replicas by presentation of a logical filename to the database.

In the case where the identifier references a logical name, a user should be able to request a simple expansion which should result in the nameserver returning the stored list of logical filenames and logical names. The user should also be able to request a full expansion, which should result in the nameserver returning the expansion in terms of logical filenames alone.

Note that the design of a library of utilities for interrogating the databases, and performing such tasks as automatically selecting the closest physical database is a highly nontrivial task, and should be specified in a separate document.

  1. Fault tolerance

The design must be robust against client and server unavailability. It must assume that network partitions, and/or other faults may interrupt an operation at any moment. If so, both the client and the server need to be able to recover. If possible, the server should complete the operation, however should this not be possible, then the recovery process must involve rewinding the database to a consistent state, and garbage collection of any resulting incompletely registered data.

  1. Inter-process communication

The various databases will not, in general, be running on the same computer. Communication between the database managers must therefore go via a secure external protocol. We propose to use a SOAP/XML based protocol that is yet to be determined.

  1. Data replication support

Network bandwidth limitations are a typical source of delay when doing computations that involve the transfer of large datasets. The GRID model therefore calls for the ability to establish local data repositories in close physical proximity to the computational nodes. Since there may be several such nodes in different physical locations, it follows that data may be replicated over several such local repositories. Unless such replicas are tracked, this model may conflict with the need to provide for subsequent data and metadata updates.

Data and metadata integrity is ensured by means of two md5 checksums, one for data and one for metadata.

  1. Algorithm for registering new data to the database

The user should first upload the data to the storage element “incoming” area. Once this is done, the following data is uploaded to the database manager: location of the data, location of any associated metadata, logical filename.

The database manager process should then immediately moves the new data and metadata into its protected database area. Both the data and metadata checksums are calculated and registered. The manager flags the entry as being in an “unregistered” state before timestamping it. Finally, it contacts the logical nameserver in order to register the logical filename entry.

If the user authenticates with the nameserver process, it will initiate an attempt to register the logical filename. If no existing entry is found, then the logical filename is registered as a new entry, together with the checksums, the file size, and the list of URLs. The dataset is registered as the sole replica instantiation.

If an existing duplicate entry is found, then the checksums and file sizes are compared. Should they match, and there is no duplicate entry found for this replica, then the nameserver registers the dataset in its database as a replica of the existing entry. Should they match, but an existing entry is found for this replica, then no further action is taken, and no error is returned. Should they not match, then the registration is refused, and the error EEXIST is propagated back to the storage element's database manager.

If an error was returned by the nameserver, the database manager returns the error to the user. If, however, no error was returned, it will flag the entry as being in the “registered” state and return control to the user process together with the OK error code.

  1. Algorithm for deleting a logical name/filename

A logical name may be deleted by an authenticated user with the necessary privileges by simply notifying the nameserver that it is no longer in use. A dataset that is no longer of use may be deleted by removing its logical filename from the logical nameserver. Once this is done, the nameserver process should consult with its list of instances of that dataset, and attempt to notify each storage element that this dataset has been deleted.

Upon receiving such notification, the storage element's database manager moves its instance of the dataset into the “invalid” state. It may then remove the instance from its database, or leave this task up to the garbage collector.

Use TCP + keepalive as insurance against network partitions.

  1. Algorithm for updating a dataset

A suitably authenticated user with update privileges may modify the dataset by uploading the modified data into the download area, and submitting an “update” request to the storage element database manager. After changing the local entry's state to “updating”, the latter will then calculate the necessary checksums, and forward the request on to the nameserver. Appended to the original request are the new checksums as well as the old checksums.

The nameserver, upon reception of the request should compare the old checksums to the new checksums. If they match, and the user privileges are authenticated, the update is accepted and the nameserver entry is updated. The nameserver process then notifies all instances of the dataset that they have been invalidated, and returns an “OK” result.

Finally, the storage element database manager, upon reception of the “OK” result, must change its state back to “registered”.

  1. Garbage collection

All data which is not in the “registered” state may be subject to garbage collection. In order to ensure that this does not interfere with the process of registering new data by deleting the data while it is being uploaded etc, the garbage collectors should respect a miniumum “grace period” the length of which will be configurable by the site administrator.

All data, prior to registration, must be timestamped in order to provide a base for the grace period. For files in the “incoming” area, the standard filesystem timestamps such as the UNIX “mtime” may be used. For files that have been moved into the storage element's database, the registration timestamp should be used.

  1. Reboot recovery

  1. Error codes

enum {


OK = 0,

/* Operation succeeded with no error */

ENOENT = 1,

/* No such entry was found */

EPERM = 2,

/* Permission denied */

EEXIST = 3,

/* A duplicate entry already exists */

};




*trond.myklebust@fys.uio.no