|
1.4.2010
dCache going standardThe NFS v4.1 dCache implementation. |
Since the first day High Energy Physics is one of the main scientific disciplines, which produces large bulks of data every year. As a result we always have a storage problem - data never fits on a single disk or even on a single data server. Many custom solutions have been developed at various labs to address this issue. Castor, dDache, rfio, xrootd and others are used to store most of our permanently increasing physics data. Nevertheless they are not handy enough and require application modifications, which is not always an option. In 1989 NFSv2 was introduced to unify the access to remote data. It however still was bound to a single data server and as a result was not widely used. Even NFS version 3 and version 4 did not solve the problem. After 16 years of waiting, storage vendors finally addressed our requirements. This joined effort, of prominent storage vendors, resulted in the NFSv4.1 protocol specification (rfc 5661[1]). Today the Linux community, Microsoft, Oracle(Sun), NetApp, Panasas and many others are working hard to make a rock solid client as part of their OS to provide standard access to a distributed heterogeneous storage environment. The NFSv4.1 specification defines the client/server interaction by leaving room to the server vendors for different implementations. Right now four server implementations are available - Linux, NetApp, SUN and dCache.
The NFSv4.1 protocol makes a distinction between metadata and data access. The are three types of data server access protocols defined - BLOCK based , FILE based and OBJECT based. Nevertheless there is room to define new protocols. dCache implements the FILE-based data access protocol only. Data access is steered by mapping file data to storage devices holding the data. Such a mapping is called "file layout". It defines how a data is organized on one or more storage devices. Prior to any IO operation, a client has to request the layout for the given file. Each layout is associated with a device ID, which identifies a group of storage devices. In the simplest case this group may point to a single storage device. There are five operations, which can be used by clients.
- LAYOUTGET : to get the layout of a file
- LAYOUTRETURN : to notify server that client is not going to use this layout any more
- LAYOUTCOMMIT : to inform metadata server about data written to the data servers (that the layout has been changed?).
- GETDEVICEINFO : to get a mapping between device ID and data server address.
- GETDEVICELIST : allows clients to fetch all device IDs for a specific file system.
In addition, there are operations defined to allow meta-data servers to recall the layout or notify the client about changes in device ID.
The typical IO operation on a file will result in a series of requests:
OPEN->LAYOUTGET->GETDEVICEINFO->READ/WRITE->CLOSE
| Client | Operation | Meta Data Server | Data Server |
|---|---|---|---|
| | ⇒ | Open | ⇒ | | |
| | ⇒ | Layout Get | ⇒ | | |
| | ⇐ | device ID | ⇐ | | |
| | ⇒ | get device info | ⇒ | | |
| | ⇐ | device address | ⇐ | | |
| | ⇒ | read / write | ⇒ | | |
| | ⇒ | layout return | ⇒ | | |
| | ⇒ | close | ⇒ | |
Depending on the metadata server policy, layouts can be reused or have to be returned after the IO operations are completed.
Since dCache version 1.9.3 (June 2009) NFSv4.1 is a part of the standard distribution. This allows to mount dCache as a regular NFS server providing standard POSIX file access. As NFSv4.1 is just another protocol dCache supports, all features, e.g. pool selection, load balancing, checksumming, automatic and manual replications apply as well. This means however that files in dCache, even using NFSv4.1 access, are still immutable. DCAP access over the mounted NFSv4.1 files system is supported as well. In addition to local file access, NFSv4.1 makes dCache more firewall friendly by using a single TCP port per pool. To avoid DoS on the backend tape systems, the dCache NFS interface doesn't trigger tape restores. More fine-grained ACLs on tertiary storage access through NFSv4.1, as in place for other protocols, are in discussion. At the time of this writing, dCache does not support file striping. This means that only a single location of a file will be returned to a client at a time, even if the file has one or more replicas on other pools. A second request from a different client may however return the location of a different copy of the file. This and other restrictions will be addressed in future dCache releases.
Of course it makes no sense to have an NFSv4.1 server if there are no clients available. According to the Linux kernel road map, parallel NFS (NFS 4.1 pNFS) support will become part of the standard kernel with 2.6.36. Other vendors haven't published any time lines yet. Today we have several possibilities to evaluate the power of NFSv4.1. If you are lucky enough to use Fedora 12 Linux, you can simply use kernel builds provided by Steve Dickson[2]. For OpenSolaris users there is the Nevada build provided by Oracle(Sun) [3]. Hard-core people may build their own kernel from sources. For the majority of the scientific world there is a SL5 ( RHEL5 ) build provided by dCache.ORG[4].
While NFSv4.1 is still in an experimental stage, in dCache as well as in OSes, the only way to get it into production is to start using it. This will not only help developers to find and fix bugs in an early stage, but will be a clear signal for vendors that with NFSv4.1 they are addressing storage issues not only for WLCG but as well for other science communities with the need for standard access to a huge and exponentially increasing amount of data.
- [1] The NFS V4.1 RFC
- [2] NFS 4.1 in Fedora
- [3] NFS 4.1 in Open Solaris
- [4] NFS 4.1 in dCache
- [5] NFS 4.1, some numbers.