dCache 1.9.1 Release Notes
Introduction
It is our long term goal to renovate the entire dCache code base,
focusing on increased modularity, consistency and orthogonality. As a
first step in this direction, the 1.9.1 release introduces a
refactored pool component.
The new pool component is a drop in replacement for the old pool,
with an unaltered external interface to ease testing and
deployment. The new pool has run in limited production at NDGF for
several months and we are fairly confident that - although not perfect
- it is safe to use. That being said, other sites have different usage
patterns, and we urge everybody to carefully test the new pool before
widespread deployment. We provide an upgrade strategy below.
It has been a design requirement to make the new pool a drop in
replacement for the existing pool. Therefore the administrative
interface is mostly unchanged, and the feature set is more or less the
same. We provide a detailed list of new and dropped features
below.
Further iterations of the pool code are planned. Where the current
release focuses on modularity and internal consistency, future
releases will focus on mover management and improved support for
advanced protocols like NFS 4.1 and xrootd. These protocols
incorporate knowledge of the distributed nature of dCache and do not
fit well with the current mover based I/O model.
Work has already begun on 1.9.2, our next feature release. It is
our intention to make bug fix release to both 1.9.0 and 1.9.1, should
this become necessary.
Upgrading from 1.9.0
There are no restrictions on the order in which components can be
upgraded to 1.9.1: Any mix of 1.9.0 and 1.9.1 is supported.
The new pool is a drop in replacement. It is therefor possible to
upgrade a single pool to 1.9.1 in an existing 1.9.0 installation for
the purpose of evaluation and testing. At any time the installation
can be downgraded to 1.9.0. The restrictions described in the 1.9.0
release notes also apply to 1.9.1 and a 1.9.1 pool should not be
used with any 1.8.0 head nodes.
Besides the new pool, 1.9.1 introduces significant changes to the
logging infrastructure, see below. To take advantage of these changes
on a dCache node, that node needs to be updated to 1.9.1.
Important: Apparently, config/log4j.properties
is not always automatically updated upon upgrade. Please make sure to
use the version for the 1.9.1 release, as logging will otherwise be
very noisy and likely in a wrong format.
Known Issues
Modified Handling of Transfer Failures
The new pool treats any upload failure as an error and marks the
replica as broken (using the flags BAD and FROM_CLIENT). This is a
change from the old pool, which would mark such files as PRECIOUS.
The new behaviour is currently being debated by the dCache team and
will be altered again in the future.
Changed and New Features in 1.9.1
Refactored Pool
As described above, 1.9.1 ships with a refactored pool. The new
pool is an evolution of the old pool, not a complete rewrite. The old
pool is still shipped with 1.9.1, although deactivated by default. A
batch file is available for activating the old
pool.
Modular Pool Design
Although not really a new feature, the modular structure is evident
through several commands in the pool. In particular the output of the
info command has changed. The output is now grouped per
module.
A new command set starting with the prefix bean was
added. The bean commands allow inspection of the loaded
modules. The commands were mainly introduced for debugging
purposes.
Removed pool functionality
Several features have been removed from the new pool compared to
the old one. Most of these features did not work in the old pool
anyway or have been replaced by new concepts. The following features
are not supported:
- LFS mode=hsm
- Suppressing flush of zero sized files (should be done in HSM script)
- Classic space reservation (that is, the old space manager)
- pool inventory command
- pp keep command
- storage info key "overwrite"
- -permanent option is gone
- -sticky=allowed|denied option is gone: Since 1.8 sticky
flags are essential for correct functioning of dCache. They are always
allowed in the new pool.
- -remove-unexisting-entries-on-flush option is gone: We now
always suppress flush when the file doesn't exist in PNFS, but we
don't delete the replica until the cleaner explicitly sends a remove
message.
Log4j
For some time now dCache has used two logging systems. An old
logging system based on a simple printout level per cell, and a new
logging system based on Log4j. This has led to some confusion as the
two systems used incompatible configuration mechanism and inconsistent
formatting. This has been resolved in version 1.9.1.
Internally, dCache still uses two logging systems, but the old
system has been refurbished to call-through to Log4j. That means the
user only has to deal with one logging system.
The old logging system only had two log levels (debug and
error). These are mapped to equivalent log levels in Log4j, but the
mapping may be adjusted through the printout level. This is a stopgap
solution while some code still uses the old logging calls and will
eventually be removed. Code logging to Log4j directly is not affected
by the printout level. The Log4j runtime user interface introduced in
version 1.9.0 should be preferred to adjusting the printout level.
The Cells pinboard system has been restructured. Previously code
would write directly to the pinboard using special log calls. This has
been replaced by a custom Log4j appender. This appender redirects
messages to the appropriate cell's pinboard. It is configurable
through Log4j which messages are added to the pinboard.
Log Context
We have introduced a logging context. Using our default Log4j
configuration, the log context is printed in square brackets in front
of any log message. The content of the log context depends on the
context (hence the name), but will in many cases contain:
- The source of the request (cell name)
- Session ID
- Message type
- PNFS ID
Session ID uniquely identifies an activity across multiple
cells. Support for session ID is limited in 1.9.1, since not all
components generate or forward the session ID. Support will improve
over time.
Migration Module
A migration module was written for the new pool. This module
subsumes the functionality of the copy module for the maintenance
cell. The migration module allows replicas to be copied or moved
between pools. If 1.9.0 head nodes are used, then the only supported
value for the -target option is pool.
In contrast to the old copy module, the migration module can
maintain sticky flags and can update the state of the source after
transfer, including deleting the source replica. Notice that the
migration module is unaware of the pin manager and the space
manager.
More development is planned for future releases. Known issues:
Performing several migration tasks from the same source pool may in
rare cases interact in unforeseen ways. Logging is almost
non-existing. In case of failures, the module retries eagerly without
pausing.
The migration module is accessed through the admin commands of the
pool. All commands are prefixed with migration. Please use
help migration copy for usage information. For your
convenience, the documentation is reprinted below:
Copies files to other pools. Unless filter options are specified,
all files on the source pool are copied.
The operation is idempotent, that is, it can safely be repeated
without creating extra copies of the files. If the replica exists
on any of the target pools, then it is not copied again.
Both the state of the local replica and that of the target replica
can be specified. If the target replica already exists, the state
is updated to be at least as strong as the specified target state,
that is, the lifetime of sticky bits is extended, but never reduced,
and cached can be changed to precious, but never the opposite.
Syntax:
copy [options] <target> ...
Options:
-state=cached|precious
Only copy replicas in the given state.
-sticky[=<owner>[,<owner> ...]]
Only copy sticky replicas. Can optionally be limited to
the list of owners. A sticky flag for each owner must be
present for the replica to be selected.
-storage=<class>
Only copy replicas with the given storage class.
-pnfsid=<pnfsid>
Only copy the replica with the given PNFS ID.
-smode=same|cached|precious|removable|delete[+<owner>[(<lifetime>)] ...]
Update the local replica to the given mode after transfer.
'same' does not change the local state (this is the
default), 'cached' marks it cached, 'precious' marks it
precious, 'removable' marks it cached and strips all
existing sticky flags, and 'delete' deletes the replica.
An optional list of sticky flags can be specified. The
lifetime is in seconds. A lifetime of 0 causes the flag
to immediate expire. Notice that existing sticky flags
of the same owner are overwritten.
-tmode=same|cached|precious[+<owner>[(<lifetime>)] ...]
Set the mode of the target replica. 'same' applies the
state and sticky bits of the local replica (this is the
default), 'cached' marks it cached, 'precious' marks it
precious. An optional list of sticky flags can be
specified. The lifetime is in seconds.
-select=proportional|best|random
Determines how a pool is selected from the set of target
pools. 'proportional' selects a pool with a probability
inversely proportional to the cost of the pool. 'best'
selects the pool with the lowest cost. 'random' selects
a pool randomly. The default is 'proportional'.
-target=pool|pgroup|link
Determines the interpretation of the target names. 'pool'
is the default.
-refresh=<time>
Specifies the period in seconds of when target pool
information is queried from the pool manager. The
default is 300 seconds.
-exclude=<pool>[,<pool> ...]
Exclude target pools.
-concurrency=<concurrency>
Specifies how many concurrent transfers to perform.
Defaults to 1.
dCache 1.9.2 Goals
- New authorization infrastructure
- Unpin by VO
Changelog since 1.9.0-4
Pool
- [r8700] Added initial incomplete implementation of new repository subsystem
- [r8765] Documentation updates and various minor fixes
- [r8768] Elliminated some of the parameters on the constructor
- [r9029] Simplified synchronisation in CacheRepositoryV5. I want the first iteration be simple, so reader-writer locking is overkill
- [r9034] Copied pool classes which either directly or indirectly depend on the
- [r9036] Updated sweeper creation in new repository subsystem to not expect the
- [r9037] Don't pass CacheRepository to HSM ans P2P subsystems. This will make
- [r9058] Changed LRU order such that removable files are placed first. With this
- [r9115] Refined several throws-declarations, and prevented createEntry from throwing DuplicateEntryException
- [r9119] WriteHandle:
- [r9123] Copied from modules/dCache/diskCacheV111/pools. Needs to be modified for the new repository subsystem
- [r9127] Ported from old repository to new repository subsystem
- [r9135] Ported PoolV4 and P2PClient to new repository. The two space monitor
- [r9205] Ported HSM support to new pool repository
- [r9212] Fixed 'rh restore' command; due to changes in HsmStorageHandler2,
- [r9214] Implemented the 'flush pnfsid' command
- [r9215] Removed space reservation support (was commented out and we decided that
- [r9218] Implemented onRestore checksum policy
- [r9223] Enabled the replication handler for the new pool. Currently as a hack
- [r9262] Extended new repository interface to provide access to the sticky
- [r9263] Added support for PoolSetSickyMessage
- [r9265] Added support for saving configuration
- [r9279] Made new pool inherit from AbstractCell
- [r9304] Added fault listener to new pool. It doesn't do much yet, but it allows
- [r9315] Added periodic health check of repository
- [r9317] Gave the repository the ability to add information to getInfo
- [r9334] Added shutdown method to repository. This allows the repository to shut
- [r9383] Added error handling for when enqueuing a mover fails
- [r9391] Removed zero-size-file-flush-suppression; after discussion between
- [r9425] Implemented support for the h-flag
- [r9438] Resolved an issue in how the h-flag was updated (it should only be
- [r9442] Added NoRouteToCellException declarations in anticipation of doing
- [r9450] Removed overwrite option from the client code used to send a new
- [r9462] Throw exception when attempting to open a broken entry
- [r9478] WriteHandleImpl.close now throws a CacheException if the replica size
- [r9486] Propagate InterruptedException from storeChecksumInPnfs method
- [r9487] Moved checksum verification after restore from PoolV4 to
- [r9505] Moved ping thread creation to top of constructor: in case of failure,
- [r9510] Fixed spelling error in method name
- [r9529] Added lfs=volatile support
- [r9569] Removed obsolete comments
- [r9570] Reworked handling of sweeper instantiation and "Storage Mode" to getInfo
- [r9571] It is said that it is easier to get forgiveness than permission, so here
- [r9578] Added space allocation to new P2PClient
- [r9579] Fixed checksum calculation for empty files
- [r9580] Restructured CacheRepositoryV5 to use setter injection rather than rely
- [r9655] Made ReplicationHandler a StateChangeListener. Thus we no longer need to
- [r9664] Fixed synchronization issues found by findbugs
- [r9668] Added a call to register a cache location for broken entries (broken
- [r9725] This patch changes StorageClassInfo to log via log4j rather than printing to stderr. The patch only addresses the
- [r9800] Use complete sentence when logging pool mode changes
- [r9801] Remove Logable support from HsmStorageHandler2
- [r9809] Decouples SpaceSweeper from CellAdapter
- [r9837] Use source-routing for DoorTransferFishedMessage
- [r9838] Use DelayedReply for HsmStorageInterpreter
- [r9841] Decouples RepositoryInterpreter from CellAdapter
- [r9853] Use log4j logging and CellEndpoint in movers
- [r9854] Inherit thread group in CacheRepositoryV5
- [r9858] Decouple ChecksumModuleV1 from CellAdapter
- [r9859] Use log4j in HsmStorageInterpreter
- [r9860] Do not implement Logable in PoolV4
- [r9863] Do not access CellNucleus in PoolV4
- [r9864] Use log4j for new pool
- [r9865] Decouples P2PClient from CellAdapter
- [r9872] Decouples HsmFlushController in the new pool from CellAdapter
- [r9897] Fixes deadlock problem in CacheRepositoryV5
- [r9907] Decouple HsmStorageHandler2 from CellAdapter
- [r9908] Adds close method to StickyInspector
- [r9909] Fixes regression introduced in rev. 9908
- [r9910] Removes SpaceSweeper0 and SpaceSweeper1
- [r9911] Rework SimpleJobScheduler naming
- [r9913] Move component creation from CacheRepositoryV5 to Spring container
- [r9929] Removes deprecated or dead code from PoolV4
- [r9930] Reduce PoolV4's coupling to AbstractCell and CellAdapter
- [r9931] Avoid repeated creation of CellPath instance
- [r9935] Prepare pool components for use with UniversalSpringCell
- [r9936] Move the new pool to UniversalSpringCell
- [r9937] Moves Spring XML files into the jar file
- [r9942] Changes 'reply required' flag for PoolFetchFileMessage
- [r9943] Introduced AbstractCellComponent
- [r9944] Removes unused code from CacheRepositoryV5
- [r9953] Wrap cache location registration in try-finally block
- [r9957] Use new message dispatch infrastructure for new pool
- [r9961] Introduce interface for repository subsystem
- [r9962] Yeah yeah - forgot to run 'svn add' and 'svn remove'.... Shame on me
- [r9969] Catch RuntimeException thrown in StateChangeListener
- [r9970] Makes PnfsHandler.addCacheLocation blocking
- [r9986] Don't touch file when creating new entry
- [r9987] Minor cleanup in printSetup of new pool
- [r9990] Modifies runInventory() to keep retrying upon communication failures
- [r9991] Fixes deadlock in new pool
- [r9994] Fix CacheRepositoryV5 tests. Were broken by rev. 9991
- [r10063] Suppress thread interrupts after fetch or store
- [r10085] Remove deprecated -allowSticky and -permanent flags
- [r10098] Cleaned up log output of P2PClient
- [r10101] Increased logging for sources of the remove entry in the pool repository, Patch#3419
- [r10149] Fix mover arguments
- [r10175] Clean up logging in checksum module
- [r10176] Clean up checksum logic in checksum module
- [r10177] Clean up logging in HsmFlushController
- [r10179] Cleanup info command in UniversalSpringCell
- [r10189] Remove replica if cache location registration fails due to delete file
- [r10210] Update pool to handle new trash better
- [r10257] Clean up logging in HsmStorageHandler2
- [r10259] Berkeley DB generates an error message when the DB is closed several
- [r10260] Avoid closing Berkeley DB twice
- [r10291] Copied diskCacheV111.pools.StorageClassInfoFlushable (trunk r10290) to
- [r10308] Forgot to update the JobTimeoutManager package name in the Spring file
- [r10318] Add support for extending sticky bits
- [r10319] Adds support for creating entries with multible sticky records
- [r10320] Refactor P2PClient into plugin for new pool
- [r10327] Added missing P2P files from r10320
- [r10328] Removed old P2PClient implementation
- [r10329] Give helper thread for 'rh restore' command a name
- [r10331] Migration module for the new pool
- [r10332] Use NIO channels in P2PClient
- [r10334] Adds shutdown command to job scheduler
- [r10335] Replace WriteHandle.cancel with WriteHandle.commit
- [r10342] Clean up log output of new pool
- [r10407] Mark CacheRepositoryEntry dirty whenever it is changed
- [r10410] Fixes several issues in log4j shell
- [r10415] Copied missing space sweepers from current trunk to new pool
- [r10453] Don't hold lock while P2P callback is executed
- [r10458] Clone CellPath for CopyFinished message
- [r10465] Fix message ordering assumption in migration module
- [r10468] Do not attempt to migration broken files
- [r10470] Fixes defaults for 'migration replica' command
- [r10472] Renamed a couple of migration commands
- [r10474] Set PoolListFromPoolManager._pools to empty list
- [r10476] Include command in migration info output
- [r10485] Finish implementation of 'migration ls' command
- [r10486] Implement FixedPoolList.toString
- [r10496] Fix leak in StickyInspector
- [r10500] Fixed test to match changes in r10496
- [r10517] Ignore NACK on migration ping
- [r10519] Fix healer regression introduced in #3829
- [r10536] Added test to check overallocation handling
- [r10540] Fix bug described in patch #3939
pnfsDomain
- [r10263] Add commands to hsmcleaner to query its internal state
- [r10344] Fixed HSM cleaner handling of multiple occurences of the same location
- [r10370] Merged trunk r9451 into 1.9 (don't overwrite checksum in PNFS)
dCacheDomain
- [r9122] Removed _spaceCostFactor and _performanceCostFactor variables, since they were not used anymore
- [r9242] more debug messages on selection
- [r9319] print exclude pool message on exclude only
- [r9347] removed unused variables
- [r9405] synchronized javadoc with actual values
Cells
- [r9615] Added various interfaces for expressing aspects of Cell adapter. Rather
- [r9618] Added new glue cell to integrate the cell and spring frameworks. The old
- [r9627] Split CellCommunicationAware into two interfaces: One for sending and
- [r9628] Added checks to shell commands to detect an uninitialised cell
- [r9726] Documentation update in JavaDoc
- [r9932] Implement setup file support in UniversalSpringCell
- [r9934] Misc updates to UniversalSpringCell
- [r9943] Introduced AbstractCellComponent
- [r10179] Cleanup info command in UniversalSpringCell
- [r10181] Add 'bean messages' command to UniversalSpringCell
- [r10272] Use log4j for Cells package
- [r10345] Renamed org.dcache.cell to org.dcache.cells and moved AbstractCell
- [r10400] Refactor cells log4j support for better encapsulation
- [r10402] Adds session identifier to cells
- [r10404] Forgot to add file in r10402
- [r10491] Add cell timeout task to AbstractCell
infoDomain
- [r10282] Add missing pool-to-link persistant metadata
- [r10424] Adding metric to indicate door interface ordering
- [r10426] The following patch tidies up two visitors that share some common code
Misc
- [r9450] Removed overwrite option from the client code used to send a new
- [r9480] Use blocking send for sending PnfsSetChecksumMessage (otherwise setting
- [r9576] Don't log error messages for something which is quite normal
- [r9852] Use CellEndpoint for ChecksumFactory and ChecksumPersistence
- [r10086] Log message type when receiving unexpected message
- [r9720] This patch changes ExternalTask to log via log4j rather than a Cell Logable target. The patch only addresses the
- [r9721] This patch changes the RunSystem class to log via log4j rather than a Cell Logable target. The patch only addresses
- [r9734] Marked deprecated constructor (the one requiring Logable) as
- [r9738] Added logging of return code if it is non-zero. Suggested by Paul
- [r9871] Removes most uses of the Logable interface
- [r9120] Made DEFAULT_ERROR_CODE public
- [r10290] Mark constants static to avoid serialization
- [r10296] make use of new startup script in rpm pre/post sections
- [r10314] replaced ManagerV2 w/ Manager, an ommission I regret
- [r10337] Merged trunk r9531 into 1.9 (adds spring jars)
- [r10334] Adds shutdown command to job scheduler
- [r10371] Merged trunk r9523 into 1.9 (styling of restore web page)
- [r10412] In case you are declaring a DNS-like VO name in site-info.def, remember to change the "." or "-" with "_", like in the example:
- [r10418] aim_config_file_get_value has been used for some time in other areas
- [r10432] Fix for inconsistencies between this script and dcache-core dcache-pool as stated in patch #3874 in starting and stoping the adminDoor domain
- [r10440] commiting Patch #3874 intended to change the domain name of the domain hosting the
- [r10443] patch 3880
- [r10456] Silence PTLS debug logger
- [r10525] patch #3930, reduce the default level of srm logging
SRM Client Tools
- [r10303] handle "-2" "-1" and "-srm_protocol_version" in consistent manner
- [r10302] make sure we set TRetentionPolicy appropriately, based on client input
FTP Door
- [r10322] Move perfMarkerTask field into Transfer object
- [r10324] Replaces _transferInProgress with Transfer.aborted field
- [r10326] Renamed transfer_error to abortTransfer
Xrootd
- [r10046] Patch 1 of 2: xrootd API change
- [r10047] Patch 2 of 2: xrootd API change
hsmcp.rb
- [r7125] Merged from production-1-7-0-NDGF: New hsmcp script written in Ruby
- [r7251] layout reorgonize
- [r7369] Added remove command to HSM copy script
- [r7379] Fixed bug introduced in rev. 7369 (Ruby is not C :-) )
- [r7511] Fixed type error
- [r7515] Do not fail if file is already flushed (HsmStorageHandler2 expectes this
- [r9203] Updated help strings to show required parameters