mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-03-22 08:51:30 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	6bf4d76f47	Linux Compat: inode->i_mutex/i_sem Create spl_inode_lock/spl_inode_unlock compability macros to simply access to the inode mutex/sem. This avoids the need to have to ugly up the code with the required #define's at every call site. At the moment the SPL only uses this in one place but higher layers can benefit from the macro.	2011-01-11 12:14:48 -08:00
Brian Behlendorf	95c73795b0	Fix ZVOL rename minor devices During a rename we need to be careful to destroy and create a new minor for the ZVOL _only_ if the rename succeeded. The previous code would both destroy you minor device unconditionally, it would also fail to create the new minor device on success.	2011-01-07 12:26:02 -08:00
Brian Behlendorf	149e873ab1	Fix minor compiler warnings These compiler warnings were introduced when code which was previously #ifdef'ed out by HAVE_ZPL was re-added for use by the posix layer. All of the following changes should be obviously correct and will cause no semantic changes.	2011-01-06 15:04:28 -08:00
Brian Behlendorf	683fe41fc7	Add missing mkdirp prototype For while now mkdirp has been built as part of libspl however the protoype was never added to libgen.h. This went unnoticed until enabling the mount support which uses mkdirp().	2010-12-14 10:06:44 -08:00
Brian Behlendorf	5b63b3eb6f	Use cv_timedwait_interruptible in arc The issue is that cv_timedwait() sleeps uninterruptibly to block signals and avoid waking up early. Under Linux this counts against the load average keeping it artificially high. This change allows the arc to sleep interruptibly which mean it may be woken up early due to a signal. Normally this means some extra care must be taken to handle a potential signal. But for the arcs usage of cv_timedwait() there is no harm in waking up before the timeout expires so no extra handling is required.	2010-12-14 10:06:44 -08:00
Ricardo M. Correia	8d4e8140ef	Fix block device-related issues in zdb. Specifically, this fixes the two following errors in zdb when a pool is composed of block devices: 1) 'Value too large for defined data type' when running 'zdb <dataset>'. 2) 'character device required' when running 'zdb -l <block-device>'. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-12-14 09:52:46 -08:00
Brian Behlendorf	a7dc7e5d5a	Enable rrwlock.c compilation With the addition of the thread specific data interfaces to the SPL it is safe to enable compilation of the re-enterant read reader/writer locks.	2010-12-07 16:05:25 -08:00
Brian Behlendorf	135cf6a8ae	Refresh autogen.sh products Refresh the autogen.sh products based on the versions which are installed by default in the GA RHEL6.0 release. autoconf (GNU Autoconf) 2.63 automake (GNU automake) 1.11.1 ltmain.sh (GNU libtool) 2.2.6b	2010-12-07 15:33:12 -08:00
Brian Behlendorf	b7dc313837	Add Thread Specific Data (TSD) Regression Test To validate the correct behavior of the TSD interfaces it's important that we add a regression test. This test is designed to minimally exercise the fundamental TSD behavior, it does not attempt to validate all potential corner cases. The test will first create 32 keys via tsd_create() and register a common destructor. Next 16 wait threads will be created each of which set/verify a random value for all 32 keys, then block waiting to be released by the control thread. Meanwhile the control thread verifies that none of the destructors have been run prematurely. The next phase of the test is to create 16 exit threads which set/verify a random value for all 32 keys. They then immediately exit. This is is designed to verify tsd_exit() which will be called via thread_exit(). This must result in all registered destructors being run and the memory for the tsd being free'd. After this tsd_destroy() is verified by destroying all 32 keys. Once again we must see the expected number of destructors run and the tsd memory free'd. At this point the blocked threads are released and they exit calling tsd_exit() which should do very little since all the tsd has already been destroyed. If this all goes off without a hitch the test passes. To ensure no memory has been leaked, I have manually verified that after spl module unload no memory is reported leaked.	2010-12-07 10:02:44 -08:00
Brian Behlendorf	9fe45dc1ac	Add Thread Specific Data (TSD) Implementation Thread specific data has implemented using a hash table, this avoids the need to add a member to the task structure and allows maximum portability between kernels. This implementation has been optimized to keep the tsd_set() and tsd_get() times as small as possible. The majority of the entries in the hash table are for specific tsd entries. These entries are hashed by the product of their key and pid because by design the key and pid are guaranteed to be unique. Their product also has the desirable properly that it will be uniformly distributed over the hash bins providing neither the pid nor key is zero. Under linux the zero pid is always the init process and thus won't be used, and this implementation is careful to never to assign a zero key. By default the hash table is sized to 512 bins which is expected to be sufficient for light to moderate usage of thread specific data. The hash table contains two additional type of entries. They first type is entry is called a 'key' entry and it is added to the hash during tsd_create(). It is used to store the address of the destructor function and it is used as an anchor point. All tsd entries which use the same key will be linked to this entry. This is used during tsd_destory() to quickly call the destructor function for all tsd associated with the key. The 'key' entry may be looked up with tsd_hash_search() by passing the key you wish to lookup and DTOR_PID constant as the pid. The second type of entry is called a 'pid' entry and it is added to the hash the first time a process set a key. The 'pid' entry is also used as an anchor and all tsd for the process will be linked to it. This list is using during tsd_exit() to ensure all registered destructors are run for the process. The 'pid' entry may be looked up with tsd_hash_search() by passing the PID_KEY constant as the key, and the process pid. Note that tsd_exit() is called by thread_exit() so if your using the Solaris thread API you should not need to call tsd_exit() directly.	2010-12-07 10:02:32 -08:00
Brian Behlendorf	8beea9ac24	Refresh autogen.sh products Refresh the autogen.sh products based on the versions which are installed by default in the GA RHEL6.0 release. autoconf (GNU Autoconf) 2.63 automake (GNU automake) 1.11.1 ltmain.sh (GNU libtool) 2.2.6b	2010-11-30 10:36:58 -08:00
Ricardo M. Correia	c2f997b0b3	Make kmutex_t typesafe in all cases. When HAVE_MUTEX_OWNER and CONFIG_SMP are defined, kmutex_t is just a typedef for struct mutex. This is generally OK but has the downside that it can make mistakes such as mutex_lock(&kmutex_var) to pass by unnoticed until someone compiles the code without HAVE_MUTEX_OWNER or CONFIG_SMP (in which case kmutex_t is a real struct). Note that the correct API to call should have been mutex_enter() rather than mutex_lock(). We prevent these kind of mistakes by making kmutex_t a real structure with only one field. This makes kmutex_t typesafe and it shouldn't have any impact on the generated assembly code. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-29 11:25:32 -08:00
Brian Behlendorf	058de03caa	Clear cv->cv_mutex when not in use For debugging purposes the condition varaibles keep track of the mutex used during a wait. The idea is to validate that all callers always use the same mutex. Unfortunately, we have seen cases where the caller reuses the condition variable with a different mutex but in a way which is known to be safe. My reading of the man pages suggests you should not do this and always cv_destroy()/cv_init() a new mutex. However, there is overhead in doing this and it does appear to be allowed under Solaris. To accomidate this behavior cv_wait_common() and __cv_timedwait() have been modified to clear the associated mutex when the last waiter is dropped. This ensures that while the condition variable is in use the incorrect mutex case is detected. It also allows the condition variable to be safely recycled without requiring the overhead of a cv_destroy()/cv_init() as long as it isn't currently in use. Finally, spin lock cv->cv_lock was removed because it is not required. When the condition variable is used properly the caller will always be holding the mutex so the spin lock is redundant. The lock was originally added because I expected to need to protect more than just the cv->cv_mutex. It turns out that was not the case. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-29 11:02:34 -08:00
Ned Bass	31165fd9aa	Remove partition from vdev name in zfault.sh As of the 0.5.2 tag, names of whole-disk vdevs must be specified to the command line tools without partition identifiers. This commit fixes a 'zpool online' command in zfault.sh that incorrectly includes he partition in the vdev name, causing test 9 to fail. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-29 10:53:53 -08:00
Brian Behlendorf	5e7affae52	Skip /dev/hpet during 'zpool import' If libblkid does not contain ZFS support, then 'zpool import' will scan all block devices in /dev/ to determine which ones are components of a ZFS filesystem. It does this by opening all the devices and stat'ing them to determine which ones are block devices. If the device turns out not to be a block device it is skipped. Usually, this whole process is pretty harmless (although slow). But there are certain devices in /dev/ which must be handled in a very specific way or your system may crash. For example, if /dev/watchdog is simply opened the watchdog timer will be started and your system will panic when the timer expires. It turns out the /dev/hpet causes similiar problems although only when accessed under a virtual machine. For some reason accessing /dev/hpet causes qemu to crash. To address this issue this commit adds /dev/hpet to the device blacklist, it will be skipped solely based on its name.	2010-11-12 09:33:17 -08:00
Brian Behlendorf	e0f3df67e5	Add '-ts' options to zconfig.sh/zfault.sh usage When adding this functionality originally the options to only run specific tests (-t), or conversely skip specific tests (-s) were omitted from the usage page. This commit adds the missing documentation.	2010-11-11 11:40:06 -08:00
Brian Behlendorf	7dc3830c0f	Remove spl/zfs modules as part of cleanup The idea behind the '-c' flag is to cleanup everything from a previous test run which might cause the test script to fail. This should also include removing the previously loaded module. This makes it a little easier to run 'zconfig.sh -c', however remember this is a test script and it will take all of your other zpools offline for the purposes of the test. This notion has also been extended to the default 'make check' behavior.	2010-11-11 11:40:06 -08:00
Brian Behlendorf	cf47fad67d	Unconditionally load core kernel modules Loading and unloading the zlib modules as part of the zfs.sh script has proven a little problematic for a few reasons. * First, your kernel may not need to load either zlib_inflate or zlib_deflate. This functionality may be built directly in to your kernel. It depends entirely on what your distribution decided was the right thing to do. * Second, even if you do manage to load the correct modules you may not be able to unload them. There may other consumers of the modules with a reference preventing the unload. To avoid both of these issues the test scripts have been updated to attempt to unconditionally load all modules listed in KERNEL_MODULES. If the module is successfully loaded you must have needed it. If the module can't be loaded that almost certainly means either it is built in to your kernel or is already being used by another consumer. In both cases this is not an issue and we can move on to the spl/zfs modules. Finally, by removing these kernel modules from the MODULES list we ensure they are never unloaded during 'zfs.sh -u'. This avoids the issue of the script failing because there is another consumer using the module we were not aware of. In other words the script restricts unloading modules to only the spl/zfs modules. Closes #78	2010-11-11 11:38:25 -08:00
Ned Bass	e06be58641	Fix for access beyond end of device error This commit fixes a sign extension bug affecting l2arc devices. Extremely large offsets may be passed down to the low level block device driver on reads, generating errors similar to attempt to access beyond end of device sdbi1: rw=14, want=36028797014862705, limit=125026959 The unwanted sign extension occurrs because the function arc_read_nolock() stores the offset as a daddr_t, a 32-bit signed int type in the Linux kernel. This offset is then passed to zio_read_phys() as a uint64_t argument, causing sign extension for values of 0x80000000 or greater. To avoid this, we store the offset in a uint64_t. This change also changes a few daddr_t struct members to uint64_t in the libspl headers to avoid similar bugs cropping up in the future. We also add an ASSERT to __vdev_disk_physio() to check for invalid offsets. Closes #66 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-10 21:29:07 -08:00
Brian Behlendorf	1f30b9d432	Linux 2.6.36 compat, use fops->unlocked_ioctl() As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has been removed and thus fops()->ioctl() has also been removed. The replacement hook is fops->unlocked_ioctl() which has existed in kernel since 2.6.12. Since the ZFS code only contains support back to 2.6.18 vintage kernels, I'm not adding an autoconf check for this and simply moving everything to use fops->unlocked_ioctl().	2010-11-10 17:01:08 -08:00
Brian Behlendorf	8326eb4605	Linux 2.6.36 compat, blk_* macros removed Most of the blk_* macros were removed in 2.6.36. Ostensibly this was done to improve readability and allow easier grepping. However, from a portability stand point the macros are helpful. Therefore the needed macros are redefined here if they are missing from the kernel.	2010-11-10 17:00:40 -08:00
Brian Behlendorf	675de5aa37	Linux 2.6.36 compat, synchronous bio flag The name of the flag used to mark a bio as synchronous has changed again in the 2.6.36 kernel due to the unification of the BIO_RW_* and REQ_* flags. The new flag is called REQ_SYNC. To simplify checking this flag I have introduced the vdev_disk_dio_is_sync() helper function. Based on the results of several new autoconf tests it uses the correct mask to check for a synchronous bio. Preferred interface for flagging a synchronous bio: 2.6.12-2.6.29: BIO_RW_SYNC 2.6.30-2.6.35: BIO_RW_SYNCIO 2.6.36-2.6.xx: REQ_SYNC	2010-11-10 17:00:33 -08:00
Brian Behlendorf	f4af6bb783	Linux 2.6.36 compat, use REQ_FAILFAST_MASK As of linux-2.6.36 the BIO_RW_FAILFAST and REQ_FAILFAST flags have been unified under the REQ_* names. These flags always had to be kept in-sync so this is a nice step forward, unfortunately it means we need to be careful to only use the new unified flags when the BIO_RW_* flags are not defined. Additional autoconf checks were added for this and if it is ever unclear which method to use no flags are set. This is safe but may result in longer delays before a disk is failed. Perferred interface for setting FAILFAST on a bio: 2.6.12-2.6.27: BIO_RW_FAILFAST 2.6.28-2.6.35: BIO_RW_FAILFAST_{DEV\|TRANSPORT\|DRIVER} 2.6.36-2.6.xx: REQ_FAILFAST_{DEV\|TRANSPORT\|DRIVER}	2010-11-10 16:59:49 -08:00
Ned Bass	b04cffc9b0	Remove inconsistent use of EOPNOTSUPP Commit `3ee56c292b` changed an ENOTSUP return value in one location to ENOTSUPP to fix user programs seeing an invalid ioctl() error code. However, use of ENOTSUP is widespread in the zfs module. Instead of changing all of those uses, we fixed the ENOTSUP definition in the SPL to be consistent with user space. The changed return value in the above commit is therefore no longer needed, so this commit reverses it to maintain consistency. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-10 13:26:56 -08:00
Ned Bass	00ba7ef900	Give ENOTSUP a valid user space error value The ZFS module returns ENOTSUP for several error conditions where an operation is not (yet) supported. The SPL defined ENOTSUP in terms of ENOTSUPP, but that is an internal Linux kernel error code that should not be seen by user programs. As a result the zfs utilities print a confusing error message if an unsupported operation is attempted: internal error: Unknown error 524 Aborted This change defines ENOTSUP in terms of EOPNOTSUPP which is consistent with user space. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-10 13:25:49 -08:00
Brian Behlendorf	8655ce492f	Linux 2.6.36 compat, use fops->unlocked_ioctl() As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has been removed and thus fops()->ioctl() has also been removed. The replacement hook is fops->unlocked_ioctl() which has existed in kernel since 2.6.12. Since the SPL only contains support back to 2.6.18 vintage kernels, I'm not adding an autoconf check for this and simply moving everything to use fops->unlocked_ioctl().	2010-11-10 13:16:12 -08:00
Brian Behlendorf	9b2048c26b	Linux 2.6.36 compat, fs_struct->lock type change In the linux-2.6.36 kernel the fs_struct lock was changed from a rwlock_t to a spinlock_t. If the kernel would export the set_fs_pwd() symbol by default this would not have caused us any issues, but they don't. So we're forced to add a new autoconf check which sets the HAVE_FS_STRUCT_SPINLOCK define when a spinlock_t is used. We can then correctly use either spin_lock or write_lock in our custom set_fs_pwd() implementation.	2010-11-09 13:29:47 -08:00
Brian Behlendorf	a50cede388	Linux 2.6.36 compat, wrap RLIM64_INFINITY As of linux-2.6.36 RLIM64_INFINITY is defined in linux/resource.h. This is handled by conditionally defining RLIM64_INFINITY in the SPL only when the kernel does not provide it.	2010-11-09 13:28:55 -08:00
Brian Behlendorf	1e18307b61	Fix incorrect krw_type_t type Flagged by the default compile options on archlinux 2010.05, we should be using the krw_t type not the krw_type_t type in the private data. module/splat/splat-rwlock.c: In function ‘splat_rwlock_test4_func’: module/splat/splat-rwlock.c:432:6: warning: case value ‘1’ not in enumerated type ‘krw_type_t’	2010-11-09 10:18:01 -08:00
Brian Behlendorf	8c3ab23f4b	Add lustre zpios-test workload The lustre zpios-test simulates a reasonable lustre workload. It will create 128 threads, the same as a Lustre OSS, and then 4096 individual objects. Each objects is 16MiB in size and will be written/read in 1MiB from a random thread. This is fundamentally how we expect Lustre to behave for large IO intensive workloads.	2010-11-08 14:03:36 -08:00
Brian Behlendorf	a8179b5139	Prep for 0.5.2 tag Update META file to prep for 0.5.2 tag.	2010-11-08 14:03:36 -08:00
Brian Behlendorf	cb39a6c6aa	Replace custom zpool configs with generic configs To streamline testing I have in the past added several custom configs to the zpool-config directory. This change reverts those custom configs and replaces them with three generic config which can do the same thing. The generic config behavior can be set by setting various environment variables when calling either the zpool-create.sh or zpios.sh scripts. For example if you wanted to create and test a single 4-disk Raid-Z2 configuration using disks [A-D]1 with dedicated ZIL and L2ARC devices you could run the following. $ ZIL="log A2" L2ARC="cache B2" RANKS=1 CHANNELS=4 LEVEL=2 \ zpool-create.sh -c zpool-raidz $ zpool status tank pool: tank state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 A1 ONLINE 0 0 0 B1 ONLINE 0 0 0 C1 ONLINE 0 0 0 D1 ONLINE 0 0 0 logs A2 ONLINE 0 0 0 cache B2 ONLINE 0 0 0 errors: No known data errors	2010-11-08 14:03:36 -08:00
Ned Bass	3ee56c292b	Make rollbacks fail gracefully Support for rolling back datasets require a functional ZPL, which we currently do not have. The zfs command does not check for ZPL support before attempting a rollback, and in preparation for rolling back a zvol it removes the minor node of the device. To prevent the zvol device node from disappearing after a failed rollback operation, this change wraps the zfs_do_rollback() function in an #ifdef HAVE_ZPL and returns ENOSYS in the absence of a ZPL. This is consistent with the behavior of other ZPL dependent commands such as mount. The orginal error message observed with this bug was rather confusing: internal error: Unknown error 524 Aborted This was because zfs_ioc_rollback() returns ENOTSUP if we don't HAVE_ZPL, but Linux actually has no such error code. It should instead return EOPNOTSUPP, as that is how ENOTSUP is defined in user space. With that we would have gotten the somewhat more helpful message cannot rollback 'tank/fish': unsupported version This is rather a moot point with the above changes since we will no longer make that ioctl call without a ZPL. But, this change updates the error code just in case. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-08 14:03:36 -08:00
Brian Behlendorf	7e55f4e00c	Increate zio write interrupt thread count. Increasing the default zio_wr_int thread count from 8 to 16 improves write performence by 13% on large systems. More testing need to be done but I suspect the ideal tuning here is ZTI_BATCH() with a minimum of 8 threads.	2010-11-08 14:03:35 -08:00
Brian Behlendorf	451041db53	Shorten zio_* thread names Linux kernel thread names are expected to be short. This change shortens the zio thread names to 10 characters leaving a few chracters to append the /<cpuid> to which the thread is bound. For example: z_wr_iss/0.	2010-11-08 14:03:35 -08:00
Brian Behlendorf	c11908c75d	Prep for 0.5.2 tag Update META file to prep for 0.5.2 tag.	2010-11-05 11:52:46 -07:00
Brian Behlendorf	8294c69bb7	Clear owner after dropping mutex It's important to clear mp->owner after calling mutex_unlock() because when CONFIG_DEBUG_MUTEXES is defined the mutex owner is verified in mutex_unlock(). If we set it to NULL this check fails and the lockdep support is immediately disabled.	2010-11-05 11:52:30 -07:00
Ned Bass	b1c5821375	Fix panic mounting unformatted zvol On some older kernels, i.e. 2.6.18, zvol_ioctl_by_inode() may get passed a NULL file pointer if the user tries to mount a zvol without a filesystem on it. This change adds checks to prevent a null pointer dereference. Closes #73. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-29 14:46:33 -07:00
Brian Behlendorf	23aa63cbf5	Fix 2.6.35 shrinker callback API change As of linux-2.6.35 the shrinker callback API now takes an additional argument. The shrinker struct is passed to the callback so that users can embed the shrinker structure in private data and use container_of() to access it. This removes the need to always use global state for the shrinker. To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf check to properly detect the API. Then we simply setup a callback function with the correct number of arguments. For now we do not make use of the new 3rd argument.	2010-10-22 14:51:26 -07:00
Ned Bass	6ee71f5ce3	Call modprobe with absolute path Some sudo configurations may not include /sbin in the PATH. libzfs_load_module() currently does not call modprobe with an absolute path, so it may fail under such configurations if called under sudo. This change adds the absolute path to modprobe so we no longer rely on how PATH is set. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-22 12:39:57 -07:00
Ned Bass	d877ac6bfe	Fix intermittent 'zpool add' failures Creating whole-disk vdevs can intermittently fail if a udev-managed symlink to the disk partition is already in place. To avoid this, we now remove any such symlink before partitioning the disk. This makes zpool_label_disk_wait() truly wait for the new link to show up instead of returning if it finds an old link still in place. Otherwise there is a window between when udev deletes and recreates the link during which access attempts will fail with ENOENT. Also, clean up a comment about waiting for udev to create symlinks. It no longer needs to describe the special cases for the link names, since that is now handled in a separate helper function. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-22 12:38:58 -07:00
Ned Bass	d4055aac3c	Add zconfig test for adding and removing vdevs This test performs a sanity check of the zpool add and remove commands. It tests adding and removing both a cache disk and a log disk to and from a zpool. Usage of both a shorthand device path and a full path is covered. The test uses a scsi_debug device as the disk to be added and removed. This is done so that zpool will see it as a whole disk and partition it, which it does not currently done for loopback devices. We want to verify that the manipulation done to whole disks paths to hide the parition information does not break the add/remove interface. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-22 12:41:57 -07:00
Ned Bass	4682b8c14e	Remove solaris-specific code from make_leaf_vdev() Portability between Solaris and Linux isn't really an issue for us anymore, and removing sections like this one helps simplify the code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-22 12:25:58 -07:00
Ned Bass	a2c6816c34	Support shorthand names with zpool remove zpool status displays abbreviated vdev names without leading path components and, in the case of whole disks, without partition information. Also, the zpool subcommands 'create' and 'add' support using shorthand devices names without qualified paths. Prior to this change, however, removing a device generally required specifying its name as it is stored in the vdev label. So while zpool status might list a cache disk with a name like A16, removing it would require a full path such as /dev/disk/zpool/A16-part1, which is non-intuitive. This change adds support for shorthand device names with the remove subcommand so one can simply type, for example, zpool remove tank A16 A consequence of this change is that including the partition information when removing a whole-disk vdev now results in an error. While this is arguably the correct behavior, it is a departure from how zpool previously worked in this project. This change removes the only reference to ctd_check_path(), so that function is also removed to avoid compiler warnings. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-22 12:25:46 -07:00
Ned Bass	79e7242a91	Add helper functions for manipulating device names This change adds two helper functions for working with vdev names and paths. zfs_resolve_shortname() resolves a shorthand vdev name to an absolute path of a file in /dev, /dev/disk/by-id, /dev/disk/by-label, /dev/disk/by-path, /dev/disk/by-uuid, /dev/disk/zpool. This was previously done only in the function is_shorthand_path(), but we need a general helper function to implement shorthand names for additional zpool subcommands like remove. is_shorthand_path() is accordingly updated to call the helper function. There is a minor change in the way zfs_resolve_shortname() tests if a file exists. is_shorthand_path() effectively used open() and stat64() to test for file existence, since its scope includes testing if a device is a whole disk and collecting file status information. zfs_resolve_shortname(), on the other hand, only uses access() to test for existence and leaves it to the caller to perform any additional file operations. This seemed like the most general and lightweight approach, and still preserves the semantics of is_shorthand_path(). zfs_append_partition() appends a partition suffix to a device path. This should be used to generate the name of a whole disk as it is stored in the vdev label. The user-visible names of whole disks do not contain the partition information, while the name in the vdev label does. The code was lifted from the function make_disks(), which now just calls the helper function. Again, having a helper function to do this supports general handling of shorthand names in the user interface. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-22 12:25:30 -07:00
Brian Behlendorf	0ee8118bd3	Add zfault zpool configurations and tests Eleven new zpool configurations were added to allow testing of various failure cases. The first 5 zpool configurations leverage the 'faulty' md device type which allow us to simuluate IO errors at the block layer. The last 6 zpool configurations leverage the scsi_debug module provided by modern kernels. This device allows you to create virtual scsi devices which are backed by a ram disk. With this setup we can verify the full IO stack by injecting faults at the lowest layer. Both methods of fault injection are important to verifying the IO stack. The zfs code itself also provides a mechanism for error injection via the zinject command line tool. While we should also take advantage of this appraoch to validate the code it does not address any of the Linux integration issues which are the most concerning. For the moment we're trusting that the upstream Solaris guys are running zinject and would have caught internal zfs logic errors. Currently, there are 6 r/w test cases layered on top of the 'faulty' md devices. They include 3 writes tests for soft/transient errors, hard/permenant errors, and all writes error to the device. There are 3 matching read tests for soft/transient errors, hard/permenant errors, and fixable read error with a write. Although for this last case zfs doesn't do anything special. The seventh test case verifies zfs detects and corrects checksum errors. In this case one of the drives is extensively damaged and by dd'ing over large sections of it. We then ensure zfs logs the issue and correctly rebuilds the damage. The next test cases use the scsi_debug configuration to injects error at the bottom of the scsi stack. This ensures we find any flaws in the scsi midlayer or our usage of it. Plus it stresses the device specific retry, timeout, and error handling outside of zfs's control. The eighth test case is to verify that the system correctly handles an intermittent device timeout. Here the scsi_debug device drops 1 in N requests resulting in a retry either at the block level. The ZFS code does specify the FAILFAST option but it turns out that for this case the Linux IO stack with still retry the command. The FAILFAST logic located in scsi_noretry_cmd() does no seem to apply to the simply timeout case. It appears to be more targeted to specific device or transport errors from the lower layers. The ninth test case handles a persistent failure in which the device is removed from the system by Linux. The test verifies that the failure is detected, the device is made unavailable, and then can be successfully re-add when brought back online. Additionally, it ensures that errors and events are logged to the correct places and the no data corruption has occured due to the failure.	2010-10-12 15:20:03 -07:00
Brian Behlendorf	baa40d45cb	Fix missing 'zpool events' It turns out that 'zpool events' over 1024 bytes in size where being silently dropped. This was discovered while writing the zfault.sh tests to validate common failure modes. This could occur because the zfs interface for passing an arbitrary size nvlist_t over an ioctl() is to provide a buffer for the packed nvlist which is usually big enough. In this case 1024 byte is the default. If the kernel determines the buffer is to small it returns ENOMEM and the minimum required size of the nvlist_t. This was working properly but in the case of 'zpool events' the event stream was advanced dispite the error. Thus the retry with the bigger buffer would succeed but it would skip over the previous event. The fix is to pass this size to zfs_zevent_next() and determine before removing the event from the list if it will fit. This was preferable to checking after the event was returned because this avoids the need to rewind the stream.	2010-10-12 14:55:03 -07:00
Brian Behlendorf	a69052be7f	Initial zio delay timing While there is no right maximum timeout for a disk IO we can start laying the ground work to measure how long they do take in practice. This change simply measures the IO time and if it exceeds 30s an event is posted for 'zpool events'. This value was carefully selected because for sd devices it implies that at least one timeout (SD_TIMEOUT) has occured. Unfortunately, even with FAILFAST set we may retry and request and not get an error. This behavior is strongly dependant on the device driver and how it is hooked in to the scsi error handling stack. However by setting the limit at 30s we can log the event even if no error was returned. Slightly longer term we can start recording these delays perhaps as a simple power-of-two histrogram. This histogram can then be reported as part of the 'zpool status' command when given an command line option. None of this code changes the internal behavior of ZFS. Currently it is simply for reporting excessively long delays.	2010-10-12 14:55:02 -07:00
Brian Behlendorf	2959d94a0a	Add FAILFAST support ZFS works best when it is notified as soon as possible when a device failure occurs. This allows it to immediately start any recovery actions which may be needed. In theory Linux supports a flag which can be set on bio's called FAILFAST which provides this quick notification by disabling the retry logic in the lower scsi layers. That's the theory at least. In practice is turns out that while the flag exists you oddly have to set it with the BIO_RW_AHEAD flag. And even when it's set it you may get retries in the low level drivers decides that's the right behavior, or if you don't get the right error codes reported to the scsi midlayer. Unfortunately, without additional kernels patchs there's not much which can be done to improve this. Basically, this just means that it may take 2-3 minutes before a ZFS is notified properly that a device has failed. This can be improved and I suspect I'll be submitting patches upstream to handle this.	2010-10-12 14:55:02 -07:00
Brian Behlendorf	c5343ba71b	Fix 'zpool events' formatting for awk To make the 'zpool events' output simple to parse with awk the extra newline after embedded nvlists has been dropped. This allows the entire event to be parsed as a single whitespace seperated record. The -H option has been added to operate in scripted mode. For the 'zpool events' command this means don't print the header. The usage of -H is consistent with scripted mode for other zpool commands.	2010-10-12 14:55:01 -07:00

... 93 94 95 96 97 ...

5357 Commits