mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	72d5e2da3e	Add HAVE_SCANSTAMP This functionality is not supported under Linux, perhaps it will be some day if it's decided it's useful.	2011-02-10 09:20:33 -08:00
Brian Behlendorf	872e8d2697	Add initial rw_uio functions to the dmu These functions were dropped originally because I felt they would need to be rewritten anyway to avoid using uios. However, this patch readds then with they dea they can just be reworked and the uio bits dropped.	2011-02-04 16:14:34 -08:00
Brian Behlendorf	b4ead57cfb	Remove HAVE_ZPL from commands and libraries Thanks to the previous few commits we can now build all of the user space commands and libraries with support for the zpl.	2011-02-04 16:14:34 -08:00
Brian Behlendorf	9a616b5d17	Documentation updates Minor Linux specific documentation updates to the comments and man pages.	2011-02-04 16:14:34 -08:00
Brian Behlendorf	c5d915f423	Minimal libshare infrastructure ZFS even under Solaris does not strictly require libshare to be available. The current implementation attempts to dlopen() the library to access the needed symbols. If this fails libshare support is simply disabled. This means that on Linux we only need the most minimal libshare implementation. In fact just enough to prevent the build from failing. Longer term we can decide if we want to implement a libshare library like Solaris. At best this would be an abstraction layer between ZFS and NFS/SMB. Alternately, we can drop libshare entirely and directly integrate ZFS with Linux's NFS/SMB. Finally the bare bones user-libshare.m4 test was dropped. If we do decide to implement libshare at some point it will surely be as part of this package so the check is not needed.	2011-02-04 16:14:29 -08:00
Brian Behlendorf	3fb1fcdea1	Add 'zfs mount' support By design the zfs utility is supposed to handle mounting and unmounting a zfs filesystem. We could allow zfs to do this directly. There are system calls available to mount/umount a filesystem. And there are library calls available to manipulate /etc/mtab. But there are a couple very good reasons not to take this appraoch... for now. Instead of directly calling the system and library calls to (u)mount the filesystem we fork and exec a (u)mount process. The principle reason for this is to delegate the responsibility for locking and updating /etc/mtab to (u)mount(8). This ensures maximum portability and ensures the right locking scheme for your version of (u)mount will be used. If we didn't do this we would have to resort to an autoconf test to determine what locking mechanism is used. The downside to using mount(8) instead of mount(2) is that we lose the exact errno which was returned by the kernel. The return code from mount(8) provides some insight in to what went wrong but it not quite as good. For the moment this is translated as a best guess in to a errno for the higher layers of zfs. In the long term a shared library called libmount is under development which provides a common API to address the locking and errno issues. Once the standard mount utility has been updated to use this library we can then leverage it. Until then this is the only safe solution. http://www.kernel.org/pub/linux/utils/util-linux/libmount-docs/index.html	2011-02-04 16:11:58 -08:00
Brian Behlendorf	feb46b92a7	Open up libzfs_run_process/libzfs_load_module Recently helper functions were added to libzfs_util to load a kernel module or execute a process. Initially this functionality was limited to libzfs but it has become clear there will be other consumers. This change opens up the interface so it may be used where appropriate.	2011-01-28 12:47:57 -08:00
Brian Behlendorf	95c4cae39f	Disable umount.zfs helper For the moment, the only advantage in registering a umount helper would be to automatically unshare a zfs filesystem. Since under Linux this would be unexpected (but nice) behavior there is no harm in disabling it. This is desirable because the 'zfs unmount' path invokes the system umount. This is done to ensure correct mtab locking but has the side effect that the umount.zfs helper would be called if it exists. By default this helper calls back in to zfs to do the unmount on Solaris which we don't want under Linux. Once libmount is available and we have a safe way to correctly lock and update the /etc/mtab file we can reconsider the need for a umount helper. Using libmount is the prefered solution.	2011-01-28 12:47:57 -08:00
Brian Behlendorf	3b8cfee8af	Enable mount.zfs helper While not strictly required to mount a zfs filesystem using a mount helper has certain advantages. First, we need it if we want to honor the mount behavior as found on Solaris. As part of the mount we need to validate that the dataset has the legacy mount property set if we are using 'mount' instead of 'zfs mount'. Secondly, by using a mount helper we can automatically load the zpl kernel module. This way you can just issue a 'mount' or 'zfs mount' and it will just work. Finally, it gives us common hook in user space to add any zfs specific mount options we might want. At the moment we don't have any but now the infrastructure is at least in place.	2011-01-28 12:47:57 -08:00
Brian Behlendorf	b3259b6a2b	Autoconf selinux support If libselinux is detected on your system at configure time link against it. This allows us to use a library call to detect if selinux is enabled and if it is to pass the mount option: "context=\"system_u:object_r:file_t:s0" For now this is required because none of the existing selinux policies are aware of the zfs filesystem type. Because of this they do not properly enable xattr based labeling even though zfs supports all of the required hooks. Until distro's add zfs as a known xattr friendly fs type we must use mntpoint labeling. Alternately, end users could modify their existing selinux policy with a little guidance.	2011-01-28 12:45:19 -08:00
Brian Behlendorf	95c73795b0	Fix ZVOL rename minor devices During a rename we need to be careful to destroy and create a new minor for the ZVOL _only_ if the rename succeeded. The previous code would both destroy you minor device unconditionally, it would also fail to create the new minor device on success.	2011-01-07 12:26:02 -08:00
Brian Behlendorf	149e873ab1	Fix minor compiler warnings These compiler warnings were introduced when code which was previously #ifdef'ed out by HAVE_ZPL was re-added for use by the posix layer. All of the following changes should be obviously correct and will cause no semantic changes.	2011-01-06 15:04:28 -08:00
Brian Behlendorf	683fe41fc7	Add missing mkdirp prototype For while now mkdirp has been built as part of libspl however the protoype was never added to libgen.h. This went unnoticed until enabling the mount support which uses mkdirp().	2010-12-14 10:06:44 -08:00
Brian Behlendorf	5b63b3eb6f	Use cv_timedwait_interruptible in arc The issue is that cv_timedwait() sleeps uninterruptibly to block signals and avoid waking up early. Under Linux this counts against the load average keeping it artificially high. This change allows the arc to sleep interruptibly which mean it may be woken up early due to a signal. Normally this means some extra care must be taken to handle a potential signal. But for the arcs usage of cv_timedwait() there is no harm in waking up before the timeout expires so no extra handling is required.	2010-12-14 10:06:44 -08:00
Ricardo M. Correia	8d4e8140ef	Fix block device-related issues in zdb. Specifically, this fixes the two following errors in zdb when a pool is composed of block devices: 1) 'Value too large for defined data type' when running 'zdb <dataset>'. 2) 'character device required' when running 'zdb -l <block-device>'. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-12-14 09:52:46 -08:00
Brian Behlendorf	a7dc7e5d5a	Enable rrwlock.c compilation With the addition of the thread specific data interfaces to the SPL it is safe to enable compilation of the re-enterant read reader/writer locks.	2010-12-07 16:05:25 -08:00
Brian Behlendorf	135cf6a8ae	Refresh autogen.sh products Refresh the autogen.sh products based on the versions which are installed by default in the GA RHEL6.0 release. autoconf (GNU Autoconf) 2.63 automake (GNU automake) 1.11.1 ltmain.sh (GNU libtool) 2.2.6b	2010-12-07 15:33:12 -08:00
Ned Bass	31165fd9aa	Remove partition from vdev name in zfault.sh As of the 0.5.2 tag, names of whole-disk vdevs must be specified to the command line tools without partition identifiers. This commit fixes a 'zpool online' command in zfault.sh that incorrectly includes he partition in the vdev name, causing test 9 to fail. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-29 10:53:53 -08:00
Brian Behlendorf	5e7affae52	Skip /dev/hpet during 'zpool import' If libblkid does not contain ZFS support, then 'zpool import' will scan all block devices in /dev/ to determine which ones are components of a ZFS filesystem. It does this by opening all the devices and stat'ing them to determine which ones are block devices. If the device turns out not to be a block device it is skipped. Usually, this whole process is pretty harmless (although slow). But there are certain devices in /dev/ which must be handled in a very specific way or your system may crash. For example, if /dev/watchdog is simply opened the watchdog timer will be started and your system will panic when the timer expires. It turns out the /dev/hpet causes similiar problems although only when accessed under a virtual machine. For some reason accessing /dev/hpet causes qemu to crash. To address this issue this commit adds /dev/hpet to the device blacklist, it will be skipped solely based on its name.	2010-11-12 09:33:17 -08:00
Brian Behlendorf	e0f3df67e5	Add '-ts' options to zconfig.sh/zfault.sh usage When adding this functionality originally the options to only run specific tests (-t), or conversely skip specific tests (-s) were omitted from the usage page. This commit adds the missing documentation.	2010-11-11 11:40:06 -08:00
Brian Behlendorf	7dc3830c0f	Remove spl/zfs modules as part of cleanup The idea behind the '-c' flag is to cleanup everything from a previous test run which might cause the test script to fail. This should also include removing the previously loaded module. This makes it a little easier to run 'zconfig.sh -c', however remember this is a test script and it will take all of your other zpools offline for the purposes of the test. This notion has also been extended to the default 'make check' behavior.	2010-11-11 11:40:06 -08:00
Brian Behlendorf	cf47fad67d	Unconditionally load core kernel modules Loading and unloading the zlib modules as part of the zfs.sh script has proven a little problematic for a few reasons. * First, your kernel may not need to load either zlib_inflate or zlib_deflate. This functionality may be built directly in to your kernel. It depends entirely on what your distribution decided was the right thing to do. * Second, even if you do manage to load the correct modules you may not be able to unload them. There may other consumers of the modules with a reference preventing the unload. To avoid both of these issues the test scripts have been updated to attempt to unconditionally load all modules listed in KERNEL_MODULES. If the module is successfully loaded you must have needed it. If the module can't be loaded that almost certainly means either it is built in to your kernel or is already being used by another consumer. In both cases this is not an issue and we can move on to the spl/zfs modules. Finally, by removing these kernel modules from the MODULES list we ensure they are never unloaded during 'zfs.sh -u'. This avoids the issue of the script failing because there is another consumer using the module we were not aware of. In other words the script restricts unloading modules to only the spl/zfs modules. Closes #78	2010-11-11 11:38:25 -08:00
Ned Bass	e06be58641	Fix for access beyond end of device error This commit fixes a sign extension bug affecting l2arc devices. Extremely large offsets may be passed down to the low level block device driver on reads, generating errors similar to attempt to access beyond end of device sdbi1: rw=14, want=36028797014862705, limit=125026959 The unwanted sign extension occurrs because the function arc_read_nolock() stores the offset as a daddr_t, a 32-bit signed int type in the Linux kernel. This offset is then passed to zio_read_phys() as a uint64_t argument, causing sign extension for values of 0x80000000 or greater. To avoid this, we store the offset in a uint64_t. This change also changes a few daddr_t struct members to uint64_t in the libspl headers to avoid similar bugs cropping up in the future. We also add an ASSERT to __vdev_disk_physio() to check for invalid offsets. Closes #66 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-10 21:29:07 -08:00
Brian Behlendorf	1f30b9d432	Linux 2.6.36 compat, use fops->unlocked_ioctl() As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has been removed and thus fops()->ioctl() has also been removed. The replacement hook is fops->unlocked_ioctl() which has existed in kernel since 2.6.12. Since the ZFS code only contains support back to 2.6.18 vintage kernels, I'm not adding an autoconf check for this and simply moving everything to use fops->unlocked_ioctl().	2010-11-10 17:01:08 -08:00
Brian Behlendorf	8326eb4605	Linux 2.6.36 compat, blk_* macros removed Most of the blk_* macros were removed in 2.6.36. Ostensibly this was done to improve readability and allow easier grepping. However, from a portability stand point the macros are helpful. Therefore the needed macros are redefined here if they are missing from the kernel.	2010-11-10 17:00:40 -08:00
Brian Behlendorf	675de5aa37	Linux 2.6.36 compat, synchronous bio flag The name of the flag used to mark a bio as synchronous has changed again in the 2.6.36 kernel due to the unification of the BIO_RW_* and REQ_* flags. The new flag is called REQ_SYNC. To simplify checking this flag I have introduced the vdev_disk_dio_is_sync() helper function. Based on the results of several new autoconf tests it uses the correct mask to check for a synchronous bio. Preferred interface for flagging a synchronous bio: 2.6.12-2.6.29: BIO_RW_SYNC 2.6.30-2.6.35: BIO_RW_SYNCIO 2.6.36-2.6.xx: REQ_SYNC	2010-11-10 17:00:33 -08:00
Brian Behlendorf	f4af6bb783	Linux 2.6.36 compat, use REQ_FAILFAST_MASK As of linux-2.6.36 the BIO_RW_FAILFAST and REQ_FAILFAST flags have been unified under the REQ_* names. These flags always had to be kept in-sync so this is a nice step forward, unfortunately it means we need to be careful to only use the new unified flags when the BIO_RW_* flags are not defined. Additional autoconf checks were added for this and if it is ever unclear which method to use no flags are set. This is safe but may result in longer delays before a disk is failed. Perferred interface for setting FAILFAST on a bio: 2.6.12-2.6.27: BIO_RW_FAILFAST 2.6.28-2.6.35: BIO_RW_FAILFAST_{DEV\|TRANSPORT\|DRIVER} 2.6.36-2.6.xx: REQ_FAILFAST_{DEV\|TRANSPORT\|DRIVER}	2010-11-10 16:59:49 -08:00
Ned Bass	b04cffc9b0	Remove inconsistent use of EOPNOTSUPP Commit `3ee56c292b` changed an ENOTSUP return value in one location to ENOTSUPP to fix user programs seeing an invalid ioctl() error code. However, use of ENOTSUP is widespread in the zfs module. Instead of changing all of those uses, we fixed the ENOTSUP definition in the SPL to be consistent with user space. The changed return value in the above commit is therefore no longer needed, so this commit reverses it to maintain consistency. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-10 13:26:56 -08:00
Brian Behlendorf	8c3ab23f4b	Add lustre zpios-test workload The lustre zpios-test simulates a reasonable lustre workload. It will create 128 threads, the same as a Lustre OSS, and then 4096 individual objects. Each objects is 16MiB in size and will be written/read in 1MiB from a random thread. This is fundamentally how we expect Lustre to behave for large IO intensive workloads.	2010-11-08 14:03:36 -08:00
Brian Behlendorf	a8179b5139	Prep for 0.5.2 tag Update META file to prep for 0.5.2 tag.	2010-11-08 14:03:36 -08:00
Brian Behlendorf	cb39a6c6aa	Replace custom zpool configs with generic configs To streamline testing I have in the past added several custom configs to the zpool-config directory. This change reverts those custom configs and replaces them with three generic config which can do the same thing. The generic config behavior can be set by setting various environment variables when calling either the zpool-create.sh or zpios.sh scripts. For example if you wanted to create and test a single 4-disk Raid-Z2 configuration using disks [A-D]1 with dedicated ZIL and L2ARC devices you could run the following. $ ZIL="log A2" L2ARC="cache B2" RANKS=1 CHANNELS=4 LEVEL=2 \ zpool-create.sh -c zpool-raidz $ zpool status tank pool: tank state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 A1 ONLINE 0 0 0 B1 ONLINE 0 0 0 C1 ONLINE 0 0 0 D1 ONLINE 0 0 0 logs A2 ONLINE 0 0 0 cache B2 ONLINE 0 0 0 errors: No known data errors	2010-11-08 14:03:36 -08:00
Ned Bass	3ee56c292b	Make rollbacks fail gracefully Support for rolling back datasets require a functional ZPL, which we currently do not have. The zfs command does not check for ZPL support before attempting a rollback, and in preparation for rolling back a zvol it removes the minor node of the device. To prevent the zvol device node from disappearing after a failed rollback operation, this change wraps the zfs_do_rollback() function in an #ifdef HAVE_ZPL and returns ENOSYS in the absence of a ZPL. This is consistent with the behavior of other ZPL dependent commands such as mount. The orginal error message observed with this bug was rather confusing: internal error: Unknown error 524 Aborted This was because zfs_ioc_rollback() returns ENOTSUP if we don't HAVE_ZPL, but Linux actually has no such error code. It should instead return EOPNOTSUPP, as that is how ENOTSUP is defined in user space. With that we would have gotten the somewhat more helpful message cannot rollback 'tank/fish': unsupported version This is rather a moot point with the above changes since we will no longer make that ioctl call without a ZPL. But, this change updates the error code just in case. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-08 14:03:36 -08:00
Brian Behlendorf	7e55f4e00c	Increate zio write interrupt thread count. Increasing the default zio_wr_int thread count from 8 to 16 improves write performence by 13% on large systems. More testing need to be done but I suspect the ideal tuning here is ZTI_BATCH() with a minimum of 8 threads.	2010-11-08 14:03:35 -08:00
Brian Behlendorf	451041db53	Shorten zio_* thread names Linux kernel thread names are expected to be short. This change shortens the zio thread names to 10 characters leaving a few chracters to append the /<cpuid> to which the thread is bound. For example: z_wr_iss/0.	2010-11-08 14:03:35 -08:00
Ned Bass	b1c5821375	Fix panic mounting unformatted zvol On some older kernels, i.e. 2.6.18, zvol_ioctl_by_inode() may get passed a NULL file pointer if the user tries to mount a zvol without a filesystem on it. This change adds checks to prevent a null pointer dereference. Closes #73. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-29 14:46:33 -07:00
Ned Bass	6ee71f5ce3	Call modprobe with absolute path Some sudo configurations may not include /sbin in the PATH. libzfs_load_module() currently does not call modprobe with an absolute path, so it may fail under such configurations if called under sudo. This change adds the absolute path to modprobe so we no longer rely on how PATH is set. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-22 12:39:57 -07:00
Ned Bass	d877ac6bfe	Fix intermittent 'zpool add' failures Creating whole-disk vdevs can intermittently fail if a udev-managed symlink to the disk partition is already in place. To avoid this, we now remove any such symlink before partitioning the disk. This makes zpool_label_disk_wait() truly wait for the new link to show up instead of returning if it finds an old link still in place. Otherwise there is a window between when udev deletes and recreates the link during which access attempts will fail with ENOENT. Also, clean up a comment about waiting for udev to create symlinks. It no longer needs to describe the special cases for the link names, since that is now handled in a separate helper function. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-22 12:38:58 -07:00
Ned Bass	d4055aac3c	Add zconfig test for adding and removing vdevs This test performs a sanity check of the zpool add and remove commands. It tests adding and removing both a cache disk and a log disk to and from a zpool. Usage of both a shorthand device path and a full path is covered. The test uses a scsi_debug device as the disk to be added and removed. This is done so that zpool will see it as a whole disk and partition it, which it does not currently done for loopback devices. We want to verify that the manipulation done to whole disks paths to hide the parition information does not break the add/remove interface. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-22 12:41:57 -07:00
Ned Bass	4682b8c14e	Remove solaris-specific code from make_leaf_vdev() Portability between Solaris and Linux isn't really an issue for us anymore, and removing sections like this one helps simplify the code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-22 12:25:58 -07:00
Ned Bass	a2c6816c34	Support shorthand names with zpool remove zpool status displays abbreviated vdev names without leading path components and, in the case of whole disks, without partition information. Also, the zpool subcommands 'create' and 'add' support using shorthand devices names without qualified paths. Prior to this change, however, removing a device generally required specifying its name as it is stored in the vdev label. So while zpool status might list a cache disk with a name like A16, removing it would require a full path such as /dev/disk/zpool/A16-part1, which is non-intuitive. This change adds support for shorthand device names with the remove subcommand so one can simply type, for example, zpool remove tank A16 A consequence of this change is that including the partition information when removing a whole-disk vdev now results in an error. While this is arguably the correct behavior, it is a departure from how zpool previously worked in this project. This change removes the only reference to ctd_check_path(), so that function is also removed to avoid compiler warnings. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-22 12:25:46 -07:00
Ned Bass	79e7242a91	Add helper functions for manipulating device names This change adds two helper functions for working with vdev names and paths. zfs_resolve_shortname() resolves a shorthand vdev name to an absolute path of a file in /dev, /dev/disk/by-id, /dev/disk/by-label, /dev/disk/by-path, /dev/disk/by-uuid, /dev/disk/zpool. This was previously done only in the function is_shorthand_path(), but we need a general helper function to implement shorthand names for additional zpool subcommands like remove. is_shorthand_path() is accordingly updated to call the helper function. There is a minor change in the way zfs_resolve_shortname() tests if a file exists. is_shorthand_path() effectively used open() and stat64() to test for file existence, since its scope includes testing if a device is a whole disk and collecting file status information. zfs_resolve_shortname(), on the other hand, only uses access() to test for existence and leaves it to the caller to perform any additional file operations. This seemed like the most general and lightweight approach, and still preserves the semantics of is_shorthand_path(). zfs_append_partition() appends a partition suffix to a device path. This should be used to generate the name of a whole disk as it is stored in the vdev label. The user-visible names of whole disks do not contain the partition information, while the name in the vdev label does. The code was lifted from the function make_disks(), which now just calls the helper function. Again, having a helper function to do this supports general handling of shorthand names in the user interface. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-22 12:25:30 -07:00
Brian Behlendorf	0ee8118bd3	Add zfault zpool configurations and tests Eleven new zpool configurations were added to allow testing of various failure cases. The first 5 zpool configurations leverage the 'faulty' md device type which allow us to simuluate IO errors at the block layer. The last 6 zpool configurations leverage the scsi_debug module provided by modern kernels. This device allows you to create virtual scsi devices which are backed by a ram disk. With this setup we can verify the full IO stack by injecting faults at the lowest layer. Both methods of fault injection are important to verifying the IO stack. The zfs code itself also provides a mechanism for error injection via the zinject command line tool. While we should also take advantage of this appraoch to validate the code it does not address any of the Linux integration issues which are the most concerning. For the moment we're trusting that the upstream Solaris guys are running zinject and would have caught internal zfs logic errors. Currently, there are 6 r/w test cases layered on top of the 'faulty' md devices. They include 3 writes tests for soft/transient errors, hard/permenant errors, and all writes error to the device. There are 3 matching read tests for soft/transient errors, hard/permenant errors, and fixable read error with a write. Although for this last case zfs doesn't do anything special. The seventh test case verifies zfs detects and corrects checksum errors. In this case one of the drives is extensively damaged and by dd'ing over large sections of it. We then ensure zfs logs the issue and correctly rebuilds the damage. The next test cases use the scsi_debug configuration to injects error at the bottom of the scsi stack. This ensures we find any flaws in the scsi midlayer or our usage of it. Plus it stresses the device specific retry, timeout, and error handling outside of zfs's control. The eighth test case is to verify that the system correctly handles an intermittent device timeout. Here the scsi_debug device drops 1 in N requests resulting in a retry either at the block level. The ZFS code does specify the FAILFAST option but it turns out that for this case the Linux IO stack with still retry the command. The FAILFAST logic located in scsi_noretry_cmd() does no seem to apply to the simply timeout case. It appears to be more targeted to specific device or transport errors from the lower layers. The ninth test case handles a persistent failure in which the device is removed from the system by Linux. The test verifies that the failure is detected, the device is made unavailable, and then can be successfully re-add when brought back online. Additionally, it ensures that errors and events are logged to the correct places and the no data corruption has occured due to the failure.	2010-10-12 15:20:03 -07:00
Brian Behlendorf	baa40d45cb	Fix missing 'zpool events' It turns out that 'zpool events' over 1024 bytes in size where being silently dropped. This was discovered while writing the zfault.sh tests to validate common failure modes. This could occur because the zfs interface for passing an arbitrary size nvlist_t over an ioctl() is to provide a buffer for the packed nvlist which is usually big enough. In this case 1024 byte is the default. If the kernel determines the buffer is to small it returns ENOMEM and the minimum required size of the nvlist_t. This was working properly but in the case of 'zpool events' the event stream was advanced dispite the error. Thus the retry with the bigger buffer would succeed but it would skip over the previous event. The fix is to pass this size to zfs_zevent_next() and determine before removing the event from the list if it will fit. This was preferable to checking after the event was returned because this avoids the need to rewind the stream.	2010-10-12 14:55:03 -07:00
Brian Behlendorf	a69052be7f	Initial zio delay timing While there is no right maximum timeout for a disk IO we can start laying the ground work to measure how long they do take in practice. This change simply measures the IO time and if it exceeds 30s an event is posted for 'zpool events'. This value was carefully selected because for sd devices it implies that at least one timeout (SD_TIMEOUT) has occured. Unfortunately, even with FAILFAST set we may retry and request and not get an error. This behavior is strongly dependant on the device driver and how it is hooked in to the scsi error handling stack. However by setting the limit at 30s we can log the event even if no error was returned. Slightly longer term we can start recording these delays perhaps as a simple power-of-two histrogram. This histogram can then be reported as part of the 'zpool status' command when given an command line option. None of this code changes the internal behavior of ZFS. Currently it is simply for reporting excessively long delays.	2010-10-12 14:55:02 -07:00
Brian Behlendorf	2959d94a0a	Add FAILFAST support ZFS works best when it is notified as soon as possible when a device failure occurs. This allows it to immediately start any recovery actions which may be needed. In theory Linux supports a flag which can be set on bio's called FAILFAST which provides this quick notification by disabling the retry logic in the lower scsi layers. That's the theory at least. In practice is turns out that while the flag exists you oddly have to set it with the BIO_RW_AHEAD flag. And even when it's set it you may get retries in the low level drivers decides that's the right behavior, or if you don't get the right error codes reported to the scsi midlayer. Unfortunately, without additional kernels patchs there's not much which can be done to improve this. Basically, this just means that it may take 2-3 minutes before a ZFS is notified properly that a device has failed. This can be improved and I suspect I'll be submitting patches upstream to handle this.	2010-10-12 14:55:02 -07:00
Brian Behlendorf	c5343ba71b	Fix 'zpool events' formatting for awk To make the 'zpool events' output simple to parse with awk the extra newline after embedded nvlists has been dropped. This allows the entire event to be parsed as a single whitespace seperated record. The -H option has been added to operate in scripted mode. For the 'zpool events' command this means don't print the header. The usage of -H is consistent with scripted mode for other zpool commands.	2010-10-12 14:55:01 -07:00
Brian Behlendorf	312c07edfd	Generate zevents for speculative and soft errors By default the Solaris code does not log speculative or soft io errors in either 'zpool status' or post an event. Under Linux we don't want to change the expected behavior of 'zpool status' so these io errors are still suppressed there. However, since we do need to know about these events for Linux FMA and the 'zpool events' interface is new we do post the events. With the addition of the zio_flags field the posted events now contain enough information that a user space consumer can identify and discard these events if it sees fit.	2010-10-12 14:55:00 -07:00
Brian Behlendorf	d148e95156	Fix negative zio->io_error which must be positive. All the upper layers of zfs expect zio->io_error to be positive. I was careful but I missed one instance in vdev_disk_physio_completion() which could return a negative error. To ensure all cases are always caught I had additionally added an ASSERT() to check this before zio_interpret(). Finally, as a debugging aid when zfs is build with --enable-debug all errors from the backing block devices will be reported to the console with an error message like this: ZFS: zio error=5 type=1 offset=4217856 size=8192 flags=60440	2010-10-12 14:55:00 -07:00
Brian Behlendorf	398f129ca3	Suppress large kmem_alloc() warning. Observed during failure mode testing, dsl_scan_setup_sync() allocates 73920 bytes. This is way over the limit of what is wise to do with a kmem_alloc() and it should probably be moved to a slab. For now I'm just flagging it with KM_NODEBUG to quiet the error until this can be revisited.	2010-10-12 14:54:59 -07:00
Ned Bass	5c1bad0013	Fix undersized buffer in is_shorthand_path() The string array 'char dirs[5][8]' was too small to accomodate the terminating NUL character in "by-label". This change adds the needed additional byte. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-12 14:47:39 -07:00

... 16 17 18 19 20 ...

1051 Commits