There were two cases when attempting to set the vdev block device
scheduler which would causes console warnings.
The first case was when the vdev used a loop, ram, dm, or other
such device which doesn't support a configurable scheduler. In
these cases attempting to set a scheduler is pointless and can
be safely skipped.
The secord case is slightly more troubling. We were seeing
transient cases where setting the elevator would return -EFAULT.
On retry everything is fine so there appears to be a small window
where this is possible. To handle that case we silently retry
up to three times before reporting the warning.
In all of the above cases the warning is harmless and at worse you
may see slightly different performance characteristics from one
or more of your vdevs.
Initial testing has shown the the right IO scheduler to use under Linux
is noop. This strikes the ideal balance by allowing the zfs elevator
to do all request ordering and prioritization. While allowing the
Linux elevator to do the maximum front/back merging allowed by the
physical device. This yields the largest possible requests for the
device with the lowest total overhead.
While 'noop' should be right for your system you can choose a different
IO scheduler with the 'zfs_vdev_scheduler' option. You may set this
value to any of the standard Linux schedulers: noop, cfq, deadline,
anticipatory. In addition, if you choose 'none' zfs will not attempt
to change the IO scheduler for the block device.
This commit fixes a sign extension bug affecting l2arc devices. Extremely
large offsets may be passed down to the low level block device driver on
reads, generating errors similar to
attempt to access beyond end of device
sdbi1: rw=14, want=36028797014862705, limit=125026959
The unwanted sign extension occurrs because the function arc_read_nolock()
stores the offset as a daddr_t, a 32-bit signed int type in the Linux kernel.
This offset is then passed to zio_read_phys() as a uint64_t argument, causing
sign extension for values of 0x80000000 or greater. To avoid this, we store
the offset in a uint64_t.
This change also changes a few daddr_t struct members to uint64_t in the libspl
headers to avoid similar bugs cropping up in the future. We also add an ASSERT
to __vdev_disk_physio() to check for invalid offsets.
Closes#66
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The name of the flag used to mark a bio as synchronous has changed
again in the 2.6.36 kernel due to the unification of the BIO_RW_*
and REQ_* flags. The new flag is called REQ_SYNC. To simplify
checking this flag I have introduced the vdev_disk_dio_is_sync()
helper function. Based on the results of several new autoconf
tests it uses the correct mask to check for a synchronous bio.
Preferred interface for flagging a synchronous bio:
2.6.12-2.6.29: BIO_RW_SYNC
2.6.30-2.6.35: BIO_RW_SYNCIO
2.6.36-2.6.xx: REQ_SYNC
While there is no right maximum timeout for a disk IO we can start
laying the ground work to measure how long they do take in practice.
This change simply measures the IO time and if it exceeds 30s an
event is posted for 'zpool events'.
This value was carefully selected because for sd devices it implies
that at least one timeout (SD_TIMEOUT) has occured. Unfortunately,
even with FAILFAST set we may retry and request and not get an
error. This behavior is strongly dependant on the device driver
and how it is hooked in to the scsi error handling stack. However
by setting the limit at 30s we can log the event even if no error
was returned.
Slightly longer term we can start recording these delays perhaps
as a simple power-of-two histrogram. This histogram can then be
reported as part of the 'zpool status' command when given an command
line option.
None of this code changes the internal behavior of ZFS. Currently
it is simply for reporting excessively long delays.
ZFS works best when it is notified as soon as possible when a device
failure occurs. This allows it to immediately start any recovery
actions which may be needed. In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.
That's the theory at least. In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.
Unfortunately, without additional kernels patchs there's not much
which can be done to improve this. Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed. This can be improved and I suspect I'll be
submitting patches upstream to handle this.
All the upper layers of zfs expect zio->io_error to be positive. I was
careful but I missed one instance in vdev_disk_physio_completion() which
could return a negative error. To ensure all cases are always caught I
had additionally added an ASSERT() to check this before zio_interpret().
Finally, as a debugging aid when zfs is build with --enable-debug all
errors from the backing block devices will be reported to the console
with an error message like this:
ZFS: zio error=5 type=1 offset=4217856 size=8192 flags=60440
This commit fixes a bug in vdev_disk_open() in which the whole_disk property
was getting set to 0 for disk devices, even when it was stored as a 1 when the
zpool was created. The whole_disk property lets us detect when the partition
suffix should be stripped from the device name in CLI output. It is also used
to determine how writeback cache should be set for a device.
When an existing zpool is imported its configuration is read from the vdev
label by user space in zpool_read_label(). The whole_disk property is saved in
the nvlist which gets passed into the kernel, where it in turn gets saved in
the vdev struct in vdev_alloc(). Therefore, this value is available in
vdev_disk_open() and should not be overridden by checking the provided device
path, since that path will likely point to a partition and the check will
return the wrong result.
We also add an ASSERT that the whole_disk property is set. We are not aware of
any cases where vdev_disk_open() should be called with a config that doesn't
have this property set. The ASSERT is there so that when debugging is enabled
we can identify any legitimate cases that we are missing. If we never hit the
ASSERT, we can at some point remove it along with the conditional whole_disk
check.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>