While there is no right maximum timeout for a disk IO we can start
laying the ground work to measure how long they do take in practice.
This change simply measures the IO time and if it exceeds 30s an
event is posted for 'zpool events'.
This value was carefully selected because for sd devices it implies
that at least one timeout (SD_TIMEOUT) has occured. Unfortunately,
even with FAILFAST set we may retry and request and not get an
error. This behavior is strongly dependant on the device driver
and how it is hooked in to the scsi error handling stack. However
by setting the limit at 30s we can log the event even if no error
was returned.
Slightly longer term we can start recording these delays perhaps
as a simple power-of-two histrogram. This histogram can then be
reported as part of the 'zpool status' command when given an command
line option.
None of this code changes the internal behavior of ZFS. Currently
it is simply for reporting excessively long delays.
ZFS works best when it is notified as soon as possible when a device
failure occurs. This allows it to immediately start any recovery
actions which may be needed. In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.
That's the theory at least. In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.
Unfortunately, without additional kernels patchs there's not much
which can be done to improve this. Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed. This can be improved and I suspect I'll be
submitting patches upstream to handle this.
To make the 'zpool events' output simple to parse with awk the extra
newline after embedded nvlists has been dropped. This allows the
entire event to be parsed as a single whitespace seperated record.
The -H option has been added to operate in scripted mode. For the
'zpool events' command this means don't print the header. The usage
of -H is consistent with scripted mode for other zpool commands.
By default the Solaris code does not log speculative or soft io errors
in either 'zpool status' or post an event. Under Linux we don't want
to change the expected behavior of 'zpool status' so these io errors
are still suppressed there.
However, since we do need to know about these events for Linux FMA and
the 'zpool events' interface is new we do post the events. With the
addition of the zio_flags field the posted events now contain enough
information that a user space consumer can identify and discard these
events if it sees fit.
All the upper layers of zfs expect zio->io_error to be positive. I was
careful but I missed one instance in vdev_disk_physio_completion() which
could return a negative error. To ensure all cases are always caught I
had additionally added an ASSERT() to check this before zio_interpret().
Finally, as a debugging aid when zfs is build with --enable-debug all
errors from the backing block devices will be reported to the console
with an error message like this:
ZFS: zio error=5 type=1 offset=4217856 size=8192 flags=60440
Observed during failure mode testing, dsl_scan_setup_sync() allocates
73920 bytes. This is way over the limit of what is wise to do with a
kmem_alloc() and it should probably be moved to a slab. For now I'm
just flagging it with KM_NODEBUG to quiet the error until this can be
revisited.
The string array 'char dirs[5][8]' was too small to accomodate the terminating
NUL character in "by-label". This change adds the needed additional byte.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit modifies libzfs_init() to attempt to load the zfs kernel module if
it is not already loaded. This is done to simplify initialization by letting
users simply import their zpools without having to first load the module.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under Solaris, the slice number is chopped off when displaying the device name
if the vdev is a whole disk. Under Linux we should similarly discard the
partition number. This commit adds the logic to perform the name truncation
for devices ending in -partX, XpX, or X, where X is a string of digits. The
second case handles devices like md0p0. The third case is limited to scsi and
ide disks, i.e. those beginning with "sd" or "hd", in order to avoid stripping
the number from names like "loop0".
This commit removes the Solaris-specific code for removing slices, since we no
longer reasonably expect our changes to be merged in upstream. The partition
stripping code was moved off to a helper function to improve readability.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit fixes a bug in vdev_disk_open() in which the whole_disk property
was getting set to 0 for disk devices, even when it was stored as a 1 when the
zpool was created. The whole_disk property lets us detect when the partition
suffix should be stripped from the device name in CLI output. It is also used
to determine how writeback cache should be set for a device.
When an existing zpool is imported its configuration is read from the vdev
label by user space in zpool_read_label(). The whole_disk property is saved in
the nvlist which gets passed into the kernel, where it in turn gets saved in
the vdev struct in vdev_alloc(). Therefore, this value is available in
vdev_disk_open() and should not be overridden by checking the provided device
path, since that path will likely point to a partition and the check will
return the wrong result.
We also add an ASSERT that the whole_disk property is set. We are not aware of
any cases where vdev_disk_open() should be called with a config that doesn't
have this property set. The ASSERT is there so that when debugging is enabled
we can identify any legitimate cases that we are missing. If we never hit the
ASSERT, we can at some point remove it along with the conditional whole_disk
check.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This callback is needed for properly accounting the per-uid and per-gid
space usage. Even if we don't have the ZPL, we still need this callback
in order to have proper on-disk ZPL compatibility and to be able to use
Lustre quotas.
Fortunately, the callback doesn't have any ZPL/VFS dependencies so we
can just move it out of #ifdef HAVE_ZPL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Top-level vdev names in zpool status output should follow a <type-id> naming
convention. In the case of raidz devices, the type portion of the name was
missing.
This commit fixes a bug in zpool_vdev_name() where in this snprintf call
(void) snprintf(buf, sizeof (buf), "%s-%llu", path,
(u_longlong_t)id);
buf and path may point to the same location. The result is that buf ends up
containing only the "-id" part. This only occurred for raidz devices because
the code for appending the parity level to the type string stored its result in
buf then set path to point there. To fix this we allocate a new temporary
buffer on the stack instead of reusing buf.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#57
Required for the DB_DNODE_ENTER()/DB_DNODE_EXIT() helpers.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In my machine, dnode_hold_impl() allocates 9992 bytes in DEBUG mode and it
causes a large stream of stack traces in the logs. Instead, use KM_NODEBUG
to quiet down this known large alloc.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
By default the zpool_layout command would always use the slot
number assigned by Linux when generating the zdev.conf file.
This is a reasonable default there are cases when it makes
sense to remap the slot id assigned by Linux using your own
custom mapping.
This commit adds support to zpool_layout to provide a custom
slot mapping file. The file contains in the first column the
Linux slot it and in the second column the custom slot mapping.
By passing this map file with '-m map' to zpool_config the
mapping will be applied when generating zdev.conf.
Additionally, two sample mapping have been added which reflect
different ways to map the slots in the dragon drawers.
These two lines were being rendered incorrectly on the GitHub
site. To fix the issue there needs to be leading whitespace
before each line to ensure each command is rendered on its
own line.
$ ./configure
$ make pkg
The wiki contents have been converted to html and made available
at their new home http://zfsonlinux.org. The wiki has also been
disabled the html pages are now the official documentation.
Occasional failures were observed in zconfig.sh because udev
could be delayed for a few seconds. To handle this the wait_udev
function has been added to wait for timeout seconds for an
expected device before returning an error. By default callers
currently use a 30 seconds timeout which should be much longer
than udev ever needs but not so long to worry the test suite
is hung.
Commit 6283f55ea1 updated _almost_
everything to use the correct top level object directory. This
was done to correctly supporting building in custom directories.
Unfortunately, I missed this one instance in the zfs-module.spec.in
rpm spec file. Fix it.
The zfs package supports the option --with-config=srpm which
is used to bootstrap configure to allow the 'make srpm' target
to work. This has the advantage of allowing creation of source
rpms without having all your -devel packages installed. This
source package can then be feed back in to an automated build
farm which only installs the required packages listed by the
srpm. This ensures that all proper dependencies are expressed
by the source package, because if they are not you will get
configure/build failures.
The trouble here is that --with-config=srpm prevents the
architecture check from running resulting in TARGET_ASM_DIR
being set to the default asm-generic. The 'make dist' rule
then fails because there is no asm-generic/atomic.S file
because it is generated at build time. To handle this I
have added an empty file asm-generic/atomic.S simply as a
place holder for 'make dist'.
Previously the project contained who zfs_context.h files,
one for user space builds and one for kernel space builds.
It was the responsibility of the source including the file
to ensure the right one was included based on the order of
the include paths.
This was the way it was done in OpenSolaris but for our
purposes I felt it was overly obscure. The user and kernel
zfs_context.h files have been combined in to a single file
and a #define determines if you get the user or kernel
context.
The issue here was that I used the _KERNEL macro which is
defined as part of the spl which will only be defined for
most builds after you include the right zfs_context. It is
safer to use the __KERNEL__ macro which is automatically
defined as part of the kernel build process and passed as
a command line compiler option. It will always be defined
if your building in the kernel and never for user space.
Under Ubuntu 10.04 the default compiler flags include -Wformat
and -Wformat-security which cause the above warning. In particular,
cases where "%s" was forgotten as part of the format specifier.
https://wiki.ubuntu.com/CompilerFlags
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz
tar -xzf zfs-x.y.z.tar.gz
cd zfs-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This change also moves many of the include headers from individual
incude/sys directories under the modules directory in to a single
top level include directory. This has the advantage of making
the build rules cleaner and logically it makes a bit more sense.
The spl_config.h file is checked to determine the spl version.
However, the zfs code was looking for it in the source directory
and not the build directory.
The GIT file was removed from the tree because I have stopped
using TopGit. Because of this is must also be removed from
the top level Makefile.am as will as the zfs.spec.in file
which referenced it.
Fix type in lib/libzpool/Makefile.am which was preventing
the needed zrlock.h header from being included by 'make dist'.
I simply had the name wrong in the Makefile.am.
Regenerated autogen.sh build products.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This script is now dynamically generated at configure time
from scripts/common.sh.in. This change was made by commit
26e61dd074df64f9e1d779273efd56fa9d92cdc5 but we accidentally
kept the common.sh file around.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Full update to date build information will stay on the wiki for
now, but there is no harm in adding the bare bones instructions
to the README. They shouldn't change and are a reasonable
quick start.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add the initial products from autogen.sh. These products will
be updated incrementally after this point as development occurs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Minor changes to ztest for this environment. These including
updating ztest to run in the local development tree, as well
as relocating some local variables in this function to the heap.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This topic branch contains required changes to the user space
utilities to allow them to integrate cleanly with Linux.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This topic branch contains all the changes needed to integrate the user
side zfs tools with Linux style devices. Primarily this includes fixing
up the Solaris libefi library to be Linux friendly, and integrating with
the libblkid library which is provided by e2fsprogs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Track various large hunks which have been dropped simply
because they are not relevant to this port.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Solaris recently introduced the idea of drive topology because
where a drive is located does matter. I have already handled
this with udev/blkid integration under Linux so I'm hopeful
this case can simply be removed but for now I've just stubbed
out what is needed in libspl and commented out the rest here.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The upstream ZFS code has correctly moved to a faster native sha2
implementation. Unfortunately, under Linux that's going to be a little
problematic so we revert the code to the more portable version contained
in earlier ZFS releases. Using the native sha2 implementation in Linux
is possible but the API is slightly different in kernel version user
space depending on which libraries are used. Ideally, we need a fast
implementation of SHA256 which builds as part of ZFS this shouldn't be
that hard to do but it will take some effort.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
All changes needed for the libspl layer. This includes modifications
to files directly copied from OpenSolaris and the addition of new
files needed to fill in the gaps.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This branch contains the majority of the changes required to cleanly
intergrate with Linux style special devices (/dev/zfs). Mainly this
means dropping all the Solaris style callbacks and replacing them
with the Linux equivilants.
This patch also adds the onexit infrastructure needed to track
some minimal state between ioctls. Under Linux it would be easy
to do this simply using the file->private_data. But under Solaris
they apparent need to pass the file descriptor as part of the ioctl
data and then perform a lookup in the kernel. Once again to keep
code change to a minimum I've implemented the Solaris solution.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The ZFS update to onnv_141 brought with it support for a
security label attribute called mlslabel. This feature
depends on zones to work correctly and thus I am disabling
it under Linux. Equivilant functionality could be added
at some point in the future.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>