This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values. However, in practice sometimes you do need to tweak these
values for one reason or another. In those cases it's nice not to
have to resort to rebuilding from source. All tunables are visable
to modinfo and the list is as follows:
$ modinfo module/zfs/zfs.ko
filename: module/zfs/zfs.ko
license: CDDL
author: Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description: ZFS
srcversion: 8EAB1D71DACE05B5AA61567
depends: spl,znvpair,zcommon,zunicode,zavl
vermagic: 2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm: zvol_major:Major number for zvol device (uint)
parm: zvol_threads:Number of threads for zvol device (uint)
parm: zio_injection_enabled:Enable fault injection (int)
parm: zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm: zio_delay_max:Max zio millisec delay before posting event (int)
parm: zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm: zil_replay_disable:Disable intent logging replay (int)
parm: zfs_nocacheflush:Disable cache flushes (bool)
parm: zfs_read_chunk_size:Bytes to read per chunk (long)
parm: zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm: zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm: zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm: zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm: zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm: zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm: zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm: zfs_vdev_scheduler:I/O scheduler (charp)
parm: zfs_vdev_cache_max:Inflate reads small than max (int)
parm: zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm: zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm: zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm: zfs_recover:Set to attempt to recover from fatal errors (int)
parm: spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm: zfs_zevent_len_max:Max event queue length (int)
parm: zfs_zevent_cols:Max event column width (int)
parm: zfs_zevent_console:Log events to the console (int)
parm: zfs_top_maxinflight:Max I/Os per top-level (int)
parm: zfs_resilver_delay:Number of ticks to delay resilver (int)
parm: zfs_scrub_delay:Number of ticks to delay scrub (int)
parm: zfs_scan_idle:Idle window in clock ticks (int)
parm: zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm: zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm: zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm: zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm: zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm: zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm: zfs_no_write_throttle:Disable write throttling (int)
parm: zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm: zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm: zfs_write_limit_min:Min tgx write limit (ulong)
parm: zfs_write_limit_max:Max tgx write limit (ulong)
parm: zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm: zfs_write_limit_override:Override tgx write limit (ulong)
parm: zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm: zfetch_max_streams:Max number of streams per zfetch (uint)
parm: zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm: zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm: zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm: zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm: zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm: zfs_arc_min:Min arc size (ulong)
parm: zfs_arc_max:Max arc size (ulong)
parm: zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm: zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm: zfs_arc_grow_retry:Seconds before growing arc size (int)
parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm: zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
This change fixes a kernel panic which would occur when resizing
a dataset which was not open. The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.
The code has also been updated to correctly notify the kernel
when the block device capacity changes. For 2.6.28 and newer
kernels the capacity change will be immediately detected. For
earlier kernels the capacity change will be detected when the
device is next opened. This is a known limitation of older
kernels.
Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0
Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes#68Closes#84
This build failure was accidentally introduced by previous commit
bfd214a which fixed the load average. Unfortunately, the wrapper
for cv_wait_interruptible was not available in the zfs_context.h
user compatibility code. I failed to notice this because I didn't
rebuild everything cleanly before committing.
undefined reference to `cv_wait_interruptible'
collect2: ld returned 1 exit status
Closes#181
This commit fixes issue on
https://github.com/behlendorf/zfs/issues/#issue/172
Changes:
- update BLKZNAME to use _IOR instead of _IO. Kernel 2.6.32 allows
read parameters (copy_to_user) with _IO, while newer kernels (tested
Archlinux's 2.6.37 kernel) enforces _IOR (which is correct)
- fix return code and message on error
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Added insert_inode_locked() helper function, prior to this most callers
used insert_inode_hash(). The older method doesn't check for collisions
in the inode_hashtable but it still acceptible for use. Fallback to
using insert_inode_hash() when insert_inode_locked() is unavailable.
Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility
of synchronous reclaim deadlocks. These deadlocks never existed in the
original OpenSolaris code because all memory reclaim on Solaris is done
asyncronously. Linux does both synchronous (direct) and asynchronous
(indirect) reclaim.
This commit addresses a deadlock caused by inode eviction. A KM_SLEEP
allocation may trigger direct memory reclaim and shrink the inode cache.
This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is
held. Through the ->shrink_icache_memory()->evict()->zfs_inactive()->
zfs_zinactive() call path the same mutex may be reacquired resulting
in a deadlock. To avoid this deadlock the process must not reacquire
the mutex when it is already holding it.
This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD
mutex locking should be reevaluated. This infrastructure already
prevents us from ever using the Linux lock dependency analysis tools,
and it may limit scalability.
To support automatically mounting your zfs on filesystem on boot
a basic init script is needed. Unfortunately, every distribution
has their own idea of the _right_ way to do things. Rather than
write one very complicated portable init script, which would be
invariably replaced by the distributions own anyway. I have
instead added support to provide multiple distribution specific
init scripts.
The correct init script for your distribution will be selected
by ZFS_AC_DEFAULT_PACKAGE which will set DEFAULT_INIT_SCRIPT.
During 'make install' the correct script for your system will
be installed from zfs/etc/init.d/zfs.DEFAULT_INIT_SCRIPT to the
usual /etc/init.d/zfs location.
Currently, there is zfs.fedora and a more generic zfs.lsb init
script. Hopefully, the distribution maintainers who know best
how they want their init scripts to function will feedback their
approved versions to be included in the project.
This change does not consider upstart jobs but I'm not at all
opposed to add that sort of thing.
Register the missing .remount_fs handler. This handler isn't strictly
required because the VFS does a pretty good job updating most of the
MS_* flags. However, there's no harm in using the hook to call the
registered zpl callback for various MS_* flags. Additionaly, this
allows us to lay the ground work for more complicated argument parsing
in the future.
Register the missing .sync_fs handler. This is a noop in most cases
because the usual requirement is that sync just be initiated. As part
of the DMU's normal transaction processing txgs will be frequently
synced. However, when the 'wait' flag is set the requirement is that
.sync_fs must not return until the data is safe on disk. With the
addition of the .sync_fs handler this is now properly implemented.
The original range lock implementation had to be modified by commit
8926ab7 because it was unsafe on Linux. In particular, calling
cv_destroy() immediately after cv_broadcast() is dangerous because
the waiters may still be asleep. Thus the following cv_destroy()
will free memory which may still be in use.
This was fixed by updating cv_destroy() to block on waiters but
this in turn introduced a deadlock. The deadlock was resolved
with the use of a taskq to move the offending free outside the
range lock. This worked well but using the taskq for the free
resulted in a serious performace hit. This is somewhat ironic
because at the time I felt using the taskq might improve things
by making the free asynchronous.
This patch refines the original fix and moves the free from the
taskq to a private free list. Then items which must be free'd
are simply inserted in to the list. When the range lock is dropped
it's safe to free the items. The list is walked and all rl_t
entries are freed.
This change improves small cached read performance by 26x. This
was expected because for small reads the number of locking calls
goes up significantly. More surprisingly this change significantly
improves large cache read performance. This probably attributable
to better cpu/memory locality. Very likely the same processor
which allocated the memory is now freeing it.
bs ext3 zfs zfs+fix faster
----------------------------------------------
512 435 3 79 26x
1k 820 7 160 22x
2k 1536 14 305 21x
4k 2764 28 572 20x
8k 3788 50 1024 20x
16k 4300 86 1843 21x
32k 4505 138 2560 18x
64k 5324 252 3891 15x
128k 5427 276 4710 17x
256k 5427 413 5017 12x
512k 5427 497 5324 10x
1m 5427 521 5632 10x
Closes#142
In the original implementation the zfs_open()/zfs_close() hooks
were dropped for simplicity. This was functional but not 100%
correct with the expected ZFS sematics. Updating and re-adding the
zfs_open()/zfs_close() hooks resolves the following issues.
1) The ZFS_APPENDONLY file attribute is once again honored. While
there are still no Linux tools to set/clear these attributes once
there are it should behave correctly.
2) Minimal virus scan file attribute hooks were added. Once again
this support in disabled but the infrastructure is back in place.
3) Most importantly correctly handle assigning files which were
opened syncronously to the intent log. Without this change O_SYNC
modifications could be lost during a system crash even though they
were marked synchronous.
When I began work on the Posix layer it immediately became clear to
me that to integrate cleanly with the Linux VFS certain Solaris
specific things would have to go. One of these things was to elimate
as many Solaris specific types from the ZPL layer as possible. They
would be replaced with their Linux equivalents. This would not only
be good for performance, but for the general readability and health of
the code. The Solaris and Linux VFS are different beasts and should
be treated as such. Most of the code remains common for constructing
transactions and such, but there are subtle and important differenced
which need to be repsected.
This policy went quite for for certain types such as the vnode_t,
and it initially seemed to be working out well for the vattr_t. There
was a relatively small amount of related xvattr_t code I was forced to
comment out with HAVE_XVATTR. But it didn't look that hard to come
back soon and replace it all with a native Linux type.
However, after going doing this path with xvattr some distance it
clear that this code was woven in the ZPL more deeply than I thought.
In particular its hooks went very deep in to the ZPL replay code
and replacing it would not be as easy as I originally thought.
Rather than continue persuing replacing and removing this code I've
taken a step back and reevaluted things. This commit reverts many of
my previous commits which removed xvattr related code. It restores
much of the code to its original upstream state and now relies on
improved xvattr_t support in the zfs package itself.
The result of this is that much of the code which I had commented
out, which accidentally broke things like replay, is now back in
place and working. However, there may be a small performance
impact for getattr/setattr operations because they now require
a translation from native Linux to Solaris types. For now that's
a price I'm willing to pay. Once everything is completely functional
we can revisting the issue of removing the vattr_t/xvattr_t types.
Closes#111
With the removal of the minimal xvattr support from the spl this
support needs to be replaced in the zfs package. This is fairly
easily accomplished by directly adding portions of the sys/vnode.h
header from OpenSolaris. These xvattr additions have been placed
in the sys/xvattr.h header file and included as needed where simply
a sys/vnode.h was included before.
In additon to the xvattr types and helper macros two functions
were also included. The xva_init() and xva_getxoptattr() functions
were included as static inline functions in xvattr.h. They are
simple enough and it was simpler to place them here rather than
in their own .c file.
This commit allows zvols with names longer than 32 characters, which
fixes issue on https://github.com/behlendorf/zfs/issues/#issue/102.
Changes include:
- use /dev/zd* device names for zvol, where * is the device minor
(include/sys/fs/zfs.h, module/zfs/zvol.c).
- add BLKZNAME ioctl to get dataset name from userland
(include/sys/fs/zfs.h, module/zfs/zvol.c, cmd/zvol_id).
- add udev rule to create /dev/zvol/[dataset_name] and the legacy
/dev/[dataset_name] symlink. For partitions on zvol, it will create
/dev/zvol/[dataset_name]-part* (etc/udev/rules.d/60-zvol.rules,
cmd/zvol_id).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The open_bdev_exclusive() function has been replaced (again) by the
more generic blkdev_get_by_path() function. Additionally, the
counterpart function close_bdev_exclusive() has been replaced by
blkdev_put(). Because these functions are more generic versions
of the functions they replaced the compatibility macro must add
the FMODE_EXCL mask to ensure they are exclusive.
Closes#114
For legacy reasons the zvol.c and vdev_disk.c Linux compatibility
code ended up in sys/blkdev.h and sys/vdev_disk.h headers. While
there are worse places for this code to live it should be in a
linux/blkdev_compat.h header. This change moves this block device
Linux compatibility code in to the linux/blkdev_compat.h header
and updates all the correct #include locations. This is not a
functional change or bug fix, it is just code cleanup.
This adds an API to wait for pending commit callbacks of already-synced
transactions to finish processing. This is needed by the DMU-OSD in
Lustre during device finalization when some callbacks may still not be
called, this leads to non-zero reference count errors. See lustre.org
bug 23931.
The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback. It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.
The fsync() callback in the file_operations structure used to take
3 arguments. The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers. To
handle this a compatibility prototype was added to ensure the right
prototype is used. Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.
The const keyword was added to the 'struct xattr_handler' in the
generic Linux super_block structure. To handle this we define an
appropriate xattr_handler_t typedef which can be used. This was
the preferred solution because it keeps the code clean and readable.
Initial testing has shown the the right IO scheduler to use under Linux
is noop. This strikes the ideal balance by allowing the zfs elevator
to do all request ordering and prioritization. While allowing the
Linux elevator to do the maximum front/back merging allowed by the
physical device. This yields the largest possible requests for the
device with the lowest total overhead.
While 'noop' should be right for your system you can choose a different
IO scheduler with the 'zfs_vdev_scheduler' option. You may set this
value to any of the standard Linux schedulers: noop, cfq, deadline,
anticipatory. In addition, if you choose 'none' zfs will not attempt
to change the IO scheduler for the block device.
It's worth taking a moment to describe how mmap is implemented
for zfs because it differs considerably from other Linux filesystems.
However, this issue is handled the same way under OpenSolaris.
The issue is that by design zfs bypasses the Linux page cache and
leaves all caching up to the ARC. This has been shown to work
well for the common read(2)/write(2) case. However, mmap(2)
is problem because it relies on being tightly integrated with the
page cache. To handle this we cache mmap'ed files twice, once in
the ARC and a second time in the page cache. The code is careful
to keep both copies synchronized.
When a file with an mmap'ed region is written to using write(2)
both the data in the ARC and existing pages in the page cache
are updated. For a read(2) data will be read first from the page
cache then the ARC if needed. Neither a write(2) or read(2) will
will ever result in new pages being added to the page cache.
New pages are added to the page cache only via .readpage() which
is called when the vfs needs to read a page off disk to back the
virtual memory region. These pages may be modified without
notifying the ARC and will be written out periodically via
.writepage(). This will occur due to either a sync or the usual
page aging behavior. Note because a read(2) of a mmap'ed file
will always check the page cache first even when the ARC is out
of date correct data will still be returned.
While this implementation ensures correct behavior it does have
have some drawbacks. The most obvious of which is that it
increases the required memory footprint when access mmap'ed
files. It also adds additional complexity to the code keeping
both caches synchronized.
Longer term it may be possible to cleanly resolve this wart by
mapping page cache pages directly on to the ARC buffers. The
Linux address space operations are flexible enough to allow
selection of which pages back a particular index. The trick
would be working out the details of which subsystem is in
charge, the ARC, the page cache, or both. It may also prove
helpful to move the ARC buffers to a scatter-gather lists
rather than a vmalloc'ed region.
Additionally, zfs_write/read_common() were used in the readpage
and writepage hooks because it was fairly easy. However, it
would be better to update zfs_fillpage and zfs_putapage to be
Linux friendly and use them instead.
The Linux specific file operations have all been located in the
file zpl_file.c. These functions primarily rely on the reworked
zfs_* functions to do their job. They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
This first zpl_* commit also includes a common zpl.h header with
minimal entries to register the Linux specific hooks. In also
adds all the new zpl_* file to the Makefile.in. This is not a
standalone commit, you required the following zpl_* commits.
A new flag is required for the zfs_rlock code to determine if
it is operation of the zvol of zpl dataset. This used to be
keyed off the zp->z_vnode, which was a hack to begin with, but
with the removal of vnodes we needed a dedicated flag.
I appologize in advance why to many things ended up in this commit.
When it could be seperated in to a whole series of commits teasing
that all apart now would take considerable time and I'm not sure
there's much merrit in it. As such I'll just summerize the intent
of the changes which are all (or partly) in this commit. Broadly
the intent is to remove as much Solaris specific code as possible
and replace it with native Linux equivilants. More specifically:
1) Replace all instances of zfsvfs_t with zfs_sb_t. While the
type is largely the same calling it private super block data
rather than a zfsvfs is more consistent with how Linux names
this. While non critical it makes the code easier to read when
your thinking in Linux friendly VFS terms.
2) Replace vnode_t with struct inode. The Linux VFS doesn't have
the notion of a vnode and there's absolutely no good reason to
create one. There are in fact several good reasons to remove it.
It just adds overhead on Linux if we were to manage one, it
conplicates the code, and it likely will lead to bugs so there's
a good change it will be out of date. The code has been updated
to remove all need for this type.
3) Replace all vtype_t's with umode types. Along with this shift
all uses of types to mode bits. The Solaris code would pass a
vtype which is redundant with the Linux mode. Just update all the
code to use the Linux mode macros and remove this redundancy.
4) Remove using of vn_* helpers and replace where needed with
inode helpers. The big example here is creating iput_aync to
replace vn_rele_async. Other vn helpers will be addressed as
needed but they should be be emulated. They are a Solaris VFS'ism
and should simply be replaced with Linux equivilants.
5) Update znode alloc/free code. Under Linux it's common to
embed the inode specific data with the inode itself. This removes
the need for an extra memory allocation. In zfs this information
is called a znode and it now embeds the inode with it. Allocators
have been updated accordingly.
6) Minimal integration with the vfs flags for setting up the
super block and handling mount options has been added this
code will need to be refined but functionally it's all there.
This will be the first and last of these to large to review commits.
For the moment we do not use dmu_write_pages() to write pages
directly in to a dmu object. It may be required at some point
in the future, but for now is simplest and cleanest to drop it.
It can be easily readded if/when needed.
This code is used for snapshot and heavily leverages Solaris
functionality we do not want to reimplement. These files have
been removed, including references to them, and will be replaced
by a zfs_snap.c/zpl_snap.c implementation which handles snapshots.
For the moment we have left ZFS unchanged and it updates many values
as part of the znode. However, some of these values should be set
in the inode. For the moment this is handled by adding a function
called zfs_inode_update() which updates the inode based on the znode.
This is considered a workaround until we can systematically go
through the ZFS code and have it directly update the inode. At
which point zfs_update_inode() can be dropped entirely. Keeping
two copies of the same data isn't only inefficient it's a breeding
ground for bugs.
Under Linux the convention for filesystem specific data structure is
to embed it along with the generic vfs data structure. This differs
significantly from Solaris.
Since we want to integrates as cleanly with the Linux VFS as possible.
This changes modifies zfs_znode_alloc() to allocate a znode with an
embedded inode for use with the generic VFS. This is done by calling
iget_locked() which will allocate a new inode if needed by calling
sb->alloc_inode(). This function allocates enough memory for a
znode_t by returns a pointer to the inode structure for Linux's VFS.
This function is also responsible for setting the callback
znode->z_set_ops_inodes() which is used to register the correct
handlers for the inode.
Lay the initial ground work for a include/linux/ compatibility
directory. This was less critical in the past because the bulk
of the ZFS code consumes the Solaris API via the SPL. This API
was stable and the bulk Linux API differences were handled in
the SPL.
However, with the addition of a full Posix layer written directly
against the Linux APIs we are going to need more compatibility
code. It makes sense that all this code should be cleanly located
in one place. Subsequent patches should move the existing zvol
and vdev_disk compatibility code in to this directory.
This code originates in OpenSolaris and was modified by KQ Infotech
to be compatible with Linux. While supporting uios in the short
term is useful to get something working this is not an abstraction
we want to keep. This code is expected to be short lived and
removed as soon as all the remaining uio based APIs and updated.
These functions were dropped originally because I felt they would
need to be rewritten anyway to avoid using uios. However, this
patch readds then with they dea they can just be reworked and
the uio bits dropped.
ZFS even under Solaris does not strictly require libshare to be
available. The current implementation attempts to dlopen() the
library to access the needed symbols. If this fails libshare
support is simply disabled.
This means that on Linux we only need the most minimal libshare
implementation. In fact just enough to prevent the build from
failing. Longer term we can decide if we want to implement a
libshare library like Solaris. At best this would be an abstraction
layer between ZFS and NFS/SMB. Alternately, we can drop libshare
entirely and directly integrate ZFS with Linux's NFS/SMB.
Finally the bare bones user-libshare.m4 test was dropped. If we
do decide to implement libshare at some point it will surely be
as part of this package so the check is not needed.
If libselinux is detected on your system at configure time link
against it. This allows us to use a library call to detect if
selinux is enabled and if it is to pass the mount option:
"context=\"system_u:object_r:file_t:s0"
For now this is required because none of the existing selinux
policies are aware of the zfs filesystem type. Because of this
they do not properly enable xattr based labeling even though
zfs supports all of the required hooks.
Until distro's add zfs as a known xattr friendly fs type we
must use mntpoint labeling. Alternately, end users could modify
their existing selinux policy with a little guidance.
The issue is that cv_timedwait() sleeps uninterruptibly to block signals
and avoid waking up early. Under Linux this counts against the load
average keeping it artificially high. This change allows the arc to
sleep interruptibly which mean it may be woken up early due to a signal.
Normally this means some extra care must be taken to handle a potential
signal. But for the arcs usage of cv_timedwait() there is no harm in
waking up before the timeout expires so no extra handling is required.
Most of the blk_* macros were removed in 2.6.36. Ostensibly this was
done to improve readability and allow easier grepping. However, from
a portability stand point the macros are helpful. Therefore the needed
macros are redefined here if they are missing from the kernel.
The name of the flag used to mark a bio as synchronous has changed
again in the 2.6.36 kernel due to the unification of the BIO_RW_*
and REQ_* flags. The new flag is called REQ_SYNC. To simplify
checking this flag I have introduced the vdev_disk_dio_is_sync()
helper function. Based on the results of several new autoconf
tests it uses the correct mask to check for a synchronous bio.
Preferred interface for flagging a synchronous bio:
2.6.12-2.6.29: BIO_RW_SYNC
2.6.30-2.6.35: BIO_RW_SYNCIO
2.6.36-2.6.xx: REQ_SYNC
As of linux-2.6.36 the BIO_RW_FAILFAST and REQ_FAILFAST flags
have been unified under the REQ_* names. These flags always had
to be kept in-sync so this is a nice step forward, unfortunately
it means we need to be careful to only use the new unified flags
when the BIO_RW_* flags are not defined. Additional autoconf
checks were added for this and if it is ever unclear which method
to use no flags are set. This is safe but may result in longer
delays before a disk is failed.
Perferred interface for setting FAILFAST on a bio:
2.6.12-2.6.27: BIO_RW_FAILFAST
2.6.28-2.6.35: BIO_RW_FAILFAST_{DEV|TRANSPORT|DRIVER}
2.6.36-2.6.xx: REQ_FAILFAST_{DEV|TRANSPORT|DRIVER}
It turns out that 'zpool events' over 1024 bytes in size where being
silently dropped. This was discovered while writing the zfault.sh
tests to validate common failure modes.
This could occur because the zfs interface for passing an arbitrary
size nvlist_t over an ioctl() is to provide a buffer for the packed
nvlist which is usually big enough. In this case 1024 byte is the
default. If the kernel determines the buffer is to small it returns
ENOMEM and the minimum required size of the nvlist_t. This was
working properly but in the case of 'zpool events' the event stream
was advanced dispite the error. Thus the retry with the bigger
buffer would succeed but it would skip over the previous event.
The fix is to pass this size to zfs_zevent_next() and determine
before removing the event from the list if it will fit. This was
preferable to checking after the event was returned because this
avoids the need to rewind the stream.
While there is no right maximum timeout for a disk IO we can start
laying the ground work to measure how long they do take in practice.
This change simply measures the IO time and if it exceeds 30s an
event is posted for 'zpool events'.
This value was carefully selected because for sd devices it implies
that at least one timeout (SD_TIMEOUT) has occured. Unfortunately,
even with FAILFAST set we may retry and request and not get an
error. This behavior is strongly dependant on the device driver
and how it is hooked in to the scsi error handling stack. However
by setting the limit at 30s we can log the event even if no error
was returned.
Slightly longer term we can start recording these delays perhaps
as a simple power-of-two histrogram. This histogram can then be
reported as part of the 'zpool status' command when given an command
line option.
None of this code changes the internal behavior of ZFS. Currently
it is simply for reporting excessively long delays.
ZFS works best when it is notified as soon as possible when a device
failure occurs. This allows it to immediately start any recovery
actions which may be needed. In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.
That's the theory at least. In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.
Unfortunately, without additional kernels patchs there's not much
which can be done to improve this. Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed. This can be improved and I suspect I'll be
submitting patches upstream to handle this.
By default the Solaris code does not log speculative or soft io errors
in either 'zpool status' or post an event. Under Linux we don't want
to change the expected behavior of 'zpool status' so these io errors
are still suppressed there.
However, since we do need to know about these events for Linux FMA and
the 'zpool events' interface is new we do post the events. With the
addition of the zio_flags field the posted events now contain enough
information that a user space consumer can identify and discard these
events if it sees fit.
Previously the project contained who zfs_context.h files,
one for user space builds and one for kernel space builds.
It was the responsibility of the source including the file
to ensure the right one was included based on the order of
the include paths.
This was the way it was done in OpenSolaris but for our
purposes I felt it was overly obscure. The user and kernel
zfs_context.h files have been combined in to a single file
and a #define determines if you get the user or kernel
context.
The issue here was that I used the _KERNEL macro which is
defined as part of the spl which will only be defined for
most builds after you include the right zfs_context. It is
safer to use the __KERNEL__ macro which is automatically
defined as part of the kernel build process and passed as
a command line compiler option. It will always be defined
if your building in the kernel and never for user space.
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz
tar -xzf zfs-x.y.z.tar.gz
cd zfs-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This change also moves many of the include headers from individual
incude/sys directories under the modules directory in to a single
top level include directory. This has the advantage of making
the build rules cleaner and logically it makes a bit more sense.