This change introduces zpool_get_prop_literal. It's an expanded version
of zpool_get_prop taking one additional boolean parameter. With this
parameter set to B_FALSE it will behave identically to zpool_get_prop.
Setting it to B_TRUE will return full precision numbers for the
following properties:
ZPOOL_PROP_SIZE
ZPOOL_PROP_ALLOCATED
ZPOOL_PROP_FREE
ZPOOL_PROP_FREEING
ZPOOL_PROP_EXPANDSZ
ZPOOL_PROP_ASHIFT
Also introduced is a wrapper function for zpool_get_prop making it
use zpool_get_prop_literal in the background.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1813
Currently there is no mechanism to inspect which dbufs are being
cached by the system. There are some coarse counters in arcstats
by they only give a rough idea of what's being cached. This patch
aims to improve the current situation by adding a new dbufs kstat.
When read this new kstat will walk all cached dbufs linked in to
the dbuf_hash. For each dbuf it will dump detailed information
about the buffer. It will also dump additional information about
the referenced arc buffer and its related dnode. This provides a
more complete view in to exactly what is being cached.
With this generic infrastructure in place utilities can be written
to post-process the data to understand exactly how the caching is
working. For example, the data could be processed to show a list
of all cached dnodes and how much space they're consuming. Or a
similar list could be generated based on dnode type. Many other
ways to interpret the data exist based on what kinds of questions
you're trying to answer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
This change adds a new kstat to gain some visibility into the
amount of time spent in each call to dmu_tx_assign. A histogram
is exported via the new dmu_tx_assign file. The information
contained in this histogram is the frequency dmu_tx_assign
took to complete given an interval range.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change is an attempt to add visibility in to how txgs are being
formed on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to txg. These entries are then exported through the kstat
interface, which can then be interpreted in userspace.
For each txg, the following information is exported:
* Unique txg number (uint64_t)
* The time the txd was born (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The current txg state ((O)pen/(Q)uiescing/(S)yncing/(C)ommitted)
* The number of reserved bytes for the txg (uint64_t)
* The number of bytes read during the txg (uint64_t)
* The number of bytes written during the txg (uint64_t)
* The number of read operations during the txg (uint64_t)
* The number of write operations during the txg (uint64_t)
* The time the txg was closed (hrtime_t)
* The time the txg was quiesced (hrtime_t)
* The time the txg was synced (hrtime_t)
Note that while the raw kstat now stores relative hrtimes for the
open, quiesce, and sync times. Those relative times are used to
calculate how long each state took and these deltas and printed by
output handlers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
These kstat interfaces are required to port
"Illumos #3537 want pool io kstats" to ZFS on Linux.
kstat_waitq_enter()
kstat_waitq_exit()
kstat_runq_enter()
kstat_runq_exit()
Additionally, zero out the ks_data buffer in __kstat_create() so
that the kstat_io_t counters are initialized to zero.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
While porting Illumos #3537 I found that ks_lock member of kstat_t
structure is different between Illumos and SPL. It is a pointer to
the kmutex_t in Illumos, but the mutex lock itself in SPL.
Apparently Illumos kstat API allows consumer to override the lock
if required. With SPL implementation it is not possible anymore.
Things were alright until the first attempt to actually override
the lock. Porting of Illumos #3537 introduced such code for the
first time.
In order to provide the Solaris/Illumos like functionality we:
1. convert ks_lock to "kmutex_t *ks_lock"
2. create a new field "kmutex_t ks_private_lock"
3. On kstat_create() ks_lock = &ks_private_lock
Thus if consumer doesn't care we still have our internal lock in use.
If, however, consumer does care she has a chance to set ks_lock to
anything else before calling kstat_install().
The rest of the code will use ks_lock regardless of its origin.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #286
When creating a new pool, or adding/replacing a disk in an existing
pool, partition tables will be automatically created on the devices.
Under normal circumstances it will take less than a second for udev
to create the expected device files under /dev/. However, it has
been observed that if the system is doing heavy IO concurrently udev
may take far longer. If you also throw in some cheap dodgy hardware
it may take even longer.
To prevent zpool commands from failing due to this the default wait
time for udev is being increased to 30 seconds. This will have no
impact on normal usage, the increase timeout should only be noticed
if your udev rules are incorrectly configured.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1646
Linus Torvalds merged LZ4 into Linux 3.11. This causes a conflict
whenever CONFIG_LZ4_DECOMPRESS=y or CONFIG_LZ4_COMPRESS=y are set in the
kernel's .config. We rename the symbols to avoid the conflict.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1789
This reverts commit dba79fcbf2 in
favor of using the generic KSTAT_TYPE_RAW callbacks. The advantage
of this approach is that arbitrary types can be added without the
need to add them to the SPL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
This change adds simple wrappers for accessing a thread's PID and
command character string.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
The current implementation for displaying kstats of type KSTAT_TYPE_RAW
is rather crude. This patch attempts to enhance this handling by
allowing a kstat user to register formatting callbacks which can
optionally be used.
The callbacks allow the user to implement functions for interpreting
their data and transposing it into a character buffer. This buffer,
containing a string representation of the raw data, is then be displayed
through the current /proc textual interface.
Additionally the kstats are made writable because it's now possible
to provide a useful handler via the existing ks_update() interface.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
This is needed for the Illumos #4045 write throttle patch. It is used
in the arc eviction code to avoid blocking all arc activity by sitting on
arcs_mtx too long.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #286
2882 implement libzfs_core
2883 changing "canmount" property to "on" should not always remount dataset
2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
https://www.illumos.org/issues/2882https://www.illumos.org/issues/2883https://www.illumos.org/issues/2900illumos/illumos-gate@4445fffbbb
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1293
Porting notes:
WARNING: This patch changes the user/kernel ABI. That means that
the zfs/zpool utilities built from master are NOT compatible with
the 0.6.2 kernel modules. Ensure you load the matching kernel
modules from master after updating the utilities. Otherwise the
zfs/zpool commands will be unable to interact with your pool and
you will see errors similar to the following:
$ zpool list
failed to read pool configuration: bad address
no pools available
$ zfs list
no datasets available
Add zvol minor device creation to the new zfs_snapshot_nvl function.
Remove the logging of the "release" operation in
dsl_dataset_user_release_sync(). The logging caused a null dereference
because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the
logging functions try to get the ds name via the dsl_dataset_name()
function. I've got no idea why this particular code would have worked
in Illumos. This code has subsequently been completely reworked in
Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring).
Squash some "may be used uninitialized" warning/erorrs.
Fix some printf format warnings for %lld and %llu.
Apply a few spa_writeable() changes that were made to Illumos in
illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and
3115 fixes.
Add a missing call to fnvlist_free(nvl) in log_internal() that was added
in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time
(zfsonlinux/zfs@9e11c73) because it depended on future work.
Commit torvalds/linux@2233f31aad
replaced ->readdir() with ->iterate() in struct file_operations.
All filesystems must now use the new ->iterate method.
To handle this the code was reworked to use the new ->iterate
interface. Care was taken to keep the majority of changes
confined to the ZPL layer which is already Linux specific.
However, minor changes were required to the common zfs_readdir()
function.
Compatibility with older kernels was accomplished by adding
versions of the trivial dir_emit* helper functions. Also the
various *_readdir() functions were reworked in to wrappers
which create a dir_context structure to pass to the new
*_iterate() functions.
Unfortunately, the new dir_emit* functions prevent us from
passing a private pointer to the filldir function. The xattr
directory code leveraged this ability through zfs_readdir()
to generate the list of xattr names. Since we can no longer
use zfs_readdir() a simplified zpl_xattr_readdir() function
was added to perform the same task.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1653
Issue #1591
3618 ::zio dcmd does not show timestamp data
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
http://www.illumos.org/issues/3618illumos/illumos-gate@c55e05cb35
Notes on porting to ZFS on Linux:
The original changeset mostly deals with mdb ::zio dcmd.
However, in order to provide the requested functionality
it modifies vdev and zio structures to keep the timing data
in nanoseconds instead of ticks. It is these changes that
are ported over in the commit in hand.
One visible change of this commit is that the default value
of 'zfs_vdev_time_shift' tunable is changed:
zfs_vdev_time_shift = 6
to
zfs_vdev_time_shift = 29
The original value of 6 was inherited from OpenSolaris and
was subotimal - since it shifted the raw tick value - it
didn't compensate for different tick frequencies on Linux and
OpenSolaris. The former has HZ=1000, while the latter HZ=100.
(Which itself led to other interesting performance anomalies
under non-trivial load. The deadline scheduler delays the IO
according to its priority - the lower priority the further
the deadline is set. The delay is measured in units of
"shifted ticks". Since the HZ value was 10 times higher,
the delay units were 10 times shorter. Thus really low
priority IO like resilver (delay is 10 units) and scrub
(delay is 20 units) were scheduled much sooner than intended.
The overall effect is that resilver and scrub IO consumed
more bandwidth at the expense of the other IO.)
Now that the bookkeeping is done is nanoseconds the shift
behaves correctly for any tick frequency (HZ).
Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1643
When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.
We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.
Original-patch-by: DHE
Rewrite-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#260
num_physpages was removed by
torvalds/linux@cfa11e08ed, so lets replace
it with totalram_pages.
This is a bug fix as much as it is a compatibility fix because
num_physpages did not reflect the number of pages actually available to
the kernel:
http://lkml.indiana.edu/hypermail/linux/kernel/0908.2/01001.html
Also, there are known issues with memory calculations when ZFS is in a
Xen dom0. There is a chance that using totalram_pages could resolve
them. This conjecture is untested at the time of writing.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#273
These functions are used in neither Illumos nor ZFSOnLinux. They appear
to have been replaced by arc_buf_alloc()/arc_buf_free(), so lets remove
them.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
This change adds a new kstat to gain some visibility into the amount of
time spent in each call to dmu_tx_assign. A histogram is exported via
a new dmu_tx_assign_histogram-$POOLNAME file. The information contained
in this histogram is the frequency dmu_tx_assign took to complete given
an interval range. For example, given the below histogram file:
$ cat /proc/spl/kstat/zfs/dmu_tx_assign_histogram-tank
12 1 0x01 32 1536 19792068076691 20516481514522
name type data
1 us 4 859
2 us 4 252
4 us 4 171
8 us 4 2
16 us 4 0
32 us 4 2
64 us 4 0
128 us 4 0
256 us 4 0
512 us 4 0
1024 us 4 0
2048 us 4 0
4096 us 4 0
8192 us 4 0
16384 us 4 0
32768 us 4 1
65536 us 4 1
131072 us 4 1
262144 us 4 4
524288 us 4 0
1048576 us 4 0
2097152 us 4 0
4194304 us 4 0
8388608 us 4 0
16777216 us 4 0
33554432 us 4 0
67108864 us 4 0
134217728 us 4 0
268435456 us 4 0
536870912 us 4 0
1073741824 us 4 0
2147483648 us 4 0
one can see most calls to dmu_tx_assign completed in 32us or less, but a
few outliers did not. Specifically, 4 of the calls took between 262144us
and 131072us. This information is difficult, if not impossible, to gather
without this change.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1584
The FreeBSD implementation of zfs adds the 'zpool labelclear'
command. Since this functionality is helpful and straight
forward to add it is being included in ZoL.
References:
freebsd/freebsd@119a041dc9
Ported-by: Dmitry Khasanov <pik4ez@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1126
Linux kernel commit torvalds/linux#59d8053f moved the definition of
struct proc_dir_entry from include/linux/proc_fs.h to the private
header fs/proc/internal.h. The SPL relied on that to map Solaris'
kstat to entries in /proc/spl/kstat.
Since the proc_dir_entry structure is now private the only safe
thing to do is wrap the opaque proc handle with our own structure.
This actually ends up simplify the code and is good because it
moves us away from depending on implementation details of /proc.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
Linux kernel commmit torvalds/linux@db3808c1 moved the
vmalloc_info structure from a private to a public header.
Now that it's available for kernel modules use it.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
The approach taken was the rework zfs_holey() as little as
possible and then just wrap the code as needed to ensure
correct locking and error handling.
Tested with xfstests 285 and 286. All tests pass except for
7-9 of 285 which try to reserve blocks first via fallocate(2)
and fail because fallocate(2) is not yet supported.
Note that the filp->f_lock spinlock did not exist prior to
Linux 2.6.30, but we avoid the need for autotools check by
virtue of the fact that SEEK_DATA/SEEK_HOLE support was not
added until Linux 3.1.
An autoconf check was added for lseek_execute() which is
currently a private function but the expectation is that it
will be exported perhaps as early as Linux 3.11.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1384
Until these hooks are fully implemented return the expected
-EOPNOTSUPP error to indicate they are not functional. This
allows test suites such as xfstests to cleanly skip testing
this functionality until it's implemented.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #229
Ensure the value is cast to a 'long long' for printing purposes. The
expectation is that ASSERT0/VERIFY0 are mostly used for validating
return values and thus may commonly be negative.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #246
The non-blocking allocation handlers in nvlist_alloc() would be
mistakenly assigned if any flags other than KM_SLEEP were passed.
This meant that nvlists allocated with KM_PUSHPUSH or other KM_*
debug flags were effectively always using atomic allocations.
While these failures were unlikely it could lead to assertions
because KM_PUSHPAGE allocations in particular are guaranteed to
succeed or block. They must never fail.
Since the existing API does not allow us to pass allocation
flags to the private allocators the cleanest thing to do is to
add a KM_PUSHPAGE allocator.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/spl#249
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@6e6d5868f5https://www.illumos.org/issues/3805
ZFS should proactively evict freed blocks from the cache.
On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk. We were wasting about half the
system's RAM (252GB) on blocks that have been freed.
Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:
1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself). The secondary control is to
evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks. Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB. However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1503
The definition of zfs_vdev_holder casts VDEV_HOLDER into a function pointer
passing to linux kernel's block layer function blkdev_get_by_path.
However current VDEV_HOLDER is defined to be wider than 32 bits and the compiler
warns about potential overflows. Instead of specifying different values for 32-bit and
64-bit systems using ifdefs, choose the common factor 32-bit addresses.
Redefine VDEV_HOLDER to 0x2401de7("zholder") here.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1520
The Illumos code introduced the ASSERT0 and VERIFY0 macros which
are to be used instead of ASSERT3S(x, ==, 0) and VERIFY3S(x, ==, 0).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Madhav Suresh <madhav.suresh@delphix.com>
Closes#246
The vn_rdwr() function performs I/O by calling the vfs_write() or
vfs_read() functions. These functions reside just below the system
call layer and the expectation is they have almost the entire 8k of
stack space to work with. In fact, certain layered configurations
such as ext+lvm+md+multipath require the majority of this stack to
avoid stack overflows.
To avoid this posibility the vn_rdwr() call in dump_bytes() has been
moved to the ZIO_TYPE_FREE, taskq. This ensures that all I/O will be
performed with the majority of the stack space available. This ends
up being very similiar to as if the I/O were issued via sys_write()
or sys_read().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1399Closes#1423
3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
illumos/illumos-gate@ec94d32https://illumos.org/issues/3581
Notes for Linux port:
Earlier commit 08d08eb reduced contention on this taskq lock by simply
reducing the number of z_fr_iss threads from 100 to one-per-CPU. We
also optimized the taskq implementation in zfsonlinux/spl@3c6ed54.
These changes significantly improved unlink performance to acceptable
levels.
This patch further reduces time spent spinning on this lock by
randomly dispatching the work items over multiple independent task
queues. The Illumos ZFS developers stated that this lock contention
only arose after "3329 spa_sync() spends 10-20% of its time in
spa_free_sync_cb()" was landed. It's not clear if 3329 affects the
Linux port or not. I didn't see spa_free_sync_cb() show up in
oprofile sessions while unlinking large files, but I may just not
have used the right test case.
I tested unlinking a 1 TB of data with and without the patch and
didn't observe a meaningful difference in elapsed time. However,
oprofile showed that the percent time spent in taskq_thread() was
reduced from about 16% to about 5%. Aside from a possible slight
performance benefit this may be worth landing if only for the sake of
maintaining consistency with upstream.
Ported-by: Ned Bass <bass6@llnl.gov>
Closes#1327
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
NOTES: This patch has been reworked from the original in the
following ways to accomidate Linux ZFS implementation
*) Usage of the cyclic interface was replaced by the delayed taskq
interface. This avoids the need to implement new compatibility
code and allows us to rely on the existing taskq implementation.
*) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h
because declaring externs in source files as was done in the
original patch is just plain wrong.
*) Instead of panicing the system when the deadman triggers a
zevent describing the blocked vdev and the first pending I/O
is posted. If the panic behavior is desired Linux provides
other generic methods to panic the system when threads are
observed to hang.
*) For reference, to delay zios by 30 seconds for testing you can
use zinject as follows: 'zinject -d <vdev> -D30 <pool>'
References:
illumos/illumos-gate@283b84606bhttps://www.illumos.org/issues/3246
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1396
Somewhat amazingly it went unnoticed that the delay() function
doesn't actually cause the task to block. Since the task state
is never changed from TASK_RUNNING before schedule_timeout() the
scheduler allows to task to continue running without any delay.
Using schedule_timeout_interruptible() resolves the issue by
correctly setting TASK_UNINTERRUPTIBLE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add wrappers for the Solaris MSEC_TO_TICK, USEC_TO_TICK, and
NSEC_TO_TICK conversion functions. They are mapped directly to
their Linux counterparts with the exception of NSEC_TO_TICK
can cannot use usecs_to_jiffies() because it is not exported
by the kernel.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Install the common spl kernel development headers under
/usr/src/spl-<version>/ rather than in a kernel specific
directory. The kernel specific build products such as
spl_config.h and Modules.symvers are left installed under
/usr/src/spl-<version>/<kernel>.
This was done to be consistent with where dkms expects
kernel module source to be packaged. It also allows for
a common spl-kmod-devel package which includes the headers,
and per-kernel spl-kmod-devel-<kernel> packages.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>