Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions.
The functions have been modified to make them Linux friendly.
ZFS uses these functions to read/write the mmapped pages. Using them
from readpage/writepage results in clear code. The patch also adds
readpages and writepages interface functions to read/write list of
pages in one function call.
The code change handles the first mmap optimization mentioned on
https://github.com/behlendorf/zfs/issues/225
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Issue #255
According to Linux kernel commit 2c27c65e, using truncate_setsize in
setattr simplifies the code. Therefore, the patch replaces the call
to vmtruncate() with truncate_setsize().
zfs_setattr uses zfs_freesp to free the disk space belonging to the
file. As truncate_setsize may release the page cache and flushing
the dirty data to disk, it must be called before the zfs_freesp.
Suggested-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.
The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
To accomindate the updated Linux 3.0 shrinker API the spl
shrinker compatibility code was updated. Unfortunately, this
couldn't be done cleanly without slightly adjusting the comapt
API. See spl commit a55bcaad18.
This commit updates the ZFS code to use the slightly modified
API. You must use the latest SPL if your building ZFS.
The problem here is that prune_icache() tries to evict/delete
both the xattr directory inode as well as at least one xattr
inode contained in that directory. Here's what happens:
1. File is created.
2. xattr is created for that file (behind the scenes a xattr
directory and a file in that xattr directory are created)
3. File is deleted.
4. Both the xattr directory inode and at least one xattr
inode from that directory are evicted by prune_icache();
prune_icache() acquires a lock on both inodes before it
calls ->evict() on the inodes
When the xattr directory inode is evicted zfs_zinactive attempts
to delete the xattr files contained in that directory. While
enumerating these files zfs_zget() is called to obtain a reference
to the xattr file znode - which tries to lock the xattr inode.
However that very same xattr inode was already locked by
prune_icache() further up the call stack, thus leading to a
deadlock.
This can be reliably reproduced like this:
$ touch test
$ attr -s a -V b test
$ rm test
$ echo 3 > /proc/sys/vm/drop_caches
This patch fixes the deadlock by moving the zfs_purgedir() call to
zfs_unlinked_drain(). Instead zfs_rmnode() now checks whether the
xattr dir is empty and leaves the xattr dir in the unlinked set if
it finds any xattrs.
To ensure zfs_unlinked_drain() never accesses a stale super block
zfsvfs_teardown() has been update to block until the iput taskq
has been drained. This avoids a potential race where a file with
an xattr directory is removed and the file system is immediately
unmounted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#266
iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy()
for us after zfs_zinactive(), thus making sure that the inode is
properly cleaned up.
The zfs_inode_destroy() calls in zfs_rmnode() would lead to a
double-free.
Fixes#282
The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been
introduced as a replacement for WRITE_BARRIER. This was done
to allow richer semantics to be expressed to the block layer.
It is the block layers responsibility to choose the correct way
to implement these semantics.
This change simply updates the bio's to use the new kernel API
which should be absolutely safe. However, since ZFS depends
entirely on this working as designed for correctness we do
want to be careful.
Closes#281
Stack usage for ddt_class_contains() reduced from 524 bytes to 68
bytes. This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120
bytes. This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
This abomination is no longer required because the zio's issued
during this recursive call path will now be handled asynchronously
by the taskq thread pool.
This reverts commit 6656bf5621.
The majority of the recursive operations performed by the dsl
are done either in the context of the tgx_sync_thread or during
pool import. It is these recursive operations which contribute
greatly to the stack depth. When this recursion is coupled with
a synchronous I/O in the same context overflow becomes possible.
Previously to handle this case I have focused on keeping the
individual stack frames as light as possible. This is a good
idea as long as it can be done in a way which doesn't overly
complicate the code. However, there is a better solution.
If we treat all zio's issued by the tgx_sync_thread as async then
we can use the tgx_sync_thread stack for the recursive parts, and
the zio_* threads for the I/O parts. This effectively doubles our
available stack space with the only drawback being a small delay
to schedule the I/O. However, in practice the scheduling time
is so much smaller than the actual I/O time this isn't an issue.
Another benefit of making the zio async is that the zio pipeline
is now parallel. That should mean for CPU intensive pipelines
such as compression or dedup performance may be improved.
With this change in place the worst case stack usage observed so
far is 6902 bytes. This is still higher than I'd like but
significantly improved. Additional changes to specific functions
should improve this further. This change allows us to revent
commit 6656bf5 which did some horrible things to the recursive
traverse_visitbp() callpath in the name of saving stack.
Yesterday I ran across a 3TB drive which exposed 4K sectors to
Linux. While I thought I had gotten this support correct it
turns out there were 2 subtle bugs which prevented it from
working.
sudo ./cmd/zpool/zpool create -f large-sector /dev/sda
cannot create 'large-sector': one or more devices is currently unavailable
1) The first issue was that it was possible that bdev_capacity()
would return the number of 512 byte sectors rather than the number
of 4096 sectors. Internally, certain Linux functions only operate
with 512 byte sectors so you need to be careful. To avoid any
confusion in the future I've updated bdev_capacity() to simply
return the device (or partition) capacity in bytes. The higher
levels of ZFS want the value in bytes anyway so this is cleaner.
2) When creating a bio the ->bi_sector count must always be
expressed in 512 byte sectors. The existing code would scale
the byte offset by the logical sector size. Until now this was
always 512 so it never caused problems. Trying a 4K sector drive
clearly exposed the issue. The problem has been fixed by
hard coding the 512 byte sector which is exactly what the bio
code does internally.
With these changes I'm now able to create ZFS pools using 4K
sector drives. No issues were observed during fairly extensive
testing. This is also a low risk change if your using 512b
sectors devices because none of the logic changes.
Closes#256
The default buffer size when requesting multiple quota entries
is 100 times the zfs_useracct_t size. In practice this works out
to exactly 27200 bytes. Since this will be a short lived buffer
in a non-performance critical path it is preferable to vmem_alloc()
the needed memory.
Initially when zfsdev_ioctl() was ported to Linux we didn't have
any credential support implemented. So at the time we simply
passed NULL which wasn't much of a problem since most of the
secpolicy code was disabled.
However, one exception is quota handling which does require the
credential. Now that proper credentials are supported we can
safely start passing the callers credential. This is also an
initial step towards fully implemented the zfs secpolicy.
Normally when the arc_shrinker_func() function is called the return
value should be:
>=0 - To indicate the number of freeable objects in the cache, or
-1 - To indicate this cache should be skipped
However, when the shrinker callback is called with 'nr_to_scan' equal
to zero. The caller simply wants the number of freeable objects in
the cache and we must never return -1. This patch reorders the
first two conditionals in arc_shrinker_func() to ensure this behavior.
This patch also now explictly casts arc_size and arc_c_min to signed
int64_t types so MAX(x, 0) works as expected. As unsigned types
we would never see an negative value which defeated the purpose of
the MAX() lower bound and broke the shrinker logic.
Finally, when nr_to_scan is non-zero we explictly prevent all reclaim
below arc_c_min. This is done to prevent the Linux page cache from
completely crowding out the ARC. This limit is tunable and some
experimentation is likely going to be required to set it exactly right.
For now we're sticking with the OpenSolaris defaults.
Closes#218Closes#243
The comment in zfs_close() pertaining to decrementing the synchronous
open count needs to be updated for Linux. The code was already
updated to be correct, but the comment was missed and is now misleading.
Under Linux the zfs_close() hook is only called once when the final
reference is dropped. This differs from Solaris where zfs_close()
is called for each close.
Closes#237
Update the handling of named pipes and sockets to be consistent with
other platforms with regard to the rdev attribute. While all ZFS
ipmlementations store the rdev for device files in a system attribute
(SA), this is not the case for FIFOs and sockets. Indeed, Linux always
passes rdev=0 to mknod() for FIFOs and sockets, so the value is not
needed. Add an ASSERT that rdev==0 for FIFOs and sockets to detect if
the expected behavior ever changes.
Closes#216
The direct reclaim path in the z_wr_* threads must be disabled
to ensure forward progress is always maintained for txg processing.
This ensures that a txg will never get stuck waiting on itself
because it entered the following memory reclaim callpath.
->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive()
->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open()
It would be preferable to target this exact code path but the
kernel offers no way to do this without custom patches. To avoid
this we are forced to disable all reclaim for these threads. It
should not be necessary to do this for other other z_* threads
because they will not hold a txg open.
Closes#232
How nfsd handles .fsync() has been changed a couple of times in the
recent kernels. But basically there are three cases we need to
consider.
Linux 2.6.12 - 2.6.33
* The .fsync() hook takes 3 arguments
* The nfsd will call .fsync() with a NULL file struct pointer.
Linux 2.6.34
* The .fsync() hook takes 3 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()
Linux 2.6.35 - 2.6.x
* The .fsync() hook takes 2 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()
For once it looks like we've gotten lucky. The first two cases can
actually be collased in to one if we stop using the file struct
pointer entirely. Since the dentry is still passed in both cases
this is possible. The last case can then be safely handled by
unconditionally using the dentry in the file struct pointer now
that we know the nfsd caller has been removed.
Closes#230
The default buffer size when requesting history is 128k. This
is far to large for a kmem_alloc() so instead use the slower
vmem_alloc(). This path has no performance concerns and the
buffer is immediately free'd after its contents are copied to
the user space buffer.
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values. However, in practice sometimes you do need to tweak these
values for one reason or another. In those cases it's nice not to
have to resort to rebuilding from source. All tunables are visable
to modinfo and the list is as follows:
$ modinfo module/zfs/zfs.ko
filename: module/zfs/zfs.ko
license: CDDL
author: Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description: ZFS
srcversion: 8EAB1D71DACE05B5AA61567
depends: spl,znvpair,zcommon,zunicode,zavl
vermagic: 2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm: zvol_major:Major number for zvol device (uint)
parm: zvol_threads:Number of threads for zvol device (uint)
parm: zio_injection_enabled:Enable fault injection (int)
parm: zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm: zio_delay_max:Max zio millisec delay before posting event (int)
parm: zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm: zil_replay_disable:Disable intent logging replay (int)
parm: zfs_nocacheflush:Disable cache flushes (bool)
parm: zfs_read_chunk_size:Bytes to read per chunk (long)
parm: zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm: zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm: zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm: zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm: zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm: zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm: zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm: zfs_vdev_scheduler:I/O scheduler (charp)
parm: zfs_vdev_cache_max:Inflate reads small than max (int)
parm: zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm: zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm: zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm: zfs_recover:Set to attempt to recover from fatal errors (int)
parm: spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm: zfs_zevent_len_max:Max event queue length (int)
parm: zfs_zevent_cols:Max event column width (int)
parm: zfs_zevent_console:Log events to the console (int)
parm: zfs_top_maxinflight:Max I/Os per top-level (int)
parm: zfs_resilver_delay:Number of ticks to delay resilver (int)
parm: zfs_scrub_delay:Number of ticks to delay scrub (int)
parm: zfs_scan_idle:Idle window in clock ticks (int)
parm: zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm: zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm: zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm: zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm: zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm: zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm: zfs_no_write_throttle:Disable write throttling (int)
parm: zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm: zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm: zfs_write_limit_min:Min tgx write limit (ulong)
parm: zfs_write_limit_max:Max tgx write limit (ulong)
parm: zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm: zfs_write_limit_override:Override tgx write limit (ulong)
parm: zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm: zfetch_max_streams:Max number of streams per zfetch (uint)
parm: zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm: zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm: zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm: zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm: zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm: zfs_arc_min:Min arc size (ulong)
parm: zfs_arc_max:Max arc size (ulong)
parm: zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm: zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm: zfs_arc_grow_retry:Seconds before growing arc size (int)
parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm: zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
When a new znode/inode pair is created both the znode and the inode
should be immediately updated to the correct values. This was done
for the znode and for most of the values in the inode, but not all
of them. This normally wasn't a problem because most subsequent
operations would cause the inode to be immediately updated. This
change ensures the inode is now fully updated before it is inserted
in to the inode hash.
Closes#116Closes#146Closes#164
This change fixes a kernel panic which would occur when resizing
a dataset which was not open. The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.
The code has also been updated to correctly notify the kernel
when the block device capacity changes. For 2.6.28 and newer
kernels the capacity change will be immediately detected. For
earlier kernels the capacity change will be detected when the
device is next opened. This is a known limitation of older
kernels.
Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0
Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes#68Closes#84
The vdev_metaslab_init() function has been observed to allocate
larger than 8k chunks. However, they are not much larger than 8k
and it does this infrequently so it is allowed and the warning is
supressed.
The dsl_scan_visit() function is a little heavy weight taking 464
bytes on the stack. This can be easily reduced for little cost by
moving zap_cursor_t and zap_attribute_t off the stack and on to the
heap. After this change dsl_scan_visit() has been reduced in size
by 320 bytes.
This change was made to reduce stack usage in the dsl_scan_sync()
callpath which is recursive and has been observed to overflow the
stack.
Issue #174
This function is called recursively so everything possible must be
done to limit its stack consumption. The dprintf_bp() debugging
function adds 30 bytes of local variables to the function we cannot
afford. By commenting out this debugging we save 30 bytes per
recursion and depths of 13 are not uncommon. This yeilds a total
stack saving of 390 bytes on our 8k stack.
Issue #174
The recursive call chain dsl_scan_visitbp() -> dsl_scan_recurse() ->
dsl_scan_visitdnode() -> dsl_scan_visitbp has been observed to consume
considerable stack resulting in a stack overflow (>8k). The cleanest
way I see to fix this with minimal impact to the existing flow of
code, and with the fewest performance concerns, is to always inline
dsl_scan_recurse() and dsl_scan_visitdnode(). While this will increase
the function size of dsl_scan_visitbp(), by 4660 bytes, it also reduces
the stack requirements by removing the function call overhead.
Issue #174
It's possible for a zvol_write thread to enter direct memory reclaim
while holding open a transaction group. This results in the system
attempting to write out data to the disk to free memory. Unfortunately,
this can't succeed because the the thread doing reclaim is holding open
the txg which must be closed to be synced to disk. To prevent this
the offending allocation is marked KM_PUSHPAGE which will prevent it
from attempting writeback.
Closes#191
Occasionally we would see an -EFAULT returned when setting the
I/O scheduler on a vdev. This was caused an improperly formatted
user mode helper command.
This commit restructures the command to something simpler, allocates
space for it dynamically to save stack, and removes the retry logic
which is no longer needed.
Closes#169
This change ensures the ARC meta-data limits are enforced. Without
this enforcement meta-data can grow to consume all of the ARC cache
pushing out data and hurting performance. The cache is aggressively
reclaimed but this is a soft and not a hard limit. The cache may
exceed the set limit briefly before being brought under control.
By default 25% of the ARC capacity can be used for meta-data. This
limit can be tuned by setting the 'zfs_arc_meta_limit' module option.
Once this limit is exceeded meta-data reclaim will occur in 3 percent
chunks, or may be tuned using 'arc_reduce_dnlc_percent'.
Closes#193
Fixed a bug where zfs_zget could access a stale znode pointer when
the inode had already been removed from the inode cache via iput ->
iput_final -> ... -> zfs_zinactive but the corresponding SA handle
was still alive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#180
As part of zfs_ioc_recv() a zfs_cmd_t is allocated in the kernel
which is 17808 bytes in size. This sort of thing in general should
be avoided. However, since this should be an infrequent event for
now we allow it and simply suppress the warning with the KM_NODEBUG
flag. This can be revisited latter if/when it becomes an issue.
Closes#178
If the attribute's new value was shorter than the old one the old
code would leave parts of the old value in the xattr znode.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#203
Without this we may mistakenly believe we have a dentry and try to
d_instantiate() it. This will result in the following BUG. It's
important to note that while the xattr directory has an inode
assoicated with it we never create a dentry for it.
kernel BUG at fs/dcache.c:1418!
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#202
When compiling ZFS in user space gcc-4.6.0 correctly identifies
the variable 'os' as being set but never used. This generates a
warning and a build failure when using --enable-debug. However,
the code is correct we only want to use 'os' for the kernel space
builds. To suppress the warning the call was wrapped with a
VERIFY() which has the nice side effect of ensuring the 'os'
actually never is NULL. This was observed under Fedora 15.
module/zfs/dsl_pool.c: In function ‘dsl_pool_create’:
module/zfs/dsl_pool.c:229:12: error: variable ‘os’ set but not used
[-Werror=unused-but-set-variable]
Update code to use the spl_invalidate_inodes() wrapper. This hides
some of the complexity of determining if invalidate_inodes() was
exported, and if so what is its prototype. The second argument
of spl_invalidate_inodes() determined the behavior of how dirty
inodes are handled. By passing a zero we are indicated that we
want those inodes to be treated as busy and skipped.
The .sync_fs fix as applied did not use the updated SPL credential
API. This broke builds on Debian Lenny, this change applies the
needed fix to use the portable API. The original credential changes
are part of commit 81e97e2187.
Disable the normal reclaim path for the txg_sync thread. This
ensures the thread will never enter dmu_tx_assign() which can
otherwise occur due to direct reclaim. If this is allowed to
happen the system can deadlock. Direct reclaim call path:
->shrink_icache_memory->prune_icache->dispose_list->
clear_inode->zpl_clear_inode->zfs_inactive->dmu_tx_assign
Under OpenSolaris all memory reclaim is done asyncronously. Under
Linux memory reclaim is done asynchronously _and_ synchronously.
When a process allocates memory with GFP_KERNEL it explicitly allows
the kernel to do reclaim on its behalf to satify the allocation.
If that GFP_KERNEL allocation fails the kernel may take more drastic
measures to reclaim the memory such as killing user space processes.
This was observed to happen with ZFS because the ARC could consume
a large fraction of the system memory but no synchronous reclaim
could be performed on it. The result was GFP_KERNEL allocations
could fail resulting in OOM events, and only moments latter the
arc_reclaim thread would free unused memory from the ARC.
This change leaves the arc_thread in place to manage the fundamental
ARC behavior. But it adds a synchronous (direct) reclaim path for
the ARC which can be called when memory is badly needed. It also
adds an asynchronous (indirect) reclaim path which is called
much more frequently to prune the ARC slab caches.
The following useful values were missing the arcstats. This change
adds them in to provide greater visibility in to the arcs behavior.
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_meta_used 4 624774592
arc_meta_limit 4 400785408
arc_meta_max 4 625594176
Under Linux a dentry referencing an inode must be instantiated before
the inode is unlocked. To accomplish this without overly modifing
the core ZFS code the dentry it passed via the vattr_t. There are
cases such as replay when a dentry is not available. In which case
it is obviously not initialized at inode creation time, if a dentry
is needed it will be spliced as when required via d_lookup().
Kernel threads which sleep uninterruptibly on Linux are marked in the (D)
state. These threads are usually in the process of performing IO and are
thus counted against the load average. The txg_quiesce and txg_sync threads
were always sleeping uninterruptibly and thus inflating the load average.
This change makes them sleep interruptibly. Some care is required however
because these threads may now be woken early by signals. In this case the
callers are all careful to check that the required conditions are met after
waking up. If we're woken early due to a signal they will simply go back
to sleep. In this case these changes are safe.
Closes#175
The .freeze_fs/.unfreeze_fs hooks were not added until Linux 2.6.29
Since these hooks are currently unused they are being removed to
allow support of older kernels.
As of Linux 2.6.29 a clean credential API was added to the Linux kernel.
Previously the credential was embedded in the task_struct. Because the
SPL already has considerable support for handling this API change the
ZPL code has been updated to use the Solaris credential API.
Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility
of synchronous reclaim deadlocks. These deadlocks never existed in the
original OpenSolaris code because all memory reclaim on Solaris is done
asyncronously. Linux does both synchronous (direct) and asynchronous
(indirect) reclaim.
This commit addresses a deadlock caused by inode eviction. A KM_SLEEP
allocation may trigger direct memory reclaim and shrink the inode cache.
This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is
held. Through the ->shrink_icache_memory()->evict()->zfs_inactive()->
zfs_zinactive() call path the same mutex may be reacquired resulting
in a deadlock. To avoid this deadlock the process must not reacquire
the mutex when it is already holding it.
This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD
mutex locking should be reevaluated. This infrastructure already
prevents us from ever using the Linux lock dependency analysis tools,
and it may limit scalability.
It used to be the case that all KM_SLEEP allocations were GFS_NOFS.
Unfortunately this often resulted in the kernel being unable to
reclaim the ARC, inode, and dentry caches in a timely manor.
The fix was to make KM_SLEEP a GFP_KERNEL allocation in the SPL.
However, this increases the posibility of deadlocking the system
on a zfs write thread. If a zfs write thread attempts to perform
an allocation it may trigger synchronous reclaim. This reclaim
may attempt to flush dirty data/inode to disk to free memory.
Unforunately, this write cannot finish because the write thread
which would handle it is holding the previous transaction open.
Deadlock.
To avoid this all allocations in the zfs write thread path must
use KM_PUSHPAGE which prohibits synchronous reclaim for that
thread. In this way forward progress in ensured. The risk
with this change is I missed updating an allocation for the
write threads leaving an increased posibility of deadlock. If
any deadlocks remain they will be unlikely but we'll have to
make sure they all get fixed.
Register the missing .remount_fs handler. This handler isn't strictly
required because the VFS does a pretty good job updating most of the
MS_* flags. However, there's no harm in using the hook to call the
registered zpl callback for various MS_* flags. Additionaly, this
allows us to lay the ground work for more complicated argument parsing
in the future.
Register the missing .sync_fs handler. This is a noop in most cases
because the usual requirement is that sync just be initiated. As part
of the DMU's normal transaction processing txgs will be frequently
synced. However, when the 'wait' flag is set the requirement is that
.sync_fs must not return until the data is safe on disk. With the
addition of the .sync_fs handler this is now properly implemented.