Commit Graph

210 Commits

Author SHA1 Message Date
Christian Kohlschütter
df30f56639 Add "ashift" property to zpool create
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly.  This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes.  The drive may behave this way
for compatibility reasons.  For example, the WDC WD20EARS disks are
known to exhibit this behavior.

When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).

This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time.  This can significantly improve
performance in certain cases.  However, it will have an impact on your
total pool capacity.  See the updated ashift property description
in the zpool.8 man page for additional details.

Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior.  The most common ashift values are 9 and 12.

  Example:
  zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd

Closes #280

Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-06-17 16:35:49 -07:00
Brian Behlendorf
96801d2906 Linux 2.6.37 compat, WRITE_FLUSH_FUA
The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been
introduced as a replacement for WRITE_BARRIER.  This was done
to allow richer semantics to be expressed to the block layer.
It is the block layers responsibility to choose the correct way
to implement these semantics.

This change simply updates the bio's to use the new kernel API
which should be absolutely safe.  However, since ZFS depends
entirely on this working as designed for correctness we do
want to be careful.

Closes #281
2011-06-17 14:37:26 -07:00
Brian Behlendorf
e95b3bdcbb Fix stack ddt_class_contains()
Stack usage for ddt_class_contains() reduced from 524 bytes to 68
bytes.  This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
2011-05-31 12:17:27 -07:00
Brian Behlendorf
5b8c7bbcea Fix stack ddt_zap_lookup()
Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120
bytes.  This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
2011-05-31 12:17:27 -07:00
Brian Behlendorf
c7f8f831a4 Revert "Fix stack traverse_visitbp()"
This abomination is no longer required because the zio's issued
during this recursive call path will now be handled asynchronously
by the taskq thread pool.

This reverts commit 6656bf5621.
2011-05-31 12:17:27 -07:00
Brian Behlendorf
2fac4c2a74 Make tgx_sync_thread zio's async
The majority of the recursive operations performed by the dsl
are done either in the context of the tgx_sync_thread or during
pool import.  It is these recursive operations which contribute
greatly to the stack depth.  When this recursion is coupled with
a synchronous I/O in the same context overflow becomes possible.

Previously to handle this case I have focused on keeping the
individual stack frames as light as possible.  This is a good
idea as long as it can be done in a way which doesn't overly
complicate the code.  However, there is a better solution.

If we treat all zio's issued by the tgx_sync_thread as async then
we can use the tgx_sync_thread stack for the recursive parts, and
the zio_* threads for the I/O parts.  This effectively doubles our
available stack space with the only drawback being a small delay
to schedule the I/O.  However, in practice the scheduling time
is so much smaller than the actual I/O time this isn't an issue.
Another benefit of making the zio async is that the zio pipeline
is now parallel.  That should mean for CPU intensive pipelines
such as compression or dedup performance may be improved.

With this change in place the worst case stack usage observed so
far is 6902 bytes.  This is still higher than I'd like but
significantly improved.  Additional changes to specific functions
should improve this further.  This change allows us to revent
commit 6656bf5 which did some horrible things to the recursive
traverse_visitbp() callpath in the name of saving stack.
2011-05-31 12:17:27 -07:00
Brian Behlendorf
f74fae8b30 Fix 4K sector support
Yesterday I ran across a 3TB drive which exposed 4K sectors to
Linux.  While I thought I had gotten this support correct it
turns out there were 2 subtle bugs which prevented it from
working.

  sudo ./cmd/zpool/zpool create -f large-sector /dev/sda
  cannot create 'large-sector': one or more devices is currently unavailable

1) The first issue was that it was possible that bdev_capacity()
would return the number of 512 byte sectors rather than the number
of 4096 sectors.  Internally, certain Linux functions only operate
with 512 byte sectors so you need to be careful.  To avoid any
confusion in the future I've updated bdev_capacity() to simply
return the device (or partition) capacity in bytes.  The higher
levels of ZFS want the value in bytes anyway so this is cleaner.

2) When creating a bio the ->bi_sector count must always be
expressed in 512 byte sectors.  The existing code would scale
the byte offset by the logical sector size.   Until now this was
always 512 so it never caused problems.  Trying a 4K sector drive
clearly exposed the issue.  The problem has been fixed by
hard coding the 512 byte sector which is exactly what the bio
code does internally.

With these changes I'm now able to create ZFS pools using 4K
sector drives.  No issues were observed during fairly extensive
testing.  This is also a low risk change if your using 512b
sectors devices because none of the logic changes.

Closes #256
2011-05-27 11:38:53 -07:00
Brian Behlendorf
2b8cad6159 Use vmem_alloc() for zfs_ioc_userspace_many()
The default buffer size when requesting multiple quota entries
is 100 times the zfs_useracct_t size.  In practice this works out
to exactly 27200 bytes.  Since this will be a short lived buffer
in a non-performance critical path it is preferable to vmem_alloc()
the needed memory.
2011-05-20 14:23:18 -07:00
Brian Behlendorf
f01b360e67 Pass caller's credential in zfsdev_ioctl()
Initially when zfsdev_ioctl() was ported to Linux we didn't have
any credential support implemented.  So at the time we simply
passed NULL which wasn't much of a problem since most of the
secpolicy code was disabled.

However, one exception is quota handling which does require the
credential.  Now that proper credentials are supported we can
safely start passing the callers credential.  This is also an
initial step towards fully implemented the zfs secpolicy.
2011-05-20 10:12:25 -07:00
Brian Behlendorf
3fd70ee6b0 Fix 'negative objects to delete' warning
Normally when the arc_shrinker_func() function is called the return
value should be:

   >=0 - To indicate the number of freeable objects in the cache, or
   -1  - To indicate this cache should be skipped

However, when the shrinker callback is called with 'nr_to_scan' equal
to zero.  The caller simply wants the number of freeable objects in
the cache and we must never return -1.  This patch reorders the
first two conditionals in arc_shrinker_func() to ensure this behavior.

This patch also now explictly casts arc_size and arc_c_min to signed
int64_t types so MAX(x, 0) works as expected.  As unsigned types
we would never see an negative value which defeated the purpose of
the MAX() lower bound and broke the shrinker logic.

Finally, when nr_to_scan is non-zero we explictly prevent all reclaim
below arc_c_min.  This is done to prevent the Linux page cache from
completely crowding out the ARC.  This limit is tunable and some
experimentation is likely going to be required to set it exactly right.
For now we're sticking with the OpenSolaris defaults.

Closes #218
Closes #243
2011-05-18 10:29:22 -07:00
Brian Behlendorf
e814770f2e Update synchronous open zfs_close() comment
The comment in zfs_close() pertaining to decrementing the synchronous
open count needs to be updated for Linux.  The code was already
updated to be correct, but the comment was missed and is now misleading.
Under Linux the zfs_close() hook is only called once when the final
reference is dropped.  This differs from Solaris where zfs_close()
is called for each close.

Closes #237
2011-05-13 08:20:06 -07:00
Brian Behlendorf
c91d229809 Merge pull request #235 from nedbass/rdev
Don't store rdev in SA for FIFOs and sockets
2011-05-09 16:41:28 -07:00
Ned A. Bass
aa6d8c1086 Don't store rdev in SA for FIFOs and sockets
Update the handling of named pipes and sockets to be consistent with
other platforms with regard to the rdev attribute.  While all ZFS
ipmlementations store the rdev for device files in a system attribute
(SA), this is not the case for FIFOs and sockets.  Indeed, Linux always
passes rdev=0 to mknod() for FIFOs and sockets, so the value is not
needed.  Add an ASSERT that rdev==0 for FIFOs and sockets to detect if
the expected behavior ever changes.

Closes #216
2011-05-09 13:35:07 -07:00
Brian Behlendorf
21ade34764 Disable direct reclaim for z_wr_* threads
The direct reclaim path in the z_wr_* threads must be disabled
to ensure forward progress is always maintained for txg processing.
This ensures that a txg will never get stuck waiting on itself
because it entered the following memory reclaim callpath.

  ->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive()
  ->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open()

It would be preferable to target this exact code path but the
kernel offers no way to do this without custom patches.  To avoid
this we are forced to disable all reclaim for these threads.  It
should not be necessary to do this for other other z_* threads
because they will not hold a txg open.

Closes #232
2011-05-06 15:26:26 -07:00
Brian Behlendorf
3117dd0b90 Handle NULL in nfsd .fsync() hook
How nfsd handles .fsync() has been changed a couple of times in the
recent kernels.  But basically there are three cases we need to
consider.

Linux 2.6.12 - 2.6.33
* The .fsync() hook takes 3 arguments
* The nfsd will call .fsync() with a NULL file struct pointer.

Linux 2.6.34
* The .fsync() hook takes 3 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()

Linux 2.6.35 - 2.6.x
* The .fsync() hook takes 2 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()

For once it looks like we've gotten lucky.  The first two cases can
actually be collased in to one if we stop using the file struct
pointer entirely.  Since the dentry is still passed in both cases
this is possible.  The last case can then be safely handled by
unconditionally using the dentry in the file struct pointer now
that we know the nfsd caller has been removed.

Closes #230
2011-05-06 12:33:45 -07:00
Brian Behlendorf
34b84cb831 Use vmem_alloc() for zfs_ioc_pool_get_history()
The default buffer size when requesting history is 128k.  This
is far to large for a kmem_alloc() so instead use the slower
vmem_alloc().  This path has no performance concerns and the
buffer is immediately free'd after its contents are copied to
the user space buffer.
2011-05-06 09:59:52 -07:00
Brian Behlendorf
c409e4647f Add missing ZFS tunables
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values.  However, in practice sometimes you do need to tweak these
values for one reason or another.  In those cases it's nice not to
have to resort to rebuilding from source.  All tunables are visable
to modinfo and the list is as follows:

$ modinfo module/zfs/zfs.ko
filename:       module/zfs/zfs.ko
license:        CDDL
author:         Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description:    ZFS
srcversion:     8EAB1D71DACE05B5AA61567
depends:        spl,znvpair,zcommon,zunicode,zavl
vermagic:       2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm:           zvol_major:Major number for zvol device (uint)
parm:           zvol_threads:Number of threads for zvol device (uint)
parm:           zio_injection_enabled:Enable fault injection (int)
parm:           zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm:           zio_delay_max:Max zio millisec delay before posting event (int)
parm:           zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm:           zil_replay_disable:Disable intent logging replay (int)
parm:           zfs_nocacheflush:Disable cache flushes (bool)
parm:           zfs_read_chunk_size:Bytes to read per chunk (long)
parm:           zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm:           zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm:           zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm:           zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm:           zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm:           zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm:           zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm:           zfs_vdev_scheduler:I/O scheduler (charp)
parm:           zfs_vdev_cache_max:Inflate reads small than max (int)
parm:           zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm:           zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm:           zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm:           zfs_recover:Set to attempt to recover from fatal errors (int)
parm:           spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm:           zfs_zevent_len_max:Max event queue length (int)
parm:           zfs_zevent_cols:Max event column width (int)
parm:           zfs_zevent_console:Log events to the console (int)
parm:           zfs_top_maxinflight:Max I/Os per top-level (int)
parm:           zfs_resilver_delay:Number of ticks to delay resilver (int)
parm:           zfs_scrub_delay:Number of ticks to delay scrub (int)
parm:           zfs_scan_idle:Idle window in clock ticks (int)
parm:           zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm:           zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm:           zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm:           zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm:           zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm:           zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm:           zfs_no_write_throttle:Disable write throttling (int)
parm:           zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm:           zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm:           zfs_write_limit_min:Min tgx write limit (ulong)
parm:           zfs_write_limit_max:Max tgx write limit (ulong)
parm:           zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm:           zfs_write_limit_override:Override tgx write limit (ulong)
parm:           zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm:           zfetch_max_streams:Max number of streams per zfetch (uint)
parm:           zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm:           zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm:           zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm:           zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm:           zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm:           zfs_arc_min:Min arc size (ulong)
parm:           zfs_arc_max:Max arc size (ulong)
parm:           zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm:           zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm:           zfs_arc_grow_retry:Seconds before growing arc size (int)
parm:           zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm:           zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
2011-05-04 10:02:37 -07:00
Brian Behlendorf
5f35b19007 Fully update inode when created
When a new znode/inode pair is created both the znode and the inode
should be immediately updated to the correct values.  This was done
for the znode and for most of the values in the inode, but not all
of them.  This normally wasn't a problem because most subsequent
operations would cause the inode to be immediately updated.  This
change ensures the inode is now fully updated before it is inserted
in to the inode hash.

Closes #116
Closes #146
Closes #164
2011-05-02 14:04:19 -07:00
Brian Behlendorf
df554c148e Fix 'zfs set volsize=N pool/dataset'
This change fixes a kernel panic which would occur when resizing
a dataset which was not open.  The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.

The code has also been updated to correctly notify the kernel
when the block device capacity changes.  For 2.6.28 and newer
kernels the capacity change will be immediately detected.  For
earlier kernels the capacity change will be detected when the
device is next opened.  This is a known limitation of older
kernels.

Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0

Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes #68
Closes #84
2011-05-02 08:54:40 -07:00
Gunnar Beutner
055656d4f4 Implemented NFS export_operations.
Implemented the required NFS operations for exporting ZFS datasets
using the in-kernel NFS daemon.
2011-04-29 12:36:13 -07:00
Brian Behlendorf
5476e6952c Suppress 'vdev_metaslab_init' memory warning
The vdev_metaslab_init() function has been observed to allocate
larger than 8k chunks.  However, they are not much larger than 8k
and it does this infrequently so it is allowed and the warning is
supressed.
2011-04-27 09:35:18 -07:00
Brian Behlendorf
40a39e1103 Conserve stack in dsl_scan_visit()
The dsl_scan_visit() function is a little heavy weight taking 464
bytes on the stack.  This can be easily reduced for little cost by
moving zap_cursor_t and zap_attribute_t off the stack and on to the
heap.  After this change dsl_scan_visit() has been reduced in size
by 320 bytes.

This change was made to reduce stack usage in the dsl_scan_sync()
callpath which is recursive and has been observed to overflow the
stack.

Issue #174
2011-04-26 15:48:00 -07:00
Brian Behlendorf
b81c4ac9af Conserve stack in dsl_scan_visitbp()
This function is called recursively so everything possible must be
done to limit its stack consumption.  The dprintf_bp() debugging
function adds 30 bytes of local variables to the function we cannot
afford.  By commenting out this debugging we save 30 bytes per
recursion and depths of 13 are not uncommon.  This yeilds a total
stack saving of 390 bytes on our 8k stack.

Issue #174
2011-04-26 15:48:00 -07:00
Brian Behlendorf
7a060636b0 Conserve stack in dsl_scan_visitbp()
The recursive call chain dsl_scan_visitbp() -> dsl_scan_recurse() ->
dsl_scan_visitdnode() -> dsl_scan_visitbp has been observed to consume
considerable stack resulting in a stack overflow (>8k).  The cleanest
way I see to fix this with minimal impact to the existing flow of
code, and with the fewest performance concerns, is to always inline
dsl_scan_recurse() and dsl_scan_visitdnode().  While this will increase
the function size of dsl_scan_visitbp(), by 4660 bytes, it also reduces
the stack requirements by removing the function call overhead.

Issue #174
2011-04-26 13:37:35 -07:00
Brian Behlendorf
701b1f8168 Fix zvol deadlock
It's possible for a zvol_write thread to enter direct memory reclaim
while holding open a transaction group.  This results in the system
attempting to write out data to the disk to free memory.  Unfortunately,
this can't succeed because the the thread doing reclaim is holding open
the txg which must be closed to be synced to disk.  To prevent this
the offending allocation is marked KM_PUSHPAGE which will prevent it
from attempting writeback.

Closes #191
2011-04-26 12:56:35 -07:00
Brian Behlendorf
e2448b0e62 Fix spurious -EFAULT when setting I/O scheduler
Occasionally we would see an -EFAULT returned when setting the
I/O scheduler on a vdev.  This was caused an improperly formatted
user mode helper command.

This commit restructures the command to something simpler, allocates
space for it dynamically to save stack, and removes the retry logic
which is no longer needed.

Closes #169
2011-04-22 14:55:35 -07:00
Brian Behlendorf
6a8f9b6bf0 Enforce ARC meta-data limits
This change ensures the ARC meta-data limits are enforced.  Without
this enforcement meta-data can grow to consume all of the ARC cache
pushing out data and hurting performance.  The cache is aggressively
reclaimed but this is a soft and not a hard limit.  The cache may
exceed the set limit briefly before being brought under control.

By default 25% of the ARC capacity can be used for meta-data.  This
limit can be tuned by setting the 'zfs_arc_meta_limit' module option.
Once this limit is exceeded meta-data reclaim will occur in 3 percent
chunks, or may be tuned using 'arc_reduce_dnlc_percent'.

Closes #193
2011-04-21 13:49:31 -07:00
Gunnar Beutner
36df284366 Fixed a use-after-free bug in zfs_zget().
Fixed a bug where zfs_zget could access a stale znode pointer when
the inode had already been removed from the inode cache via iput ->
iput_final -> ... -> zfs_zinactive but the corresponding SA handle
was still alive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #180
2011-04-21 13:48:01 -07:00
Brian Behlendorf
d247f2a3cc Suppress 'zfs receive' memory warning
As part of zfs_ioc_recv() a zfs_cmd_t is allocated in the kernel
which is 17808 bytes in size.  This sort of thing in general should
be avoided.  However, since this should be an infrequent event for
now we allow it and simply suppress the warning with the KM_NODEBUG
flag.  This can be revisited latter if/when it becomes an issue.

Closes #178
2011-04-20 10:22:31 -07:00
Gunnar Beutner
bec30953cd Truncate the xattr znode when updating existing attributes.
If the attribute's new value was shorter than the old one the old
code would leave parts of the old value in the xattr znode.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #203
2011-04-19 14:14:40 -07:00
Gunnar Beutner
274b7e79f3 Added missing initialization for va.va_dentry in zfs_get_xattrdir.
Without this we may mistakenly believe we have a dentry and try to
d_instantiate() it.  This will result in the following BUG.  It's
important to note that while the xattr directory has an inode
assoicated with it we never create a dentry for it.

  kernel BUG at fs/dcache.c:1418!

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #202
2011-04-19 13:57:41 -07:00
Brian Behlendorf
0fe3d820f5 Fix gcc compiler warning, dsl_pool_create()
When compiling ZFS in user space gcc-4.6.0 correctly identifies
the variable 'os' as being set but never used.  This generates a
warning and a build failure when using --enable-debug.  However,
the code is correct we only want to use 'os' for the kernel space
builds.  To suppress the warning the call was wrapped with a
VERIFY() which has the nice side effect of ensuring the 'os'
actually never is NULL.  This was observed under Fedora 15.

  module/zfs/dsl_pool.c: In function ‘dsl_pool_create’:
  module/zfs/dsl_pool.c:229:12: error: variable ‘os’ set but not used
  [-Werror=unused-but-set-variable]
2011-04-19 09:04:51 -07:00
Brian Behlendorf
e30c0ada6d Linux 2.6.39 compat, invalidate_inodes()
Update code to use the spl_invalidate_inodes() wrapper.  This hides
some of the complexity of determining if invalidate_inodes() was
exported, and if so what is its prototype.  The second argument
of spl_invalidate_inodes() determined the behavior of how dirty
inodes are handled.  By passing a zero we are indicated that we
want those inodes to be treated as busy and skipped.
2011-04-19 08:57:23 -07:00
Brian Behlendorf
0d3ac5e735 Linux 2.6.29 compat, credentials
The .sync_fs fix as applied did not use the updated SPL credential
API.  This broke builds on Debian Lenny, this change applies the
needed fix to use the portable API.  The original credential changes
are part of commit 81e97e2187.
2011-04-07 14:27:09 -07:00
Brian Behlendorf
eec8164771 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Disable the normal reclaim path for the txg_sync thread.  This
ensures the thread will never enter dmu_tx_assign() which can
otherwise occur due to direct reclaim.  If this is allowed to
happen the system can deadlock.  Direct reclaim call path:

  ->shrink_icache_memory->prune_icache->dispose_list->
  clear_inode->zpl_clear_inode->zfs_inactive->dmu_tx_assign
2011-04-07 09:52:16 -07:00
Brian Behlendorf
7cb67b45f3 Add direct+indirect ARC reclaim
Under OpenSolaris all memory reclaim is done asyncronously.  Under
Linux memory reclaim is done asynchronously _and_ synchronously.
When a process allocates memory with GFP_KERNEL it explicitly allows
the kernel to do reclaim on its behalf to satify the allocation.
If that GFP_KERNEL allocation fails the kernel may take more drastic
measures to reclaim the memory such as killing user space processes.

This was observed to happen with ZFS because the ARC could consume
a large fraction of the system memory but no synchronous reclaim
could be performed on it.  The result was GFP_KERNEL allocations
could fail resulting in OOM events, and only moments latter the
arc_reclaim thread would free unused memory from the ARC.

This change leaves the arc_thread in place to manage the fundamental
ARC behavior.  But it adds a synchronous (direct) reclaim path for
the ARC which can be called when memory is badly needed.  It also
adds an asynchronous (indirect) reclaim path which is called
much more frequently to prune the ARC slab caches.
2011-04-07 09:52:10 -07:00
Brian Behlendorf
1834f2d8b7 Add missing arcstats
The following useful values were missing the arcstats.  This change
adds them in to provide greater visibility in to the arcs behavior.

arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_meta_used                   4    624774592
arc_meta_limit                  4    400785408
arc_meta_max                    4    625594176
2011-04-07 09:52:05 -07:00
Brian Behlendorf
c85b224faf Call d_instantiate before unlocking inode
Under Linux a dentry referencing an inode must be instantiated before
the inode is unlocked.  To accomplish this without overly modifing
the core ZFS code the dentry it passed via the vattr_t.  There are
cases such as replay when a dentry is not available.  In which case
it is obviously not initialized at inode creation time, if a dentry
is needed it will be spliced as when required via d_lookup().
2011-04-07 09:51:57 -07:00
Brian Behlendorf
d433c20651 Fix make distclean for `./configure --with-config=user
Making distclean in module
    make[1]: Entering directory `/zfs/module'
    make -C  SUBDIRS=`pwd`  clean
    make: Entering an unknown directory
    make: *** SUBDIRS=/zfs/module: No such file or directory.  Stop.

When using --with-config=user the 'distclean' target would fail
because it assumes the kernel configuration infrastrure is set up.
This is not the case, nor does it need to be, because the
'--with-config=user' option will prune the entire ./module subtree
from SUBDIRS.  This prevents most build rules from operating in the
./module directory.

However, the 'dist*' rules will still traverse this directory
because it is listed in DIST_SUBDIRS.  This is correct because we
need to ensure the dist rules package the directory contents
regardless of the configuration for the 'dist' rule.  The correct
way to handle this is to only invoke the kernel build system as
part of the 'clean' rule when CONFIG_KERNEL_TRUE is set.

Initial fix provided by Darik Horn <dajhorn@vanadac.com>.
This commit is a slightly refined form of the original.
2011-04-05 13:33:28 -07:00
Brian Behlendorf
bfd214af01 Fix inflated load average
Kernel threads which sleep uninterruptibly on Linux are marked in the (D)
state.  These threads are usually in the process of performing IO and are
thus counted against the load average.  The txg_quiesce and txg_sync threads
were always sleeping uninterruptibly and thus inflating the load average.

This change makes them sleep interruptibly.  Some care is required however
because these threads may now be woken early by signals.  In this case the
callers are all careful to check that the required conditions are met after
waking up.  If we're woken early due to a signal they will simply go back
to sleep.  In this case these changes are safe.

Closes #175
2011-03-31 17:07:12 -07:00
Brian Behlendorf
7a1cdc0775 Linux 2.6.29 compat, .freeze_fs/.unfreeze_fs
The .freeze_fs/.unfreeze_fs hooks were not added until Linux 2.6.29
Since these hooks are currently unused they are being removed to
allow support of older kernels.
2011-03-22 12:17:24 -07:00
Brian Behlendorf
81e97e2187 Linux 2.6.29 compat, credentials
As of Linux 2.6.29 a clean credential API was added to the Linux kernel.
Previously the credential was embedded in the task_struct.  Because the
SPL already has considerable support for handling this API change the
ZPL code has been updated to use the Solaris credential API.
2011-03-22 12:15:54 -07:00
Brian Behlendorf
d6bd8eaae4 Fix evict() deadlock
Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility
of synchronous reclaim deadlocks.  These deadlocks never existed in the
original OpenSolaris code because all memory reclaim on Solaris is done
asyncronously.  Linux does both synchronous (direct) and asynchronous
(indirect) reclaim.

This commit addresses a deadlock caused by inode eviction.  A KM_SLEEP
allocation may trigger direct memory reclaim and shrink the inode cache.
This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is
held.  Through the ->shrink_icache_memory()->evict()->zfs_inactive()->
zfs_zinactive() call path the same mutex may be reacquired resulting
in a deadlock.  To avoid this deadlock the process must not reacquire
the mutex when it is already holding it.

This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD
mutex locking should be reevaluated.  This infrastructure already
prevents us from ever using the Linux lock dependency analysis tools,
and it may limit scalability.
2011-03-22 12:14:55 -07:00
Brian Behlendorf
691f6ac4c2 Use KM_PUSHPAGE instead of KM_SLEEP
It used to be the case that all KM_SLEEP allocations were GFS_NOFS.
Unfortunately this often resulted in the kernel being unable to
reclaim the ARC, inode, and dentry caches in a timely manor.
The fix was to make KM_SLEEP a GFP_KERNEL allocation in the SPL.

However, this increases the posibility of deadlocking the system
on a zfs write thread.  If a zfs write thread attempts to perform
an allocation it may trigger synchronous reclaim.  This reclaim
may attempt to flush dirty data/inode to disk to free memory.
Unforunately, this write cannot finish because the write thread
which would handle it is holding the previous transaction open.
Deadlock.

To avoid this all allocations in the zfs write thread path must
use KM_PUSHPAGE which prohibits synchronous reclaim for that
thread.  In this way forward progress in ensured.  The risk
with this change is I missed updating an allocation for the
write threads leaving an increased posibility of deadlock.  If
any deadlocks remain they will be unlikely but we'll have to
make sure they all get fixed.
2011-03-22 12:14:55 -07:00
Brian Behlendorf
0de19dad9c Register .remount_fs handler
Register the missing .remount_fs handler.  This handler isn't strictly
required because the VFS does a pretty good job updating most of the
MS_* flags.  However, there's no harm in using the hook to call the
registered zpl callback for various MS_* flags.  Additionaly, this
allows us to lay the ground work for more complicated argument parsing
in the future.
2011-03-15 13:33:29 -07:00
Brian Behlendorf
03f9ba9d99 Register .sync_fs handler
Register the missing .sync_fs handler.  This is a noop in most cases
because the usual requirement is that sync just be initiated.  As part
of the DMU's normal transaction processing txgs will be frequently
synced.  However, when the 'wait' flag is set the requirement is that
.sync_fs must not return until the data is safe on disk.  With the
addition of the .sync_fs handler this is now properly implemented.
2011-03-15 13:33:29 -07:00
Brian Behlendorf
04516a45b2 Don't set I/O Scheduler for Partitions
ZFS should only change the i/o scheduler for a disk when it has
ownership of the whole disk.  This is basically the same logic as
adjusting the write cache behavior on a disk.  This change updates
the vdev disk code to skip partitions when setting the i/o scheduler.

Closes #152
2011-03-10 13:34:17 -08:00
Brian Behlendorf
adf2e8778e Fix O_APPEND Corruption
Due to an uninitialized variable files opened with O_APPEND may
overwrite the start of the file rather than append to it.  This
was introduced accidentally when I removed the Solaris vnodes.

The zfs_range_lock_writer() function used to key off zf->z_vnode
to determine if a znode_t was for a zvol of zpl object.  With
the removal of vnodes this was replaced by the flag zp->z_is_zvol.
This flag was used to control the append behavior for range locks.

Unfortunately, this value was never properly initialized after
the vnode removal.  However, because most of memory is usually
zeros it happened to be set correctly most of the time making
the bug appear racy.  Properly initializing zp->z_is_zvol to
zero completely resolves the problem with O_APPEND.

Closes #126
2011-03-09 13:31:00 -08:00
Brian Behlendorf
17c37660a1 Conserve stack in zfs_setattr()
Move 'bulk' and 'xattr_bulk' from the stack to the heap to minimize
stack space usage.  These two arrays consumed 448 bytes on the stack
and have been replaced by two 8 byte points for a total stack space
saving of 432 bytes.  The zfs_setattr() path had been previously
observed to overrun the stack in certain circumstances.
2011-03-09 13:30:03 -08:00
Brian Behlendorf
450dc149bd Range lock performance improvements
The original range lock implementation had to be modified by commit
8926ab7 because it was unsafe on Linux.  In particular, calling
cv_destroy() immediately after cv_broadcast() is dangerous because
the waiters may still be asleep.  Thus the following cv_destroy()
will free memory which may still be in use.

This was fixed by updating cv_destroy() to block on waiters but
this in turn introduced a deadlock.  The deadlock was resolved
with the use of a taskq to move the offending free outside the
range lock.  This worked well but using the taskq for the free
resulted in a serious performace hit.  This is somewhat ironic
because at the time I felt using the taskq might improve things
by making the free asynchronous.

This patch refines the original fix and moves the free from the
taskq to a private free list.  Then items which must be free'd
are simply inserted in to the list.  When the range lock is dropped
it's safe to free the items.  The list is walked and all rl_t
entries are freed.

This change improves small cached read performance by 26x.  This
was expected because for small reads the number of locking calls
goes up significantly.  More surprisingly this change significantly
improves large cache read performance.  This probably attributable
to better cpu/memory locality.  Very likely the same processor
which allocated the memory is now freeing it.

bs	ext3	zfs	zfs+fix		faster
----------------------------------------------
512     435     3       79      	26x
1k      820     7       160     	22x
2k      1536    14      305     	21x
4k      2764    28      572     	20x
8k      3788    50      1024    	20x
16k     4300    86      1843    	21x
32k     4505    138     2560    	18x
64k     5324    252     3891    	15x
128k    5427    276     4710    	17x
256k    5427    413     5017    	12x
512k    5427    497     5324    	10x
1m      5427    521     5632    	10x

Closes #142
2011-03-08 12:44:06 -08:00