Commit Graph

311 Commits

Author SHA1 Message Date
Gunnar Beutner
6f0cf71e0d Removed erroneous zfs_inode_destroy() calls from zfs_rmnode().
iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy()
for us after zfs_zinactive(), thus making sure that the inode is
properly cleaned up.

The zfs_inode_destroy() calls in zfs_rmnode() would lead to a
double-free.

Fixes #282
2011-06-20 10:30:17 -07:00
Christian Kohlschütter
df30f56639 Add "ashift" property to zpool create
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly.  This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes.  The drive may behave this way
for compatibility reasons.  For example, the WDC WD20EARS disks are
known to exhibit this behavior.

When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).

This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time.  This can significantly improve
performance in certain cases.  However, it will have an impact on your
total pool capacity.  See the updated ashift property description
in the zpool.8 man page for additional details.

Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior.  The most common ashift values are 9 and 12.

  Example:
  zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd

Closes #280

Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-06-17 16:35:49 -07:00
Brian Behlendorf
96801d2906 Linux 2.6.37 compat, WRITE_FLUSH_FUA
The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been
introduced as a replacement for WRITE_BARRIER.  This was done
to allow richer semantics to be expressed to the block layer.
It is the block layers responsibility to choose the correct way
to implement these semantics.

This change simply updates the bio's to use the new kernel API
which should be absolutely safe.  However, since ZFS depends
entirely on this working as designed for correctness we do
want to be careful.

Closes #281
2011-06-17 14:37:26 -07:00
Brian Behlendorf
e95b3bdcbb Fix stack ddt_class_contains()
Stack usage for ddt_class_contains() reduced from 524 bytes to 68
bytes.  This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
2011-05-31 12:17:27 -07:00
Brian Behlendorf
5b8c7bbcea Fix stack ddt_zap_lookup()
Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120
bytes.  This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
2011-05-31 12:17:27 -07:00
Brian Behlendorf
c7f8f831a4 Revert "Fix stack traverse_visitbp()"
This abomination is no longer required because the zio's issued
during this recursive call path will now be handled asynchronously
by the taskq thread pool.

This reverts commit 6656bf5621.
2011-05-31 12:17:27 -07:00
Brian Behlendorf
2fac4c2a74 Make tgx_sync_thread zio's async
The majority of the recursive operations performed by the dsl
are done either in the context of the tgx_sync_thread or during
pool import.  It is these recursive operations which contribute
greatly to the stack depth.  When this recursion is coupled with
a synchronous I/O in the same context overflow becomes possible.

Previously to handle this case I have focused on keeping the
individual stack frames as light as possible.  This is a good
idea as long as it can be done in a way which doesn't overly
complicate the code.  However, there is a better solution.

If we treat all zio's issued by the tgx_sync_thread as async then
we can use the tgx_sync_thread stack for the recursive parts, and
the zio_* threads for the I/O parts.  This effectively doubles our
available stack space with the only drawback being a small delay
to schedule the I/O.  However, in practice the scheduling time
is so much smaller than the actual I/O time this isn't an issue.
Another benefit of making the zio async is that the zio pipeline
is now parallel.  That should mean for CPU intensive pipelines
such as compression or dedup performance may be improved.

With this change in place the worst case stack usage observed so
far is 6902 bytes.  This is still higher than I'd like but
significantly improved.  Additional changes to specific functions
should improve this further.  This change allows us to revent
commit 6656bf5 which did some horrible things to the recursive
traverse_visitbp() callpath in the name of saving stack.
2011-05-31 12:17:27 -07:00
Brian Behlendorf
f74fae8b30 Fix 4K sector support
Yesterday I ran across a 3TB drive which exposed 4K sectors to
Linux.  While I thought I had gotten this support correct it
turns out there were 2 subtle bugs which prevented it from
working.

  sudo ./cmd/zpool/zpool create -f large-sector /dev/sda
  cannot create 'large-sector': one or more devices is currently unavailable

1) The first issue was that it was possible that bdev_capacity()
would return the number of 512 byte sectors rather than the number
of 4096 sectors.  Internally, certain Linux functions only operate
with 512 byte sectors so you need to be careful.  To avoid any
confusion in the future I've updated bdev_capacity() to simply
return the device (or partition) capacity in bytes.  The higher
levels of ZFS want the value in bytes anyway so this is cleaner.

2) When creating a bio the ->bi_sector count must always be
expressed in 512 byte sectors.  The existing code would scale
the byte offset by the logical sector size.   Until now this was
always 512 so it never caused problems.  Trying a 4K sector drive
clearly exposed the issue.  The problem has been fixed by
hard coding the 512 byte sector which is exactly what the bio
code does internally.

With these changes I'm now able to create ZFS pools using 4K
sector drives.  No issues were observed during fairly extensive
testing.  This is also a low risk change if your using 512b
sectors devices because none of the logic changes.

Closes #256
2011-05-27 11:38:53 -07:00
Brian Behlendorf
2b8cad6159 Use vmem_alloc() for zfs_ioc_userspace_many()
The default buffer size when requesting multiple quota entries
is 100 times the zfs_useracct_t size.  In practice this works out
to exactly 27200 bytes.  Since this will be a short lived buffer
in a non-performance critical path it is preferable to vmem_alloc()
the needed memory.
2011-05-20 14:23:18 -07:00
Brian Behlendorf
f01b360e67 Pass caller's credential in zfsdev_ioctl()
Initially when zfsdev_ioctl() was ported to Linux we didn't have
any credential support implemented.  So at the time we simply
passed NULL which wasn't much of a problem since most of the
secpolicy code was disabled.

However, one exception is quota handling which does require the
credential.  Now that proper credentials are supported we can
safely start passing the callers credential.  This is also an
initial step towards fully implemented the zfs secpolicy.
2011-05-20 10:12:25 -07:00
Brian Behlendorf
3fd70ee6b0 Fix 'negative objects to delete' warning
Normally when the arc_shrinker_func() function is called the return
value should be:

   >=0 - To indicate the number of freeable objects in the cache, or
   -1  - To indicate this cache should be skipped

However, when the shrinker callback is called with 'nr_to_scan' equal
to zero.  The caller simply wants the number of freeable objects in
the cache and we must never return -1.  This patch reorders the
first two conditionals in arc_shrinker_func() to ensure this behavior.

This patch also now explictly casts arc_size and arc_c_min to signed
int64_t types so MAX(x, 0) works as expected.  As unsigned types
we would never see an negative value which defeated the purpose of
the MAX() lower bound and broke the shrinker logic.

Finally, when nr_to_scan is non-zero we explictly prevent all reclaim
below arc_c_min.  This is done to prevent the Linux page cache from
completely crowding out the ARC.  This limit is tunable and some
experimentation is likely going to be required to set it exactly right.
For now we're sticking with the OpenSolaris defaults.

Closes #218
Closes #243
2011-05-18 10:29:22 -07:00
Brian Behlendorf
e814770f2e Update synchronous open zfs_close() comment
The comment in zfs_close() pertaining to decrementing the synchronous
open count needs to be updated for Linux.  The code was already
updated to be correct, but the comment was missed and is now misleading.
Under Linux the zfs_close() hook is only called once when the final
reference is dropped.  This differs from Solaris where zfs_close()
is called for each close.

Closes #237
2011-05-13 08:20:06 -07:00
Brian Behlendorf
c91d229809 Merge pull request #235 from nedbass/rdev
Don't store rdev in SA for FIFOs and sockets
2011-05-09 16:41:28 -07:00
Ned A. Bass
aa6d8c1086 Don't store rdev in SA for FIFOs and sockets
Update the handling of named pipes and sockets to be consistent with
other platforms with regard to the rdev attribute.  While all ZFS
ipmlementations store the rdev for device files in a system attribute
(SA), this is not the case for FIFOs and sockets.  Indeed, Linux always
passes rdev=0 to mknod() for FIFOs and sockets, so the value is not
needed.  Add an ASSERT that rdev==0 for FIFOs and sockets to detect if
the expected behavior ever changes.

Closes #216
2011-05-09 13:35:07 -07:00
Brian Behlendorf
21ade34764 Disable direct reclaim for z_wr_* threads
The direct reclaim path in the z_wr_* threads must be disabled
to ensure forward progress is always maintained for txg processing.
This ensures that a txg will never get stuck waiting on itself
because it entered the following memory reclaim callpath.

  ->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive()
  ->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open()

It would be preferable to target this exact code path but the
kernel offers no way to do this without custom patches.  To avoid
this we are forced to disable all reclaim for these threads.  It
should not be necessary to do this for other other z_* threads
because they will not hold a txg open.

Closes #232
2011-05-06 15:26:26 -07:00
Brian Behlendorf
3117dd0b90 Handle NULL in nfsd .fsync() hook
How nfsd handles .fsync() has been changed a couple of times in the
recent kernels.  But basically there are three cases we need to
consider.

Linux 2.6.12 - 2.6.33
* The .fsync() hook takes 3 arguments
* The nfsd will call .fsync() with a NULL file struct pointer.

Linux 2.6.34
* The .fsync() hook takes 3 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()

Linux 2.6.35 - 2.6.x
* The .fsync() hook takes 2 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()

For once it looks like we've gotten lucky.  The first two cases can
actually be collased in to one if we stop using the file struct
pointer entirely.  Since the dentry is still passed in both cases
this is possible.  The last case can then be safely handled by
unconditionally using the dentry in the file struct pointer now
that we know the nfsd caller has been removed.

Closes #230
2011-05-06 12:33:45 -07:00
Brian Behlendorf
34b84cb831 Use vmem_alloc() for zfs_ioc_pool_get_history()
The default buffer size when requesting history is 128k.  This
is far to large for a kmem_alloc() so instead use the slower
vmem_alloc().  This path has no performance concerns and the
buffer is immediately free'd after its contents are copied to
the user space buffer.
2011-05-06 09:59:52 -07:00
Brian Behlendorf
c409e4647f Add missing ZFS tunables
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values.  However, in practice sometimes you do need to tweak these
values for one reason or another.  In those cases it's nice not to
have to resort to rebuilding from source.  All tunables are visable
to modinfo and the list is as follows:

$ modinfo module/zfs/zfs.ko
filename:       module/zfs/zfs.ko
license:        CDDL
author:         Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description:    ZFS
srcversion:     8EAB1D71DACE05B5AA61567
depends:        spl,znvpair,zcommon,zunicode,zavl
vermagic:       2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm:           zvol_major:Major number for zvol device (uint)
parm:           zvol_threads:Number of threads for zvol device (uint)
parm:           zio_injection_enabled:Enable fault injection (int)
parm:           zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm:           zio_delay_max:Max zio millisec delay before posting event (int)
parm:           zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm:           zil_replay_disable:Disable intent logging replay (int)
parm:           zfs_nocacheflush:Disable cache flushes (bool)
parm:           zfs_read_chunk_size:Bytes to read per chunk (long)
parm:           zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm:           zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm:           zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm:           zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm:           zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm:           zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm:           zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm:           zfs_vdev_scheduler:I/O scheduler (charp)
parm:           zfs_vdev_cache_max:Inflate reads small than max (int)
parm:           zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm:           zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm:           zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm:           zfs_recover:Set to attempt to recover from fatal errors (int)
parm:           spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm:           zfs_zevent_len_max:Max event queue length (int)
parm:           zfs_zevent_cols:Max event column width (int)
parm:           zfs_zevent_console:Log events to the console (int)
parm:           zfs_top_maxinflight:Max I/Os per top-level (int)
parm:           zfs_resilver_delay:Number of ticks to delay resilver (int)
parm:           zfs_scrub_delay:Number of ticks to delay scrub (int)
parm:           zfs_scan_idle:Idle window in clock ticks (int)
parm:           zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm:           zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm:           zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm:           zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm:           zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm:           zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm:           zfs_no_write_throttle:Disable write throttling (int)
parm:           zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm:           zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm:           zfs_write_limit_min:Min tgx write limit (ulong)
parm:           zfs_write_limit_max:Max tgx write limit (ulong)
parm:           zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm:           zfs_write_limit_override:Override tgx write limit (ulong)
parm:           zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm:           zfetch_max_streams:Max number of streams per zfetch (uint)
parm:           zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm:           zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm:           zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm:           zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm:           zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm:           zfs_arc_min:Min arc size (ulong)
parm:           zfs_arc_max:Max arc size (ulong)
parm:           zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm:           zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm:           zfs_arc_grow_retry:Seconds before growing arc size (int)
parm:           zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm:           zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
2011-05-04 10:02:37 -07:00
Brian Behlendorf
5f35b19007 Fully update inode when created
When a new znode/inode pair is created both the znode and the inode
should be immediately updated to the correct values.  This was done
for the znode and for most of the values in the inode, but not all
of them.  This normally wasn't a problem because most subsequent
operations would cause the inode to be immediately updated.  This
change ensures the inode is now fully updated before it is inserted
in to the inode hash.

Closes #116
Closes #146
Closes #164
2011-05-02 14:04:19 -07:00
Brian Behlendorf
df554c148e Fix 'zfs set volsize=N pool/dataset'
This change fixes a kernel panic which would occur when resizing
a dataset which was not open.  The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.

The code has also been updated to correctly notify the kernel
when the block device capacity changes.  For 2.6.28 and newer
kernels the capacity change will be immediately detected.  For
earlier kernels the capacity change will be detected when the
device is next opened.  This is a known limitation of older
kernels.

Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0

Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes #68
Closes #84
2011-05-02 08:54:40 -07:00
Gunnar Beutner
055656d4f4 Implemented NFS export_operations.
Implemented the required NFS operations for exporting ZFS datasets
using the in-kernel NFS daemon.
2011-04-29 12:36:13 -07:00
Brian Behlendorf
5476e6952c Suppress 'vdev_metaslab_init' memory warning
The vdev_metaslab_init() function has been observed to allocate
larger than 8k chunks.  However, they are not much larger than 8k
and it does this infrequently so it is allowed and the warning is
supressed.
2011-04-27 09:35:18 -07:00
Brian Behlendorf
40a39e1103 Conserve stack in dsl_scan_visit()
The dsl_scan_visit() function is a little heavy weight taking 464
bytes on the stack.  This can be easily reduced for little cost by
moving zap_cursor_t and zap_attribute_t off the stack and on to the
heap.  After this change dsl_scan_visit() has been reduced in size
by 320 bytes.

This change was made to reduce stack usage in the dsl_scan_sync()
callpath which is recursive and has been observed to overflow the
stack.

Issue #174
2011-04-26 15:48:00 -07:00
Brian Behlendorf
b81c4ac9af Conserve stack in dsl_scan_visitbp()
This function is called recursively so everything possible must be
done to limit its stack consumption.  The dprintf_bp() debugging
function adds 30 bytes of local variables to the function we cannot
afford.  By commenting out this debugging we save 30 bytes per
recursion and depths of 13 are not uncommon.  This yeilds a total
stack saving of 390 bytes on our 8k stack.

Issue #174
2011-04-26 15:48:00 -07:00
Brian Behlendorf
7a060636b0 Conserve stack in dsl_scan_visitbp()
The recursive call chain dsl_scan_visitbp() -> dsl_scan_recurse() ->
dsl_scan_visitdnode() -> dsl_scan_visitbp has been observed to consume
considerable stack resulting in a stack overflow (>8k).  The cleanest
way I see to fix this with minimal impact to the existing flow of
code, and with the fewest performance concerns, is to always inline
dsl_scan_recurse() and dsl_scan_visitdnode().  While this will increase
the function size of dsl_scan_visitbp(), by 4660 bytes, it also reduces
the stack requirements by removing the function call overhead.

Issue #174
2011-04-26 13:37:35 -07:00
Brian Behlendorf
701b1f8168 Fix zvol deadlock
It's possible for a zvol_write thread to enter direct memory reclaim
while holding open a transaction group.  This results in the system
attempting to write out data to the disk to free memory.  Unfortunately,
this can't succeed because the the thread doing reclaim is holding open
the txg which must be closed to be synced to disk.  To prevent this
the offending allocation is marked KM_PUSHPAGE which will prevent it
from attempting writeback.

Closes #191
2011-04-26 12:56:35 -07:00
Brian Behlendorf
e2448b0e62 Fix spurious -EFAULT when setting I/O scheduler
Occasionally we would see an -EFAULT returned when setting the
I/O scheduler on a vdev.  This was caused an improperly formatted
user mode helper command.

This commit restructures the command to something simpler, allocates
space for it dynamically to save stack, and removes the retry logic
which is no longer needed.

Closes #169
2011-04-22 14:55:35 -07:00
Brian Behlendorf
6a8f9b6bf0 Enforce ARC meta-data limits
This change ensures the ARC meta-data limits are enforced.  Without
this enforcement meta-data can grow to consume all of the ARC cache
pushing out data and hurting performance.  The cache is aggressively
reclaimed but this is a soft and not a hard limit.  The cache may
exceed the set limit briefly before being brought under control.

By default 25% of the ARC capacity can be used for meta-data.  This
limit can be tuned by setting the 'zfs_arc_meta_limit' module option.
Once this limit is exceeded meta-data reclaim will occur in 3 percent
chunks, or may be tuned using 'arc_reduce_dnlc_percent'.

Closes #193
2011-04-21 13:49:31 -07:00
Gunnar Beutner
36df284366 Fixed a use-after-free bug in zfs_zget().
Fixed a bug where zfs_zget could access a stale znode pointer when
the inode had already been removed from the inode cache via iput ->
iput_final -> ... -> zfs_zinactive but the corresponding SA handle
was still alive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #180
2011-04-21 13:48:01 -07:00
Brian Behlendorf
d247f2a3cc Suppress 'zfs receive' memory warning
As part of zfs_ioc_recv() a zfs_cmd_t is allocated in the kernel
which is 17808 bytes in size.  This sort of thing in general should
be avoided.  However, since this should be an infrequent event for
now we allow it and simply suppress the warning with the KM_NODEBUG
flag.  This can be revisited latter if/when it becomes an issue.

Closes #178
2011-04-20 10:22:31 -07:00
Gunnar Beutner
bec30953cd Truncate the xattr znode when updating existing attributes.
If the attribute's new value was shorter than the old one the old
code would leave parts of the old value in the xattr znode.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #203
2011-04-19 14:14:40 -07:00
Gunnar Beutner
274b7e79f3 Added missing initialization for va.va_dentry in zfs_get_xattrdir.
Without this we may mistakenly believe we have a dentry and try to
d_instantiate() it.  This will result in the following BUG.  It's
important to note that while the xattr directory has an inode
assoicated with it we never create a dentry for it.

  kernel BUG at fs/dcache.c:1418!

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #202
2011-04-19 13:57:41 -07:00
Brian Behlendorf
0fe3d820f5 Fix gcc compiler warning, dsl_pool_create()
When compiling ZFS in user space gcc-4.6.0 correctly identifies
the variable 'os' as being set but never used.  This generates a
warning and a build failure when using --enable-debug.  However,
the code is correct we only want to use 'os' for the kernel space
builds.  To suppress the warning the call was wrapped with a
VERIFY() which has the nice side effect of ensuring the 'os'
actually never is NULL.  This was observed under Fedora 15.

  module/zfs/dsl_pool.c: In function ‘dsl_pool_create’:
  module/zfs/dsl_pool.c:229:12: error: variable ‘os’ set but not used
  [-Werror=unused-but-set-variable]
2011-04-19 09:04:51 -07:00
Brian Behlendorf
e30c0ada6d Linux 2.6.39 compat, invalidate_inodes()
Update code to use the spl_invalidate_inodes() wrapper.  This hides
some of the complexity of determining if invalidate_inodes() was
exported, and if so what is its prototype.  The second argument
of spl_invalidate_inodes() determined the behavior of how dirty
inodes are handled.  By passing a zero we are indicated that we
want those inodes to be treated as busy and skipped.
2011-04-19 08:57:23 -07:00
Brian Behlendorf
0d3ac5e735 Linux 2.6.29 compat, credentials
The .sync_fs fix as applied did not use the updated SPL credential
API.  This broke builds on Debian Lenny, this change applies the
needed fix to use the portable API.  The original credential changes
are part of commit 81e97e2187.
2011-04-07 14:27:09 -07:00
Brian Behlendorf
eec8164771 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Disable the normal reclaim path for the txg_sync thread.  This
ensures the thread will never enter dmu_tx_assign() which can
otherwise occur due to direct reclaim.  If this is allowed to
happen the system can deadlock.  Direct reclaim call path:

  ->shrink_icache_memory->prune_icache->dispose_list->
  clear_inode->zpl_clear_inode->zfs_inactive->dmu_tx_assign
2011-04-07 09:52:16 -07:00
Brian Behlendorf
7cb67b45f3 Add direct+indirect ARC reclaim
Under OpenSolaris all memory reclaim is done asyncronously.  Under
Linux memory reclaim is done asynchronously _and_ synchronously.
When a process allocates memory with GFP_KERNEL it explicitly allows
the kernel to do reclaim on its behalf to satify the allocation.
If that GFP_KERNEL allocation fails the kernel may take more drastic
measures to reclaim the memory such as killing user space processes.

This was observed to happen with ZFS because the ARC could consume
a large fraction of the system memory but no synchronous reclaim
could be performed on it.  The result was GFP_KERNEL allocations
could fail resulting in OOM events, and only moments latter the
arc_reclaim thread would free unused memory from the ARC.

This change leaves the arc_thread in place to manage the fundamental
ARC behavior.  But it adds a synchronous (direct) reclaim path for
the ARC which can be called when memory is badly needed.  It also
adds an asynchronous (indirect) reclaim path which is called
much more frequently to prune the ARC slab caches.
2011-04-07 09:52:10 -07:00
Brian Behlendorf
1834f2d8b7 Add missing arcstats
The following useful values were missing the arcstats.  This change
adds them in to provide greater visibility in to the arcs behavior.

arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_meta_used                   4    624774592
arc_meta_limit                  4    400785408
arc_meta_max                    4    625594176
2011-04-07 09:52:05 -07:00
Brian Behlendorf
c85b224faf Call d_instantiate before unlocking inode
Under Linux a dentry referencing an inode must be instantiated before
the inode is unlocked.  To accomplish this without overly modifing
the core ZFS code the dentry it passed via the vattr_t.  There are
cases such as replay when a dentry is not available.  In which case
it is obviously not initialized at inode creation time, if a dentry
is needed it will be spliced as when required via d_lookup().
2011-04-07 09:51:57 -07:00
Brian Behlendorf
d433c20651 Fix make distclean for `./configure --with-config=user
Making distclean in module
    make[1]: Entering directory `/zfs/module'
    make -C  SUBDIRS=`pwd`  clean
    make: Entering an unknown directory
    make: *** SUBDIRS=/zfs/module: No such file or directory.  Stop.

When using --with-config=user the 'distclean' target would fail
because it assumes the kernel configuration infrastrure is set up.
This is not the case, nor does it need to be, because the
'--with-config=user' option will prune the entire ./module subtree
from SUBDIRS.  This prevents most build rules from operating in the
./module directory.

However, the 'dist*' rules will still traverse this directory
because it is listed in DIST_SUBDIRS.  This is correct because we
need to ensure the dist rules package the directory contents
regardless of the configuration for the 'dist' rule.  The correct
way to handle this is to only invoke the kernel build system as
part of the 'clean' rule when CONFIG_KERNEL_TRUE is set.

Initial fix provided by Darik Horn <dajhorn@vanadac.com>.
This commit is a slightly refined form of the original.
2011-04-05 13:33:28 -07:00
Brian Behlendorf
bfd214af01 Fix inflated load average
Kernel threads which sleep uninterruptibly on Linux are marked in the (D)
state.  These threads are usually in the process of performing IO and are
thus counted against the load average.  The txg_quiesce and txg_sync threads
were always sleeping uninterruptibly and thus inflating the load average.

This change makes them sleep interruptibly.  Some care is required however
because these threads may now be woken early by signals.  In this case the
callers are all careful to check that the required conditions are met after
waking up.  If we're woken early due to a signal they will simply go back
to sleep.  In this case these changes are safe.

Closes #175
2011-03-31 17:07:12 -07:00
Brian Behlendorf
7a1cdc0775 Linux 2.6.29 compat, .freeze_fs/.unfreeze_fs
The .freeze_fs/.unfreeze_fs hooks were not added until Linux 2.6.29
Since these hooks are currently unused they are being removed to
allow support of older kernels.
2011-03-22 12:17:24 -07:00
Brian Behlendorf
81e97e2187 Linux 2.6.29 compat, credentials
As of Linux 2.6.29 a clean credential API was added to the Linux kernel.
Previously the credential was embedded in the task_struct.  Because the
SPL already has considerable support for handling this API change the
ZPL code has been updated to use the Solaris credential API.
2011-03-22 12:15:54 -07:00
Brian Behlendorf
d6bd8eaae4 Fix evict() deadlock
Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility
of synchronous reclaim deadlocks.  These deadlocks never existed in the
original OpenSolaris code because all memory reclaim on Solaris is done
asyncronously.  Linux does both synchronous (direct) and asynchronous
(indirect) reclaim.

This commit addresses a deadlock caused by inode eviction.  A KM_SLEEP
allocation may trigger direct memory reclaim and shrink the inode cache.
This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is
held.  Through the ->shrink_icache_memory()->evict()->zfs_inactive()->
zfs_zinactive() call path the same mutex may be reacquired resulting
in a deadlock.  To avoid this deadlock the process must not reacquire
the mutex when it is already holding it.

This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD
mutex locking should be reevaluated.  This infrastructure already
prevents us from ever using the Linux lock dependency analysis tools,
and it may limit scalability.
2011-03-22 12:14:55 -07:00
Brian Behlendorf
691f6ac4c2 Use KM_PUSHPAGE instead of KM_SLEEP
It used to be the case that all KM_SLEEP allocations were GFS_NOFS.
Unfortunately this often resulted in the kernel being unable to
reclaim the ARC, inode, and dentry caches in a timely manor.
The fix was to make KM_SLEEP a GFP_KERNEL allocation in the SPL.

However, this increases the posibility of deadlocking the system
on a zfs write thread.  If a zfs write thread attempts to perform
an allocation it may trigger synchronous reclaim.  This reclaim
may attempt to flush dirty data/inode to disk to free memory.
Unforunately, this write cannot finish because the write thread
which would handle it is holding the previous transaction open.
Deadlock.

To avoid this all allocations in the zfs write thread path must
use KM_PUSHPAGE which prohibits synchronous reclaim for that
thread.  In this way forward progress in ensured.  The risk
with this change is I missed updating an allocation for the
write threads leaving an increased posibility of deadlock.  If
any deadlocks remain they will be unlikely but we'll have to
make sure they all get fixed.
2011-03-22 12:14:55 -07:00
Brian Behlendorf
0de19dad9c Register .remount_fs handler
Register the missing .remount_fs handler.  This handler isn't strictly
required because the VFS does a pretty good job updating most of the
MS_* flags.  However, there's no harm in using the hook to call the
registered zpl callback for various MS_* flags.  Additionaly, this
allows us to lay the ground work for more complicated argument parsing
in the future.
2011-03-15 13:33:29 -07:00
Brian Behlendorf
03f9ba9d99 Register .sync_fs handler
Register the missing .sync_fs handler.  This is a noop in most cases
because the usual requirement is that sync just be initiated.  As part
of the DMU's normal transaction processing txgs will be frequently
synced.  However, when the 'wait' flag is set the requirement is that
.sync_fs must not return until the data is safe on disk.  With the
addition of the .sync_fs handler this is now properly implemented.
2011-03-15 13:33:29 -07:00
Brian Behlendorf
04516a45b2 Don't set I/O Scheduler for Partitions
ZFS should only change the i/o scheduler for a disk when it has
ownership of the whole disk.  This is basically the same logic as
adjusting the write cache behavior on a disk.  This change updates
the vdev disk code to skip partitions when setting the i/o scheduler.

Closes #152
2011-03-10 13:34:17 -08:00
Brian Behlendorf
adf2e8778e Fix O_APPEND Corruption
Due to an uninitialized variable files opened with O_APPEND may
overwrite the start of the file rather than append to it.  This
was introduced accidentally when I removed the Solaris vnodes.

The zfs_range_lock_writer() function used to key off zf->z_vnode
to determine if a znode_t was for a zvol of zpl object.  With
the removal of vnodes this was replaced by the flag zp->z_is_zvol.
This flag was used to control the append behavior for range locks.

Unfortunately, this value was never properly initialized after
the vnode removal.  However, because most of memory is usually
zeros it happened to be set correctly most of the time making
the bug appear racy.  Properly initializing zp->z_is_zvol to
zero completely resolves the problem with O_APPEND.

Closes #126
2011-03-09 13:31:00 -08:00
Brian Behlendorf
17c37660a1 Conserve stack in zfs_setattr()
Move 'bulk' and 'xattr_bulk' from the stack to the heap to minimize
stack space usage.  These two arrays consumed 448 bytes on the stack
and have been replaced by two 8 byte points for a total stack space
saving of 432 bytes.  The zfs_setattr() path had been previously
observed to overrun the stack in certain circumstances.
2011-03-09 13:30:03 -08:00
Brian Behlendorf
450dc149bd Range lock performance improvements
The original range lock implementation had to be modified by commit
8926ab7 because it was unsafe on Linux.  In particular, calling
cv_destroy() immediately after cv_broadcast() is dangerous because
the waiters may still be asleep.  Thus the following cv_destroy()
will free memory which may still be in use.

This was fixed by updating cv_destroy() to block on waiters but
this in turn introduced a deadlock.  The deadlock was resolved
with the use of a taskq to move the offending free outside the
range lock.  This worked well but using the taskq for the free
resulted in a serious performace hit.  This is somewhat ironic
because at the time I felt using the taskq might improve things
by making the free asynchronous.

This patch refines the original fix and moves the free from the
taskq to a private free list.  Then items which must be free'd
are simply inserted in to the list.  When the range lock is dropped
it's safe to free the items.  The list is walked and all rl_t
entries are freed.

This change improves small cached read performance by 26x.  This
was expected because for small reads the number of locking calls
goes up significantly.  More surprisingly this change significantly
improves large cache read performance.  This probably attributable
to better cpu/memory locality.  Very likely the same processor
which allocated the memory is now freeing it.

bs	ext3	zfs	zfs+fix		faster
----------------------------------------------
512     435     3       79      	26x
1k      820     7       160     	22x
2k      1536    14      305     	21x
4k      2764    28      572     	20x
8k      3788    50      1024    	20x
16k     4300    86      1843    	21x
32k     4505    138     2560    	18x
64k     5324    252     3891    	15x
128k    5427    276     4710    	17x
256k    5427    413     5017    	12x
512k    5427    497     5324    	10x
1m      5427    521     5632    	10x

Closes #142
2011-03-08 12:44:06 -08:00
Brian Behlendorf
126400a1ca Add zfs_open()/zfs_close()
In the original implementation the zfs_open()/zfs_close() hooks
were dropped for simplicity.  This was functional but not 100%
correct with the expected ZFS sematics.  Updating and re-adding the
zfs_open()/zfs_close() hooks resolves the following issues.

1) The ZFS_APPENDONLY file attribute is once again honored.  While
there are still no Linux tools to set/clear these attributes once
there are it should behave correctly.

2) Minimal virus scan file attribute hooks were added.  Once again
this support in disabled but the infrastructure is back in place.

3) Most importantly correctly handle assigning files which were
opened syncronously to the intent log.  Without this change O_SYNC
modifications could be lost during a system crash even though they
were marked synchronous.
2011-03-08 11:04:51 -08:00
Brian Behlendorf
53cf50e081 Set stat->st_dev and statfs->f_fsid
Filesystems like ZFS must use what the kernel calls an anonymous super
block.  Basically, this is just a filesystem which is not backed by a
single block device.  Normally this block device's dev_t is stored in
the super block.  For anonymous super blocks a unique reserved dev_t
is assigned as part of get_sb().

This sb->s_dev must then be set in the returned stat structures as
stat->st_dev.  This allows userspace utilities to easily detect the
boundries of a specific filesystem.  Tools such as 'du' depend on this
for proper accounting.

Additionally, under OpenSolaris the statfs->f_fsid is set to the device
id.  To preserve consistency with OpenSolaris we also set the fsid to
the device id.  Other Linux filesystem (ext) set the fsid to a unique
value determined by the filesystems uuid.  This value is unique but
maintains no relationship to the device id.  This may be desirable
when exporting NFS filesystem because it minimizes to chance of a
client observing the same fsid from two different servers.

Closes #140
2011-03-07 16:06:22 -08:00
Brian Behlendorf
6742abf9ec Use Linux ATTR_ versions
The AT_ versions of these macros are used on Solaris and while they
map to their Linux equivilants the code has been updated to use the
ATTR_ versions.
2011-03-03 11:29:15 -08:00
Brian Behlendorf
f4ea75d492 Conserve stack in zfs_setattr()
Move 'tmpxvattr' from the stack to the heap to minimize stack
space usage.  This is enough to get us below the 1024 byte stack
frame warning.  That however is still a large stack frame and it
should be further reduced by moving the 'bulk' and 'xattr_bulk'
sa_bulk_attr_t variables to the heap in a future patch.
2011-03-02 14:18:58 -08:00
Brian Behlendorf
5484965ab6 Drop HAVE_XVATTR macros
When I began work on the Posix layer it immediately became clear to
me that to integrate cleanly with the Linux VFS certain Solaris
specific things would have to go.  One of these things was to elimate
as many Solaris specific types from the ZPL layer as possible.  They
would be replaced with their Linux equivalents.  This would not only
be good for performance, but for the general readability and health of
the code.  The Solaris and Linux VFS are different beasts and should
be treated as such.  Most of the code remains common for constructing
transactions and such, but there are subtle and important differenced
which need to be repsected.

This policy went quite for for certain types such as the vnode_t,
and it initially seemed to be working out well for the vattr_t.  There
was a relatively small amount of related xvattr_t code I was forced to
comment out with HAVE_XVATTR.  But it didn't look that hard to come
back soon and replace it all with a native Linux type.

However, after going doing this path with xvattr some distance it
clear that this code was woven in the ZPL more deeply than I thought.
In particular its hooks went very deep in to the ZPL replay code
and replacing it would not be as easy as I originally thought.

Rather than continue persuing replacing and removing this code I've
taken a step back and reevaluted things.  This commit reverts many of
my previous commits which removed xvattr related code.  It restores
much of the code to its original upstream state and now relies on
improved xvattr_t support in the zfs package itself.

The result of this is that much of the code which I had commented
out, which accidentally broke things like replay, is now back in
place and working.  However, there may be a small performance
impact for getattr/setattr operations because they now require
a translation from native Linux to Solaris types.  For now that's
a price I'm willing to pay.  Once everything is completely functional
we can revisting the issue of removing the vattr_t/xvattr_t types.

Closes #111
2011-03-02 11:44:34 -08:00
Brian Behlendorf
9623f736d9 Remove caller_context_t
Remove the remaining callers of caller_context_t.  This type has
been removed because it is not needed for the Linux port.
2011-03-02 11:35:35 -08:00
Darik Horn
a23cc0a443 Add the zpool and filesystem versions
Print the supported zpool and filesystem versions at module load
time.  This change removes an ambiguity and adds information that
system administrators care about.  The phrase "ZFS pool version %s"
is the same as zpool upgrade -v so that the operator is familiar
with the message.

  ZFS: Loaded module v0.6.0, ZFS pool version 28, ZFS filesystem version 5
  ZFS: Unloaded module v0.6.0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-02-28 09:46:23 -08:00
Brian Behlendorf
fdcd952b4d Fix set block scheduler warnings
There were two cases when attempting to set the vdev block device
scheduler which would causes console warnings.

The first case was when the vdev used a loop, ram, dm, or other
such device which doesn't support a configurable scheduler.  In
these cases attempting to set a scheduler is pointless and can
be safely skipped.

The secord case is slightly more troubling.  We were seeing
transient cases where setting the elevator would return -EFAULT.
On retry everything is fine so there appears to be a small window
where this is possible.  To handle that case we silently retry
up to three times before reporting the warning.

In all of the above cases the warning is harmless and at worse you
may see slightly different performance characteristics from one
or more of your vdevs.
2011-02-25 11:37:11 -08:00
Fajar A. Nugraha
4c0d8e50b9 Use udev to create /dev/zvol/[dataset_name] links
This commit allows zvols with names longer than 32 characters, which
fixes issue on https://github.com/behlendorf/zfs/issues/#issue/102.

Changes include:
- use /dev/zd* device names for zvol, where * is the device minor
  (include/sys/fs/zfs.h, module/zfs/zvol.c).
- add BLKZNAME ioctl to get dataset name from userland
  (include/sys/fs/zfs.h, module/zfs/zvol.c, cmd/zvol_id).
- add udev rule to create /dev/zvol/[dataset_name] and the legacy
  /dev/[dataset_name] symlink. For partitions on zvol, it will create
  /dev/zvol/[dataset_name]-part* (etc/udev/rules.d/60-zvol.rules,
  cmd/zvol_id).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-02-25 09:43:19 -08:00
Brian Behlendorf
dc1d7665c5 Remove rdev packing
Remove custom code to pack/unpack dev_t's.  Under Linux all dev_t's
are an unsigned 32-bit value even on 64-bit platforms.  The lower
20 bits are used for the minor number and the upper 12 for the major
number.

This means if your importing a pool from Solaris you may get strange
major/minor numbers.  But it doesn't really matter because even if
we add compatibility code to translate the encoded Solaris major/minor
they won't do you any good under Linux.  You will still need to
recreate the dev_t with a major/minor which maps to reserved major
numbers used under Linux.

Dropping this code also resolves 32-bit builds by removing the
offending 32-bit compatibility code.
2011-02-23 15:13:03 -08:00
Brian Behlendorf
99c564bc48 Use correct ASSERT3* variant
ASSERT3P should be used instead of ASSERT3U when comparing
pointers.  Using ASSERT3U with the cast causes a compiler
warning for 32-bit builds which is fatal with --enable-debug.
2011-02-23 15:03:30 -08:00
Brian Behlendorf
05ff35c602 Increase fragment size to block size
The underlying storage pool actually uses multiple block
size.  Under Solaris frsize (fragment size) is reported as
the smallest block size we support, and bsize (block size)
as the filesystem's maximum block size.  Unfortunately,
under Linux the fragment size and block size are often used
interchangeably.  Thus we are forced to report both of them
as the filesystem's maximum block size.

Closes #112
2011-02-23 14:00:06 -08:00
Brian Behlendorf
f6dcdf13f8 Fix 'statement with no effect' warning
Because the secpolicy_* macros are all currently defined to (0).
And because the caller of this function does not check the return
code.  The compiler complains that this statement has no effect
which is correct and OK.  To suppress the warning explictly cast
the result to (void).
2011-02-23 13:03:19 -08:00
Brian Behlendorf
a31a70bbd1 Fix enum compiler warning
Generally it's a good idea to use enums for switch statements,
but in this case it causes warning because the enum is really a
set of flags.  These flags are OR'ed together in some cases
resulting in values which are not part of the original enum.
This causes compiler warning such as this about invalid cases.

  error: case value ‘33’ not in enumerated type ‘zprop_source_t’

To handle this we simply case the enum to an int for the switch
statement.  This leaves all other enum type checking in place
and effectively disabled these warnings.
2011-02-23 12:52:51 -08:00
Brian Behlendorf
61e909608d Linux 2.6.x compat, blkdev_compat.h
For legacy reasons the zvol.c and vdev_disk.c Linux compatibility
code ended up in sys/blkdev.h and sys/vdev_disk.h headers.  While
there are worse places for this code to live it should be in a
linux/blkdev_compat.h header.  This change moves this block device
Linux compatibility code in to the linux/blkdev_compat.h header
and updates all the correct #include locations.  This is not a
functional change or bug fix, it is just code cleanup.
2011-02-23 12:29:38 -08:00
Brian Behlendorf
5d0265c0dd Merge branch 'zpl' 2011-02-18 09:31:25 -08:00
Brian Behlendorf
037849f854 Use provided uid/gid for setattr
When changing the uid/gid of a file via zfs_setattr() use the
Posix id passed in iattr->ia_uid/gid.  While the zfs_fuid_create()
code already had the fuid support disabled for Linux it was
returning the uid/gid from the credential.  With this change
the 'chown' command which relies on setxattr is now working
properly.

Also remove a little stray white space which was in front of
zfs_update_inode() call and the end of zfs_setattr().
2011-02-17 14:23:48 -08:00
Brian Behlendorf
efd1832bc6 Fix symlink(2) inode reference count
Under Linux sys_symlink(2) should result in a inode being created
with one reference for the inode itself, and a second reference on
the inode which is held by the new dentry.  Under Solaris this
appears not to be the case.  Their zfs_symlink() handler drops
the inode reference before returning.

The result of this under Linux is that the reference count for
symlinks is always one smaller than it should have been. This
results in a BUG() when the symlink is unlinked.  To handle this
the Linux port now keeps the inode reference which differs from
the Solaris behavior.  This results in correct reference counts.

Closes #96
2011-02-17 11:34:47 -08:00
Brian Behlendorf
5095000169 Use -zfs_readlink() error
The zfs_readlink() function returns a Solaris positive error value
and that needs to be converted to a Linux negative error value.
While in this case nothing would actually go wrong, it's still
incorrect and should be fixed if for no other reason than clarity.
2011-02-17 09:48:06 -08:00
Brian Behlendorf
8b4f9a2d55 Fix readlink(2)
This patch addresses three issues related to symlinks.

1) Revert the zfs_follow_link() function to a modified version
of the original zfs_readlink().  The only changes from the
original OpenSolaris version relate to using Linux types.
For the moment this means no vnode's and no zfsvfs_t.  The
caller zpl_follow_link() was also updated accordingly.  This
change was reverted because it was slightly gratuitious.

2) Update zpl_follow_link() to use local variables for the
link buffer.  I'd forgotten that iov.iov_base is updated by
uiomove() so after the call to zfs_readlink() it can not longer
be used.  We need our own private copy of the link pointer.

3) Allocate MAXPATHLEN instead of MAXPATHLEN+1.  By default
MAXPATHLEN is 4096 bytes which is a full page, adding one to
it pushes it slightly over a page.  That means you'll likely
end up allocating 2 pages which is wasteful of memory and
possibly slightly slower.
2011-02-16 15:54:55 -08:00
Ricardo M. Correia
54a179e7b8 Add API to wait for pending commit callbacks
This adds an API to wait for pending commit callbacks of already-synced
transactions to finish processing.  This is needed by the DMU-OSD in
Lustre during device finalization when some callbacks may still not be
called, this leads to non-zero reference count errors.  See lustre.org
bug 23931.
2011-02-16 11:20:06 -08:00
Brian Behlendorf
a6695d83b7 Add get/setattr, get/setxattr hooks
While the attr/xattr hooks were already in place for regular
files this hooks can also apply to directories and special files.
While they aren't typically used in this way, it should be
supported.  This patch registers these additional callbacks
for both directory and special inode types.
2011-02-16 09:55:53 -08:00
Brian Behlendorf
d8fd10545b Fix FIFO and socket handling
Under Linux when creating a fifo or socket type device in the ZFS
filesystem it's critical that the rdev is stored in a SA.  This
was already being correctly done for character and block devices,
but that logic needed to be extended to include FIFOs and sockets.

This patch takes care of device creation but a follow on patch
may still be required to verify that the dev_t is being correctly
packed/unpacked from the SA.
2011-02-16 09:51:44 -08:00
Brian Behlendorf
d567444809 Create minors for all zvols
It was noticed that when you have zvols in multiple datasets
not all of the zvol devices are created at module load time.
Fajarnugraha did the leg work to identify that the root cause of
this bug is a non-zero return value from zvol_create_minors_cb().

Returning a non-zero value from the dmu_objset_find_spa() callback
function results in aborting processing the remaining children in
a dataset.  Since we want to ensure that the callback in run on
all children regardless of error simply unconditionally return
zero from the zvol_create_minors_cb().  This callback function
is solely used for this purpose so surpressing the error is safe.

Closes #96
2011-02-16 09:50:06 -08:00
Brian Behlendorf
2c395def27 Linux 2.6.36 compat, sops->evict_inode()
The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback.  It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.
2011-02-11 13:47:51 -08:00
Brian Behlendorf
f9637c6c8b Linux 2.6.33 compat, get/set xattr callbacks
The xattr handler prototypes were sanitized with the idea being that
the same handlers could be used for multiple methods.  The result of
this was the inode type was changes to a dentry, and both the get()
and set() hooks had a handler_flags argument added.  The list()
callback was similiarly effected but no autoconf check was added
because we do not use the list() callback.
2011-02-11 10:41:00 -08:00
Brian Behlendorf
7268e1bec8 Linux 2.6.35 compat, fops->fsync()
The fsync() callback in the file_operations structure used to take
3 arguments.  The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers.  To
handle this a compatibility prototype was added to ensure the right
prototype is used.  Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.
2011-02-11 09:05:51 -08:00
Brian Behlendorf
777d4af891 Linux 2.6.35 compat, const struct xattr_handler
The const keyword was added to the 'struct xattr_handler' in the
generic Linux super_block structure.  To handle this we define an
appropriate xattr_handler_t typedef which can be used.  This was
the preferred solution because it keeps the code clean and readable.
2011-02-10 16:29:00 -08:00
Brian Behlendorf
6839eed23e Use 'noop' IO Scheduler
Initial testing has shown the the right IO scheduler to use under Linux
is noop.  This strikes the ideal balance by allowing the zfs elevator
to do all request ordering and prioritization.  While allowing the
Linux elevator to do the maximum front/back merging allowed by the
physical device.  This yields the largest possible requests for the
device with the lowest total overhead.

While 'noop' should be right for your system you can choose a different
IO scheduler with the 'zfs_vdev_scheduler' option.  You may set this
value to any of the standard Linux schedulers: noop, cfq, deadline,
anticipatory.  In addition, if you choose 'none' zfs will not attempt
to change the IO scheduler for the block device.
2011-02-10 09:27:22 -08:00
Brian Behlendorf
4db77a74a6 Suppress large kmem_alloc() warning
The following warning was observed under normal operation.  It's
not fatal but it's something to be addressed long term.  Flag the
offending allocation with KM_NODEBUG to suppress the warning and
flag the call site.

SPL: Showing stack for process 21761
Pid: 21761, comm: iozone Tainted: P           ----------------
2.6.32-71.14.1.el6.x86_64 #1
Call Trace:
 [<ffffffffa05465a7>] spl_debug_dumpstack+0x27/0x40 [spl]
 [<ffffffffa054a84d>] kmem_alloc_debug+0x11d/0x130 [spl]
 [<ffffffffa05de166>] dmu_buf_hold_array_by_dnode+0xa6/0x4e0 [zfs]
 [<ffffffffa05de825>] dmu_buf_hold_array+0x65/0x90 [zfs]
 [<ffffffffa05de891>] dmu_read_uio+0x41/0xd0 [zfs]
 [<ffffffffa0654827>] zfs_read+0x147/0x470 [zfs]
 [<ffffffffa06644a2>] zpl_read_common+0x52/0x70 [zfs]
 [<ffffffffa0664503>] zpl_read+0x43/0x70 [zfs]
 [<ffffffff8116d905>] vfs_read+0xb5/0x1a0
 [<ffffffff8116da41>] sys_read+0x51/0x90
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
2011-02-10 09:27:22 -08:00
Brian Behlendorf
ceb43b935d Invalidate dcache and inode cache
When performing a 'zfs rollback' it's critical to invalidate
the previous dcache and inode cache.  If we don't there will
stale cache entries which when accessed will result in EIOs.
2011-02-10 09:27:22 -08:00
Brian Behlendorf
8926ab7a50 Move cv_destroy() outside zp->z_range_lock()
With the recent SPL change (d599e4fa) that forces cv_destroy()
to block until all waiters have been woken.  It is now unsafe
to call cv_destroy() under the zp->z_range_lock() because it
is used as the condition variable mutex.  If there are waiters
cv_destroy() will block until they wake up and aquire the mutex.
However, they will never aquire the mutex because cv_destroy()
will not return allowing it's caller to drop the lock.  Deadlock.

To avoid this cv_destroy() is now run asynchronously in a taskq.
This solves two problems:

1) It is no longer run under the zp->z_range_lock so no deadlock.
2) Since cv_destroy() may now block we don't want this slowing
   down zfs_range_unlock() and throttling the system.

This was not as much of an issue under OpenSolaris because their
cv_destroy() implementation does not do anything.  They do however
risk a bad paging request if cv_destroy() returns, the memory holding
the condition variable is free'd, and then the waiters wake up and
try to reference it.  It's a very small unlikely race, but it is
possible.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
c0d35759c5 Add mmap(2) support
It's worth taking a moment to describe how mmap is implemented
for zfs because it differs considerably from other Linux filesystems.
However, this issue is handled the same way under OpenSolaris.

The issue is that by design zfs bypasses the Linux page cache and
leaves all caching up to the ARC.  This has been shown to work
well for the common read(2)/write(2) case.  However, mmap(2)
is problem because it relies on being tightly integrated with the
page cache.  To handle this we cache mmap'ed files twice, once in
the ARC and a second time in the page cache.  The code is careful
to keep both copies synchronized.

When a file with an mmap'ed region is written to using write(2)
both the data in the ARC and existing pages in the page cache
are updated.  For a read(2) data will be read first from the page
cache then the ARC if needed.  Neither a write(2) or read(2) will
will ever result in new pages being added to the page cache.

New pages are added to the page cache only via .readpage() which
is called when the vfs needs to read a page off disk to back the
virtual memory region.  These pages may be modified without
notifying the ARC and will be written out periodically via
.writepage().  This will occur due to either a sync or the usual
page aging behavior.  Note because a read(2) of a mmap'ed file
will always check the page cache first even when the ARC is out
of date correct data will still be returned.

While this implementation ensures correct behavior it does have
have some drawbacks.  The most obvious of which is that it
increases the required memory footprint when access mmap'ed
files.  It also adds additional complexity to the code keeping
both caches synchronized.

Longer term it may be possible to cleanly resolve this wart by
mapping page cache pages directly on to the ARC buffers.  The
Linux address space operations are flexible enough to allow
selection of which pages back a particular index.  The trick
would be working out the details of which subsystem is in
charge, the ARC, the page cache, or both.  It may also prove
helpful to move the ARC buffers to a scatter-gather lists
rather than a vmalloc'ed region.

Additionally, zfs_write/read_common() were used in the readpage
and writepage hooks because it was fairly easy.  However, it
would be better to update zfs_fillpage and zfs_putapage to be
Linux friendly and use them instead.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
cc5f931cfd Add Hooks for Linux Xattr Operations
The Linux specific xattr operations have all been located in the
file zpl_xattr.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
51f0bbe425 Add Hooks for Linux Super Block Operations
The Linux specific super block operations have all been located in the
file zpl_super.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
ee154f01bf Add Hooks for Linux Inode Operations
The Linux specific inode operations have all been located in the
file zpl_inode.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
1efb473f89 Add Hooks for Linux File Operations
The Linux specific file operations have all been located in the
file zpl_file.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.

This first zpl_* commit also includes a common zpl.h header with
minimal entries to register the Linux specific hooks.  In also
adds all the new zpl_* file to the Makefile.in.  This is not a
standalone commit, you required the following zpl_* commits.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
633e8030b3 Wrap with HAVE_XVATTR
For the moment exactly how to handle xvattr is not clear.  This
change largely consists of the code to comment out the offending
bits until something reasonable can be done.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
3c4988c83e Add zp->z_is_zvol flag
A new flag is required for the zfs_rlock code to determine if
it is operation of the zvol of zpl dataset.  This used to be
keyed off the zp->z_vnode, which was a hack to begin with, but
with the removal of vnodes we needed a dedicated flag.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
3558fd73b5 Prototype/structure update for Linux
I appologize in advance why to many things ended up in this commit.
When it could be seperated in to a whole series of commits teasing
that all apart now would take considerable time and I'm not sure
there's much merrit in it.  As such I'll just summerize the intent
of the changes which are all (or partly) in this commit.  Broadly
the intent is to remove as much Solaris specific code as possible
and replace it with native Linux equivilants.  More specifically:

1) Replace all instances of zfsvfs_t with zfs_sb_t.  While the
type is largely the same calling it private super block data
rather than a zfsvfs is more consistent with how Linux names
this.  While non critical it makes the code easier to read when
your thinking in Linux friendly VFS terms.

2) Replace vnode_t with struct inode.  The Linux VFS doesn't have
the notion of a vnode and there's absolutely no good reason to
create one.  There are in fact several good reasons to remove it.
It just adds overhead on Linux if we were to manage one, it
conplicates the code, and it likely will lead to bugs so there's
a good change it will be out of date.  The code has been updated
to remove all need for this type.

3) Replace all vtype_t's with umode types.  Along with this shift
all uses of types to mode bits.  The Solaris code would pass a
vtype which is redundant with the Linux mode.  Just update all the
code to use the Linux mode macros and remove this redundancy.

4) Remove using of vn_* helpers and replace where needed with
inode helpers.  The big example here is creating iput_aync to
replace vn_rele_async.  Other vn helpers will be addressed as
needed but they should be be emulated.  They are a Solaris VFS'ism
and should simply be replaced with Linux equivilants.

5) Update znode alloc/free code.  Under Linux it's common to
embed the inode specific data with the inode itself.  This removes
the need for an extra memory allocation.  In zfs this information
is called a znode and it now embeds the inode with it.  Allocators
have been updated accordingly.

6) Minimal integration with the vfs flags for setting up the
super block and handling mount options has been added this
code will need to be refined but functionally it's all there.

This will be the first and last of these to large to review commits.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
6149f4c45f Remove dmu_write_pages() support
For the moment we do not use dmu_write_pages() to write pages
directly in to a dmu object.  It may be required at some point
in the future, but for now is simplest and cleanest to drop it.
It can be easily readded if/when needed.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
eb28321e2d Create a root znode without VFS dependencies
For portability reasons it's handy to be able to create a root
znode and basic filesystem components without requiring the full
cooperation of the VFS.  We are committing to this to simply the
filesystem creations code.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
bcf308227c Remove zfs_ctldir.[ch]
This code is used for snapshot and heavily leverages Solaris
functionality we do not want to reimplement.  These files have
been removed, including references to them, and will be replaced
by a zfs_snap.c/zpl_snap.c implementation which handles snapshots.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
b516a07b99 Disable fuid features
These features should probably be enabled in the Linux zpl code.
For now I'm disabling them until it's clear what needs to be done.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
d5e53f9d06 Disable zfs_sync during oops/panic
Minor update to ensure zfs_sync() is disabled if a kernel oops/panic
is triggered.  As the comment says 'data integrity is job one'.  This
change could have been done by defining panicstr to oops_in_progress
in the SPL.  But I felt it was better to use the native Linux API
here since to be clear.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
acb5376940 Disable Shutdown/Reboot
This support has been disable with HAVE_SHUTDOWN.  We can support
this at some point by adding the needed reboot notifiers.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
cb28b3494e Remove SYNC_ATTR check
This flag does not need to be support under Linux.  As the comment
says it was only there to support fsflush() for old filesystem like
UFS.  This is not needed under Linux.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
e15c023014 Remove mount options
Mount option parsing is still very Linux specific and will be
handled above this zfs filesystem layer.  Honoring those mount
options once set if of course the responsibility of the lower
layers.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
d7cafa8e3e Remove zfs_active_fs_count
This variable was used to ensure that the ZFS module is never
removed while the filesystem is mounted.  Once again the generic
Linux VFS handles this case for us so it can be removed.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
42ab36aa36 Remove unused mount functions
The functions zfs_mount_label_policy(), zfs_mountroot(), zfs_mount()
will not be needed because most of what they do is already handled
by the generic Linux VFS layer.  They all call zfs_domount() which
creates the actual dataset, the caller of this library call which
will be in the zpl layer is responsible for what's left.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
c0b3dc7d07 Remove zfs_major/zfs_minor/zfsfstype
Under Linux we don't need to reserve a major or minor number for
the filesystem.  We can rely on the VFS to handle colisions without
this being handled by the lower ZFS layers.

Additionally, there is no need to keep a zfsfstype around.  We are
not limited on Linux by the OpenSolaris infrastructure which needed
this.  The upper zpl layer can specify the filesystem type.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
4b3f12ecd5 Remove Solaris VFS Hooks
The ZFS code is being restructured to act as a library and a stand
alone module.  This allows us to leverage most of the existing code
with minimal modification.  It also means we need to drop the Solaris
vfs/vnode functions they will be replaced by Linux equivilants and
updated to be Linux friendly.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
960e08fe3e VFS: Add zfs_inode_update() helper
For the moment we have left ZFS unchanged and it updates many values
as part of the znode.  However, some of these values should be set
in the inode.  For the moment this is handled by adding a function
called zfs_inode_update() which updates the inode based on the znode.

This is considered a workaround until we can systematically go
through the ZFS code and have it directly update the inode.  At
which point zfs_update_inode() can be dropped entirely.  Keeping
two copies of the same data isn't only inefficient it's a breeding
ground for bugs.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
7304b6e50f VFS: Integrate zfs_znode_alloc()
Under Linux the convention for filesystem specific data structure is
to embed it along with the generic vfs data structure.  This differs
significantly from Solaris.

Since we want to integrates as cleanly with the Linux VFS as possible.
This changes modifies zfs_znode_alloc() to allocate a znode with an
embedded inode for use with the generic VFS.  This is done by calling
iget_locked() which will allocate a new inode if needed by calling
sb->alloc_inode().  This function allocates enough memory for a
znode_t by returns a pointer to the inode structure for Linux's VFS.
This function is also responsible for setting the callback
znode->z_set_ops_inodes() which is used to register the correct
handlers for the inode.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
10c6047ea5 Enable zfs_znode compilation
Basic compilation of the bulk of zfs_znode.c has been enabled.  After
much consideration it was decided to convert the existing vnode based
interfaces to more friendly Linux interfaces.  The following commits
will systematically replace update the requiter interfaces.  There
are of course pros and cons to this decision.

Pros:
* This simplifies intergration with Linux in the long term.  There is
  no longer any need to manage vnodes which are a foreign concept to
  the Linux VFS.
* Improved long term maintainability.
* Minor performance improvements by removing vnode overhead.

Cons:
* Added work in the short term to modify multiple ZFS interfaces.
* Harder to pull in changes if we ever see any new code from Solaris.
* Mixed Solaris and Linux interfaces in some ZFS code.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
a405c8a665 ACL related changes
A small collection of ACL related changes related to not
supporting fuid mapping.  This whole are will need to be
closely investigated.
2011-02-10 09:26:26 -08:00
Brian Behlendorf
3fc050aaf2 Init/destroy tsd
Add missing tsd_destroy() call for rrw_tsd_key to avoid a leak.
2011-02-10 09:25:38 -08:00
Brian Behlendorf
ab892c5f0a Replace VOP_* calls with direct zfs_* calls
These generic Solaris wrappers are no longer required.  Simply
directly call the correct zfs functions for clarity.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
590329b50c Add basic uio support
This code originates in OpenSolaris and was modified by KQ Infotech
to be compatible with Linux.  While supporting uios in the short
term is useful to get something working this is not an abstraction
we want to keep.  This code is expected to be short lived and
removed as soon as all the remaining uio based APIs and updated.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
538f669f63 Add trivial acl helpers
The zfs acl code makes use of the two OpenSolaris helper functions
acl_trivial_access_masks() and ace_trivial_common().  Since they are
only called from zfs_acl.c I've brought them over from OpenSolaris
and added them as static function to this file.  This way I don't
need to reimplement this functionality from scratch in the SPL.

Long term once I take a more careful look at the acl implementation
it may be the case that these functions really aren't needed.  If
that turns out to be the case they can then be removed.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
c60bc1fbf0 Remove dead ACL code
The following code was unused which caused gcc to complain.
Since it was deadcode it has simply been removed.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
4e1b54fdde Remove zfs_parse_bootfs() support
Remove unneeded bootfs functions.  This support shouldn't be required
for the Linux port, and even if it is it would need to be reworked
to integrate cleanly with Linux.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
9ee7fac531 VFS: Wrap with HAVE_SHARE
Certain NFS/SMB share functionality is not yet in place.  These
functions used to be wrapped with the generic HAVE_ZPL to prevent
them from being compiled.  I still don't want them compiled but
I'm working toward eliminating the use of HAVE_ZPL.  So I'm just
renaming the wrapper here to HAVE_SHARE.  They still won't be
compiled until all the share issues are worked through.  Share
support is the last missing piece from zfs_ioctl.c.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
bc3e15e386 Wrap with HAVE_MLSLABEL
The zfs_check_global_label() function is part of the HAVE_MLSLABEL
support which was previously commented out by a HAVE_ZPL check.
Since we're still deciding what to do about mls labels wrap it
with the preexisting macro to keep it compiled out.
2011-02-10 09:21:42 -08:00
Brian Behlendorf
5649246dd3 Remove znode move functionality
Unlike Solaris the Linux implementation embeds the inode in the
znode, and has no use for a vnode.  So while it's true that fragmention
of the znode cache may occur it should not be worse than any of the
other Linux FS inode caches.  Until proven that this is a problem it's
just added complexity we don't need.
2011-02-10 09:21:42 -08:00
Brian Behlendorf
f30484afc3 Conserve stack in zfs_mkdir()
Move the sa_attrs array from the stack to the heap to minimize stack
space usage.
2011-02-10 09:21:42 -08:00
Brian Behlendorf
1ee1b76786 Conserve stack in zfs_sa_upgrade()
As always under Linux stack space is at a premium.  Relocate two
20 element sa_bulk_attr_t arrays in zfs_sa_upgrade() from the stack
to the heap.
2011-02-10 09:21:42 -08:00
Brian Behlendorf
e5c39b95a7 Export required vfs/vn symbols 2011-02-10 09:21:42 -08:00
Brian Behlendorf
72d5e2da3e Add HAVE_SCANSTAMP
This functionality is not supported under Linux, perhaps it
will be some day if it's decided it's useful.
2011-02-10 09:20:33 -08:00
Brian Behlendorf
872e8d2697 Add initial rw_uio functions to the dmu
These functions were dropped originally because I felt they would
need to be rewritten anyway to avoid using uios.  However, this
patch readds then with they dea they can just be reworked and
the uio bits dropped.
2011-02-04 16:14:34 -08:00
Brian Behlendorf
9a616b5d17 Documentation updates
Minor Linux specific documentation updates to the comments and
man pages.
2011-02-04 16:14:34 -08:00
Brian Behlendorf
95c73795b0 Fix ZVOL rename minor devices
During a rename we need to be careful to destroy and create a
new minor for the ZVOL _only_ if the rename succeeded.  The previous
code would both destroy you minor device unconditionally, it would
also fail to create the new minor device on success.
2011-01-07 12:26:02 -08:00
Brian Behlendorf
149e873ab1 Fix minor compiler warnings
These compiler warnings were introduced when code which was
previously #ifdef'ed out by HAVE_ZPL was re-added for use
by the posix layer.  All of the following changes should be
obviously correct and will cause no semantic changes.
2011-01-06 15:04:28 -08:00
Brian Behlendorf
5b63b3eb6f Use cv_timedwait_interruptible in arc
The issue is that cv_timedwait() sleeps uninterruptibly to block signals
and avoid waking up early.  Under Linux this counts against the load
average keeping it artificially high.  This change allows the arc to
sleep interruptibly which mean it may be woken up early due to a signal.

Normally this means some extra care must be taken to handle a potential
signal.  But for the arcs usage of cv_timedwait() there is no harm in
waking up before the timeout expires so no extra handling is required.
2010-12-14 10:06:44 -08:00
Brian Behlendorf
a7dc7e5d5a Enable rrwlock.c compilation
With the addition of the thread specific data interfaces to the
SPL it is safe to enable compilation of the re-enterant read
reader/writer locks.
2010-12-07 16:05:25 -08:00
Ned Bass
e06be58641 Fix for access beyond end of device error
This commit fixes a sign extension bug affecting l2arc devices.  Extremely
large offsets may be passed down to the low level block device driver on
reads, generating errors similar to

    attempt to access beyond end of device
    sdbi1: rw=14, want=36028797014862705, limit=125026959

The unwanted sign extension occurrs because the function arc_read_nolock()
stores the offset as a daddr_t, a 32-bit signed int type in the Linux kernel.
This offset is then passed to zio_read_phys() as a uint64_t argument, causing
sign extension for values of 0x80000000 or greater.  To avoid this, we store
the offset in a uint64_t.

This change also changes a few daddr_t struct members to uint64_t in the libspl
headers to avoid similar bugs cropping up in the future.  We also add an ASSERT
to __vdev_disk_physio() to check for invalid offsets.

Closes #66
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-10 21:29:07 -08:00
Brian Behlendorf
1f30b9d432 Linux 2.6.36 compat, use fops->unlocked_ioctl()
As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed.  The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12.  Since the ZFS code only contains support
back to 2.6.18 vintage kernels, I'm not adding an autoconf check
for this and simply moving everything to use fops->unlocked_ioctl().
2010-11-10 17:01:08 -08:00
Brian Behlendorf
675de5aa37 Linux 2.6.36 compat, synchronous bio flag
The name of the flag used to mark a bio as synchronous has changed
again in the 2.6.36 kernel due to the unification of the BIO_RW_*
and REQ_* flags.  The new flag is called REQ_SYNC.  To simplify
checking this flag I have introduced the vdev_disk_dio_is_sync()
helper function.  Based on the results of several new autoconf
tests it uses the correct mask to check for a synchronous bio.

Preferred interface for flagging a synchronous bio:
  2.6.12-2.6.29: BIO_RW_SYNC
  2.6.30-2.6.35: BIO_RW_SYNCIO
  2.6.36-2.6.xx: REQ_SYNC
2010-11-10 17:00:33 -08:00
Ned Bass
b04cffc9b0 Remove inconsistent use of EOPNOTSUPP
Commit 3ee56c292b changed an ENOTSUP return value
in one location to ENOTSUPP to fix user programs seeing an invalid ioctl()
error code.  However, use of ENOTSUP is widespread in the zfs module.  Instead
of changing all of those uses, we fixed the ENOTSUP definition in the SPL to be
consistent with user space.  The changed return value in the above commit is
therefore no longer needed, so this commit reverses it to maintain consistency.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-10 13:26:56 -08:00
Ned Bass
3ee56c292b Make rollbacks fail gracefully
Support for rolling back datasets require a functional ZPL, which we currently
do not have.  The zfs command does not check for ZPL support before attempting
a rollback, and in preparation for rolling back a zvol it removes the minor
node of the device.  To prevent the zvol device node from disappearing after a
failed rollback operation, this change wraps the zfs_do_rollback() function in
an #ifdef HAVE_ZPL and returns ENOSYS in the absence of a ZPL.  This is
consistent with the behavior of other ZPL dependent commands such as mount.

The orginal error message observed with this bug was rather confusing:

    internal error: Unknown error 524
    Aborted

This was because zfs_ioc_rollback() returns ENOTSUP if we don't HAVE_ZPL, but
Linux actually has no such error code.  It should instead return EOPNOTSUPP, as
that is how ENOTSUP is defined in user space.  With that we would have gotten
the somewhat more helpful message

    cannot rollback 'tank/fish': unsupported version

This is rather a moot point with the above changes since we will no longer make
that ioctl call without a ZPL.  But, this change updates the error code just in
case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-08 14:03:36 -08:00
Brian Behlendorf
7e55f4e00c Increate zio write interrupt thread count.
Increasing the default zio_wr_int thread count from 8 to 16 improves
write performence by 13% on large systems.  More testing need to be
done but I suspect the ideal tuning here is ZTI_BATCH() with a minimum
of 8 threads.
2010-11-08 14:03:35 -08:00
Brian Behlendorf
451041db53 Shorten zio_* thread names
Linux kernel thread names are expected to be short.  This change shortens
the zio thread names to 10 characters leaving a few chracters to append
the /<cpuid> to which the thread is bound.  For example: z_wr_iss/0.
2010-11-08 14:03:35 -08:00
Ned Bass
b1c5821375 Fix panic mounting unformatted zvol
On some older kernels, i.e. 2.6.18, zvol_ioctl_by_inode() may get passed a NULL
file pointer if the user tries to mount a zvol without a filesystem on it.
This change adds checks to prevent a null pointer dereference.

Closes #73.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-10-29 14:46:33 -07:00
Brian Behlendorf
baa40d45cb Fix missing 'zpool events'
It turns out that 'zpool events' over 1024 bytes in size where being
silently dropped.  This was discovered while writing the zfault.sh
tests to validate common failure modes.

This could occur because the zfs interface for passing an arbitrary
size nvlist_t over an ioctl() is to provide a buffer for the packed
nvlist which is usually big enough.  In this case 1024 byte is the
default.  If the kernel determines the buffer is to small it returns
ENOMEM and the minimum required size of the nvlist_t.  This was
working properly but in the case of 'zpool events' the event stream
was advanced dispite the error.  Thus the retry with the bigger
buffer would succeed but it would skip over the previous event.

The fix is to pass this size to zfs_zevent_next() and determine
before removing the event from the list if it will fit.  This was
preferable to checking after the event was returned because this
avoids the need to rewind the stream.
2010-10-12 14:55:03 -07:00
Brian Behlendorf
a69052be7f Initial zio delay timing
While there is no right maximum timeout for a disk IO we can start
laying the ground work to measure how long they do take in practice.
This change simply measures the IO time and if it exceeds 30s an
event is posted for 'zpool events'.

This value was carefully selected because for sd devices it implies
that at least one timeout (SD_TIMEOUT) has occured.  Unfortunately,
even with FAILFAST set we may retry and request and not get an
error.  This behavior is strongly dependant on the device driver
and how it is hooked in to the scsi error handling stack.  However
by setting the limit at 30s we can log the event even if no error
was returned.

Slightly longer term we can start recording these delays perhaps
as a simple power-of-two histrogram.  This histogram can then be
reported as part of the 'zpool status' command when given an command
line option.

None of this code changes the internal behavior of ZFS.  Currently
it is simply for reporting excessively long delays.
2010-10-12 14:55:02 -07:00
Brian Behlendorf
2959d94a0a Add FAILFAST support
ZFS works best when it is notified as soon as possible when a device
failure occurs.  This allows it to immediately start any recovery
actions which may be needed.  In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.

That's the theory at least.  In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.

Unfortunately, without additional kernels patchs there's not much
which can be done to improve this.  Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed.  This can be improved and I suspect I'll be
submitting patches upstream to handle this.
2010-10-12 14:55:02 -07:00
Brian Behlendorf
312c07edfd Generate zevents for speculative and soft errors
By default the Solaris code does not log speculative or soft io errors
in either 'zpool status' or post an event.  Under Linux we don't want
to change the expected behavior of 'zpool status' so these io errors
are still suppressed there.

However, since we do need to know about these events for Linux FMA and
the 'zpool events' interface is new we do post the events.  With the
addition of the zio_flags field the posted events now contain enough
information that a user space consumer can identify and discard these
events if it sees fit.
2010-10-12 14:55:00 -07:00
Brian Behlendorf
d148e95156 Fix negative zio->io_error which must be positive.
All the upper layers of zfs expect zio->io_error to be positive.  I was
careful but I missed one instance in vdev_disk_physio_completion() which
could return a negative error.  To ensure all cases are always caught I
had additionally added an ASSERT() to check this before zio_interpret().

Finally, as a debugging aid when zfs is build with --enable-debug all
errors from the backing block devices will be reported to the console
with an error message like this:

	ZFS: zio error=5 type=1 offset=4217856 size=8192 flags=60440
2010-10-12 14:55:00 -07:00
Brian Behlendorf
398f129ca3 Suppress large kmem_alloc() warning.
Observed during failure mode testing, dsl_scan_setup_sync() allocates
73920 bytes.  This is way over the limit of what is wise to do with a
kmem_alloc() and it should probably be moved to a slab.  For now I'm
just flagging it with KM_NODEBUG to quiet the error until this can be
revisited.
2010-10-12 14:54:59 -07:00
Ned Bass
3a7381e531 Use stored whole_disk property when opening a vdev
This commit fixes a bug in vdev_disk_open() in which the whole_disk property
was getting set to 0 for disk devices, even when it was stored as a 1 when the
zpool was created.  The whole_disk property lets us detect when the partition
suffix should be stripped from the device name in CLI output.  It is also used
to determine how writeback cache should be set for a device.

When an existing zpool is imported its configuration is read from the vdev
label by user space in zpool_read_label().  The whole_disk property is saved in
the nvlist which gets passed into the kernel, where it in turn gets saved in
the vdev struct in vdev_alloc().  Therefore, this value is available in
vdev_disk_open() and should not be overridden by checking the provided device
path, since that path will likely point to a partition and the check will
return the wrong result.

We also add an ASSERT that the whole_disk property is set.  We are not aware of
any cases where vdev_disk_open() should be called with a config that doesn't
have this property set.  The ASSERT is there so that when debugging is enabled
we can identify any legitimate cases that we are missing.  If we never hit the
ASSERT, we can at some point remove it along with the conditional whole_disk
check.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-10-04 13:53:18 -07:00
Ricardo M. Correia
0151834d65 Register the space accounting callback even when we don't have the ZPL.
This callback is needed for properly accounting the per-uid and per-gid
space usage.  Even if we don't have the ZPL, we still need this callback
in order to have proper on-disk ZPL compatibility and to be able to use
Lustre quotas.

Fortunately, the callback doesn't have any ZPL/VFS dependencies so we
can just move it out of #ifdef HAVE_ZPL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-10-04 11:34:39 -07:00
Ricardo M. Correia
368f4c10ae Export ZFS symbols needed by Lustre.
Required for the DB_DNODE_ENTER()/DB_DNODE_EXIT() helpers.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-09-17 16:24:15 -07:00
Ricardo M. Correia
1e411a4c12 Quiet down very frequent large allocation warning in ZFS.
In my machine, dnode_hold_impl() allocates 9992 bytes in DEBUG mode and it
causes a large stream of stack traces in the logs. Instead, use KM_NODEBUG
to quiet down this known large alloc.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-09-17 16:24:15 -07:00
Brian Behlendorf
6283f55ea1 Support custom build directories and move includes
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory.  The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.

For example, this project is designed to work on various different
Linux distributions each of which work slightly differently.  This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.

Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution.  When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.

wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz
tar -xzf zfs-x.y.z.tar.gz
cd zfs-x-y-z

------------------------- run concurrently ----------------------
<ubuntu system>  <fedora system>  <debian system>  <rhel6 system>
mkdir ubuntu     mkdir fedora     mkdir debian     mkdir rhel6
cd ubuntu        cd fedora        cd debian        cd rhel6
../configure     ../configure     ../configure     ../configure
make             make             make             make
make check       make check       make check       make check

This change also moves many of the include headers from individual
incude/sys directories under the modules directory in to a single
top level include directory.  This has the advantage of making
the build rules cleaner and logically it makes a bit more sense.
2010-09-08 12:38:56 -07:00
Brian Behlendorf
f5e79474f0 Fix zfsdev_compat_ioctl() case
For the !CONFIG_COMPAT case fix the zfsdev_compat_ioctl()
compatibility function name.  This was caught by the
chaos4.3 builder.
2010-09-01 16:00:15 -07:00
Brian Behlendorf
302ef1517e Add linux zpios support
Linux kernel implementation of PIOS test app.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:42:01 -07:00
Brian Behlendorf
d603ed6c27 Add linux user disk support
This topic branch contains all the changes needed to integrate the user
side zfs tools with Linux style devices.  Primarily this includes fixing
up the Solaris libefi library to be Linux friendly, and integrating with
the libblkid library which is provided by e2fsprogs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:42:00 -07:00
Brian Behlendorf
054bc00b4c Add linux compatibility
Resolve minor Linux compatibility issues.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:59 -07:00
Brian Behlendorf
7b89a54996 Add linux spa thread support
Disable the spa thread under Linux until it can be implemented.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:59 -07:00