Commit Graph

4669 Commits

Author SHA1 Message Date
Brian Behlendorf 99c452bbba Fix taskq_wait_id()
The existing taskq_wait_id() function can incorrectly block
indefinitely.  Reimplement it more simply using wait_event()
in a similar fashion to taskq_wait_all().

This flaw was uncovered in the context of moving vn_rdwr() to
a taskq.  Previously taskq_wait_id() had no consumers outside
the SPLAT task framework which is why the issue went unnoticed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-05-03 14:32:29 -07:00
George.Wilson cc92e9d0c3 3246 ZFS I/O deadman thread
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

NOTES: This patch has been reworked from the original in the
following ways to accomidate Linux ZFS implementation

*) Usage of the cyclic interface was replaced by the delayed taskq
   interface.  This avoids the need to implement new compatibility
   code and allows us to rely on the existing taskq implementation.

*) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h
   because declaring externs in source files as was done in the
   original patch is just plain wrong.

*) Instead of panicing the system when the deadman triggers a
   zevent describing the blocked vdev and the first pending I/O
   is posted.  If the panic behavior is desired Linux provides
   other generic methods to panic the system when threads are
   observed to hang.

*) For reference, to delay zios by 30 seconds for testing you can
   use zinject as follows: 'zinject -d <vdev> -D30 <pool>'

References:
  illumos/illumos-gate@283b84606b
  https://www.illumos.org/issues/3246

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1396
2013-05-01 17:05:52 -07:00
Brian Behlendorf 57f5a2008e Fix txg_quiesce thread deadlock
A deadlock was accidentally introduced by commit e95853a which
can occur when the system is under memory pressure.  What happens
is that while the txg_quiesce thread is holding the tx->tx_cpu
locks it enters memory reclaim.  In the context of this memory
reclaim it then issues synchronous I/O to a ZVOL swap device.
Because the txg_quiesce thread is holding the tx->tx_cpu locks
a new txg cannot be opened to handle the I/O.  Deadlock.

The fix is straight forward.  Move the memory allocation outside
the critical region where the tx->tx_cpu locks are held.  And for
good measure change the offending allocation to KM_PUSHPAGE to
ensure it never attempts to issue I/O during reclaim.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1274
2013-04-26 14:42:36 -07:00
Brian Behlendorf f706421173 Correctly return ERANGE in getxattr(2)
According to the getxattr(2) man page the ERANGE errno should be
returned when the size of the value buffer is to small to hold the
result.  Prior to this patch the implementation would just truncate
the value to size bytes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1408
2013-04-24 12:35:04 -07:00
Chris Dunlop 254255f735 Trivial spelling fix
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1411
2013-04-19 15:43:16 -07:00
Caleb James DeLisle 8f1e11b610 Remove .readdir from zpl_file_operations table
The zpl_readdir() function shouldn't be registered as part of
the zpl_file_operations table, it must only be part of the
zpl_dir_file_operations table.  By removing this callback
the VFS will now correctly return ENOTDIR when calling
getdents() on a file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1404
2013-04-19 15:36:47 -07:00
Martin Matuska b28e57cb82 Allow setting a lower ashift with -o ashift
Previous patches have allowed you to set an increased ashift to
avoid doing 512b IO with 4k sector devices.  However, it was not
possible to set the ashift lower than the reported physical sector
size even when a smaller logical size was supported.  In practice,
there are several cases where settong a lower ashift is useful:

* Most modern drives now correctly report their physical sector
  size as 4k.  This causes zfs to correctly default to using a 4k
  sector size (ashift=12).  However, for some usage models this
  new default ashift value causes an unacceptable increase in
  space usage.  Filesystems with many small files may see the
  total available space reduced to 30-40% which is unacceptable.

* When replacing a drive in an existing pool which was created
  with ashift=9 a modern 4k sector drive cannot be used.  The
  'zpool replace' command will issue an error that the new drive
  has an 'incompatible sector alignment'.  However, by allowing
  the ashift to be manual specified as smaller, non-optimal,
  value the device may still be safely used.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1381
Closes #1328
Issue #967
Issue #548
2013-04-12 10:50:46 -07:00
George Wilson 295304bed6 Illumos #3422, #3425
3422 zpool create/syseventd race yield non-importable pool
3425 first write to a new zvol can fail with EFBIG

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@bda8819455
  https://www.illumos.org/issues/3422
  https://www.illumos.org/issues/3425

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1390
2013-04-12 09:01:36 -07:00
Jan Engelhardt a9e86ac4fd gitignore: anchor entries at their respective directory
.ko is specific to module, .m4 to config, etc.

Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 11:07:52 -07:00
Jan Engelhardt ea0fcfc875 gitignore: anchor entries at their respective directory
.ko is specific to module, .m4 to config, etc.

Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 10:50:17 -07:00
Jan Engelhardt 4e95cc99b0 build: resolve orthographic and other grammatical errors
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 10:44:52 -07:00
Richard Yao feaf1e321d Do not call cond_resched() in spl_slab_reclaim()
Calling cond_resched() after each object is freed and then after each
slab is freed can cause slabs of objects to live for excessive periods
of time following reclaimation. This interferes with the kernel's own
memory management when called from kswapd and can cause direct reclaim
to occur in response to memory pressure that should have been resolved.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
2013-03-21 12:58:44 -07:00
Brian Behlendorf 5dc6af0eec Add zio_ddt_free()+ddt_phys_decref() error handling
The assumption in zio_ddt_free() is that ddt_phys_select() must
always find a match.  However, if that fails due to a damaged
DDT or some other reason the code will NULL dereference in
ddt_phys_decref().

While this should never happen it has been observed on various
platforms.  The result is that unless your willing to patch the
ZFS code the pool is inaccessible.  Therefore, we're choosing
to more gracefully handle this case rather than leave it fatal.

http://mail.opensolaris.org/pipermail/zfs-discuss/2012-February/050972.html

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1308
2013-03-19 13:01:01 -07:00
Brian Behlendorf 30b92c1de6 Add metaslab_debug option
Enabling metaslab debugging will prevent space maps from being
automatically unloaded.  This can significantly increase the
memory footprint but being able to dynamically control this is
helpful for debugging and certain performance testing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-18 16:47:43 -07:00
Richard Yao 4a31e5aa9b Linux 3.9 compat: Switch to hlist_for_each{,_rcu}
torvalds/linux@b67bfe0d42 changed
hlist_for_each_entry{,_rcu} to take 3 arguments instead of 4. We handle
this by switching to hlist_for_each{,_rcu}, which works across all
supported kernels.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:34 -07:00
Richard Yao 8274ed5988 Drop support for 3 argument version of set_fs_pwd
This was a suggestion that Brian Behlendorf made when reviewing an early
pull request for Linux 3.9 support. This commit was made intentionally
easy to revert should we ever have a reason to reintroduce support for
older kernels.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:31 -07:00
Richard Yao a54718cfe0 Linux 3.9 compat: set_fs_root takes const struct path *
torvalds/linux@dcf787f391 enforces
const-correctness in passing struct path *.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:29 -07:00
Richard Yao 2a305c34c8 Linux 3.9 compat: vfs_getattr takes two arguments
The function prototype of vfs_getattr previoulsy took struct vfsmount *
and struct dentry * as arguments. These would always be defined together
in a struct path *.

torvalds/linux@3dadecce20 modified
vfs_getattr to take struct path * is taken as an argument instead.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:26 -07:00
Richard Yao bc90df6688 Linux 3.9 compat: Do not depend on f_vfsmnt
torvalds/linux@182be68478 removed the
preprocessor definition for f_vfsmnt. The ability to access the
mountpoint via ->f_path.mnt has been stable for a long time, so we
switch to that.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:23 -07:00
Richard Yao 1c24b699b0 Linux 3.9 compat: Undefine GCC_VERSION
The mainline kernel started defining GCC_VERSION with commit
torvalds/linux@3f3f8d2f48.
Unfortunately, LZ4 also defines this macro, but the two
defintions are incompatible. We undefine GCC_VERSION in lz4.c
to handle this.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1339
2013-03-06 15:48:48 -08:00
Ned Bass 92db59ca3b Refresh links to web site
A few files still refer to @behlendorf's private fork on
github.  Use the primary web site URL instead.  Two typos
are also corrected.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-06 15:46:41 -08:00
Brian Behlendorf d09f98a9a6 Add KMODDIR to install target
Provide a mechanism to control the directory name the modules
are installed in.  The kernel privdes INSTALL_MOD_DIR for
this but it was hardcoded to be 'addon/zfs'.

Add a KMODDIR variable which can be passed to 'make install'
to override the default directory name.  While we're here
change the default from 'addon/zfs' to 'extra' which is the
kernel.org default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-06 15:46:40 -08:00
Eric Dillmann 0b4d1b5853 Add snapdev=[hidden|visible] dataset property
The new snapdev dataset property may be set to control the
visibility of zvol snapshot devices.  By default this value
is set to 'hidden' which will prevent zvol snapshots from
appearing under /dev/zvol/ and /dev/<dataset>/.  When set to
'visible' all zvol snapshots for the dataset will be visible.

This functionality was largely added because when automatic
snapshoting is enabled large numbers of read-only zvol snapshots
will be created.  When creating these devices the kernel will
attempt to read their partition tables, and blkid will attempt
to identify any filesystems on those partitions.  This leads
to a variety of issues:

1) The zvol partition tables will be read in the context of
   the `modprobe zfs` for automatically imported pools.  This
   is undesirable and should be done asynchronously, but for
   now reducing the number of visible devices helps.

2) Udev expects to be able to complete its work for a new
   block devices fairly quickly.  When many zvol devices are
   added at the same time this is no longer be true.  It can
   lead to udev timeouts and missing /dev/zvol links.

3) Simply having lots of devices in /dev/ can be aukward from
   a management standpoint.  Hidding the devices your unlikely
   to ever use helps with this.  Any snapshot device which is
   needed can be made visible by changing the snapdev property.

NOTE: This patch changes the default behavior for zvols which
      was effectively 'snapdev=visible'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1235
Closes #945
Issue #956
Issue #756
2013-03-05 12:37:54 -08:00
Ned Bass 3d6af2dd6d Refresh links to web site
Update links to refer to the official ZFS on Linux website instead of
@behlendorf's personal fork on github.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-04 19:09:34 -08:00
George Wilson a4430fce69 Merge zvol.c changes from PSARC 2010/306 Read-only ZFS pools
The changes to zvol.c were never merged from the last onnv_147
bulk update.  This was because zvol.c was largely rewritten
for Linux making it fairly easy to miss these sorts of changes.

This causes a regression when importing a zpool with zvols
read-only.  This does not impact pool which only contain
filesystem datasets.

References:
  illumos/illumos-gate@f9af39b

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1332
Closes #1333
2013-03-04 09:56:13 -08:00
Richard Yao b01615d5ac Constify structures containing function pointers
The PaX team modified the kernel's modpost to report writeable function
pointers as section mismatches because they are potential exploit
targets. We could ignore the warnings, but their presence can obscure
actual issues. Proper const correctness can also catch programming
mistakes.

Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2
kernel reports 133 section mismatches prior to this patch. This patch
eliminates 130 of them. The quantity of writeable function pointers
eliminated by constifying each structure is as follows:

vdev_opts_t             52
zil_replay_func_t       24
zio_compress_info_t     24
zio_checksum_info_t     9
space_map_ops_t         7
arc_byteswap_func_t     5

The remaining 3 writeable function pointers cannot be addressed by this
patch. 2 of them are in zpl_fs_type. The kernel's sget function requires
that this be non-const. The final writeable function pointer is created
by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and
remove_shrinker() functions also require that this be non-const.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1300
2013-03-04 08:49:32 -08:00
Brian Behlendorf 0298f3d67f Add KMODDIR to install target
Provide a mechanism to control the directory name the modules
are installed in.  The kernel privdes INSTALL_MOD_DIR for
this but it was hardcoded to be 'addon/spl'.

Add a KMODDIR variable which can be passed to 'make install'
to override the default directory name.  While we're here
change the default from 'addon/spl' to 'extra' which is the
kernel.org default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-01 16:55:06 -08:00
Brian Behlendorf 8128bd89fb Fix hot spares
The issue with hot spares in ZoL is because it opens all leaf
vdevs exclusively (O_EXCL).  On Linux, exclusive opens cause
subsequent exclusive opens to fail with EBUSY.

This could be resolved by not opening any of the devices
exclusively, which is what Illumos does, but the additional
protection offered by exclusive opens is desirable.  It cleanly
prevents you from accidentally adding an in-use non-ZFS device
to your pool.

To fix this we very slightly relaxed the usage of O_EXCL in
the following ways.

1) Functions which open the device but only read had the
   O_EXCL flag removed and were updated to use O_RDONLY.

2) A common holder was added to the vdev disk code.  This
   allow the ZFS code to internally open the device multiple
   times but non-ZFS callers may not.

3) An exception was added to make_disks() for hot spare when
   creating partition tables.  For hot spare devices which
   are already opened exclusively we skip creating the partition
   table because this must already have been done when the disk
   was originally added as a hot spare.

Additional minor changes include fixing check_in_use() to use
a partition instead of a slice suffix.  And is_spare() was moved
above make_disks() to avoid adding a forward reference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #250
2013-03-01 13:31:02 -08:00
Brian Behlendorf bd99a7584a Remove wholedisk check from vdev_disk_open()
As described by the comment and enforced the by assertion the
v->vdev_wholedisk will never be -1.  The wholedisk handling
is performed by the user space utilities.  To prevent confusion
this dead code is being removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-28 12:02:59 -08:00
Brian Behlendorf 0d8103d956 Leaf vdevs should not be reopened
When vdev_disk.c was implemented for Linux we failed to handle the
reopen case.  According to the vdev_reopen() comment leaf vdevs should
not be closed or opened when v->vdev_reopening is set.  Under Linux
we would always close and open the device.

This issue was only noticed when a 'zpool scrub' command was run while
the leaf vdev device names in /dev/disk/by-vdev were missing.  The
scrub command calls vdev_reopen() which caused the vdevs to be closed
but they couldn't be reopened due to the missing links.  The result
was that all the vdevs were marked unavailable and the pool was
halted due to failmode=wait.

This patch adds the missing functionality in a similiar fashion to
to the Illumos code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-28 12:02:59 -08:00
Etienne Dechamps d9b0ebbe82 Remove the bio_empty_barrier() check.
To determine whether the kernel is capable of handling empty barrier
BIOs, we check for the presence of the bio_empty_barrier() macro,
which was introduced in 2.6.24. If this macro is defined, then we can
flush disk vdevs; if it isn't, then flushing is disabled.

Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37,
even though the kernel is still capable of handling empty barrier BIOs.

As a result, flushing is effectively disabled on kernels >= 2.6.37,
meaning that starting from this kernel version, zfs doesn't use
barriers to guarantee on-disk data consistency. This is quite bad and
can lead to potential data corruption on power failures.

This patch fixes the issue by removing the configure check for
bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore.

Thanks to Richard Kojedzinszky for catching this nasty bug.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1318
2013-02-24 10:22:34 -08:00
Brian Behlendorf 546c978bbd Enable zfs_arc_memory_throttle_disable by default
The zfs_arc_memory_throttle_disable module option was introduced
by commit 0c5493d470 to resolve a
memory miscalculation which could result in the txg_sync thread
spinning.

When this was first introduced the default behavior was left
unchanged until enough real world usage confirmed there were no
unexpected issues.  We've now reached that point.  Linux's
direct reclaim is working as expected so we're enabling this
behavior by default.

This helps pave the way to retire the spl_kmem_availrmem()
functionality in the SPL layer.  This was the only caller.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #938
2013-02-21 13:38:24 -08:00
Richard Yao 8dca0a9a38 Make spa.c assertions catch unsupported pre-feature flag pool versions
A couple of assertions in spa.c were designed to prevent the use of
invalid pool versions. They were written under the assumption
that all valid pools are less than SPA_VERSION. Since feature flags
jumped from 28 to 5000, any numbers in the range 28 to 5000
non-inclusive will fail to trigger them.  We switch to the new
SPA_VERSION_IS_SUPPORTED macro to correct this.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1282
2013-02-12 10:27:44 -08:00
Brian Behlendorf 9878a89d7a Add explicit MAXNAMELEN check
It turns out that the Linux VFS doesn't strictly handle all cases
where a component path name exceeds MAXNAMELEN.  It does however
appear to correctly handle MAXPATHLEN for us.

The right way to handle this appears to be to add an explicit
check to the zpl_lookup() function.  Several in-tree filesystems
handle this case the same way.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1279
2013-02-12 10:27:39 -08:00
Ned Bass ed2e157605 Switch KM_SLEEP to KM_PUSHPAGE
Two more locations where KM_SLEEP was used in a call which must
use KM_PUSHPAGE were found while using the zpool upgrade command.
See commit b8d06fc for additional details.

Also make a small correction to the comment block above
dsl_dir_open_spa().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1268
2013-02-06 11:19:58 -08:00
Brian Behlendorf 4bf3909e51 Disable automatic log dumping
Long ago infrastructure was added to the SPL to keep an internal
debug log of the last few seconds of activity.  This was helpful
during the early development, but these days it is no longer
needed.  I haven't had to resort to this debug buffer to resolve
an issue for several years now.

Today better more generic tools like systemtap and ftrace have
evolved to the point where they can be used for this purpose.
Along with the stack trace dumped to the system console, and in
rare cases a crash dump we almost always have the debug we need.

Therefore, I'm disabling the code which automatically dumps
this log to disk during an assertion except for the case where
spl_debug_panic_on_bug is set (disabled by default).

This should be viewed as a first step towards either.

  a) Retiring this infrastructure and complexity entirely, or
  b) Integrating this logging more properly with ftrace.

As part of this change I'm also removing from the packages the
undocumented spl utility which is used to decode the binary logs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-05 16:13:27 -08:00
Brian Behlendorf dd26aa535b Cast 'zfs bad bloc' to ULL for x86
Explicitly case this value to an unsigned long long for 32-bit
systems to inform the compiler that a long type should not be
used.  Otherwise we get the following compiler error:

  dmu_send.c:376: error: integer constant is too large for
  ‘long’ type

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-04 16:39:08 -08:00
Brian Behlendorf 0c5493d470 Add zfs_arc_memory_throttle_disable module option
The way in which virtual box ab(uses) memory can throw off the
free memory calculation in arc_memory_throttle().  The result is
the txg_sync thread will effectively spin waiting for memory to
be released even though there's lots of memory on the system.

To handle this case I'm adding a zfs_arc_memory_throttle_disable
module option largely for virtual box users.  Setting this option
disables free memory checks which allows the txg_sync thread to
make progress.

By default this option is disabled to preserve the current
behavior.  However, because Linux supports direct memory reclaim
it's doubtful throttling due to perceived memory pressure is ever
a good idea.  We should enable this option by default once we've
done enough real world testing to convince ourselve there aren't
any unexpected side effects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #938
2013-02-01 11:17:14 -08:00
Brian Behlendorf 1f7c30df8f Add zfs_disable_dup_eviction module option
Commit 1eb5bfa introduced a new zfs_disable_dup_eviction tunable.
It should have been made available as a module option in the
original patch but was overlooked.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-01 09:57:57 -08:00
Brian Behlendorf 6ef94aa67a Fix tsd_get/set() race with tsd_exit/destroy()
The tsd_exit() and tsd_destroy() functions remove entries from
hash bins without taking the hash bin lock.  They do take the
table lock, but tsd_get() and tsd_set() only take the hash bin
lock to allow for maximum concurency.

The result is that while tsd_get() and tsd_set() are traversing
the hash bin list it can be modified by another thread in which
happens to hash to the same value.  To avoid this add the needed
locking to tsd_exit() and tsd_destroy().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #174
2013-01-31 13:54:59 -08:00
Ned Bass 36f86f73f6 Fix mismatch between SA header size and layout
When a system attribute layout is created an inconsistency may occur
between the system attribute header (sa_hdr_phys_t) size and the
variable-sized attribute count stored in the layout.  The inconsistency
results in the following failed assertion when SA_HDR_SIZE_MATCH_LAYOUT
returns false:

SPLError: 11315:0:(sa.c:1541:sa_find_idx_tab())
ASSERTION((IS_SA_BONUSTYPE(bonustype) && SA_HDR_SIZE_MATCH_LAYOUT(hdr,
tb)) || !IS_SA_BONUSTYPE(bonustype) || (IS_SA_BONUSTYPE(bonustype) &&
hdr->sa_layout_info == 0)) failed

The bug originates in this snippet from sa_find_sizes().

    if (is_var_sz && var_size > 1) {
            if (P2ROUNDUP(hdrsize + sizeof (uint16_t),
                *total < full_space) {
                    hdrsize += sizeof (uint16_t);

This assumes that the current variable-sized attribute will be stored in
the current buffer and accounts for the space needed to store its size
in the sa_hdr_phys_t. However if the next attribute spills over we need
to store a blkptr_t at the end of the bonus buffer to point to the spill
block. If the current attribute is in the way of the blkptr_t then it
too will be relocated into the spill block. But since we've already
accounted for it in the header size we get the inconsistency described
above.

To avoid this, record the index of the last variable-sized attribute
that prompted a hdrsize increase, and reverse the increase if we later
determine that that attribute will be relocated to the spill block.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1250
2013-01-31 10:31:19 -08:00
Ned Bass 67629d0f08 Fix rounding discrepancy in sa_find_sizes()
A rounding discrepancy exists between how sa_build_layouts() and
sa_find_sizes() calculate when the spill block needs to be kicked in.
This results in a narrow size range where sa_build_layouts() believes
there must be a spill block allocated but due to the discrepancy there
isn't.  A panic then occurs when the hdl->sa_spill NULL pointer is
dereferenced.

The following reproducer for this bug was isolated:

    truncate -s 128m /tmp/tank
    zpool create tank /tmp/tank
    zfs create -o xattr=sa tank/fish
    ln -s `perl -e 'print "z" x 41'` /tank/fish/z
    setfattr -hn trusted.foo -v`perl -e 'print "z"x45'` /tank/fish/z

This test results in roughly the following system attribute (SA)
layout:

  176 bytes - "standard" SA's
   41 bytes - name of symbolic link target
  100 bytes - XDR encoded nvlist for xattr
  ---
  317 bytes - total

Because 317 is less than DN_MAX_BONUSLEN (320), sa_find_sizes()
decides no spill block is needed. But sa_build_layouts() rounds 41 up
to 48 when computing the space requirements so it tries to switch to
the spill block.

Note that we were only able to reproduce this bug using a combination
of symbolic links and the Linux-specific xattr=sa dataset property.
So while this issue is not technically Linux-specific, it may be
difficult or impossible to hit the narrow size range needed to
reproduce it on other platforms.

To fix the discrepancy, round the running total in sa_find_sizes() up
to an 8-byte boundary before accounting for each SA, since this is how
they will be stored in the bonus and (possibly) spill buffers.

To make the intent of the code more clear, explicitly assert key
assumptions about expected alignment of data and whether spill-over
will occur.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1240
2013-01-31 10:31:13 -08:00
Adam H. Leventhal 89103a2643 Illumos #3447 improve the comment in txg.c
3447 improve the comment in txg.c

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@adbbcfface
  https://www.illumos.org/issues/3447

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-30 08:55:20 -08:00
Eric Dillmann 9759c60f1a Illumos #3035 LZ4 compression support in ZFS and GRUB
3035 LZ4 compression support in ZFS and GRUB

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <csiden@delphix.com>

References:
  illumos/illumos-gate@a6f561b4ae
  https://www.illumos.org/issues/3035
  http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS

This patch has been slightly modified from the upstream Illumos
version to be compatible with Linux.  Due to the very limited
stack space in the kernel a lz4 workspace kmem cache is used.
Since we are using gcc we are also able to take advantage of the
gcc optimized __builtin_ctz functions.

Support for GRUB has been dropped from this patch.  That code
is available but those changes will need to made to the upstream
GRUB package.

Lastly, several hunks of dead code were dropped for clarity.  They
include the functions real_LZ4_uncompress(), LZ4_compressBound()
and the Visual Studio specific hunks wrapped in _MSC_VER.

Ported-by: Eric Dillmann <eric@jave.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1217
2013-01-29 09:28:20 -08:00
Brian Behlendorf 0936c3449f Add spl_kmem_cache_expire module option
Cache aging was implemented because it was part of the default Solaris
kmem_cache behavior.  The idea is that per-cpu objects which haven't been
accessed in several seconds should be returned to the cache.  On the other
hand Linux slabs never move objects back to the slabs unless there is
memory pressure on the system.

This behavior is now configurable through the 'spl_kmem_cache_expire'
module option.  The value is a bit mask with the following meaning.

  0x1 - Solaris style cache aging eviction is enabled.
  0x2 - Linux style low memory eviction is enabled.

Both methods may be safely enabled simultaneously, but by default
both are disabled.  It has never been clear if the kmem cache aging
(which has been around from day one) actually does any good.  It has
however been the source of numerous bugs so I wouldn't mind retiring
it entirely.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1227
Closes #210
2013-01-28 09:34:12 -08:00
Chris Wedgwood ddc07fa57a Avoid gcc -Werror=maybe-uninitialized warnings
Explicitly set acl details to zero to silence gcc (zfs_acl_node_read
can't be sure zfs_acl_znode_info will set acl_count and aclsize).
Normally suppressing these warnings by setting this to zero at
declaration time is a bad idea but in this instance it's hard to
avoid and should be fairly safe.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1244
2013-01-28 09:10:29 -08:00
Brian Behlendorf 6772fb679a Use dsl_dataset_snap_lookup()
Retire the dmu_snapshot_id() function which was introduced in the
initial .zfs control directory implementation.  There is already
an existing dsl_dataset_snap_lookup() which does exactly what we
need, and the dmu_snapshot_id() function as implemented is racy.

https://github.com/zfsonlinux/zfs/issues/1215#issuecomment-12579879

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1238
2013-01-25 15:07:40 -08:00
Brian Behlendorf bf01b5e616 Add d_clear_d_op() compatibility
Added d_clear_d_op() helper function which clears some flags and the
registered dentry->d_op table.  This is required because d_set_d_op()
issues a warning when the dentry operations table is already set.
For the .zfs control directory to work properly we must be able to
override the default operations table and register custom .d_automount
and .d_revalidate callbacks.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1230
2013-01-23 16:33:29 -08:00
Ned Bass 1305d33a4b fzap_cursor_move_to_key() should drop l_rwlock
Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock
since that function returns with the lock held on success.  All other
callers drop the lock correctly but it seems fzap_cursor_move_to_key()
does not.  This may block writers or cause VERIFY failures when the
lock is freed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1215
Closes zfsonlinux/spl#143
Closes zfsonlinux/spl#97
2013-01-23 16:31:16 -08:00
Brian Behlendorf 09a661e960 Fix zpl_revalidate() NULL deref
In zpl_revalidate() it's possible for the nameidata to be NULL
for kernels which still accept the parameter.  In particular,
lookup_one_len() calls d_revalidate() with a NULL nameidata.

Resolve the issue by checking for a NULL nameidata in which case
just set the flags to 0.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1226
2013-01-22 09:38:17 -08:00