Commit Graph

574 Commits

Author SHA1 Message Date
Brian Behlendorf
1c50c992ba Revert "Avoid ELOOP on auto-mounted snapshots"
This reverts commit 7afcf5b1da which
accidentally introduced a regression with the .zfs snapshot directory.
While the updated code still does correctly mount the requested
snapshot.  It updates the vfsmount such that it references the
original dataset vfsmount.  The result is that the snapshot itself
isn't visible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #816
2013-01-09 11:24:47 -08:00
Brian Behlendorf
4cec9b2dc7 Only reduce __zio_execute() stack usage in kernel space
Related to 91579709fc we need to
be very careful about not overrunning the stack in kernel space.
However, in user space we're already allowing slightly larger
stacks so this stack usage optimization is not required there.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-09 10:34:35 -08:00
George Wilson
1eb5bfa3dc Illumos #3145, #3212
3145 single-copy arc
3212 ztest: race condition between vdev_online() and spa_vdev_remove()

Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  illumos-gate/commit/9253d63df408bb48584e0b1abfcc24ef2472382e
  illumos changeset: 13840:97fd5cdf328a
  https://www.illumos.org/issues/3145
  https://www.illumos.org/issues/3212

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #989
Closes #1137
2013-01-08 10:35:44 -08:00
Matthew Ahrens
753c38392d Illumos #3104: eliminate empty bpobjs
3104 eliminate empty bpobjs
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@f174573681
  illumos changeset: 13782:8f78aae28a63
  https://www.illumos.org/issues/3104

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
Brian Behlendorf
91579709fc Fix __zio_execute() asynchronous dispatch
To save valuable stack all zio's were made asynchronous when in the
tgx_sync_thread context or during pool initialization.  See commit
2fac4c2 for the original patch and motivation.

Unfortuantely, the changes to dsl_pool_sync_context() made by the
feature flags broke this logic causing in __zio_execute() to dispatch
itself infinitely when called during pool initialization.  This
commit refines the existing logic to specificly target only the two
cases we care about.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
George Wilson
ea0b2538cd Illumos #3349: zpool upgrade -V bumps the on disk version number
3349 zpool upgrade -V bumps the on disk version number, but leaves
the in core version
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@25345e4666
  https://www.illumos.org/issues/3349

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
Matthew Ahrens
29809a6cba Illumos #3086: unnecessarily setting DS_FLAG_INCONSISTENT on async
3086 unnecessarily setting DS_FLAG_INCONSISTENT on async
destroyed datasets
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@ce636f8b38
  illumos changeset: 13776:cd512c80fd75
  https://www.illumos.org/issues/3086

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
Christopher Siden
b9b24bb4ca Illumos #2762: zpool command should have better support for feature flags
2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@57221772c3
  https://www.illumos.org/issues/2762

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
George Wilson
3bc7e0fb0f Illumos #3090 and #3102
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@dfbb943217
  illumos changeset: 13777:b1e53580146d
  https://www.illumos.org/issues/3090
  https://www.illumos.org/issues/3102

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #939
2013-01-08 10:35:42 -08:00
Christopher Siden
9ae529ec5d Illumos #2619 and #2747
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@53089ab7c8
  illumos/illumos-gate@ad135b5d64
  illumos changeset: 13700:2889e2596bd6
  https://www.illumos.org/issues/2619
  https://www.illumos.org/issues/2747

NOTE: The grub specific changes were not ported.  This change
must be made to the Linux grub packages.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:35 -08:00
Ned Bass
37f000c5aa Fix gcc array subscript above bounds warning
In a debug build, certain GCC versions flag an array bounds warning in
the below code from dnode_sync.c

    } else {
            int i;
            ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
            /* the blkptrs we are losing better be unallocated */
            for (i = dn->dn_next_nblkptr[txgoff];
                i < dnp->dn_nblkptr; i++)
                    ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));

This usage is in fact safe, since the ASSERT ensures the index does
not exceed to maximum possible number of block pointers. However gcc
can't determine that the assignment 'i = dn->dn_next_nblkptr[txgoff];'
falls within the array bounds so it issues a warning.  To avoid this,
initialize i to zero to make gcc happy but skip the elements before
dn->dn_next_nblkptr[txgoff] in the loop body.  Since a dnode contains
at most 3 block pointers this overhead should be negligible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #950
2013-01-07 11:21:52 -08:00
Matt Johnston
72938d6905 Use cv_wait_io() which will will account for iowait
Update zio_wait() to use cv_wait_io() to ensure the iowait time
is properly accounted for.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-07 10:52:52 -08:00
Matt Johnston
72f53c5694 Revert part of "Log I/Os longer than zio_delay_max (30s default)"
This reverts commit 9dcb971983
which was originally introduced to debug occasional slow I/Os.
These I/Os would complete eventually but were observed to take
several 100 seconds.

The root cause of this issue was the CFQ scheduler which can,
under certain conditions, excessively delay an I/O from being
issued to the device.  This issue was mitigated somewhat by
commit 84daaddedb which ensures
the I/O elevator gets changed even for DM style devices.

This change isn't in any way harmful but it does conflict with
a required change to properly account from I/O wait time.
Because Linux does not export the io_schedule_timeout() function
we must instead rely  on io_schedule() via cv_wait_io().

The additional debugging information which was added to the
delay event has been intentionally left in place.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-07 10:51:04 -08:00
Brian Behlendorf
65d56083b4 Fix zpool on zvol lock inversion deadlock
In all but one case the spa_namespace_lock is taken before the
bdev->bd_mutex lock.  But Linux __blkdev_get() function calls
fops->open() with the bdev->bd_mutex lock held and we must
somehow still safely acquire the spa_namespace_lock.

To avoid a potential lock inversion deadlock we preemptively
try to take the spa_namespace_lock().  Normally it will not
be contended and this is safe because spa_open_common() handles
the case where the caller already holds the spa_namespace_lock.

When it is contended we risk a lock inversion if we were to
block waiting for the lock.  Luckily, the __blkdev_get()
function allows us to return -ERESTARTSYS which will result in
bdev->bd_mutex being dropped, reacquired, and fops->open() being
called again.  This process can be repeated safely until both
locks are acquired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #612
2012-12-20 09:57:39 -08:00
Brian Behlendorf
d5446cfc52 Revert "Remove TSD zfs_fsyncer_key"
This reverts commit 31f2b5abdf back
to the original code until the fsync(2) performance regression
can be addressed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-20 09:56:28 -08:00
Brian Behlendorf
31f2b5abdf Remove TSD zfs_fsyncer_key
It's my understanding that the zfs_fsyncer_key TSD was added as
a performance omtimization to reduce contention on the zl_lock
from zil_commit().  This issue manifested itself as very long
(100+ms) fsync() system call times for fsync() heavy workloads.

However, under Linux I'm not seeing the same contention that
was originally described.  Therefore, I'm removing this code
in order to ween ourselves off any dependence on TSD.  If the
original performance issue reappears on Linux we can revisit
fixing it without resorting to TSD.

This just leaves one small ZFS TSD consumer.  If it can be
cleanly removed from the code we'll be able to shed the SPL
TSD implementation entirely.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#174
2012-12-19 09:08:01 -08:00
Prakash Surya
84daaddedb Set elevator for DM devices despite vdev_wholedisk
The current state of udev and devicer-mapper devices makes it difficult
to construct a mapping of DM partitions and their underlying DM device.
For example, with a /dev directory with the following contents:

    $ ls -d /dev/dm-*
    /dev/dm-0
    /dev/dm-1
    /dev/dm-2
    /dev/dm-3

it is not immediately apparent if these are completely separate devices,
or partitions and real devices intermixed. In contrast, SCSI devices
would appear as so:

    $ ls -d /dev/sd*
    /dev/sda
    /dev/sda1
    /dev/sdb
    /dev/sdb1

Here, one can immediately determine that there are two devices (sda and
sdb), each containing a single partition. The lack of a predictable and
consistent mapping from DM devices to DM device partitions makes it
difficult for user space to process these devices the same way it does
SCSI devices.

As a result, the ZFS utilities do not partition DM devices, and instead
set the "vdev_wholedisk" label to 0 and treat them as partitions. This
has the side effect that, even if ZFS has sole ownership of the device,
the IO scheduler will not be modified because it is treated as a
partition.

This change adds an exception for DM devices in vdev_elevator_switch,
allowing the elevator to be modified even though the "vdev_wholedisk"
property is not set.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1149
2012-12-18 15:12:40 -08:00
Jorgen Lundman
6c2856726f Fix using zvol as slog device
During the original ZoL port the vdev_uses_zvols() function was
disabled until it could be properly implemented.  This prevented
a zpool from use a zvol for its slog device.

This patch implements that missing functionality by adding a
zvol_is_zvol() function to zvol.c.  Given the full path to a
device it will lookup the device and verify its major number
against the registered zvol major number for the system.  If
they match we know the device is a zvol.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1131
2012-12-18 11:02:28 -08:00
Brian Behlendorf
8780c53961 Update SAs when an inode is dirtied
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed.  That change
accidentally resulted in unexpected ctime updates which upset tools
like git.  That was always a horrible hack and I'm happy it will
never make it in to a tagged release.

The right fix is something I initially resisted doing because I
was worried about the additional overhead.  However, in hindsight
the overhead isn't as bad as I feared.

This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied.  We leverage this
callback to keep the znode SAs strictly in sync with the inode.

However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime.  This will cover the callpath of most concern to us.

  ->filemap_page_mkwrite->file_update_time->update_time->
      mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
Closes #1140
2012-12-14 12:18:54 -08:00
Ned Bass
7afcf5b1da Avoid ELOOP on auto-mounted snapshots
Ensure that the path member pointers are associated with the
newly-mounted snapshot when zpl_snapdir_automount() returns.  Otherwise
the follow_automount() function may be called repeatedly, leading to an
incorrect ELOOP error return. This problem was observed as a 'Too many
levels of symbolic links' error from user-space commands accessing an
unmounted snapshot in the .zfs/snapshot directory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #816
2012-12-13 08:57:11 -08:00
Brian Behlendorf
2ae1031962 Linux 3.7 compat, schedule_delayed_work()
Linux kernel commit d8e794d accidentally broke the delayed work
APIs for non-GPL callers.   While the APIs to schedule a delayed
work item are still available to all callers, it is no longer
possible to initialize the delayed work item.

I'm cautiously optimistic we could get the delayed_work_timer_fn
exported for all callers in the upstream kernel.  But frankly
the compatibility code to use this kernel interface has always
been problematic.

Therefore, this patch abandons direct use the of the Linux
kernel interface in favor of the new delayed taskq interface.
It provides roughly the same functionality as delayed work queues
but it's a stable interface under our control.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1053
2012-12-12 10:47:05 -08:00
Richard Yao
e4d89e9cfc Switch KM_SLEEP to KM_PUSHPAGE
When writes to zvols invoke ZIL, zfs_range_new_proxy() is called,
which allocates memory using KM_SLEEP, triggering a warning.
Switch to KM_PUSHPAGE to silence that warning.  See commit
b8d06fca08 for additional details.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1138
2012-12-10 09:44:45 -08:00
Brian Behlendorf
53c7411919 Revert "Fix unlink/xattr deadlock"
This reverts commit b00131d43c which
is no longer needed due to e89260a1c8.

This change forces all xattr znodes to hold a reference on their
parent which ensures prune_icache() will never attempt to evict
both the parent and child concurrently.  This effectively prevents
the deadlock condition from ever occuring.

Therefore we can safely revert back to the upstream synchronous
cleanup code.  This is nice because it keeps our code base closer
to upstream and resolves the performance issues introduced by the
original deadlock fix.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #457
2012-12-05 13:41:30 -08:00
Brian Behlendorf
d3aa3ea96e Preserve inode mtime/ctime in .writepage()
When updating a file via mmap()'ed I/O preserve the mtime/ctime
which were updated when the page was made writable by the generic
callback filemap_page_mkwrite().

But more importantly than preserving the exact time add the missing
call to sa_bulk_update().  This ensures that the znode modifications
are written to disk as part of the transaction.  Without this the
inode may mistaken rollback to the previous on-disk znode state.

Additionally, for mmap()'ed znodes explicitly set the atime, mtime,
and ctime on close using the up to date values in the inode.  This
is critical because writepage() may occur after close and on close
we need to ensure the values are correct.

Original-patch-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
2012-12-05 13:00:25 -08:00
Brian Behlendorf
e89260a1c8 Directory xattr znodes hold a reference on their parent
Unlike normal file or directory znodes, an xattr znode is
guaranteed to only have a single parent.  Therefore, we can
take a refernce on that parent if it is provided at create
time and cache it.  Additionally, we take care to cache it
on any subsequent zfs_zaccess() where the parent is provided
as an optimization.

This allows us to avoid needing to do a zfs_zget() when
setting up the SELinux security xattr in the create path.
This is critical because a hash lookup on the directory
will deadlock since it is locked.

The zpl_xattr_security_init() call has also been moved up
to the zpl layer to ensure TXs to create the required
xattrs are performed after the create TX.  Otherwise we
run the risk of deadlocking on the open create TX.

Ideally the security xattr should be fully constructed
before the new inode is unlocked.  However, doing so would
require far more extensive changes to ZFS.

This change may also have the benefitial side effect of
ensuring xattr directory znodes are evicted from the cache
before normal file or directory znodes due to the extra
reference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #671
2012-12-03 12:10:46 -08:00
Brian Behlendorf
c3275b56a1 Add load_nvlist() error handling
Add the missing error handling to load_nvlist().  There's no good
reason this needs to be fatal.  All callers of load_nvlist() do
correctly handle an error condition and it is preferable that an
error be returned.  This will allow 'zpool import -FX' to safely
attempt to rollback through previous txgs looking for a good one.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1120
2012-11-30 13:48:17 -08:00
Brian Behlendorf
004324ecc6 Disable page allocation warnings for super block
Due to the slightly increased size of the ZFS super block
caused by 30315d2 there are now allocation warnings.  The
allocation size is still small (just over 8k) and super
blocks are rarely allocated so we suppress the warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1101
2012-11-30 11:04:44 -08:00
Brian Behlendorf
f74a147c02 Fix NULL deref when zvol_alloc() fails
If zvol_alloc() fails zv will be set to NULL and dereferenced
in out_dmu_objset_disown.  To avoid this entirely the zv->objset
line is moved up in to the success block.

Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1109
2012-11-27 14:10:31 -08:00
George Wilson
32a9872bba Illumos #2671: zpool import should not fail if vdev ashift has increased
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Richard Lowe <richlowe@richlowe.net>

Refererces to Illumos issue:
      https://www.illumos.org/issues/2671

This patch has been slightly modified from the upstream Illumos
version.  In the upstream implementation a warning message is
logged to the console.  To prevent pointless console noise this
notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift"
event.

The event indicates a non-optimial (but entirely safe) ashift
value was used to create the pool.  Depending on your workload
this may impact pool performance.  Unfortunately, the only way
to correct the issue is to recreate the pool with a new ashift.

NOTE: The unrelated fix to the comment in zpool_main.c appears
in the upstream commit and was preserved for consistnecy.

Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #955
2012-11-15 11:05:59 -08:00
Brian Behlendorf
4c837f0d93 Fix "allocating allocated segment" panic
Gunnar Beutner did all the hard work on this one by correctly
identifying that this issue is a race between dmu_sync() and
dbuf_dirty().

Now in all cases the caller is responsible for preventing this
race by making sure the zfs_range_lock() is held when dirtying
a buffer which may be referenced in a log record.  The mmap
case which relies on zfs_putpage() was not taking the range
lock.  This code was accidentally dropped when the function
was rewritten for the Linux VFS.

This patch adds the required range locking to zfs_putpage().

It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which
aren't strictly required due to the VFS holding a reference.
However, this makes the code more consistent with the upsteam
code and there's no harm in being extra careful here.

Original-patch-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #541
2012-11-09 19:01:09 -08:00
Brian Behlendorf
e26ade5101 Fix zvol+btrfs hang
When using a zvol to back a btrfs filesystem the btrfs mount
would hang.  This was due to the bio completion callback used
in btrfs assuming that lower level drivers would never modify
the bio->bi_io_vecs after they were submitted via bio_submit().
If they are modified btrfs will miscalculate which pages need
to be unlocked resulting in a hang.

It's worth mentioning that other file systems such as ext[234]
and xfs work fine because they do not make the same assumption
in the bio completion callback.

The most straight forward way to fix the issue is to present
the semantics expected by btrfs.  This is done by cloning the
bios attached to each request and then using the clones bvecs
to perform the required accounting.  The clones are freed after
each read/write and the original unmodified bios are linked back
in to the request.

Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #469
2012-11-09 12:24:51 -08:00
Brian Behlendorf
9dcb971983 Log I/Os longer than zio_delay_max (30s default)
There have been reports of ZFS deadlocking due to what appears to
be a lost IO.  This patch addes some debugging to determine the
exact state of the IO which neither 1) completed, 2) failed, or
3) timed out after zio_delay_max (30) seconds.

This information will be logged using the ZFS FMA infrastructure
as a 'delay' event and posted to the internal zevent log.  By
default the last 64 events will be kept in the log but the limit
is configurable via the zfs_zevent_len_max module option.

To dump the contents of the log use the 'zpool events -v' command
and look for the resource.fs.zfs.delay event.  It will include
various information about the pool, vdev, and zio which may shed
some light on the issue.

In the context of this change the 120 second kernel blocked thread
watchdog has been disabled for synchronous IOs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #930
2012-11-02 15:45:59 -07:00
Brian Behlendorf
e95853a331 Add txgs-<pool> kstat file
Create a kstat file which contains useful statistics about the
last N txgs processed.  This can be helpful when analyzing pool
performance.  The new KSTAT_TYPE_TXG type was added for this
purpose and it tracks the following statistics per-txg.

  txg          - Unique txg number
  state        - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
  birth;       - Creation time
  nread        - Bytes read
  nwritten;    - Bytes written
  reads        - IOPs read
  writes       - IOPs write
  open_time;   - Length in nanoseconds the txg was open
  quiesce_time - Length in nanoseconds the txg was quiescing
  sync_time;   - Length in nanoseconds the txg was syncing

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-02 15:45:56 -07:00
Brian Behlendorf
e8fd45a0f9 Add ddt_object_count() error handling
The interface for the ddt_zap_count() function assumes it can
never fail.  However, internally ddt_zap_count() is implemented
with zap_count() which can potentially fail.  Now because there
was no way to return the error to the caller a VERIFY was used
to ensure this case never happens.

Unfortunately, it has been observed that pools can be damaged in
such a way that zap_count() fails.  The result is that the pool can
not be imported without hitting the VERIFY and crashing the system.

This patch reworks ddt_object_count() so the error can be safely
caught and returned to the caller.  This allows a pool which has
be damaged in this way to be safely rewound for import.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #910
2012-10-29 08:57:45 -07:00
Brian Behlendorf
178e73b376 Revert "Don't ashift-align vdev read requests."
This reverts commit a5c20e2a0a which
accidentally introduced a regression for real 4k sector devices.
See issue #1065 for details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1065
2012-10-24 15:25:33 -07:00
Brian Behlendorf
f21e5c6a17 Remove 'Resized bio's/dio' warning
The following warning was originally added to provide visibility
in to how often a dio gets heavily fragmented in to over 16 bios.
This can happen due to constraints imposed by the block device
and may have a negitive impact on performance but is otherwise
harmless.  To prevent needless confusion and worry the message
has been removed.

  kernel: WARNING: Resized bio's/dio to 32

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-22 10:17:10 -07:00
Brian Behlendorf
c7dfc08629 Quote snapshot and mountpoint for .zfs automount
When automounting a snapshot in the .zfs/snapshot directory
make sure to quote both the dataset name and the mount point.
This ensures that if either component contains spaces, which
are allowed, they get handled correctly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1027
2012-10-17 13:26:18 -07:00
Etienne Dechamps
5d7a86d114 Use the slog even with logbias=throughput.
In the current code, logbias=throughput implies the following:
 1) All synchronous writes are logged in indirect mode.
 2) The slog is not used.

(1) makes sense because it avoids writing the data twice, which is
obviously a good thing when the user wants maximum pool throughput.

(2), however, is a surprising decision. Considering all writes are
indirect, the log record doesn't contain the actual data, only pointers
to DMU blocks. As a result, log records written in logbias=throughput
mode are quite small, and as such, it doesn't make any sense to write
them to the main pool since slogs are usually optimized for small
synchronous writes.

In fact, the current behavior is actually harmful for performance,
because log blocks and data blocks from dmu_sync() seldom have the same
allocation size and as a result are usually allocated from different
metaslabs. This means that if a spindle has to write both log blocks and
DMU blocks (which is likely to happen under heavy load), it will have to
seek between the two. Allocating the log blocks from the slog pool
instead of the main pool avoids these unnecessary seeks.

This commit makes ZFS use the slog on datasets with logbias=throughput.
Real-life performance testing shows a 50% synchronous write performance
increase with some large commit sizes, and no negative effect in other
cases.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1013
2012-10-17 08:56:46 -07:00
Etienne Dechamps
920dd524fb Add FASTWRITE algorithm for synchronous writes.
Currently, ZIL blocks are spread over vdevs using hint block pointers
managed by the ZIL commit code and passed to metaslab_alloc(). Spreading
log blocks accross vdevs is important for performance: indeed, using
mutliple disks in parallel decreases the ZIL commit latency, which is
the main performance metric for synchronous writes. However, the current
implementation suffers from the following issues:

1) It would be best if the ZIL module was not aware of such low-level
details. They should be handled by the ZIO and metaslab modules;

2) Because the hint block pointer is managed per log, simultaneous
commits from multiple logs might use the same vdevs at the same time,
which is inefficient;

3) Because dmu_write() does not honor the block pointer hint, indirect
writes are not spread.

The naive solution of rotating the metaslab rotor each time a block is
allocated for the ZIL or dmu_sync() doesn't work in practice because the
first ZIL block to be written is actually allocated during the previous
commit. Consequently, when metaslab_alloc() decides the vdev for this
block, it will do so while a bunch of other allocations are happening at
the same time (from dmu_sync() and other ZILs). This means the vdev for
this block is chosen more or less at random. When the next commit
happens, there is a high chance (especially when the number of blocks
per commit is slightly less than the number of the disks) that one disk
will have to write two blocks (with a potential seek) while other disks
are sitting idle, which defeats spreading and increases the commit
latency.

This commit introduces a new concept in the metaslab allocator:
fastwrites. Basically, each top-level vdev maintains a counter
indicating the number of synchronous writes (from dmu_sync() and the
ZIL) which have been allocated but not yet completed. When the metaslab
is called with the FASTWRITE flag, it will choose the vdev with the
least amount of pending synchronous writes. If there are multiple vdevs
with the same value, the first matching vdev (starting from the rotor)
is used. Once metaslab_alloc() has decided which vdev the block is
allocated to, it updates the fastwrite counter for this vdev.

The rationale goes like this: when an allocation is done with
FASTWRITE, it "reserves" the vdev until the data is written. Until then,
all future allocations will naturally avoid this vdev, even after a full
rotation of the rotor. As a result, pending synchronous writes at a
given point in time will be nicely spread over all vdevs. This contrasts
with the previous algorithm, which is based on the implicit assumption
that blocks are written instantaneously after they're allocated.

metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to
manually increase or decrease fastwrite counters, respectively. They
should be used with caution, as there is no per-BP tracking of fastwrite
information, so leaks and "double-unmarks" are possible. There is,
however, an assert in the vdev teardown code which will fire if the
fastwrite counters are not zero when the pool is exported or the vdev
removed. Note that as stated above, marking is also done implictly by
metaslab_alloc().

ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to
the metaslab when allocating (assuming ZIO does the allocation, which is
only true in the case of dmu_sync). This flag will also trigger an
unmark when zio_done() fires.

A side-effect of the new algorithm is that when a ZIL stops being used,
its last block can stay in the pending state (allocated but not yet
written) for a long time, polluting the fastwrite counters. To avoid
that, I've implemented a somewhat crude but working solution which
unmarks these pending blocks in zil_sync(), thus guaranteeing that
linguering fastwrites will get pruned at each sync event.

The best performance improvements are observed with pools using a large
number of top-level vdevs and heavy synchronous write workflows
(especially indirect writes and concurrent writes from multiple ZILs).
Real-life testing shows a 200% to 300% performance increase with
indirect writes and various commit sizes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1013
2012-10-17 08:56:41 -07:00
Brian Behlendorf
a298dbde92 Condition variable usage, zp->r_{rd,wr}_cv
The following incorrect usage of cv_broadcast() was caught by
code inspection.  The cv_broadcast() function must be called
under the associated mutex to preventing racing with cv_wait().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-15 16:02:03 -07:00
Brian Behlendorf
8c0712fd88 Condition variable usage, zilog->zl_cv_batch
The following incorrect usage of cv_signal and cv_broadcast()
was caught by code inspection.  The cv_signal and cv_broadcast()
functions must be called under the associated mutex to preventing
racing with cv_wait().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-15 16:01:58 -07:00
Brian Behlendorf
99db9bfde7 Condition variable usage, zevent_cv
The following incorrect usage of cv_broadcast() was caught by
code inspection.  The cv_broadcast() function must be called
under the associated mutex to preventing racing with cv_wait().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-15 16:01:54 -07:00
Massimo Maggi
6f53a6a229 Switch KM_SLEEP to KM_PUSHPAGE
In this particular instance the allocation occurred in the context
of sys_msync()->...->zpl_putpage() where we must be careful not to
initiate additional I/O.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1038
2012-10-15 09:32:38 -07:00
Brian Behlendorf
c418410393 Limit zfs_vdev_aggregation_limit to SPA_MAXBLOCKSIZE
Prevent users from setting the zfs_vdev_aggregation_limit tuning
larger than SPA_MAXBLOCKSIZE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #520
2012-10-15 09:28:43 -07:00
Yuxuan Shui
45ca2d91cb Return positive error number in zfsctl_shares_lookup.
Otherwise it will cause zpl_shares_lookup() to return a invalid
pointer when an error occurs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Closes #626 #885 #947 #977
2012-10-15 09:11:56 -07:00
Yuxuan Shui
558ef6d080 Linux 3.6 compat, iops->create()
As of Linux commit ebfc3b49a7ac25920cb5be5445f602e51d2ea559 the
struct nameidata is no longer passed to iops->create.  Instead
only the result of (inamedata->flags & LOOKUP_EXCL) is passed.

ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 14:42:25 -07:00
Yuxuan Shui
8f195a908f Linux 3.6 compat, iops->lookup()
As of Linux commit 00cd8dd3bf95f2cc8435b4cac01d9995635c6d0b the
struct nameidata is no longer passed to iops->lookup.  Instead
only the inamedata->flags are passed.

ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 13:06:54 -07:00
Yuxuan Shui
3c20361075 Linux 3.6 compat, sget()
As of Linux commit 9249e17fe094d853d1ef7475dd559a2cc7e23d42 the
mount flags are now passed to sget() so they can be used when
initializing a new superblock.

ZFS never uses sget() in this fashion so we can simply pass a
zero and add a zpl_sget() compatibility wrapper.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 13:06:48 -07:00
Yuxuan Shui
af26c4d4ab Linux 3.6 compat, sops->write_super() removed
The .write_super callback was removed the the super_operations
structure by Linux commit f0cd2dbb6cf387c11f87265462e370bb5469299e.
All file systems are now expected to self manage writing any dirty
state assoicated with their super block.

ZFS never made use of this callback so it can simply be removed
from the super_operations structure.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 11:33:56 -07:00
Etienne Dechamps
a5c20e2a0a Don't ashift-align vdev read requests.
Currently, the size of read and write requests on vdevs is aligned
according to the vdev's ashift, allocating a new ZIO buffer and padding
if need be.

This makes sense for write requests to prevent read/modify/write if the
write happens to be smaller than the device's internal block size.

For reads however, the rationale is less clear. It seems that the
original code aligns reads because, on Solaris, device drivers will
outright refuse unaligned requests.

We don't have that issue on Linux. Indeed, Linux block devices are able
to accept requests of any size, and take care of alignment issues
themselves.

As a result, there's no point in enforcing alignment for read requests
on Linux. This is a nice optimization opportunity for two reasons:
- We remove a memory allocation in a heavily-used code path;
- The request gets aligned in the lowest layer possible, which shrinks
  the path that the additional, useless padding data has to travel.
  For example, when using 4k-sector drives that lie about their sector
  size, using 512b read requests instead of 4k means that there will
  be less data traveling down the ATA/SCSI interface, even though the
  drive actually reads 4k from the platter.

The only exception is raidz, because raidz needs to read the whole
allocated block for parity.

This patch removes alignment enforcement for read requests, except on
raidz. Note that we also remove an assertion that checks that we're
aligning a top-level vdev I/O, because that's not the case anymore for
repair writes that results from failed reads.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1022
2012-10-12 12:01:56 -07:00
Richard Yao
b68503fb30 Remove vmem_size() consumers
There are currently three vmem_size() consumers all of which are
part of the ARC implemention.  However, since the expected behavior
of the Linux and Solaris virtual memory subsystems are so different
the behavior in each of these instances needs to be reevaluated.

* arc_evict_needed() - This is actually dead code.  Arena support
was never added to the SPL and zio_arena is always NULL.  This
support isn't needed so we simply remove this dead code.

* arc_memory_throttle() - On Solaris where virtual memory constitutes
almost all of the address space we can reasonably expect there to be
a fairly large amount free.  However, on Linux by default we only
have about 100MB total and that's heavily used by the ARC.  So the
expectation on Linux is that this will usually be a small value.
Therefore we remove the vmem_size() check for i386 systems because
the expectation is that it will be less than the zfs_write_limit_max.

* arc_init() - Here vmem_size() is used to initially size the ARC.
Since the ARC is currently backed by the virtual address space it
makes sense to use this as a limit on the ARC for 32-bit systems.
This code can be removed when the ARC is backed by the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #831
2012-10-12 10:03:03 -07:00
Brian Behlendorf
87d98efe9e Fix zfs_txg_timeout module parameter
Allow the zfs_txg_timeout variable to be dynamically tuned at run
time.  By pulling it down out of the variable declaration it will
be evaluted each time through the loop.

The zfs_txg_timeout variable is now declared extern in a the common
sys/txg.h header rather than locally in dsl_scan.c.  This prevents
potential type mismatches if the global variable needs to be used
elsewhere.

Move the module_param() code in to the same source file where
zfs_txg_timeout is declared.  This is the most logical location.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-11 15:07:09 -07:00
Richard Yao
7df05a4266 Fix zfs_write_limit_max integer size mismatch on 32-bit systems
Commit c409e4647f introduced a
number of module parameters.  This required several types to be
changed to accomidate the required module parameters Linux macros.

Unfortunately, arc.c contained its own extern definition of the
zfs_write_limit_max variable and its type was not updated to be
consistent with its dsl_pool.c counterpart.  If the variable had
been properly marked extern in a common header, then gcc would
have generated a warning and this would not have slipped through.

The result of this was that the ARC unconditionally expected
zfs_write_limit_max to be 64-bit. Unfortunately, the largest size
integer module parameter that Linux supports is unsigned long, which
varies in size depending on the host system's native word size. The
effect was that on 32-bit systems, ARC incorrectly performed 64-bit
operations on a 32-bit value by reading the neighboring 32 bits as
the upper 32 bits of the 64-bit value.

We correct that by changing the extern declaration to use the unsigned
long type and move these extern definitions in to the common arc.h
header. This should make ARC correctly treat zfs_write_limit_max as a
32-bit value on 32-bit systems.

Reported-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #749
2012-10-11 11:09:25 -07:00
Cyril Plisko
15fd274973 Make zfs_immediate_write_sz a module paramater
zfs_immediate_write_sz variable is a tunable, but lacks proper
module_param() instrumentation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1032
2012-10-11 11:09:21 -07:00
Cyril Plisko
5b7e5b5ab9 txg is spelled as tgx in places
Term 'transaction group' is commonly abbreviated as txg in ZFS sources.
There are some places (Linux specific MODULE_PARAM_DESC() macros)
where it is incorrectly spelled as 'tgx'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1030
2012-10-11 09:19:08 -07:00
Massimo Maggi
beb999445a Switch KM_SLEEP to KM_PUSHPAGE
Prevent snapshot_check to initiate I/O during memory allocation.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1023
2012-10-08 10:19:05 -07:00
Brian Behlendorf
7bd04f2d7d Set default zvol elevator to noop
It doesn't make sense for a zvol to use the default system I/O
scheduler because it is a virtual device.  Therefore, we change
the default scheduler to 'noop' for zvols provided that the
elevator_change() function is available.  This interface has
been available since Linux 2.6.36 and appears in the RHEL 6.x
kernels.

We deliberately do not implement the method for older kernels
because it was racy and could result in system crashes.  It's
better to simply manually tune the scheduler for these kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1017
2012-10-05 12:39:59 -07:00
Etienne Dechamps
089fa91bc5 Align DISCARD requests on zvols.
Currently, when processing DISCARD requests, zvol_discard() calls
dmu_free_long_range() with the precise offset and size of the request.

Unfortunately, this is not optimal for requests that are not aligned to
the zvol block boundaries. Indeed, in the case of an unaligned range,
dnode_free_range() will zero out the unaligned parts. Not only is this
useless since we are not freeing any space by doing so, it is also slow
because it translates to a read-modify-write operation.

This patch fixes the issue by rounding up the discard start offset to
the next volume block boundary, and rounding down the discard end
offset to the previous volume block boundary.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1010
2012-10-04 16:01:44 -07:00
Chris Dunlop
d75d6f294e Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fc for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1002
2012-10-04 10:44:09 -07:00
Matthew Ahrens
04434775b7 Illumos #3100: zvol rename fails with EBUSY when dirty.
illumos/illumos-gate@2e2c135528
Illumos changeset: 13780:6da32a929222

3100 zvol rename fails with EBUSY when dirty

Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Adam H. Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Ported-by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #995
2012-10-03 13:59:02 -07:00
George Wilson
65947351e7 Illumos #3129, #3130
illumos/illumos-gate@d6afdce20f
Illumos changeset: 13794:7c5e0e746b2c

3129 'zpool reopen' restarts resilvers
3130 ztest failure: Assertion failed:
     0 == dmu_objset_destroy(name, B_FALSE) (0x0 == 0x10)

Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3129
  https://www.illumos.org/issues/3130

Ported by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #994
2012-10-03 13:59:02 -07:00
Brian Behlendorf
6d1d976b2c Modify vdev_elevator_switch() to use elevator_change()
As of Linux 2.6.36 an elevator_change() interface was added.
This commit updates vdev_elevator_switch() to use this interface
when available, otherwise it falls back to the usermodehelper
method.

Original-patch-by: foobarz <sysop@xeon.(none)>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #906
2012-10-03 13:31:44 -07:00
Cyril Plisko
393b44c711 Implement .commit_metadata hook for NFS export
In order to implement synchronous NFS metadata semantics ZFS
needs to provide the .commit_metadata hook.  All it takes there
is to make sure changes are committed to ZIL.  Fortunately
zfs_fsync() does just that, so simply calling it from
zpl_commit_metadata() does the trick.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #969
2012-10-03 10:49:45 -07:00
Chris Wedgwood
23a61ccc1b zvol_probe should return NULL when the device isn't found.
Previously we returned ERR_PTR(-ENOENT) which the rest of the kernel
doesn't expect and as such we can oops.

Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #949
Closes #931
Closes #789
Closes #743
Closes #730
2012-10-03 10:39:12 -07:00
Bill Pijewski
37abac6d55 Illumos #2703: add mechanism to report ZFS send progress
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  https://www.illumos.org/issues/2703

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-19 13:39:06 -07:00
Chris Siden
1bd201e70d Illumos #1948: zpool list should show more detailed pool info
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  https://www.illumos.org/issues/1948

Ported by:	Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #685
2012-09-19 13:39:05 -07:00
Brian Behlendorf
95fd8c9a7f Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #973
2012-09-19 11:52:36 -07:00
Brian Behlendorf
ba367276d8 Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-17 11:22:23 -07:00
Cyril Plisko
49d39798f2 ZFS replay transaction error 5
When zfs_replay_write() replays TX_WRITE records from ZIL
it calls zpl_write_common() to perform the actual write.
zpl_write_common() returns the number of bytes written
(similar to write() system call) or an (negative) error.
However, the code expects the positive return value to be
a residual counter. Thus when zpl_write_common() successfully
completes it is mistakenly considered to be a partial write and
the error code delivered further. At this point the ZIL processing
is aborted with famous "ZFS replay transaction error 5" error
message given to the message buffer.

The fix is to compare the zpl_write_commmon() return value with
the buffer size and flag error only when they disagree.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #933
2012-09-17 11:06:58 -07:00
Brian Behlendorf
8312c6df55 Clear PG_writeback for sync I/O error case
Commit 2b2861362f accidentally
introduced this issue by only conditionally registering the
commit callback in the async case.

The error handing code for the dmu_tx_assign() failure case
relied on there always being a registered commit callback to
clear the PG_writeback bit.  Since that is no longer strictly
true for the synchronous case we must explicitly invoke the
callback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #961
2012-09-14 15:53:47 -07:00
Brian Behlendorf
5915791096 Move iput() after zfs_inode_update()
When replaying an unlink/remove operation via zfs_rmdir() the object
being removed will be instantiated by a call to zfs_dirent_lock().
This means that there is a single reference protecting the object.
Right before the call to zfs_inode_update() this reference is dropped
which may cause the object to be destroyed.  This will result in a
NULL dereference as shown by the stack trace is issue #782.

This likely isn't an issue during normal operation because there is
always an additional reference held on the object by the VFS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #782
2012-09-12 14:22:52 -07:00
Brian Behlendorf
4ca9a43644 Remove zvol device node
The 'zfs destroy' changes in 330d06f disrupted how zvol devices
get removed on ZoL.  However, it basically boils down to the
fact that we are no longer reliably calling zvol_remove_minor()
via zfs_ioc_destroy_snaps().

Therefore we add the missing call and handle things similarly
to the existing zfs_unmount_snap() case.  Ideally we would check
if this is of type DMU_OST_ZFS or DMU_OST_ZVOL and just do the
right thing as in zfs_ioc_destroy().  However, it looks like
it would be fairly expensive to get the type, and it's harmless
to simply attempt the umount and minor removal.

This is also an issue in the latest FreeBSD and Illumos code.
It was being tracked under the following issue, and we may want
to refresh our code when they settle on what they want to do
about it upstream.

  https://www.illumos.org/issues/3170

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #903
2012-09-10 10:25:08 -07:00
Cyril Plisko
04f9432d3b Make ZFS filesystem id persistent across different machines
Use ZFS dataset fsid guid as a unique file system id, similar to what is
done on Illumos/OpenSolaris.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #888
2012-09-06 12:47:11 -07:00
Brian Behlendorf
ebcfc8a534 Disable page allocation warnings for ARC buffers
Buffers for the ARC are normally backed by the SPL virtual slab.
However, if memory is low, AND no slab objects are available,
AND a new slab cannot be quickly constructed a new emergency
object will be directly allocated.

These objects can be as large as order 5 on a system with 4k
pages.  And because they are allocated with KM_PUSHPAGE, to
avoid a potential deadlock, they are not allowed to initiate I/O
to satisfy the allocation.  This can result in the occasional
allocation failure.

However, since these allocations are allowed to block and
perform operations such as memory compaction they will eventually
succeed.  Since this is not unexpected (just unlikely) behavior
this patch disables the warning for the allocation failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #465
2012-09-06 11:53:08 -07:00
Brian Behlendorf
cafa9709f3 Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-05 08:44:58 -07:00
Brian Behlendorf
0ef0ff546e Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-04 16:00:06 -07:00
Brian Behlendorf
594b4dd82a Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-04 08:41:12 -07:00
Chris Dunlop
20a083cbe2 Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-02 10:15:49 -07:00
Brian Behlendorf
b404a3f07f Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-08-31 17:39:29 -07:00
Brian Behlendorf
2b2861362f Clear PG_writeback after zil_commit() for sync I/O
When writing via ->writepage() the writeback bit was always cleared
as part of the txg commit callback.  However, when the I/O is also
being written synchronsously to the zil we can immediately clear this
bit.  There is no need to wait for the subsequent TXG sync since the
data is already safe on stable storage.

This has been observed to reduce the msync(2) delay from up to 5
seconds down 10s of miliseconds.  One workload which is expected
to benefit from this are the intermittent samba hands described
in issue #700.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #700
Closes #907
2012-08-30 20:16:28 -07:00
Richard Yao
b8d06fca08 Switch KM_SLEEP to KM_PUSHPAGE
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.

  * The txg_sync thread
  * The zvol write/discard threads
  * The zpl_putpage() VFS callback

This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages.  If a lock is held over this operation the potential exists to
deadlock the system.  To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.

Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread.  However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA.  This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.

This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:

  21ade34 Disable direct reclaim for z_wr_* threads
  cfc9a5c Fix zpl_writepage() deadlock
  eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Brian Behlendorf
991fc1d7ae mzap_upgrade() must use kmem_alloc()
These allocations in mzap_update() used to be kmem_alloc() but
were changed to vmem_alloc() due to the size of the allocation.
However, since it turns out this function may be called in the
context of the txg_sync thread they must be changed back to use
a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:01:37 -07:00
Brian Behlendorf
8630650a8d Annotate KM_PUSHPAGE call paths with PF_NOFS
The txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard()
call paths must only use KM_PUSHPAGE to avoid potential deadlocks
during direct reclaim.

This patch annotates these call paths so any accidental use of
KM_SLEEP will be quickly detected.   In the interest of stability
if debugging is disabled the offending allocation will have its
GFP flags automatically corrected.  When debugging is enabled
any misuse will be treated as a fatal error.

This patch is entirely for debugging.  We should be careful to
NOT become dependant on it fixing up the incorrect allocations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:01:37 -07:00
Brian Behlendorf
86dd0fd922 Pre-allocate vdev I/O buffers
The vdev queue layer may require a small number of buffers
when attempting to create aggregate I/O requests.  Rather than
attempting to allocate them from the global zio buffers, which
is slow under memory pressure, it makes sense to pre-allocate
them because...

1) These buffers are short lived.  They are only required for
the life of a single I/O at which point they can be used by
the next I/O.

2) The maximum number of concurrent buffers needed by a vdev is
small.  It's roughly limited by the zfs_vdev_max_pending tunable
which defaults to 10.

By keeping a small list of these buffer per-vdev we can ensure
one is always available when we need it.  This significantly
reduces contention on the vq->vq_lock, because we no longer
need to perform a slow allocation under this lock.  This is
particularly important when memory is already low on the system.

It would probably be wise to extend the use of these buffers beyond
aggregate I/O and in to the raidz implementation.  The inability
to quickly allocate buffer for the parity stripes could result in
similiar problems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:01:37 -07:00
Richard Yao
44f21da41c Revert Disable direct reclaim for z_wr_* threads
This commit used PF_MEMALLOC to prevent a memory reclaim deadlock.
However, commit 49be0ccf1f eliminated
the invocation of __cv_init(), which was the cause of the deadlock.
PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA
to be allocated.  The use of PF_MEMALLOC was found to cause stability
problems when doing swap on zvols. Since this technique is known to
cause problems and no longer fixes anything, we revert it.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Richard Yao
62c4165a1b Revert Fix zpl_writepage() deadlock
The commit, cfc9a5c88f, to fix deadlocks
in zpl_writepage() relied on PF_MEMALLOC.   That had the effect of
disabling the direct reclaim path on all allocations originating from
calls to this function, but it failed to address the actual cause of
those deadlocks.  This led to the same deadlocks being observed with
swap on zvols, but not with swap on the loop device, which exercises
this code.

The use of PF_MEMALLOC also had the side effect of permitting
allocations to be made from ZONE_DMA in instances that did not require
it.  This contributes to the possibility of panics caused by depletion
of pages from ZONE_DMA.

As such, we revert this patch in favor of a proper fix for both issues.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Richard Yao
b876dac776 Revert Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Commit eec8164771 worked around an issue
involving direct reclaim through the use of PF_MEMALLOC.   Since we
are reworking thing to use KM_PUSHPAGE so that swap works, we revert
this patch in favor of the use of KM_PUSHPAGE in the affected areas.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Brian Behlendorf
cd38ac58a3 rmdir(2) should return ENOTEMPTY
Under Solaris the behavior for rmdir(2) is to return EEXIST when
a directory still contains entries.  However, on Linux ENOTEMPTY
is the expected return value with EEXIST being technically allowed.
According to rmdir(2):

ENOTEMPTY
   pathname contains entries other than . and .. ; or, pathname has
   ..  as its final component.  POSIX.1-2001 also allows EEXIST for
   this condition.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #895
2012-08-26 13:55:45 -07:00
Christopher Siden
9e11c7eee2 Illumos #3085: zfs diff panics, then panics in a loop on booting
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3085

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-25 12:32:25 -07:00
Simon Klinkert
c578f007ff Illumos #2901: zfs receive fails for exabyte sparse files
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/2901

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-25 12:28:29 -07:00
Javen Wu
a47587389e Drop spill buffer reference
When calling sa_update() and friends it is possible that a spill
buffer will be needed to accomidate the update.  When this happens
a hold is taken on the new dbuf and that hold must be released
before calling dmu_tx_commit().  Failing to release the hold will
cause a copy of the dbuf to be made in dbuf_sync_leaf().  This is
done to ensure further updates to the dbuf never sneak in to the
syncing txg.

This could be left to the sa_update() caller.  But then the caller
would need to be aware of this internal SA implementation detail.
It is therefore preferable to handle this all internally in the
SA implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #503
Closes #513
2012-08-25 09:26:10 -07:00
Brian Behlendorf
f828e63a0d Revert "Use SA_HDL_PRIVATE for SA xattrs"
This reverts commit ec2626ad3f which
caused consistency problems between the shared and private handles.
Reverting this change should resolve issues #709 and #727.  It
will also reintroduce an arc_anon memory leak which is addressed
by the next commit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #709
Closes #727
2012-08-25 09:25:56 -07:00
Prakash Surya
15a9e03368 Wrap smp_processor_id in kpreempt_[dis|en]able
After surveying the code, the few places where smp_processor_id is used
were deemed to be safe to use with a preempt enabled kernel. As such, no
core logic had to be changed. These smp_processor_id call sites are simply
are wrapped in kpreempt_disable and kpreempt_enabled to prevent the
Linux kernel from emitting scary warnings.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Issue #83
2012-08-24 13:19:06 -07:00
Brian Behlendorf
4047414a6a Export dmu_buf_rele() symbol
While I'd like to remove the various pragmas in module/zfs/dbuf.c.
There are consumers such as Lustre which still depend on dmu_buf_*
versions of the symbols.  Until all consumers can be converted to
use only the dbuf_* names leave this symbol exported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-14 08:38:19 -07:00
Brian Behlendorf
bafc4e9e2a Suppress 'zfs_sb_create' memory warning
When mutex debugging is enabled in your kernel the increased
size of the mutex structures can push the zfs_sb_t type beyond
the 8k warning threshold.  This isn't harmful so we suppress
the warning for this case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #628
2012-08-10 16:43:32 -07:00
Brian Behlendorf
8f576c2321 Export dbuf_* symbols
Export these symbols so they may be used by other ZFS consumers
besides the ZPL.

Remove three stale prototype definites from dbuf.h.  The actual
implementations of these functions were removed/renamed long ago.

It would be good in the long term to remove the existing pragmas
we inherited from Solaris and simply use the dbuf_* names.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-10 16:45:13 -07:00
Dan McDonald
d96eb2b153 Illumos #1693: persistent 'comment' field for a zpool
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/1693

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #678
2012-08-08 11:49:37 -07:00
Etienne Dechamps
ee5fd0bb80 Set zvol discard_granularity to the volblocksize.
Currently, zvols have a discard granularity set to 0, which suggests to
the upper layer that discard requests of arbirarily small size and
alignment can be made efficiently.

In practice however, ZFS does not handle unaligned discard requests
efficiently: indeed, it is unable to free a part of a block. It will
write zeros to the specified range instead, which is both useless and
inefficient (see dnode_free_range).

With this patch, zvol block devices expose volblocksize as their discard
granularity, so the upper layer is aware that it's not supposed to send
discard requests smaller than volblocksize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #862
2012-08-07 14:55:31 -07:00
Etienne Dechamps
7c0e570888 Limit the number of blocks to discard at once.
The number of blocks that can be discarded in one BLKDISCARD ioctl on a
zvol is currently unlimited. Some applications, such as mkfs, discard
the whole volume at once and they use the maximum possible discard size
to do that. As a result, several gigabytes discard requests are not
uncommon.

Unfortunately, if a large amount of data is allocated in the zvol, ZFS
can be quite slow to process discard requests. This is especially true
if the volblocksize is low (e.g. the 8K default). As a result, very
large discard requests can take a very long time (seconds to minutes
under heavy load) to complete. This can cause a number of problems, most
notably if the zvol is accessed remotely (e.g. via iSCSI), in which case
the client has a high probability of timing out on the request.

This patch solves the issue by adding a new tunable module parameter:
zvol_max_discard_blocks. This indicates the maximum possible range, in
zvol blocks, of one discard operation. It is set by default to 16384
blocks, which appears to be a good tradeoff. Using the default
volblocksize of 8K this is equivalent to 128 MB. When using the maximum
volblocksize of 128K this is equivalent to 2 GB.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #858
2012-07-31 09:46:09 -07:00
Matthew Ahrens
330d06f90d Illumos #1644, #1645, #1646, #1647, #1708
1644 add ZFS "clones" property
1645 add ZFS "written" and "written@..." properties
1646 "zfs send" should estimate size of stream
1647 "zfs destroy" should determine space reclaimed by
     destroying multiple snapshots
1708 adjust size of zpool history data

References:
  https://www.illumos.org/issues/1644
  https://www.illumos.org/issues/1645
  https://www.illumos.org/issues/1646
  https://www.illumos.org/issues/1647
  https://www.illumos.org/issues/1708

This commit modifies the user to kernel space ioctl ABI.  Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated.  This change has reordered
all of the new ioctl()s to the end of the list.  This should
help minimize this issue in the future.

Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Albert Lee <trisk@opensolaris.org>
Approved by: Garrett D'Amore <garret@nexenta.com>

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #826
Closes #664
2012-07-31 09:25:30 -07:00
Etienne Dechamps
2ee4a18b2a Add script for builtin module building.
This commit introduces a "copy-builtin" script designed to prepare a
kernel source tree for building ZFS as a builtin module. The script
makes a full copy of all needed files, thus making the kernel source
tree fully independent of the zfs source package.

To achieve that, some compilation flags (-include, -I) have been moved
to module/Makefile. This Makefile is only used when compiling external
modules; when compiling builtin modules, a Kbuild file generated by the
configure-builtin script is used instead. This makes sure Makefiles
inside the kernel source tree does not contain references to the zfs
source package.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
2012-07-26 13:45:09 -07:00
Richard Yao
739a1a82e0 Linux 3.5 compat, end_writeback() changed to clear_inode()
The end_writeback() function was changed by moving the call to
inode_sync_wait() earlier in to evict().   This effecitvely changes
the ordering of the sync but it does not impact the details of
the zfs implementation.

However, as part of this change end_writeback() was renamed to
clear_inode() to reflect the new semantics.  This change does
impact us and clear_inode() now maps to end_writeback() for
kernels prior to 3.5.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #784
2012-07-23 12:29:36 -07:00
Richard Yao
ea1fdf46e2 Linux 3.5 compat, iops->truncate_range() removed
The vmtruncate_range() support has been removed from the kernel in
favor of using the fallocate method in the file_operations table.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
2012-07-23 12:29:32 -07:00
Richard Yao
756c3e5a9c Linux 3.5 compat, eops->encode_fh() takes inodes
The export_operations member ->encode_fh() has been updated to
take both the child and parent inodes.  This interface used to
take the child dentry and a bool describing if the parent is needed.

NOTE: While updating this code I noticed that we do not currently
cleanly handle the case where we're passed a connectable parent.
This code should be audited to make sure we're doing the right thing.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
2012-07-23 12:29:23 -07:00
Brian Behlendorf
fc173c8589 Disable .zfs directory on 32-bit systems
The .zfs control directory implementation currently relies on
the fact that there is a direct 1:1 mapping from an object id
to its inode number.  This works well as long as the system
uses a 64-bit value to store the inode number.

Unfortunately, the Linux kernel defines the inode number as
an 'unsigned long' type.  This means that for 32-bit systems
will only have 32-bit inode numbers but we still have 64-bit
object ids.

This problem is particularly acute for the .zfs directories
which leverage those upper 32-bits.  This is done to avoid
conflicting with object ids which are allocated monotonically
starting from 0.  This is likely to also be a problem for
datasets on 32-bit systems with more than ~2 billion files.

The right long term fix must remove the simple 1:1 mapping.
Until that's done the only safe thing to do is to disable the
.zfs directory on 32-bit systems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-20 12:20:57 -07:00
Brian Behlendorf
2a4a9dc2f0 Add ddt_object_load() error handling
Add the missing error handling to ddt_object_load().  There's no
good reason this needs to be fatal.  It is preferable that an
error be returned.  This will allow 'zpool import -FX' to safely
attempt to rollback through previous txgs looking for a good one.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-20 10:36:21 -07:00
Brian Behlendorf
10be533e33 Add 'inline' keyword
The '__attribute__((always_inline))' does not strictly imply
'inline'.  Newer versions of gcc detect this misuse and issue
the following warning.  Including the missing 'inline' resolves
the build warning.

    ./module/zfs/dsl_scan.c:758:1:error: always_inline function
    might not be inlinable [-Werror=attributes]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-19 13:41:00 -07:00
Richard Yao
0a6b03d3b8 Fix build failures on PaX/GRSecurity patched kernels
Gentoo Hardened kernels include the PaX/GRSecurity patches. They use a
dialect of C that relies on a GCC plugin. In particular, struct
file_operations has been marked do_const in the PaX/GRSecurity dialect,
which causes GCC to consider all instances of it as const. This caused
failures in the autotools checks and the ZFS source code.

To address this, we modify the autotools checks to take into account
differences between the PaX C dialect and the regular C dialect. We also
modify struct zfs_acl's z_ops member to be a pointer to a function
pointer table. Lastly, we modify zpl_put_link() to address a PaX change
to the function prototype of nd_get_link().  This avoids compiler errors
in the PaX/GRSecurity dialect.

Note that the change in zpl_put_link() causes a warning that becomes a
build failure when debugging is enabled. Fixing that warning requires
ryao/spl@5ca50ef459.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #484
2012-07-17 09:22:43 -07:00
Etienne Dechamps
b5a28807cd Move partition scanning from userspace to module.
Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.

This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.

Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.

This change also depends on API changes which are available in 2.6.37
and latter kernels.  The build system has been updated to detect this,
but there is no compatibility mode for older kernels.  This means that
online expansion will NOT be available in older kernels.  However, it
will still be possible to expand the vdev offline.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #808
2012-07-17 09:17:31 -07:00
George Wilson
c7f2d69de3 Illumos #1949, #1953
1949 crash during reguid causes stale config
1953 allow and unallow missing from zpool history since removal of pyzfs

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  https://www.illumos.org/issues/1949
  https://www.illumos.org/issues/1953

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665
2012-07-11 13:33:31 -07:00
Garrett D'Amore
3541dc6d02 Illumos #1748: desire support for reguid in zfs
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com>
Reviewed by: Alexander Stetsenko <ams@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/1748

This commit modifies the user to kernel space ioctl ABI.  Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated.  If only the user space
component is updated both the 'zpool events' command and the
'zpool reguid' command will not work until the kernel modules
are updated.

Ported by:     Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665
2012-07-11 13:08:56 -07:00
Brian Behlendorf
42d3b990cf Update incorrect ddt_zap_lookup() assertion
When the ddt_zap_lookup() function was updated to dynamically
allocate memory for the cbuf variable, to save stack space, the
'csize <= sizeof (cbuf)' assertion was not updated.  The result
of this was that the size of the pointer was being used in the
comparison rather than the buffer size.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
2012-07-03 15:14:34 -07:00
Etienne Dechamps
b6ad9671ac Add ZIL statistics.
The performance of the ZIL is usually the main bottleneck when dealing with
synchronous, write-heavy workloads (e.g. databases). Understanding the
behavior of the ZIL is required to diagnose performance issues for these
workloads, and to tune ZIL parameters (like zil_slog_limit) accordingly.

This commit adds a new kstat page dedicated to the ZIL with some counters
which, hopefully, scheds some light into what the ZIL is doing, and how it is
doing it.

Currently, these statistics are available in /proc/spl/kstat/zfs/zil.
A description of the fields can be found in zil.h.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #786
2012-06-29 09:56:51 -07:00
Pawel Jakub Dawidek
0cee24064a Speed up 'zfs list -t snapshot -o name -s name'
FreeBSD #xxx:  Dramatically optimize listing snapshots when user
requests only snapshot names and wants to sort them by name, ie.
when executes:

  # zfs list -t snapshot -o name -s name

Because only name is needed we don't have to read all snapshot
properties.

Below you can find how long does it take to list 34509 snapshots
from a single disk pool before and after this change with cold and
warm cache:

    before:

        # time zfs list -t snapshot -o name -s name > /dev/null
        cold cache: 525s
        warm cache: 218s

    after:

        # time zfs list -t snapshot -o name -s name > /dev/null
        cold cache: 1.7s
        warm cache: 1.1s

NOTE: This patch only appears in FreeBSD.  If/when Illumos picks up
the change we may want to drop this patch and adopt their version.
However, for now this addresses a real issue.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #450
2012-06-14 09:49:04 -07:00
Darik Horn
74497b7ab6 Add zvol_inhibit_dev module option.
ZoL can create more zvols at runtime than can be configured during
system start, which hangs the init stack at reboot.

When a slow system has more than a few hundred zvols, udev will
fork bomb during system start and spend too much time in device
detection routines, so upstart kills it.

The zfs_inhibit_dev option allows an affected system to be rescued
by skipping /dev/zd* creation and thereby avoiding the udev
overload. All zvols are made inaccessible if this option is set, but
the `zfs destroy` and `zfs send` commands still work, and ZFS
filesystems can be mounted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-06-13 17:05:16 -07:00
Etienne Dechamps
ee191e802c Make zil_slog_limit a tunable module parameter.
zil_slog_limit specifies the maximum commit size to be written to the separate
log device. Larger commits bypass the separate log device and go directly to
the data devices.

The optimal value for zil_slog_limit directly depends on the latency and
throughput characteristics of both the separate log device and the data disks.
Small synchronous writes are faster on low-latency separate log devices (e.g.
SSDs) whereas large synchronous writes are faster on high-latency data disks
(e.g. spindles) because of higher throughput, especially with a large array.
The point is, the line between "small" and "large" synchronous writes in this
scenario is heavily dependent on the hardware used. That's why it should be
made configurable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #783
2012-06-12 08:45:53 -07:00
Richard Yao
6a0936babc Linux 3.4 compat, d_make_root() replaces d_alloc_root()
torvalds/linux@adc0e91ab1 introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:

  error: implicit declaration of function 'd_alloc_root'
  [-Werror=implicit-function-declaration]

To correct this we update the code to use the current d_make_root()
interface for readability.  Then we introduce an autotools check
to determine if d_make_root() is available.  If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #776
2012-06-11 10:04:49 -07:00
Etienne Dechamps
ab85f8455b Honor logbias when writing to ZVOLs.
The logbias option is not taken into account when writing to ZVOLs. We fix
that by using the same logic as in the zfs filesystem write code
(see zfs_log.c).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #774
2012-06-11 09:43:48 -07:00
Brian Behlendorf
710114089f Revert "Disable direct reclaim on zvols"
This reverts commit ce90208cf9.  This
change was observed to cause problems when using a zvol to back a VM
under 2.6.32.59 kernels.  This issue was filed as #710.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #342
Issue #710
2012-04-30 14:26:49 -07:00
Brian Behlendorf
b39d3b9f7b Linux 3.3 compat, iops->create()/mkdir()/mknod()
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'.  To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef.  There is no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #701
2012-04-30 12:52:38 -07:00
Richard Yao
ce90208cf9 Disable direct reclaim on zvols
Previously, it was possible for the direct reclaim path to be invoked
when a write to a zvol was made. When a zvol is used as a swap device,
this often causes swap requests to depend on additional swap requests,
which deadlocks. We address this by disabling the direct reclaim path
on zvols.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #342
2012-04-30 11:25:36 -07:00
Richard Yao
518b487602 Update ARC memory limits to account for SLUB internal fragmentation
23bdb07d4e updated the ARC memory limits
to be 1/2 of memory or all but 4GB. Unfortunately, these values assume
zero internal fragmentation in the SLUB allocator, when in reality, the
internal fragmentation could be as high as 50%, effectively doubling
memory usage. This poses clear safety issues, because it permits the
size of ARC to exceed system memory.

This patch changes this so that the default value of arc_c_max is always
1/2 of system memory. This effectively limits the ARC to the memory that
the system has physically installed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #660
2012-04-30 10:04:34 -07:00
Brian Behlendorf
302f753f16 Integrate ARC more tightly with Linux
Under Solaris the ARC was designed to stay one step ahead of the
VM subsystem.  It would attempt to recognize low memory situtions
before they occured and evict data from the cache.  It would also
make assessments about if there was enough free memory to perform
a specific operation.

This was all possible because Solaris exposes a fairly decent
view of the memory state of the system to other kernel threads.
Linux on the other hand does not make this information easily
available.  To avoid extensive modifications to the ARC the SPL
attempts to provide these same interfaces.  While this works it
is not ideal and problems can arise when the ARC and Linux have
different ideas about when your out of memory.  This has manifested
itself in the past as a spinning arc_reclaim_thread.

This patch abandons the emulated Solaris interfaces in favor of
the prefered Linux interface.  That means moving the bulk of the
memory reclaim logic out of the arc_reclaim_thread and in to the
evict driven shrinker callback.  The Linux VM will call this
function when it needs memory.  The ARC is then responsible for
attempting to free the requested amount of memory if possible.

Several interfaces have been modified to accomidate this approach,
however the basic user space implementation remains the same.
The following changes almost exclusively just apply to the kernel
implementation.

* Removed the hdr_recl() reclaim callback which is redundant
  with the broader arc_shrinker_func().

* Reduced arc_grow_retry to 5 seconds from 60.  This is now used
  internally in the ARC with arc_no_grow to indicate that direct
  reclaim was recently performed.  This typically indicates a
  rapid change in memory demands which the kswapd threads were
  unable to keep ahead of.  As long as direct reclaim is happening
  once every 5 seconds arc growth will be paused to avoid further
  contributing to the existing memory pressure.  The more common
  indirect reclaim paths will not set arc_no_grow.

* arc_shrink() has been extended to take the number of bytes by
  which arc_c should be reduced.  This allows for a more granual
  reduction of the arc target.  Since the kernel provides a
  reclaim value to the arc_shrinker_func() this value is used
  instead of 1<<arc_shrink_shift.

* arc_reclaim_needed() has been removed.  It was used to determine
  if the system was under memory pressure and relied extensively
  on Solaris specific VM interfaces.  In most case the new code
  just checks arc_no_grow which indicates that within the last
  arc_grow_retry seconds direct memory reclaim occurred.

* arc_memory_throttle() has been updated to always include the
  amount of evictable memory (arc and page cache) in its free
  space calculations.  This space is largely available in most
  call paths due to direct memory reclaim.

* The Solaris pageout code was also removed to avoid confusion.
  It has always been disabled due to proc_pageout being defined
  as NULL in the Linux port.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-30 10:03:05 -07:00
Brian Behlendorf
afec56b43f Add zfs_mdcomp_disable module option
Expose the zfs_mdcomp_disable variable as a module option.  This
can be used to disable compression of zfs meta data which is
enabled by default.  This shouldn't need to be tuned but for
most workloads, however there may be very specific instances
where it makes sense to trade disk capacity for extra cpu cycles.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-27 16:28:02 -07:00
Brian Behlendorf
ebf8e3a237 Illumos #1909: disk sync write perf regression when slog is used post oi_148
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Refererces to Illumos issue:
  https://www.illumos.org/issues/1909

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #680
2012-04-19 16:26:29 -07:00
Prakash Surya
409dc1a570 Use KM_PUSHPAGE in l2arc_write_buffers
There is potential for deadlock in the l2arc_feed thread if KM_PUSHPAGE
is not used for the allocations made in l2arc_write_buffers.
Specifically, if KM_PUSHPAGE is not used for these allocations, it is
possible for reclaim to be triggered which can cause the l2arc_feed
thread to deadlock itself on the ARC_mru mutex. An example of this is
demonstrated in the following backtrace of the l2arc_feed thread:

    crash> bt 4123
    PID: 4123   TASK: ffff88062f8c1500  CPU: 6   COMMAND: "l2arc_feed"
      0 [ffff88062511d610] schedule at ffffffff814eeee0
      1 [ffff88062511d6d8] __mutex_lock_slowpath at ffffffff814f057e
      2 [ffff88062511d748] mutex_lock at ffffffff814f041b
      3 [ffff88062511d768] arc_evict at ffffffffa05130ca [zfs]
      4 [ffff88062511d858] arc_adjust at ffffffffa05139a9 [zfs]
      5 [ffff88062511d878] arc_shrink at ffffffffa0513a95 [zfs]
      6 [ffff88062511d898] arc_kmem_reap_now at ffffffffa0513be8 [zfs]
      7 [ffff88062511d8c8] arc_shrinker_func at ffffffffa0513ccc [zfs]
      8 [ffff88062511d8f8] shrink_slab at ffffffff8112a17a
      9 [ffff88062511d958] do_try_to_free_pages at ffffffff8112bfdf
     10 [ffff88062511d9e8] try_to_free_pages at ffffffff8112c3ed
     11 [ffff88062511da98] __alloc_pages_nodemask at ffffffff8112431d
     12 [ffff88062511dbb8] kmem_getpages at ffffffff8115e632
     13 [ffff88062511dbe8] fallback_alloc at ffffffff8115f24a
     14 [ffff88062511dc68] ____cache_alloc_node at ffffffff8115efc9
     15 [ffff88062511dcc8] __kmalloc at ffffffff8115fbf9
     16 [ffff88062511dd18] kmem_alloc_debug at ffffffffa047b8cb [spl]
     17 [ffff88062511dda8] l2arc_feed_thread at ffffffffa0511e71 [zfs]
     18 [ffff88062511dea8] thread_generic_wrapper at ffffffffa047d1a1 [spl]
     19 [ffff88062511dee8] kthread at ffffffff81090a86
     20 [ffff88062511df48] kernel_thread at ffffffff8100c14a

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-17 11:56:21 -07:00
Martin Matuska
7d5cd71da6 Illumos #1346: zfs incremental receive may leave behind temporary clones
1356 zfs dataset prefetch code not working
Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/1346
  https://www.illumos.org/issues/1356

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #647
2012-04-11 12:02:27 -07:00
Albert Lee
22cd4a4653 Illumos #1475: zfs spill block hold can access invalid spill blkptr
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/1475

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #648
2012-04-11 11:46:30 -07:00
George Wilson
5ffb9d1d05 Illumos #1951: leaking a vdev when removing an l2cache device
1952 memory leak when adding a file-based l2arc device
1954 leak in ZFS from metaslab_group_create and zfs_ereport_checksum

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References to Illumos issues:
  https://www.illumos.org/issues/1951
  https://www.illumos.org/issues/1952
  https://www.illumos.org/issues/1954

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #650
2012-04-11 11:32:06 -07:00
Martin Matuska
b129c6590e OS-926: zfs panic in zfs_fill_zplprops_impl()
This change appears to be exclusive to SmartOS. It is not present in
illumos-gate but it just adds some needed error handling.  This is
clearly preferable to simply ASSERTING which is what would occur
prior to the patch.

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #652
2012-04-11 11:29:19 -07:00
Andriy Gapon
3adfc400f5 Illumos #1680: zfs vdev_file_io_start: validate vdev before using vdev_tsd
vdev_tsd can be NULL for certain vdev states.
At least in userland testing with ztest.

References to Illumos issue:
  https://www.illumos.org/issues/1680

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #655
2012-04-11 11:23:18 -07:00
Brian Behlendorf
f0fd83be65 Export additional dsl symbols
Principly these symbols were exported to get access to the
dsl_prop_register/dsl_prop_unregister functions.  They allow
us to cleanly register a callback which is called when a
dataset property is modified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-11 09:26:55 -07:00
Gunnar Beutner
1f0d8a566f Fixed a NULL pointer dereference bug in zfs_preumount
When zpl_fill_super -> zfs_domount fails (e.g. because the dataset
was destroyed before it could be successfully mounted) the subsequent
call to zpl_kill_sb -> zfs_preumount would derefence a NULL pointer.

This bug can be reproduced using this shell script:

 #!/bin/sh
 (
 while true; do
 	zfs create -o mountpoint=legacz tank/bar
 	zfs destroy tank/bar
 done
 ) &

 (
 while true; do
 	mount -t zfs tank/bar /mnt
 	umount /mnt
 done
 ) &

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #639
2012-04-05 11:29:42 -07:00
Brian Behlendorf
fc41c6402b Properly expose the mfu ghost list kstats
Due to a typo the mru ghost lists stats were accidentally being
exposed as the mfu ghost list stats.  This was harmless but
confusing since memory usage could be over reported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-27 15:08:22 -07:00
Brian Behlendorf
1c5de20ae2 Add --enable-debug-dmu-tx configure option
Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging.  When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.

This checking is particularly helpful when adding new dmu consumers
like Lustre.  However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.

--enable-debug-dmu-tx  - Enable/disable validation of each tx as
--disable-debug-dmu-tx   it is constructed.  By default validation
                         is disabled due to performance concerns.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:25:17 -07:00
Brian Behlendorf
99ea23c583 Enhance a dmu_tx_dirty_buf() assertion
The following assertion is good to validate the correctness of
new DMU consumers, but it doesn't quite provide enough information.
Slightly rework the assertion so that when it is hit the actual
offending values will be included in the output.

  SPLError: 4787:0:(dmu_tx.c:828:dmu_tx_dirty_buf())
  ASSERTION(dn == NULL || dn->dn_assigned_txg == tx->tx_txg) failed

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:24:05 -07:00
Brian Behlendorf
4b5d425f14 Add ZFS_META_RELEASE to module load/unload messages
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indidcate exactly what version of ZFS has been
loaded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:14:35 -07:00
Brian Behlendorf
9ed86e7cc7 Account for .zfs ctldir inodes
Because the .zfs ctldir inodes are not backed by physical storage
they use a different create path which was not properly accounting
for them as used.  This could result in ->nr_cached_objects()
returning 0 and cause a divide by zero error in prune_super().

In my option there's a kernel bug here too which allows this to
happen.  They should either be checking for 0 or adding +1 like
they correctly do earlier in the function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #617
2012-03-22 15:43:55 -07:00
Brian Behlendorf
ebe7e575ea Add .zfs control directory
Add support for the .zfs control directory.  This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required.  The bulk of the core
functionality is now all there with the following limitations.

*) The .zfs/snapshot directory automount support requires a 2.6.37
   or newer kernel.  The exception is RHEL6.2 which has backported
   the d_automount patches.

*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
   in the .zfs/snapshot directory works as expected.  However,
   this functionality is only available to root until zfs
   delegations are finished.

      * mkdir - create a snapshot
      * rmdir - destroy a snapshot
      * mv    - rename a snapshot

The following issues are known defeciences, but we expect them to
be addressed by future commits.

*) Add automount support for kernels older the 2.6.37.  This should
   be possible using follow_link() which is what Linux did before.

*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
   The majority of the ground work for this is complete.  However,
   finishing this work will require resolving some lingering
   integration issues with the Linux NFS kernel server.

*) The .zfs/shares directory exists but no futher smb functionality
   has yet been implemented.

Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #173
2012-03-22 13:03:47 -07:00
Brian Behlendorf
49be0ccf1f Add zio constructor/destructor
Add a standard zio constructor and destructor.  Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations.  However, in this
case none of the operations moved out of zio_create() were really
very expensive.

This change was principly made as a debug patch (and workaround)
for a zio_destroy() race.  The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread.  This would completely explain the observed symptoms in
the issue report.

This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor.  Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.

Once the real root cause is determined and resolved this change
can be safely reverted.  Until then this should help workaround
the issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496
2012-03-21 14:51:44 -07:00
Brian Behlendorf
c8df41538d Revert "Add zio constructor/destructor"
This patch was slightly flawed and allowed for zio->io_logical
to potentially not be reinitialized for a new zio.  This could
lead to assertion failures in specific cases when debugging is
enabled (--enable-debug) and I/O errors are encountered.  It
may also have caused problems when issues logical I/Os.

Since we want to make sure this workaround can be easily removed
in the future (when we have the real fix).  I'm reverting this
change and applying a new version of the patch which includes
the zio->io_logical fix.

This reverts commit 2c6d0b1e07.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #602
Issue #604
2012-03-21 14:51:01 -07:00
Brian Behlendorf
77a405ae52 Add missing NULL in zpl_xattr_handlers
The xattr_resolve_name() helper function expects the registered
list of xattr handlers to be NULL terminated.  This NULL was
accidentally missing which could result in a NULL dereference.

Interestingly this issue only manifested itself on certain 32-bit
systems.  Presumably on 64-bit kernels we just always happen to
get lucky and the memory following the structure is zeroed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #594
2012-03-15 15:18:29 -07:00
Brian Behlendorf
0ece356db5 Add sa_spill_rele() interface
Add a SA interface which allows us to release the spill block
from a SA handle without destroying the handle.  This is useful
because we can then ensure that a copy of the dirty spill block
is not made at sync time due to the extra hold.  Susequent calls
to sa_update() or sa_lookup() with transparently refetch the
spill block dbuf from the ARC hash.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-07 16:28:00 -08:00
Brian Behlendorf
2c6d0b1e07 Add zio constructor/destructor
Add a standard zio constructor and destructor.  Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations.  However, in this
case none of the operations moved out of zio_create() were really
very expensive.

This change was principly made as a debug patch (and workaround)
for a zio_destroy() race.  The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread.  This would completely explain the observed symptoms in
the issue report.

This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor.  Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.

Once the real root cause is determined and resolved this change
can be safely reverted.  Until then this should help workaround
the issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496
2012-03-07 16:06:23 -08:00
Brian Behlendorf
ec2626ad3f Use SA_HDL_PRIVATE for SA xattrs
A private SA handle must be used to ensure we can drop the dbuf
hold on the spill block prior to calling dmu_tx_commit().  If we
call dmu_tx_commit() before sa_handle_destroy(), then our hold
will trigger a copy of the dbuf to be made.  This is done to
prevent data from leaking in to the syncing txg.  As a result
the original dirty spill block will remain cached.

Additionally, relying on the shared zp->z_sa_hdl is unsafe in
the xattr context because the znode may be asynchronously dropped
from the cache.  It's far safer and simpler just to use a private
handle for xattrs.  Plus any additional overhead is offset by
the avoidance of the previously mentioned memory copy.

These forever dirty buffers can be noticed in the arcstats under
the anon_size.  On a quiescent system the value should be zero.
Without this fix and a SA xattr write workload you will see
anon_size increase.  Eventually, if enough dirty data builds up
your system it will appear to hang.  This occurs because the dmu
won't allow new txs to be assigned until that dirty data is
flushed, and it won't be because it's not part of an assigned tx.

As an aside, I typically see anon_size lurk around 16k so I think
there is another place in the code which needs a similar fix.
However, this value doesn't grow over time so it isn't critical.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #503
Issue #513
2012-03-02 13:20:48 -08:00
Brian Behlendorf
570827e129 Add 'dmu_tx' kstats entry
Keep counters for the various reasons that a thread may end up
in txg_wait_open() waiting on a new txg.  This can be useful
when attempting to determine why a particular workload is
under performing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 08:59:10 -08:00
Brian Behlendorf
13be560d89 Add arc_state_t stats to arcstats
To ensure the arc is behaving properly we need greater visibility
in to exactly how it's managing the systems memory.  This patch
takes one step in that direction be adding the current arc_state_t
for the anon, mru, mru_ghost, mfu, and mfs_ghost lists.  The l2
arc_state_t is already well represented in the arcstats.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 08:58:59 -08:00
Alex Zhuravlev
a473d90cee Export symbols for zero-copy
Export additional symbols to make use of the DMU's zero-copy
API.  This allows external modules to move data in to and out of
the ARC without incurring the cost of a memory copy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-17 12:43:02 -08:00
Brian Behlendorf
b10c77f70a Export symbols for zero-copy
Exported the required symbols to make use of the DMU's zero-copy
API.  This allows external modules to move data in to and out of
the ARC without incurring the cost of a memory copy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-10 11:56:55 -08:00
Brian Behlendorf
a31acb462d Use spl_debug_* helpers
When configuring the spl debug log support use the provided wrapper
functions.  This ensures that if --disable-debug-log was used when
buiding the spl the functions will have no effect.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:37:48 -08:00
Etienne Dechamps
30930fba21 Add support for DISCARD to ZVOLs.
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.

We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.

Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:19:38 -08:00
Etienne Dechamps
cb2d19010d Support the fallocate() file operation.
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.

To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 16:19:32 -08:00
Etienne Dechamps
aec69371a6 Check permissions in zfs_space().
This isn't done on Solaris because on this OS zfs_space() can
only be called with an opened file handle. Since the addition of
zpl_truncate_range() this isn't the case anymore, so we need to
enforce access rights.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 15:20:37 -08:00
Etienne Dechamps
5cb63a57f8 Implement the truncate_range() inode operation.
This operation allows "hole punching" in ZFS files. On Solaris this
is done via the vop_space() system call, which maps to the zfs_space()
function. So we just need to write zpl_truncate_range() as a wrapper
around zfs_space().

Note that this only works for regular files, not ZVOLs.

This is currently an insecure implementation without permission
checking, although this isn't that big of a deal since truncate_range()
isn't even callable from userspace.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 15:20:32 -08:00
Etienne Dechamps
dde9380a1b Use 32 as the default number of zvol threads.
Currently, the `zvol_threads` variable, which controls the number of worker
threads which process items from the ZVOL queues, is set to the number of
available CPUs.

This choice seems to be based on the assumption that ZVOL threads are
CPU-bound. This is not necessarily true, especially for synchronous writes.
Consider the situation described in the comments for `zil_commit()`, which is
called inside `zvol_write()` for synchronous writes:

> itxs are committed in batches. In a heavily stressed zil there will be a
> commit writer thread who is writing out a bunch of itxs to the log for a
> set of committing threads (cthreads) in the same batch as the writer.
> Those cthreads are all waiting on the same cv for that batch.
>
> There will also be a different and growing batch of threads that are
> waiting to commit (qthreads). When the committing batch completes a
> transition occurs such that the cthreads exit and the qthreads become
> cthreads. One of the new cthreads becomes he writer thread for the batch.
> Any new threads arriving become new qthreads.

We can easily deduce that, in the case of ZVOLs, there can be a maximum of
`zvol_threads` cthreads and qthreads. The default value for `zvol_threads` is
typically between 1 and 8, which is way too low in this case. This means
there will be a lot of small commits to the ZIL, which is very inefficient
compared to a few big commits, especially since we have to wait for the data
to be on stable storage. Increasing the number of threads will increase the
amount of data waiting to be commited and thus the size of the individual
commits.

On my system, in the context of VM disk image storage (lots of small
synchronous writes), increasing `zvol_threads` from 8 to 32 results in a 50%
increase in sequential synchronous write performance.

We should choose a more sensible default for `zvol_threads`. Unfortunately
the optimal value is difficult to determine automatically, since it depends
on the synchronous write latency of the underlying storage devices. In any
case, a hardcoded value of 32 would probably be better than the current
situation. Having a lot of ZVOL threads doesn't seem to have any real
downside anyway.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #392
2012-02-08 13:58:10 -08:00
Etienne Dechamps
34037afe24 Improve ZVOL queue behavior.
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Etienne Dechamps
b18019d2d8 Fix synchronicity for ZVOLs.
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:

    WRITE:       A normal async write. Device will be plugged.
    WRITE_SYNC:  Synchronous write. Identical to WRITE, but passes down
                 the hint that someone will be waiting on this IO
                 shortly.
    WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
    WRITE_FUA:   Like WRITE_SYNC but data is guaranteed to be on
                 non-volatile media on completion.

In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.

The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.

Also, we must inform the block layer that we actually support FLUSH and FUA.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Etienne Dechamps
56c34bac44 Support "sync=always" for ZVOLs.
Currently the "sync=always" property works for regular ZFS datasets, but not
for ZVOLs. This patch remedies that.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #374.
2012-02-07 16:23:06 -08:00
Brian Behlendorf
47621f3d76 Linux 3.3 compat, sops->show_options()
The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'.  Add an autoconf check
to detect the API change and then conditionally define the expected
interface.  In either case we are only interested in the zfs_sb_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #549
2012-02-03 10:02:01 -08:00
Brian Behlendorf
d7e398ce1a Cleanup ZFS debug infrastructure
Historically the internal zfs debug infrastructure has been
scattered throughout the code.  Since we expect to start making
more use of this code this patch performs some cleanup.

* Consolidate the zfs debug infrastructure in the zfs_debug.[ch]
  files.  This includes moving the zfs_flags and zfs_recover
  variables, plus moving the zfs_panic_recover() function.

* Remove the existing unused functionality in zfs_debug.c and
  replace it with code which correctly utilized the spl logging
  infrastructure.

* Remove the __dprintf() function from zfs_ioctl.c.  This is
  dead code, the dprintf() functionality in the kernel relies
  on the spl log support.

* Remove dprintf() from hdr_recl().  This wasn't particularly
  useful and was missing the required format specifier anyway.

* Subsequent patches should unify the dprintf() and zfs_dbgmsg()
  functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:24:30 -08:00
Brian Behlendorf
0c5dde492f Allow multiple values per directory entry
When using zfs to back a Lustre filesystem it's advantageous to
to store a fid with the object id in the directory zap.  The only
technical impediment to doing this is that the zpl code expects
a single value in the zap per directory entry.

This change relaxes that requirement such that multiple entries
are allowed provided the first one is the object id.  The zpl
code will just ignore additional entries.  This allows the ZoL
count to mount datasets which are being used as Lustre server
backends.

Once the upstream feature flags support is merged in this change
should be updated to a read-only feature.  Until this occurs
other zfs implementations will not be able to read the zfs
filesystems created by Lustre.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:22:08 -08:00
Brian Behlendorf
e29be02e46 Export symbol zfs_attr_table
Export the zfs_attr_table symbol so it may be used by non-zpl
consumers which are still interested in writing a zpl compatible
dataset (e.g. Lustre).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-27 09:23:36 -08:00
Richard Laager
57a4eddc4d Allow setting bootfs on any pool
The vdev_is_bootable() restrictions are no longer necessary
with recent GRUB2 code.  FreeBSD has implemented the same
change, except that I moved the Solaris comment to be inside
the #ifdef __sun__ block.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #317
2012-01-17 13:49:07 -08:00
Ned Bass
08d08ebba2 Reduce number of zio free threads
As described in Issue #458 and #258, unlinking large amounts of data
can cause the threads in the zio free wait queue to start spinning.
Reducing the number of z_fr_iss threads from a fixed value of 100 to 1
per cpu signficantly reduces contention on the taskq spinlock and
improves throughput.

Instrumenting the taskq code showed that __taskq_dispatch() can spend
a long time holding tq->tq_lock if there are a large number of threads
in the queue.  It turns out the time spent in wake_up() scales
linearly with the number of threads in the queue.  When a large number
of short work items are dispatched, as seems to be the case with
unlink, the worker threads drain the queue faster than the dispatcher
can fill it.  They then all pile into the work wait queue to wait for
new work items.  So if 100 threads are in the queue, wake_up() takes
about 100 times as long, and the woken threads have to spin until the
dispatcher releases the lock.

Reducing the number of threads helps with the symptoms, but doesn't
get to the root of the problem.  It would seem that wake_up()
shouldn't scale linearly in time with queue depth, particularly if we
are only trying to wake up one thread.  In that vein, I tried making
all of the waiting processes exclusive to prevent the scheduler from
iterating over the entire list, but I still saw the linear time
scaling.  So further investigation is needed, but in the meantime
reducing the thread count is an easy workaround.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
Issue #458
2012-01-17 08:54:00 -08:00
Darik Horn
96b91ef0d6 Apply the ZoL coding standard to zpl_xattr.c
Make the indenting in the zpl_xattr.c file consistent with the Sun
coding standard by removing soft tabs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-12 15:12:03 -08:00
Brian Behlendorf
166dd49de0 Linux 3.2 compat, security_inode_init_security()
The security_inode_init_security() API has been changed to include
a filesystem specific callback to write security extended attributes.
This was done to support the initialization of multiple LSM xattrs
and the EVM xattr.

This change updates the code to use the new API when it's available.
Otherwise it falls back to the previous implementation.

In addition, the ZFS_AC_KERNEL_6ARGS_SECURITY_INODE_INIT_SECURITY
autoconf test has been made more rigerous by passing the expected
types.  This is done to ensure we always properly the detect the
correct form for the security_inode_init_security() API.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #516
2012-01-12 15:06:39 -08:00
Brian Behlendorf
ab26409db7 Linux 3.1 compat, super_block->s_shrink
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block.  Prior
to this change there was one shared global shrinker.

The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded.  This would cause the VFS to drop
references on a fraction of the dentries in the dcache.  The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit.  Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.

This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit.  The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning.  Thus we
can minimize any impact on the caching of other filesystems.

In the context of making this change several other important
issues related to managing the ARC were addressed, they include:

* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback.  The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code.  This callback can also be used by
other ARC consumers such as Lustre.

  arc_add_prune_callback()
  arc_remove_prune_callback()

* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity.  The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.

* Less aggressively invoke the prune callback.  We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes.  It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.

* More promptly manage exceeding the arc_meta_limit.  When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.

* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache.  Remember
this will only occur when the ARC has no other choice.  If it
can evict buffers safely without invoking the prune callback
it will.

* This change is also expected to resolve the unexpect collapses
of the ARC cache.  This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink().  This effectively shrunk the entire cache
when really we just needed to reclaim meta data.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #466
Closes #292
2012-01-11 11:46:02 -08:00
Darik Horn
28eb9213d8 Linux 3.2 compat: set_nlink()
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit

  SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170

Use the new set_nlink() kernel function instead.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
2011-12-16 20:02:52 -08:00
Garrett D'Amore
a38718a63d Illumos #734: Use taskq_dispatch_ent() interface
It has been observed that some of the hottest locks are those
of the zio taskqs.  Contention on these locks can limit the
rate at which zios are dispatched which limits performance.

This upstream change from Illumos uses new interface to the
taskqs which allow them to utilize a prealloc'ed taskq_ent_t.
This removes the need to perform an allocation at dispatch
time while holding the contended lock.  This has the effect
of improving system performance.

Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Alexey Zaytsev <alexey.zaytsev@nexenta.com>
Reviewed by: Jason Brian King <jason.brian.king@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/734

Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #482
2011-12-14 09:19:30 -08:00
Brian Behlendorf
30a9524e45 Set zvol_major/zvol_threads permissions
The zvol_major and zvol_threads module options were being created
with 0 permission bits.  This prevented them from being listed in
the /sys/module/zfs/parameters/ directory, although they were
visible in `modinfo zfs`.  This patch fixes the issue by updating
the permission bits to 0444.  For the moment these options must
be read-only because they are used during module initialization.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #392
2011-12-07 09:27:50 -08:00
Brian Behlendorf
23bdb07d4e Update default ARC memory limits
In the upstream OpenSolaris ZFS code the maximum ARC usage is
limited to 3/4 of memory or all but 1GB, whichever is larger.
Because of how Linux's VM subsystem is organized these defaults
have proven to be too large which can lead to stability issues.

To avoid making everyone manually tune the ARC the defaults are
being changed to 1/2 of memory or all but 4GB.  The rational for
this is as follows:

* Desktop Systems (less than 8GB of memory)

  Limiting the ARC to 1/2 of memory is desirable for desktop
  systems which have highly dynamic memory requirements.  For
  example, launching your web browser can suddenly result in a
  demand for several gigabytes of memory.  This memory must be
  reclaimed from the ARC cache which can take some time.  The
  user will experience this reclaim time as a sluggish system
  with poor interactive performance.  Thus in this case it is
  preferable to leave the memory as free and available for
  immediate use.

* Server Systems (more than 8GB of memory)

  Using all but 4GB of memory for the ARC is preferable for
  server systems.  These systems often run with minimal user
  interaction and have long running daemons with relatively
  stable memory demands.  These systems will benefit most by
  having as much data cached in memory as possible.

These values should work well for most configurations.  However,
if you have a desktop system with more than 8GB of memory you may
wish to further restrict the ARC.  This can still be accomplished
by setting the 'zfs_arc_max' module option.

Additionally, keep in mind these aren't currently hard limits.
The ARC is based on a slab implementation which can suffer from
memory fragmentation.  Because this fragmentation is not visible
from the ARC it may believe it is within the specified limits while
actually consuming slightly more memory.  How much more memory get's
consumed will be determined by how badly fragmented the slabs are.

In the long term this can be mitigated by slab defragmentation code
which was OpenSolaris solution.  Or preferably, using the page cache
to back the ARC under Linux would be even better.  See issue #75
for the benefits of more tightly integrating with the page cache.

This change also fixes a issue where the default ARC max was being
set incorrectly for machines with less than 2GB of memory.  The
constant in the arc_c_max comparison must be explicitly cast to
a uint64_t type to prevent overflow and the wrong conditional
branch being taken.  This failure was typically observed in VMs
which are commonly created with less than 2GB of memory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #75
2011-12-05 12:02:12 -08:00
Brian Behlendorf
f31b3ebe6e Allow xattrs on symlinks
The Solaris version of ZFS does not allow xattrs to be set on
symlinks due to the way they implemented the attropen() system
call.  Linux however implements xattrs through the lgetxattr()
and lsetxattr() system calls which do not have this limitation.

The only reason this hasn't always worked under ZFS on Linux
is that the xattr handlers were not registered for symlink type
inodes.  This was done simply to be consistent with the Solaris
behavior.

Upon futher reflection I believe this should be allowed under
Linux.  The only ill effect would be that the xattrs on symlinks
will not be visible when the pool is imported on a Solaris
system.  This also has the benefit that it allows for SELinux
style security xattr labeling which expects to be able to set
xattrs on all inode types.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #272
2011-11-29 10:24:24 -08:00
Brian Behlendorf
82a37189aa Implement SA based xattrs
The current ZFS implementation stores xattrs on disk using a hidden
directory.  In this directory a file name represents the xattr name
and the file contexts are the xattr binary data.  This approach is
very flexible and allows for arbitrarily large xattrs.  However,
it also suffers from a significant performance penalty.  Accessing
a single xattr can requires up to three disk seeks.

  1) Lookup the dnode object.
  2) Lookup the dnodes's xattr directory object.
  3) Lookup the xattr object in the directory.

To avoid this performance penalty Linux filesystems such as ext3
and xfs try to store the xattr as part of the inode on disk.  When
the xattr is to large to store in the inode then a single external
block is allocated for them.  In practice most xattrs are small
and this approach works well.

The addition of System Attributes (SA) to zfs provides us a clean
way to make this optimization.  When the dataset property 'xattr=sa'
is set then xattrs will be preferentially stored as System Attributes.
This allows tiny xattrs (~100 bytes) to be stored with the dnode and
up to 64k of xattrs to be stored in the spill block.  If additional
xattr space is required, which is unlikely under Linux, they will be
stored using the traditional directory approach.

This optimization results in roughly a 3x performance improvement
when accessing xattrs which brings zfs roughly to parity with ext4
and xfs (see table below).  When multiple xattrs are stored per-file
the performance improvements are even greater because all of the
xattrs stored in the spill block will be cached.

However, by default SA based xattrs are disabled in the Linux port
to maximize compatibility with other implementations.  If you do
enable SA based xattrs then they will not be visible on platforms
which do not support this feature.

----------------------------------------------------------------------
   Time in seconds to get/set one xattr of N bytes on 100,000 files
------+--------------------------------+------------------------------
      |            setxattr            |            getxattr
bytes |  ext4     xfs zfs-dir  zfs-sa  |  ext4     xfs zfs-dir  zfs-sa
------+--------------------------------+------------------------------
1     |  2.33   31.88   21.50    4.57  |  2.35    2.64    6.29    2.43
32    |  2.79   30.68   21.98    4.60  |  2.44    2.59    6.78    2.48
256   |  3.25   31.99   21.36    5.92  |  2.32    2.71    6.22    3.14
1024  |  3.30   32.61   22.83    8.45  |  2.40    2.79    6.24    3.27
4096  |  3.57  317.46   22.52   10.73  |  2.78   28.62    6.90    3.94
16384 |   n/a 2342.39   34.30   19.20  |   n/a   45.44  145.90    7.55
65536 |   n/a 2941.39  128.15  131.32* |   n/a  141.92  256.85  262.12*

Legend:
* ext4      - Stock RHEL6.1 ext4 mounted with '-o user_xattr'.
* xfs       - Stock RHEL6.1 xfs mounted with default options.
* zfs-dir   - Directory based xattrs only.
* zfs-sa    - Prefer SAs but spill in to directories as needed, a
              trailing * indicates overflow in to directories occured.

NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file.
NOTE: XFS and ZFS have no limit on xattr name/value pairs per file.
NOTE: Linux limits individual name/value pairs to 65536 bytes.
NOTE: All setattr/getattr's were done after dropping the cache.
NOTE: All tests were run against a single hard drive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #443
2011-11-28 15:45:51 -08:00
Brian Behlendorf
adcd70bd1a Linux 3.1 compat, fops->fsync()
The Linux 3.1 kernel updated the fops->fsync() callback yet again.
They now pass the requested range and delegate the responsibility
for calling filemap_write_and_wait_range() to the callback.  In
addition imutex is no longer held by the caller and the callback
is responsible for taking the lock if required.

This commit updates the code to provide a zpl_fsync() function
for the updated API.  Implementations for the previous two APIs
are also maintained for compatibility.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #445
2011-11-10 10:03:08 -08:00
Brian Behlendorf
5547c2f1bf Simplify BDI integration
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code.  The updated code now just
registers the bdi during mount and destroys it during unmount.

The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it.  Luckily the bdi_setup_and_register() function
is trivial.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #367
2011-11-08 10:19:03 -08:00
Brian Behlendorf
591fb62f19 Disown dataset in zfs_sb_create()
Fix an unlikely failure cause in zfs_sb_create() which could
leave the dataset owned on error and thus unavailable until
after a reboot.  Disown the dataset if SA are expected but
are in fact missing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-11-08 10:18:40 -08:00
Brian Behlendorf
ae6ba3dbe6 Improve meta data performance
Profiling the system during meta data intensive workloads such
as creating/removing millions of files, revealed that the system
was cpu bound.  A large fraction of that cpu time was being spent
waiting on the virtual address space spin lock.

It turns out this was caused by certain heavily used kmem_caches
being backed by virtual memory.  By default a kmem_cache will
dynamically determine the type of memory used based on the object
size.  For large objects virtual memory is usually preferable
and for small object physical memory is a better choice.  See
the spl_slab_alloc() function for a longer discussion on this.

However, there is a certain amount of gray area when defining a
'large' object.  For the following caches it turns out they were
just over the line:

  * dnode_cache
  * zio_cache
  * zio_link_cache
  * zio_buf_512_cache
  * zfs_data_buf_512_cache

Now because we know there will be a lot of churn in these caches,
and because we know the slabs will still be reasonably sized.
We can safely request with the KMC_KMEM flag that the caches be
backed with physical memory addresses.  This entirely avoids the
need to serialize on the virtual address space lock.

As a bonus this also reduces our vmalloc usage which will be good
for 32-bit kernels which have a very small virtual address space.
It will also probably be good for interactive performance since
unrelated processes could also block of this same global lock.
Finally, we may see less cpu time being burned in the arc_reclaim
and txg_sync_threads.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
2011-11-03 10:19:21 -07:00
Brian Behlendorf
6a95d0b74c Fix NULL deref in balance_pgdat()
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure.  It may have already been set when entering
zpl_putpage() in which case it must remain set on exit.  In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim.  By clearing
it we allow the following NULL deref to potentially occur.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #287
2011-11-03 10:15:39 -07:00
Gunnar Beutner
a7b125e9a5 Fix a race condition in zfs_getattr_fast()
zfs_getattr_fast() was missing a lock on the ZFS superblock which
could result in zfs_znode_dmu_fini() clearing the zp->z_sa_hdl member
while zfs_getattr_fast() was accessing the znode. The result of this
would usually be a panic.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #431
2011-11-03 10:13:09 -07:00
Xin Li
c475167627 Illumos #1661: Fix flaw in sa_find_sizes() calculation
When calculating space needed for SA_BONUS buffers, hdrsize is
always rounded up to next 8-aligned boundary. However, in two places
the round up was done against sum of 'total' plus hdrsize. On the
other hand, hdrsize increments by 4 each time, which means in certain
conditions, we would end up returning with will_spill == 0 and
(total + hdrsize) larger than full_space, leading to a failed
assertion because it's invalid for dmu_set_bonus.

Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/1661

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #426
2011-10-24 09:57:52 -07:00
Darik Horn
3cee2262a6 Change sun.com URLs to zfsonlinux.org
ZFS contains error messages that point to the defunct www.sun.com
domain, which is currently offline.  Change these error messages
to use the zfsonlinux.org mirror instead.

This commit depends on:

  zfsonlinux/zfsonlinux.github.com@8e10ead3dc

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-24 09:52:21 -07:00
Brian Behlendorf
6f2255ba8a Set mtime on symbolic links
Register the setattr/getattr callbacks for symlinks.  Without these
the generic inode_setattr() and generic_fillattr() functions will
be used.  In the setattr case this will only result in the inode being
updated in memory, the dirty_inode callback would also normally run
but none is registered for zfs.

The straight forward fix is to set the setattr/getattr callbacks
for symlinks so they are handled just like files and directories.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #412
2011-10-18 15:49:31 -07:00
Alexander Stetsenko
8d35c1499d Illumos #755: dmu_recv_stream builds incomplete guid_to_ds_map
An incomplete guid_to_ds_map would cause restore_write_byref() to fail
while receiving a de-duplicated backup stream.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D`Amore <garrett@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/755
- https://github.com/illumos/illumos-gate/commit/ec5cf9d53a

Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #372
2011-10-18 11:18:14 -07:00
Brian Behlendorf
86f35f34f4 Export symbols for the VFS API
Export all symbols already marked extern in the zfs_vfsops.h
header.  Several non-static symbols have also been added to
the header and exportewd.  This allows external modules to
more easily create and manipulate properly created ZFS
filesystem type datasets.

Rename zfsvfs_teardown() to zfs_sb_teardown and export it.
This is done simply for consistency with the rest of the code
base.  All other zfsvfs_* functions have already been renamed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-11 10:25:59 -07:00
Brian Behlendorf
e45aa45298 Export symbols for the full SA API
Export all the symbols for the system attribute (SA) API.  This
allows external module to cleanly manipulate the SAs associated
with a dnode.  Documention for the SA API can be found in the
module/zfs/sa.c source.

This change also removes the zfs_sa_uprade_pre, and
zfs_sa_uprade_post prototypes.  The functions themselves were
dropped some time ago.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-05 15:59:56 -07:00
Andreas Dilger
baab063016 zpl: Fix "df -i" to have better free inodes value
Due to the confusion in Linux statfs between f_frsize and f_bsize
the blocks counts were changed to be in units of z_max_blksize
instead of SPA_MINBLOCKSIZE as it is on other platforms.

However, the free files calculation in zfs_statvfs() is limited by
the free blocks count, since each dnode consumes one block/sector.
This provided a reasonable estimate of free inodes, but on Linux
this meant that the free inodes count was underestimated by a large
amount, since 256 512-byte dnodes can fit into a 128kB block, and
more if the max blocksize is increased to 1MB or larger.

Also, the use of SPA_MINBLOCKSIZE is semantically incorrect since
DNODE_SIZE may change to a value other than SPA_MINBLOCKSIZE and
may even change per dataset, and devices with large sectors setting
ashift will also use a larger blocksize.

Correct the f_ffree calculation to use (availbytes >> DNODE_SHIFT)
to more accurately compute the maximum number of dnodes that can
be created.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #413
Closes #400
2011-09-28 11:27:10 -07:00
Brian Behlendorf
dee28b0700 Export symbols for the full ZAP API
Export all the symbols for the ZAP API.  This allows external modules
to cleanly interface with ZAP type objects.  Previously only a subset
of the functionality was exposed.  Documention for the ZAP API can be
found in the sys/zap.h header.

This change also removes a duplicate zap_increment_int() prototype.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-09-27 16:12:36 -07:00
Brian Behlendorf
fa6e5ced2f Suppress kmem_alloc() warning in zfs_prop_set_special()
Suppress the warning for this large kmem_alloc() because it is not
that far over the warning threshhold (8k) and it is short lived.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-09-15 20:26:51 -07:00
Brian Behlendorf
2708f716c0 Fix usage of zsb after free
Caught by code inspection, the variable zsb was referenced after
being freed.  Move the kmem_free() to the end of the function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-09-09 10:29:48 -07:00
Brian Behlendorf
95d9fd028b Fix incompatible pointer type warning
This warning was accidentally introduced by commit
f3ab88d646 which updated the
.readpages() implementation.  The fix is to simply cast
the helper function to the appropriate type when passed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-19 15:16:30 -07:00
Brian Behlendorf
f3ab88d646 Correctly lock pages for .readpages()
Unlike the .readpage() callback which is passed a single locked page
to be populated.  The .readpages() callback is passed a list of unlocked
pages which are all marked for read-ahead (PG_readahead set).  It is
the responsibly of .readpages() to ensure to pages are properly locked
before being populated.

Prior to this change the requested read-ahead pages would be updated
outside of the page lock which is unsafe.  The unlocked pages would then
be unlocked again which is harmless but should have been immediately
detected as bug.  Unfortunately, newer kernels failed detect this issue
because the check is done with a VM_BUG_ON which is disabled by default.
Luckily, the old Debian Lenny 2.6.26 kernel caught this because it
simply uses a BUG_ON.

The straight forward fix for this is to update the .readpages() callback
to use the read_cache_pages() helper function.  The helper function will
ensure that each page in the list is properly locked before it is passed
to the .readpage() callback.  In addition resolving the bug, this results
in a nice simplification of the existing code.

The downside to this change is that instead of passing one large read
request to the dmu multiple smaller ones are submitted.  All of these
requests however are marked for readahead so the lower layers should
issue a large I/O regardless.  Thus most of the request should hit the
ARC cache.

Futher optimization of this code can be done in the future is a perform
analysis determines it to be worthwhile.  But for the moment, it is
preferable that code be correct and understandable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #355
2011-08-08 13:24:52 -07:00
Brian Behlendorf
76659dc110 Add backing_device_info per-filesystem
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk.  The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance.  Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.

The replacement for pdflush is called bdi (backing device info).  The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback.  The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.

For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme.  However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.

Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.

However, there is one critical case where not implementing the bdi
functionality can cause problems.  If an application handles a page
fault it can enter the balance_dirty_pages() callpath.  This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.

Without a registered backing_device_info for the filesystem the
dirty pages will not get written out.  Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.

This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block.  It is then registered
when the filesystem mounted and unregistered on unmount.  It will
not be registered for mounted snapshots which are read-only.  This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #174
2011-08-04 13:37:38 -07:00
Brian Behlendorf
3c0e5c0f45 Cleanup mmap(2) writes
While the existing implementation of .writepage()/zpl_putpage() was
functional it was not entirely correct.  In particular, it would move
dirty pages in to a clean state simply after copying them in to the
ARC cache.  This would result in the pages being lost if the system
were to crash enough though the Linux VFS believed them to be safe on
stable storage.

Since at the moment virtually all I/O, except mmap(2), bypasses the
page cache this isn't as bad as it sounds.  However, as hopefully
start using the page cache more getting this right becomes more
important so it's good to improve this now.

This patch takes a big step in that direction by updating the code
to correctly move dirty pages through a writeback phase before they
are marked clean.  When a dirty page is copied in to the ARC it will
now be set in writeback and a completion callback is registered with
the transaction.  The page will stay in writeback until the dmu runs
the completion callback indicating the page is on stable storage.
At this point the page can be safely marked clean.

This process is normally entirely asynchronous and will be repeated
for every dirty page.  This may initially sound inefficient but most
of these pages will end up in a few txgs.  That means when they are
eventually written to disk they should be nicely batched.  However,
there is room for improvement.  It may still be desirable to batch
up the pages in to larger writes for the dmu.  This would reduce
the number of callbacks and small 4k buffer required by the ARC.

Finally, if the caller requires that the I/O be done synchronously
by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set.  Then the I/O
will trigger a zil_commit() to flush the data to stable storage.
At which point the registered callbacks will be run leaving the
date safe of disk and marked clean before returning from .writepage.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-02 10:34:55 -07:00
Martin Matuska
cddafdcbc5 Illumos #1313: Integer overflow in txg_delay()
The function txg_delay() is used to delay txg (transaction group)
threads in ZFS.  The timeout value for this function is calculated
using:

    int timeout = ddi_get_lbolt() + ticks;

Later, the actual wait is performed:

    while (ddi_get_lbolt() < timeout &&
        tx->tx_syncing_txg < txg-1 && !txg_stalled(dp))
            (void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock,
                timeout - ddi_get_lbolt());

The ddi_get_lbolt() function returns current uptime in clock ticks
and is typed as clock_t.  The clock_t type on 64-bit architectures
is int64_t.

The "timeout" variable will overflow depending on the tick frequency
(e.g. for 1000 it will overflow in 28.855 days). This will make the
expression "ddi_get_lbolt() < timeout" always false - txg threads will
not be delayed anymore at all. This leads to a slowdown in ZFS writes.

The attached patch initializes timeout as clock_t to match the return
value of ddi_get_lbolt().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #352
2011-08-01 12:09:43 -07:00
Martin Matuska
ca5252204a Illumos #1043: Recursive zfs snapshot destroy fails
Prior to revision 11314 if a user was recursively destroying
snapshots of a dataset the target dataset was not required to
exist.  The zfs_secpolicy_destroy_snaps() function introduced
the security check on the target dataset, so since then if the
target dataset does not exist, the recursive destroy is not
performed.  Before 11314, only a delete permission check on
the snapshot's master dataset was performed.

Steps to reproduce:
zfs create pool/a
zfs snapshot pool/a@s1
zfs destroy -r pool@s1

Therefore I suggest to fallback to the old security check, if
the target snapshot does not exist and continue with the destroy.

References to Illumos issue and patch:
- https://www.illumos.org/issues/1043
- https://www.illumos.org/attachments/217/recursive_dataset_destroy.patch

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Eric Schrock
3e31d2b080 Illumos #883: ZIL reuse during remount corruption
Moving the zil_free() cleanup to zil_close() prevents this
problem from occurring in the first place.  There is a very
good description of the issue and fix in Illumus #883.

Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Reivewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/883
- https://github.com/illumos/illumos-gate/commit/c9ba2a43cb

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Matt Ahrens
f5fc4acaa7 Illumos #1092: zfs refratio property
Add a "REFRATIO" property, which is the compression ratio based on
data referenced. For snapshots, this is the same as COMPRESSRATIO,
but for filesystems/volumes, the COMPRESSRATIO is based on the
data "USED" (ie, includes blocks in children, but not blocks
shared with the origin).

This is needed to figure out how much space a filesystem would
use if it were not compressed (ignoring snapshots).

Reviewed by: George Wilson <George.Wilson@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Mark Musante <Mark.Musante@oracle.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/1092
- https://github.com/illumos/illumos-gate/commit/187d6ac08a

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
George Wilson
6d974228ef Illumos #1051: zfs should handle imbalanced luns
Today zfs tries to allocate blocks evenly across all devices.
This means when devices are imbalanced zfs will use lots of
CPU searching for space on devices which tend to be pretty
full.  It should instead fail quickly on the full LUNs and
move onto devices which have more availability.

Reviewed by: Eric Schrock <Eric.Schrock@delphix.com>
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Garrett D'Amore
2cc6c8db12 Illumos #175: zfs vdev cache consumes excessive memory
Note that with the current ZFS code, it turns out that the vdev
cache is not helpful, and in some cases actually harmful.  It
is better if we disable this.  Once some time has passed, we
should actually remove this to simplify the code.  For now we
just disable it by setting the zfs_vdev_cache_size to zero.
Note that Solaris 11 has made these same changes.

References to Illumos issue and patch:
- https://www.illumos.org/issues/175
- https://github.com/illumos/illumos-gate/commit/b68a40a845

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Gordon Ross
ef3c1dea70 Illumos #764: panic in zfs:dbuf_sync_list
Hypothesis about what's going on here.

At some time in the past, something, i.e. dnode_reallocate()
calls one of:
dbuf_rm_spill(dn, tx);

These will do:
dbuf_rm_spill(dnode_t *dn, dmu_tx_t *tx)
dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx)
dbuf_undirty(db, tx)

Currently dbuf_undirty can leave a spill block in dn_dirty_records[],
(it having been put there previously by dbuf_dirty) and free it.
Sometime later, dbuf_sync_list trips over this reference to free'd
(and typically reused) memory.

Also, dbuf_undirty can call dnode_clear_range with a bogus
block ID. It needs to test for DMU_SPILL_BLKID, similar to
how dnode_clear_range is called in dbuf_dirty().

References to Illumos issue and patch:
- https://www.illumos.org/issues/764
- https://github.com/illumos/illumos-gate/commit/3f2366c2bb

Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Mark.Maybe@oracle.com
Reviewed by: Albert Lee <trisk@nexenta.com
Approved by: Garrett D'Amore <garrett@nexenta.com>

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00