Commit Graph

1852 Commits

Author SHA1 Message Date
Brian Behlendorf
302f753f16 Integrate ARC more tightly with Linux
Under Solaris the ARC was designed to stay one step ahead of the
VM subsystem.  It would attempt to recognize low memory situtions
before they occured and evict data from the cache.  It would also
make assessments about if there was enough free memory to perform
a specific operation.

This was all possible because Solaris exposes a fairly decent
view of the memory state of the system to other kernel threads.
Linux on the other hand does not make this information easily
available.  To avoid extensive modifications to the ARC the SPL
attempts to provide these same interfaces.  While this works it
is not ideal and problems can arise when the ARC and Linux have
different ideas about when your out of memory.  This has manifested
itself in the past as a spinning arc_reclaim_thread.

This patch abandons the emulated Solaris interfaces in favor of
the prefered Linux interface.  That means moving the bulk of the
memory reclaim logic out of the arc_reclaim_thread and in to the
evict driven shrinker callback.  The Linux VM will call this
function when it needs memory.  The ARC is then responsible for
attempting to free the requested amount of memory if possible.

Several interfaces have been modified to accomidate this approach,
however the basic user space implementation remains the same.
The following changes almost exclusively just apply to the kernel
implementation.

* Removed the hdr_recl() reclaim callback which is redundant
  with the broader arc_shrinker_func().

* Reduced arc_grow_retry to 5 seconds from 60.  This is now used
  internally in the ARC with arc_no_grow to indicate that direct
  reclaim was recently performed.  This typically indicates a
  rapid change in memory demands which the kswapd threads were
  unable to keep ahead of.  As long as direct reclaim is happening
  once every 5 seconds arc growth will be paused to avoid further
  contributing to the existing memory pressure.  The more common
  indirect reclaim paths will not set arc_no_grow.

* arc_shrink() has been extended to take the number of bytes by
  which arc_c should be reduced.  This allows for a more granual
  reduction of the arc target.  Since the kernel provides a
  reclaim value to the arc_shrinker_func() this value is used
  instead of 1<<arc_shrink_shift.

* arc_reclaim_needed() has been removed.  It was used to determine
  if the system was under memory pressure and relied extensively
  on Solaris specific VM interfaces.  In most case the new code
  just checks arc_no_grow which indicates that within the last
  arc_grow_retry seconds direct memory reclaim occurred.

* arc_memory_throttle() has been updated to always include the
  amount of evictable memory (arc and page cache) in its free
  space calculations.  This space is largely available in most
  call paths due to direct memory reclaim.

* The Solaris pageout code was also removed to avoid confusion.
  It has always been disabled due to proc_pageout being defined
  as NULL in the Linux port.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-30 10:03:05 -07:00
Brian Behlendorf
afec56b43f Add zfs_mdcomp_disable module option
Expose the zfs_mdcomp_disable variable as a module option.  This
can be used to disable compression of zfs meta data which is
enabled by default.  This shouldn't need to be tuned but for
most workloads, however there may be very specific instances
where it makes sense to trade disk capacity for extra cpu cycles.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-27 16:28:02 -07:00
Brian Behlendorf
ebf8e3a237 Illumos #1909: disk sync write perf regression when slog is used post oi_148
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Refererces to Illumos issue:
  https://www.illumos.org/issues/1909

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #680
2012-04-19 16:26:29 -07:00
Prakash Surya
409dc1a570 Use KM_PUSHPAGE in l2arc_write_buffers
There is potential for deadlock in the l2arc_feed thread if KM_PUSHPAGE
is not used for the allocations made in l2arc_write_buffers.
Specifically, if KM_PUSHPAGE is not used for these allocations, it is
possible for reclaim to be triggered which can cause the l2arc_feed
thread to deadlock itself on the ARC_mru mutex. An example of this is
demonstrated in the following backtrace of the l2arc_feed thread:

    crash> bt 4123
    PID: 4123   TASK: ffff88062f8c1500  CPU: 6   COMMAND: "l2arc_feed"
      0 [ffff88062511d610] schedule at ffffffff814eeee0
      1 [ffff88062511d6d8] __mutex_lock_slowpath at ffffffff814f057e
      2 [ffff88062511d748] mutex_lock at ffffffff814f041b
      3 [ffff88062511d768] arc_evict at ffffffffa05130ca [zfs]
      4 [ffff88062511d858] arc_adjust at ffffffffa05139a9 [zfs]
      5 [ffff88062511d878] arc_shrink at ffffffffa0513a95 [zfs]
      6 [ffff88062511d898] arc_kmem_reap_now at ffffffffa0513be8 [zfs]
      7 [ffff88062511d8c8] arc_shrinker_func at ffffffffa0513ccc [zfs]
      8 [ffff88062511d8f8] shrink_slab at ffffffff8112a17a
      9 [ffff88062511d958] do_try_to_free_pages at ffffffff8112bfdf
     10 [ffff88062511d9e8] try_to_free_pages at ffffffff8112c3ed
     11 [ffff88062511da98] __alloc_pages_nodemask at ffffffff8112431d
     12 [ffff88062511dbb8] kmem_getpages at ffffffff8115e632
     13 [ffff88062511dbe8] fallback_alloc at ffffffff8115f24a
     14 [ffff88062511dc68] ____cache_alloc_node at ffffffff8115efc9
     15 [ffff88062511dcc8] __kmalloc at ffffffff8115fbf9
     16 [ffff88062511dd18] kmem_alloc_debug at ffffffffa047b8cb [spl]
     17 [ffff88062511dda8] l2arc_feed_thread at ffffffffa0511e71 [zfs]
     18 [ffff88062511dea8] thread_generic_wrapper at ffffffffa047d1a1 [spl]
     19 [ffff88062511dee8] kthread at ffffffff81090a86
     20 [ffff88062511df48] kernel_thread at ffffffff8100c14a

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-17 11:56:21 -07:00
Martin Matuska
7d5cd71da6 Illumos #1346: zfs incremental receive may leave behind temporary clones
1356 zfs dataset prefetch code not working
Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/1346
  https://www.illumos.org/issues/1356

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #647
2012-04-11 12:02:27 -07:00
Albert Lee
22cd4a4653 Illumos #1475: zfs spill block hold can access invalid spill blkptr
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/1475

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #648
2012-04-11 11:46:30 -07:00
George Wilson
5ffb9d1d05 Illumos #1951: leaking a vdev when removing an l2cache device
1952 memory leak when adding a file-based l2arc device
1954 leak in ZFS from metaslab_group_create and zfs_ereport_checksum

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References to Illumos issues:
  https://www.illumos.org/issues/1951
  https://www.illumos.org/issues/1952
  https://www.illumos.org/issues/1954

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #650
2012-04-11 11:32:06 -07:00
Martin Matuska
b129c6590e OS-926: zfs panic in zfs_fill_zplprops_impl()
This change appears to be exclusive to SmartOS. It is not present in
illumos-gate but it just adds some needed error handling.  This is
clearly preferable to simply ASSERTING which is what would occur
prior to the patch.

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #652
2012-04-11 11:29:19 -07:00
Andriy Gapon
3adfc400f5 Illumos #1680: zfs vdev_file_io_start: validate vdev before using vdev_tsd
vdev_tsd can be NULL for certain vdev states.
At least in userland testing with ztest.

References to Illumos issue:
  https://www.illumos.org/issues/1680

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #655
2012-04-11 11:23:18 -07:00
Brian Behlendorf
f0fd83be65 Export additional dsl symbols
Principly these symbols were exported to get access to the
dsl_prop_register/dsl_prop_unregister functions.  They allow
us to cleanly register a callback which is called when a
dataset property is modified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-11 09:26:55 -07:00
Gunnar Beutner
1f0d8a566f Fixed a NULL pointer dereference bug in zfs_preumount
When zpl_fill_super -> zfs_domount fails (e.g. because the dataset
was destroyed before it could be successfully mounted) the subsequent
call to zpl_kill_sb -> zfs_preumount would derefence a NULL pointer.

This bug can be reproduced using this shell script:

 #!/bin/sh
 (
 while true; do
 	zfs create -o mountpoint=legacz tank/bar
 	zfs destroy tank/bar
 done
 ) &

 (
 while true; do
 	mount -t zfs tank/bar /mnt
 	umount /mnt
 done
 ) &

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #639
2012-04-05 11:29:42 -07:00
Brian Behlendorf
fc41c6402b Properly expose the mfu ghost list kstats
Due to a typo the mru ghost lists stats were accidentally being
exposed as the mfu ghost list stats.  This was harmless but
confusing since memory usage could be over reported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-27 15:08:22 -07:00
Brian Behlendorf
1c5de20ae2 Add --enable-debug-dmu-tx configure option
Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging.  When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.

This checking is particularly helpful when adding new dmu consumers
like Lustre.  However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.

--enable-debug-dmu-tx  - Enable/disable validation of each tx as
--disable-debug-dmu-tx   it is constructed.  By default validation
                         is disabled due to performance concerns.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:25:17 -07:00
Brian Behlendorf
99ea23c583 Enhance a dmu_tx_dirty_buf() assertion
The following assertion is good to validate the correctness of
new DMU consumers, but it doesn't quite provide enough information.
Slightly rework the assertion so that when it is hit the actual
offending values will be included in the output.

  SPLError: 4787:0:(dmu_tx.c:828:dmu_tx_dirty_buf())
  ASSERTION(dn == NULL || dn->dn_assigned_txg == tx->tx_txg) failed

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:24:05 -07:00
Brian Behlendorf
4b5d425f14 Add ZFS_META_RELEASE to module load/unload messages
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indidcate exactly what version of ZFS has been
loaded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:14:35 -07:00
Brian Behlendorf
9ed86e7cc7 Account for .zfs ctldir inodes
Because the .zfs ctldir inodes are not backed by physical storage
they use a different create path which was not properly accounting
for them as used.  This could result in ->nr_cached_objects()
returning 0 and cause a divide by zero error in prune_super().

In my option there's a kernel bug here too which allows this to
happen.  They should either be checking for 0 or adding +1 like
they correctly do earlier in the function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #617
2012-03-22 15:43:55 -07:00
Brian Behlendorf
ebe7e575ea Add .zfs control directory
Add support for the .zfs control directory.  This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required.  The bulk of the core
functionality is now all there with the following limitations.

*) The .zfs/snapshot directory automount support requires a 2.6.37
   or newer kernel.  The exception is RHEL6.2 which has backported
   the d_automount patches.

*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
   in the .zfs/snapshot directory works as expected.  However,
   this functionality is only available to root until zfs
   delegations are finished.

      * mkdir - create a snapshot
      * rmdir - destroy a snapshot
      * mv    - rename a snapshot

The following issues are known defeciences, but we expect them to
be addressed by future commits.

*) Add automount support for kernels older the 2.6.37.  This should
   be possible using follow_link() which is what Linux did before.

*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
   The majority of the ground work for this is complete.  However,
   finishing this work will require resolving some lingering
   integration issues with the Linux NFS kernel server.

*) The .zfs/shares directory exists but no futher smb functionality
   has yet been implemented.

Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #173
2012-03-22 13:03:47 -07:00
Brian Behlendorf
49be0ccf1f Add zio constructor/destructor
Add a standard zio constructor and destructor.  Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations.  However, in this
case none of the operations moved out of zio_create() were really
very expensive.

This change was principly made as a debug patch (and workaround)
for a zio_destroy() race.  The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread.  This would completely explain the observed symptoms in
the issue report.

This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor.  Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.

Once the real root cause is determined and resolved this change
can be safely reverted.  Until then this should help workaround
the issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496
2012-03-21 14:51:44 -07:00
Brian Behlendorf
c8df41538d Revert "Add zio constructor/destructor"
This patch was slightly flawed and allowed for zio->io_logical
to potentially not be reinitialized for a new zio.  This could
lead to assertion failures in specific cases when debugging is
enabled (--enable-debug) and I/O errors are encountered.  It
may also have caused problems when issues logical I/Os.

Since we want to make sure this workaround can be easily removed
in the future (when we have the real fix).  I'm reverting this
change and applying a new version of the patch which includes
the zio->io_logical fix.

This reverts commit 2c6d0b1e07.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #602
Issue #604
2012-03-21 14:51:01 -07:00
Brian Behlendorf
77a405ae52 Add missing NULL in zpl_xattr_handlers
The xattr_resolve_name() helper function expects the registered
list of xattr handlers to be NULL terminated.  This NULL was
accidentally missing which could result in a NULL dereference.

Interestingly this issue only manifested itself on certain 32-bit
systems.  Presumably on 64-bit kernels we just always happen to
get lucky and the memory following the structure is zeroed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #594
2012-03-15 15:18:29 -07:00
Brian Behlendorf
0ece356db5 Add sa_spill_rele() interface
Add a SA interface which allows us to release the spill block
from a SA handle without destroying the handle.  This is useful
because we can then ensure that a copy of the dirty spill block
is not made at sync time due to the extra hold.  Susequent calls
to sa_update() or sa_lookup() with transparently refetch the
spill block dbuf from the ARC hash.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-07 16:28:00 -08:00
Brian Behlendorf
2c6d0b1e07 Add zio constructor/destructor
Add a standard zio constructor and destructor.  Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations.  However, in this
case none of the operations moved out of zio_create() were really
very expensive.

This change was principly made as a debug patch (and workaround)
for a zio_destroy() race.  The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread.  This would completely explain the observed symptoms in
the issue report.

This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor.  Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.

Once the real root cause is determined and resolved this change
can be safely reverted.  Until then this should help workaround
the issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496
2012-03-07 16:06:23 -08:00
Brian Behlendorf
ec2626ad3f Use SA_HDL_PRIVATE for SA xattrs
A private SA handle must be used to ensure we can drop the dbuf
hold on the spill block prior to calling dmu_tx_commit().  If we
call dmu_tx_commit() before sa_handle_destroy(), then our hold
will trigger a copy of the dbuf to be made.  This is done to
prevent data from leaking in to the syncing txg.  As a result
the original dirty spill block will remain cached.

Additionally, relying on the shared zp->z_sa_hdl is unsafe in
the xattr context because the znode may be asynchronously dropped
from the cache.  It's far safer and simpler just to use a private
handle for xattrs.  Plus any additional overhead is offset by
the avoidance of the previously mentioned memory copy.

These forever dirty buffers can be noticed in the arcstats under
the anon_size.  On a quiescent system the value should be zero.
Without this fix and a SA xattr write workload you will see
anon_size increase.  Eventually, if enough dirty data builds up
your system it will appear to hang.  This occurs because the dmu
won't allow new txs to be assigned until that dirty data is
flushed, and it won't be because it's not part of an assigned tx.

As an aside, I typically see anon_size lurk around 16k so I think
there is another place in the code which needs a similar fix.
However, this value doesn't grow over time so it isn't critical.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #503
Issue #513
2012-03-02 13:20:48 -08:00
Brian Behlendorf
570827e129 Add 'dmu_tx' kstats entry
Keep counters for the various reasons that a thread may end up
in txg_wait_open() waiting on a new txg.  This can be useful
when attempting to determine why a particular workload is
under performing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 08:59:10 -08:00
Brian Behlendorf
13be560d89 Add arc_state_t stats to arcstats
To ensure the arc is behaving properly we need greater visibility
in to exactly how it's managing the systems memory.  This patch
takes one step in that direction be adding the current arc_state_t
for the anon, mru, mru_ghost, mfu, and mfs_ghost lists.  The l2
arc_state_t is already well represented in the arcstats.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 08:58:59 -08:00
Alex Zhuravlev
a473d90cee Export symbols for zero-copy
Export additional symbols to make use of the DMU's zero-copy
API.  This allows external modules to move data in to and out of
the ARC without incurring the cost of a memory copy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-17 12:43:02 -08:00
Brian Behlendorf
b10c77f70a Export symbols for zero-copy
Exported the required symbols to make use of the DMU's zero-copy
API.  This allows external modules to move data in to and out of
the ARC without incurring the cost of a memory copy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-10 11:56:55 -08:00
Brian Behlendorf
a31acb462d Use spl_debug_* helpers
When configuring the spl debug log support use the provided wrapper
functions.  This ensures that if --disable-debug-log was used when
buiding the spl the functions will have no effect.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:37:48 -08:00
Etienne Dechamps
30930fba21 Add support for DISCARD to ZVOLs.
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.

We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.

Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:19:38 -08:00
Etienne Dechamps
cb2d19010d Support the fallocate() file operation.
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.

To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 16:19:32 -08:00
Etienne Dechamps
aec69371a6 Check permissions in zfs_space().
This isn't done on Solaris because on this OS zfs_space() can
only be called with an opened file handle. Since the addition of
zpl_truncate_range() this isn't the case anymore, so we need to
enforce access rights.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 15:20:37 -08:00
Etienne Dechamps
5cb63a57f8 Implement the truncate_range() inode operation.
This operation allows "hole punching" in ZFS files. On Solaris this
is done via the vop_space() system call, which maps to the zfs_space()
function. So we just need to write zpl_truncate_range() as a wrapper
around zfs_space().

Note that this only works for regular files, not ZVOLs.

This is currently an insecure implementation without permission
checking, although this isn't that big of a deal since truncate_range()
isn't even callable from userspace.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 15:20:32 -08:00
Etienne Dechamps
dde9380a1b Use 32 as the default number of zvol threads.
Currently, the `zvol_threads` variable, which controls the number of worker
threads which process items from the ZVOL queues, is set to the number of
available CPUs.

This choice seems to be based on the assumption that ZVOL threads are
CPU-bound. This is not necessarily true, especially for synchronous writes.
Consider the situation described in the comments for `zil_commit()`, which is
called inside `zvol_write()` for synchronous writes:

> itxs are committed in batches. In a heavily stressed zil there will be a
> commit writer thread who is writing out a bunch of itxs to the log for a
> set of committing threads (cthreads) in the same batch as the writer.
> Those cthreads are all waiting on the same cv for that batch.
>
> There will also be a different and growing batch of threads that are
> waiting to commit (qthreads). When the committing batch completes a
> transition occurs such that the cthreads exit and the qthreads become
> cthreads. One of the new cthreads becomes he writer thread for the batch.
> Any new threads arriving become new qthreads.

We can easily deduce that, in the case of ZVOLs, there can be a maximum of
`zvol_threads` cthreads and qthreads. The default value for `zvol_threads` is
typically between 1 and 8, which is way too low in this case. This means
there will be a lot of small commits to the ZIL, which is very inefficient
compared to a few big commits, especially since we have to wait for the data
to be on stable storage. Increasing the number of threads will increase the
amount of data waiting to be commited and thus the size of the individual
commits.

On my system, in the context of VM disk image storage (lots of small
synchronous writes), increasing `zvol_threads` from 8 to 32 results in a 50%
increase in sequential synchronous write performance.

We should choose a more sensible default for `zvol_threads`. Unfortunately
the optimal value is difficult to determine automatically, since it depends
on the synchronous write latency of the underlying storage devices. In any
case, a hardcoded value of 32 would probably be better than the current
situation. Having a lot of ZVOL threads doesn't seem to have any real
downside anyway.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #392
2012-02-08 13:58:10 -08:00
Etienne Dechamps
34037afe24 Improve ZVOL queue behavior.
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Etienne Dechamps
b18019d2d8 Fix synchronicity for ZVOLs.
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:

    WRITE:       A normal async write. Device will be plugged.
    WRITE_SYNC:  Synchronous write. Identical to WRITE, but passes down
                 the hint that someone will be waiting on this IO
                 shortly.
    WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
    WRITE_FUA:   Like WRITE_SYNC but data is guaranteed to be on
                 non-volatile media on completion.

In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.

The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.

Also, we must inform the block layer that we actually support FLUSH and FUA.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Etienne Dechamps
56c34bac44 Support "sync=always" for ZVOLs.
Currently the "sync=always" property works for regular ZFS datasets, but not
for ZVOLs. This patch remedies that.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #374.
2012-02-07 16:23:06 -08:00
Brian Behlendorf
47621f3d76 Linux 3.3 compat, sops->show_options()
The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'.  Add an autoconf check
to detect the API change and then conditionally define the expected
interface.  In either case we are only interested in the zfs_sb_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #549
2012-02-03 10:02:01 -08:00
Brian Behlendorf
d7e398ce1a Cleanup ZFS debug infrastructure
Historically the internal zfs debug infrastructure has been
scattered throughout the code.  Since we expect to start making
more use of this code this patch performs some cleanup.

* Consolidate the zfs debug infrastructure in the zfs_debug.[ch]
  files.  This includes moving the zfs_flags and zfs_recover
  variables, plus moving the zfs_panic_recover() function.

* Remove the existing unused functionality in zfs_debug.c and
  replace it with code which correctly utilized the spl logging
  infrastructure.

* Remove the __dprintf() function from zfs_ioctl.c.  This is
  dead code, the dprintf() functionality in the kernel relies
  on the spl log support.

* Remove dprintf() from hdr_recl().  This wasn't particularly
  useful and was missing the required format specifier anyway.

* Subsequent patches should unify the dprintf() and zfs_dbgmsg()
  functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:24:30 -08:00
Brian Behlendorf
0c5dde492f Allow multiple values per directory entry
When using zfs to back a Lustre filesystem it's advantageous to
to store a fid with the object id in the directory zap.  The only
technical impediment to doing this is that the zpl code expects
a single value in the zap per directory entry.

This change relaxes that requirement such that multiple entries
are allowed provided the first one is the object id.  The zpl
code will just ignore additional entries.  This allows the ZoL
count to mount datasets which are being used as Lustre server
backends.

Once the upstream feature flags support is merged in this change
should be updated to a read-only feature.  Until this occurs
other zfs implementations will not be able to read the zfs
filesystems created by Lustre.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:22:08 -08:00
Brian Behlendorf
e29be02e46 Export symbol zfs_attr_table
Export the zfs_attr_table symbol so it may be used by non-zpl
consumers which are still interested in writing a zpl compatible
dataset (e.g. Lustre).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-27 09:23:36 -08:00
Richard Laager
57a4eddc4d Allow setting bootfs on any pool
The vdev_is_bootable() restrictions are no longer necessary
with recent GRUB2 code.  FreeBSD has implemented the same
change, except that I moved the Solaris comment to be inside
the #ifdef __sun__ block.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #317
2012-01-17 13:49:07 -08:00
Ned Bass
08d08ebba2 Reduce number of zio free threads
As described in Issue #458 and #258, unlinking large amounts of data
can cause the threads in the zio free wait queue to start spinning.
Reducing the number of z_fr_iss threads from a fixed value of 100 to 1
per cpu signficantly reduces contention on the taskq spinlock and
improves throughput.

Instrumenting the taskq code showed that __taskq_dispatch() can spend
a long time holding tq->tq_lock if there are a large number of threads
in the queue.  It turns out the time spent in wake_up() scales
linearly with the number of threads in the queue.  When a large number
of short work items are dispatched, as seems to be the case with
unlink, the worker threads drain the queue faster than the dispatcher
can fill it.  They then all pile into the work wait queue to wait for
new work items.  So if 100 threads are in the queue, wake_up() takes
about 100 times as long, and the woken threads have to spin until the
dispatcher releases the lock.

Reducing the number of threads helps with the symptoms, but doesn't
get to the root of the problem.  It would seem that wake_up()
shouldn't scale linearly in time with queue depth, particularly if we
are only trying to wake up one thread.  In that vein, I tried making
all of the waiting processes exclusive to prevent the scheduler from
iterating over the entire list, but I still saw the linear time
scaling.  So further investigation is needed, but in the meantime
reducing the thread count is an easy workaround.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
Issue #458
2012-01-17 08:54:00 -08:00
Darik Horn
96b91ef0d6 Apply the ZoL coding standard to zpl_xattr.c
Make the indenting in the zpl_xattr.c file consistent with the Sun
coding standard by removing soft tabs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-12 15:12:03 -08:00
Brian Behlendorf
166dd49de0 Linux 3.2 compat, security_inode_init_security()
The security_inode_init_security() API has been changed to include
a filesystem specific callback to write security extended attributes.
This was done to support the initialization of multiple LSM xattrs
and the EVM xattr.

This change updates the code to use the new API when it's available.
Otherwise it falls back to the previous implementation.

In addition, the ZFS_AC_KERNEL_6ARGS_SECURITY_INODE_INIT_SECURITY
autoconf test has been made more rigerous by passing the expected
types.  This is done to ensure we always properly the detect the
correct form for the security_inode_init_security() API.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #516
2012-01-12 15:06:39 -08:00
Brian Behlendorf
ab26409db7 Linux 3.1 compat, super_block->s_shrink
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block.  Prior
to this change there was one shared global shrinker.

The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded.  This would cause the VFS to drop
references on a fraction of the dentries in the dcache.  The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit.  Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.

This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit.  The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning.  Thus we
can minimize any impact on the caching of other filesystems.

In the context of making this change several other important
issues related to managing the ARC were addressed, they include:

* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback.  The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code.  This callback can also be used by
other ARC consumers such as Lustre.

  arc_add_prune_callback()
  arc_remove_prune_callback()

* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity.  The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.

* Less aggressively invoke the prune callback.  We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes.  It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.

* More promptly manage exceeding the arc_meta_limit.  When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.

* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache.  Remember
this will only occur when the ARC has no other choice.  If it
can evict buffers safely without invoking the prune callback
it will.

* This change is also expected to resolve the unexpect collapses
of the ARC cache.  This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink().  This effectively shrunk the entire cache
when really we just needed to reclaim meta data.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #466
Closes #292
2012-01-11 11:46:02 -08:00
Darik Horn
28eb9213d8 Linux 3.2 compat: set_nlink()
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit

  SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170

Use the new set_nlink() kernel function instead.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
2011-12-16 20:02:52 -08:00
Garrett D'Amore
a38718a63d Illumos #734: Use taskq_dispatch_ent() interface
It has been observed that some of the hottest locks are those
of the zio taskqs.  Contention on these locks can limit the
rate at which zios are dispatched which limits performance.

This upstream change from Illumos uses new interface to the
taskqs which allow them to utilize a prealloc'ed taskq_ent_t.
This removes the need to perform an allocation at dispatch
time while holding the contended lock.  This has the effect
of improving system performance.

Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Alexey Zaytsev <alexey.zaytsev@nexenta.com>
Reviewed by: Jason Brian King <jason.brian.king@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/734

Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #482
2011-12-14 09:19:30 -08:00
Brian Behlendorf
30a9524e45 Set zvol_major/zvol_threads permissions
The zvol_major and zvol_threads module options were being created
with 0 permission bits.  This prevented them from being listed in
the /sys/module/zfs/parameters/ directory, although they were
visible in `modinfo zfs`.  This patch fixes the issue by updating
the permission bits to 0444.  For the moment these options must
be read-only because they are used during module initialization.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #392
2011-12-07 09:27:50 -08:00
Brian Behlendorf
23bdb07d4e Update default ARC memory limits
In the upstream OpenSolaris ZFS code the maximum ARC usage is
limited to 3/4 of memory or all but 1GB, whichever is larger.
Because of how Linux's VM subsystem is organized these defaults
have proven to be too large which can lead to stability issues.

To avoid making everyone manually tune the ARC the defaults are
being changed to 1/2 of memory or all but 4GB.  The rational for
this is as follows:

* Desktop Systems (less than 8GB of memory)

  Limiting the ARC to 1/2 of memory is desirable for desktop
  systems which have highly dynamic memory requirements.  For
  example, launching your web browser can suddenly result in a
  demand for several gigabytes of memory.  This memory must be
  reclaimed from the ARC cache which can take some time.  The
  user will experience this reclaim time as a sluggish system
  with poor interactive performance.  Thus in this case it is
  preferable to leave the memory as free and available for
  immediate use.

* Server Systems (more than 8GB of memory)

  Using all but 4GB of memory for the ARC is preferable for
  server systems.  These systems often run with minimal user
  interaction and have long running daemons with relatively
  stable memory demands.  These systems will benefit most by
  having as much data cached in memory as possible.

These values should work well for most configurations.  However,
if you have a desktop system with more than 8GB of memory you may
wish to further restrict the ARC.  This can still be accomplished
by setting the 'zfs_arc_max' module option.

Additionally, keep in mind these aren't currently hard limits.
The ARC is based on a slab implementation which can suffer from
memory fragmentation.  Because this fragmentation is not visible
from the ARC it may believe it is within the specified limits while
actually consuming slightly more memory.  How much more memory get's
consumed will be determined by how badly fragmented the slabs are.

In the long term this can be mitigated by slab defragmentation code
which was OpenSolaris solution.  Or preferably, using the page cache
to back the ARC under Linux would be even better.  See issue #75
for the benefits of more tightly integrating with the page cache.

This change also fixes a issue where the default ARC max was being
set incorrectly for machines with less than 2GB of memory.  The
constant in the arc_c_max comparison must be explicitly cast to
a uint64_t type to prevent overflow and the wrong conditional
branch being taken.  This failure was typically observed in VMs
which are commonly created with less than 2GB of memory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #75
2011-12-05 12:02:12 -08:00
Brian Behlendorf
f31b3ebe6e Allow xattrs on symlinks
The Solaris version of ZFS does not allow xattrs to be set on
symlinks due to the way they implemented the attropen() system
call.  Linux however implements xattrs through the lgetxattr()
and lsetxattr() system calls which do not have this limitation.

The only reason this hasn't always worked under ZFS on Linux
is that the xattr handlers were not registered for symlink type
inodes.  This was done simply to be consistent with the Solaris
behavior.

Upon futher reflection I believe this should be allowed under
Linux.  The only ill effect would be that the xattrs on symlinks
will not be visible when the pool is imported on a Solaris
system.  This also has the benefit that it allows for SELinux
style security xattr labeling which expects to be able to set
xattrs on all inode types.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #272
2011-11-29 10:24:24 -08:00
Brian Behlendorf
82a37189aa Implement SA based xattrs
The current ZFS implementation stores xattrs on disk using a hidden
directory.  In this directory a file name represents the xattr name
and the file contexts are the xattr binary data.  This approach is
very flexible and allows for arbitrarily large xattrs.  However,
it also suffers from a significant performance penalty.  Accessing
a single xattr can requires up to three disk seeks.

  1) Lookup the dnode object.
  2) Lookup the dnodes's xattr directory object.
  3) Lookup the xattr object in the directory.

To avoid this performance penalty Linux filesystems such as ext3
and xfs try to store the xattr as part of the inode on disk.  When
the xattr is to large to store in the inode then a single external
block is allocated for them.  In practice most xattrs are small
and this approach works well.

The addition of System Attributes (SA) to zfs provides us a clean
way to make this optimization.  When the dataset property 'xattr=sa'
is set then xattrs will be preferentially stored as System Attributes.
This allows tiny xattrs (~100 bytes) to be stored with the dnode and
up to 64k of xattrs to be stored in the spill block.  If additional
xattr space is required, which is unlikely under Linux, they will be
stored using the traditional directory approach.

This optimization results in roughly a 3x performance improvement
when accessing xattrs which brings zfs roughly to parity with ext4
and xfs (see table below).  When multiple xattrs are stored per-file
the performance improvements are even greater because all of the
xattrs stored in the spill block will be cached.

However, by default SA based xattrs are disabled in the Linux port
to maximize compatibility with other implementations.  If you do
enable SA based xattrs then they will not be visible on platforms
which do not support this feature.

----------------------------------------------------------------------
   Time in seconds to get/set one xattr of N bytes on 100,000 files
------+--------------------------------+------------------------------
      |            setxattr            |            getxattr
bytes |  ext4     xfs zfs-dir  zfs-sa  |  ext4     xfs zfs-dir  zfs-sa
------+--------------------------------+------------------------------
1     |  2.33   31.88   21.50    4.57  |  2.35    2.64    6.29    2.43
32    |  2.79   30.68   21.98    4.60  |  2.44    2.59    6.78    2.48
256   |  3.25   31.99   21.36    5.92  |  2.32    2.71    6.22    3.14
1024  |  3.30   32.61   22.83    8.45  |  2.40    2.79    6.24    3.27
4096  |  3.57  317.46   22.52   10.73  |  2.78   28.62    6.90    3.94
16384 |   n/a 2342.39   34.30   19.20  |   n/a   45.44  145.90    7.55
65536 |   n/a 2941.39  128.15  131.32* |   n/a  141.92  256.85  262.12*

Legend:
* ext4      - Stock RHEL6.1 ext4 mounted with '-o user_xattr'.
* xfs       - Stock RHEL6.1 xfs mounted with default options.
* zfs-dir   - Directory based xattrs only.
* zfs-sa    - Prefer SAs but spill in to directories as needed, a
              trailing * indicates overflow in to directories occured.

NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file.
NOTE: XFS and ZFS have no limit on xattr name/value pairs per file.
NOTE: Linux limits individual name/value pairs to 65536 bytes.
NOTE: All setattr/getattr's were done after dropping the cache.
NOTE: All tests were run against a single hard drive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #443
2011-11-28 15:45:51 -08:00
Brian Behlendorf
adcd70bd1a Linux 3.1 compat, fops->fsync()
The Linux 3.1 kernel updated the fops->fsync() callback yet again.
They now pass the requested range and delegate the responsibility
for calling filemap_write_and_wait_range() to the callback.  In
addition imutex is no longer held by the caller and the callback
is responsible for taking the lock if required.

This commit updates the code to provide a zpl_fsync() function
for the updated API.  Implementations for the previous two APIs
are also maintained for compatibility.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #445
2011-11-10 10:03:08 -08:00
Brian Behlendorf
5547c2f1bf Simplify BDI integration
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code.  The updated code now just
registers the bdi during mount and destroys it during unmount.

The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it.  Luckily the bdi_setup_and_register() function
is trivial.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #367
2011-11-08 10:19:03 -08:00
Brian Behlendorf
591fb62f19 Disown dataset in zfs_sb_create()
Fix an unlikely failure cause in zfs_sb_create() which could
leave the dataset owned on error and thus unavailable until
after a reboot.  Disown the dataset if SA are expected but
are in fact missing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-11-08 10:18:40 -08:00
Brian Behlendorf
ae6ba3dbe6 Improve meta data performance
Profiling the system during meta data intensive workloads such
as creating/removing millions of files, revealed that the system
was cpu bound.  A large fraction of that cpu time was being spent
waiting on the virtual address space spin lock.

It turns out this was caused by certain heavily used kmem_caches
being backed by virtual memory.  By default a kmem_cache will
dynamically determine the type of memory used based on the object
size.  For large objects virtual memory is usually preferable
and for small object physical memory is a better choice.  See
the spl_slab_alloc() function for a longer discussion on this.

However, there is a certain amount of gray area when defining a
'large' object.  For the following caches it turns out they were
just over the line:

  * dnode_cache
  * zio_cache
  * zio_link_cache
  * zio_buf_512_cache
  * zfs_data_buf_512_cache

Now because we know there will be a lot of churn in these caches,
and because we know the slabs will still be reasonably sized.
We can safely request with the KMC_KMEM flag that the caches be
backed with physical memory addresses.  This entirely avoids the
need to serialize on the virtual address space lock.

As a bonus this also reduces our vmalloc usage which will be good
for 32-bit kernels which have a very small virtual address space.
It will also probably be good for interactive performance since
unrelated processes could also block of this same global lock.
Finally, we may see less cpu time being burned in the arc_reclaim
and txg_sync_threads.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
2011-11-03 10:19:21 -07:00
Brian Behlendorf
6a95d0b74c Fix NULL deref in balance_pgdat()
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure.  It may have already been set when entering
zpl_putpage() in which case it must remain set on exit.  In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim.  By clearing
it we allow the following NULL deref to potentially occur.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #287
2011-11-03 10:15:39 -07:00
Gunnar Beutner
a7b125e9a5 Fix a race condition in zfs_getattr_fast()
zfs_getattr_fast() was missing a lock on the ZFS superblock which
could result in zfs_znode_dmu_fini() clearing the zp->z_sa_hdl member
while zfs_getattr_fast() was accessing the znode. The result of this
would usually be a panic.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #431
2011-11-03 10:13:09 -07:00
Xin Li
c475167627 Illumos #1661: Fix flaw in sa_find_sizes() calculation
When calculating space needed for SA_BONUS buffers, hdrsize is
always rounded up to next 8-aligned boundary. However, in two places
the round up was done against sum of 'total' plus hdrsize. On the
other hand, hdrsize increments by 4 each time, which means in certain
conditions, we would end up returning with will_spill == 0 and
(total + hdrsize) larger than full_space, leading to a failed
assertion because it's invalid for dmu_set_bonus.

Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/1661

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #426
2011-10-24 09:57:52 -07:00
Darik Horn
3cee2262a6 Change sun.com URLs to zfsonlinux.org
ZFS contains error messages that point to the defunct www.sun.com
domain, which is currently offline.  Change these error messages
to use the zfsonlinux.org mirror instead.

This commit depends on:

  zfsonlinux/zfsonlinux.github.com@8e10ead3dc

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-24 09:52:21 -07:00
Brian Behlendorf
6f2255ba8a Set mtime on symbolic links
Register the setattr/getattr callbacks for symlinks.  Without these
the generic inode_setattr() and generic_fillattr() functions will
be used.  In the setattr case this will only result in the inode being
updated in memory, the dirty_inode callback would also normally run
but none is registered for zfs.

The straight forward fix is to set the setattr/getattr callbacks
for symlinks so they are handled just like files and directories.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #412
2011-10-18 15:49:31 -07:00
Alexander Stetsenko
8d35c1499d Illumos #755: dmu_recv_stream builds incomplete guid_to_ds_map
An incomplete guid_to_ds_map would cause restore_write_byref() to fail
while receiving a de-duplicated backup stream.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D`Amore <garrett@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/755
- https://github.com/illumos/illumos-gate/commit/ec5cf9d53a

Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #372
2011-10-18 11:18:14 -07:00
Brian Behlendorf
86f35f34f4 Export symbols for the VFS API
Export all symbols already marked extern in the zfs_vfsops.h
header.  Several non-static symbols have also been added to
the header and exportewd.  This allows external modules to
more easily create and manipulate properly created ZFS
filesystem type datasets.

Rename zfsvfs_teardown() to zfs_sb_teardown and export it.
This is done simply for consistency with the rest of the code
base.  All other zfsvfs_* functions have already been renamed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-11 10:25:59 -07:00
Brian Behlendorf
e45aa45298 Export symbols for the full SA API
Export all the symbols for the system attribute (SA) API.  This
allows external module to cleanly manipulate the SAs associated
with a dnode.  Documention for the SA API can be found in the
module/zfs/sa.c source.

This change also removes the zfs_sa_uprade_pre, and
zfs_sa_uprade_post prototypes.  The functions themselves were
dropped some time ago.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-05 15:59:56 -07:00
Andreas Dilger
baab063016 zpl: Fix "df -i" to have better free inodes value
Due to the confusion in Linux statfs between f_frsize and f_bsize
the blocks counts were changed to be in units of z_max_blksize
instead of SPA_MINBLOCKSIZE as it is on other platforms.

However, the free files calculation in zfs_statvfs() is limited by
the free blocks count, since each dnode consumes one block/sector.
This provided a reasonable estimate of free inodes, but on Linux
this meant that the free inodes count was underestimated by a large
amount, since 256 512-byte dnodes can fit into a 128kB block, and
more if the max blocksize is increased to 1MB or larger.

Also, the use of SPA_MINBLOCKSIZE is semantically incorrect since
DNODE_SIZE may change to a value other than SPA_MINBLOCKSIZE and
may even change per dataset, and devices with large sectors setting
ashift will also use a larger blocksize.

Correct the f_ffree calculation to use (availbytes >> DNODE_SHIFT)
to more accurately compute the maximum number of dnodes that can
be created.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #413
Closes #400
2011-09-28 11:27:10 -07:00
Brian Behlendorf
dee28b0700 Export symbols for the full ZAP API
Export all the symbols for the ZAP API.  This allows external modules
to cleanly interface with ZAP type objects.  Previously only a subset
of the functionality was exposed.  Documention for the ZAP API can be
found in the sys/zap.h header.

This change also removes a duplicate zap_increment_int() prototype.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-09-27 16:12:36 -07:00
Brian Behlendorf
fa6e5ced2f Suppress kmem_alloc() warning in zfs_prop_set_special()
Suppress the warning for this large kmem_alloc() because it is not
that far over the warning threshhold (8k) and it is short lived.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-09-15 20:26:51 -07:00
Brian Behlendorf
2708f716c0 Fix usage of zsb after free
Caught by code inspection, the variable zsb was referenced after
being freed.  Move the kmem_free() to the end of the function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-09-09 10:29:48 -07:00
Brian Behlendorf
95d9fd028b Fix incompatible pointer type warning
This warning was accidentally introduced by commit
f3ab88d646 which updated the
.readpages() implementation.  The fix is to simply cast
the helper function to the appropriate type when passed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-19 15:16:30 -07:00
Brian Behlendorf
f3ab88d646 Correctly lock pages for .readpages()
Unlike the .readpage() callback which is passed a single locked page
to be populated.  The .readpages() callback is passed a list of unlocked
pages which are all marked for read-ahead (PG_readahead set).  It is
the responsibly of .readpages() to ensure to pages are properly locked
before being populated.

Prior to this change the requested read-ahead pages would be updated
outside of the page lock which is unsafe.  The unlocked pages would then
be unlocked again which is harmless but should have been immediately
detected as bug.  Unfortunately, newer kernels failed detect this issue
because the check is done with a VM_BUG_ON which is disabled by default.
Luckily, the old Debian Lenny 2.6.26 kernel caught this because it
simply uses a BUG_ON.

The straight forward fix for this is to update the .readpages() callback
to use the read_cache_pages() helper function.  The helper function will
ensure that each page in the list is properly locked before it is passed
to the .readpage() callback.  In addition resolving the bug, this results
in a nice simplification of the existing code.

The downside to this change is that instead of passing one large read
request to the dmu multiple smaller ones are submitted.  All of these
requests however are marked for readahead so the lower layers should
issue a large I/O regardless.  Thus most of the request should hit the
ARC cache.

Futher optimization of this code can be done in the future is a perform
analysis determines it to be worthwhile.  But for the moment, it is
preferable that code be correct and understandable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #355
2011-08-08 13:24:52 -07:00
Brian Behlendorf
76659dc110 Add backing_device_info per-filesystem
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk.  The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance.  Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.

The replacement for pdflush is called bdi (backing device info).  The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback.  The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.

For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme.  However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.

Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.

However, there is one critical case where not implementing the bdi
functionality can cause problems.  If an application handles a page
fault it can enter the balance_dirty_pages() callpath.  This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.

Without a registered backing_device_info for the filesystem the
dirty pages will not get written out.  Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.

This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block.  It is then registered
when the filesystem mounted and unregistered on unmount.  It will
not be registered for mounted snapshots which are read-only.  This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #174
2011-08-04 13:37:38 -07:00
Brian Behlendorf
3c0e5c0f45 Cleanup mmap(2) writes
While the existing implementation of .writepage()/zpl_putpage() was
functional it was not entirely correct.  In particular, it would move
dirty pages in to a clean state simply after copying them in to the
ARC cache.  This would result in the pages being lost if the system
were to crash enough though the Linux VFS believed them to be safe on
stable storage.

Since at the moment virtually all I/O, except mmap(2), bypasses the
page cache this isn't as bad as it sounds.  However, as hopefully
start using the page cache more getting this right becomes more
important so it's good to improve this now.

This patch takes a big step in that direction by updating the code
to correctly move dirty pages through a writeback phase before they
are marked clean.  When a dirty page is copied in to the ARC it will
now be set in writeback and a completion callback is registered with
the transaction.  The page will stay in writeback until the dmu runs
the completion callback indicating the page is on stable storage.
At this point the page can be safely marked clean.

This process is normally entirely asynchronous and will be repeated
for every dirty page.  This may initially sound inefficient but most
of these pages will end up in a few txgs.  That means when they are
eventually written to disk they should be nicely batched.  However,
there is room for improvement.  It may still be desirable to batch
up the pages in to larger writes for the dmu.  This would reduce
the number of callbacks and small 4k buffer required by the ARC.

Finally, if the caller requires that the I/O be done synchronously
by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set.  Then the I/O
will trigger a zil_commit() to flush the data to stable storage.
At which point the registered callbacks will be run leaving the
date safe of disk and marked clean before returning from .writepage.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-02 10:34:55 -07:00
Martin Matuska
cddafdcbc5 Illumos #1313: Integer overflow in txg_delay()
The function txg_delay() is used to delay txg (transaction group)
threads in ZFS.  The timeout value for this function is calculated
using:

    int timeout = ddi_get_lbolt() + ticks;

Later, the actual wait is performed:

    while (ddi_get_lbolt() < timeout &&
        tx->tx_syncing_txg < txg-1 && !txg_stalled(dp))
            (void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock,
                timeout - ddi_get_lbolt());

The ddi_get_lbolt() function returns current uptime in clock ticks
and is typed as clock_t.  The clock_t type on 64-bit architectures
is int64_t.

The "timeout" variable will overflow depending on the tick frequency
(e.g. for 1000 it will overflow in 28.855 days). This will make the
expression "ddi_get_lbolt() < timeout" always false - txg threads will
not be delayed anymore at all. This leads to a slowdown in ZFS writes.

The attached patch initializes timeout as clock_t to match the return
value of ddi_get_lbolt().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #352
2011-08-01 12:09:43 -07:00
Martin Matuska
ca5252204a Illumos #1043: Recursive zfs snapshot destroy fails
Prior to revision 11314 if a user was recursively destroying
snapshots of a dataset the target dataset was not required to
exist.  The zfs_secpolicy_destroy_snaps() function introduced
the security check on the target dataset, so since then if the
target dataset does not exist, the recursive destroy is not
performed.  Before 11314, only a delete permission check on
the snapshot's master dataset was performed.

Steps to reproduce:
zfs create pool/a
zfs snapshot pool/a@s1
zfs destroy -r pool@s1

Therefore I suggest to fallback to the old security check, if
the target snapshot does not exist and continue with the destroy.

References to Illumos issue and patch:
- https://www.illumos.org/issues/1043
- https://www.illumos.org/attachments/217/recursive_dataset_destroy.patch

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Eric Schrock
3e31d2b080 Illumos #883: ZIL reuse during remount corruption
Moving the zil_free() cleanup to zil_close() prevents this
problem from occurring in the first place.  There is a very
good description of the issue and fix in Illumus #883.

Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Reivewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/883
- https://github.com/illumos/illumos-gate/commit/c9ba2a43cb

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Matt Ahrens
f5fc4acaa7 Illumos #1092: zfs refratio property
Add a "REFRATIO" property, which is the compression ratio based on
data referenced. For snapshots, this is the same as COMPRESSRATIO,
but for filesystems/volumes, the COMPRESSRATIO is based on the
data "USED" (ie, includes blocks in children, but not blocks
shared with the origin).

This is needed to figure out how much space a filesystem would
use if it were not compressed (ignoring snapshots).

Reviewed by: George Wilson <George.Wilson@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Mark Musante <Mark.Musante@oracle.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/1092
- https://github.com/illumos/illumos-gate/commit/187d6ac08a

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
George Wilson
6d974228ef Illumos #1051: zfs should handle imbalanced luns
Today zfs tries to allocate blocks evenly across all devices.
This means when devices are imbalanced zfs will use lots of
CPU searching for space on devices which tend to be pretty
full.  It should instead fail quickly on the full LUNs and
move onto devices which have more availability.

Reviewed by: Eric Schrock <Eric.Schrock@delphix.com>
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Garrett D'Amore
2cc6c8db12 Illumos #175: zfs vdev cache consumes excessive memory
Note that with the current ZFS code, it turns out that the vdev
cache is not helpful, and in some cases actually harmful.  It
is better if we disable this.  Once some time has passed, we
should actually remove this to simplify the code.  For now we
just disable it by setting the zfs_vdev_cache_size to zero.
Note that Solaris 11 has made these same changes.

References to Illumos issue and patch:
- https://www.illumos.org/issues/175
- https://github.com/illumos/illumos-gate/commit/b68a40a845

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Gordon Ross
ef3c1dea70 Illumos #764: panic in zfs:dbuf_sync_list
Hypothesis about what's going on here.

At some time in the past, something, i.e. dnode_reallocate()
calls one of:
dbuf_rm_spill(dn, tx);

These will do:
dbuf_rm_spill(dnode_t *dn, dmu_tx_t *tx)
dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx)
dbuf_undirty(db, tx)

Currently dbuf_undirty can leave a spill block in dn_dirty_records[],
(it having been put there previously by dbuf_dirty) and free it.
Sometime later, dbuf_sync_list trips over this reference to free'd
(and typically reused) memory.

Also, dbuf_undirty can call dnode_clear_range with a bogus
block ID. It needs to test for DMU_SPILL_BLKID, similar to
how dnode_clear_range is called in dbuf_dirty().

References to Illumos issue and patch:
- https://www.illumos.org/issues/764
- https://github.com/illumos/illumos-gate/commit/3f2366c2bb

Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Mark.Maybe@oracle.com
Reviewed by: Albert Lee <trisk@nexenta.com
Approved by: Garrett D'Amore <garrett@nexenta.com>

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Tim Haley
7b8518cb8d Illumos #xxx: zdb -vvv broken after zfs diff integration
References to Illumos issue and patch:
- https://github.com/illumos/illumos-gate/commit/163eb7ff

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:02 -07:00
Brian Behlendorf
beb9826902 Fix txg_sync_thread deadlock
Update two kmem_alloc()'s in dbuf_dirty() to use KM_PUSHPAGE.
Because these functions are called from txg_sync_thread we
must ensure they don't reenter the zfs filesystem code via
the .writepage callback.  This would result in a deadlock.

This deadlock is rare and has only been observed once under
an abusive mmap() write workload.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-22 15:24:57 -07:00
Brian Behlendorf
22872ff5da Use zfs_mknode() to create dataset root
Long, long, long ago when the effort to port ZFS was begun
the zfs_create_fs() function was heavily modified to remove
all of its VFS dependencies.  This allowed Lustre to use
the dataset without us having to spend the time porting all
the required VFS code.

Fast-forward several years and we now have all the VFS code
in place but are still relying on the modified zfs_create_fs().
This isn't required anymore and we can now use zfs_mknode()
to create the root znode for the filesystem.

This commit reverts the contents of zfs_create_fs() to largely
match the upstream OpenSolaris code.  There have been minor
modifications to accomidate the Linux VFS but that is all.

This code fixes issue #116 by bootstraping enough of the VFS
data structures so we can rely on zfs_mknode() to create the
root directory.  This ensures it is created properly with
support for system attributes.  Previously it wasn't which
is why it behaved differently that all other directories
when modified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #116
2011-07-20 19:52:26 -07:00
Brian Behlendorf
9fd91daeef Honor setgit bit on directories
Newly created files were always being created with the fsuid/fsgid
in the current users credentials.  This is correct except in the
case when the parent directory sets the 'setgit' bit.  In this
case according to posix the newly created file/directory should
inherit the gid of the parent directory.  Additionally, in the
case of a subdirectory it should also inherit the 'setgit' bit.

Finally, this commit performs a little cleanup of the vattr_t
initialization by moving it to a common helper function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #262
2011-07-20 14:07:13 -07:00
Brian Behlendorf
cfc9a5c88f Fix zpl_writepage() deadlock
Disable the normal reclaim path for zpl_putpage().  This ensures that
all memory allocations under this call path will never enter direct
reclaim.  If this were to happen the VM might try to write out
additional pages by calling zpl_putpage() again resulting in a
deadlock.

This sitution is typically handled in Linux by marking each offending
allocation GFP_NOFS.  However, since much of the code used is common
it makes more sense to use PF_MEMALLOC to flag the entire call tree.
Alternately, the code could be updated to pass the needed allocation
flags but that's a more invasive change.

The following example of the above described deadlock was triggered
by test 074 in the xfstest suite.

Call Trace:
 [<ffffffff814dcdb2>] down_write+0x32/0x40
 [<ffffffffa05af6e4>] dnode_new_blkid+0x94/0x2d0 [zfs]
 [<ffffffffa0597d66>] dbuf_dirty+0x556/0x750 [zfs]
 [<ffffffffa05987d1>] dmu_buf_will_dirty+0x81/0xd0 [zfs]
 [<ffffffffa059ee70>] dmu_write+0x90/0x170 [zfs]
 [<ffffffffa0611afe>] zfs_putpage+0x2ce/0x360 [zfs]
 [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
 [<ffffffffa06287b2>] zpl_writepage+0x12/0x20 [zfs]
 [<ffffffff8115f907>] writeout+0xa7/0xd0
 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170
 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0
 [<ffffffff811559ab>] compact_zone+0x4fb/0x780
 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0
 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190
 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0
 [<ffffffff8115464a>] alloc_pages_current+0xaa/0x110
 [<ffffffff8111e36e>] __get_free_pages+0xe/0x50
 [<ffffffffa03f0e2f>] kv_alloc+0x3f/0xb0 [spl]
 [<ffffffffa03f11d9>] spl_kmem_cache_alloc+0x339/0x660 [spl]
 [<ffffffffa05950b3>] dbuf_create+0x43/0x370 [zfs]
 [<ffffffffa0596fb1>] __dbuf_hold_impl+0x241/0x480 [zfs]
 [<ffffffffa0597276>] dbuf_hold_impl+0x86/0xc0 [zfs]
 [<ffffffffa05977ff>] dbuf_hold_level+0x1f/0x30 [zfs]
 [<ffffffffa05a9dde>] dmu_tx_check_ioerr+0x4e/0x110 [zfs]
 [<ffffffffa05aa1f9>] dmu_tx_count_write+0x359/0x6f0 [zfs]
 [<ffffffffa05aa5df>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa0611a6d>] zfs_putpage+0x23d/0x360 [zfs]
 [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
 [<ffffffff811221f9>] write_cache_pages+0x1c9/0x4a0
 [<ffffffffa0628738>] zpl_writepages+0x18/0x20 [zfs]
 [<ffffffff81122521>] do_writepages+0x21/0x40
 [<ffffffff8119bbbd>] writeback_single_inode+0xdd/0x2c0
 [<ffffffff8119bfbe>] writeback_sb_inodes+0xce/0x180
 [<ffffffff8119c11b>] writeback_inodes_wb+0xab/0x1b0
 [<ffffffff8119c4bb>] wb_writeback+0x29b/0x3f0
 [<ffffffff8119c6cb>] wb_do_writeback+0xbb/0x240
 [<ffffffff811308ea>] bdi_forker_task+0x6a/0x310
 [<ffffffff8108ddf6>] kthread+0x96/0xa0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #327
2011-07-19 16:17:01 -07:00
Brian Behlendorf
abd39a8289 Fix zio_execute() deadlock
To avoid deadlocking the system it is crucial that all memory
allocations performed in the zio_execute() call path are marked
KM_PUSHPAGE (GFP_NOFS).  This ensures that while a z_wr_iss
thread is processing the syncing transaction group it does
not re-enter the filesystem code and deadlock on itself.

Call Trace:
 [<ffffffffa02580e8>] cv_wait_common+0x78/0xe0 [spl]
 [<ffffffffa0347bab>] txg_wait_open+0x7b/0xa0 [zfs]
 [<ffffffffa030e73d>] dmu_tx_wait+0xed/0xf0 [zfs]
 [<ffffffffa0376a49>] zfs_putpage+0x219/0x360 [zfs]
 [<ffffffffa038d75e>] zpl_putpage+0x1e/0x60 [zfs]
 [<ffffffffa038d7b2>] zpl_writepage+0x12/0x20 [zfs]
 [<ffffffff8115f907>] writeout+0xa7/0xd0
 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170
 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0
 [<ffffffff811559ab>] compact_zone+0x4fb/0x780
 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0
 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190
 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0
 [<ffffffff81159932>] kmem_getpages+0x62/0x170
 [<ffffffff8115a54a>] fallback_alloc+0x1ba/0x270
 [<ffffffff8115a2c9>] ____cache_alloc_node+0x99/0x160
 [<ffffffff8115b059>] __kmalloc+0x189/0x220
 [<ffffffffa02539fb>] kmem_alloc_debug+0xeb/0x130 [spl]
 [<ffffffffa031454a>] dnode_hold_impl+0x46a/0x550 [zfs]
 [<ffffffffa0314649>] dnode_hold+0x19/0x20 [zfs]
 [<ffffffffa03042e3>] dmu_read+0x33/0x180 [zfs]
 [<ffffffffa034729d>] space_map_load+0xfd/0x320 [zfs]
 [<ffffffffa03300bc>] metaslab_activate+0x10c/0x170 [zfs]
 [<ffffffffa0330ad9>] metaslab_alloc+0x469/0x800 [zfs]
 [<ffffffffa038963c>] zio_dva_allocate+0x6c/0x2f0 [zfs]
 [<ffffffffa038a249>] zio_execute+0x99/0xf0 [zfs]
 [<ffffffffa0254b1c>] taskq_thread+0x1cc/0x330 [spl]
 [<ffffffff8108ddf6>] kthread+0x96/0xa0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #291
2011-07-19 11:55:42 -07:00
Brian Behlendorf
a140dc5469 Fix mmap(2)/write(2)/read(2) deadlock
When modifing overlapping regions of a file using mmap(2) and
write(2)/read(2) it is possible to deadlock due to a lock inversion.
The zfs_write() and zfs_read() hooks first take the zfs range lock
and then lock the individual pages.  Conversely, when using mmap'ed
I/O the zpl_writepage() hook is called with the individual page
locks already taken and then zfs_putpage() takes the zfs range lock.

The most straight forward fix is to simply not take the zfs range
lock in the mmap(2) case.  The individual pages will still be locked
thus serializing access.  Updating the same region of a file with
write(2) and mmap(2) has always been a dodgy thing to do.  This change
at a minimum ensures we don't deadlock and is consistent with the
existing Linux semantics enforced by the VFS.

This isn't an issue under Solaris because the only range locking
performed will be with the zfs range locks.  It's up to each filesystem
to perform its own file locking.  Under Linux the VFS provides many
of these services.

It may be possible/desirable at a latter date to entirely dump the
existing zfs range locking and rely on the Linux VFS page locks.
However, for now its safest to perform both layers of locking until
zfs is more tightly integrated with the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #302
2011-07-19 11:55:42 -07:00
Brian Behlendorf
61f218b090 Fix send/recv 'dataset is busy' errors
This commit fixes a regression which was accidentally introduced by
the Linux 2.6.39 compatibility chanages.  As part of these changes
instead of holding an active reference on the namepsace (which is
no longer posible) a reference is taken on the super block.  This
reference ensures the super block remains valid while it is in use.

To handle the unlikely race condition of the filesystem being
unmounted concurrently with the start of a 'zfs send/recv' the
code was updated to only take the super block reference when there
was an existing reference.  This indicates that the filesystem is
active and in use.

Unfortunately, in the 'zfs recv' case this is not the case.  The
newly created dataset will not have a super block without an
active reference which results in the 'dataset is busy' error.

The most straight forward fix for this is to simply update the
code to always take the reference even when it's zero.  This
may expose us to very very unlikely concurrent umount/send/recv
case but the consequences of that are minor.

Closes #319
2011-07-15 16:37:19 -07:00
Brian Behlendorf
057e8eee35 Improve fstat(2) performance
There is at most a factor of 3x performance improvement to be
had by using the Linux generic_fillattr() helper.  However, to
use it safely we need to ensure the values in a cached inode
are kept rigerously up to date.  Unfortunately, this isn't
the case for the blksize, blocks, and atime fields.  At the
moment the authoritative values are still stored in the znode.

This patch introduces an optimized zfs_getattr_fast() call.
The idea is to use the up to date values from the inode and
the blksize, block, and atime fields from the znode.  At some
latter date we should be able to strictly use the inode values
and further improve performance.

The remaining overhead in the zfs_getattr_fast() call can be
attributed to having to take the znode mutex.  This overhead is
unavoidable until the inode is kept strictly up to date.  The
the careful reader will notice the we do not use the customary
ZFS_ENTER()/ZFS_EXIT() macros.  These macro's are designed to
ensure the filesystem is not torn down in the middle of an
operation.  However, in this case the VFS is holding a
reference on the active inode so we know this is impossible.

=================== Performance Tests ========================

This test calls the fstat(2) system call 10,000,000 times on
an open file description in a tight loop.  The test results
show the zfs stat(2) performance is now only 22% slower than
ext4.  This is a 2.5x improvement and there is a clear long
term plan to get to parity with ext4.

filesystem    | test-1  test-2  test-3  | average | times-ext4
--------------+-------------------------+---------+-----------
ext4          |  7.785s  7.899s  7.284s |  7.656s | 1.000x
zfs-0.6.0-rc4 | 24.052s 22.531s 23.857s | 23.480s | 3.066x
zfs-faststat  |  9.224s  9.398s  9.485s |  9.369s | 1.223x

The second test is to run 'du' of a copy of the /usr tree
which contains 110514 files.  The test is run multiple times
both using both a cold cache (/proc/sys/vm/drop_caches) and
a hot cache.  As expected this change signigicantly improved
the zfs hot cache performance and doesn't quite bring zfs to
parity with ext4.

A little surprisingly the zfs cold cache performance is better
than ext4.  This can probably be attributed to the zfs allocation
policy of co-locating all the meta data on disk which minimizes
seek times.  By default the ext4 allocator will spread the data
over the entire disk only co-locating each directory.

filesystem    | cold    | hot
--------------+---------+--------
ext4          | 13.318s | 1.040s
zfs-0.6.0-rc4 |  4.982s | 1.762s
zfs-faststat  |  4.933s | 1.345s
2011-07-11 09:11:22 -07:00
Brian Behlendorf
abd8610cd5 Add L2ARC tunables
The performance of the L2ARC can be tweaked by a number of tunables, which
may be necessary for different workloads:

     l2arc_write_max         max write bytes per interval
     l2arc_write_boost       extra write bytes during device warmup
     l2arc_noprefetch        skip caching prefetched buffers
     l2arc_headroom          number of max device writes to precache
     l2arc_feed_secs         seconds between L2ARC writing
     l2arc_feed_min_ms       min feed interval in milliseconds
     l2arc_feed_again        turbo L2ARC warmup
     l2arc_norw              no reads during writes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #316
2011-07-08 12:44:11 -07:00
Gunnar Beutner
3c9609b322 Renamed HAVE_SHARE ifdefs to HAVE_SMB_SHARE.
The remaining code that is guarded by HAVE_SHARE ifdefs is related to the
.zfs/shares functionality which is currently not available on Linux.

On Solaris the .zfs/shares directory can be used to set permissions for
SMB shares.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:28 -07:00
Gunnar Beutner
46e18b3f0f Implemented sharing datasets via NFS using libshare.
The sharenfs and sharesmb properties depend on the libshare library
to export datasets via NFS and SMB. This commit implements the base
libshare functionality as well as support for managing NFS shares.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:28 -07:00
Brian Behlendorf
285226eff3 Always allow non-user xattrs
Under Linux you may only disable USER xattrs.  The SECURITY,
SYSTEM, and TRUSTED xattr namespaces must always be available
if xattrs are supported by the filesystem.  The enforcement
of USER xattrs is performed in the zpl_xattr_user_* handlers.

Under Solaris there is only a single xattr namespace which
is managed globally.
2011-07-01 13:39:48 -07:00
Rohan Puri
a89c3e0bd5 Support mandatory locks (nbmand)
The Linux kernel already has support for mandatory locking.  This
change just replaces the Solaris mandatory locking calls with the
Linux equivilants.  In fact, it looks like this code could be
removed entirely because this checking is already done generically
in the Linux VFS.  However, for now we'll leave it in place even
if it is redundant just in case we missed something.

The original patch to update the code to support mandatory locking
was done by Rohan Puri.  This patch is an updated version which is
compatible with the previous mount option handling changes.

Original-Patch-by: Rohan Puri <rohan.puri15@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #222
Closes #253
2011-07-01 13:39:40 -07:00
Brian Behlendorf
2cf7f52bc4 Linux compat 2.6.39: mount_nodev()
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure.  When using the new
interface the caller must now use the mount_nodev() helper.

Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers.  This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use.  It provides our only entry point
in to the namespace layer for manipulating certain mount options.

This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly.  It also allowed me
to keep more of the original Solaris code unmodified.  Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do.  However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem.  Thus keeping a back
reference from the filesystem to the namespace is complicated.

Rather than introduce some ugly hack to get the vfsmount and
continue as before.  I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.

This commit updates the code to remove this vfsmount back
reference entirely.  All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.

Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code.  This
change which fairly involved has turned out nicely.

Closes #246
Closes #217
Closes #187
Closes #248
Closes #231
2011-07-01 13:36:39 -07:00
Brian Behlendorf
5c03efc379 Linux compat 2.6.39: security_inode_init_security()
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.

Closes #246
Closes #217
Closes #187
2011-07-01 12:40:08 -07:00
Brian Behlendorf
e2e7aa2df8 Add ZFS specific mmap() checks
Under Linux the VFS handles virtually all of the mmap() access
checks.  Filesystem specific checks are left to be handled in
the .mmap() hook and normally there arn't any.

However, ZFS provides a few attributes which can influence the
mmap behavior and should be honored.  Note, currently the code
to modify these attributes has not been implemented under Linux.

* ZFS_IMMUTABLE | ZFS_READONLY | ZFS_APPENDONLY: when any of these
  attributes are set a file may not be mmaped with write access.

* ZFS_AV_QUARANTINED: when set a file file may not be mmaped with
  read or exec access.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-01 12:23:46 -07:00
Brian Behlendorf
f0b2486034 Remove unused MMAP functions
The following functions were required for the OpenSolaris mmap
implementation.  Because the Linux VFS does most the most heavy
lifting for us they are not required and are being removed to
keep the code clean and easy to understand.

  * zfs_null_putapage()
  * zfs_frlock()
  * zfs_no_putpage()

Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
2011-07-01 12:22:57 -07:00
Prasad Joshi
dde471ef5a MMAP Optimization
Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions.
The functions have been modified to make them Linux friendly.

ZFS uses these functions to read/write the mmapped pages. Using them
from readpage/writepage results in clear code. The patch also adds
readpages and writepages interface functions to read/write list of
pages in one function call.

The code change handles the first mmap optimization mentioned on
https://github.com/behlendorf/zfs/issues/225

Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Issue #255
2011-07-01 12:22:52 -07:00
Prasad Joshi
218b8eafbd Use truncate_setsize in zfs_setattr
According to Linux kernel commit 2c27c65e, using truncate_setsize in
setattr simplifies the code. Therefore, the patch replaces the call
to vmtruncate() with truncate_setsize().

zfs_setattr uses zfs_freesp to free the disk space belonging to the
file.  As truncate_setsize may release the page cache and flushing
the dirty data to disk, it must be called before the zfs_freesp.

Suggested-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes #255
2011-06-27 09:59:52 -07:00
Prasad Joshi
b312979252 Tear down and flush the mmap region
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.

The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.

Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes #255
2011-06-27 09:59:19 -07:00
Brian Behlendorf
7e7baecaa3 Linux 3.0 compat, shrinker compatibility
To accomindate the updated Linux 3.0 shrinker API the spl
shrinker compatibility code was updated.  Unfortunately, this
couldn't be done cleanly without slightly adjusting the comapt
API.  See spl commit a55bcaad18.

This commit updates the ZFS code to use the slightly modified
API.  You must use the latest SPL if your building ZFS.
2011-06-21 14:36:39 -07:00
Gunnar Beutner
b00131d43c Fix unlink/xattr deadlock
The problem here is that prune_icache() tries to evict/delete
both the xattr directory inode as well as at least one xattr
inode contained in that directory. Here's what happens:

1. File is created.
2. xattr is created for that file (behind the scenes a xattr
   directory and a file in that xattr directory are created)
3. File is deleted.
4. Both the xattr directory inode and at least one xattr
   inode from that directory are evicted by prune_icache();
   prune_icache() acquires a lock on both inodes before it
   calls ->evict() on the inodes

When the xattr directory inode is evicted zfs_zinactive attempts
to delete the xattr files contained in that directory. While
enumerating these files zfs_zget() is called to obtain a reference
to the xattr file znode - which tries to lock the xattr inode.
However that very same xattr inode was already locked by
prune_icache() further up the call stack, thus leading to a
deadlock.

This can be reliably reproduced like this:
$ touch test
$ attr -s a -V b test
$ rm test
$ echo 3 > /proc/sys/vm/drop_caches

This patch fixes the deadlock by moving the zfs_purgedir() call to
zfs_unlinked_drain().  Instead zfs_rmnode() now checks whether the
xattr dir is empty and leaves the xattr dir in the unlinked set if
it finds any xattrs.

To ensure zfs_unlinked_drain() never accesses a stale super block
zfsvfs_teardown() has been update to block until the iput taskq
has been drained.  This avoids a potential race where a file with
an xattr directory is removed and the file system is immediately
unmounted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #266
2011-06-20 13:47:03 -07:00
Gunnar Beutner
6f0cf71e0d Removed erroneous zfs_inode_destroy() calls from zfs_rmnode().
iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy()
for us after zfs_zinactive(), thus making sure that the inode is
properly cleaned up.

The zfs_inode_destroy() calls in zfs_rmnode() would lead to a
double-free.

Fixes #282
2011-06-20 10:30:17 -07:00
Brian Behlendorf
96801d2906 Linux 2.6.37 compat, WRITE_FLUSH_FUA
The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been
introduced as a replacement for WRITE_BARRIER.  This was done
to allow richer semantics to be expressed to the block layer.
It is the block layers responsibility to choose the correct way
to implement these semantics.

This change simply updates the bio's to use the new kernel API
which should be absolutely safe.  However, since ZFS depends
entirely on this working as designed for correctness we do
want to be careful.

Closes #281
2011-06-17 14:37:26 -07:00
Brian Behlendorf
e95b3bdcbb Fix stack ddt_class_contains()
Stack usage for ddt_class_contains() reduced from 524 bytes to 68
bytes.  This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
2011-05-31 12:17:27 -07:00
Brian Behlendorf
5b8c7bbcea Fix stack ddt_zap_lookup()
Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120
bytes.  This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
2011-05-31 12:17:27 -07:00
Brian Behlendorf
c7f8f831a4 Revert "Fix stack traverse_visitbp()"
This abomination is no longer required because the zio's issued
during this recursive call path will now be handled asynchronously
by the taskq thread pool.

This reverts commit 6656bf5621.
2011-05-31 12:17:27 -07:00
Brian Behlendorf
2fac4c2a74 Make tgx_sync_thread zio's async
The majority of the recursive operations performed by the dsl
are done either in the context of the tgx_sync_thread or during
pool import.  It is these recursive operations which contribute
greatly to the stack depth.  When this recursion is coupled with
a synchronous I/O in the same context overflow becomes possible.

Previously to handle this case I have focused on keeping the
individual stack frames as light as possible.  This is a good
idea as long as it can be done in a way which doesn't overly
complicate the code.  However, there is a better solution.

If we treat all zio's issued by the tgx_sync_thread as async then
we can use the tgx_sync_thread stack for the recursive parts, and
the zio_* threads for the I/O parts.  This effectively doubles our
available stack space with the only drawback being a small delay
to schedule the I/O.  However, in practice the scheduling time
is so much smaller than the actual I/O time this isn't an issue.
Another benefit of making the zio async is that the zio pipeline
is now parallel.  That should mean for CPU intensive pipelines
such as compression or dedup performance may be improved.

With this change in place the worst case stack usage observed so
far is 6902 bytes.  This is still higher than I'd like but
significantly improved.  Additional changes to specific functions
should improve this further.  This change allows us to revent
commit 6656bf5 which did some horrible things to the recursive
traverse_visitbp() callpath in the name of saving stack.
2011-05-31 12:17:27 -07:00
Brian Behlendorf
f74fae8b30 Fix 4K sector support
Yesterday I ran across a 3TB drive which exposed 4K sectors to
Linux.  While I thought I had gotten this support correct it
turns out there were 2 subtle bugs which prevented it from
working.

  sudo ./cmd/zpool/zpool create -f large-sector /dev/sda
  cannot create 'large-sector': one or more devices is currently unavailable

1) The first issue was that it was possible that bdev_capacity()
would return the number of 512 byte sectors rather than the number
of 4096 sectors.  Internally, certain Linux functions only operate
with 512 byte sectors so you need to be careful.  To avoid any
confusion in the future I've updated bdev_capacity() to simply
return the device (or partition) capacity in bytes.  The higher
levels of ZFS want the value in bytes anyway so this is cleaner.

2) When creating a bio the ->bi_sector count must always be
expressed in 512 byte sectors.  The existing code would scale
the byte offset by the logical sector size.   Until now this was
always 512 so it never caused problems.  Trying a 4K sector drive
clearly exposed the issue.  The problem has been fixed by
hard coding the 512 byte sector which is exactly what the bio
code does internally.

With these changes I'm now able to create ZFS pools using 4K
sector drives.  No issues were observed during fairly extensive
testing.  This is also a low risk change if your using 512b
sectors devices because none of the logic changes.

Closes #256
2011-05-27 11:38:53 -07:00
Brian Behlendorf
2b8cad6159 Use vmem_alloc() for zfs_ioc_userspace_many()
The default buffer size when requesting multiple quota entries
is 100 times the zfs_useracct_t size.  In practice this works out
to exactly 27200 bytes.  Since this will be a short lived buffer
in a non-performance critical path it is preferable to vmem_alloc()
the needed memory.
2011-05-20 14:23:18 -07:00
Brian Behlendorf
f01b360e67 Pass caller's credential in zfsdev_ioctl()
Initially when zfsdev_ioctl() was ported to Linux we didn't have
any credential support implemented.  So at the time we simply
passed NULL which wasn't much of a problem since most of the
secpolicy code was disabled.

However, one exception is quota handling which does require the
credential.  Now that proper credentials are supported we can
safely start passing the callers credential.  This is also an
initial step towards fully implemented the zfs secpolicy.
2011-05-20 10:12:25 -07:00
Brian Behlendorf
3fd70ee6b0 Fix 'negative objects to delete' warning
Normally when the arc_shrinker_func() function is called the return
value should be:

   >=0 - To indicate the number of freeable objects in the cache, or
   -1  - To indicate this cache should be skipped

However, when the shrinker callback is called with 'nr_to_scan' equal
to zero.  The caller simply wants the number of freeable objects in
the cache and we must never return -1.  This patch reorders the
first two conditionals in arc_shrinker_func() to ensure this behavior.

This patch also now explictly casts arc_size and arc_c_min to signed
int64_t types so MAX(x, 0) works as expected.  As unsigned types
we would never see an negative value which defeated the purpose of
the MAX() lower bound and broke the shrinker logic.

Finally, when nr_to_scan is non-zero we explictly prevent all reclaim
below arc_c_min.  This is done to prevent the Linux page cache from
completely crowding out the ARC.  This limit is tunable and some
experimentation is likely going to be required to set it exactly right.
For now we're sticking with the OpenSolaris defaults.

Closes #218
Closes #243
2011-05-18 10:29:22 -07:00
Brian Behlendorf
e814770f2e Update synchronous open zfs_close() comment
The comment in zfs_close() pertaining to decrementing the synchronous
open count needs to be updated for Linux.  The code was already
updated to be correct, but the comment was missed and is now misleading.
Under Linux the zfs_close() hook is only called once when the final
reference is dropped.  This differs from Solaris where zfs_close()
is called for each close.

Closes #237
2011-05-13 08:20:06 -07:00
Brian Behlendorf
c91d229809 Merge pull request #235 from nedbass/rdev
Don't store rdev in SA for FIFOs and sockets
2011-05-09 16:41:28 -07:00
Ned A. Bass
aa6d8c1086 Don't store rdev in SA for FIFOs and sockets
Update the handling of named pipes and sockets to be consistent with
other platforms with regard to the rdev attribute.  While all ZFS
ipmlementations store the rdev for device files in a system attribute
(SA), this is not the case for FIFOs and sockets.  Indeed, Linux always
passes rdev=0 to mknod() for FIFOs and sockets, so the value is not
needed.  Add an ASSERT that rdev==0 for FIFOs and sockets to detect if
the expected behavior ever changes.

Closes #216
2011-05-09 13:35:07 -07:00
Brian Behlendorf
21ade34764 Disable direct reclaim for z_wr_* threads
The direct reclaim path in the z_wr_* threads must be disabled
to ensure forward progress is always maintained for txg processing.
This ensures that a txg will never get stuck waiting on itself
because it entered the following memory reclaim callpath.

  ->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive()
  ->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open()

It would be preferable to target this exact code path but the
kernel offers no way to do this without custom patches.  To avoid
this we are forced to disable all reclaim for these threads.  It
should not be necessary to do this for other other z_* threads
because they will not hold a txg open.

Closes #232
2011-05-06 15:26:26 -07:00
Brian Behlendorf
3117dd0b90 Handle NULL in nfsd .fsync() hook
How nfsd handles .fsync() has been changed a couple of times in the
recent kernels.  But basically there are three cases we need to
consider.

Linux 2.6.12 - 2.6.33
* The .fsync() hook takes 3 arguments
* The nfsd will call .fsync() with a NULL file struct pointer.

Linux 2.6.34
* The .fsync() hook takes 3 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()

Linux 2.6.35 - 2.6.x
* The .fsync() hook takes 2 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()

For once it looks like we've gotten lucky.  The first two cases can
actually be collased in to one if we stop using the file struct
pointer entirely.  Since the dentry is still passed in both cases
this is possible.  The last case can then be safely handled by
unconditionally using the dentry in the file struct pointer now
that we know the nfsd caller has been removed.

Closes #230
2011-05-06 12:33:45 -07:00
Brian Behlendorf
34b84cb831 Use vmem_alloc() for zfs_ioc_pool_get_history()
The default buffer size when requesting history is 128k.  This
is far to large for a kmem_alloc() so instead use the slower
vmem_alloc().  This path has no performance concerns and the
buffer is immediately free'd after its contents are copied to
the user space buffer.
2011-05-06 09:59:52 -07:00
Brian Behlendorf
c409e4647f Add missing ZFS tunables
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values.  However, in practice sometimes you do need to tweak these
values for one reason or another.  In those cases it's nice not to
have to resort to rebuilding from source.  All tunables are visable
to modinfo and the list is as follows:

$ modinfo module/zfs/zfs.ko
filename:       module/zfs/zfs.ko
license:        CDDL
author:         Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description:    ZFS
srcversion:     8EAB1D71DACE05B5AA61567
depends:        spl,znvpair,zcommon,zunicode,zavl
vermagic:       2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm:           zvol_major:Major number for zvol device (uint)
parm:           zvol_threads:Number of threads for zvol device (uint)
parm:           zio_injection_enabled:Enable fault injection (int)
parm:           zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm:           zio_delay_max:Max zio millisec delay before posting event (int)
parm:           zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm:           zil_replay_disable:Disable intent logging replay (int)
parm:           zfs_nocacheflush:Disable cache flushes (bool)
parm:           zfs_read_chunk_size:Bytes to read per chunk (long)
parm:           zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm:           zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm:           zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm:           zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm:           zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm:           zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm:           zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm:           zfs_vdev_scheduler:I/O scheduler (charp)
parm:           zfs_vdev_cache_max:Inflate reads small than max (int)
parm:           zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm:           zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm:           zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm:           zfs_recover:Set to attempt to recover from fatal errors (int)
parm:           spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm:           zfs_zevent_len_max:Max event queue length (int)
parm:           zfs_zevent_cols:Max event column width (int)
parm:           zfs_zevent_console:Log events to the console (int)
parm:           zfs_top_maxinflight:Max I/Os per top-level (int)
parm:           zfs_resilver_delay:Number of ticks to delay resilver (int)
parm:           zfs_scrub_delay:Number of ticks to delay scrub (int)
parm:           zfs_scan_idle:Idle window in clock ticks (int)
parm:           zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm:           zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm:           zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm:           zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm:           zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm:           zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm:           zfs_no_write_throttle:Disable write throttling (int)
parm:           zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm:           zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm:           zfs_write_limit_min:Min tgx write limit (ulong)
parm:           zfs_write_limit_max:Max tgx write limit (ulong)
parm:           zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm:           zfs_write_limit_override:Override tgx write limit (ulong)
parm:           zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm:           zfetch_max_streams:Max number of streams per zfetch (uint)
parm:           zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm:           zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm:           zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm:           zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm:           zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm:           zfs_arc_min:Min arc size (ulong)
parm:           zfs_arc_max:Max arc size (ulong)
parm:           zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm:           zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm:           zfs_arc_grow_retry:Seconds before growing arc size (int)
parm:           zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm:           zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
2011-05-04 10:02:37 -07:00
Brian Behlendorf
5f35b19007 Fully update inode when created
When a new znode/inode pair is created both the znode and the inode
should be immediately updated to the correct values.  This was done
for the znode and for most of the values in the inode, but not all
of them.  This normally wasn't a problem because most subsequent
operations would cause the inode to be immediately updated.  This
change ensures the inode is now fully updated before it is inserted
in to the inode hash.

Closes #116
Closes #146
Closes #164
2011-05-02 14:04:19 -07:00
Brian Behlendorf
df554c148e Fix 'zfs set volsize=N pool/dataset'
This change fixes a kernel panic which would occur when resizing
a dataset which was not open.  The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.

The code has also been updated to correctly notify the kernel
when the block device capacity changes.  For 2.6.28 and newer
kernels the capacity change will be immediately detected.  For
earlier kernels the capacity change will be detected when the
device is next opened.  This is a known limitation of older
kernels.

Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0

Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes #68
Closes #84
2011-05-02 08:54:40 -07:00
Gunnar Beutner
055656d4f4 Implemented NFS export_operations.
Implemented the required NFS operations for exporting ZFS datasets
using the in-kernel NFS daemon.
2011-04-29 12:36:13 -07:00
Brian Behlendorf
5476e6952c Suppress 'vdev_metaslab_init' memory warning
The vdev_metaslab_init() function has been observed to allocate
larger than 8k chunks.  However, they are not much larger than 8k
and it does this infrequently so it is allowed and the warning is
supressed.
2011-04-27 09:35:18 -07:00
Brian Behlendorf
40a39e1103 Conserve stack in dsl_scan_visit()
The dsl_scan_visit() function is a little heavy weight taking 464
bytes on the stack.  This can be easily reduced for little cost by
moving zap_cursor_t and zap_attribute_t off the stack and on to the
heap.  After this change dsl_scan_visit() has been reduced in size
by 320 bytes.

This change was made to reduce stack usage in the dsl_scan_sync()
callpath which is recursive and has been observed to overflow the
stack.

Issue #174
2011-04-26 15:48:00 -07:00
Brian Behlendorf
b81c4ac9af Conserve stack in dsl_scan_visitbp()
This function is called recursively so everything possible must be
done to limit its stack consumption.  The dprintf_bp() debugging
function adds 30 bytes of local variables to the function we cannot
afford.  By commenting out this debugging we save 30 bytes per
recursion and depths of 13 are not uncommon.  This yeilds a total
stack saving of 390 bytes on our 8k stack.

Issue #174
2011-04-26 15:48:00 -07:00
Brian Behlendorf
7a060636b0 Conserve stack in dsl_scan_visitbp()
The recursive call chain dsl_scan_visitbp() -> dsl_scan_recurse() ->
dsl_scan_visitdnode() -> dsl_scan_visitbp has been observed to consume
considerable stack resulting in a stack overflow (>8k).  The cleanest
way I see to fix this with minimal impact to the existing flow of
code, and with the fewest performance concerns, is to always inline
dsl_scan_recurse() and dsl_scan_visitdnode().  While this will increase
the function size of dsl_scan_visitbp(), by 4660 bytes, it also reduces
the stack requirements by removing the function call overhead.

Issue #174
2011-04-26 13:37:35 -07:00
Brian Behlendorf
701b1f8168 Fix zvol deadlock
It's possible for a zvol_write thread to enter direct memory reclaim
while holding open a transaction group.  This results in the system
attempting to write out data to the disk to free memory.  Unfortunately,
this can't succeed because the the thread doing reclaim is holding open
the txg which must be closed to be synced to disk.  To prevent this
the offending allocation is marked KM_PUSHPAGE which will prevent it
from attempting writeback.

Closes #191
2011-04-26 12:56:35 -07:00
Brian Behlendorf
e2448b0e62 Fix spurious -EFAULT when setting I/O scheduler
Occasionally we would see an -EFAULT returned when setting the
I/O scheduler on a vdev.  This was caused an improperly formatted
user mode helper command.

This commit restructures the command to something simpler, allocates
space for it dynamically to save stack, and removes the retry logic
which is no longer needed.

Closes #169
2011-04-22 14:55:35 -07:00
Brian Behlendorf
6a8f9b6bf0 Enforce ARC meta-data limits
This change ensures the ARC meta-data limits are enforced.  Without
this enforcement meta-data can grow to consume all of the ARC cache
pushing out data and hurting performance.  The cache is aggressively
reclaimed but this is a soft and not a hard limit.  The cache may
exceed the set limit briefly before being brought under control.

By default 25% of the ARC capacity can be used for meta-data.  This
limit can be tuned by setting the 'zfs_arc_meta_limit' module option.
Once this limit is exceeded meta-data reclaim will occur in 3 percent
chunks, or may be tuned using 'arc_reduce_dnlc_percent'.

Closes #193
2011-04-21 13:49:31 -07:00
Gunnar Beutner
36df284366 Fixed a use-after-free bug in zfs_zget().
Fixed a bug where zfs_zget could access a stale znode pointer when
the inode had already been removed from the inode cache via iput ->
iput_final -> ... -> zfs_zinactive but the corresponding SA handle
was still alive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #180
2011-04-21 13:48:01 -07:00
Brian Behlendorf
d247f2a3cc Suppress 'zfs receive' memory warning
As part of zfs_ioc_recv() a zfs_cmd_t is allocated in the kernel
which is 17808 bytes in size.  This sort of thing in general should
be avoided.  However, since this should be an infrequent event for
now we allow it and simply suppress the warning with the KM_NODEBUG
flag.  This can be revisited latter if/when it becomes an issue.

Closes #178
2011-04-20 10:22:31 -07:00
Gunnar Beutner
bec30953cd Truncate the xattr znode when updating existing attributes.
If the attribute's new value was shorter than the old one the old
code would leave parts of the old value in the xattr znode.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #203
2011-04-19 14:14:40 -07:00
Gunnar Beutner
274b7e79f3 Added missing initialization for va.va_dentry in zfs_get_xattrdir.
Without this we may mistakenly believe we have a dentry and try to
d_instantiate() it.  This will result in the following BUG.  It's
important to note that while the xattr directory has an inode
assoicated with it we never create a dentry for it.

  kernel BUG at fs/dcache.c:1418!

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #202
2011-04-19 13:57:41 -07:00
Brian Behlendorf
0fe3d820f5 Fix gcc compiler warning, dsl_pool_create()
When compiling ZFS in user space gcc-4.6.0 correctly identifies
the variable 'os' as being set but never used.  This generates a
warning and a build failure when using --enable-debug.  However,
the code is correct we only want to use 'os' for the kernel space
builds.  To suppress the warning the call was wrapped with a
VERIFY() which has the nice side effect of ensuring the 'os'
actually never is NULL.  This was observed under Fedora 15.

  module/zfs/dsl_pool.c: In function ‘dsl_pool_create’:
  module/zfs/dsl_pool.c:229:12: error: variable ‘os’ set but not used
  [-Werror=unused-but-set-variable]
2011-04-19 09:04:51 -07:00
Brian Behlendorf
e30c0ada6d Linux 2.6.39 compat, invalidate_inodes()
Update code to use the spl_invalidate_inodes() wrapper.  This hides
some of the complexity of determining if invalidate_inodes() was
exported, and if so what is its prototype.  The second argument
of spl_invalidate_inodes() determined the behavior of how dirty
inodes are handled.  By passing a zero we are indicated that we
want those inodes to be treated as busy and skipped.
2011-04-19 08:57:23 -07:00
Brian Behlendorf
0d3ac5e735 Linux 2.6.29 compat, credentials
The .sync_fs fix as applied did not use the updated SPL credential
API.  This broke builds on Debian Lenny, this change applies the
needed fix to use the portable API.  The original credential changes
are part of commit 81e97e2187.
2011-04-07 14:27:09 -07:00
Brian Behlendorf
eec8164771 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Disable the normal reclaim path for the txg_sync thread.  This
ensures the thread will never enter dmu_tx_assign() which can
otherwise occur due to direct reclaim.  If this is allowed to
happen the system can deadlock.  Direct reclaim call path:

  ->shrink_icache_memory->prune_icache->dispose_list->
  clear_inode->zpl_clear_inode->zfs_inactive->dmu_tx_assign
2011-04-07 09:52:16 -07:00
Brian Behlendorf
7cb67b45f3 Add direct+indirect ARC reclaim
Under OpenSolaris all memory reclaim is done asyncronously.  Under
Linux memory reclaim is done asynchronously _and_ synchronously.
When a process allocates memory with GFP_KERNEL it explicitly allows
the kernel to do reclaim on its behalf to satify the allocation.
If that GFP_KERNEL allocation fails the kernel may take more drastic
measures to reclaim the memory such as killing user space processes.

This was observed to happen with ZFS because the ARC could consume
a large fraction of the system memory but no synchronous reclaim
could be performed on it.  The result was GFP_KERNEL allocations
could fail resulting in OOM events, and only moments latter the
arc_reclaim thread would free unused memory from the ARC.

This change leaves the arc_thread in place to manage the fundamental
ARC behavior.  But it adds a synchronous (direct) reclaim path for
the ARC which can be called when memory is badly needed.  It also
adds an asynchronous (indirect) reclaim path which is called
much more frequently to prune the ARC slab caches.
2011-04-07 09:52:10 -07:00
Brian Behlendorf
1834f2d8b7 Add missing arcstats
The following useful values were missing the arcstats.  This change
adds them in to provide greater visibility in to the arcs behavior.

arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_meta_used                   4    624774592
arc_meta_limit                  4    400785408
arc_meta_max                    4    625594176
2011-04-07 09:52:05 -07:00
Brian Behlendorf
c85b224faf Call d_instantiate before unlocking inode
Under Linux a dentry referencing an inode must be instantiated before
the inode is unlocked.  To accomplish this without overly modifing
the core ZFS code the dentry it passed via the vattr_t.  There are
cases such as replay when a dentry is not available.  In which case
it is obviously not initialized at inode creation time, if a dentry
is needed it will be spliced as when required via d_lookup().
2011-04-07 09:51:57 -07:00
Brian Behlendorf
bfd214af01 Fix inflated load average
Kernel threads which sleep uninterruptibly on Linux are marked in the (D)
state.  These threads are usually in the process of performing IO and are
thus counted against the load average.  The txg_quiesce and txg_sync threads
were always sleeping uninterruptibly and thus inflating the load average.

This change makes them sleep interruptibly.  Some care is required however
because these threads may now be woken early by signals.  In this case the
callers are all careful to check that the required conditions are met after
waking up.  If we're woken early due to a signal they will simply go back
to sleep.  In this case these changes are safe.

Closes #175
2011-03-31 17:07:12 -07:00
Brian Behlendorf
7a1cdc0775 Linux 2.6.29 compat, .freeze_fs/.unfreeze_fs
The .freeze_fs/.unfreeze_fs hooks were not added until Linux 2.6.29
Since these hooks are currently unused they are being removed to
allow support of older kernels.
2011-03-22 12:17:24 -07:00
Brian Behlendorf
81e97e2187 Linux 2.6.29 compat, credentials
As of Linux 2.6.29 a clean credential API was added to the Linux kernel.
Previously the credential was embedded in the task_struct.  Because the
SPL already has considerable support for handling this API change the
ZPL code has been updated to use the Solaris credential API.
2011-03-22 12:15:54 -07:00
Brian Behlendorf
d6bd8eaae4 Fix evict() deadlock
Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility
of synchronous reclaim deadlocks.  These deadlocks never existed in the
original OpenSolaris code because all memory reclaim on Solaris is done
asyncronously.  Linux does both synchronous (direct) and asynchronous
(indirect) reclaim.

This commit addresses a deadlock caused by inode eviction.  A KM_SLEEP
allocation may trigger direct memory reclaim and shrink the inode cache.
This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is
held.  Through the ->shrink_icache_memory()->evict()->zfs_inactive()->
zfs_zinactive() call path the same mutex may be reacquired resulting
in a deadlock.  To avoid this deadlock the process must not reacquire
the mutex when it is already holding it.

This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD
mutex locking should be reevaluated.  This infrastructure already
prevents us from ever using the Linux lock dependency analysis tools,
and it may limit scalability.
2011-03-22 12:14:55 -07:00
Brian Behlendorf
691f6ac4c2 Use KM_PUSHPAGE instead of KM_SLEEP
It used to be the case that all KM_SLEEP allocations were GFS_NOFS.
Unfortunately this often resulted in the kernel being unable to
reclaim the ARC, inode, and dentry caches in a timely manor.
The fix was to make KM_SLEEP a GFP_KERNEL allocation in the SPL.

However, this increases the posibility of deadlocking the system
on a zfs write thread.  If a zfs write thread attempts to perform
an allocation it may trigger synchronous reclaim.  This reclaim
may attempt to flush dirty data/inode to disk to free memory.
Unforunately, this write cannot finish because the write thread
which would handle it is holding the previous transaction open.
Deadlock.

To avoid this all allocations in the zfs write thread path must
use KM_PUSHPAGE which prohibits synchronous reclaim for that
thread.  In this way forward progress in ensured.  The risk
with this change is I missed updating an allocation for the
write threads leaving an increased posibility of deadlock.  If
any deadlocks remain they will be unlikely but we'll have to
make sure they all get fixed.
2011-03-22 12:14:55 -07:00
Brian Behlendorf
0de19dad9c Register .remount_fs handler
Register the missing .remount_fs handler.  This handler isn't strictly
required because the VFS does a pretty good job updating most of the
MS_* flags.  However, there's no harm in using the hook to call the
registered zpl callback for various MS_* flags.  Additionaly, this
allows us to lay the ground work for more complicated argument parsing
in the future.
2011-03-15 13:33:29 -07:00
Brian Behlendorf
03f9ba9d99 Register .sync_fs handler
Register the missing .sync_fs handler.  This is a noop in most cases
because the usual requirement is that sync just be initiated.  As part
of the DMU's normal transaction processing txgs will be frequently
synced.  However, when the 'wait' flag is set the requirement is that
.sync_fs must not return until the data is safe on disk.  With the
addition of the .sync_fs handler this is now properly implemented.
2011-03-15 13:33:29 -07:00
Brian Behlendorf
04516a45b2 Don't set I/O Scheduler for Partitions
ZFS should only change the i/o scheduler for a disk when it has
ownership of the whole disk.  This is basically the same logic as
adjusting the write cache behavior on a disk.  This change updates
the vdev disk code to skip partitions when setting the i/o scheduler.

Closes #152
2011-03-10 13:34:17 -08:00
Brian Behlendorf
adf2e8778e Fix O_APPEND Corruption
Due to an uninitialized variable files opened with O_APPEND may
overwrite the start of the file rather than append to it.  This
was introduced accidentally when I removed the Solaris vnodes.

The zfs_range_lock_writer() function used to key off zf->z_vnode
to determine if a znode_t was for a zvol of zpl object.  With
the removal of vnodes this was replaced by the flag zp->z_is_zvol.
This flag was used to control the append behavior for range locks.

Unfortunately, this value was never properly initialized after
the vnode removal.  However, because most of memory is usually
zeros it happened to be set correctly most of the time making
the bug appear racy.  Properly initializing zp->z_is_zvol to
zero completely resolves the problem with O_APPEND.

Closes #126
2011-03-09 13:31:00 -08:00
Brian Behlendorf
17c37660a1 Conserve stack in zfs_setattr()
Move 'bulk' and 'xattr_bulk' from the stack to the heap to minimize
stack space usage.  These two arrays consumed 448 bytes on the stack
and have been replaced by two 8 byte points for a total stack space
saving of 432 bytes.  The zfs_setattr() path had been previously
observed to overrun the stack in certain circumstances.
2011-03-09 13:30:03 -08:00
Brian Behlendorf
450dc149bd Range lock performance improvements
The original range lock implementation had to be modified by commit
8926ab7 because it was unsafe on Linux.  In particular, calling
cv_destroy() immediately after cv_broadcast() is dangerous because
the waiters may still be asleep.  Thus the following cv_destroy()
will free memory which may still be in use.

This was fixed by updating cv_destroy() to block on waiters but
this in turn introduced a deadlock.  The deadlock was resolved
with the use of a taskq to move the offending free outside the
range lock.  This worked well but using the taskq for the free
resulted in a serious performace hit.  This is somewhat ironic
because at the time I felt using the taskq might improve things
by making the free asynchronous.

This patch refines the original fix and moves the free from the
taskq to a private free list.  Then items which must be free'd
are simply inserted in to the list.  When the range lock is dropped
it's safe to free the items.  The list is walked and all rl_t
entries are freed.

This change improves small cached read performance by 26x.  This
was expected because for small reads the number of locking calls
goes up significantly.  More surprisingly this change significantly
improves large cache read performance.  This probably attributable
to better cpu/memory locality.  Very likely the same processor
which allocated the memory is now freeing it.

bs	ext3	zfs	zfs+fix		faster
----------------------------------------------
512     435     3       79      	26x
1k      820     7       160     	22x
2k      1536    14      305     	21x
4k      2764    28      572     	20x
8k      3788    50      1024    	20x
16k     4300    86      1843    	21x
32k     4505    138     2560    	18x
64k     5324    252     3891    	15x
128k    5427    276     4710    	17x
256k    5427    413     5017    	12x
512k    5427    497     5324    	10x
1m      5427    521     5632    	10x

Closes #142
2011-03-08 12:44:06 -08:00
Brian Behlendorf
126400a1ca Add zfs_open()/zfs_close()
In the original implementation the zfs_open()/zfs_close() hooks
were dropped for simplicity.  This was functional but not 100%
correct with the expected ZFS sematics.  Updating and re-adding the
zfs_open()/zfs_close() hooks resolves the following issues.

1) The ZFS_APPENDONLY file attribute is once again honored.  While
there are still no Linux tools to set/clear these attributes once
there are it should behave correctly.

2) Minimal virus scan file attribute hooks were added.  Once again
this support in disabled but the infrastructure is back in place.

3) Most importantly correctly handle assigning files which were
opened syncronously to the intent log.  Without this change O_SYNC
modifications could be lost during a system crash even though they
were marked synchronous.
2011-03-08 11:04:51 -08:00
Brian Behlendorf
53cf50e081 Set stat->st_dev and statfs->f_fsid
Filesystems like ZFS must use what the kernel calls an anonymous super
block.  Basically, this is just a filesystem which is not backed by a
single block device.  Normally this block device's dev_t is stored in
the super block.  For anonymous super blocks a unique reserved dev_t
is assigned as part of get_sb().

This sb->s_dev must then be set in the returned stat structures as
stat->st_dev.  This allows userspace utilities to easily detect the
boundries of a specific filesystem.  Tools such as 'du' depend on this
for proper accounting.

Additionally, under OpenSolaris the statfs->f_fsid is set to the device
id.  To preserve consistency with OpenSolaris we also set the fsid to
the device id.  Other Linux filesystem (ext) set the fsid to a unique
value determined by the filesystems uuid.  This value is unique but
maintains no relationship to the device id.  This may be desirable
when exporting NFS filesystem because it minimizes to chance of a
client observing the same fsid from two different servers.

Closes #140
2011-03-07 16:06:22 -08:00
Brian Behlendorf
6742abf9ec Use Linux ATTR_ versions
The AT_ versions of these macros are used on Solaris and while they
map to their Linux equivilants the code has been updated to use the
ATTR_ versions.
2011-03-03 11:29:15 -08:00
Brian Behlendorf
f4ea75d492 Conserve stack in zfs_setattr()
Move 'tmpxvattr' from the stack to the heap to minimize stack
space usage.  This is enough to get us below the 1024 byte stack
frame warning.  That however is still a large stack frame and it
should be further reduced by moving the 'bulk' and 'xattr_bulk'
sa_bulk_attr_t variables to the heap in a future patch.
2011-03-02 14:18:58 -08:00
Brian Behlendorf
5484965ab6 Drop HAVE_XVATTR macros
When I began work on the Posix layer it immediately became clear to
me that to integrate cleanly with the Linux VFS certain Solaris
specific things would have to go.  One of these things was to elimate
as many Solaris specific types from the ZPL layer as possible.  They
would be replaced with their Linux equivalents.  This would not only
be good for performance, but for the general readability and health of
the code.  The Solaris and Linux VFS are different beasts and should
be treated as such.  Most of the code remains common for constructing
transactions and such, but there are subtle and important differenced
which need to be repsected.

This policy went quite for for certain types such as the vnode_t,
and it initially seemed to be working out well for the vattr_t.  There
was a relatively small amount of related xvattr_t code I was forced to
comment out with HAVE_XVATTR.  But it didn't look that hard to come
back soon and replace it all with a native Linux type.

However, after going doing this path with xvattr some distance it
clear that this code was woven in the ZPL more deeply than I thought.
In particular its hooks went very deep in to the ZPL replay code
and replacing it would not be as easy as I originally thought.

Rather than continue persuing replacing and removing this code I've
taken a step back and reevaluted things.  This commit reverts many of
my previous commits which removed xvattr related code.  It restores
much of the code to its original upstream state and now relies on
improved xvattr_t support in the zfs package itself.

The result of this is that much of the code which I had commented
out, which accidentally broke things like replay, is now back in
place and working.  However, there may be a small performance
impact for getattr/setattr operations because they now require
a translation from native Linux to Solaris types.  For now that's
a price I'm willing to pay.  Once everything is completely functional
we can revisting the issue of removing the vattr_t/xvattr_t types.

Closes #111
2011-03-02 11:44:34 -08:00
Brian Behlendorf
9623f736d9 Remove caller_context_t
Remove the remaining callers of caller_context_t.  This type has
been removed because it is not needed for the Linux port.
2011-03-02 11:35:35 -08:00
Darik Horn
a23cc0a443 Add the zpool and filesystem versions
Print the supported zpool and filesystem versions at module load
time.  This change removes an ambiguity and adds information that
system administrators care about.  The phrase "ZFS pool version %s"
is the same as zpool upgrade -v so that the operator is familiar
with the message.

  ZFS: Loaded module v0.6.0, ZFS pool version 28, ZFS filesystem version 5
  ZFS: Unloaded module v0.6.0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-02-28 09:46:23 -08:00
Brian Behlendorf
fdcd952b4d Fix set block scheduler warnings
There were two cases when attempting to set the vdev block device
scheduler which would causes console warnings.

The first case was when the vdev used a loop, ram, dm, or other
such device which doesn't support a configurable scheduler.  In
these cases attempting to set a scheduler is pointless and can
be safely skipped.

The secord case is slightly more troubling.  We were seeing
transient cases where setting the elevator would return -EFAULT.
On retry everything is fine so there appears to be a small window
where this is possible.  To handle that case we silently retry
up to three times before reporting the warning.

In all of the above cases the warning is harmless and at worse you
may see slightly different performance characteristics from one
or more of your vdevs.
2011-02-25 11:37:11 -08:00
Fajar A. Nugraha
4c0d8e50b9 Use udev to create /dev/zvol/[dataset_name] links
This commit allows zvols with names longer than 32 characters, which
fixes issue on https://github.com/behlendorf/zfs/issues/#issue/102.

Changes include:
- use /dev/zd* device names for zvol, where * is the device minor
  (include/sys/fs/zfs.h, module/zfs/zvol.c).
- add BLKZNAME ioctl to get dataset name from userland
  (include/sys/fs/zfs.h, module/zfs/zvol.c, cmd/zvol_id).
- add udev rule to create /dev/zvol/[dataset_name] and the legacy
  /dev/[dataset_name] symlink. For partitions on zvol, it will create
  /dev/zvol/[dataset_name]-part* (etc/udev/rules.d/60-zvol.rules,
  cmd/zvol_id).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-02-25 09:43:19 -08:00
Brian Behlendorf
dc1d7665c5 Remove rdev packing
Remove custom code to pack/unpack dev_t's.  Under Linux all dev_t's
are an unsigned 32-bit value even on 64-bit platforms.  The lower
20 bits are used for the minor number and the upper 12 for the major
number.

This means if your importing a pool from Solaris you may get strange
major/minor numbers.  But it doesn't really matter because even if
we add compatibility code to translate the encoded Solaris major/minor
they won't do you any good under Linux.  You will still need to
recreate the dev_t with a major/minor which maps to reserved major
numbers used under Linux.

Dropping this code also resolves 32-bit builds by removing the
offending 32-bit compatibility code.
2011-02-23 15:13:03 -08:00
Brian Behlendorf
99c564bc48 Use correct ASSERT3* variant
ASSERT3P should be used instead of ASSERT3U when comparing
pointers.  Using ASSERT3U with the cast causes a compiler
warning for 32-bit builds which is fatal with --enable-debug.
2011-02-23 15:03:30 -08:00
Brian Behlendorf
05ff35c602 Increase fragment size to block size
The underlying storage pool actually uses multiple block
size.  Under Solaris frsize (fragment size) is reported as
the smallest block size we support, and bsize (block size)
as the filesystem's maximum block size.  Unfortunately,
under Linux the fragment size and block size are often used
interchangeably.  Thus we are forced to report both of them
as the filesystem's maximum block size.

Closes #112
2011-02-23 14:00:06 -08:00
Brian Behlendorf
f6dcdf13f8 Fix 'statement with no effect' warning
Because the secpolicy_* macros are all currently defined to (0).
And because the caller of this function does not check the return
code.  The compiler complains that this statement has no effect
which is correct and OK.  To suppress the warning explictly cast
the result to (void).
2011-02-23 13:03:19 -08:00
Brian Behlendorf
a31a70bbd1 Fix enum compiler warning
Generally it's a good idea to use enums for switch statements,
but in this case it causes warning because the enum is really a
set of flags.  These flags are OR'ed together in some cases
resulting in values which are not part of the original enum.
This causes compiler warning such as this about invalid cases.

  error: case value ‘33’ not in enumerated type ‘zprop_source_t’

To handle this we simply case the enum to an int for the switch
statement.  This leaves all other enum type checking in place
and effectively disabled these warnings.
2011-02-23 12:52:51 -08:00
Brian Behlendorf
61e909608d Linux 2.6.x compat, blkdev_compat.h
For legacy reasons the zvol.c and vdev_disk.c Linux compatibility
code ended up in sys/blkdev.h and sys/vdev_disk.h headers.  While
there are worse places for this code to live it should be in a
linux/blkdev_compat.h header.  This change moves this block device
Linux compatibility code in to the linux/blkdev_compat.h header
and updates all the correct #include locations.  This is not a
functional change or bug fix, it is just code cleanup.
2011-02-23 12:29:38 -08:00
Brian Behlendorf
5d0265c0dd Merge branch 'zpl' 2011-02-18 09:31:25 -08:00
Brian Behlendorf
037849f854 Use provided uid/gid for setattr
When changing the uid/gid of a file via zfs_setattr() use the
Posix id passed in iattr->ia_uid/gid.  While the zfs_fuid_create()
code already had the fuid support disabled for Linux it was
returning the uid/gid from the credential.  With this change
the 'chown' command which relies on setxattr is now working
properly.

Also remove a little stray white space which was in front of
zfs_update_inode() call and the end of zfs_setattr().
2011-02-17 14:23:48 -08:00
Brian Behlendorf
efd1832bc6 Fix symlink(2) inode reference count
Under Linux sys_symlink(2) should result in a inode being created
with one reference for the inode itself, and a second reference on
the inode which is held by the new dentry.  Under Solaris this
appears not to be the case.  Their zfs_symlink() handler drops
the inode reference before returning.

The result of this under Linux is that the reference count for
symlinks is always one smaller than it should have been. This
results in a BUG() when the symlink is unlinked.  To handle this
the Linux port now keeps the inode reference which differs from
the Solaris behavior.  This results in correct reference counts.

Closes #96
2011-02-17 11:34:47 -08:00
Brian Behlendorf
5095000169 Use -zfs_readlink() error
The zfs_readlink() function returns a Solaris positive error value
and that needs to be converted to a Linux negative error value.
While in this case nothing would actually go wrong, it's still
incorrect and should be fixed if for no other reason than clarity.
2011-02-17 09:48:06 -08:00
Brian Behlendorf
8b4f9a2d55 Fix readlink(2)
This patch addresses three issues related to symlinks.

1) Revert the zfs_follow_link() function to a modified version
of the original zfs_readlink().  The only changes from the
original OpenSolaris version relate to using Linux types.
For the moment this means no vnode's and no zfsvfs_t.  The
caller zpl_follow_link() was also updated accordingly.  This
change was reverted because it was slightly gratuitious.

2) Update zpl_follow_link() to use local variables for the
link buffer.  I'd forgotten that iov.iov_base is updated by
uiomove() so after the call to zfs_readlink() it can not longer
be used.  We need our own private copy of the link pointer.

3) Allocate MAXPATHLEN instead of MAXPATHLEN+1.  By default
MAXPATHLEN is 4096 bytes which is a full page, adding one to
it pushes it slightly over a page.  That means you'll likely
end up allocating 2 pages which is wasteful of memory and
possibly slightly slower.
2011-02-16 15:54:55 -08:00
Ricardo M. Correia
54a179e7b8 Add API to wait for pending commit callbacks
This adds an API to wait for pending commit callbacks of already-synced
transactions to finish processing.  This is needed by the DMU-OSD in
Lustre during device finalization when some callbacks may still not be
called, this leads to non-zero reference count errors.  See lustre.org
bug 23931.
2011-02-16 11:20:06 -08:00
Brian Behlendorf
a6695d83b7 Add get/setattr, get/setxattr hooks
While the attr/xattr hooks were already in place for regular
files this hooks can also apply to directories and special files.
While they aren't typically used in this way, it should be
supported.  This patch registers these additional callbacks
for both directory and special inode types.
2011-02-16 09:55:53 -08:00
Brian Behlendorf
d8fd10545b Fix FIFO and socket handling
Under Linux when creating a fifo or socket type device in the ZFS
filesystem it's critical that the rdev is stored in a SA.  This
was already being correctly done for character and block devices,
but that logic needed to be extended to include FIFOs and sockets.

This patch takes care of device creation but a follow on patch
may still be required to verify that the dev_t is being correctly
packed/unpacked from the SA.
2011-02-16 09:51:44 -08:00
Brian Behlendorf
d567444809 Create minors for all zvols
It was noticed that when you have zvols in multiple datasets
not all of the zvol devices are created at module load time.
Fajarnugraha did the leg work to identify that the root cause of
this bug is a non-zero return value from zvol_create_minors_cb().

Returning a non-zero value from the dmu_objset_find_spa() callback
function results in aborting processing the remaining children in
a dataset.  Since we want to ensure that the callback in run on
all children regardless of error simply unconditionally return
zero from the zvol_create_minors_cb().  This callback function
is solely used for this purpose so surpressing the error is safe.

Closes #96
2011-02-16 09:50:06 -08:00
Brian Behlendorf
2c395def27 Linux 2.6.36 compat, sops->evict_inode()
The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback.  It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.
2011-02-11 13:47:51 -08:00
Brian Behlendorf
f9637c6c8b Linux 2.6.33 compat, get/set xattr callbacks
The xattr handler prototypes were sanitized with the idea being that
the same handlers could be used for multiple methods.  The result of
this was the inode type was changes to a dentry, and both the get()
and set() hooks had a handler_flags argument added.  The list()
callback was similiarly effected but no autoconf check was added
because we do not use the list() callback.
2011-02-11 10:41:00 -08:00
Brian Behlendorf
7268e1bec8 Linux 2.6.35 compat, fops->fsync()
The fsync() callback in the file_operations structure used to take
3 arguments.  The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers.  To
handle this a compatibility prototype was added to ensure the right
prototype is used.  Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.
2011-02-11 09:05:51 -08:00
Brian Behlendorf
777d4af891 Linux 2.6.35 compat, const struct xattr_handler
The const keyword was added to the 'struct xattr_handler' in the
generic Linux super_block structure.  To handle this we define an
appropriate xattr_handler_t typedef which can be used.  This was
the preferred solution because it keeps the code clean and readable.
2011-02-10 16:29:00 -08:00
Brian Behlendorf
6839eed23e Use 'noop' IO Scheduler
Initial testing has shown the the right IO scheduler to use under Linux
is noop.  This strikes the ideal balance by allowing the zfs elevator
to do all request ordering and prioritization.  While allowing the
Linux elevator to do the maximum front/back merging allowed by the
physical device.  This yields the largest possible requests for the
device with the lowest total overhead.

While 'noop' should be right for your system you can choose a different
IO scheduler with the 'zfs_vdev_scheduler' option.  You may set this
value to any of the standard Linux schedulers: noop, cfq, deadline,
anticipatory.  In addition, if you choose 'none' zfs will not attempt
to change the IO scheduler for the block device.
2011-02-10 09:27:22 -08:00
Brian Behlendorf
4db77a74a6 Suppress large kmem_alloc() warning
The following warning was observed under normal operation.  It's
not fatal but it's something to be addressed long term.  Flag the
offending allocation with KM_NODEBUG to suppress the warning and
flag the call site.

SPL: Showing stack for process 21761
Pid: 21761, comm: iozone Tainted: P           ----------------
2.6.32-71.14.1.el6.x86_64 #1
Call Trace:
 [<ffffffffa05465a7>] spl_debug_dumpstack+0x27/0x40 [spl]
 [<ffffffffa054a84d>] kmem_alloc_debug+0x11d/0x130 [spl]
 [<ffffffffa05de166>] dmu_buf_hold_array_by_dnode+0xa6/0x4e0 [zfs]
 [<ffffffffa05de825>] dmu_buf_hold_array+0x65/0x90 [zfs]
 [<ffffffffa05de891>] dmu_read_uio+0x41/0xd0 [zfs]
 [<ffffffffa0654827>] zfs_read+0x147/0x470 [zfs]
 [<ffffffffa06644a2>] zpl_read_common+0x52/0x70 [zfs]
 [<ffffffffa0664503>] zpl_read+0x43/0x70 [zfs]
 [<ffffffff8116d905>] vfs_read+0xb5/0x1a0
 [<ffffffff8116da41>] sys_read+0x51/0x90
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
2011-02-10 09:27:22 -08:00
Brian Behlendorf
ceb43b935d Invalidate dcache and inode cache
When performing a 'zfs rollback' it's critical to invalidate
the previous dcache and inode cache.  If we don't there will
stale cache entries which when accessed will result in EIOs.
2011-02-10 09:27:22 -08:00
Brian Behlendorf
8926ab7a50 Move cv_destroy() outside zp->z_range_lock()
With the recent SPL change (d599e4fa) that forces cv_destroy()
to block until all waiters have been woken.  It is now unsafe
to call cv_destroy() under the zp->z_range_lock() because it
is used as the condition variable mutex.  If there are waiters
cv_destroy() will block until they wake up and aquire the mutex.
However, they will never aquire the mutex because cv_destroy()
will not return allowing it's caller to drop the lock.  Deadlock.

To avoid this cv_destroy() is now run asynchronously in a taskq.
This solves two problems:

1) It is no longer run under the zp->z_range_lock so no deadlock.
2) Since cv_destroy() may now block we don't want this slowing
   down zfs_range_unlock() and throttling the system.

This was not as much of an issue under OpenSolaris because their
cv_destroy() implementation does not do anything.  They do however
risk a bad paging request if cv_destroy() returns, the memory holding
the condition variable is free'd, and then the waiters wake up and
try to reference it.  It's a very small unlikely race, but it is
possible.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
c0d35759c5 Add mmap(2) support
It's worth taking a moment to describe how mmap is implemented
for zfs because it differs considerably from other Linux filesystems.
However, this issue is handled the same way under OpenSolaris.

The issue is that by design zfs bypasses the Linux page cache and
leaves all caching up to the ARC.  This has been shown to work
well for the common read(2)/write(2) case.  However, mmap(2)
is problem because it relies on being tightly integrated with the
page cache.  To handle this we cache mmap'ed files twice, once in
the ARC and a second time in the page cache.  The code is careful
to keep both copies synchronized.

When a file with an mmap'ed region is written to using write(2)
both the data in the ARC and existing pages in the page cache
are updated.  For a read(2) data will be read first from the page
cache then the ARC if needed.  Neither a write(2) or read(2) will
will ever result in new pages being added to the page cache.

New pages are added to the page cache only via .readpage() which
is called when the vfs needs to read a page off disk to back the
virtual memory region.  These pages may be modified without
notifying the ARC and will be written out periodically via
.writepage().  This will occur due to either a sync or the usual
page aging behavior.  Note because a read(2) of a mmap'ed file
will always check the page cache first even when the ARC is out
of date correct data will still be returned.

While this implementation ensures correct behavior it does have
have some drawbacks.  The most obvious of which is that it
increases the required memory footprint when access mmap'ed
files.  It also adds additional complexity to the code keeping
both caches synchronized.

Longer term it may be possible to cleanly resolve this wart by
mapping page cache pages directly on to the ARC buffers.  The
Linux address space operations are flexible enough to allow
selection of which pages back a particular index.  The trick
would be working out the details of which subsystem is in
charge, the ARC, the page cache, or both.  It may also prove
helpful to move the ARC buffers to a scatter-gather lists
rather than a vmalloc'ed region.

Additionally, zfs_write/read_common() were used in the readpage
and writepage hooks because it was fairly easy.  However, it
would be better to update zfs_fillpage and zfs_putapage to be
Linux friendly and use them instead.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
cc5f931cfd Add Hooks for Linux Xattr Operations
The Linux specific xattr operations have all been located in the
file zpl_xattr.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
51f0bbe425 Add Hooks for Linux Super Block Operations
The Linux specific super block operations have all been located in the
file zpl_super.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
ee154f01bf Add Hooks for Linux Inode Operations
The Linux specific inode operations have all been located in the
file zpl_inode.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
1efb473f89 Add Hooks for Linux File Operations
The Linux specific file operations have all been located in the
file zpl_file.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.

This first zpl_* commit also includes a common zpl.h header with
minimal entries to register the Linux specific hooks.  In also
adds all the new zpl_* file to the Makefile.in.  This is not a
standalone commit, you required the following zpl_* commits.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
633e8030b3 Wrap with HAVE_XVATTR
For the moment exactly how to handle xvattr is not clear.  This
change largely consists of the code to comment out the offending
bits until something reasonable can be done.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
3c4988c83e Add zp->z_is_zvol flag
A new flag is required for the zfs_rlock code to determine if
it is operation of the zvol of zpl dataset.  This used to be
keyed off the zp->z_vnode, which was a hack to begin with, but
with the removal of vnodes we needed a dedicated flag.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
3558fd73b5 Prototype/structure update for Linux
I appologize in advance why to many things ended up in this commit.
When it could be seperated in to a whole series of commits teasing
that all apart now would take considerable time and I'm not sure
there's much merrit in it.  As such I'll just summerize the intent
of the changes which are all (or partly) in this commit.  Broadly
the intent is to remove as much Solaris specific code as possible
and replace it with native Linux equivilants.  More specifically:

1) Replace all instances of zfsvfs_t with zfs_sb_t.  While the
type is largely the same calling it private super block data
rather than a zfsvfs is more consistent with how Linux names
this.  While non critical it makes the code easier to read when
your thinking in Linux friendly VFS terms.

2) Replace vnode_t with struct inode.  The Linux VFS doesn't have
the notion of a vnode and there's absolutely no good reason to
create one.  There are in fact several good reasons to remove it.
It just adds overhead on Linux if we were to manage one, it
conplicates the code, and it likely will lead to bugs so there's
a good change it will be out of date.  The code has been updated
to remove all need for this type.

3) Replace all vtype_t's with umode types.  Along with this shift
all uses of types to mode bits.  The Solaris code would pass a
vtype which is redundant with the Linux mode.  Just update all the
code to use the Linux mode macros and remove this redundancy.

4) Remove using of vn_* helpers and replace where needed with
inode helpers.  The big example here is creating iput_aync to
replace vn_rele_async.  Other vn helpers will be addressed as
needed but they should be be emulated.  They are a Solaris VFS'ism
and should simply be replaced with Linux equivilants.

5) Update znode alloc/free code.  Under Linux it's common to
embed the inode specific data with the inode itself.  This removes
the need for an extra memory allocation.  In zfs this information
is called a znode and it now embeds the inode with it.  Allocators
have been updated accordingly.

6) Minimal integration with the vfs flags for setting up the
super block and handling mount options has been added this
code will need to be refined but functionally it's all there.

This will be the first and last of these to large to review commits.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
6149f4c45f Remove dmu_write_pages() support
For the moment we do not use dmu_write_pages() to write pages
directly in to a dmu object.  It may be required at some point
in the future, but for now is simplest and cleanest to drop it.
It can be easily readded if/when needed.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
eb28321e2d Create a root znode without VFS dependencies
For portability reasons it's handy to be able to create a root
znode and basic filesystem components without requiring the full
cooperation of the VFS.  We are committing to this to simply the
filesystem creations code.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
bcf308227c Remove zfs_ctldir.[ch]
This code is used for snapshot and heavily leverages Solaris
functionality we do not want to reimplement.  These files have
been removed, including references to them, and will be replaced
by a zfs_snap.c/zpl_snap.c implementation which handles snapshots.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
b516a07b99 Disable fuid features
These features should probably be enabled in the Linux zpl code.
For now I'm disabling them until it's clear what needs to be done.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
d5e53f9d06 Disable zfs_sync during oops/panic
Minor update to ensure zfs_sync() is disabled if a kernel oops/panic
is triggered.  As the comment says 'data integrity is job one'.  This
change could have been done by defining panicstr to oops_in_progress
in the SPL.  But I felt it was better to use the native Linux API
here since to be clear.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
acb5376940 Disable Shutdown/Reboot
This support has been disable with HAVE_SHUTDOWN.  We can support
this at some point by adding the needed reboot notifiers.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
cb28b3494e Remove SYNC_ATTR check
This flag does not need to be support under Linux.  As the comment
says it was only there to support fsflush() for old filesystem like
UFS.  This is not needed under Linux.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
e15c023014 Remove mount options
Mount option parsing is still very Linux specific and will be
handled above this zfs filesystem layer.  Honoring those mount
options once set if of course the responsibility of the lower
layers.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
d7cafa8e3e Remove zfs_active_fs_count
This variable was used to ensure that the ZFS module is never
removed while the filesystem is mounted.  Once again the generic
Linux VFS handles this case for us so it can be removed.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
42ab36aa36 Remove unused mount functions
The functions zfs_mount_label_policy(), zfs_mountroot(), zfs_mount()
will not be needed because most of what they do is already handled
by the generic Linux VFS layer.  They all call zfs_domount() which
creates the actual dataset, the caller of this library call which
will be in the zpl layer is responsible for what's left.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
c0b3dc7d07 Remove zfs_major/zfs_minor/zfsfstype
Under Linux we don't need to reserve a major or minor number for
the filesystem.  We can rely on the VFS to handle colisions without
this being handled by the lower ZFS layers.

Additionally, there is no need to keep a zfsfstype around.  We are
not limited on Linux by the OpenSolaris infrastructure which needed
this.  The upper zpl layer can specify the filesystem type.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
4b3f12ecd5 Remove Solaris VFS Hooks
The ZFS code is being restructured to act as a library and a stand
alone module.  This allows us to leverage most of the existing code
with minimal modification.  It also means we need to drop the Solaris
vfs/vnode functions they will be replaced by Linux equivilants and
updated to be Linux friendly.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
960e08fe3e VFS: Add zfs_inode_update() helper
For the moment we have left ZFS unchanged and it updates many values
as part of the znode.  However, some of these values should be set
in the inode.  For the moment this is handled by adding a function
called zfs_inode_update() which updates the inode based on the znode.

This is considered a workaround until we can systematically go
through the ZFS code and have it directly update the inode.  At
which point zfs_update_inode() can be dropped entirely.  Keeping
two copies of the same data isn't only inefficient it's a breeding
ground for bugs.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
7304b6e50f VFS: Integrate zfs_znode_alloc()
Under Linux the convention for filesystem specific data structure is
to embed it along with the generic vfs data structure.  This differs
significantly from Solaris.

Since we want to integrates as cleanly with the Linux VFS as possible.
This changes modifies zfs_znode_alloc() to allocate a znode with an
embedded inode for use with the generic VFS.  This is done by calling
iget_locked() which will allocate a new inode if needed by calling
sb->alloc_inode().  This function allocates enough memory for a
znode_t by returns a pointer to the inode structure for Linux's VFS.
This function is also responsible for setting the callback
znode->z_set_ops_inodes() which is used to register the correct
handlers for the inode.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
10c6047ea5 Enable zfs_znode compilation
Basic compilation of the bulk of zfs_znode.c has been enabled.  After
much consideration it was decided to convert the existing vnode based
interfaces to more friendly Linux interfaces.  The following commits
will systematically replace update the requiter interfaces.  There
are of course pros and cons to this decision.

Pros:
* This simplifies intergration with Linux in the long term.  There is
  no longer any need to manage vnodes which are a foreign concept to
  the Linux VFS.
* Improved long term maintainability.
* Minor performance improvements by removing vnode overhead.

Cons:
* Added work in the short term to modify multiple ZFS interfaces.
* Harder to pull in changes if we ever see any new code from Solaris.
* Mixed Solaris and Linux interfaces in some ZFS code.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
a405c8a665 ACL related changes
A small collection of ACL related changes related to not
supporting fuid mapping.  This whole are will need to be
closely investigated.
2011-02-10 09:26:26 -08:00
Brian Behlendorf
3fc050aaf2 Init/destroy tsd
Add missing tsd_destroy() call for rrw_tsd_key to avoid a leak.
2011-02-10 09:25:38 -08:00
Brian Behlendorf
ab892c5f0a Replace VOP_* calls with direct zfs_* calls
These generic Solaris wrappers are no longer required.  Simply
directly call the correct zfs functions for clarity.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
538f669f63 Add trivial acl helpers
The zfs acl code makes use of the two OpenSolaris helper functions
acl_trivial_access_masks() and ace_trivial_common().  Since they are
only called from zfs_acl.c I've brought them over from OpenSolaris
and added them as static function to this file.  This way I don't
need to reimplement this functionality from scratch in the SPL.

Long term once I take a more careful look at the acl implementation
it may be the case that these functions really aren't needed.  If
that turns out to be the case they can then be removed.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
c60bc1fbf0 Remove dead ACL code
The following code was unused which caused gcc to complain.
Since it was deadcode it has simply been removed.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
4e1b54fdde Remove zfs_parse_bootfs() support
Remove unneeded bootfs functions.  This support shouldn't be required
for the Linux port, and even if it is it would need to be reworked
to integrate cleanly with Linux.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
9ee7fac531 VFS: Wrap with HAVE_SHARE
Certain NFS/SMB share functionality is not yet in place.  These
functions used to be wrapped with the generic HAVE_ZPL to prevent
them from being compiled.  I still don't want them compiled but
I'm working toward eliminating the use of HAVE_ZPL.  So I'm just
renaming the wrapper here to HAVE_SHARE.  They still won't be
compiled until all the share issues are worked through.  Share
support is the last missing piece from zfs_ioctl.c.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
bc3e15e386 Wrap with HAVE_MLSLABEL
The zfs_check_global_label() function is part of the HAVE_MLSLABEL
support which was previously commented out by a HAVE_ZPL check.
Since we're still deciding what to do about mls labels wrap it
with the preexisting macro to keep it compiled out.
2011-02-10 09:21:42 -08:00
Brian Behlendorf
5649246dd3 Remove znode move functionality
Unlike Solaris the Linux implementation embeds the inode in the
znode, and has no use for a vnode.  So while it's true that fragmention
of the znode cache may occur it should not be worse than any of the
other Linux FS inode caches.  Until proven that this is a problem it's
just added complexity we don't need.
2011-02-10 09:21:42 -08:00
Brian Behlendorf
f30484afc3 Conserve stack in zfs_mkdir()
Move the sa_attrs array from the stack to the heap to minimize stack
space usage.
2011-02-10 09:21:42 -08:00
Brian Behlendorf
1ee1b76786 Conserve stack in zfs_sa_upgrade()
As always under Linux stack space is at a premium.  Relocate two
20 element sa_bulk_attr_t arrays in zfs_sa_upgrade() from the stack
to the heap.
2011-02-10 09:21:42 -08:00
Brian Behlendorf
e5c39b95a7 Export required vfs/vn symbols 2011-02-10 09:21:42 -08:00
Brian Behlendorf
72d5e2da3e Add HAVE_SCANSTAMP
This functionality is not supported under Linux, perhaps it
will be some day if it's decided it's useful.
2011-02-10 09:20:33 -08:00
Brian Behlendorf
872e8d2697 Add initial rw_uio functions to the dmu
These functions were dropped originally because I felt they would
need to be rewritten anyway to avoid using uios.  However, this
patch readds then with they dea they can just be reworked and
the uio bits dropped.
2011-02-04 16:14:34 -08:00
Brian Behlendorf
9a616b5d17 Documentation updates
Minor Linux specific documentation updates to the comments and
man pages.
2011-02-04 16:14:34 -08:00
Brian Behlendorf
95c73795b0 Fix ZVOL rename minor devices
During a rename we need to be careful to destroy and create a
new minor for the ZVOL _only_ if the rename succeeded.  The previous
code would both destroy you minor device unconditionally, it would
also fail to create the new minor device on success.
2011-01-07 12:26:02 -08:00
Brian Behlendorf
149e873ab1 Fix minor compiler warnings
These compiler warnings were introduced when code which was
previously #ifdef'ed out by HAVE_ZPL was re-added for use
by the posix layer.  All of the following changes should be
obviously correct and will cause no semantic changes.
2011-01-06 15:04:28 -08:00
Brian Behlendorf
5b63b3eb6f Use cv_timedwait_interruptible in arc
The issue is that cv_timedwait() sleeps uninterruptibly to block signals
and avoid waking up early.  Under Linux this counts against the load
average keeping it artificially high.  This change allows the arc to
sleep interruptibly which mean it may be woken up early due to a signal.

Normally this means some extra care must be taken to handle a potential
signal.  But for the arcs usage of cv_timedwait() there is no harm in
waking up before the timeout expires so no extra handling is required.
2010-12-14 10:06:44 -08:00
Brian Behlendorf
a7dc7e5d5a Enable rrwlock.c compilation
With the addition of the thread specific data interfaces to the
SPL it is safe to enable compilation of the re-enterant read
reader/writer locks.
2010-12-07 16:05:25 -08:00
Ned Bass
e06be58641 Fix for access beyond end of device error
This commit fixes a sign extension bug affecting l2arc devices.  Extremely
large offsets may be passed down to the low level block device driver on
reads, generating errors similar to

    attempt to access beyond end of device
    sdbi1: rw=14, want=36028797014862705, limit=125026959

The unwanted sign extension occurrs because the function arc_read_nolock()
stores the offset as a daddr_t, a 32-bit signed int type in the Linux kernel.
This offset is then passed to zio_read_phys() as a uint64_t argument, causing
sign extension for values of 0x80000000 or greater.  To avoid this, we store
the offset in a uint64_t.

This change also changes a few daddr_t struct members to uint64_t in the libspl
headers to avoid similar bugs cropping up in the future.  We also add an ASSERT
to __vdev_disk_physio() to check for invalid offsets.

Closes #66
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-10 21:29:07 -08:00
Brian Behlendorf
675de5aa37 Linux 2.6.36 compat, synchronous bio flag
The name of the flag used to mark a bio as synchronous has changed
again in the 2.6.36 kernel due to the unification of the BIO_RW_*
and REQ_* flags.  The new flag is called REQ_SYNC.  To simplify
checking this flag I have introduced the vdev_disk_dio_is_sync()
helper function.  Based on the results of several new autoconf
tests it uses the correct mask to check for a synchronous bio.

Preferred interface for flagging a synchronous bio:
  2.6.12-2.6.29: BIO_RW_SYNC
  2.6.30-2.6.35: BIO_RW_SYNCIO
  2.6.36-2.6.xx: REQ_SYNC
2010-11-10 17:00:33 -08:00
Ned Bass
b04cffc9b0 Remove inconsistent use of EOPNOTSUPP
Commit 3ee56c292b changed an ENOTSUP return value
in one location to ENOTSUPP to fix user programs seeing an invalid ioctl()
error code.  However, use of ENOTSUP is widespread in the zfs module.  Instead
of changing all of those uses, we fixed the ENOTSUP definition in the SPL to be
consistent with user space.  The changed return value in the above commit is
therefore no longer needed, so this commit reverses it to maintain consistency.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-10 13:26:56 -08:00
Ned Bass
3ee56c292b Make rollbacks fail gracefully
Support for rolling back datasets require a functional ZPL, which we currently
do not have.  The zfs command does not check for ZPL support before attempting
a rollback, and in preparation for rolling back a zvol it removes the minor
node of the device.  To prevent the zvol device node from disappearing after a
failed rollback operation, this change wraps the zfs_do_rollback() function in
an #ifdef HAVE_ZPL and returns ENOSYS in the absence of a ZPL.  This is
consistent with the behavior of other ZPL dependent commands such as mount.

The orginal error message observed with this bug was rather confusing:

    internal error: Unknown error 524
    Aborted

This was because zfs_ioc_rollback() returns ENOTSUP if we don't HAVE_ZPL, but
Linux actually has no such error code.  It should instead return EOPNOTSUPP, as
that is how ENOTSUP is defined in user space.  With that we would have gotten
the somewhat more helpful message

    cannot rollback 'tank/fish': unsupported version

This is rather a moot point with the above changes since we will no longer make
that ioctl call without a ZPL.  But, this change updates the error code just in
case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-08 14:03:36 -08:00
Brian Behlendorf
7e55f4e00c Increate zio write interrupt thread count.
Increasing the default zio_wr_int thread count from 8 to 16 improves
write performence by 13% on large systems.  More testing need to be
done but I suspect the ideal tuning here is ZTI_BATCH() with a minimum
of 8 threads.
2010-11-08 14:03:35 -08:00
Brian Behlendorf
451041db53 Shorten zio_* thread names
Linux kernel thread names are expected to be short.  This change shortens
the zio thread names to 10 characters leaving a few chracters to append
the /<cpuid> to which the thread is bound.  For example: z_wr_iss/0.
2010-11-08 14:03:35 -08:00
Ned Bass
b1c5821375 Fix panic mounting unformatted zvol
On some older kernels, i.e. 2.6.18, zvol_ioctl_by_inode() may get passed a NULL
file pointer if the user tries to mount a zvol without a filesystem on it.
This change adds checks to prevent a null pointer dereference.

Closes #73.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-10-29 14:46:33 -07:00
Brian Behlendorf
baa40d45cb Fix missing 'zpool events'
It turns out that 'zpool events' over 1024 bytes in size where being
silently dropped.  This was discovered while writing the zfault.sh
tests to validate common failure modes.

This could occur because the zfs interface for passing an arbitrary
size nvlist_t over an ioctl() is to provide a buffer for the packed
nvlist which is usually big enough.  In this case 1024 byte is the
default.  If the kernel determines the buffer is to small it returns
ENOMEM and the minimum required size of the nvlist_t.  This was
working properly but in the case of 'zpool events' the event stream
was advanced dispite the error.  Thus the retry with the bigger
buffer would succeed but it would skip over the previous event.

The fix is to pass this size to zfs_zevent_next() and determine
before removing the event from the list if it will fit.  This was
preferable to checking after the event was returned because this
avoids the need to rewind the stream.
2010-10-12 14:55:03 -07:00
Brian Behlendorf
a69052be7f Initial zio delay timing
While there is no right maximum timeout for a disk IO we can start
laying the ground work to measure how long they do take in practice.
This change simply measures the IO time and if it exceeds 30s an
event is posted for 'zpool events'.

This value was carefully selected because for sd devices it implies
that at least one timeout (SD_TIMEOUT) has occured.  Unfortunately,
even with FAILFAST set we may retry and request and not get an
error.  This behavior is strongly dependant on the device driver
and how it is hooked in to the scsi error handling stack.  However
by setting the limit at 30s we can log the event even if no error
was returned.

Slightly longer term we can start recording these delays perhaps
as a simple power-of-two histrogram.  This histogram can then be
reported as part of the 'zpool status' command when given an command
line option.

None of this code changes the internal behavior of ZFS.  Currently
it is simply for reporting excessively long delays.
2010-10-12 14:55:02 -07:00
Brian Behlendorf
2959d94a0a Add FAILFAST support
ZFS works best when it is notified as soon as possible when a device
failure occurs.  This allows it to immediately start any recovery
actions which may be needed.  In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.

That's the theory at least.  In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.

Unfortunately, without additional kernels patchs there's not much
which can be done to improve this.  Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed.  This can be improved and I suspect I'll be
submitting patches upstream to handle this.
2010-10-12 14:55:02 -07:00
Brian Behlendorf
312c07edfd Generate zevents for speculative and soft errors
By default the Solaris code does not log speculative or soft io errors
in either 'zpool status' or post an event.  Under Linux we don't want
to change the expected behavior of 'zpool status' so these io errors
are still suppressed there.

However, since we do need to know about these events for Linux FMA and
the 'zpool events' interface is new we do post the events.  With the
addition of the zio_flags field the posted events now contain enough
information that a user space consumer can identify and discard these
events if it sees fit.
2010-10-12 14:55:00 -07:00
Brian Behlendorf
d148e95156 Fix negative zio->io_error which must be positive.
All the upper layers of zfs expect zio->io_error to be positive.  I was
careful but I missed one instance in vdev_disk_physio_completion() which
could return a negative error.  To ensure all cases are always caught I
had additionally added an ASSERT() to check this before zio_interpret().

Finally, as a debugging aid when zfs is build with --enable-debug all
errors from the backing block devices will be reported to the console
with an error message like this:

	ZFS: zio error=5 type=1 offset=4217856 size=8192 flags=60440
2010-10-12 14:55:00 -07:00
Brian Behlendorf
398f129ca3 Suppress large kmem_alloc() warning.
Observed during failure mode testing, dsl_scan_setup_sync() allocates
73920 bytes.  This is way over the limit of what is wise to do with a
kmem_alloc() and it should probably be moved to a slab.  For now I'm
just flagging it with KM_NODEBUG to quiet the error until this can be
revisited.
2010-10-12 14:54:59 -07:00
Ned Bass
3a7381e531 Use stored whole_disk property when opening a vdev
This commit fixes a bug in vdev_disk_open() in which the whole_disk property
was getting set to 0 for disk devices, even when it was stored as a 1 when the
zpool was created.  The whole_disk property lets us detect when the partition
suffix should be stripped from the device name in CLI output.  It is also used
to determine how writeback cache should be set for a device.

When an existing zpool is imported its configuration is read from the vdev
label by user space in zpool_read_label().  The whole_disk property is saved in
the nvlist which gets passed into the kernel, where it in turn gets saved in
the vdev struct in vdev_alloc().  Therefore, this value is available in
vdev_disk_open() and should not be overridden by checking the provided device
path, since that path will likely point to a partition and the check will
return the wrong result.

We also add an ASSERT that the whole_disk property is set.  We are not aware of
any cases where vdev_disk_open() should be called with a config that doesn't
have this property set.  The ASSERT is there so that when debugging is enabled
we can identify any legitimate cases that we are missing.  If we never hit the
ASSERT, we can at some point remove it along with the conditional whole_disk
check.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-10-04 13:53:18 -07:00
Ricardo M. Correia
0151834d65 Register the space accounting callback even when we don't have the ZPL.
This callback is needed for properly accounting the per-uid and per-gid
space usage.  Even if we don't have the ZPL, we still need this callback
in order to have proper on-disk ZPL compatibility and to be able to use
Lustre quotas.

Fortunately, the callback doesn't have any ZPL/VFS dependencies so we
can just move it out of #ifdef HAVE_ZPL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-10-04 11:34:39 -07:00
Ricardo M. Correia
368f4c10ae Export ZFS symbols needed by Lustre.
Required for the DB_DNODE_ENTER()/DB_DNODE_EXIT() helpers.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-09-17 16:24:15 -07:00
Ricardo M. Correia
1e411a4c12 Quiet down very frequent large allocation warning in ZFS.
In my machine, dnode_hold_impl() allocates 9992 bytes in DEBUG mode and it
causes a large stream of stack traces in the logs. Instead, use KM_NODEBUG
to quiet down this known large alloc.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-09-17 16:24:15 -07:00
Brian Behlendorf
6283f55ea1 Support custom build directories and move includes
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory.  The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.

For example, this project is designed to work on various different
Linux distributions each of which work slightly differently.  This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.

Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution.  When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.

wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz
tar -xzf zfs-x.y.z.tar.gz
cd zfs-x-y-z

------------------------- run concurrently ----------------------
<ubuntu system>  <fedora system>  <debian system>  <rhel6 system>
mkdir ubuntu     mkdir fedora     mkdir debian     mkdir rhel6
cd ubuntu        cd fedora        cd debian        cd rhel6
../configure     ../configure     ../configure     ../configure
make             make             make             make
make check       make check       make check       make check

This change also moves many of the include headers from individual
incude/sys directories under the modules directory in to a single
top level include directory.  This has the advantage of making
the build rules cleaner and logically it makes a bit more sense.
2010-09-08 12:38:56 -07:00
Brian Behlendorf
f5e79474f0 Fix zfsdev_compat_ioctl() case
For the !CONFIG_COMPAT case fix the zfsdev_compat_ioctl()
compatibility function name.  This was caught by the
chaos4.3 builder.
2010-09-01 16:00:15 -07:00
Brian Behlendorf
d603ed6c27 Add linux user disk support
This topic branch contains all the changes needed to integrate the user
side zfs tools with Linux style devices.  Primarily this includes fixing
up the Solaris libefi library to be Linux friendly, and integrating with
the libblkid library which is provided by e2fsprogs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:42:00 -07:00
Brian Behlendorf
054bc00b4c Add linux compatibility
Resolve minor Linux compatibility issues.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:59 -07:00
Brian Behlendorf
7b89a54996 Add linux spa thread support
Disable the spa thread under Linux until it can be implemented.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:59 -07:00
Brian Behlendorf
9c905c550b Add linux sha2 support
The upstream ZFS code has correctly moved to a faster native sha2
implementation.  Unfortunately, under Linux that's going to be a little
problematic so we revert the code to the more portable version contained
in earlier ZFS releases.  Using the native sha2 implementation in Linux
is possible but the API is slightly different in kernel version user
space depending on which libraries are used.  Ideally, we need a fast
implementation of SHA256 which builds as part of ZFS this shouldn't be
that hard to do but it will take some effort.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:59 -07:00
Brian Behlendorf
c28b227942 Add linux kernel module support
Setup linux kernel module support, this includes:
- zfs context for kernel/user
- kernel module build system integration
- kernel module macros
- kernel module symbol export
- kernel module options

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:58 -07:00
Brian Behlendorf
00b46022c6 Add linux kernel memory support
Required kmem/vmem changes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:57 -07:00
Brian Behlendorf
60101509ee Add linux kernel disk support
Native Linux vdev disk interfaces

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:57 -07:00
Brian Behlendorf
325f023544 Add linux kernel device support
This branch contains the majority of the changes required to cleanly
intergrate with Linux style special devices (/dev/zfs).  Mainly this
means dropping all the Solaris style callbacks and replacing them
with the Linux equivilants.

This patch also adds the onexit infrastructure needed to track
some minimal state between ioctls.  Under Linux it would be easy
to do this simply using the file->private_data.  But under Solaris
they apparent need to pass the file descriptor as part of the ioctl
data and then perform a lookup in the kernel.  Once again to keep
code change to a minimum I've implemented the Solaris solution.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:50 -07:00
Brian Behlendorf
47d0ed1e6f Add linux spl debug support
Use spl debug if HAVE_SPL defined

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:50 -07:00
Brian Behlendorf
d2c15e84e9 Add linux mlslabel support
The ZFS update to onnv_141 brought with it support for a
security label attribute called mlslabel.  This feature
depends on zones to work correctly and thus I am disabling
it under Linux.  Equivilant functionality could be added
at some point in the future.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:49 -07:00
Brian Behlendorf
266852767f Add linux events
This topic branch leverages the Solaris style FMA call points
in ZFS to create a user space visible event notification system
under Linux.  This new system is called zevent and it unifies
all previous Solaris style ereports and sysevent notifications.

Under this Linux specific scheme when a sysevent or ereport event
occurs an nvlist describing the event is created which looks almost
exactly like a Solaris ereport.  These events are queued up in the
kernel when they occur and conditionally logged to the console.
It is then up to a user space application to consume the events
and do whatever it likes with them.

To make this possible the existing /dev/zfs ABI has been extended
with two new ioctls which behave as follows.

* ZFS_IOC_EVENTS_NEXT
Get the next pending event.  The kernel will keep track of the last
event consumed by the file descriptor and provide the next one if
available.  If no new events are available the ioctl() will block
waiting for the next event.  This ioctl may also be called in a
non-blocking mode by setting zc.zc_guid = ZEVENT_NONBLOCK.  In the
non-blocking case if no events are available ENOENT will be returned.
It is possible that ESHUTDOWN will be returned if the ioctl() is
called while module unloading is in progress.  And finally ENOMEM
may occur if the provided nvlist buffer is not large enough to
contain the entire event.

* ZFS_IOC_EVENTS_CLEAR
Clear are events queued by the kernel.  The kernel will keep a fairly
large number of recent events queued, use this ioctl to clear the
in kernel list.  This will effect all user space processes consuming
events.

The zpool command has been extended to use this events ABI with the
'events' subcommand.  You may run 'zpool events -v' to output a
verbose log of all recent events.  This is very similar to the
Solaris 'fmdump -ev' command with the key difference being it also
includes what would be considered sysevents under Solaris.  You
may also run in follow mode with the '-f' option.  To clear the
in kernel event queue use the '-c' option.

$ sudo cmd/zpool/zpool events -fv
TIME                        CLASS
May 13 2010 16:31:15.777711000 ereport.fs.zfs.config.sync
        class = "ereport.fs.zfs.config.sync"
        ena = 0x40982b7897700001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0xed976600de75dfa6
        (end detector)

        time = 0x4bec8bc3 0x2e5aed98
        pool = "zpios"
        pool_guid = 0xed976600de75dfa6
        pool_context = 0x0

While the 'zpool events' command is handy for interactive debugging
it is not expected to be the primary consumer of zevents.  This ABI
was primarily added to facilitate the addition of a user space
monitoring daemon.  This daemon would consume all events posted by
the kernel and based on the type of event perform an action.  For
most events simply forwarding them on to syslog is likely enough.
But this interface also cleanly allows for more sophisticated
actions to be taken such as generating an email for a failed drive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:36 -07:00
Brian Behlendorf
c9c0d073da Add build system
Add autoconf style build infrastructure to the ZFS tree.  This
includes autogen.sh, configure.ac, m4 macros, some scripts/*,
and makefiles for all the core ZFS components.
2010-08-31 13:41:27 -07:00
Brian Behlendorf
6656bf5621 Fix stack traverse_visitbp()
Due to  limited stack space recursive functions are frowned upon in
the Linux kernel.  However, they often are the most elegant solution
to a problem.  The following code preserves the recursive function
traverse_visitbp() but moves the local variables AND function
arguments to the heap to minimize the stack frame size.  Enough
space is initially allocated on the stack for 20 levels of recursion.
This change does ugly-up-the-code but it reduces the worst case
usage from roughly 4160 bytes to 960 bytes on x86_64 archs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:50 -07:00
Ned Bass
da6b4005c9 Fix stack zio_execute()
Implement zio_execute() as a wrapper around the static function
__zio_execute() so that we can force  __zio_execute() to be inlined.
This reduces stack overhead which is important because __zio_execute()
is called recursively in several zio code paths.  zio_execute() itself
cannot be inlined because it is externally visible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:50 -07:00
Brian Behlendorf
c776b317e4 Fix stack zio_done()
Eliminated local variables pointing to members of the zio struct.
Just refer to the struct members directly.  This saved about 32 bytes per
call, but this function can be called recurisvely up to 19 levels deep,
so we potentially save up to 608 bytes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:50 -07:00
Brian Behlendorf
5fed499def Fix stack vdev_cache_read()
Moving the vdev_cache_entry_t struct ve_search from the stack to
the heap saves ~100 bytes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:49 -07:00
Brian Behlendorf
47050a88ac Fix stack traverse_impl()
Stack use reduced from 560 bytes to 128 bytes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:49 -07:00
Brian Behlendorf
60948de1ef Fix stack noinline
Certain function must never be automatically inlined by gcc because
they are stack heavy or called recursively.  This patch flags all
such functions I've found as 'noinline' to prevent gcc from making
the optimization.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:49 -07:00
Brian Behlendorf
18a89ba43d Fix stack lzjb
Reduce kernel stack usage by lzjb_compress() by moving uint16 array
off the stack and on to the heap.  The exact performance implications
of this I have not measured but we absolutely need to keep stack
usage to a minimum.  If/when this becomes and issue we optimize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:49 -07:00
Brian Behlendorf
bf701a83c5 Fix stack inline
Decrease stack usage for various call paths by forcing certain
functions to be inlined.  By inlining the functions the overhead
of a new stack frame is removed at the cost of increased code size.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:48 -07:00
Brian Behlendorf
161ce7ce3c Fix stack dsl_scan_visitbp()
To reduce stack overhead this topic branch moves the 128 byte
blkptr_t data strucutre in dsl_scan_visitbp() to the heap.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:48 -07:00
Brian Behlendorf
fcf37ec6c2 Fix stack dsl_dir_open_spa()
Reduce stack usage by 256 bytes by moving buf char array from
the stack to the heap.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:48 -07:00
Brian Behlendorf
48c67dc8f8 Fix stack dsl_deleg_get()
Reduce stack usage in dsl_deleg_get, gcc flagged it as consuming a
whopping 1040 bytes or potentially 1/4 of a 4K stack.  This patch
moves all the large structures and buffer off the stack and on to
the heap.  This includes 2 zap_cursor_t structs each 52 bytes in
size, 2 zap_attribute_t structs each 280 bytes in size, and 1
256 byte char array.  The total saves on the stack is 880 bytes
after you account for the 5 new pointers added.

Also the source buffer length has been increased from MAXNAMELEN
to MAXNAMELEN+strlen(MOS_DIR_NAME)+1 as described by the comment in
dsl_dir_name().  A buffer overrun may have been possible with the
slightly smaller buffer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:48 -07:00
Brian Behlendorf
81a4966389 Fix stack dsl_dataset_destroy()
Move dsl_dataset_t local variable from the stack to the heap.
This reduces the stack usage of this function from 2048 bytes
to 176 bytes for x84_64 arches.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:48 -07:00
Brian Behlendorf
a8ac8e715e Fix stack dmu_objset_snapshot()
Reduce stack usage by 276 bytes by moving the snaparg struct from the
stack to the heap.  We have limited stack space we must not waste.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:47 -07:00
Brian Behlendorf
fc5bb51f08 Fix stack dbuf_hold_impl()
This commit preserves the recursive function dbuf_hold_impl() but moves
the local variables and function arguments to the heap to minimize
the stack frame size.  Enough space is initially allocated on the
stack for 20 levels of recursion.  This technique was based on commit
34229a2f2ac07363f64ddd63e014964fff2f0671 which reduced stack usage of
traverse_visitbp().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:47 -07:00
Brian Behlendorf
5ac1241a95 Fix dnode_move() scope
The dnode_move() functionality is only used in the kernel build.
As such we should be careful to wrap all of the related code
with '#ifdef _KERNEL' to avoid gcc warnings about unused code.
2010-08-31 08:38:47 -07:00
Brian Behlendorf
8a8f5c6b3c Fix zfs_ioc_objset_stats
Interestingly this looks like an upstream bug as well.  If for some
reason we are unable to get a zvols statistics, because perhaps the
zpool is hopelessly corrupt, we would trigger the VERIFY.  This
commit adds the proper error handling just to propagate the error
back to user space.  Now the user space tools still must handle this
properly but in the worst case the tool will crash or perhaps have
some missing output.  That's far far better than crashing the host.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:47 -07:00
Brian Behlendorf
5cc556b447 Fix zio_taskq_dispatch to use TQ_NOSLEEP
The zio_taskq_dispatch() function may be called at interrupt time
and it is critical that we never sleep.

Additionally, wrap taskq_dispatch() in a while loop because it may
fail.  This is non optimal but is OK for now.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:46 -07:00
Brian Behlendorf
ef5319df8e Fix rw_init() usage
Properly initialize rwlock primitives.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:46 -07:00
Brian Behlendorf
eaa8687be3 Fix zmod.h usage in userspace
Do not use zmod.h in userspace.

This has also been filed with the ZFS team. It makes the userspace
libzpool code use the zlib API, instead of the Solaris-only and
non-standard zmod.h.  The zlib API is almost identical and is a de
facto standard, so this is a no-brainer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:46 -07:00
Brian Behlendorf
3f50448292 Fix missing newlines
Add missing \n's to dprintf()s

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:46 -07:00
Brian Behlendorf
22c81dd8a9 Fix metaslab
If your only going to allow one allocator to be used and it is defined
at compile time there is no point including the others in the build.
This patch could/should be refined for Linux to make the metaslab
configurable at run time.  That might be a bit tricky however since
you would need to quiese all IO.  Short of that making it configurable
as a module load option would be a reasonable compromise.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:45 -07:00
Brian Behlendorf
98f72a539c Fix list handling to only use the API
Remove all instances of list handling where the API is not used
and instead list data members are directly accessed.  Doing this
sort of thing is bad for portability.

Additionally, ensure that list_link_init() is called on newly
created list nodes.  This ensures the node is properly initialized
and does not rely on the assumption that zero'ing the list_node_t
via kmem_zalloc() is the same as proper initialization.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:45 -07:00
Brian Behlendorf
59e6e7ca85 Fix kstat xuio
Move xiou stat structures from a header to the dmu.c source as is
done with all the other kstat interfaces.  This information is local
to dmu.c registered the xuio kstat and should stay that way.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:45 -07:00
Brian Behlendorf
754c6663a3 Fix dbuf eviction assertion
Replace non-fatal assertion with warning.  This was being observed
during testing and it should not be fatal.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:45 -07:00
Brian Behlendorf
753972fccf Fix dbuf_dirty_record_t leaks
Fix two leaks with dbuf_dirty_record_t

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:44 -07:00
Brian Behlendorf
5631c03889 Fix variables named current
In the linux kernel 'current' is defined to mean the current process
and can never be used as a local variable in a function.  Simply
replace all usage of 'current' with 'curr' in this function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:44 -07:00
Ricardo M. Correia
090ff0929e Fix commit callbacks
The upstream commit cb code had a few bugs:

1) The arguments of the list_move_tail() call in txg_dispatch_callbacks()
were reversed by mistake. This caused the commit callbacks to not be
called at all.

2) ztest had a bug in ztest_dmu_commit_callbacks() where "error" was not
initialized correctly. This seems to have caused the test to always take
the simulated error code path, which made ztest unable to detect whether
commit cbs were being called for transactions that successfuly complete.

3) ztest had another bug in ztest_dmu_commit_callbacks() where the commit
cb threshold was not being compared correctly.

4) The commit cb taskq was using 'max_ncpus * 2' as the maxalloc argument
of taskq_create(), which could have caused unnecessary delays in the txg
sync thread.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:44 -07:00
Brian Behlendorf
d4ed667343 Fix gcc uninitialized variable warnings
Gcc -Wall warn: 'uninitialized variable'

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:43 -07:00
Brian Behlendorf
1fde1e3720 Fix gcc unused variable warnings
Gcc -Wall warn: 'unused variable'

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:43 -07:00
Brian Behlendorf
c65aa5b2b9 Fix gcc missing parenthesis warnings
Gcc -Wall warn: 'missing parenthesis'

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 08:38:35 -07:00
Brian Behlendorf
e75c13c353 Fix gcc missing case warnings
Gcc ASSERT() missing cases are impossible

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-27 15:34:03 -07:00
Brian Behlendorf
2598c0012d Fix gcc missing braces warnings
Resolve compiler warnings concerning missing braces.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-27 15:34:03 -07:00
Brian Behlendorf
0bc8fd7884 Fix gcc invalid prototype warnings
Gcc -Wall warn: 'invalid prototype'

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-27 15:34:03 -07:00
Ricardo M. Correia
e5dc681a50 Fix gcc ident pragma warnings
Remove all ident pragmas which are unknown to gcc.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-27 15:34:02 -07:00
Brian Behlendorf
f709a82dc1 Fix gcc useless debug warnings
Gcc useless debugging.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-27 15:34:01 -07:00
Brian Behlendorf
b8864a233c Fix gcc cast warnings
Gcc -Wall warn: 'lacks a cast'
Gcc -Wall warn: 'comparison between pointer and integer'

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-27 15:33:32 -07:00
Brian Behlendorf
d6320ddb78 Fix gcc c90 compliance warnings
Fix non-c90 compliant code, for the most part these changes
simply deal with where a particular variable is declared.
Under c90 it must alway be done at the very start of a block.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-27 15:28:32 -07:00
Ricardo M. Correia
c5b3a7bbcc Fix gcc 64-bit constant warnings
Add 'ull' suffix to 64-bit constants.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-26 15:18:01 -07:00
Brian Behlendorf
572e285762 Update to onnv_147
This is the last official OpenSolaris tag before the public
development tree was closed.
2010-08-26 14:24:34 -07:00
Brian Behlendorf
428870ff73 Update core ZFS code from build 121 to build 141. 2010-05-28 13:45:14 -07:00
Brian Behlendorf
fa42225a3d Add Solaris FMA style support 2010-04-29 10:37:15 -07:00
Brian Behlendorf
0aa61e8427 Remove zvol.c when updating in update-zfs.sh Linux version available. 2009-11-15 16:20:01 -08:00
Brian Behlendorf
45d1cae3b8 Rebase master to b121 2009-08-18 11:43:27 -07:00
Brian Behlendorf
9babb37438 Rebase master to b117 2009-07-02 15:44:48 -07:00
Brian Behlendorf
d164b20935 Rebase master to b108 2009-02-18 12:51:31 -08:00
Brian Behlendorf
fb5f0bc833 Rebase master to b105 2009-01-15 13:59:39 -08:00
Brian Behlendorf
172bb4bd5e Move the world out of /zfs/ and seperate out module build tree 2008-12-11 11:08:09 -08:00