Commit Graph

3104 Commits

Author SHA1 Message Date
Brian Behlendorf ba367276d8 Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-17 11:22:23 -07:00
Cyril Plisko 49d39798f2 ZFS replay transaction error 5
When zfs_replay_write() replays TX_WRITE records from ZIL
it calls zpl_write_common() to perform the actual write.
zpl_write_common() returns the number of bytes written
(similar to write() system call) or an (negative) error.
However, the code expects the positive return value to be
a residual counter. Thus when zpl_write_common() successfully
completes it is mistakenly considered to be a partial write and
the error code delivered further. At this point the ZIL processing
is aborted with famous "ZFS replay transaction error 5" error
message given to the message buffer.

The fix is to compare the zpl_write_commmon() return value with
the buffer size and flag error only when they disagree.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #933
2012-09-17 11:06:58 -07:00
Brian Behlendorf 8312c6df55 Clear PG_writeback for sync I/O error case
Commit 2b2861362f accidentally
introduced this issue by only conditionally registering the
commit callback in the async case.

The error handing code for the dmu_tx_assign() failure case
relied on there always being a registered commit callback to
clear the PG_writeback bit.  Since that is no longer strictly
true for the synchronous case we must explicitly invoke the
callback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #961
2012-09-14 15:53:47 -07:00
Brian Behlendorf 5915791096 Move iput() after zfs_inode_update()
When replaying an unlink/remove operation via zfs_rmdir() the object
being removed will be instantiated by a call to zfs_dirent_lock().
This means that there is a single reference protecting the object.
Right before the call to zfs_inode_update() this reference is dropped
which may cause the object to be destroyed.  This will result in a
NULL dereference as shown by the stack trace is issue #782.

This likely isn't an issue during normal operation because there is
always an additional reference held on the object by the VFS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #782
2012-09-12 14:22:52 -07:00
Brian Behlendorf 3050c9314f Switch KM_SLEEP to KM_PUSHPAGE
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system.  To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs.  This will prevent them from attempting
to initiate any I/O during direct reclaim.

This change was originally part of cd5ca4b but was reverted by
330fe01.  It always should have had its own commit for exactly
this reason.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-12 12:27:09 -07:00
Brian Behlendorf 9b51f21841 Remove TQ_SLEEP -> KM_SLEEP mapping
When the taskq code was originally written it seemed like a good
idea to simply map TQ_SLEEP to KM_SLEEP.  Unfortunately, this
assumed that the TQ_* flags would never confict with any of the
Linux GFP_* flags.  When adding the TQ_PUSHPAGE support in commit
cd5ca4b this invariant was accidentally broken.

Therefore to support TQ_PUSHPAGE, which is needed for Linux, and
prevent any further confusion I have removed this direct mapping.
The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined
in terms of their KM_* counterparts.  Instead a simple mapping
function is introduce to convert TQ_* -> KM_* where needed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #171
2012-09-12 11:41:42 -07:00
Brian Behlendorf 330fe010e4 Revert "Switch KM_SLEEP to KM_PUSHPAGE"
This reverts commit cd5ca4b2f8
due to conflicts in the higher TQ_ bits which caused incorrect
behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-12 10:07:48 -07:00
Brian Behlendorf 4ca9a43644 Remove zvol device node
The 'zfs destroy' changes in 330d06f disrupted how zvol devices
get removed on ZoL.  However, it basically boils down to the
fact that we are no longer reliably calling zvol_remove_minor()
via zfs_ioc_destroy_snaps().

Therefore we add the missing call and handle things similarly
to the existing zfs_unmount_snap() case.  Ideally we would check
if this is of type DMU_OST_ZFS or DMU_OST_ZVOL and just do the
right thing as in zfs_ioc_destroy().  However, it looks like
it would be fairly expensive to get the type, and it's harmless
to simply attempt the umount and minor removal.

This is also an issue in the latest FreeBSD and Illumos code.
It was being tracked under the following issue, and we may want
to refresh our code when they settle on what they want to do
about it upstream.

  https://www.illumos.org/issues/3170

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #903
2012-09-10 10:25:08 -07:00
Brian Behlendorf 3c60f5054c Debug cv_destroy() with mutex held
There still appears to be a race in the condition variables where
->cv_mutex is set after we are woken from the cv_destroy wait queue.
This might be possible when cv_destroy() is called immediately after
cv_broadcast().  We had some troubles with this previously but
there may still be a small race, see commit d599e4f.

The following patch closes one small race and improves the ASSERTs
such that they log the offending value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#943
2012-09-10 10:23:26 -07:00
Brian Behlendorf 95331f4437 Set KMC_NOEMERGENCY for zlib workspaces
The workspace required by zlib to perform compression is roughly
512MB (order-7).  These allocations are so large that we should
never attempt to directly kmalloc an emergency object for them.

It is far preferable to asynchronously vmalloc an additional slab
in case it's needed.  Then simply block waiting for an existing
object to be released or for the new slab to be allocated.

This can be accomplished by disabling emergency slab objects by
passing the KMC_NOEMERGENCY flag at slab creation time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#917
2012-09-07 14:36:26 -07:00
Brian Behlendorf cb5c2acebb Add KMC_NOEMERGENCY slab flag
Provide a flag to disable the use of emergency objects for a
specific kmem cache.  There may be instances where under no
circumstances should you kmalloc() an emergency object.  For
example, when you cache contains very large objects (>128k).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-07 14:27:03 -07:00
Cyril Plisko 04f9432d3b Make ZFS filesystem id persistent across different machines
Use ZFS dataset fsid guid as a unique file system id, similar to what is
done on Illumos/OpenSolaris.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #888
2012-09-06 12:47:11 -07:00
Brian Behlendorf ebcfc8a534 Disable page allocation warnings for ARC buffers
Buffers for the ARC are normally backed by the SPL virtual slab.
However, if memory is low, AND no slab objects are available,
AND a new slab cannot be quickly constructed a new emergency
object will be directly allocated.

These objects can be as large as order 5 on a system with 4k
pages.  And because they are allocated with KM_PUSHPAGE, to
avoid a potential deadlock, they are not allowed to initiate I/O
to satisfy the allocation.  This can result in the occasional
allocation failure.

However, since these allocations are allowed to block and
perform operations such as memory compaction they will eventually
succeed.  Since this is not unexpected (just unlikely) behavior
this patch disables the warning for the allocation failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #465
2012-09-06 11:53:08 -07:00
Brian Behlendorf cafa9709f3 Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-05 08:44:58 -07:00
Brian Behlendorf 0ef0ff546e Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-04 16:00:06 -07:00
Brian Behlendorf 594b4dd82a Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-04 08:41:12 -07:00
Chris Dunlop 20a083cbe2 Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-02 10:15:49 -07:00
Brian Behlendorf b404a3f07f Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-08-31 17:39:29 -07:00
Brian Behlendorf 46b3945d5d Suppress task_hash_table_init() large allocation warning
When various kernel debuging options are enabled this allocation
may be larger than usual as shown by the following warning.  It
is in no way harmful so we suppress the warning.

  SPL: large kmem_alloc(40960, 0x80d0) at
  tsd_hash_table_init:358 (76495/76495)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #93
2012-08-30 21:02:52 -07:00
Brian Behlendorf 2b2861362f Clear PG_writeback after zil_commit() for sync I/O
When writing via ->writepage() the writeback bit was always cleared
as part of the txg commit callback.  However, when the I/O is also
being written synchronsously to the zil we can immediately clear this
bit.  There is no need to wait for the subsequent TXG sync since the
data is already safe on stable storage.

This has been observed to reduce the msync(2) delay from up to 5
seconds down 10s of miliseconds.  One workload which is expected
to benefit from this are the intermittent samba hands described
in issue #700.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #700
Closes #907
2012-08-30 20:16:28 -07:00
Brian Behlendorf efcd0ca32d Enhance SPLAT kmem:slab_overcommit test
After the emergency slab objects were merged I started observing
timeout failures in the kmem:slab_overcommit test.  These were
due to the ineffecient way the slab_overcommit reclaim function
was implemented.  And due to the additional cost of potentially
allocating ten of thousands of emergency objects and tracking
them on a single list.

This patch addresses the first concern by enhansing the test
case to trace all of the allocations objects as a linked list.
This allows for a cleaner version of the reclaim function to
simply release SPLAT_KMEM_OBJ_RECLAIM objects.

Since this touches some common code all the tests which share
these data structions were also updated.  After making these
changes slab_overcommit is reliably passing.  However, there
is certainly additional cleanup which could be done here.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-30 15:49:00 -07:00
Richard Yao b8d06fca08 Switch KM_SLEEP to KM_PUSHPAGE
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.

  * The txg_sync thread
  * The zvol write/discard threads
  * The zpl_putpage() VFS callback

This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages.  If a lock is held over this operation the potential exists to
deadlock the system.  To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.

Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread.  However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA.  This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.

This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:

  21ade34 Disable direct reclaim for z_wr_* threads
  cfc9a5c Fix zpl_writepage() deadlock
  eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Brian Behlendorf 991fc1d7ae mzap_upgrade() must use kmem_alloc()
These allocations in mzap_update() used to be kmem_alloc() but
were changed to vmem_alloc() due to the size of the allocation.
However, since it turns out this function may be called in the
context of the txg_sync thread they must be changed back to use
a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:01:37 -07:00
Brian Behlendorf 8630650a8d Annotate KM_PUSHPAGE call paths with PF_NOFS
The txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard()
call paths must only use KM_PUSHPAGE to avoid potential deadlocks
during direct reclaim.

This patch annotates these call paths so any accidental use of
KM_SLEEP will be quickly detected.   In the interest of stability
if debugging is disabled the offending allocation will have its
GFP flags automatically corrected.  When debugging is enabled
any misuse will be treated as a fatal error.

This patch is entirely for debugging.  We should be careful to
NOT become dependant on it fixing up the incorrect allocations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:01:37 -07:00
Brian Behlendorf 86dd0fd922 Pre-allocate vdev I/O buffers
The vdev queue layer may require a small number of buffers
when attempting to create aggregate I/O requests.  Rather than
attempting to allocate them from the global zio buffers, which
is slow under memory pressure, it makes sense to pre-allocate
them because...

1) These buffers are short lived.  They are only required for
the life of a single I/O at which point they can be used by
the next I/O.

2) The maximum number of concurrent buffers needed by a vdev is
small.  It's roughly limited by the zfs_vdev_max_pending tunable
which defaults to 10.

By keeping a small list of these buffer per-vdev we can ensure
one is always available when we need it.  This significantly
reduces contention on the vq->vq_lock, because we no longer
need to perform a slow allocation under this lock.  This is
particularly important when memory is already low on the system.

It would probably be wise to extend the use of these buffers beyond
aggregate I/O and in to the raidz implementation.  The inability
to quickly allocate buffer for the parity stripes could result in
similiar problems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:01:37 -07:00
Richard Yao 44f21da41c Revert Disable direct reclaim for z_wr_* threads
This commit used PF_MEMALLOC to prevent a memory reclaim deadlock.
However, commit 49be0ccf1f eliminated
the invocation of __cv_init(), which was the cause of the deadlock.
PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA
to be allocated.  The use of PF_MEMALLOC was found to cause stability
problems when doing swap on zvols. Since this technique is known to
cause problems and no longer fixes anything, we revert it.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Richard Yao 62c4165a1b Revert Fix zpl_writepage() deadlock
The commit, cfc9a5c88f, to fix deadlocks
in zpl_writepage() relied on PF_MEMALLOC.   That had the effect of
disabling the direct reclaim path on all allocations originating from
calls to this function, but it failed to address the actual cause of
those deadlocks.  This led to the same deadlocks being observed with
swap on zvols, but not with swap on the loop device, which exercises
this code.

The use of PF_MEMALLOC also had the side effect of permitting
allocations to be made from ZONE_DMA in instances that did not require
it.  This contributes to the possibility of panics caused by depletion
of pages from ZONE_DMA.

As such, we revert this patch in favor of a proper fix for both issues.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Richard Yao b876dac776 Revert Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Commit eec8164771 worked around an issue
involving direct reclaim through the use of PF_MEMALLOC.   Since we
are reworking thing to use KM_PUSHPAGE so that swap works, we revert
this patch in favor of the use of KM_PUSHPAGE in the affected areas.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Brian Behlendorf cd5ca4b2f8 Switch KM_SLEEP to KM_PUSHPAGE
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system.  To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs.  This will prevent them from attempting
to initiate any I/O during direct reclaim.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf 500e95c884 Revert "Disable vmalloc() direct reclaim"
This reverts commit 2092cf68d8.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf 617f79de6a Revert "Fix NULL deref in balance_pgdat()"
This reverts commit b8b6e4c453.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf bc03e07a7c Revert "Detect kernels that honor gfp flags passed to vmalloc()"
This reverts commit 36811b4430.
Which is no longer required because there is now SPL code in
place to safely handle the deadlocks the kernel patch was designed
to address.  Therefore we can unconditionally use vmalloc() and
drop all the PF_MEMALLOC code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf d47e664ad4 Revert "Add TASKQ_NORECLAIM flag"
This reverts commit 372c257233.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
Brian Behlendorf e2dcc6e2b8 Emergency slab objects
This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs.  The issue is that the Linux kernel does
not honor the flags passed to __vmalloc().  This makes it unsafe
to use in a writeback context.  Unfortunately, this is a use case
ZFS depends on for correct operation.

Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.

  https://bugs.gentoo.org/show_bug.cgi?id=416685

However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.

While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code.  This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.

This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue.  This doesn't prevent the posibility
of the worker thread from deadlocking.  However, the caller can now
safely block on a wait queue for the slab allocation to complete.

Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,.  The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.

However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out.  When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.

As long as forward progress can be maintained then even if the
worker thread is deadlocked the critical thread will make progress.
This will eventually allow the deadlocked worker thread to complete
and normal operation will resume.

These emergency allocations will likely be slow since they require
contiguous pages.  However, their use should be rare so the impact
is expected to be minimal.  If that turns out not to be the case in
practice further optimizations are possible.

One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed.  Is they accumulate on a system and the list
grows freeing objects will become more expensive.  This could be
handled relatively easily by using a hash instead of a list, but that
optimization (if needed) is left for a follow up patch.

Additionally, these emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented.  See issue https://github.com/zfsonlinux/spl/issues/26
for full details.  This work would also help reduce ZFS's memory
fragmentation problems.

The /proc/spl/kmem/slab file has had two new columns added at the
end.  The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case.  These value should give us a good idea
of how often these objects are needed.  Based on these values under
real use cases we can tune the default behavior.

Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock.   This should manifest itself as reduced cpu usage for
the system.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
Brian Behlendorf cd38ac58a3 rmdir(2) should return ENOTEMPTY
Under Solaris the behavior for rmdir(2) is to return EEXIST when
a directory still contains entries.  However, on Linux ENOTEMPTY
is the expected return value with EEXIST being technically allowed.
According to rmdir(2):

ENOTEMPTY
   pathname contains entries other than . and .. ; or, pathname has
   ..  as its final component.  POSIX.1-2001 also allows EEXIST for
   this condition.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #895
2012-08-26 13:55:45 -07:00
Christopher Siden 9e11c7eee2 Illumos #3085: zfs diff panics, then panics in a loop on booting
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3085

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-25 12:32:25 -07:00
Simon Klinkert c578f007ff Illumos #2901: zfs receive fails for exabyte sparse files
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/2901

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-25 12:28:29 -07:00
Javen Wu a47587389e Drop spill buffer reference
When calling sa_update() and friends it is possible that a spill
buffer will be needed to accomidate the update.  When this happens
a hold is taken on the new dbuf and that hold must be released
before calling dmu_tx_commit().  Failing to release the hold will
cause a copy of the dbuf to be made in dbuf_sync_leaf().  This is
done to ensure further updates to the dbuf never sneak in to the
syncing txg.

This could be left to the sa_update() caller.  But then the caller
would need to be aware of this internal SA implementation detail.
It is therefore preferable to handle this all internally in the
SA implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #503
Closes #513
2012-08-25 09:26:10 -07:00
Brian Behlendorf f828e63a0d Revert "Use SA_HDL_PRIVATE for SA xattrs"
This reverts commit ec2626ad3f which
caused consistency problems between the shared and private handles.
Reverting this change should resolve issues #709 and #727.  It
will also reintroduce an arc_anon memory leak which is addressed
by the next commit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #709
Closes #727
2012-08-25 09:25:56 -07:00
Prakash Surya 15a9e03368 Wrap smp_processor_id in kpreempt_[dis|en]able
After surveying the code, the few places where smp_processor_id is used
were deemed to be safe to use with a preempt enabled kernel. As such, no
core logic had to be changed. These smp_processor_id call sites are simply
are wrapped in kpreempt_disable and kpreempt_enabled to prevent the
Linux kernel from emitting scary warnings.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Issue #83
2012-08-24 13:19:06 -07:00
Prakash Surya 08850eddcb Avoid calling smp_processor_id in spl_magazine_age
The spl_magazine_age function had the implied assumption that it will
remain on its current cpu through its execution. In order to support
preempt enabled kernels, this assumption had to be removed.

The spl_kmem_magazine structure now holds the cpu id of the cpu it is
local to. This allows spl_magazine_age to use this field when scheduling
work to be done by the magazine's local cpu.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
2012-08-24 09:43:22 -07:00
Richard Yao 15d0411297 Remove Makefile from non-toplevel .gitignore files
When building SPL support into the kernel, ./copy-builtin will copy
non-toplevel .gitignore files. These files list /Makefile, which causes
git-archive to omit ./module/{spl,splat}/Makefile. The absence of these
files result in build failures when SPL is selected. ZFS is unaffected
because it puts Makefile in the toplevel .gitignore, which is not
copied. We fix SPL by emulating that behavior.

Reported-by: Fabio Erculiani <lxnay@gentoo.org>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #152
2012-08-23 12:49:04 -07:00
Prakash Surya 9baf44bc17 Wrap trace_set_debug_header in trace_[get|put]_tcd
To properly support CONFIG_PREEMPT enabled kernels, we must refrain from
using a CPU index when preemption is enabled. As a result, this change
moves the trace_set_debug_header call (which calls smp_processor_id)
within trace_get_tcd and trace_put_tcd (which disable and enable
preemption respectively).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #160
2012-08-23 10:01:20 -07:00
Brian Behlendorf 4047414a6a Export dmu_buf_rele() symbol
While I'd like to remove the various pragmas in module/zfs/dbuf.c.
There are consumers such as Lustre which still depend on dmu_buf_*
versions of the symbols.  Until all consumers can be converted to
use only the dbuf_* names leave this symbol exported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-14 08:38:19 -07:00
Brian Behlendorf bafc4e9e2a Suppress 'zfs_sb_create' memory warning
When mutex debugging is enabled in your kernel the increased
size of the mutex structures can push the zfs_sb_t type beyond
the 8k warning threshold.  This isn't harmful so we suppress
the warning for this case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #628
2012-08-10 16:43:32 -07:00
Brian Behlendorf 8f576c2321 Export dbuf_* symbols
Export these symbols so they may be used by other ZFS consumers
besides the ZPL.

Remove three stale prototype definites from dbuf.h.  The actual
implementations of these functions were removed/renamed long ago.

It would be good in the long term to remove the existing pragmas
we inherited from Solaris and simply use the dbuf_* names.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-10 16:45:13 -07:00
Dan McDonald d96eb2b153 Illumos #1693: persistent 'comment' field for a zpool
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/1693

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #678
2012-08-08 11:49:37 -07:00
Etienne Dechamps ee5fd0bb80 Set zvol discard_granularity to the volblocksize.
Currently, zvols have a discard granularity set to 0, which suggests to
the upper layer that discard requests of arbirarily small size and
alignment can be made efficiently.

In practice however, ZFS does not handle unaligned discard requests
efficiently: indeed, it is unable to free a part of a block. It will
write zeros to the specified range instead, which is both useless and
inefficient (see dnode_free_range).

With this patch, zvol block devices expose volblocksize as their discard
granularity, so the upper layer is aware that it's not supposed to send
discard requests smaller than volblocksize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #862
2012-08-07 14:55:31 -07:00
Richard Yao 6576a1a70d Fix incorrect type in spl_kmem_cache_set_move() parameter
A preprocessor definition renders this harmless. However, it is a good
idea to change this to be consistent.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
2012-08-01 16:35:18 -07:00
Etienne Dechamps 7c0e570888 Limit the number of blocks to discard at once.
The number of blocks that can be discarded in one BLKDISCARD ioctl on a
zvol is currently unlimited. Some applications, such as mkfs, discard
the whole volume at once and they use the maximum possible discard size
to do that. As a result, several gigabytes discard requests are not
uncommon.

Unfortunately, if a large amount of data is allocated in the zvol, ZFS
can be quite slow to process discard requests. This is especially true
if the volblocksize is low (e.g. the 8K default). As a result, very
large discard requests can take a very long time (seconds to minutes
under heavy load) to complete. This can cause a number of problems, most
notably if the zvol is accessed remotely (e.g. via iSCSI), in which case
the client has a high probability of timing out on the request.

This patch solves the issue by adding a new tunable module parameter:
zvol_max_discard_blocks. This indicates the maximum possible range, in
zvol blocks, of one discard operation. It is set by default to 16384
blocks, which appears to be a good tradeoff. Using the default
volblocksize of 8K this is equivalent to 128 MB. When using the maximum
volblocksize of 128K this is equivalent to 2 GB.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #858
2012-07-31 09:46:09 -07:00