Commit Graph

5081 Commits

Author SHA1 Message Date
Brian Behlendorf c418410393 Limit zfs_vdev_aggregation_limit to SPA_MAXBLOCKSIZE
Prevent users from setting the zfs_vdev_aggregation_limit tuning
larger than SPA_MAXBLOCKSIZE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #520
2012-10-15 09:28:43 -07:00
Yuxuan Shui 45ca2d91cb Return positive error number in zfsctl_shares_lookup.
Otherwise it will cause zpl_shares_lookup() to return a invalid
pointer when an error occurs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Closes #626 #885 #947 #977
2012-10-15 09:11:56 -07:00
Yuxuan Shui bcb15891ab Linux 3.6 compat, kern_path_locked() added
The kern_path_parent() function was removed from Linux 3.6 because
it was observed that all the callers just want the parent dentry.
The simpler kern_path_locked() function replaces kern_path_parent()
and does the lookup while holding the ->i_mutex lock.

This is good news for the vn implementation because it removes the
need for us to handle the locking.  However, it makes it harder to
implement a single readable vn_remove()/vn_rename() function which
is usually what we prefer.

Therefore, we implement a new version of vn_remove()/vn_rename()
for Linux 3.6 and newer kernels.  This allows us to leave the
existing working implementation untouched, and to add a simpler
version for newer kernels.

Long term I would very much like to see all of the vn code removed
since what this code enabled is generally frowned upon in the kernel.
But that can't happen util we either abondon the zpool.cache file
or implement alternate infrastructure to update is correctly in
user space.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #154
2012-10-14 16:26:21 -07:00
Yuxuan Shui 558ef6d080 Linux 3.6 compat, iops->create()
As of Linux commit ebfc3b49a7ac25920cb5be5445f602e51d2ea559 the
struct nameidata is no longer passed to iops->create.  Instead
only the result of (inamedata->flags & LOOKUP_EXCL) is passed.

ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 14:42:25 -07:00
Yuxuan Shui 8f195a908f Linux 3.6 compat, iops->lookup()
As of Linux commit 00cd8dd3bf95f2cc8435b4cac01d9995635c6d0b the
struct nameidata is no longer passed to iops->lookup.  Instead
only the inamedata->flags are passed.

ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 13:06:54 -07:00
Yuxuan Shui 3c20361075 Linux 3.6 compat, sget()
As of Linux commit 9249e17fe094d853d1ef7475dd559a2cc7e23d42 the
mount flags are now passed to sget() so they can be used when
initializing a new superblock.

ZFS never uses sget() in this fashion so we can simply pass a
zero and add a zpl_sget() compatibility wrapper.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 13:06:48 -07:00
Yuxuan Shui af26c4d4ab Linux 3.6 compat, sops->write_super() removed
The .write_super callback was removed the the super_operations
structure by Linux commit f0cd2dbb6cf387c11f87265462e370bb5469299e.
All file systems are now expected to self manage writing any dirty
state assoicated with their super block.

ZFS never made use of this callback so it can simply be removed
from the super_operations structure.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 11:33:56 -07:00
Etienne Dechamps a5c20e2a0a Don't ashift-align vdev read requests.
Currently, the size of read and write requests on vdevs is aligned
according to the vdev's ashift, allocating a new ZIO buffer and padding
if need be.

This makes sense for write requests to prevent read/modify/write if the
write happens to be smaller than the device's internal block size.

For reads however, the rationale is less clear. It seems that the
original code aligns reads because, on Solaris, device drivers will
outright refuse unaligned requests.

We don't have that issue on Linux. Indeed, Linux block devices are able
to accept requests of any size, and take care of alignment issues
themselves.

As a result, there's no point in enforcing alignment for read requests
on Linux. This is a nice optimization opportunity for two reasons:
- We remove a memory allocation in a heavily-used code path;
- The request gets aligned in the lowest layer possible, which shrinks
  the path that the additional, useless padding data has to travel.
  For example, when using 4k-sector drives that lie about their sector
  size, using 512b read requests instead of 4k means that there will
  be less data traveling down the ATA/SCSI interface, even though the
  drive actually reads 4k from the platter.

The only exception is raidz, because raidz needs to read the whole
allocated block for parity.

This patch removes alignment enforcement for read requests, except on
raidz. Note that we also remove an assertion that checks that we're
aligning a top-level vdev I/O, because that's not the case anymore for
repair writes that results from failed reads.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1022
2012-10-12 12:01:56 -07:00
Richard Yao b68503fb30 Remove vmem_size() consumers
There are currently three vmem_size() consumers all of which are
part of the ARC implemention.  However, since the expected behavior
of the Linux and Solaris virtual memory subsystems are so different
the behavior in each of these instances needs to be reevaluated.

* arc_evict_needed() - This is actually dead code.  Arena support
was never added to the SPL and zio_arena is always NULL.  This
support isn't needed so we simply remove this dead code.

* arc_memory_throttle() - On Solaris where virtual memory constitutes
almost all of the address space we can reasonably expect there to be
a fairly large amount free.  However, on Linux by default we only
have about 100MB total and that's heavily used by the ARC.  So the
expectation on Linux is that this will usually be a small value.
Therefore we remove the vmem_size() check for i386 systems because
the expectation is that it will be less than the zfs_write_limit_max.

* arc_init() - Here vmem_size() is used to initially size the ARC.
Since the ARC is currently backed by the virtual address space it
makes sense to use this as a limit on the ARC for 32-bit systems.
This code can be removed when the ARC is backed by the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #831
2012-10-12 10:03:03 -07:00
Massimo Maggi dea3505dff Switch KM_SLEEP to KM_PUSHPAGE
In this particular instance the allocation occurred in the context
of sys_msync()->...->zpl_putpage() where we must be careful not to
initiate additional I/O.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-11 16:22:29 -07:00
Brian Behlendorf 87d98efe9e Fix zfs_txg_timeout module parameter
Allow the zfs_txg_timeout variable to be dynamically tuned at run
time.  By pulling it down out of the variable declaration it will
be evaluted each time through the loop.

The zfs_txg_timeout variable is now declared extern in a the common
sys/txg.h header rather than locally in dsl_scan.c.  This prevents
potential type mismatches if the global variable needs to be used
elsewhere.

Move the module_param() code in to the same source file where
zfs_txg_timeout is declared.  This is the most logical location.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-11 15:07:09 -07:00
Richard Yao 7df05a4266 Fix zfs_write_limit_max integer size mismatch on 32-bit systems
Commit c409e4647f introduced a
number of module parameters.  This required several types to be
changed to accomidate the required module parameters Linux macros.

Unfortunately, arc.c contained its own extern definition of the
zfs_write_limit_max variable and its type was not updated to be
consistent with its dsl_pool.c counterpart.  If the variable had
been properly marked extern in a common header, then gcc would
have generated a warning and this would not have slipped through.

The result of this was that the ARC unconditionally expected
zfs_write_limit_max to be 64-bit. Unfortunately, the largest size
integer module parameter that Linux supports is unsigned long, which
varies in size depending on the host system's native word size. The
effect was that on 32-bit systems, ARC incorrectly performed 64-bit
operations on a 32-bit value by reading the neighboring 32 bits as
the upper 32 bits of the 64-bit value.

We correct that by changing the extern declaration to use the unsigned
long type and move these extern definitions in to the common arc.h
header. This should make ARC correctly treat zfs_write_limit_max as a
32-bit value on 32-bit systems.

Reported-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #749
2012-10-11 11:09:25 -07:00
Cyril Plisko 15fd274973 Make zfs_immediate_write_sz a module paramater
zfs_immediate_write_sz variable is a tunable, but lacks proper
module_param() instrumentation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1032
2012-10-11 11:09:21 -07:00
Cyril Plisko 5b7e5b5ab9 txg is spelled as tgx in places
Term 'transaction group' is commonly abbreviated as txg in ZFS sources.
There are some places (Linux specific MODULE_PARAM_DESC() macros)
where it is incorrectly spelled as 'tgx'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1030
2012-10-11 09:19:08 -07:00
Massimo Maggi beb999445a Switch KM_SLEEP to KM_PUSHPAGE
Prevent snapshot_check to initiate I/O during memory allocation.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1023
2012-10-08 10:19:05 -07:00
Brian Behlendorf 7bd04f2d7d Set default zvol elevator to noop
It doesn't make sense for a zvol to use the default system I/O
scheduler because it is a virtual device.  Therefore, we change
the default scheduler to 'noop' for zvols provided that the
elevator_change() function is available.  This interface has
been available since Linux 2.6.36 and appears in the RHEL 6.x
kernels.

We deliberately do not implement the method for older kernels
because it was racy and could result in system crashes.  It's
better to simply manually tune the scheduler for these kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1017
2012-10-05 12:39:59 -07:00
Etienne Dechamps bbdc6ae495 Add interface for file hole punching.
This adds an interface to "punch holes" (deallocate space) in VFS
files. The interface is identical to the Solaris VOP_SPACE interface.
This interface is necessary for TRIM support on file vdevs.

This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which
was introduced in 2.6.38. For a brief time before 2.6.38 this was done
using the truncate_range inode operation, which was quickly deprecated.
This patch only supports FALLOC_FL_PUNCH_HOLE.

This adds support for the truncate_range() inode operation to
VOP_SPACE() for file hole punching. This API is deprecated and removed
in 3.5, so it's only useful for old kernels.

On tmpfs, the truncate_range() inode operation translates to
shmem_truncate_range(). Unfortunately, this function expects the end
offset to be inclusive and aligned to the end of a page. If it is not,
the kernel will stop with a BUG_ON().

This patch fixes the issue by adapting to the constraints set forth by
shmem_truncate_range().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #168
2012-10-04 16:22:07 -07:00
Etienne Dechamps 089fa91bc5 Align DISCARD requests on zvols.
Currently, when processing DISCARD requests, zvol_discard() calls
dmu_free_long_range() with the precise offset and size of the request.

Unfortunately, this is not optimal for requests that are not aligned to
the zvol block boundaries. Indeed, in the case of an unaligned range,
dnode_free_range() will zero out the unaligned parts. Not only is this
useless since we are not freeing any space by doing so, it is also slow
because it translates to a read-modify-write operation.

This patch fixes the issue by rounding up the discard start offset to
the next volume block boundary, and rounding down the discard end
offset to the previous volume block boundary.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1010
2012-10-04 16:01:44 -07:00
Chris Dunlop d75d6f294e Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fc for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1002
2012-10-04 10:44:09 -07:00
Matthew Ahrens 04434775b7 Illumos #3100: zvol rename fails with EBUSY when dirty.
illumos/illumos-gate@2e2c135528
Illumos changeset: 13780:6da32a929222

3100 zvol rename fails with EBUSY when dirty

Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Adam H. Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Ported-by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #995
2012-10-03 13:59:02 -07:00
George Wilson 65947351e7 Illumos #3129, #3130
illumos/illumos-gate@d6afdce20f
Illumos changeset: 13794:7c5e0e746b2c

3129 'zpool reopen' restarts resilvers
3130 ztest failure: Assertion failed:
     0 == dmu_objset_destroy(name, B_FALSE) (0x0 == 0x10)

Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3129
  https://www.illumos.org/issues/3130

Ported by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #994
2012-10-03 13:59:02 -07:00
Brian Behlendorf 6d1d976b2c Modify vdev_elevator_switch() to use elevator_change()
As of Linux 2.6.36 an elevator_change() interface was added.
This commit updates vdev_elevator_switch() to use this interface
when available, otherwise it falls back to the usermodehelper
method.

Original-patch-by: foobarz <sysop@xeon.(none)>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #906
2012-10-03 13:31:44 -07:00
Cyril Plisko 393b44c711 Implement .commit_metadata hook for NFS export
In order to implement synchronous NFS metadata semantics ZFS
needs to provide the .commit_metadata hook.  All it takes there
is to make sure changes are committed to ZIL.  Fortunately
zfs_fsync() does just that, so simply calling it from
zpl_commit_metadata() does the trick.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #969
2012-10-03 10:49:45 -07:00
Chris Wedgwood 23a61ccc1b zvol_probe should return NULL when the device isn't found.
Previously we returned ERR_PTR(-ENOENT) which the rest of the kernel
doesn't expect and as such we can oops.

Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #949
Closes #931
Closes #789
Closes #743
Closes #730
2012-10-03 10:39:12 -07:00
Bill Pijewski 37abac6d55 Illumos #2703: add mechanism to report ZFS send progress
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  https://www.illumos.org/issues/2703

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-19 13:39:06 -07:00
Chris Siden 1bd201e70d Illumos #1948: zpool list should show more detailed pool info
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  https://www.illumos.org/issues/1948

Ported by:	Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #685
2012-09-19 13:39:05 -07:00
Brian Behlendorf 95fd8c9a7f Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #973
2012-09-19 11:52:36 -07:00
Brian Behlendorf ba367276d8 Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-17 11:22:23 -07:00
Cyril Plisko 49d39798f2 ZFS replay transaction error 5
When zfs_replay_write() replays TX_WRITE records from ZIL
it calls zpl_write_common() to perform the actual write.
zpl_write_common() returns the number of bytes written
(similar to write() system call) or an (negative) error.
However, the code expects the positive return value to be
a residual counter. Thus when zpl_write_common() successfully
completes it is mistakenly considered to be a partial write and
the error code delivered further. At this point the ZIL processing
is aborted with famous "ZFS replay transaction error 5" error
message given to the message buffer.

The fix is to compare the zpl_write_commmon() return value with
the buffer size and flag error only when they disagree.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #933
2012-09-17 11:06:58 -07:00
Brian Behlendorf 8312c6df55 Clear PG_writeback for sync I/O error case
Commit 2b2861362f accidentally
introduced this issue by only conditionally registering the
commit callback in the async case.

The error handing code for the dmu_tx_assign() failure case
relied on there always being a registered commit callback to
clear the PG_writeback bit.  Since that is no longer strictly
true for the synchronous case we must explicitly invoke the
callback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #961
2012-09-14 15:53:47 -07:00
Brian Behlendorf 5915791096 Move iput() after zfs_inode_update()
When replaying an unlink/remove operation via zfs_rmdir() the object
being removed will be instantiated by a call to zfs_dirent_lock().
This means that there is a single reference protecting the object.
Right before the call to zfs_inode_update() this reference is dropped
which may cause the object to be destroyed.  This will result in a
NULL dereference as shown by the stack trace is issue #782.

This likely isn't an issue during normal operation because there is
always an additional reference held on the object by the VFS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #782
2012-09-12 14:22:52 -07:00
Brian Behlendorf 3050c9314f Switch KM_SLEEP to KM_PUSHPAGE
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system.  To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs.  This will prevent them from attempting
to initiate any I/O during direct reclaim.

This change was originally part of cd5ca4b but was reverted by
330fe01.  It always should have had its own commit for exactly
this reason.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-12 12:27:09 -07:00
Brian Behlendorf 9b51f21841 Remove TQ_SLEEP -> KM_SLEEP mapping
When the taskq code was originally written it seemed like a good
idea to simply map TQ_SLEEP to KM_SLEEP.  Unfortunately, this
assumed that the TQ_* flags would never confict with any of the
Linux GFP_* flags.  When adding the TQ_PUSHPAGE support in commit
cd5ca4b this invariant was accidentally broken.

Therefore to support TQ_PUSHPAGE, which is needed for Linux, and
prevent any further confusion I have removed this direct mapping.
The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined
in terms of their KM_* counterparts.  Instead a simple mapping
function is introduce to convert TQ_* -> KM_* where needed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #171
2012-09-12 11:41:42 -07:00
Brian Behlendorf 330fe010e4 Revert "Switch KM_SLEEP to KM_PUSHPAGE"
This reverts commit cd5ca4b2f8
due to conflicts in the higher TQ_ bits which caused incorrect
behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-12 10:07:48 -07:00
Brian Behlendorf 4ca9a43644 Remove zvol device node
The 'zfs destroy' changes in 330d06f disrupted how zvol devices
get removed on ZoL.  However, it basically boils down to the
fact that we are no longer reliably calling zvol_remove_minor()
via zfs_ioc_destroy_snaps().

Therefore we add the missing call and handle things similarly
to the existing zfs_unmount_snap() case.  Ideally we would check
if this is of type DMU_OST_ZFS or DMU_OST_ZVOL and just do the
right thing as in zfs_ioc_destroy().  However, it looks like
it would be fairly expensive to get the type, and it's harmless
to simply attempt the umount and minor removal.

This is also an issue in the latest FreeBSD and Illumos code.
It was being tracked under the following issue, and we may want
to refresh our code when they settle on what they want to do
about it upstream.

  https://www.illumos.org/issues/3170

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #903
2012-09-10 10:25:08 -07:00
Brian Behlendorf 3c60f5054c Debug cv_destroy() with mutex held
There still appears to be a race in the condition variables where
->cv_mutex is set after we are woken from the cv_destroy wait queue.
This might be possible when cv_destroy() is called immediately after
cv_broadcast().  We had some troubles with this previously but
there may still be a small race, see commit d599e4f.

The following patch closes one small race and improves the ASSERTs
such that they log the offending value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#943
2012-09-10 10:23:26 -07:00
Brian Behlendorf 95331f4437 Set KMC_NOEMERGENCY for zlib workspaces
The workspace required by zlib to perform compression is roughly
512MB (order-7).  These allocations are so large that we should
never attempt to directly kmalloc an emergency object for them.

It is far preferable to asynchronously vmalloc an additional slab
in case it's needed.  Then simply block waiting for an existing
object to be released or for the new slab to be allocated.

This can be accomplished by disabling emergency slab objects by
passing the KMC_NOEMERGENCY flag at slab creation time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#917
2012-09-07 14:36:26 -07:00
Brian Behlendorf cb5c2acebb Add KMC_NOEMERGENCY slab flag
Provide a flag to disable the use of emergency objects for a
specific kmem cache.  There may be instances where under no
circumstances should you kmalloc() an emergency object.  For
example, when you cache contains very large objects (>128k).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-07 14:27:03 -07:00
Cyril Plisko 04f9432d3b Make ZFS filesystem id persistent across different machines
Use ZFS dataset fsid guid as a unique file system id, similar to what is
done on Illumos/OpenSolaris.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #888
2012-09-06 12:47:11 -07:00
Brian Behlendorf ebcfc8a534 Disable page allocation warnings for ARC buffers
Buffers for the ARC are normally backed by the SPL virtual slab.
However, if memory is low, AND no slab objects are available,
AND a new slab cannot be quickly constructed a new emergency
object will be directly allocated.

These objects can be as large as order 5 on a system with 4k
pages.  And because they are allocated with KM_PUSHPAGE, to
avoid a potential deadlock, they are not allowed to initiate I/O
to satisfy the allocation.  This can result in the occasional
allocation failure.

However, since these allocations are allowed to block and
perform operations such as memory compaction they will eventually
succeed.  Since this is not unexpected (just unlikely) behavior
this patch disables the warning for the allocation failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #465
2012-09-06 11:53:08 -07:00
Brian Behlendorf cafa9709f3 Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-05 08:44:58 -07:00
Brian Behlendorf 0ef0ff546e Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-04 16:00:06 -07:00
Brian Behlendorf 594b4dd82a Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-04 08:41:12 -07:00
Chris Dunlop 20a083cbe2 Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-02 10:15:49 -07:00
Brian Behlendorf b404a3f07f Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-08-31 17:39:29 -07:00
Brian Behlendorf 46b3945d5d Suppress task_hash_table_init() large allocation warning
When various kernel debuging options are enabled this allocation
may be larger than usual as shown by the following warning.  It
is in no way harmful so we suppress the warning.

  SPL: large kmem_alloc(40960, 0x80d0) at
  tsd_hash_table_init:358 (76495/76495)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #93
2012-08-30 21:02:52 -07:00
Brian Behlendorf 2b2861362f Clear PG_writeback after zil_commit() for sync I/O
When writing via ->writepage() the writeback bit was always cleared
as part of the txg commit callback.  However, when the I/O is also
being written synchronsously to the zil we can immediately clear this
bit.  There is no need to wait for the subsequent TXG sync since the
data is already safe on stable storage.

This has been observed to reduce the msync(2) delay from up to 5
seconds down 10s of miliseconds.  One workload which is expected
to benefit from this are the intermittent samba hands described
in issue #700.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #700
Closes #907
2012-08-30 20:16:28 -07:00
Brian Behlendorf efcd0ca32d Enhance SPLAT kmem:slab_overcommit test
After the emergency slab objects were merged I started observing
timeout failures in the kmem:slab_overcommit test.  These were
due to the ineffecient way the slab_overcommit reclaim function
was implemented.  And due to the additional cost of potentially
allocating ten of thousands of emergency objects and tracking
them on a single list.

This patch addresses the first concern by enhansing the test
case to trace all of the allocations objects as a linked list.
This allows for a cleaner version of the reclaim function to
simply release SPLAT_KMEM_OBJ_RECLAIM objects.

Since this touches some common code all the tests which share
these data structions were also updated.  After making these
changes slab_overcommit is reliably passing.  However, there
is certainly additional cleanup which could be done here.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-30 15:49:00 -07:00
Richard Yao b8d06fca08 Switch KM_SLEEP to KM_PUSHPAGE
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.

  * The txg_sync thread
  * The zvol write/discard threads
  * The zpl_putpage() VFS callback

This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages.  If a lock is held over this operation the potential exists to
deadlock the system.  To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.

Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread.  However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA.  This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.

This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:

  21ade34 Disable direct reclaim for z_wr_* threads
  cfc9a5c Fix zpl_writepage() deadlock
  eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Brian Behlendorf 991fc1d7ae mzap_upgrade() must use kmem_alloc()
These allocations in mzap_update() used to be kmem_alloc() but
were changed to vmem_alloc() due to the size of the allocation.
However, since it turns out this function may be called in the
context of the txg_sync thread they must be changed back to use
a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:01:37 -07:00