4897 Space accounting mismatch in L2ARC/zpool
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
From the illumos issue tracker:
L2ARC vdev space usage statistics are calculated as the delta
between the maximum and minimum vdev offset ever written to
by the L2ARC fill thread, but do not inform the user of how
much space in between these two offsets is actually taken up by
cached buffers. This fix changes that so that vdev space usage
stats on L2ARC devices accurately track the volume of buffers
stored on them, allowing users to see the exact L2ARC usage in
"zpool iostat -v".
References:
https://www.illumos.org/issues/4897https://github.com/illumos/illumos-gate/commit/3038a2b
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2555
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4390https://github.com/illumos/illumos-gate/commit/7fd05ac
Porting notes:
Previous stack-reduction efforts in traverse_visitb() caused a fair
number of un-mergable pieces of code. This patch should reduce its
stack footprint a bit more.
The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated
using kmem_zalloc() for the purpose of stack reduction.
The new global zfs_free_leak_on_eio has been defined as an integer
rather than a boolean_t as was the case with the related zfs_recover
global. Also, zfs_free_leak_on_eio's definition has been inserted into
zfs_debug.c for consistency with the existing definition of zfs_recover.
Illumos placed it in spa_misc.c.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2545
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Description from Matt Ahrens's bug report at Delphix:
Add a new zfs property, "redundant_metadata" which can have values
"all" or "most". The default will be "all", which is the current
behavior. Setting to "most" will cause us to only store 1 copy of
level-1 indirect blocks of user data files.
Additional notes:
The new man page section for this property states
"The exact behavior of which metadata blocks
are stored redundantly may change in future releases."
and:
"When set to most, ZFS stores an extra copy of most types of
metadata. This can improve performance of random writes,
because less metadata must be written."
The current implementation is as described above in Matt's blog.
It is controlled by a new global integer
"zfs_redundant_metadata_most_ditto_level", currently initialized
to 2. When "redundant_metadata" is set to "most", only indirect
blocks of the specified level and higher will have additional ditto
blocks created.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2542
Added highbit64() and howmany() which are used in recent upstream
code. Both highbit() and highbit64() should at some point be
re-factored to use the optimized fls() and fls64() functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#363
There have been issues in the past where excessive debug logging
to the console has resulted in significant performance impacts.
In the vast majority of these cases only a few stack traces are
required to diagnose the issue. Therefore, stack traces dumped to
the console will now we limited to 5 every 60s.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#374
As part of the write throttle & i/o schedule performance work the
zfs_trunc() function should have been updated to use TXG_WAIT.
Using TXG_WAIT ensures that the tx will be part of the next txg.
If TXG_NOWAIT is used and retried for ERESTART errors then the
tx can suffer from starvation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2488
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101https://www.illumos.org/issues/4102https://www.illumos.org/issues/4103https://www.illumos.org/issues/4105https://www.illumos.org/issues/4106https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488
This changes moves the called to metaslab_group_alloc_update() to the
metaslab_sync_reassess() function. The original placement of the call
within metaslab_sync_done() appears to have been a simple mistake,
introduced by ac72fac3ea.
This aligns us more closely to the upstream illumos code base.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update the current code to ensure inodes are never dirtied if they are
part of a read-only file system or snapshot. If they do somehow get
dirtied an attempt will make made to write them to disk. In the case
of snapshots, which don't have a ZIL, this will result in a NULL
dereference in zil_commit().
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2405
The parameter was added as illumos issue 4081 which was committed to
zfsonlinux in ac72fac3ea. This patch
documents the parameter and allows for it to be set as a module parameter.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2483
If the smaller of 2 different sized child vdev's of a mirrored vdev is
detached, and the pool has the autoexpand property set to off, as the
remaining larger vdev is promoted to a top level vdev it fails to retain
the asize of the original top level mirror vdev and therefore partially
autoexpands.
This partially autoexpanded state leaves the new vdev too large to
re-mirror by adding the smaller vdev back in, and the pool fails to
utilize the space until next imported.
If the autoexpand property is set to on, the child vdev grows
in size after it has been promoted to a top level vdev as expected.
This commit causes the remaining child mirror to retain the asize of its
old parent mirror vdev if the autoexpand property is set to off,
this allows the smaller vdev to be re-added if required the vdev
can then be told to expand if required by the usual using zpool online -e.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes#1208
Spl's debugging and assertion macros macro used the typical do/while(0)
form for if/else friendliness, however, this limits their use in contexts
where a do loop is not valid; such as within another multi-statement
style macro.
The following macros have been converted to not use do/while(0):
PANIC, ASSERT, ASSERTF, VERIFY, VERIFY3_IMPL
PANIC has been converted to a wrapper around the new spl_PANIC() function.
The other macros have been converted to use the "&&" operator for the
branch-predicition conditional and also to use spl_PANIC().
The __ASSERT() macro was not touched. It is only used by the debugging
infrastructure and that code, including this macro, will be retired when
the tracepoint patches are merged.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
Added several comments regarding the removal of real_LZ4_uncompress()
which exists in the reference implementation but has been removed here
since it's not used.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 91579709fc which
limited the asynchronous dispatch to kernel space. We want to do
this for two reasons:
1) While we have slightly more headroom in user space excessively
deep stacks have been observed while running ztest, see #2293.
2) Removing this conditional makes the pipeline behave consistently
regardless of if it's executing in kernel space or user space.
This way we're more likely to uncover subtle issues with ztest.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2384
We should not override the default memory type of the kmem cache. This
was done previously to force certain objects which were slightly over
object size limit cut off in to KMC_KMEM caches for better performance.
The zfsonlinux/spl#356 patch slightly increases the default cut off
from 511 bytes 1024 bytes for x86_64. This means there is long longer
a need to override the default for the caches. And since the default
values are now being used the new spl_kmem_cache_slab_limit and
spl_kmem_cache_kmem_limit tunables will apply to all kmem caches.
The following is a list of caches that will be impacted:
| object size | forced type | default type
----------------- | ------------- | ------------- | --------------
dnode_t | 936 bytes | KMC_KMEM | KMC_KMEM
zio_cache | 1104 bytes | *KMC_KMEM | *KMC_VMEM
zio_link_cache | 48 bytes | KMC_KMEM | KMC_KMEM
zio_vdev_cache | 131088 bytes | KMC_VMEM | KMC_VMEM
zio_buf_512 | 512 bytes | KMC_KMEM | KMC_KMEM
zio_data_buf_512 | 512 bytes | KMC_KMEM | KMC_KMEM
zio_buf_1024 | 1024 bytes | KMC_KMEM | KMC_KMEM
zio_data_buf_1024 | 1024 bytes | +KMC_VMEM | +KMC_KMEM
* Cache memory type will change from KMC_KMEM to KMC_VMEM.
+ Cache memory type will change from KMC_VMEM to KMC_KMEM.
This patch removes another slight point of divergence between ZoL
and Illumos.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#2337
The correct behavior for all registered shrinkers is to return the
number of objects in their cache. In theory this allows the Linux
VM to balance memory reclaim across all registered caches.
In commit b9b3715 this behavior was disabled in favor of returning
-1 which notifies the VM that no additional objects are available
for reclaim. This was done as a workaround to resolve thrashing
in shrink_slabs() which could occur when memory was low and numerous
core where in reclaim. Unfortunately, this has been observed to
increase the likelihood of OOM events when SPL slab consumers are
responsible for consuming the majority of memory.
Therefore, this patch makes this behavior tunable. Setting the
spl_kmem_cache_reclaim module option to 0x1 will result in the
shrinker only being called once. This is the default behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#358
For small objects the Linux slab allocator has several advantages
over its counterpart in the SPL. These include:
1) It is more memory-efficient and packs objects more tightly.
2) It is continually tuned to maximize performance.
Therefore it makes sense to layer the SPLs slab allocator on top
of the Linux slab allocator. This allows us to leverage the
advantages above while preserving the Illumos semantics we depend
on. However, there are some things we need to be careful of:
1) The Linux slab allocator was never designed to work well with
large objects. Because the SPL slab must still handle this use
case a cut off limit was added to transition from Linux slab
backed objects to kmem or vmem backed slabs.
spl_kmem_cache_slab_limit - Objects less than or equal to this
size in bytes will be backed by the Linux slab. By default
this value is zero which disables the Linux slab functionality.
Reasonable values for this cut off limit are in the range of
4096-16386 bytes.
spl_kmem_cache_kmem_limit - Objects less than or equal to this
size in bytes will be backed by a kmem slab. Objects over this
size will be vmem backed instead. This value defaults to
1/8 a page, or 512 bytes on an x86_64 architecture.
2) Be aware that using the Linux slab may inadvertently introduce
new deadlocks. Care has been taken previously to ensure that
all allocations which occur in the write path use GFP_NOIO.
However, there may be internal allocations performed in the
Linux slab which do not honor these flags. If this is the case
a deadlock may occur.
The path forward is definitely to start relying on the Linux slab.
But for that to happen we need to start building confidence that
there aren't any unexpected surprises lurking for us. And ideally
need to move completely away from using the SPLs slab for large
memory allocations. This patch is a first step.
NOTES:
1) The KMC_NOMAGAZINE flag was leveraged to support the Linux slab
backed caches but it is not supported for kmem/vmem backed caches.
2) Regardless of the spl_kmem_cache_*_limit settings a cache may
be explicitly set to a given type by passed the KMC_KMEM,
KMC_VMEM, or KMC_SLAB flags during cache creation.
3) The constructors, destructors, and reclaim callbacks are all
functional and will be called regardless of the cache type.
4) KMC_SLAB caches will not appear in /proc/spl/kmem/slab due to
the issues involved in presenting correct object accounting.
Instead they will appear in /proc/slabinfo under the same names.
5) Several kmem SPLAT tests needed to be fixed because they relied
incorrectly on internal kmem slab accounting. With the updated
test cases all the SPLAT tests pass as expected.
6) An autoconf test was added to ensure that the __GFP_COMP flag
was correctly added to the default flags used when allocating
a slab. This is required to ensure all pages in higher order
slabs are properly refcounted, see ae16ed9.
7) When using the SLUB allocator there is no need to attempt to
set the __GFP_COMP flag. This has been the default behavior
for the SLUB since Linux 2.6.25.
8) When using the SLUB it may be desirable to set the slub_nomerge
kernel parameter to prevent caches from being merged.
Original-patch-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#356
For consistency with disk vdevs honor the zfs_nocacheflush tunable.
This setting is available primarily for debugging and performance
analysis.
Signed-off-by: HC <mmttdebbcc@yahoo.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2336
In the case where a variable-sized SA overlaps the spill block pointer and
a new variable-sized SA is being added, the header size was improperly
calculated to include the to-be-moved SA. This problem could be
reproduced when xattr=sa enabled as follows:
ln -s $(perl -e 'print "x" x 120') blah
setfattr -n security.selinux -v blahblah -h blah
The symlink is large enough to interfere with the spill block pointer and
has a typical SA registration as follows (shown in modified "zdb -dddd"
<SA attr layout obj> format):
[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ]
Adding the SA xattr will attempt to extend the registration to:
[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ZPL_DXATTR ]
but since the ZPL_SYMLINK SA interferes with the spill block pointer, it
must also be moved to the spill block which will have a registration of:
[ ZPL_SYMLINK ZPL_DXATTR ]
This commit updates extra_hdrsize when this condition occurs, allowing
hdrsize to be subsequently decreased appropriately.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2214
Issue #2228
Issue #2316
Issue #2343
Restructure the zfsdev_state_list to allow for lock-free reading by
converting to a simple singly-linked list from which items are never
deleted and over which only forward iterations are performed. It depends
on, among other things, the atomicity of accessing the zs_minor integer
and zs_next pointer.
This fixes a lock inversion in which the zfsdev_state_lock is used by
both the sync task (txg_sync) and indirectly by any user program which
uses /dev/zfs; the zfsdev_release method uses the same lock and then
blocks on the sync task.
The most typical failure scenerio occurs when the sync task is cleaning
up a user hold while various concurrent "zfs" commands are in progress.
Neither Illumos nor Solaris are affected by this issue because they use
DDI interface which provides lock-free reading of device state via the
ddi_get_soft_state() function.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2301
Originally, vdev_file used system_taskq. This would cause a deadlock,
especially on system with few CPUs. The reason is that the prefetcher
threads, which are on system_taskq, will sometimes be blocked waiting
for I/O to finish. If the prefetcher threads consume all the tasks in
system_taskq, the I/O cannot be served and thus results in a deadlock.
We fix this by creating a dedicated vdev_file_taskq for vdev_file I/O.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2270
The dva_get_dsize_sync() function incorrectly assumes that the call
to vdev_lookup_top() cannot fail. However, the NULL dereference at
clearly shows that under certain circumstances it is possible. Note
that offset 0x570 (1376) maps as expected to vd->vdev_deflate_ratio.
BUG: unable to handle kernel NULL pointer dereference at 00000570
crash> struct -o vdev
struct vdev {
[0] uint64_t vdev_id;
... ...
[1376] uint64_t vdev_deflate_ratio;
Given that this can happen this patch add the required error handling.
In the case where vdev_lookup_top() fails assume that no deflation
will occur for the DVA and use the asize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Alexey Zhuravlev <alexey.zhuravlev@intel.com>
Closes#1707Closes#1987Closes#1891
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When fetching property values of snapshots, a check against the head
dataset type must be performed. Previously, this additional check was
performed only when fetching "version", "normalize", "utf8only" or "case".
This caused the ZPL properties "acltype", "exec", "devices", "nbmand",
"setuid" and "xattr" to be erroneously displayed with meaningless values
for snapshots of volumes. It also did not allow for the display of
"volsize" of a snapshot of a volume.
This patch adds the headcheck flag paramater to zfs_prop_valid_for_type()
and zprop_valid_for_type() to indicate the check is being done
against a head dataset's type in order that properties valid only for
snapshots are handled correctly. This allows the the head check in
get_numeric_property() to be performed when fetching a property for
a snapshot.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2265
A minor style issue was accidentally introduced by aa7d06a.
This change resolves that style problem.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Today the metaslab_debug logic performs two tasks:
- load all metaslabs on import/open
- don't unload metaslabs at the end of spa_sync
This change provides knobs for each of these independently.
References:
https://illumos.org/issues/4101https://github.com/illumos/illumos-gate/commit/0713e23
Notes:
1) This is a small piece of the metaslab improvement patch from
Illumos. It was worth bringing over before the rest, since it's
low risk and it can be useful on fragmented pools (e.g. Lustre
MDTs). metaslab_debug_unload would give the performance benefit
of the old metaslab_debug option without causing unwanted delay
during pool import.
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2227
When the system attributes (SAs) for an object exceed what can
can be stored in the bonus area of a dnode a spill block is
allocated. These spill blocks are currently considered data
blocks. However, they should be accounted for as meta data
because they are effectively an extension of the dnode.
While this may seem like a minor accounting issue it has broader
implications. The key thing to be aware of is that each spill
block will hold a reference on its parent dnode. The dnode in
turn holds a reference on its dbuf in the dnode object. This
means that a single 512 byte data buffer for a spill block can
pin over 16k of meta data. This is analogous to the small file
situation described in 2b13331 where a relatively small number
of data buffer can cause the ARC to exceed the meta limit.
However, unlike the small file case a spill block can legitimately
be considered meta data. By changing the spill block to meta data
they will now be dropped from the cache when the meta limit is
reached. This then allows the dnodes and dbufs which the spill
block was pinning to be released.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#2294
As described in the comment above dmu_tx_assign() this function must
only fail if the pool is out of space. If for some other reason the
TX cannot be assigned (such as memory pressure) ERESTART must be
returned. Alternately, EAGAIN could be returned to inject a delay
but that isn't required because the caller will block on the condition
variable waiting for the next TXG.
/*
* Assign tx to a transaction group. txg_how can be one of:
*
* (1) TXG_WAIT. If the current open txg is full, waits until there's
* a new one. This should be used when you're not holding locks.
* It will only fail if we're truly out of space (or over quota).
* ...
*/
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2287
We add support for lsattr and chattr to resolve a regression caused
by 88c283952f that broke Python's
xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr,
which depended on Python's xattr.list().
Only attributes common to both Solaris and Linux are supported. These
are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File
attributes exclusive to Solaris are present in the ZFS code, but cannot
be accessed or modified through this method. That was the case prior to
this patch. The resolution of issue zfsonlinux/zfs#229 should implement
some method to permit access and modification of Solaris-specific
attributes.
References:
https://bugs.gentoo.org/show_bug.cgi?id=483516
Original-patch-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1691
Some network related block device uses tcp_sendpage, which doesn't
behave well when using 0-count page. Add assertion to catch them.
This has a runtime dependency on:
zfsonlinux/spl@ae16ed9 Fix crash when using ZFS on Ceph rbd
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2277
The problem is described in commit aeeb4e0c0a.
However, instead of disabling the binding to CPU altogether we just keep the
last CPU index across calls to taskq_create() and thus achieve even
distribution of the taskq threads across all available CPUs.
The implementation based on assumption that task queues initialization
performed in serial manner.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Andrey Vesnovaty <andreyv@infinidat.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#336
When using __get_free_pages to get high order memory, only the first page's
_count will set to 1, other's will be 0. When an internal page get passed into
rbd, it will eventully go into tcp_sendpage. There, it will be called with
get_page and put_page, and get freed erroneously when _count jump back to 0.
The solution to this problem is to use compound page. All pages in a
high order compound page share a single _count. So get_page and put_page in
tcp_sendpage will not cause _count jump to 0.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#251
Endianness detection in LZ4 is broken in user-space builds. This
bug corrupts compressed data and manifests itself in several ztest
failures. When LZ4 was originally ported to Illumos ZFS, the proper
checks for Linux were stripped out. The Linux port then inherited
the remaining detection code that works on Illumos but not on Linux.
The current LZ4 endianness check misuses the condition
defined(__BIG_ENDIAN) to indicate a big-endian system. On Linux
__BIG_ENDIAN is defined uncondtionally in the user-space header
/usr/include/endian.h, regardless of the endianness of the system.
The kernel does not use this header, so only user-space builds are
affected.
While we could fix this by restoring the upstream LZ4 endianness
detection code, reliable checks already exist in
libspl/include/sys/isa_defs.h. This change uses the libspl results
to replace the word-size and endianness checks in LZ4, simplifying
the code and reducing duplication.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#1963Fixes#1964Fixes#1965
Due to an asymmetry in the kmem accounting a memory leak was being
reported when it was only an accounting issue. All memory allocated
with kmem_alloc() must be released with kmem_free() or it will not
be properly accounted for.
In this case the code used strfree() to release the memory allocated
by kmem_alloc(). Presumably this was done because the size of the
memory region wasn't available when the memory needed to be freed.
To resolve this issue the code has been updated to use strdup() instead
of kmem_alloc() to allocate the memory. Like strfree(), strdup() is
not integrated with the memory accounting. This means we can use
strfree() to release it like Illumos.
SPL: kmem leaked 10/4368729 bytes
address size data func:line
ffff880067e9aa40 10 ZZZZZZZZZZ zfsdev_ioctl:5655
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#2262
This behavior is more consistent with the way memory reclaim
is expected to work under Linux.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#349
Also, make sure we use clock_t for ddi_get_lbolt to prevent type conversion
from screwing things.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2142
By default maximal number of objects in slab can't exceed (16*2 - 1) and slab
size can't exceed 32M.
Today's high end servers having couple hundreds of RAM available for ARC may
run into a trouble with virtual memory because of the restriction mentioned
above.
Problem:
Reasons for very high number of virtual memory allocations:
* Real slab size very small relative to the size of the entire RAM
* Slabs allocated on virtual memory and fill entire ARC
The result is very high number of allocated virtual memory ranges (hundreds of
ranges). When virtual memory subsystem manages high number of ranges its
performance become so poor that it freezes from time to time.
Solution:
Number of objects per slab should be increased taking into account maximal
slab size which can also be increased if needed.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#337