When the ddt_zap_lookup() function was updated to dynamically
allocate memory for the cbuf variable, to save stack space, the
'csize <= sizeof (cbuf)' assertion was not updated. The result
of this was that the size of the pointer was being used in the
comparison rather than the buffer size.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
The Chaos 4.x distribution is based on RHEL 5.x which is no longer
supported by ZoL since it uses a 2.6.18 kernel.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The Chaos 4.x distribution is based on RHEL 5.x which is no longer
supported by ZoL since it uses a 2.6.18 kernel.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit adds support for building debug and debug-devel sub packages
of the zfs-modules main package. This is to allow building packages
which are built against a debug kernel. By default, only packages are
built against a regular non-debug kernel. This can be toggled by passing
the '--with kernel-debug' parameter to rpmbuild.
Examples:
# To build packages against only the non-debug kernel
$ rpmbuild --rebuild --with kernel --without kernel-debug $SRPM
# To build packages against only the debug kernel
$ rpmbuild --rebuild --without kernel --with kernel-debug $SRPM
# To build packages against debug and non-debug kernel
$ rpmbuild --rebuild --with kernel --with kernel-debug $SRPM
Note: Only the RHEL 5/6, CHAOS 5, and Fedora distributions are supported
for building the debug and debug-devel packages.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit adds support for building debug and debug-devel sub packages
of the spl-modules main package. This is to allow building packages
which are built against a debug kernel. By default, only packages are
built against a regular non-debug kernel. This can be toggled by passing
the '--with kernel-debug' parameter to rpmbuild.
Examples:
# To build packages against only the non-debug kernel
$ rpmbuild --rebuild --with kernel --without kernel-debug $SRPM
# To build packages against only the debug kernel
$ rpmbuild --rebuild --without kernel --with kernel-debug $SRPM
# To build packages against debug and non-debug kernel
$ rpmbuild --rebuild --with kernel --with kernel-debug $SRPM
Note: Only the RHEL 5/6, CHAOS 5, and Fedora distributions are supported
for building the debug and debug-devel packages.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #115
Usage of get_current() is not supported across all architectures.
The correct interface to use is the '#define current' which will
map to the appropriate function, usually current_thread_info().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#119
The performance of the ZIL is usually the main bottleneck when dealing with
synchronous, write-heavy workloads (e.g. databases). Understanding the
behavior of the ZIL is required to diagnose performance issues for these
workloads, and to tune ZIL parameters (like zil_slog_limit) accordingly.
This commit adds a new kstat page dedicated to the ZIL with some counters
which, hopefully, scheds some light into what the ZIL is doing, and how it is
doing it.
Currently, these statistics are available in /proc/spl/kstat/zfs/zil.
A description of the fields can be found in zil.h.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#786
FreeBSD #xxx: Dramatically optimize listing snapshots when user
requests only snapshot names and wants to sort them by name, ie.
when executes:
# zfs list -t snapshot -o name -s name
Because only name is needed we don't have to read all snapshot
properties.
Below you can find how long does it take to list 34509 snapshots
from a single disk pool before and after this change with cold and
warm cache:
before:
# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 525s
warm cache: 218s
after:
# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 1.7s
warm cache: 1.1s
NOTE: This patch only appears in FreeBSD. If/when Illumos picks up
the change we may want to drop this patch and adopt their version.
However, for now this addresses a real issue.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #450
ZoL can create more zvols at runtime than can be configured during
system start, which hangs the init stack at reboot.
When a slow system has more than a few hundred zvols, udev will
fork bomb during system start and spend too much time in device
detection routines, so upstart kills it.
The zfs_inhibit_dev option allows an affected system to be rescued
by skipping /dev/zd* creation and thereby avoiding the udev
overload. All zvols are made inaccessible if this option is set, but
the `zfs destroy` and `zfs send` commands still work, and ZFS
filesystems can be mounted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
torvalds/linux@1dce27c5aa introduced
__clear_close_on_exec() as a replacement for FD_CLR. Further commits
appear to have removed FD_CLR from the Linux source tree. This
causes the following failure:
error: implicit declaration of function '__FD_CLR'
[-Werror=implicit-function-declaration]
To correct this we update the code to use the current
__clear_close_on_exec() interface for readability. Then we introduce
an autotools check to determine if __clear_close_on_exec() is available.
If it isn't then we define some compatibility logic which used the older
FD_CLR() interface.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#124
Gcc version 4.7.0 reports the delta.tv_sec in the slab reclaim test
as potentially unitialized. In practice this will never occur but
to keep gcc happy we initialize the variable to zero.
Signed-off-by: Brian Behlendorf <behlendo@fedora-17-amd64.(none)>
Prevent 'rpm -Uvh *.rpm" from automatically replacing your vdev.conf
file by flagging it as a non replacable config file.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#486
This is not a proper fix. It is just a workaround for the stack
smashing detected by gcc in zvol_id. We simply disable the gcc
stack protector for now when building the zvol_id udev helper.
Once the root cause is resolved this patch should be reverted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issues #569
zil_slog_limit specifies the maximum commit size to be written to the separate
log device. Larger commits bypass the separate log device and go directly to
the data devices.
The optimal value for zil_slog_limit directly depends on the latency and
throughput characteristics of both the separate log device and the data disks.
Small synchronous writes are faster on low-latency separate log devices (e.g.
SSDs) whereas large synchronous writes are faster on high-latency data disks
(e.g. spindles) because of higher throughput, especially with a large array.
The point is, the line between "small" and "large" synchronous writes in this
scenario is heavily dependent on the hardware used. That's why it should be
made configurable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#783
When failing to remove a zvol device link because it's busy, wait
a bit and retry in a loop instead of giving up immediately. This
technique is similar to the loop in zpool_label_disk_wait(), with
the same goal: waiting for the asynchronous udev processes to finish
their work.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#692
When compiling ZFS with CFLAGS=-O0 it will trigger the following error.
Resolve the issue by properly including locale.h.
../../cmd/mount_zfs/mount_zfs.c: In function 'main':
../../cmd/mount_zfs/mount_zfs.c:318:2: warning: implicit declaration
of function 'setlocale' [-Wimplicit-function-declaration]
../../cmd/mount_zfs/mount_zfs.c:318:19: error: 'LC_ALL' undeclared
(first use in this function)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#724
torvalds/linux@adc0e91ab1 introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:
error: implicit declaration of function 'd_alloc_root'
[-Werror=implicit-function-declaration]
To correct this we update the code to use the current d_make_root()
interface for readability. Then we introduce an autotools check
to determine if d_make_root() is available. If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#776
The logbias option is not taken into account when writing to ZVOLs. We fix
that by using the same logic as in the zfs filesystem write code
(see zfs_log.c).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#774
The configure script error message for kernels built with
CONFIG_DEBUG_LOCK_ALLOC may give the impression that the issue is
strictly with license compliance. To avoid confusion add some words
indicating that the linking stage will fail if the build continues.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#773
In the module unload path the vm_file_cache was being destroyed
under a spin lock. Because this operation might sleep it was
possible, although very very unlikely, that this could result
in a deadlock.
This issue was indentified by using a Linux debug kernel and
has been fixed by moving the kmem_cache_destroy() out from under
the spin lock. There is no need to lock this operation here.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#771
When a device is already open O_EXCL by another process the
`zpool import` will correctly fail. However, the default failure
message isn't very helpful. It may in fact be harmful if you
take its advise and destroy your pool.
cannot import 'tank': pool is busy
Destroy and re-create the pool from
a backup source.
Improve the error message in the EBUSY case to simply print a
message indicating that the devices are current in use. The user
will need to manually identify which process has the device open
exclusively and why.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When creating pools short device names may be used when those
devices appear in certain well known locations under /dev/.
This change adds /dev/mapper/ to that list.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
vdev_id parses the file /etc/zfs/vdev_id.conf to map a physical path
in a storage topology to a channel name. The channel name is combined
with a disk enclosure slot number to create an alias that reflects the
physical location of the drive. This is particularly helpful when it
comes to tasks like replacing failed drives. Slot numbers may also be
re-mapped in case the default numbering is unsatisfactory. The drive
aliases will be created as symbolic links in /dev/disk/by-vdev.
The only currently supported topologies are sas_direct and sas_switch:
o sas_direct - a channel is uniquely identified by a PCI slot and a
HBA port
o sas_switch - a channel is uniquely identified by a SAS switch port
A multipath mode is supported in which dm-mpath devices are handled by
examining the first running component disk, as reported by 'multipath
-l'. In multipath mode the configuration file should contain a
channel definition with the same name for each path to a given
enclosure.
vdev_id can replace the existing zpool_id script on systems where the
storage topology conforms to sas_direct or sas_switch. The script
could be extended to support other topologies as well. The advantage
of vdev_id is that it is driven by a single static input file that can
be shared across multiple nodes having a common storage toplogy.
zpool_id, on the other hand, requires a unique /etc/zfs/zdev.conf per
node and a separate slot-mapping file. However, zpool_id provides the
flexibility of using any device names that show up in
/dev/disk/by-path, so it may still be needed on some systems.
vdev_id's functionality subsumes that of the sas_switch_id script, and
it is unlikely that anyone is using it, so sas_switch_id is removed.
Finally, /dev/disk/by-vdev is added to the list of directories that
'zpool import' will scan.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#713
The CONFIG_DEBUG_LOCK_ALLOC check at configure time was added to
detect when mutex_lock() is defined as a GPL-only symbol. However,
the check as written only inferred this from this configuration
setting, it never actually checked. This change introduces that
missing check to prevent false positives.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Correctly implementating 64-bit division for ARM requires more than
just providing the __aeabi_uldivmod() and __aeabi_ldivmod() symbols.
They are need to be implemented is such a way that the quotient and
remainder and left in specific registers after the division operation
completes. This change updates the wrapper functions to accomplish
this according to the official ARM Run-time ABI.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#706
Originally I believed that these interfaces would be needed.
However, in practice it turned out that it was more straight
forward and maintainable to use the native Linux interfaces.
As such, this is all dead code and can be safely removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#109
The resolution of issue #31 made KM_PUSHPAGE imply GFP_NOFS. This
was done to prevent situations where filesystem operations which are
holding locks enter direct reclaim and attempt to reaquire those
same locks. This clearly will result in a deadlock.
This works for datasets which are implemented in terms for filesystem
operations. But unfortunately, swapping to a zvol will encounter
many of the same deadlocks and GFP_NOFS will not prevent this. As
such, it is appropriate to extend KM_PUSHPAGE to use the broader
GFP_NOIO mask to handle these non-filesystem cases.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#342Closes#105
This test is designed to verify that direct reclaim is functioning as
expected. We allocate a large number of objects thus creating a large
number of slabs. We then apply memory pressure and expect that the
direct reclaim path can easily recover those slabs. The registered
reclaim function will free the objects and the slab shrinker will call
it repeatedly until at least a single slab can be freed.
Note it may not be possible to reclaim every last slab via direct reclaim
without a failure because the shrinker_rwsem may be contended. For this
reason, quickly reclaiming 3/4 of the slabs is considered a success.
This should all be possible within 10 seconds. For reference, on a
system with 2G of memory this test takes roughly 0.2 seconds to run.
It may take longer on larger memory systems but should still easily
complete in the alloted 10 seconds.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#107
To minimize the chance of triggering an OOM during direct reclaim.
The kmem caches have been improved to make a best effort to reclaim
at least one slab when a reclaim function is registered. This helps
avoid the case where objects are released but they are spread over
multiple slabs so no memory gets reclaimed.
Care has been taken to avoid deadlocking if the reclaim function
is unable to make forward progress. Additionally, the reclaim
function may be skipped entirely if there are already free slabs
which can be safely reaped.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#107
The Linux direct reclaim path uses this out of band value to
determine if forward progress is being made. Normally this is
incremented by kmem_freepages() which is part of the various
Linux slab implementations. However, since we are using none
of that infrastructure we're responsible for incrementing this
count.
If no forward progress is detected and a subsequent allocation
fails the OOM killer will be invoked. If there was forward
progress additional reclaim will be attempted via the page
cache and registerd shrinker until the allocation succeeds.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#107
When memory pressure triggers direct memory reclaim, a slabs age
and delay should not prevent it from being freed. This patch ensures
these values are ignored, allowing an empty slab to be freed in this
code path no matter the value of its age and delay.
This prevents needless scanning of the partial slabs and has been
observed to significantly reduce the total cpu usage. In addition,
it should allow for snappier reclaim under memory pressure.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#102
Previously, the SPL tried to maintain Solaris semantics by freeing
all available (empty) slabs from its slab caches when the shrinker
was called. This is not desirable when running on Linux. To make
the SPL shrinker more Linux friendly, the actual number of freed
slabs from each of the slab caches is now derived from nr_to_scan
and skc_slab_objs.
Additionally, an accounting bug was fixed in spl_slab_reclaim()
which could cause us to reclaim one more slab than requested.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#101
Leverage the existing generic 64-bit division operations which
were originally implemented for x86 to support ARM. All that is
required is to make the symbols available to the linker with the
expected names.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit ce90208cf9. This
change was observed to cause problems when using a zvol to back a VM
under 2.6.32.59 kernels. This issue was filed as #710.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #342
Issue #710
The zfs-test package additionally depends on mdadm and bc to
run the zfault.sh tests.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#690
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef. There is no functional change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#701
Previously, it was possible for the direct reclaim path to be invoked
when a write to a zvol was made. When a zvol is used as a swap device,
this often causes swap requests to depend on additional swap requests,
which deadlocks. We address this by disabling the direct reclaim path
on zvols.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#342
As of the removal of the taskq work list made in commit:
commit 2c02b71b14
Author: Prakash Surya <surya1@llnl.gov>
Date: Mon Dec 5 17:32:48 2011 -0800
Replace tq_work_list and tq_threads in taskq_t
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.
the comment above taskq_wait_check has been incorrect. This change is an
attempt at bringing that description more in line with the current
implementation. Essentially, references to the old task work list had to
be updated to reference the new taskq thread active list.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
23bdb07d4e updated the ARC memory limits
to be 1/2 of memory or all but 4GB. Unfortunately, these values assume
zero internal fragmentation in the SLUB allocator, when in reality, the
internal fragmentation could be as high as 50%, effectively doubling
memory usage. This poses clear safety issues, because it permits the
size of ARC to exceed system memory.
This patch changes this so that the default value of arc_c_max is always
1/2 of system memory. This effectively limits the ARC to the memory that
the system has physically installed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#660
Under Solaris the ARC was designed to stay one step ahead of the
VM subsystem. It would attempt to recognize low memory situtions
before they occured and evict data from the cache. It would also
make assessments about if there was enough free memory to perform
a specific operation.
This was all possible because Solaris exposes a fairly decent
view of the memory state of the system to other kernel threads.
Linux on the other hand does not make this information easily
available. To avoid extensive modifications to the ARC the SPL
attempts to provide these same interfaces. While this works it
is not ideal and problems can arise when the ARC and Linux have
different ideas about when your out of memory. This has manifested
itself in the past as a spinning arc_reclaim_thread.
This patch abandons the emulated Solaris interfaces in favor of
the prefered Linux interface. That means moving the bulk of the
memory reclaim logic out of the arc_reclaim_thread and in to the
evict driven shrinker callback. The Linux VM will call this
function when it needs memory. The ARC is then responsible for
attempting to free the requested amount of memory if possible.
Several interfaces have been modified to accomidate this approach,
however the basic user space implementation remains the same.
The following changes almost exclusively just apply to the kernel
implementation.
* Removed the hdr_recl() reclaim callback which is redundant
with the broader arc_shrinker_func().
* Reduced arc_grow_retry to 5 seconds from 60. This is now used
internally in the ARC with arc_no_grow to indicate that direct
reclaim was recently performed. This typically indicates a
rapid change in memory demands which the kswapd threads were
unable to keep ahead of. As long as direct reclaim is happening
once every 5 seconds arc growth will be paused to avoid further
contributing to the existing memory pressure. The more common
indirect reclaim paths will not set arc_no_grow.
* arc_shrink() has been extended to take the number of bytes by
which arc_c should be reduced. This allows for a more granual
reduction of the arc target. Since the kernel provides a
reclaim value to the arc_shrinker_func() this value is used
instead of 1<<arc_shrink_shift.
* arc_reclaim_needed() has been removed. It was used to determine
if the system was under memory pressure and relied extensively
on Solaris specific VM interfaces. In most case the new code
just checks arc_no_grow which indicates that within the last
arc_grow_retry seconds direct memory reclaim occurred.
* arc_memory_throttle() has been updated to always include the
amount of evictable memory (arc and page cache) in its free
space calculations. This space is largely available in most
call paths due to direct memory reclaim.
* The Solaris pageout code was also removed to avoid confusion.
It has always been disabled due to proc_pageout being defined
as NULL in the Linux port.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Expose the zfs_mdcomp_disable variable as a module option. This
can be used to disable compression of zfs meta data which is
enabled by default. This shouldn't need to be tuned but for
most workloads, however there may be very specific instances
where it makes sense to trade disk capacity for extra cpu cycles.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reference to Illumos issue:
https://www.illumos.org/issues/1946
Ported by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>
Refererce to Illumos issue:
https://www.illumos.org/issues/952
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#607
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>
Refererces to Illumos issue:
https://www.illumos.org/issues/1909
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#680