The get_vmalloc_info() function was used to back the vmem_size()
function. This was always problematic and resulted in brittle
code because the kernel never provided a clean interface for
modules.
However, it turns out that the only caller of this function in
ZFS uses it to determine the total virtual address space size.
This can be determined easily without get_vmalloc_info() so
vmem_size() has been updated to take this approach which allows
us to shed the get_vmalloc_info() dependency.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The on_each_cpu() function has been available since Linux 2.6.27.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The mutex_lock_nested() function has been available since Linux 2.6.18.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The inode structure has used i_mutex as its internal locking
primitive since 2.6.16. The compatibility code to check for
the previous semaphore primitive has been removed. However,
the wrapper function itself is being kept because it's entirely
possible this primitive will change again to allow finer grained
locking.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The kmalloc_node() function has been available since Linux 2.6.12.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The uaccess header has been available in the same location since
Linux 2.6.18. There is no longer a need to maintain this
compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The uintptr_t typedef has been available since Linux 2.6.24.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The atomic64_xchg() and atomic64_cmpxchg() functions have been
available since Linux 2.6.24. There is no longer a need to
maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Many of the time functions had grown overly complex in order to
handle kernel compatibility issues. However, as of Linux 2.6.26
all the required functionality is available. This allows us to
retire numerous configure checks and greatly simplify the time
compatibility wrappers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The fls64() function has been available since Linux 2.6.16 and
it should be used to implemented highbit64(). This allows us
to provide an optimized implementation and simplify the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Support for the CTL_UNNUMBERED sysctl interface was removed in
Linux 2.6.19. There is no longer any reason to maintain this
compatibility code. There also issue any reason to keep around
the CTL_NAME macro and helpers so they have been retired.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The register_sysctl() interface has been stable since Linux 2.6.21.
There is no longer a need to maintain compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There is no longer a need to wrap this because utsname() is provided
by the kernel and can be called directly. This will require a small
change in the ZFS code because utsname is expected to be a global
structure and not a function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Since the Linux 2.6.29 kernel all mutexes have been adaptive mutexs.
There is no longer any point in keeping this code so it is being
removed to simplify the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the SPL was originally written it was designed to use the
device_create() and device_destroy() functions. Unfortunately,
these functions changed considerably over the years making them
difficult to rely on.
As it turns out a better choice would have been to use the
misc_register()/misc_deregister() functions. This interface
for registering character devices has remained stable, is simple,
and provides everything we need.
Therefore the code has been reworked to use this interface. The
higher level ZFS code has always depended on these same interfaces
so this is also as a step towards minimizing our kernel dependencies.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Apply the license specified in the META file to ensure the
compatibility checks are all performed consistently.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Modify the code to use the utsname() kernel function rather than
a global variable. This results is cleaner more portable code
because utsname() is already provided by the kernel and can be
easily emulated in user space via uname(2). This means that it
will behave consistently in both contexts.
This is also has the benefit that it allows the removal of a few
_KERNEL pre-processor conditions. And it also is a pre-requisite
for a proper FUSE port because we need to provide a valid utsname.
Finally, it allows us to remove this functionality from the SPL
and all the related compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
When ZPIOS was originally written it was designed to use the
device_create() and device_destroy() functions. Unfortunately,
these functions changed considerably over the years making them
difficult to rely on.
As it turns out a better choice would have been to use the
misc_register()/misc_deregister() functions. This interface
for registering character devices has remained stable, is simple,
and provides everything we need.
Therefore the code has been reworked to use this interface. The
higher level ZFS code has always depended on these same interfaces
so this is also as a step towards minimizing our kernel dependencies.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
Creating virtual machines that have their rootfs on ZFS on hosts that
have their rootfs on ZFS causes SPA namespace collisions when the
standard name rpool is used. The solution is either to give each guest
pool a name unique to the host, which is not always desireable, or boot
a VM environment containing an ISO image to install it, which is
cumbersome.
26b42f3f9d introduced `zpool import -t
...` to simplify situations where a host must access a guest's pool when
there is a SPA namespace conflict. We build upon that to introduce
`zpool import -t tname ...`. That allows us to create a pool whose
in-core name is tname, but whose on-disk name is the normal name
specified.
This simplifies the creation of machine images that use a rootfs on ZFS.
That benefits not only real world deployments, but also ZFSOnLinux
development by decreasing the time needed to perform rootfs on ZFS
experiments.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417
To aid in detecting and debugging stack overflow issues make the
user space stack limit configurable via a new ZFS_STACK_SIZE
environment variable. The value assigned to ZFS_STACK_SIZE will
be used as the default stack size in bytes.
Because this is mainly useful as a debugging aid in conjunction
with ztest the stack limit is disabled by default. See the ztest(1)
man page for additional details on using the ZFS_STACK_SIZE
environment variable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2743
Issue #2293
Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2). Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.
Add the zfs_zero_partial_page() function, modeled after update_page(),
to handle zeroing partial pages in a hole-punching operation. It must
be used under a range lock for the requested region in order that the
ARC and page cache stay in sync.
Move the existing page cache truncation via truncate_setsize() into
zfs_freesp() for better source structure compatibility with upstream code.
Add page cache truncation to zfs_freesp() and zfs_free_range() to handle
hole punching.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#2619
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.
ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.
One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:
1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.
2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.
3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.
4. Making an asynchronous write synchronous is non sequitur.
Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.
It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.
Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.
It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:
```
Which operations are cancelable is implementation-defined.
```
http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html
The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.
Lastly, it is important to know that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux,
it is the VFS' responsibility whle on Illumos, it is the filesystem's
responsibility. We take the intermediate solution of copying the iovec
so that the ZFS code can update it like on Solaris while leaving the
originals alone. This imposes some overhead. We could always revisit
this should profiling show that the allocations are a problem.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#223Closes#2373
Some nvlist_t could be leaked in error handling paths.
Also make sure cb argument to zfs_zevent_post() cannnot
be NULL.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2158
Add #ifndef PAGESIZE to avoid redefinition warning on platforms
where this value is already provided.
Signed-off-by: stf <s@ctrlc.hu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#382
Add a new 'overlay' property (default 'off') that controls whether the
filesystem should be mounted even if the mountpoint is busy or if it
should fail with a 'mountpoint not empty'.
Doing overlay mounts is the default mount behavior on Linux, but not
in ZFS. It have been decided that following the ZFS behavior should
be the default, but this overlay allows for site administrator to
override this decision on a per-dataset basis.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2503
We should have included sys/taskq.h directly because we use the taskq
code here, but we instead had files that included sys/taskq.h also
include sys/kmem.h, which happened to include sys/taskq.h. sys/kmem.h no
longer does this, so we must define the include as we should
have done in the first place.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2411
zfsonlinux/spl#bcb15891ab394e11615eee08bba1fd85ac32e158 implemented
Linux 3.6+ support by adding duplicate vn_rename and vn_remove
functions. The new ones were cleaner, but the duplicate functions made
the codebase less maintainable. This adds some compatibility shims that
allow us to retire the older vn_rename and vn_remove in favor of the new
ones on old kernels. The result is a net 143 line reduction in lines of
code and a cleaner codebase.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#370
Most of the code base already uses va_list, which is specified by
iso-c. gcc/glibc provides 'typedef __gnuc_va_list va_list'. and
when not using gcc/glibc we can't expect to find __gnuc_va_list.
Signed-off-by: Alec Salazar <alec.j.salazar@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2588
Handle all iputs in zfs_purgedir() and zfs_inode_destroy()
asynchronously to prevent deadlocks. When the iputs are allowed
to run synchronously in the destroy call path deadlocks between
xattr directory inodes and their parent file inodes are possible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#457
Linux kernel 3.17 removes the action function argument from
wait_on_bit(). Add autoconf test and compatibility macro to support
the new interface.
The former "wait_on_bit" interface required an 'action' function to
be provided which does the actual waiting. There were over 20 such
functions in the kernel, many of them identical, though most cases
can be satisfied by one of just two functions: one which uses
io_schedule() and one which just uses schedule(). This API change
was made to consolidate all of those redundant wait functions.
References: torvalds/linux@7431620
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#378
As part of commit e8b96c6 the search zio used by the
vdev_queue_io_to_issue() function was moved to the heap
to minimize stack usage. Functionally this is fine, but
to maximize performance it's best to minimize the number
of dynamic allocations.
To avoid this allocation temporary space for the search
zio has been reserved in the vdev_queue structure. All
access must be serialized through the vq_lock.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2572
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4390https://github.com/illumos/illumos-gate/commit/7fd05ac
Porting notes:
Previous stack-reduction efforts in traverse_visitb() caused a fair
number of un-mergable pieces of code. This patch should reduce its
stack footprint a bit more.
The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated
using kmem_zalloc() for the purpose of stack reduction.
The new global zfs_free_leak_on_eio has been defined as an integer
rather than a boolean_t as was the case with the related zfs_recover
global. Also, zfs_free_leak_on_eio's definition has been inserted into
zfs_debug.c for consistency with the existing definition of zfs_recover.
Illumos placed it in spa_misc.c.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2545
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Description from Matt Ahrens's bug report at Delphix:
Add a new zfs property, "redundant_metadata" which can have values
"all" or "most". The default will be "all", which is the current
behavior. Setting to "most" will cause us to only store 1 copy of
level-1 indirect blocks of user data files.
Additional notes:
The new man page section for this property states
"The exact behavior of which metadata blocks
are stored redundantly may change in future releases."
and:
"When set to most, ZFS stores an extra copy of most types of
metadata. This can improve performance of random writes,
because less metadata must be written."
The current implementation is as described above in Matt's blog.
It is controlled by a new global integer
"zfs_redundant_metadata_most_ditto_level", currently initialized
to 2. When "redundant_metadata" is set to "most", only indirect
blocks of the specified level and higher will have additional ditto
blocks created.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2542
The atomic_swap_32() function maps to atomic_xchg(), and
the atomic_swap_64() function maps to atomic64_xchg().
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#377
This loop in dmu_objset_write_ready():
for (i = 0; i < dnp->dn_nblkptr; i++)
bp->blk_fill += dnp->dn_blkptr[i].blk_fill;
invokes _undefined behavior_ for the (common) case of dn_nblkptr=3,
therefore, the compiler is free to do whatever it wants (such as
optimizing it away, or otherwise messing up your expections).
The fix is to be honest about the array size.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2511Closes#2010
Added highbit64() and howmany() which are used in recent upstream
code. Both highbit() and highbit64() should at some point be
re-factored to use the optimized fls() and fls64() functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#363
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101https://www.illumos.org/issues/4102https://www.illumos.org/issues/4103https://www.illumos.org/issues/4105https://www.illumos.org/issues/4106https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488
Update the current code to ensure inodes are never dirtied if they are
part of a read-only file system or snapshot. If they do somehow get
dirtied an attempt will make made to write them to disk. In the case
of snapshots, which don't have a ZIL, this will result in a NULL
dereference in zil_commit().
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2405
Spl's debugging and assertion macros macro used the typical do/while(0)
form for if/else friendliness, however, this limits their use in contexts
where a do loop is not valid; such as within another multi-statement
style macro.
The following macros have been converted to not use do/while(0):
PANIC, ASSERT, ASSERTF, VERIFY, VERIFY3_IMPL
PANIC has been converted to a wrapper around the new spl_PANIC() function.
The other macros have been converted to use the "&&" operator for the
branch-predicition conditional and also to use spl_PANIC().
The __ASSERT() macro was not touched. It is only used by the debugging
infrastructure and that code, including this macro, will be retired when
the tracepoint patches are merged.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367