ZFS on Linux originally implemented xattr namespaces in a way that is
incompatible with other operating systems. On illumos, xattrs do not
have namespaces. Every xattr name is visible. FreeBSD has two
universally defined namespaces: EXTATTR_NAMESPACE_USER and
EXTATTR_NAMESPACE_SYSTEM. The system namespace is used for protected
FreeBSD-specific attributes such as MAC labels and pnfs state. These
attributes have the namespace string "freebsd:system:" prefixed to the
name in the encoding scheme used by ZFS. The user namespace is used
for general purpose user attributes and obeys normal access control
mechanisms. These attributes have no namespace string prefixed, so
xattrs written on illumos are accessible in the user namespace on
FreeBSD, and xattrs written to the user namespace on FreeBSD are
accessible by the same name on illumos.
Linux has several xattr namespaces. On Linux, ZFS encodes the
namespace in the xattr name for every namespace, including the user
namespace. As a consequence, an xattr in the user namespace with the
name "foo" is stored by ZFS with the name "user.foo" and therefore
appears on FreeBSD and illumos to have the name "user.foo" rather than
"foo". Conversely, none of the xattrs written on FreeBSD or illumos
are accessible on Linux unless the name happens to be prefixed with one
of the Linux xattr namespaces, in which case the namespace is stripped
from the name. This makes xattrs entirely incompatible between Linux
and other platforms.
We want to make the encoding of user namespace xattrs compatible across
platforms. A critical requirement of this compatibility is for xattrs
from existing pools from FreeBSD and illumos to be accessible by the
same names in the user namespace on Linux. It is also necessary that
existing pools with xattrs written by Linux retain access to those
xattrs by the same names on Linux. Making user namespace xattrs from
Linux accessible by the correct names on other platforms is important.
The handling of other namespaces is not required to be consistent.
Add a fallback mechanism for listing and getting xattrs to treat xattrs
as being in the user namespace if they do not match a known prefix.
Do not allow setting or getting xattrs with a name that is prefixed
with one of the namespace names used by ZFS on supported platforms.
Allow choosing between legacy illumos and FreeBSD compatibility and
legacy Linux compatibility with a new tunable. This facilitates
replication and migration of pools between hosts with different
compatibility needs.
The tunable controls whether or not to prefix the namespace to the
name. If the xattr is already present with the alternate prefix,
remove it so only the new version persists. By default the platform's
existing convention is used.
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11919
It's the only one actually used
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12901
Add hooks for when spa is created, exported, activated and
deactivated. Used by macOS to attach iokit, and lock
kext as busy (to stop unloads).
Userland, Linux, and, FreeBSD have empty stubs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#12801
Unfortunately macOS has obj-C keyword "fallthrough" in the OS headers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#13097
This allows reads/writes caused by accesses to mmap files to be
accounted correctly in the per-dataset kstats for both Linux and
FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Signed-off-by: Matthias Blankertz <matthias@blankertz.org>
Closes#12994Closes#13044
To follow a change in illumos taskq
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#12802
There's no need to make the platform ops dynamic dispatch.
This change replaces the dynamic dispatch with static calls to the
platform-specific functions.
To avoid name collisions, prefix all platform-specific functions
with `zvol_os_`.
I actually find `zvol_..._os` slightly nicer to read in the calling
code, but having it as a prefix is useful.
Advantage:
- easier jump-to-definition / grepping
- potential benefits to static analysis
- better legibility
Future work: also prefix remaining `static` functions in zvol_os.c.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com>
Closes#12965
69 CSTYLED BEGINs remain, appx. 30 of which can be removed if cstyle(1)
had a useful policy regarding
CALL(ARG1,
ARG2,
ARG3);
above 2 lines. As it stands, it spits out *both*
sysctl_os.c: 385: continuation line should be indented by 4 spaces
sysctl_os.c: 385: indent by spaces instead of tabs
which is very cool
Another >10 could be fixed by removing "ulong" &al. handling.
I don't foresee anyone actually using it intentionally
(does it even exist in modern headers? why did it in the first place?).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12993
When using the two argument version of submit_bio() in kernel's prior
to 4.8 the first argument should be specified. It's used by block
dump to report the bio direction.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Finix Yan <yancw@info2soft.com>
Closes#13006
Linux 5.17 sees a rename from complete_and_exit()
to kthread complete_and_exit()
Upstream commit cead18552660702a4a46f58e65188fe5f36e9dfe
("exit: Rename complete_and_exit to kthread_complete_and_exit")
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12989
For us, I think it's always just FALLOC_FL_PUNCH_HOLE with a fake
mustache on.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Coleman Kane <ckane@colemankane.org>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12975
add_disk went from void to must-check int return.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Coleman Kane <ckane@colemankane.org>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12975
Raw receiving a snapshot back to the originating dataset is currently
impossible because of user accounting being present in the originating
dataset.
One solution would be resetting user accounting when raw receiving on
the receiving dataset. However, to recalculate it we would have to dirty
all dnodes, which may not be preferable on big datasets.
Instead, we rely on the os_phys flag
OBJSET_FLAG_USERACCOUNTING_COMPLETE to indicate that user accounting is
incomplete when raw receiving. Thus, on the next mount of the receiving
dataset the local mac protecting user accounting is zeroed out.
The flag is then cleared when user accounting of the raw received
snapshot is calculated.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#12981Closes#10523Closes#11221Closes#11294Closes#12594
Issue #11300
Evaluated every variable that lives in .data (and globals in .rodata)
in the kernel modules, and constified/eliminated/localised them
appropriately. This means that all read-only data is now actually
read-only data, and, if possible, at file scope. A lot of previously-
global-symbols became inlinable (and inlined!) constants. Probably
not in a big Wowee Performance Moment, but hey.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12899
sc.nr_to_scan is an input to super_cache_clean (via
shrinker->scan_objects), used to set the number of objects to scan
in the various caches. However super_cache_scan also modifies
sc.nr_to_scan, so when used in a loop we need to reset
sc.nr_to_scan back to our desired nr_to_scan for the next
iteration.
Issue discovered and solution suggested by
Tenzin Lhakhang @tlhakhan.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #12433Closes#12908
Provide access to file generation number on Linux.
Add test coverage.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#12856
When restructuring the zvol_open() logic for the Linux 5.13 kernel
a lock inversion was accidentally introduced. In the updated code
the spa_namespace_lock is now taken before the zv_suspend_lock
allowing the following scenario to occur:
down_read <=== waiting for zv_suspend_lock
zvol_open <=== holds spa_namespace_lock
__blkdev_get
blkdev_get_by_dev
blkdev_open
...
mutex_lock <== waiting for spa_namespace_lock
spa_open_common
spa_open
dsl_pool_hold
dmu_objset_hold_flags
dmu_objset_hold
dsl_prop_get
dsl_prop_get_integer
zvol_create_minor
dmu_recv_end
zfs_ioc_recv_impl <=== holds zv_suspend_lock via zvol_suspend()
zfs_ioc_recv
...
This commit resolves the issue by moving the acquisition of the
spa_namespace_lock back to after the zv_suspend_lock which restores
the original ordering.
Additionally, as part of this change the error exit paths were
simplified where possible.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12863
The definition of struct blkcg_gq was moved into blk-cgroup.h, which is
a header that's been in Linux since 2015. This is used by
vdev_blkg_tryget() in module/os/linux/zfs/vdev_disk.c. Since the kernel
for CentOS 7 and similar-generation releases doesn't have this header,
its inclusion is guarded by a configure test.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#12819
This change adds a confiugre check to determine if bio_set_dev is a
helper macro or not. If not, then the attempt to override its internal
call to bio_associate_blkg(), with a macro definition to our own
version, is no longer possible, as the compiler won't use it when
compiling the new inline function replacement implemented in the header.
This change also creates a new vdev_bio_set_dev() function that performs
the same work, and also performs the work implemented in
vdev_bio_associate_blkg(), as it is the only thing calling that function
in our code. Our custom vdev_bio_associate_blkg() is now only compiled
if the bio_set_dev() is a macro in the Linux headers.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#12819
The iov_iter->type member was renamed iov_iter->iter_type. However,
while looking into this, realized that in 2018 a iov_iter_type(*iov)
accessor function was introduced. So if that is present, use it,
otherwise fall back to trying the existing behavior of directly
accessing type from iov_iter.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#12819
The return type for the submit_bio member of struct
block_device_operations was changed to no longer return a value.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#12819
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.
For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again. However, as of the 5.13 kernel this behavior was removed.
Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open(). This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail. To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #12301Closes#12759
Add properties, similar to pool properties, to each vdev.
This makes use of the existing per-vdev ZAP that was added as
part of device evacuation/removal.
A large number of read-only properties are exposed,
many of the members of struct vdev_t, that provide useful
statistics.
Adds support for read-only "removing" vdev property.
Adds the "allocating" property that defaults to "on" and
can be set to "off" to prevent future allocations from that
top-level vdev.
Supports user-defined vdev properties.
Includes support for properties.vdev in SYSFS.
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#11711
Instead, linux/pagemap.h offers a number of folio-specific functions to
be called instead. In this case, module/os/linux/zfs/zfs_vnops_os.c
wants to call wait_on_page_bit(pp, PG_writeback). This gets replaced
with folio_wait_bit(folio_page(pp), PG_writeback). This change modifies
the code to conditionally compile that if configure identifies th
presence of the folio_wait_bit() function.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#12800
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#12771
The submit_bio() prototype has changed again. The version is 5.16
still only expects a single argument but the return type has changed
to void. Since we never used the returned value before update the
configure check to detect both single arg versions.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12725
On Linux, sometimes, when ZFS goes to unmount an automounted snap,
it fails a VERIFY check on debug builds, because taskq_cancel_id
returned ENOENT after not finding the taskq it was trying to cancel.
This presumably happens when it already died for some reason; in this
case, we don't really mind it already being dead, since we're just
going to dispatch a new task to unmount it right after.
So we just ignore it if we get back ENOENT trying to cancel here,
retry a couple times if we get back the only other possible condition
(EBUSY), and log to dbgmsg if we got anything but ENOENT or success.
(We also add some locking around taskqid, to avoid one or two cases
of two instances of trying to cancel something at once.)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#11632Closes#12670
We move the spinlock unlock before the thread creation. This should be
safe because the thread creation code doesn't actually manipulate any
taskq data structures; that's done by the thread once it's created.
We also remove the assertion that the maxthreads is the current threads
plus one; that assertion could fail if multiple hotplug events come in
quick succession, and the first new taskq thread hasn't had a chance to
start processing yet.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
eviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#12714
As of the Linux 5.9 kernel a fallthrough macro has been added which
should be used to anotate all intentional fallthrough paths. Once
all of the kernel code paths have been updated to use fallthrough
the -Wimplicit-fallthrough option will because the default. To
avoid warnings in the OpenZFS code base when this happens apply
the fallthrough macro.
Additional reading: https://lwn.net/Articles/794944/
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12441
Kernel commits
332f606b32b6 ovl: enable RCU'd ->get_acl()
0cad6246621b vfs: add rcu argument to ->get_acl() callback
Added compatibility code to detect the new ->get_acl() interface
and correctly handle the case where the new rcu argument is set.
Reviewed-by: Coleman Kane <ckane@colemankane.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12548
Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.
statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.
We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#8507
When a header is allocated for full overwrite it is a waste of time
to allocate b_pabd/b_rabd for it, since arc_write() will free them
without ever being touched. If it is a read or a partial overwrite
then arc_read() and arc_hdr_decrypt() allocate them explicitly.
Reduced memory allocation in user threads also reduces ARC eviction
throttling there, proportionally increasing it in ZIO threads, that
is not good. To minimize or even avoid it introduce ARC allocation
reserve, allowing certain arc_get_data_abd() callers to allocate a
bit longer in situations where user threads will already throttle.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12398
Before OpenZFS 2.0, trying to set the FreeBSD sysctl vfs.zfs.arc_max
to a disallowed value would return an error.
Since the switch, it instead only generates WARN_IF_TUNING_IGNORED
Keep the ability to set the sysctl's specifically to 0, even though
that is less than the minimum, because some tests depend on this.
Also lost, was the ability to set vfs.zfs.arc_max to a value less
than the default vfs.zfs.arc_min at boot time. Restore this as well.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#12161
The /proc/diskstats accounting needs to be explicitly enabled
for block devices which do not use multi-queue.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12440Closes#12066
CpaDcGeneratefooter function that obtain the checksum code
does not support the CPA_DC_STATELESS mode. So we get the
adler32 chencksum of the end of the zlib from dc_results.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chengfei Zhu <chengfeix.zhu@intel.com>
Signed-off-by: hedong.zhang <h_d_zhang@163.com>
Closes#12343
It is wrong for arc_write_ready() to use zfs_abd_scatter_enabled to
decide whether to reallocate/copy the buffer, because the answer is
OS-specific and depends on the buffer size. Instead of that use
abd_size_alloc_linear(), moved into public header.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#12425
Kernel 5.14 introduced a change where set_page_dirty of
struct address_space_operations is no longer implicitly set to
__set_page_dirty_buffers(), which ended up resulting in a NULL
pointer deref in the kernel when it is attempted to be called.
This change sets .set_page_dirty in the structure to
__set_page_dirty_nobuffers(), which was introduced with the
related patch set. The breaking change was introduce in commit
0af573780b0b13fceb7fabd49dc1b073cee9a507 to torvalds/linux.git.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#12427
In Linux 5.14, blk_alloc_queue is no longer exported, and its usage
has been superseded by blk_alloc_disk, which returns a gendisk struct
from which we can still retrieve the struct request_queue* that is
needed in the one place where it is used. This also replaces the call
to alloc_disk(minors), and minors is now set via struct member
assignment.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Coleman Kane <ckane@colemankane.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12362Closes#12409
* zio: avoid callback typecasting
* zil: avoid zil_itxg_clean() callback typecasting
* zpl: decouple zpl_readpage() into two separate callbacks
* nvpair: explicitly declare callbacks for xdr_array()
* linux/zfs_nvops: don't use external iput() as a callback
* zcp_synctask: don't use fnvlist_free() as a callback
* zvol: don't use ops->zv_free() as a callback for taskq_dispatch()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Closes#12260
Callers of zfs_file_get and zfs_file_put can corrupt the reference
counts for the file structure resulting in a panic or a soft lockup.
When zfs send/recv runs, it will add a reference count to the
open file, and begin to send or recv the stream. If the file descriptor
is closed, then when dmu_recv_stream() or dmu_send() return we will
call zfs_file_put to remove the reference we placed on the file
structure. Unfortunately, because zfs_file_put() uses the file
descriptor to lookup the file structure, it may end up finding that
the file descriptor table no longer contains the file struct, thus
leaking the file structure. Or it might end up finding a file
descriptor for a different file and blindly updating its reference
counts. Other failure modes probably exists.
This change reworks the zfs_file_[get|put] interface to not rely
on the file descriptor but instead pass the zfs_file_t pointer around.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-issue: DLPX-76119
Closes#12299
Fix a leak of abd_t that manifested mostly when using
raidzN with at least as many columns as N (e.g. a
four-disk raidz2 but not a three-disk raidz2).
Sufficiently heavy raidz use would eventually run a system
out of memory.
Additionally:
* Switch abd_cache arena to FIRSTFIT, which empirically
improves perofrmance.
* Make abd_chunk_cache more performant and debuggable.
* Allocate the abd_zero_buf from abd_chunk_cache rather
than the heap.
* Don't try to reap non-existent qcaches in abd_cache arena.
* KM_PUSHPAGE->KM_SLEEP when allocating chunks from their
own arena
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: Sean Doran <smd@use.net>
Closes#12295
Compiling with gcc 11.1.0 produces three new warnings.
Change the code slightly to avoid them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#12130Closes#12188Closes#12237
ZFS loves using %llu for uint64_t, but that requires a cast to not
be noisy - which is even done in many, though not all, places.
Also a couple places used %u for uint64_t, which were promoted
to %llu.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12233
In all places except two spa_get_random() is used for small values,
and the consumers do not require well seeded high quality values.
Switch those two exceptions directly to random_get_pseudo_bytes()
and optimize spa_get_random(), renaming it to random_in_range(),
since it is not related to SPA or ZFS in general.
On FreeBSD directly map random_in_range() to new prng32_bounded() KPI
added in FreeBSD 13. On Linux and in user-space just reduce the type
used to uint32_t to avoid more expensive 64bit division.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12183
wmsum was designed exactly for cases like these with many updates
and rare reads. It allows to completely avoid atomic operations on
congested global variables.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12172
In zfs_znode_alloc we always hash inodes. If the
znode is unlinked, we do not need to hash it. This
fixes the problem where zfs_suspend_fs is doing zrele
(iput) in an async fashion, and zfs_resume_fs unlinked
drain processing will try to hash an inode that could
still be hashed, resulting in a panic.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alan Somers <asomers@gmail.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9741Closes#11223Closes#11648Closes#12210
This mostly reverts "3537 want pool io kstats" commit of 8 years ago.
From one side this code using pool-wide locks became pretty bad for
performance, creating significant lock contention in I/O pipeline.
From another, there are more efficient ways now to obtain detailed
statistics, while this statistics is illumos-specific and much less
usable on Linux and FreeBSD, reported only via procfs/sysctls.
This commit does not remove KSTAT_TYPE_IO implementation, that may
be removed later together with already unused KSTAT_TYPE_INTR and
KSTAT_TYPE_TIMER.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12212
`getfsstat(2)` is used to retrieve the list of mounted file systems,
which libzfs uses when fetching properties like mountpoint, atime,
setuid, etc. The `mode` parameter may be `MNT_NOWAIT`, which uses
information in the VFS's cache, or `MNT_WAIT`, which effectively does a
`statfs` on every single mounted file system in order to fetch the most
up-to-date information. As far as I can tell, the only fields that
libzfs cares about are the filesystem's name, mountpoint, fstypename,
and mount flags. Those things are always updated on mount and unmount,
so they will always be accurate in the VFS's mount cache except in two
circumstances:
1) When a file system is busy unmounting
2) When a ZFS file system changes the value of a mount-overridable
property like atime or setuid, but doesn't remount the file system.
Right now that only happens when the property is changed by an
unprivileged user who has delegated authority to change the property
but not to mount the dataset. But perhaps libzfs could choose to do
it for other reasons in the future.
Switching to `MNT_NOWAIT` will greatly improve speed with no downside,
as long as we explicitly update the mount cache whenever we change a
mount-overridable property.
For comparison, Illumos gets this information using the native
`getmntany` and `getmntent` functions, which also use cached
information. The illumos function that would refresh the cache,
`resetmnttab`, is never called by libzfs.
And on GNU/Linux, `getmntany` and `getmntent` don't even communicate
with the kernel directly. They simply parse the file they are given,
which is usually /etc/mtab or /proc/mounts. Perhaps the implementation
of /proc/mounts is synchronous, ala MNT_WAIT; I don't know.
Sponsored-by: Axcient
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes: #12091
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.
This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).
To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures.
This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12152Closes#11429Closes#11574Closes#12150
Both were removed in 4fbdb10c7b ("remove
kmem_cache module parameter KMC_EXPIRE_AGE")
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12157
The additional iter advance is incorrect, as copy_from_iter() has
already done the right thing. This will result in the following
warning being printed to the console as of the 5.12 kernel.
Attempted to advance past end of bvec iter
This change should have been included with #11378 when a
similar change was made on the read side.
Suggested-by: @siebenmann
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Issue #11378Closes#12041Closes#12155
Just like #12087, the set_acl signature changed with all the bolted-on
*userns parameters, which disabled set_acl usage, and caused #12076.
Turn zpl_set_acl into zpl_set_acl and zpl_set_acl_impl, and add a
new configure test for the new version.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12076Closes#12093
Linux changed the tmpfile() signature again in torvalds/linux@6521f89,
which in turn broke our HAVE_TMPFILE detection in configure.
Update that macro to include the new case, and change the signature of
zpl_tmpfile as appropriate.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes: #12060Closes: #12087
Commit d1d4769 takes into account the encryption key version to
decide if the local_mac could be zeroed out. However, this could lead
to failure mounting encrypted datasets created with intermediate
versions of ZFS encryption available in master between major releases.
In order to prevent this situation revert d1d4769 pending a more
comprehensive fix which addresses the mount failure case.
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #11294
Issue #12025
Issue #12300Closes#12033
Linux kernel commit 0f00b82e5413571ed225ddbccad6882d7ea60bc7 removes the
revalidate_disk() handler from struct block_device_operations. This
caused a regression, and this commit eliminates the call to it and the
assignment in the block_device_operations static handler assignment
code, when configure identifies that the kernel doesn't support that
API handler.
Reviewed-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#11967Closes#11977
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11972
zfs_log_create returns void, so there is no reason to cast its return
value to void at the call site.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11994
Quoting <linux/exportfs.h>:
> encode_fh() should return the fileid_type on success and on error
> returns 255 (if the space needed to encode fh is greater than
> @max_len*4 bytes). On error @max_len contains the minimum size (in 4
> byte unit) needed to encode the file handle.
ZFS was not setting max_len in the case where the handle was too
small. As a result of this, the `t_name_to_handle_at.c' example in
name_to_handle_at(2) did not work on ZFS.
zfsctl_fid() will itself set max_len if called with a fid that is too
small, so if we give zfs_fid() that behavior as well, the fix is quite
easy: if the handle is too small, just use a zero-size fid instead of
the handle.
Tested by running t_name_to_handle_at on a normal file, a directory, a
.zfs directory, and a snapshot.
Thanks-to: Puck Meerburg <puck@puckipedia.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Closes#11995
Remove some extra whitespace.
Use pointer-typed asserts in Linux's znode cache destructor for more
info when debugging.
Simplify a couple of conversions from inode to znode when we already
have the znode.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11974
This fixes /proc/sys/kernel/spl/hostid on kernels with mainline commit
32927393dc1ccd60fb2bdc05b9e8e88753761469 ("sysctl: pass kernel pointers
to ->proc_handler") ‒ 5.7-rc1 and up
The access_ok() check in copy_to_user() in proc_copyout_string() would
always fail, so all userspace reads and writes would fail with EINVAL
proc_dostring() strips only the final new-line,
but simple_strtoul() doesn't actually need a back-trimmed string ‒
writing "012345678 \n" is still allowed, as is "012345678zupsko", &c.
This alters what happens when an invalid value is written ‒
previously it'd get set to what-ever simple_strtoul() returned
(probably 0, thereby resetting it to default), now it does nothing
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#11878Closes#11879
This change adds SIGSTOP and SIGTSTP handling to the issig function;
this mirrors its behavior on Solaris. This way, long running kernel
tasks can be stopped with the appropriate signals. Note that doing
so with ctrl-z on the command line doesn't return control of the tty
to the shell, because tty handling is done separately from stopping
the process. That can be future work, if people feel that it is a
necessary addition.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Issue #810
Issue #10843Closes#11801
After 3937ab20f zfsdev_get_state_impl can become zfsdev_get_state.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#11833
SMACK needs to have the ZFS dentry security field setup before
SMACK's d_instantiate() hook is called as it requires functioning
'__vfs_getxattr()' calls to properly set the labels.
Fxes:
1) file instantiation properly setting the object label to the
subject's label
2) proper file labeling in a transmutable directory
Functions Updated:
1) zpl_create()
2) zpl_mknod()
3) zpl_mkdir()
4) zpl_symlink()
External-issue: https://github.com/cschaufler/smack-next/issues/1
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: TerraTech <TerraTech@users.noreply.github.com>
Closes#11646Closes#11839
`kmem_alloc(size>PAGESIZE, KM_SLEEP)` is backed by `kmalloc()`, which
finds contiguous physical memory. If there isn't enough contiguous
physical memory available (e.g. due to physical page fragmentation), the
OOM killer will be invoked to make more memory available. This is not
ideal because processes may be killed when there is still plenty of free
memory (it just happens to be in individual pages, not contiguous runs
of pages). We have observed this when allocating the ~13KB `zfs_cmd_t`,
for example in `zfsdev_ioctl()`.
This commit changes the behavior of
`kmem_alloc(size>PAGESIZE, KM_SLEEP)` when there are insufficient
contiguous free pages. In this case we will find individual pages and
stitch them together using virtual memory. This is accomplished by
using `kvmalloc()`, which implements the described behavior by trying
`kmalloc(__GFP_NORETRY)` and falling back on `vmalloc()`.
The behavior of `kmem_alloc(KM_NOSLEEP)` is not changed; it continues to
use `kmalloc(GPF_ATOMIC | __GFP_NORETRY)`. This is because `vmalloc()`
may sleep.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11461
Correct an assortment of typos throughout the code base.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#11774
We have exclusive access to our zfsdev state object in this section
until it is invalidated by setting zs_minor to -1, so we can destroy
the state without taking a lock if we do the invalidation last, after
a member to ensure correct ordering.
While here, strengthen the assertions that zs_minor is valid when we
enter.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#11751
Other (all?) Linux filesystems seem to return -EPERM instead of -EACCESS
when trying to set FS_APPEND_FL or FS_IMMUTABLE_FL without the
CAP_LINUX_IMMUTABLE capability. This was detected by generic/545 test
in the fstest suite.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Luis Henriques <henrix@camandro.org>
Closes#11791
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#11775
It used to be required to pass a enum km_type to kmap_atomic() and
kunmap_atomic(), however this is no longer necessary and the wrappers
zfs_k(un)map_atomic removed these. This is confusing in the ABD code as
the struct abd_iter member iter_km no longer exists and the wrapper
macros simply compile them out.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#11768
The BIO_MAX_PAGES macro is being retired in favor of a bio_max_segs()
function that implements the typical MIN(x,y) logic used throughout the
kernel for bounding the allocation, and also the new implementation is
intended to be signed-safe (which the former was not).
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#11765
In Linux 5.12, the filesystem API was modified to support ipmapped
mounts by adding a "struct user_namespace *" parameter to a number
functions and VFS handlers. This change adds the needed autoconf
macros to detect the new interfaces and updates the code appropriately.
This change does not add support for idmapped mounts, instead it
preserves the existing behavior by passing the initial user namespace
where needed. A subsequent commit will be required to add support
for idmapped mounted.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#11712
zhold() wraps igrab() on Linux, and igrab() may fail when the inode
is in the process of being deleted. This means zhold() must only be
called when a reference exists and therefore it cannot be deleted.
This is the case for all existing consumers so add a VERIFY and a
comment explaining this requirement.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Adam Moss <c@yotes.com>
Closes#11704
This will allow platforms to implement it as they see fit, in particular
in a different manner than rrm locks.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#11153
zil_replaying(zil, tx) has the side-effect of informing the ZIL that an
entry has been replayed in the (still open) tx. The ZIL uses that
information to record the replay progress in the ZIL header when that
tx's txg syncs.
ZPL log entries are not idempotent and logically dependent and thus
calling zil_replaying() is necessary for correctness.
For ZVOLs the question of correctness is more nuanced: ZVOL logs only
TX_WRITE and TX_TRUNCATE, both of which are idempotent. Logical
dependencies between two records exist only if the write or discard
request had sync semantics or if the ranges affected by the records
overlap.
Thus, at a first glance, it would be correct to restart replay from
the beginning if we crash before replay completes. But this does not
address the following scenario:
Assume one log record per LWB.
The chain on disk is
HDR -> 1:W(1, "A") -> 2:W(1, "B") -> 3:W(2, "X") -> 4:W(3, "Z")
where N:W(O, C) represents log entry number N which is a TX_WRITE of C
to offset A.
We replay 1, 2 and 3 in one txg, sync that txg, then crash.
Bit flips corrupt 2, 3, and 4.
We come up again and restart replay from the beginning because
we did not call zil_replaying() during replay.
We replay 1 again, then interpret 2's invalid checksum as the end
of the ZIL chain and call replay done.
The replayed zvol content is "AX".
If we had called zil_replaying() the HDR would have pointed to 3
and our resumed replay would not have replayed anything because
3 was corrupted, resulting in zvol content "BX".
If 3 logically depends on 2 then the replay corrupted the ZVOL_OBJ's
contents.
This patch adds the zil_replaying() calls to the replay functions.
Since the callbacks in the replay function need the zilog_t* pointer
so that they can call zil_replaying() we open the ZIL while
replaying in zvol_create_minor(). We also verify that replay has
been done when on-demand-opening the ZIL on the first modifying
bio.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#11667
ZFS_READONLY represents the "DOS R/O" attribute.
When that flag is set, we should behave as if write access
were not granted by anything in the ACL. In particular:
We _must_ allow writes after opening the file r/w, then
setting the DOS R/O attribute, and writing some more.
(Similar to how you can write after fchmod(fd, 0444).)
Restore these semantics which were lost on FreeBSD when refactoring
zfs_write. To my knowledge Linux does not actually expose this flag,
but we'll need it to eventually so I've added the supporting checks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11693
When populating a ZIL destination buffer ensure it is always
zeroed before its contents are constructed.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tom Caputi <caputit1@tcnj.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11687
The spl_kmem_alloc showed up in some flamegraphs in a single-threaded
4k sync write workload at 85k IOPS on an
Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz.
Certainly not a huge win but I believe the change is clean and
easy to maintain down the road.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#11666
The struct bio member bi_disk was moved underneath a new member named
bi_bdev. So all attempts to reference bio->bi_disk need to now become
bio->bi_bdev->bd_disk.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#11639
On Linux increase the maximum allowed size of the src nvlist which
can be passed to the /dev/zfs ioctl. Originally, this was set
to a maximum of KMALLOC_MAX_SIZE (4M) because it was kmalloc'd.
Since that time it's been converted to a vmalloc so that's no
longer a hard limit, and it's desirable for `zfs send/recv` to
allow larger nvlists so more snapshots can be sent at once.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6572Closes#11638
zfs_znode_update_vfs is a more platform-agnostic name than
zfs_inode_update. Besides that, the function's prototype is moved to
include/sys/zfs_znode.h as the function is also used in common code.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ka Ho Ng <khng300@gmail.com>
Sponsored by: The FreeBSD Foundation
Closes#11580
The first time through the loop prevdb and prevhdl are NULL. They
are then both set, but only prevdb is checked. Add an ASSERT to
make it clear that prevhdl must be set when prevdb is.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kleber <klebertarcisio@yahoo.com.br>
Closes#10754Closes#11575
`__vdev_disk_physio()` uses `abd_nr_pages_off()` to allocate a bio with
a sufficient number of iovec's to process this zio (i.e.
`nr_iovecs`/`bi_max_vecs`). If there are not enough iovec's in the bio,
then additional bio's will be allocated. However, this is a sub-optimal
code path. In particular, it requires several abd calls (to
`abd_nr_pages_off()` and `abd_bio_map_off()`) which will have to walk
the constituents of the ABD (the pages or the gang children) because
they are looking for offsets > 0.
For gang ABD's, `abd_nr_pages_off()` returns the number of iovec's
needed for the first constituent, rather than the sum of all
constituents (within the requested range). This always under-estimates
the required number of iovec's, which causes us to always need several
bio's. The end result is that `__vdev_disk_physio()` is usually O(n^2)
for gang ABD's (and occasionally O(n^3), when more than 16 bio's are
needed).
This commit fixes `abd_nr_pages_off()`'s handling of gang ABD's, to
correctly determine how many iovec's are needed, by adding up the number
of iovec's for each of the gang children in the requested range.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11536