Without this patch, the following crash can occur when
a file system is configured with "xattr=dir".
VNASSERT failed: locked not true at
/posix-acl/freebsd-rdma/sys/kern/vfs_subr.c:5786 (assert_vop_locked)
hold count flags ()
flags ()
lock type zfs: UNLOCKED
panic: zfs_dirent_lookup: vnode is not locked but should be
cpuid = 3
time = 1770520763
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b
vpanic() at vpanic+0x136/frame 0xfffffe00914c8270
panic() at panic+0x43/frame 0xfffffe00914c82d0
assert_vop_locked() at assert_vop_locked+0x78
zfs_dirent_lookup() at zfs_dirent_lookup+0x41
zfs_setattr_dir() at zfs_setattr_dir+0x123
zfs_setattr() at zfs_setattr+0x1389
zfs_freebsd_setattr() at zfs_freebsd_setattr+0x56b
VOP_SETATTR_APV() at VOP_SETATTR_APV+0x5d
setfown() at setfown+0xb1
kern_fchownat() at kern_fchownat+0x192
This patch fixes the problem by moving the vput() call for
attrzp to after the zfs_setattr_dir() call that takes it as
an argument.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes: #18188
Add DWPD (Drive Writes Per Day) rate limiting to control L2ARC write
speeds and protect SSD endurance. Write rate is constrained by the
minimum of l2arc_write_max and DWPD-calculated budget. Devices
accumulate unused write budget over 24-hour periods with automatic reset
and carry-over. Writes occur in controlled bursts (max 50MB) with
adaptive intervals to achieve target rates. Applies after initial device
fill.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
This option is removed upstream in favour of plain INVARIANTS.
VNASSERT is always defined so I see no reason to use it conditionally.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#18136
Fix zfs_open() to skip zil_async_to_sync() for the snapshot, as it won't
have any transactions. zfsvfs->z_log is NULL for the snapshot.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes#18091
If the file already has more than one block, then the current
block size cannot change. But if the file block size is less
than the maximum block size supported by the file system, and
there are multiple blocks in the file, the current code will
almost always extend the rangelock to its maximum size.
This means that all writes become serialized and even reads
are slowed as they will more often contend with writes. This
commit adjusts the test so that we will not lock the entire
range if there is more than one block in the file already.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mark Maybee <mark.maybee@perforce.com>
Closes#18046Closes#18064
In general it's possible for a vnode to not have an associated VM
object. This happens in particular with named pipes, which have
some distinct VOPs, defined in zfs_fifoops. Thus, this chunk of
zfs_freebsd_fsync() needs to check for the FIFO case, like other
vm_object_mightbedirty() callers do.
(Note that vn_flush_cached_data() calls are predicated on
zn_has_cached_data() returning true, and it checks for a NULL v_object
pointer already.)
Fixes: ef4058fcdc
Reported-by: Collin Funk <collin.funk1@gmail.com>
Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#18015
A deadlock occurs when snapshot expiry tasks are cancelled while holding
locks. The snapshot expiry task (snapentry_expire) spawns an umount
process and waits for it to complete. Concurrently, ARC memory pressure
triggers arc_prune which calls zfs_exit_fs(), attempting to cancel the
expiry task while holding locks. The umount process spawned by the
expiry task blocks trying to acquire locks held by arc_prune, which is
blocked waiting for the expiry task to complete. This creates a circular
dependency: expiry task waits for umount, umount waits for arc_prune,
arc_prune waits for expiry task.
Fix by adding non-blocking cancellation support to taskq_cancel_id().
The zfs_exit_fs() path calls zfsctl_snapshot_unmount_delay() to
reschedule the unmount, which needs to cancel any existing expiry task.
It now uses non-blocking cancellation to avoid waiting while holding
locks, breaking the deadlock by returning immediately when the task is
already running.
The per-entry se_taskqid_lock has been removed, with all taskqid
operations now protected by the global zfs_snapshot_lock held as
WRITER. Additionally, an se_in_umount flag prevents recursive waits when
zfsctl_destroy() is called during unmount. The taskqid is now only
cleared by the caller on successful cancellation; running tasks clear
their own taskqid upon completion.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17941
Pull all of the internal debug infrastructure up in to the zfs
code to clean up the layering. Remove all the dodgy usage of
SET_ERROR and DTRACE_PROBE from the spl. Luckily it was
lightly used in the spl layer so we're not losing much.
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17861
This is useful as debugging support, as it lets namespace lock
operations be traced directly. It will also be useful for future work to
reduce the use of spa_namespace_lock, traditionally a source of
difficult deadlocks.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17906
FreeBSD now has a pathconf name called _PC_CASE_INSENSITIVE
used to check if a file system performs case insensitive
name lookups.
This patch adds support for this name.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes#17908
We need to specifically use the FX_XFLAG_* macros in zpl_ioctl_*attr()
codepaths, and the FS_*_FL macros in the zpl_ioctl_*flags() codepaths.
The earlier code just assumes the FS_*_FL macros for both codepaths.
The 6.17 kernel add a bitmask check in copy_fsxattr_from_user() that
exposed this error via failing 'projectquota' ZTS tests.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#17884Closes#17869
In cases where all issued ZIOs must succeed, and we can't do
anything clever about the errors, we should just explicitly set
ZIO_FLAG_TRYHARD and let OS to do all the reasonable retries.
In other cases, where retries can be different from the original,
for example, some ZIOs are allowed to fail due to redundancy, or
we can disable aggregation on retrial to get at least some of
the data, we can do first pass without TRYHARD, and only if needed
retry with ZIO_FLAG_IO_RETRY (which implies TRYHARD semantics).
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17877
Over the time many of DMU functions got flags argument to control
prefetch, caching, etc. Few functions though left without it, even
though closer look shown that many of them do not require prefetch
due to their access pattern. This patch adds the flags argument to
dmu_write(), dmu_buf_hold_array() and dmu_buf_hold_array_by_bonus(),
passing DMU_READ_NO_PREFETCH where applicable.
I am going to also pass DMU_UNCACHEDIO to some of them later.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17872
In zio_crypt_key_wrap and zio_crypt_key_unwrap, the cuio_s variable was
not initialized before the calls to zfs_uio_init, leading to
uninitialized access to cuio_s.uio_offset. Initialize it to avoid gcc
warnings.
Similar issue as fixed in 2bf152021 ("Fix gcc uninitialized warning in
FreeBSD zio_crypt.c")
Signed-off-by: Ryan Libby <rlibby@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17863
Initially, `zfs_getpages()` is provided with an array of busy pages by
the vnode pager. It then tries to acquire the range lock, but if there
is a concurrent `zfs_write()` running and fails to acquire that range
lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`.
After that, it grabs the pages again and retries to acquire the range
lock, and so on.
Once it got the range lock, it filters out valid pages, then copy DMU
data to the remaining invalid pages.
The problem is that freshly allocated zero'd pages it grabbed itself are
marked as valid. Therefore they are skipped by the second part of the
function and DMU data is never copied to these pages. This causes mapped
pages to contain zeros instead of the expected file content.
This was discovered while working on RabbitMQ on FreeBSD. I could
reproduce the problem easily with the following commands:
git clone https://github.com/rabbitmq/rabbitmq-server.git
cd rabbitmq-server/deps/rabbit
gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \
ct-amqp_client t=cluster_size_3:leader_transfer_stream_send
The testsuite fails because there is a sendfile(2) that can happen
concurrently to a write(2) on the same file. This leads to sendfile(2)
or read(2) (after the sendfile) sending/returning data with zeros, which
causes a function to crash.
The patch consists of not setting the `VM_ALLOC_ZERO` flag when
`zfs_getpages()` grabs pages again. Then, the last page is zero'd if it
is invalid, in case it would be partially filled with the end of the
file content. Other pages are either valid (and will be skipped) or they
will be entirely overwritten by the file content.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Jean-Sébastien Pédron <dumbbell@FreeBSD.org>
Closes#17851
The actual minimum hole size on ZFS is variable, but we always report
SPA_MINBLOCKSIZE, which is 512. This may lead applications to believe
that they can reliably create holes at 512-byte boundaries and waste
resources trying to punch holes that ZFS ends up filling anyway.
* In the general case, if the vnode is a regular file, return its
current block size, or the record size if the file is smaller than
its own block size. If the vnode is a directory, return the dataset
record size. If it is neither a regular file nor a directory,
return EINVAL.
* In the control directory case, always return EINVAL.
Signed-off-by: Dag-Erling Smørgrav <des@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17750
zfs_aclset_common() might be called for newly created or not even
created vnodes, that triggers assertions on newer FreeBSD versions
with DEBUG_VFS_LOCKS included into INVARIANTS. In the first case
make sure to call vn_seqc_write_begin()/_end(), in the second just
skip the assertion.
The similar has to be done for project management IOCTL and file-
bases extended attributes, since those are not going through VFS.
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17722
FreeBSD now has a pathconf name called _PC_CLONE_BLKSIZE
which is the block size supported for block cloning for
the file system. Since ZFS's block size varies per file,
return the largest size likely to be used, or zero if block
cloning is not supported.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes#17645
These are old pre-OpenZFS tunable names that have long been
available via either conventional ZFS_MODULE_PARAM tunables or through
kstats. There's no point doubling up anymore, so delete them.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17375
zfsctl_root_readdir(): properly set eof.
readdir(): set *eofp to 1 on eof.
If there were no dirents to copy out, return EINVAL same as UFS.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Konstantin Belousov <kib@FreeBSD.org>
Closes#17655
zvol_state_lock is intended to protect access to the global name->zvol
lists (zvol_find_by_name()), but has also been used to control access to
OS-side private data, accessed through whatever kernel object is used to
represent the volume (gendisk, geom, etc).
This appears to have been necessary to some degree because the OS-side
object is what's used to get a handle on zvol_state_t, so zv_state_lock
and zv_suspend_lock can't be used to manage access, but also, with the
private object and the zvol_state_t being shutdown and destroyed at the
same time in zvol_os_free(), we must ensure that the private object
pointer only ever corresponds to a real zvol_state_t, not one in partial
destruction. Taking the global lock seems like a convenient way to
ensure this.
The problem with this is that zvol_state_lock does not actually protect
access to the zvol_state_t internals, so we need to take zv_state_lock
and/or zv_suspend_lock. If those are contended, this can then cause
OS-side operations (eg zvol_open()) to sleep to wait for them while hold
zvol_state_lock. This then blocks out all other OS-side operations which
want to get the private data, and any ZFS-side control operations that
would take the write half of the lock. It's even worse if ZFS-side
operations induce OS-side calls back into the zvol (eg creating a zvol
triggers a partition probe inside the kernel, and also a userspace
access from udev to set up device links). And it gets even works again
if anything decides to defer those ops to a task and wait on them, which
zvol_remove_minors_impl() will do under high load.
However, since the previous commit, we have a guarantee that the private
data pointer will always be NULL'd out in zvol_os_remove_minor()
_before_ the zvol_state_t is made invalid, but it won't happen until all
users are ejected. So, if we make access to the private object pointer
atomic, we remove the need to take a global lockout to access it, and so
we can remove all acquisitions of zvol_state_lock from the OS side.
While here, I've rewritten much of the locking theory comment at the top
of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've
tried to describe the purpose of each lock in a little more detail, and
in particular describe where it should and shouldn't be used.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17625
When destroying a zvol, it is not "unpublished" from the system (that
is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the
time del_gendisk() and put_disk() are called, the device node may still
be have an active hold, from a userspace program or something inside the
kernel (a partition probe). As it is currently, this can lead to calls
to zvol_open() or zvol_release() while the zvol_state_t is partially or
fully freed. zvol_open() has some protection against this by checking
that private_data is NULL, but zvol_release does not.
This implements a better ordering for all of this by adding a new
OS-side method, zvol_os_remove_minor(), which is responsible for fully
decoupling the "private" (OS-side) objects from the zvol_state_t. For
Linux, that means calling put_disk(), nulling private_data, and freeing
zv_zso.
This takes the place of zvol_os_clear_private(), which was a nod in that
direction but did not do enough, and did not do it early enough.
Equivalent changes are made on the FreeBSD side to follow the API
change.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17625
Back in 2014 the zfs_autoimport_disable module option was added to
control whether the kmods should load the pool configs from the cache
file on module load. The default value since that time has been for
the kernel to not process the cache file.
Detecting and importing pools during boot is now controlled outside
of the kmod on both Linux and FreeBSD. By all accounts this has been
working well and we can remove this dormant code on the kernel side.
The spa_config_load() function is has been moved to userspace, it is
now only used by libzpool. Additionally, the spa_boot_init() hook
which was used by FreeBSD now looks to be used and was removed.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17618
Page writeback is considered completed when the associated itx callback
completes. A syncing writeback will receive the error in its callback
directly, but an in-flight async writeback that was promoted to sync by
the ZIL may also receive an error.
Writeback errors, even syncing writeback errors, are not especially
serious on their own, because the error will ultimately be returned to
the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg
msync()) or to zfs_putpage() itself for a syncing (VM_PAGER_PUT_SYNC)
writeback.
The only thing we need to do when a page writeback fails is to skip
marking the page clean ("undirty"), since we don't know if it made it to
disk yet. This will ensure that it gets written out again in the future,
either some scheduled async writeback or another explicit syncing call.
On the other side, we need to make sure that if a syncing op arrives,
any changes on dirty pages are written back to the DMU and/or the ZIL
first. We do this by starting an async writeback on the vnode cache
first, so any dirty data has been recorded in the ZIL, ready for the
followup zfs_sync()->zil_commit() to find.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17398
ITX callbacks are used to signal that something can be cleaned up after
a itx is committed. Presently that's only used when syncing out mapped
pages (msync()) to mark dirty pages clean.
This extends the callback interface so it can be passed an error, and
take a different cleanup action if necessary.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17398
This changes zil_commit() to have an int return, and updates all callers
to check it. There are no corresponding internal changes yet; it will
always return 0.
Since zil_commit() is an indication that the caller _really_ wants the
associated data to be durability stored, I've annotated it with the
__warn_unused_result__ compiler attribute (via __must_check), to emit a
warning if it's ever ussd without doing something with the return code.
I hope this will mean we never misuse it in the future.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17398
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/Closes#17591
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/Closes#17591
All this machinery is there to try to understand when there an async
writeback waiting to complete because the intent log callbacks are still
outstanding, and force them with a timely zil_commit(). The next commit
fixes this properly, so there's no need for all this extra housekeeping.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17584
Now zvol minors creation logic is passed thru spa_zvol_taskq, like it
is doing for remove/rename zvol minors functions. Appropriate
zvol minors creation functions are refactored:
- The zvol_create_minor()/zvol_minors_create_recursive() were removed.
- The single zvol_create_minors() is added instead.
Also, it become possible to collect zvol minors subtasks status, to
detect, if some zvol minor subtask is failed in the subtasks chain.
The appropriate message is reported to zfs_dbgmsg buffer in this case.
Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#17575
Avoid calling dbuf_evict_one() from memory reclaim contexts (e.g. Linux
kswapd, FreeBSD pagedaemon). This prevents deadlock caused by reclaim
threads waiting for the dbuf hash lock in the call sequence:
dbuf_evict_one -> dbuf_destroy -> arc_buf_destroy
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Kaitlin Hoang <kthoang@amazon.com>
Closes#17561
Implement zvol_alloc() function on FreeBSD side to increase code base
compatibility with Linux. Also, fix issue with late returning in case
if volmode=none.
Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#17482
FreeBSD commit 2ec2ba7e232d added the Solaris style syscall interface
for extended attributes. This patch wires this interface into the
FreeBSD ZFS port, since this style of extended attributes is supported
by OpenZFS internally when the "xattr" property is set to "dir".
Some specific changes:
LOOKUP_NAMED_ATTR is defined to indicate the need to set V_NAMEDATTR
for calls to zfs_zaccess().
V_NAMEDATTR indicates that the access checking does need to be done
for FreeBSD.
The access checking code for extended attributes was copy/pasted from
the Linux port into zfs_zaccess() in the FreeBSD port.
Most of the changes are in zfs_freebsd_lookup() and
zfs_freebsd_create().
The semantics of these functions should remain unchanged unless named
attributes are being manipulated.
All the code changes are enabled for __FreeBSD_version 1500040 and
newer.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes#17540
Return from zvol_os_create_minor() function immediately after
dsl_prop_get_integer() call if volmode property value is set to
'none', like it is doing on Linux side.
Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#17405
In syncing mode, zfs_putpages() would put the entire range of pages onto
the ZIL, then return VM_PAGER_OK for each page to the kernel. However,
an associated zil_commit() or txg sync had not happened at this point,
so the write may not actually be on disk.
So, we rework that case to use a ZIL commit callback, and do the
post-write work of undirtying the page and signaling completion there.
We return VM_PAGER_PEND to the kernel instead so it knows that we will
take care of it.
The original version of this (238eab7dc1) copied the Linux model and did
the cleanup in a ZIL callback for both async and sync. This was a
mistake, as FreeBSD does not have a separate "busy for writeback" flag
like Linux which keeps the page usable. The full sbusy flag locks the
entire page out until the itx callback fires, which for async is after
txg sync, which could be literal seconds in the future.
For the async case, the data is already on the DMU and the in-memory
ZIL, which is sufficient for async writeback, so the old method of
logging it without a callback, undirtying the page and returning is more
than sufficient and reclaims that lost performance.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17533
This causes async putpages to leave the pages sbusied for a long time,
which hurts concurrency. Revert for now until we have a better
approach.
This reverts commit 238eab7dc1.
Reported by: Ihor Antonov <ngor@hugpoint.tech>
Discussed with: Rob Norris <rob.norris@klarasystems.com>
References: freebsd/freebsd-src@738a9a7
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Ported-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17533
In FreeBSD there is now a pathconf name _PC_HAS_HIDDENSYSTEM.
This patch adds support for it to OpenZFS.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes#17518
The field is subsequently accessed in zfs_mknode(), in
zfs_inherit_projid(). The Linux implementation of zfs_create_fs() has
this initialization already; there is no counterpart to
zfs_create_share_dir() that I can see.
Reported-by: KMSAN
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#17486
While FreeBSD itself does not support projects, there is no reason
why it can't be controlled via `zfs project` and other subcommands.
Most of the code is actually already there and just needs some
revival and sync with Linux, plus enabling some tests not depending
on the OS support.
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17423
zfs_putpages() would put the entire range of pages onto the ZIL, then
return VM_PAGER_OK for each page to the kernel. However, an associated
zil_commit() or txg sync had not happened at this point, so the write
may not actually be on disk.
So, we rework it to use a ZIL commit callback, and do the post-write
work of undirtying the page and signaling completion there. We return
VM_PAGER_PEND to the kernel instead so it knows that we will take care
of it.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17445
The module parameter name was not changed in FreeBSD sysctls
list: 'vfs.zfs.vol.mode'. Also, on Linux side the name is:
/sys/module/zfs/parameters/zvol_volmode.
Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#17386
The module parameter now is represented in FreeBSD sysctls list
with name: 'vfs.zfs.vol.prefetch_bytes'. The default value is 131072,
same as on Linux side.
Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#17385
The module parameter now is represented in FreeBSD sysctls list with
name: 'vfs.zfs.vol.inhibit_dev'. The default value is '0', same as on
Linux side.
Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#17384
As commit 320f0c6 did for Linux, connect POSIX_FADV_WILLNEED
up to dmu_prefetch() on FreeBSD.
While there, fix portability problems in tests/functional/fadvise.
1. Instead of relying on the numerical values of POSIX_FADV_XXX macros,
accept macro names as arguments to the file_fadvise program. (The
numbers happen to match on Linux and FreeBSD, but future systems may
vary and it seems a little strange/raw to count on that.)
2. For implementation reasons, SEQUENTIAL doesn't reach ZFS via FreeBSD
VFS currently (perhaps something that should be investigated in
FreeBSD). Since on Linux we're treating SEQUENTIAL and WILLNEED the
same, it doesn't really matter which one we use, so switch the test
over to WILLNEED exercise the new prefetch code on both OSes the
same way.
Reviewed-by: Mateusz Guzik <mjg@FreeBSD.org>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Thomas Munro <tmunro@FreeBSD.org>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Closes#17379
Nothing in any FreeBSD code uses them.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
By the assertion, vdev_geom_io_done() only expects ENXIO on an error
when the geom is a top-level (allocating) vdev[1][2]. However, zinject
currently can't insert ENXIO directly, possibly because on Solaris
outright disk failures were reported with EIO[2][3].
This is a narrow workaround to convert EIO to ENXIO when injections are
enabled, to avoid the assertion and allow the test suite to test
behaviour related to probe failure on FreeBSD.
1. freebsd/freebsd-src@37ec52ca7a
2. freebsd/freebsd-src@cd730bd6b2
3. illumos/illumos-gate@ea8dc4b6d2
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17355