mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-15 18:56:59 +03:00

Author	SHA1	Message	Date
Coleman Kane	21875dd090	Linux 6.6 compat: generic_fillattr has a new u32 request_mask added at arg2 In commit 0d72b92883c651a11059d93335f33d65c6eb653b, a new u32 argument for the request_mask was added to generic_fillattr. This is the same request_mask for statx that's present in the most recent API implemented by zpl_getattr_impl. This commit conditionally adds it to the zpl_generic_fillattr(...) macro, as well as the zfs_getattr_fast(...) implementation, when configure determines it's present in the kernel's generic_fillattr(...). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15263	2023-11-08 12:15:41 -08:00
Coleman Kane	fe9d409e90	Linux 6.6 compat: use inode_get/set_ctime(...) In Linux commit 13bc24457850583a2e7203ded05b7209ab4bc5ef, direct access to the i_ctime member of struct inode was removed. The new approach is to use accessor methods that exclusively handle passing the timestamp around by value. This change adds new tests for each of these functions and introduces zpl_ equivalents in include/os/linux/zfs/sys/zpl.h. In where the inode_get/set_ctime() functions exist, these zpl_ calls will be mapped to the new functions. On older kernels, these macros just wrap direct-access calls. The code that operated on an address of ip->i_ctime to call ZFS_TIME_DECODE() now will take a local copy using zpl_inode_get_ctime(), and then pass the address of the local copy when performing the ZFS_TIME_DECODE() call, in all cases, rather than directly accessing the member. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15263 Closes #15257	2023-11-08 12:15:41 -08:00
Tony Hutter	8ba748d414	Revert "zvol: Temporally disable blk-mq" This reverts commit `aefb6a2bd6`. `aefb6a2bd` temporally disabled blk-mq until we could fix a fix for Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #15439	2023-11-06 16:47:32 -08:00
Tony Hutter	e860cb0200	zvol: Remove broken blk-mq optimization This fix removes a dubious optimization in zfs_uiomove_bvec_rq() that saved the iterator contents of a rq_for_each_segment(). This optimization allowed restoring the "saved state" from a previous rq_for_each_segment() call on the same uio so that you wouldn't need to iterate though each bvec on every zfs_uiomove_bvec_rq() call. However, if the kernel is manipulating the requests/bios/bvecs under the covers between zfs_uiomove_bvec_rq() calls, then it could result in corruption from using the "saved state". This optimization results in an unbootable system after installing an OS on a zvol with blk-mq enabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #15351	2023-11-06 16:47:24 -08:00
Daniel Berlin	810fc49a3e	Ensure we call fput when cloning fails due to different devices. Right now, zpl_ioctl_ficlone and zpl_ioctl_ficlonerange do not call put on the src fd if the source and destination are on two different devices. This leaves the source file held open in this case. Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Daniel Berlin <dberlin@dberlin.org> Closes #15386	2023-10-10 19:19:09 -07:00
Tony Hutter	a80e1f1c90	zvol: Temporally disable blk-mq There was a report of zvol data loss (#15351) after enabling blk-mq on a zvol backed with 16k physical block sized disks. Out of an abundance of caution, do not allow the user to enable blk-mq until we can look into the issue. Note that blk-mq was not enabled by default on zvols. It was always opt-in via the zvol_use_blk_mq module parameter. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Addresses: #15351 Closes #15378	2023-10-10 19:19:09 -07:00
Mateusz Guzik	f7a07d76ee	Retire z_nr_znodes Added in `ab26409db7` ("Linux 3.1 compat, super_block->s_shrink"), with the only consumer which needed the count getting retired in `066e825221` ("Linux compat: Minimum kernel version 3.10"). The counter gets in the way of not maintaining the list to begin with. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #15274	2023-09-19 08:52:06 -07:00
Coleman Kane	58a707375f	Linux 6.5 compat: Use copy_splice_read instead of filemap_splice_read Using the filemap_splice_read function for the splice_read handler was leading to occasional data corruption under certain circumstances. Favor using copy_splice_read instead, which does not demonstrate the same erroneous behavior under the tested failure cases. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15164	2023-09-19 08:50:01 -07:00
Coleman Kane	5a22de144a	Linux 6.5 compat: replace generic_file_splice_read with filemap_splice_read The generic_file_splice_read function was removed in Linux 6.5 in favor of filemap_splice_read. Add an autoconf test for filemap_splice_read and use it if it is found as the handler for .splice_read in the file_operations struct. Additionally, ITER_PIPE was removed in 6.5. This change removes the ITER_* macros that OpenZFS doesn't use from being tested in config/kernel-vfs-iov_iter.m4. The removal of ITER_PIPE was causing the test to fail, which also affected the code responsible for setting the .splice_read handler, above. That behavior caused run-time panics on Linux 6.5. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15155	2023-09-19 08:50:01 -07:00
Coleman Kane	8be6308e85	Linux 4.20 compat: wrapper function for iov_iter type access An iov_iter_type() function to access the "type" member of the struct iov_iter was added at one point. Move the conditional logic to decide which method to use for accessing it into a macro and simplify the zpl_uio_init code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15100	2023-09-19 08:50:01 -07:00
Coleman Kane	0bf2c5365e	Linux 6.4 compat: iter_iov() function now used to get old iov member The iov_iter->iov member is now iov_iter->__iov and must be accessed via the accessor function iter_iov(). Create a wrapper that is conditionally compiled to use the access method appropriate for the target kernel version. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15100	2023-09-19 08:50:01 -07:00
Coleman Kane	d76de9fb17	Linux 6.5 compat: blkdev changes Multiple changes to the blkdev API were introduced in Linux 6.5. This includes passing (void* holder) to blkdev_put, adding a new blk_holder_ops* arg to blkdev_get_by_path, adding a new blk_mode_t type that replaces uses of fmode_t, and removing an argument from the release handler on block_device_operations that we weren't using. The open function definition has also changed to take gendisk* and blk_mode_t, so update it accordingly, too. Implement local wrappers for blkdev_get_by_path() and vdev_blkdev_put() so that the in-line calls are cleaner, and place the conditionally-compiled implementation details inside of both of these local wrappers. Both calls are exclusively used within vdev_disk.c, at this time. Add blk_mode_is_open_write() to test FMODE_WRITE / BLK_OPEN_WRITE The wrapper function is now used for testing using the appropriate method for the kernel, whether the open mode is writable or not. Emphasize fmode_t arg in zvol_release is not used Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15099	2023-09-19 08:50:01 -07:00
Volker Mauel	4da8c7d11e	Intel QAT 1.7 compatibility Based on the intel QAT samples which are bundled in the 1.x drivers, this is the preferred approach since api version 1.6. See: https://www.intel.de/content/www/de/de/download/19734/intel-quickassist-technology-driver-for-linux-hw-version-1-x.html? Reviewed-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Volker Mauel <volkermauel@gmail.com> Closes #15190	2023-09-07 16:10:52 -07:00
Rob N	92f095a903	copy_file_range: fix fallback when source create on same txg In `019dea0a5` we removed the conversion from EAGAIN->EXDEV inside zfs_clone_range(), but forgot to add a test for EAGAIN to the copy_file_range() entry points to trigger fallback to a content copy. This commit fixes that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15170 Closes #15172	2023-08-25 13:33:40 -07:00
oromenahar	895cb689d3	zfs_clone_range should return a descriptive error codes Return the more descriptive error codes instead of `EXDEV` when the parameters don't match the requirements of the clone function. Updated the comments in `brt.c` accordingly. The first three errors are just invalid parameters, which zfs can not handle. The fourth error indicates that the block which should be cloned is created and cloned or modified in the same transaction group (`txg`). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Kay Pedersen <mail@mkwg.de> Closes #15148	2023-08-25 13:33:40 -07:00
наб	bd1eab16eb	linux: zfs: ctldir: set [amc]time to snapshot's creation property If looking up a snapdir inode failed, hold pool config – hold the snapshot – get its creation property – release it – release it, then use that as the [amc]time in the allocated inode. If that fails then fall back to current time. No performance impact since this is only done when allocating a new snapdir inode. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #15110 Closes #15117	2023-08-02 08:53:45 -07:00
Rob N	c47f0f4417	linux/copy_file_range: properly request a fallback copy on Linux <5.3 Before Linux 5.3, the filesystem's copy_file_range handler had to signal back to the kernel that we can't fulfill the request and it should fallback to a content copy. This is done by returning -EOPNOTSUPP. This commit converts the EXDEV return from zfs_clone_range to EOPNOTSUPP, to force the kernel to fallback for all the valid reasons it might be unable to clone. Without it the copy_file_range() syscall will return EXDEV to userspace, breaking its semantics. Add test for copy_file_range fallbacks. copy_file_range should always fallback to a content copy whenever ZFS can't service the request with cloning. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15131	2023-08-02 08:52:40 -07:00
Rob Norris	2768dc04cc	linux: implement filesystem-side copy/clone functions for EL7 Redhat have backported copy_file_range and clone_file_range to the EL7 kernel using an "extended file operations" wrapper structure. This connects all that up to let cloning work there too. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	3366ceaf3a	linux: implement filesystem-side clone ioctls Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and were implemented as regular filesystem-specific ioctls. This implements those ioctls directly in OpenZFS, allowing cloning to work on older kernels. There's no need to gate these behind version checks; on later kernels Linux will simply never deliver these ioctls, instead calling the approprate VFS op. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	5d12545da8	linux: implement filesystem-side copy/clone functions This implements the Linux VFS ops required to service the file copy/clone APIs: .copy_file_range (4.5+) .clone_file_range (4.5-4.19) .dedupe_file_range (4.5-4.19) .remap_file_range (4.20+) Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are hooked up here, but are not implemented yet. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Chunwei Chen	b221f43943	Fix zpl_test_super race with zfs_umount We cannot call zpl_enter in zpl_test_super, because zpl_test_super is under spinlock so we can't sleep, and also because zpl_test_super is called without sb->s_umount taken, so it's possible we would race with zfs_umount and call zpl_enter on freed zfsvfs. Here's an stack trace when this happens: [ 2379.114837] VERIFY(cvp->cv_magic == CV_MAGIC) failed [ 2379.114845] PANIC at spl-condvar.c:497:__cv_broadcast() [ 2379.114854] Kernel panic - not syncing: VERIFY(cvp->cv_magic == CV_MAGIC) failed [ 2379.115012] Call Trace: [ 2379.115019] dump_stack+0x74/0x96 [ 2379.115024] panic+0x114/0x2f6 [ 2379.115035] spl_panic+0xcf/0xfc [spl] [ 2379.115477] __cv_broadcast+0x68/0xa0 [spl] [ 2379.115585] rrw_exit+0xb8/0x310 [zfs] [ 2379.115696] rrm_exit+0x4a/0x80 [zfs] [ 2379.115808] zpl_test_super+0xa9/0xd0 [zfs] [ 2379.115920] sget+0xd1/0x230 [ 2379.116033] zpl_mount+0xdc/0x230 [zfs] [ 2379.116037] legacy_get_tree+0x28/0x50 [ 2379.116039] vfs_get_tree+0x27/0xc0 [ 2379.116045] path_mount+0x2fe/0xa70 [ 2379.116048] do_mount+0x80/0xa0 [ 2379.116050] __x64_sys_mount+0x8b/0xe0 [ 2379.116052] do_syscall_64+0x35/0x50 [ 2379.116054] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 2379.116057] RIP: 0033:0x7f9912e8b26a Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #15077	2023-07-21 16:35:12 -07:00
Prakash Surya	945e39fc3a	Enable tuning of ZVOL open timeout value The default timeout for ZVOL opens may not be sufficient for all cases, so we should enable the value to be more easily tuned to account for systems where the default value is insufficient. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Closes #15023	2023-06-30 11:34:05 -07:00
Alexander Motin	b3ad3f48d9	Use list_remove_head() where possible. ... instead of list_head() + list_remove(). On FreeBSD the list functions are not inlined, so in addition to more compact code this also saves another function call. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14955	2023-06-09 10:12:52 -07:00
Rob Norris	2b9f8ba673	znode: expose zfs_get_zplprop to libzpool There's no particular reason this function should be kernel-only, and I want to use it (indirectly) from zdb. I've moved it to zfs_znode.c because libzpool does not compile in zfs_vfsops.c, and this at least matches the header its imported from. Sponsored-By: Klara, Inc. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: WHR <msl0000023508@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #14642	2023-06-05 11:53:44 -07:00
youzhongyang	f8447cf22e	Linux 6.4 compat: reclaimed_slab renamed to reclaimed Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #14891	2023-05-24 12:23:42 -07:00
Richard Yao	ab71b24d20	Linux: zfs_zaccess_trivial() should always call generic_permission() Building with Clang on Linux generates a warning that err could be uninitialized if mnt_ns is a NULL pointer. However, mnt_ns should never be NULL, so there is no need to put this behind an if statement. Taking it outside of the if statement means that the possibility of err being uninitialized goes from being always zero in a way that the compiler could not realize to a way that is always zero in a way that the compiler can realize. Sponsored-By: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Closes #14738	2023-04-20 10:29:44 -07:00
youzhongyang	d4dc53dad2	Linux 6.3 compat: idmapped mount API changes Linux kernel 6.3 changed a bunch of APIs to use the dedicated idmap type for mounts (struct mnt_idmap), we need to detect these changes and make zfs work with the new APIs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #14682	2023-04-10 14:15:36 -07:00
youzhongyang	8eb2f26057	Linux 6.3 compat: writepage_t first arg struct folio* The type def of writepage_t in kernel 6.3 is changed to take struct folio* as the first argument. We need to detect this change and pass correct function to write_cache_pages(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #14699	2023-04-05 10:01:38 -07:00
Brian Behlendorf	1142362ff6	Use vmem_zalloc to silence allocation warning The kmem allocation in zfs_prune_aliases() will trigger a large allocation warning on systems with 64K pages. Resolve this by switching to vmem_alloc() which internally uses kvmalloc() so the right allocator will be used based on the allocation size. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8491 Closes #14694	2023-03-31 09:43:54 -07:00
naivekun	60cfd3bbc2	QAT: Fix uninitialized seed in QAT compression CpaDcRqResults have to be initialized with checksum=1 for adler32. Otherwise when error CPA_DC_OVERFLOW occurred, the next compress operation will continue on previously part-compressed data, and write invalid checksum data. When zfs decompress the compressed data, a invalid checksum will occurred and lead to #14463 Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Weigang Li <weigang.li@intel.com> Reviewed-by: Chengfei Zhu <chengfeix.zhu@intel.com> Signed-off-by: naivekun <naivekun0817@gmail.com> Closes #14632 Closes #14463	2023-03-16 11:54:10 -07:00
Richard Yao	d1807f168e	nvpair: Constify string functions After addressing coverity complaints involving `nvpair_name()`, the compiler started complaining about dropping const. This lead to a rabbit hole where not only `nvpair_name()` needed to be constified, but also `nvpair_value_string()`, `fnvpair_value_string()` and a few other static functions, plus variable pointers throughout the code. The result became a fairly big change, so it has been split out into its own patch. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14612	2023-03-14 15:25:50 -07:00
Richard Yao	66a38fd10a	Linux: Suppress clang static analyzer warning in zfs_remove() Clang's static analyzer points out that if we fail to find an extended attribute directory, but somehow find it when calculating delete_now and delete_now is true, we will have a NULL pointer dereference when we try to unlink the extended attribute directory. I am not sure if this is possible, but if it is, I do not see a sane way of handling this other than rolling back the transaction and retrying. For now, let us do an VERIFY_IMPLY(). If this trips, it will stop the transaction from committing, which will prevent an attribute directory leak. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:52:04 -08:00
Richard Yao	c2550a136e	Linux: Silence static analyzer warning in crypto_create_ctx_template() A CodeChecker report from Clang's CTU analysis indicated that we were assigning uninitialized values in crypto_create_ctx_template() when we call it from zio_crypt_key_init(). This occurs because the ->cm_param and ->cm_param_len fields are uninitialized. Thankfully, the uninitialized values are only used in the skein via KCF_PROV_CREATE_CTX_TEMPLATE() -> skein_create_ctx_template() -> skein_mac_ctx_build() -> skein_get_digest_bitlen(), but that should not be called from here. We fix this to avoid a possible trap should this code change in the future. The FreeBSD version of zio_crypt_key_init() is unaffected. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:51:59 -08:00
Richard Yao	5dd0f019cd	Linux cleanup: zvol_discard() should only call blk_queue_io_stat() once Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:51:40 -08:00
Alexander Motin	a8d83e2a24	More adaptive ARC eviction Traditionally ARC adaptation was limited to MRU/MFU distribution. But for years people with metadata-centric workload demanded mechanisms to also manage data/metadata distribution, that in original ZFS was just a FIFO. As result ZFS effectively got separate states for data and metadata, minimum and maximum metadata limits etc, but it all required manual tuning, was not adaptive and in its heart remained a bad FIFO. This change removes most of existing eviction logic, rewriting it from scratch. This makes MRU/MFU adaptation individual for data and meta- data, same as the distribution between data and metadata themselves. Since most of required states separation was already done, it only required to make arcs_size state field specific per data/metadata. The adaptation logic is still based on previous concept of ghost hits, just now it balances ARC capacity between 4 states: MRU data, MRU metadata, MFU data and MFU metadata. To simplify arc_c changes instead of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd and arc_pm, representing ARC balance between metadata and data, MRU and MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed point fractions. Since we care about the math result only when need to evict, this moves all the logic from arc_adapt() to arc_evict(), that reduces per-block overhead, since per-block operations are limited to stats collection, now moved from arc_adapt() to arc_access() and using cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag from many places. This change also removes number of metadata specific tunables, part of which were actually not functioning correctly, since not all metadata are equal and some (like L2ARC headers) are not really evictable. Instead it introduced single opaque knob zfs_arc_meta_balance, tuning ARC's reaction on ghost hits, allowing administrator give more or less preference to metadata without setting strict limits. Some of old code parts like arc_evict_meta() are just removed, because since introduction of ABD ARC they really make no sense: only headers referenced by small number of buffers are not evictable, and they are really not evictable no matter what this code do. Instead just call arc_prune_async() if too much metadata appear not evictable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14359	2023-03-08 11:17:23 -08:00
Richard Yao	dd108f5d73	Linux: zfs_fillpage() should handle partial pages from end of file After `89cd2197b9` was merged, Clang's static analyzer began complaining about a dead assignment in `zfs_fillpage()`. Upon inspection, I noticed that the dead assignment was because we are not using the calculated io_len that we should use to avoid asking the DMU to read past the end of a file. This should result in `dmu_buf_hold_array_by_dnode()` calling `zfs_panic_recover()`. This issue predates `89cd2197b9`, but its simplification of zfs_fillpage() eliminated the only use of the assignment to io_len, which made Clang's static analyzer complain about the issue. Also, as a precaution, we add an assertion that io_offset < i_size. If this ever fails, bad things will happen. Otherwise, we are blindly trusting the kernel not to give us invalid offsets. We continue to blindly trust it on non-debug kernels. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14534	2023-03-01 13:19:47 -08:00
John Poduska	73c383f541	Prevent incorrect datasets being mounted During a mount, zpl_mount_impl(), uses sget() with the callback zpl_test_super() to find a super_block with a matching objset, stored in z_os. It does so without taking the teardown lock on the zfsvfs. The problem is that operations like rollback will replace the z_os. And, there is a window where the objset in the rollback is freed, but z_os still points to it. Then, a mount like operation, for instance a clone, can reallocate that exact same pointer and zpl_test_super() will then match the super_block associated with the rollback as opposed to the clone. This fix tests for a match and if so, takes the teardown lock before doing the final match test. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: John Poduska <jpoduska@datto.com> Closes #14518	2023-02-27 16:49:34 -08:00
Brian Behlendorf	89cd2197b9	Fix buffered/direct/mmap I/O race When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13608 Closes #14498	2023-02-23 10:57:24 -08:00
Brian Behlendorf	3fc92adc40	Linux: use filemap_range_has_page() As of the 4.13 kernel filemap_range_has_page() can be used to check if there is a page mapped in a given file range. When available this interface should be used which eliminates the need for the zp->z_is_mapped boolean. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14493	2023-02-14 11:04:34 -08:00
Rich Ercolani	cfd57573ff	quick fix for lingering snapdir unmount problems Unfortunately, even after `e79b6807`, I still, much more rarely, tripped asserts when playing with many ctldir mounts at once. Since this appears to happen if we dispatched twice too fast, just ignore it. We don't actually need to do anything if someone already started doing it for us. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14462	2023-02-13 16:40:13 -08:00
Prakash Surya	13312e2fa1	Reduce need for contiguous memory for ioctls We've had cases where we trigger an OOM despite having memory freely available on the system. For example, here, we had about 21GB free: kernel: Node 0 Normal: 24187584kB (UME) 15495338kB (UE) 016kB 032kB 064kB 0128kB 0256kB 0512kB 01024kB 02048kB 0*4096kB = 22071296kB The problem being, all the memory is in 4K and 8K contiguous regions, but the allocation request was for a 16K contiguous region: kernel: SafeExecutors-4 invoked oom-killer: gfp_mask=0x42dc0(GFP_KERNEL\|__GFP_NOWARN\|__GFP_COMP\|__GFP_ZERO), order=2, oom_score_adj=0 The offending allocation came from this call trace: kernel: Call Trace: kernel: dump_stack+0x57/0x7a kernel: dump_header+0x4f/0x1e1 kernel: oom_kill_process.cold.33+0xb/0x10 kernel: out_of_memory+0x1ad/0x490 kernel: __alloc_pages_slowpath+0xd55/0xe40 kernel: __alloc_pages_nodemask+0x2df/0x330 kernel: kmalloc_large_node+0x42/0x90 kernel: __kmalloc_node+0x25a/0x320 kernel: ? spl_kmem_free_impl+0x21/0x30 [spl] kernel: spl_kmem_alloc_impl+0xa5/0x100 [spl] kernel: spl_kmem_zalloc+0x19/0x20 [spl] kernel: zfsdev_ioctl+0x2b/0xe0 [zfs] kernel: do_vfs_ioctl+0xa9/0x640 kernel: ? __audit_syscall_entry+0xdd/0x130 kernel: ksys_ioctl+0x67/0x90 kernel: __x64_sys_ioctl+0x1a/0x20 kernel: do_syscall_64+0x5e/0x200 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 kernel: RIP: 0033:0x7fdca3674317 The problem is, for each ioctl that ZFS makes, it has to allocate a zfs_cmd_t structure, which is 13744 bytes in size (on my system): sdb> sizeof zfs_cmd (size_t)13744 This size, coupled with the fact that we currently allocate it with kmem_zalloc, means we need a 16K contiguous region of memory to satisfy the request. The solution taken by this change, is to use "vmem" instead of "kmem" to do the allocation, such that we don't necessarily need a contiguous 16K memory region to satisfy the allocation. Arguably, a better solution would be not to require such a large allocation to begin with (e.g. reduce the size of the zfs_cmd_t structure), but that'd be a much larger change than this "one liner". Thus, I've opted for this approach for now; we can always circle back and attempt to reduce the size of the structure in the future. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Closes #14474	2023-02-13 16:35:59 -08:00
Richard Yao	66953686c0	Add assertion and make variables unsigned in abd_alloc_chunks() Clang's static analyzer pointed out that if alloc_pages >= nr_pages before the loop, the value of page will be undefined and will be used anyway. This should not be possible, but as cleanup, we add an assertion. We also recognize that the local variables should be unsigned in the first place, so we make them unsigned. This is not enough to avoid the need for the assertion, since there is still the case that alloc_pages == nr_pages and nr_pages == 0, which the assertion implicitly checks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14456	2023-02-06 11:10:50 -08:00
Richard Yao	3a7d2a0ce0	zfs_get_temporary_prop() should not pass NULL to strcpy() `dsl_dir_activity_in_progress()` can call `zfs_get_temporary_prop()` with the forth value set to NULL, which will pass NULL to `strcpy()` when there is a match Clang's static analyzer caught this with the help of CodeChecker for Cross Translation Unit analysis. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14456	2023-02-06 11:08:57 -08:00
Coleman Kane	9cd71c8604	linux 6.2 compat: zpl_set_acl arg2 is now struct dentry Linux 6.2 changes the second argument of the set_acl operation to be a "struct dentry " rather than a "struct inode ". The inode* parameter is still available as dentry->d_inode, so adjust the call to the _impl function call to dereference and pass that pointer to it. Also document that the get_acl -> get_inode_acl member name change from commit `884a693` was an API change also introduced in Linux 6.2. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #14415	2023-01-24 11:20:50 -08:00
Chunwei Chen	c6dab6dd39	Fix unprotected zfs_znode_dmu_fini In original code, zfs_znode_dmu_fini is called in zfs_rmnode without zfs_znode_hold_enter. It seems to assume it's ok to do so when the znode is unlinked. However this assumption is not correct, as zfs_zget can be called by NFS through zpl_fh_to_dentry as pointed out by Christian in https://github.com/openzfs/zfs/pull/12767, which could result in a use-after-free bug. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #12767 Closes #14364	2023-01-19 16:59:05 -08:00
Richard Yao	2e7f664f04	Cleanup of dead code suggested by Clang Static Analyzer (#14380 ) I recently gained the ability to run Clang's static analyzer on the linux kernel modules via a few hacks. This extended coverage to code that was previously missed since Clang's static analyzer only looked at code that we built in userspace. Running it against the Linux kernel modules built from my local branch produced a total of 72 reports against my local branch. Of those, 50 were reports of logic errors and 22 were reports of dead code. Since we already had cleaned up all of the previous dead code reports, I felt it would be a good next step to clean up these dead code reports. Clang did a further breakdown of the dead code reports into: Dead assignment 15 Dead increment 2 Dead nested assignment 5 The benefit of cleaning these up, especially in the case of dead nested assignment, is that they can expose places where our error handling is incorrect. A number of them were fairly straight forward. However several were not: In vdev_disk_physio_completion(), not only were we not using the return value from the static function vdev_disk_dio_put(), but nothing used it, so I changed it to return void and removed the existing (void) cast in the other area where we call it in addition to no longer storing it to a stack value. In FSE_createDTable(), the function is dead code. Its helper function FSE_freeDTable() is also dead code, as are the CPP definitions in `module/zstd/include/zstd_compat_wrapper.h`. We just delete it all. In zfs_zevent_wait(), we have an optimization opportunity. cv_wait_sig() returns 0 if there are waiting signals and 1 if there are none. The Linux SPL version literally returns `signal_pending(current) ? 0 : 1)` and FreeBSD implements the same semantics, we can just do `!cv_wait_sig()` in place of `signal_pending(current)` to avoid unnecessarily calling it again. zfs_setattr() on FreeBSD version did not have error handling issue because the code was removed entirely from FreeBSD version. The error is from updating the attribute directory's files. After some thought, I decided to propapage errors on it to userspace. In zfs_secpolicy_tmp_snapshot(), we ignore a lack of permission from the first check in favor of checking three other permissions. I assume this is intentional. In zfs_create_fs(), the return value of zap_update() was not checked despite setting an important version number. I see no backward compatibility reason to permit failures, so we add an assertion to catch failures. Interestingly, Linux is still using ASSERT(error == 0) from OpenSolaris while FreeBSD has switched to the improved ASSERT0(error) from illumos, although illumos has yet to adopt it here. ASSERT(error == 0) was used on Linux while ASSERT0(error) was used on FreeBSD since the entire file needs conversion and that should be the subject of another patch. dnode_move()'s issue was caused by us not having implemented POINTER_IS_VALID() on Linux. We have a stub in `include/os/linux/spl/sys/kmem_cache.h` for it, when it really should be in `include/os/linux/spl/sys/kmem.h` to be consistent with Illumos/OpenSolaris. FreeBSD put both `POINTER_IS_VALID()` and `POINTER_INVALIDATE()` in `include/os/freebsd/spl/sys/kmem.h`, so we copy what it did. Whenever a report was in platform-specific code, I checked the FreeBSD version to see if it also applied to FreeBSD, but it was only relevant a few times. Lastly, the patch that enabled Clang's static analyzer to be run on the Linux kernel modules needs more work before it can be put into a PR. I plan to do that in the future as part of the on-going static analysis work that I am doing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14380	2023-01-17 09:57:12 -08:00
Richard Yao	e6328fda2e	Cleanup: !A \|\| A && B is equivalent to !A \|\| B In zfs_zaccess_dataset_check(), we have the following subexpression: (!IS_DEVVP(ZTOV(zp)) \|\| (IS_DEVVP(ZTOV(zp)) && (v4_mode & WRITE_MASK_ATTRS))) When !IS_DEVVP(ZTOV(zp)) is false, IS_DEVVP(ZTOV(zp)) is true under the law of the excluded middle since we are not doing pseudoboolean alegbra. Therefore doing: (IS_DEVVP(ZTOV(zp)) && (v4_mode & WRITE_MASK_ATTRS)) Is unnecessary and we can just do: (v4_mode & WRITE_MASK_ATTRS) The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/misc/excluded_middle.cocci Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 16:00:15 -08:00
Richard Yao	9c8fabffa2	Cleanup: Replace oldstyle struct hack with C99 flexible array members The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/misc/flexible_array.cocci However, unlike the cases where the GNU zero length array extension had been used, coccicheck would not suggest patches for the older style single member arrays. That was good because blindly changing them would break size calculations in most cases. Therefore, this required care to make sure that we did not break size calculations. In the case of `indirect_split_t`, we use `offsetof(indirect_split_t, is_child[is->is_children])` to calculate size. This might be subtly wrong according to an old mailing list thread: https://inbox.sourceware.org/gcc-prs/20021226123454.27019.qmail@sources.redhat.com/T/ That is because the C99 specification should consider the flexible array members to start at the end of a structure, but compilers prefer to put padding at the end. A suggestion was made to allow compilers to allocate padding after the VLA like compilers already did: http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n983.htm However, upon thinking about it, whether or not we allocate end of structure padding does not matter, so using offsetof() to calculate the size of the structure is fine, so long as we do not mix it with sizeof() on structures with no array members. In the case that we mix them and padding causes offsetof(struct_t, vla_member[0]) to differ from sizeof(struct_t), we would be doing unsafe operations if we underallocate via `offsetof()` and then overcopy via sizeof(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 16:00:03 -08:00
Richard Yao	d35ccc1f59	Cleanup: Fix indentation in zfs_dbgmsg_t `fdc2d30371` accidentally broke the indentation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 15:59:56 -08:00
Richard Yao	8e7ebf4e2d	Cleanup: Use C99 flexible array members instead of zero length arrays The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/misc/flexible_array.cocci The Linux kernel's documentation makes a good case for why we should not use these: https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 15:59:41 -08:00

1 2 3 4 5 ...

378 Commits