mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2025-11-07 23:04:53 +03:00

Author	SHA1	Message	Date
Paul Dagnelie	43eaef6de8	Fix zrele race in zrele_async that can cause hang There is a race condition in zfs_zrele_async when we are checking if we would be the one to evict an inode. This can lead to a txg sync deadlock. Instead of calling into iput directly, we attempt to perform the atomic decrement ourselves, unless that would set the i_count value to zero. In that case, we dispatch a call to iput to run later, to prevent a deadlock from occurring. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #11527 Closes #11530	2021-01-28 11:39:13 -08:00
Brian Behlendorf	dd487640b2	Linux 5.10 compat: restore custom uio_prefaultpages() As part of commit `1c2358c1` the custom uio_prefaultpages() code was removed in favor of using the generic kernel provided iov_iter_fault_in_readable() interface. Unfortunately, it turns out that up until the Linux 4.7 kernel the function would only ever fault in the first iovec of the iov_iter. The result being uiomove_iov() may hang waiting for the page. This commit effectively restores the custom uio_prefaultpages() pages code for Linux 4.9 and earlier kernels which contain the troublesome version of iov_iter_fault_in_readable(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11463 Closes #11484	2021-01-22 09:58:49 -08:00
Konstantin Khorenko	59570a05d8	VZ 7 kernel compat: introduce ITER-enabled .direct_IO() via IOVECs Virtuozzo 7 kernels starting 3.10.0-1127.18.2.vz7.163.46 have the following configuration: * no HAVE_VFS_RW_ITERATE * HAVE_VFS_DIRECT_IO_ITER_RW_OFFSET => let's add implementation of zpl_direct_IO() via zpl_aio_{read,write}() in this case. https://bugs.openvz.org/browse/OVZ-7243 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Closes #11410 Closes #11411	2021-01-05 10:33:41 -08:00
Brian Behlendorf	fcd9966ed9	Linux 5.11 compat: blk_{un}register_region() As of 5.11 the blk_register_region() and blk_unregister_region() functions have been retired. This isn't a problem since add_disk() has implicitly allocated minor numbers for a very long time. Reviewed-by: Rafael Kitover <rkitover@gmail.com> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11387 Closes #11390	2021-01-05 10:26:39 -08:00
Brian Behlendorf	a2621753b2	Linux 5.11 compat: revalidate_disk_size() Both revalidate_disk_size() and revalidate_disk() have been removed. Functionally this isn't a problem because we only relied on these functions to call zvol_revalidate_disk() for us and to perform any additional handling which might be needed for that kernel version. When neither are available we know there's no additional handling needed and we can directly call zvol_revalidate_disk(). Reviewed-by: Rafael Kitover <rkitover@gmail.com> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11387 Closes #11390	2021-01-05 10:26:32 -08:00
Brian Behlendorf	e888f28988	Linux 5.11 compat: bdev_whole() The bd_contains member was removed from the block_device structure. Callers needing to determine if a vdev is a whole block device should use the new bdev_whole() wrapper. For older kernels we provide our own bdev_whole() wrapper which relies on bd_contains for compatibility. Reviewed-by: Rafael Kitover <rkitover@gmail.com> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11387 Closes #11390	2021-01-05 10:26:25 -08:00
Brian Behlendorf	305510fd33	Linux 5.11 compat: bio_start_io_acct() / bio_end_io_acct() The generic IO accounting functions have been removed in favor of the bio_start_io_acct() and bio_end_io_acct() functions which provide a better interface. These new functions were introduced in the 5.8 kernels but it wasn't until the 5.11 kernel that the previous generic IO accounting interfaces were removed. This commit updates the blk_generic_*_io_acct() wrappers to provide and interface similar to the updated kernel interface. It's slightly different because for older kernels we need to pass the request queue as well as the bio. Reviewed-by: Rafael Kitover <rkitover@gmail.com> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11387 Closes #11390	2021-01-05 10:26:18 -08:00
Brian Behlendorf	67cff6e4c1	Linux 5.11 compat: lookup_bdev() The lookup_bdev() function has been updated to require a dev_t be passed as the second argument. This is actually pretty nice since the major number stored in the dev_t was the only part we were interested in. This allows to us avoid handling the bdev entirely. The vdev_lookup_bdev() wrapper was updated to emulate the behavior of the new lookup_bdev() for all supported kernels. Reviewed-by: Rafael Kitover <rkitover@gmail.com> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11387 Closes #11390	2021-01-05 10:26:09 -08:00
Michael D Labriola	7a7e101437	Linux 5.10 compat: also zvol_revalidate_disk() Commit `59b68723` added a configure check for 5.10, which removed revalidate_disk(), and conditionally replaced it's usage with a call to the new revalidate_disk_size() function. However, the old function also invoked the device's registered callback, in our case zvol_revalidate_disk(). This commit adds a call to zvol_revalidate_disk() in zvol_update_volsize() to make sure the code path stays the same. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Michael D Labriola <michael.d.labriola@gmail.com> Closes #11358	2020-12-23 14:35:47 -08:00
Brian Behlendorf	401ba57ccd	Fix maybe uninitialized variable warning Commit `1c2358c12` restructured this code and introduced a warning about the variable maybe not being initialized. This cannot happen with the updated code but we should initialize the variable anyway to silence the warning. zpl_file.c: In function ‘zpl_iter_write’: zpl_file.c:324:9: warning: ‘count’ may be used uninitialized in this function [-Wmaybe-uninitialized] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11373	2020-12-23 14:35:47 -08:00
Brian Behlendorf	188950df9e	Remove iov_iter_advance() from iter_read There's no need to call iov_iter_advance() in zpl_iter_read(). This was preserved from the previous code where it wasn't needed but also didn't cause any problems. Now that the iter functions also handle pipes that's no longer the case. When fully reading a pipe buffer iov_iter_advance() may results in the pipe buf release function being called which will not be registered resulting in a NULL dereference. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11375 Closes #11378	2020-12-23 14:35:47 -08:00
Brian Behlendorf	58bc86c5cb	Linux 5.10 compat: use iov_iter in uio structure As of the 5.10 kernel the generic splice compatibility code has been removed. All filesystems are now responsible for registering a ->splice_read and ->splice_write callback to support this operation. The good news is the VFS provided generic_file_splice_read() and iter_file_splice_write() callbacks can be used provided the ->iter_read and ->iter_write callback support pipes. However, this is currently not the case and only iovecs and bvecs (not pipes) are ever attached to the uio structure. This commit changes that by allowing full iov_iter structures to be attached to uios. Ever since the 4.9 kernel the iov_iter structure has supported iovecs, kvecs, bvevs, and pipes so it's desirable to pass the entire thing when possible. In conjunction with this the uio helper functions (i.e uiomove(), uiocopy(), etc) have been updated to understand the new UIO_ITER type. Note that using the kernel provided uio_iter interfaces allowed the existing Linux specific uio handling code to be simplified. When there's no longer a need to support kernel's older than 4.9, then it will be possible to remove the iovec and bvec members from the uio structure and always use a uio_iter. Until then we need to maintain all of the existing types for older kernels. Some additional refactoring and cleanup was included in this change: - Added checks to configure to detect available iov_iter interfaces. Some are available all the way back to the 3.10 kernel and are used when available. In particular, uio_prefaultpages() now always uses iov_iter_fault_in_readable() which is available for all supported kernels. - The unused UIO_USERISPACE type has been removed. It is no longer needed now that the uio_seg enum is platform specific. - Moved zfs_uio.c from the zcommon.ko module to the Linux specific platform code for the zfs.ko module. This gets it out of libzfs where it was never needed and keeps this Linux specific code out of the common sources. - Removed unnecessary O_APPEND handling from zfs_iter_write(), this is redundant and O_APPEND is already handled in zfs_write(); NOTE: Cleanly applying this kernel compatibility change required applying the following commits. This makes the change larger than it absolutely needs to be, but the resulting code matches what's in the branch branch. This is both more tested and makes it easier to apply any future backports in this area. `7cf4cd824` Remove incorrect assertion `783be694f` Reduce confusion in zfs_write `af5626ac2` Return EFAULT at the end of zfs_write() when set `cc1f85be8` Simplify offset and length limit in zfs_write `9585538d0` Const some unchanging variables in zfs_write `86e74dc16` Remove redundant oid parameter to update_pages `b3d723fb0` Factor uid, gid, and projid out of loop in zfs_write `3d40b6554` Share zfs_fsync, zfs_read, zfs_write, et al between Linux and FreeBSD Reviewed-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11351	2020-12-23 14:35:39 -08:00
Ryan Moeller	86e74dc162	Remove redundant oid parameter to update_pages The oid comes from the znode we are already passing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11176	2020-12-23 14:34:59 -08:00
Matthew Macy	3d40b65540	Share zfs_fsync, zfs_read, zfs_write, et al between Linux and FreeBSD The zfs_fsync, zfs_read, and zfs_write function are almost identical between Linux and FreeBSD. With a little refactoring they can be moved to the common code which is what is done by this commit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11078	2020-12-23 14:34:59 -08:00
Brian Behlendorf	f217a2b902	Fix possibly uninitialized 'root_inode' variable warning Resolve an uninitialized variable warning when compiling. In function ‘zfs_domount’: warning: ‘root_inode’ may be used uninitialized in this function [-Wmaybe-uninitialized] sb->s_root = d_make_root(root_inode); Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11306	2020-12-23 14:34:59 -08:00
Matthew Macy	043ef5c25e	Fix problems in zvol_set_volmode_impl - Don't leave fstrans set when passed a snapshot - Don't remove minor if volmode already matches new value - (FreeBSD) Wait for GEOM ops to complete before trying remove (at create time GEOM will be "tasting" in parallel) - (FreeBSD) Don't leak zvol_state_lock on open if zv == NULL - (FreeBSD) Don't try to unlock zv->zv_state lock if zv == NULL Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11199	2020-11-17 12:20:09 -08:00
Brian Behlendorf	d02fc15ba1	Linux: Fix ZFS_ENTER/ZFS_EXIT/ZFS_VERFY_ZP usage The ZFS_ENTER/ZFS_EXIT/ZFS_VERFY_ZP macros should not be used in the Linux zpl_*.c source files. They return a positive error value which is correct for the common code, but not for the Linux specific kernel code which expects a negative return value. The ZPL_ENTER/ZPL_EXIT/ZPL_VERFY_ZP macros should be used instead. Furthermore, the ZPL_EXIT macro has been updated to not call the zfs_exit_fs() function. This prevents a possible deadlock which can occur when a snapshot is automatically unmounted because the zpl_show_devname() must never wait on in progress automatic snapshot unmounts. Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11169 Closes #11201	2020-11-14 10:51:27 -08:00
Brian Behlendorf	d237d9a918	Linux: Fix mount/unmount when dataset name has a space The custom zpl_show_devname() helper should translate spaces in to the octal escape sequence \040. The getmntent(2) function is aware of this convention and properly translates the escape character back to a space when reading the fsname. Without this change the `zfs mount` and `zfs unmount` commands incorrectly detect when a dataset with a name containing spaces is mounted. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11182 Closes #11187	2020-11-12 09:01:55 -08:00
Tony Perkins	2132ae465d	Start snapdir_iterate traversals to begin wtih the value of zero. The microzap hash can sometimes be zero for single digit snapnames. The zap cursor can then have a serialized value of two (for . and ..), and skip the first entry in the avl tree for the .zfs/snapshot directory listing, and therefore does not return all snapshots. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Cedric Berger <cedric@precidata.com> Signed-off-by: Tony Perkins <tperkins@datto.com> Closes #11039	2020-11-12 09:01:27 -08:00
Mateusz Guzik	995b80fa3a	G/C struct znode -> z_moved The field is yet another leftover from unsupported zfs_znode_move. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11186	2020-11-11 11:40:15 -08:00
Coleman Kane	a30fed54f4	Linux 5.10 compat: revalidate_disk_size() added A new function was added named revalidate_disk_size() and the old revalidate_disk() appears to have been deprecated. As the only ZFS code that calls this function is zvol_update_volsize, swapping the old function call out for the new one should be all that is required. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #11085	2020-11-03 09:51:31 -08:00
Coleman Kane	e767b1cacc	Linux 5.10 compat: check_disk_change() removed Kernel 5.10 removed check_disk_change() in favor of callers using the faster bdev_check_media_change() instead, and explicitly forcing bdev revalidation when they desire that behavior. To preserve prior behavior, I have wrapped this into a zfs_check_media_change() macro that calls an inline function for the new API that mimics the old behavior when check_disk_change() doesn't exist, and just calls check_disk_change() if it exists. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #11085	2020-11-03 09:51:26 -08:00
Coleman Kane	d2090becab	Linux 5.10 compat: percpu_ref added data member Kernel commit 2b0d3d3e4fcfb brought in some changes to the struct percpu_ref structure that moves most of its fields into a member struct named "data" of type struct percpu_ref_data. This includes the "count" member which is updated by vdev_blkg_tryget(), so update this function to chase the API change, and detect it via configure. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #11085	2020-11-03 09:51:20 -08:00
Ryan Moeller	00a27515f0	zvol_os: Tidy up asserts Using more specific assert variants gives better messages on failure. No functional change. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11117	2020-10-30 16:06:24 -07:00
Ryan Moeller	c3ae9321bf	Update references to nonexistent man pages in code Refer to the correct section or alternative for FreeBSD and Linux. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11132	2020-10-30 16:04:41 -07:00
Mateusz Guzik	7e76d21bc8	Linux: g/c leftover fence in zfs_znode_alloc The port removed provisions for zfs_znode_move but the cleanup missed this bit. To quote the original: [snip] list_insert_tail(&zfsvfs->z_all_znodes, zp); membar_producer(); /* * Everything else must be valid before assigning z_zfsvfs makes the * znode eligible for zfs_znode_move(). */ zp->z_zfsvfs = zfsvfs; [/snip] In the current code it is immediately followed by unlock which issues the same fence, thus plays no role in correctness. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11115	2020-10-30 16:04:05 -07:00
Ryan Moeller	725c9e22ca	Cross-platform acltype The acltype property is currently hidden on FreeBSD and does not reflect the NFSv4 style ZFS ACLs used on the platform. This makes it difficult to observe that a pool imported from FreeBSD on Linux has a different type of ACL that is being ignored, and vice versa. Add an nfsv4 acltype and expose the property on FreeBSD. Make the default acltype nfsv4 on FreeBSD. Setting acltype to an unhanded style is treated the same as setting it to off. The ACLs will not be removed, but they will be ignored. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10520	2020-10-16 13:05:00 -07:00
Ryan Moeller	5e7198b873	Linux: Initialize zp in zfs_setattr_dir The value of zp is used without having been initialized under some conditions. Initialize the pointer to NULL. Add a regression test case using chown in acl/posix. However, this is not enough because the setup sets xattr=sa, which means zfs_setattr_dir will not be called. Create a second group of acl tests in acl/posix-sa duplicating the acl/posix tests with symlinks, and remove xattr=sa from the original acl/posix tests. This provides more coverage for the default xattr=on code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10043 Closes #11025	2020-10-16 13:01:29 -07:00
Brian Behlendorf	46c71074ca	Replace ZFS on Linux references with OpenZFS This change updates the documentation to refer to the project as OpenZFS instead ZFS on Linux. Web links have been updated to refer to https://github.com/openzfs/zfs. The extraneous zfsonlinux.org web links in the ZED and SPL sources have been dropped. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11007	2020-10-16 13:01:24 -07:00
Ryan Moeller	47e3dba972	Throw const on some strings In C, const indicates to the reader that mutation will not occur. It can also serve as a hint about ownership. Add const in a few places where it makes sense. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #10997	2020-10-16 12:55:56 -07:00
Brian Behlendorf	fc5966589b	Fix buggy procfs_list_seq_next warning The kernel seq_read() helper function expects ->next() to update the passed position even there are no more entries. Failure to do so results in the following warning being logged. seq_file: buggy .next function procfs_list_seq_next [spl] did not update position index Functionally there is no issue with the way procfs_list_seq_next() is implemented and the warning is harmless. However, we want to silence this some what scary incorrect warning. This commit updates the Linux procfs code to advance the position even for the last entry. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10984 Closes #10996	2020-10-01 12:30:28 -07:00
Brian Behlendorf	7b353d2c8c	Fix PREEMPTION=y and BLK_CGROUP=y config on arm64 With PREEMPTION=y and BLK_CGROUP=y preempt_schedule_notrace() is being used on arm64 which is a GPL-only function and hence the build of the DKMS kernel module fails. Fix that by redefining preempt_schedule_notrace() to preempt_schedule() which should be safe as long as tracing is not used. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Juerg Haefliger <juergh@canonical.com> Closes #8545 Closes #9948 Closes #10416 Closes #10973	2020-10-01 12:20:59 -07:00
Matthew Macy	c70c6e004e	FreeBSD: Add support for procfs_list The procfs_list interface is required by several kstats. Implement this functionality for FreeBSD to provide access to these kstats. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10890	2020-10-01 12:18:56 -07:00
George Wilson	5899ea5a77	vdev_ashift should only be set once == Motivation and Context The new vdev ashift optimization prevents the removal of devices when a zfs configuration is comprised of disks which have different logical and physical block sizes. This is caused because we set 'spa_min_ashift' in vdev_open and then later call 'vdev_ashift_optimize'. This would result in an inconsistency between spa's ashift calculations and that of the top-level vdev. In addition, the optimization logical ignores the overridden ashift value that would be provided by '-o ashift=<val>'. == Description This change reworks the vdev ashift optimization so that it's only set the first time the device is configured. It still allows the physical and logical ahsift values to be set every time the device is opened but those values are only consulted on first open. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Cedric Berger <cedric@precidata.com> Signed-off-by: George Wilson <gwilson@delphix.com> External-Issue: DLPX-71831 Closes #10932	2020-09-18 12:40:20 -07:00
George Wilson	dacb4f6a61	pool may become suspended during device expansion When expanding a device zfs needs to rescan the partition table to get the correct size. This can only happen when we're in the kernel and requires the device to be closed. As part of the rescan, udev is notified and the device links are removed and recreated. This leave a window where the vdev code may try to reopen the device before udev has recreated the link. If that happens, then the pool may end up in a suspended state. To correct this, we leverage the BLKPG_RESIZE_PARTITION ioctl which allows the partition information to be modified even while it's in use. This ioctl also does not remove the device link associated with the zfs data partition so it eliminates the race condition that can occur in the kernel. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #10897	2020-09-18 12:38:30 -07:00
John Poduska	aa7817c151	Need a long hold in zpl_mount_impl In zpl_mount_impl, there is: dmu_objset_hold ; returns with pool & ds held dsl_pool_rele sget dsl_dataset_rele As spelled out in the "DSL Pool Configuration Lock" in dsl_pool.c, this requires a long hold. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: John Poduska <jpoduska@datto.com> Closes #10936	2020-09-18 12:38:09 -07:00
Ryan Moeller	2cec08a1f0	Rename acltype=posixacl to acltype=posix Prefer acltype=off\|posix, retaining the old names as aliases. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10918	2020-09-18 12:38:00 -07:00
Ryan Moeller	c8bbb0c93d	Linux: Prevent destruction while showing mount devname Use ZFS_ENTER and ZFS_EXIT to protect datasets while their mount devname is being retrieved. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10892 Closes #10927	2020-09-15 18:36:03 -07:00
Matthew Macy	36f36610c3	Replace cv_{timed}wait_sig with cv_{timed}wait_idle where appropriate There are a number of places where cv_?_sig is used simply for accounting purposes but the surrounding code has no ability to cope with actually receiving a signal. On FreeBSD it is possible to send signals to individual kernel threads so this could enable undesirable behavior. This patch adds routines on Linux that will do the same idle accounting as _sig without making the task interruptible. On FreeBSD cv__idle are all aliases for cv_ Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10843	2020-09-09 10:21:01 -07:00
Spencer Kinny	fd20a81b9a	Links in Source Files Added comments in following files with links to Illumos manual pages: ./module/avl/avl.c ./module/nvpair/nvpair.c ./module/os/linux/spl/spl-kstat.c ./module/os/freebsd/spl/spl_kstat.c Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Spencer Kinny <spencerkinny1995@gmail.com> Closes #5113 Closes #10859	2020-09-03 16:17:18 -07:00
Ryan Moeller	76a157f004	Add 'zfs rename -u' to rename without remounting Allow to rename file systems without remounting if it is possible. It is possible for file systems with 'mountpoint' property set to 'legacy' or 'none' - we don't have to change mount directory for them. Currently such file systems are unmounted on rename and not even mounted back. This introduces layering violation, as we need to update 'f_mntfromname' field in statfs structure related to mountpoint (for the dataset we are renaming and all its children). In my opinion it is worth it, as it allow to update FreeBSD in even cleaner way - in ZFS-only configuration root file system is ZFS file system with 'mountpoint' property set to 'legacy'. If root dataset is named system/rootfs, we can snapshot it (system/rootfs@upgrade), clone it (system/oldrootfs), update FreeBSD and if it doesn't boot we can boot back from system/oldrootfs and rename it back to system/rootfs while it is mounted as /. Before it was not possible, because unmounting / was not possible. Authored by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: Matt Macy <mmacy@freebsd.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10839	2020-09-03 16:16:15 -07:00
Matthew Macy	baed4fbacb	Move spa_stats.c to common code Initially it was considered simplest to stub out all of the functions on FreeBSD. Now that FreeBSD supports KSTAT_TYPE_RAW at least some of the functionality should be made available. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10842	2020-08-30 14:19:08 -07:00
Brian Behlendorf	c6ee83893e	Linux 5.9 compat: NR_SLAB_RECLAIMABLE Commit `dcdc12e` added compatibility code to treat NR_SLAB_RECLAIMABLE_B as if it were the same as NR_SLAB_RECLAIMABLE. However, the new value is in bytes while the old value was in pages which means they are not interchangeable. The only place the reclaimable slab size is used is as a component of the calculation done by arc_free_memory(). This function returns the amount of memory the ARC considers to be free or reclaimable at little cost. Rather than switch to a new interface to get this value it has been removed it from the calculation. It is normally a minor component compared to the number of inactive or free pages, and removing it aligns the behavior with the FreeBSD version of arc_free_memory(). Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10834	2020-08-30 14:18:50 -07:00
youzhongyang	b900799768	Fix inability to destroy snapshot used over NFS The cache of struct svc_export and struct svc_expkey by nfsd and rpc.mountd for the snapshot holds references to the mount point. We need to flush them out before unmounting, otherwise umount would fail with EBUSY. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #6000 Closes #10783	2020-08-24 17:33:02 -07:00
Andrew	a741b386d3	Prevent zfs_acl_chmod() if aclmode restricted and ACL inherited In absence of inheriting entry for owner@, group@, or everyone@, zfs_acl_chmod() is called to set these. This can cause confusion for Samba admins who do not expect these entries to appear on newly created files and directories once they have been stripped from from the parent directory. When aclmode is set to "restricted", chmod is prevented on non-trivial ACLs. It is not a stretch to assume that in this case the administrator does not want ZFS to add the missing special entries. Add check for this aclmode, and if an inherited entry is present skip zfs_acl_chmod(). Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Walker <awalker@ixsystems.com> Closes #10748	2020-08-22 21:49:25 -07:00
Ryan Moeller	6fe3498ca3	Import vdev ashift optimization from FreeBSD Many modern devices use physical allocation units that are much larger than the minimum logical allocation size accessible by external commands. Two prevalent examples of this are 512e disk drives (512b logical sector, 4K physical sector) and flash devices (512b logical sector, 4K or larger allocation block size, and 128k or larger erase block size). Operations that modify less than the physical sector size result in a costly read-modify-write or garbage collection sequence on these devices. Simply exporting the true physical sector of the device to ZFS would yield optimal performance, but has two serious drawbacks: 1. Existing pools created with devices that have different logical and physical block sizes, but were configured to use the logical block size (e.g. because the OS version used for pool construction reported the logical block size instead of the physical block size) will suddenly find that the vdev allocation size has increased. This can be easily tolerated for active members of the array, but ZFS would prevent replacement of a vdev with another identical device because it now appears that the smaller allocation size required by the pool is not supported by the new device. 2. The device's physical block size may be too large to be supported by ZFS. The optimal allocation size for the vdev may be quite large. For example, a RAID controller may export a vdev that requires read-modify-write cycles unless accessed using 64k aligned/sized requests. ZFS currently has an 8k minimum block size limit. Reporting both the logical and physical allocation sizes for vdevs solves these problems. A device may be used so long as the logical block size is compatible with the configuration. By comparing the logical and physical block sizes, new configurations can be optimized and administrators can be notified of any existing pools that are sub-optimal. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Matthew Macy <mmacy@freebsd.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10619	2020-08-21 12:53:17 -07:00
Ryan Moeller	009cc8e884	Make zc_nvlist_src_size limit tunable We limit the size of nvlists passed to the kernel so a user cannot make the kernel do an unreasonably large allocation. On FreeBSD this limit was 128 kiB, which turns out to be a bit too small when doing some operations involving a large number of datasets or snapshots, for example replication. Make this limit tunable, with a platform-specific auto default. Linux keeps its limit at KMALLOC_MAX_SIZE. FreeBSD uses 1/4 of the system limit on user wired memory, which allows it to scale depending on system configuration. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Issue #6572 Closes #10706	2020-08-18 09:33:55 -07:00
Matthew Ahrens	85ec5cbae2	Include scatter_chunk_waste in arc_size The ARC caches data in scatter ABD's, which are collections of pages, which are typically 4K. Therefore, the space used to cache each block is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is not aware of the memory used by this round-up, it only accounts for the size that it requested from the ABD subsystem. Therefore, the ARC is effectively using more memory than it is aware of, due to the `scatter_chunk_waste`. This impacts observability, e.g. `arcstat` will show that the ARC is using less memory than it effectively is. It also impacts how the ARC responds to memory pressure. As the amount of `scatter_chunk_waste` changes, it appears to the ARC as memory pressure, so it needs to resize `arc_c`. If the sector size (`1<<ashift`) is the same as the page size (or larger), there won't be any waste. If the (compressed) block size is relatively large compared to the page size, the amount of `scatter_chunk_waste` will be small, so the problematic effects are minimal. However, if using 512B sectors (`ashift=9`), and the (compressed) block size is small (e.g. `compression=on` with the default `volblocksize=8k` or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be very large. On a production system, with `arc_size` at a constant 50% of memory, `scatter_chunk_waste` has been been observed to be 10-30% of memory. This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new `waste` field to `arcstat`. As a result, the ARC's memory usage is more observable, and `arc_c` does not need to be adjusted as frequently. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10701	2020-08-17 20:04:04 -07:00
Matthew Ahrens	994de7e4b7	Remove KMC_KMEM and KMC_VMEM `KMC_KMEM` and `KMC_VMEM` are now unused since all SPL-implemented caches are `KMC_KVMEM`. KMC_KMEM: Given the default value of `spl_kmem_cache_kmem_limit`, we don't use kmalloc to back the SPL caches, instead we use kvmalloc (KMC_KVMEM). The flag, module parameter, /proc entries, and associated code are removed. KMC_VMEM: This flag is not used, and kvmalloc() is always preferable to vmalloc(). The flag, /proc entries, and associated code are removed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10673	2020-08-17 16:04:28 -07:00
Jorgen Lundman	faa296c73c	Release onexit/events with any missed zfsdev_state Linux and FreeBSD will most likely never see this issue. On macOS when kext is unloaded, but zed is still connected, zed will be issued ENODEV. As the cdevsw is released, the kernel will not have zfsdev_release() called to release minor/onexit/events, and it "leaks". This ensures it is cleaned up before unload. Changed the for loop from zsprev, to zsnext style, for less code duplication. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10700	2020-08-13 15:03:23 -07:00

1 2 3 4

188 Commits