mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	62278325a5	Export dmu_offset_next() symbol Export the dmu_offset_next() symbol for use by Lustre. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10796	2020-09-16 14:17:51 -07:00
DeHackEd	76a1232ee7	Use boot_ncpus in place of max_ncpus in taskq_create Due to hotplug support or BIOS bugs sometimes max_ncpus can be an absurdly high value. I have a system with 32 cores/threads but reports max_ncpus == 440. This many threads potentially cripples the system during arc_prune floods for example. boot_ncpus is the number of working CPUs when called so use that instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #10282	2020-09-16 13:35:42 -07:00
Olaf Faaland	0801e4e5c9	Initialize mmp_last_write when the mmp thread starts (#10912 ) A great deal of time may go by between when mmp_init() is called and the MMP thread starts, particularly if there are bad devices, because there is I/O checking configs etc. If this time is too long, (gethrtime() - mmp_last_write) > mmp_fail_ns at the time the MMP thread starts. If MMP is configured to suspend the pool, the pool will be suspended immediately. This can be seen in issue #10838 The value of mmp_last_write doesn't matter before the mmp thread starts. To give the MMP thread time to issue and land MMP writes, initialize mmp_last_write when the MMP thread starts. Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Closes #10873	2020-09-16 00:16:47 +00:00
Brian Behlendorf	d2acd3696f	Linux 5.9 compat: NR_SLAB_RECLAIMABLE (#10865 ) Commit `dcdc12e` added compatibility code to treat NR_SLAB_RECLAIMABLE_B as if it were the same as NR_SLAB_RECLAIMABLE. However, the new value is in bytes while the old value was in pages which means they are not interchangeable. The only place the reclaimable slab size is used is as a component of the calculation done by arc_free_memory(). This function returns the amount of memory the ARC considers to be free or reclaimable at little cost. Rather than switch to a new interface to get this value it has been removed it from the calculation. It is normally a minor component compared to the number of inactive or free pages, and removing it aligns the behavior with the FreeBSD version of arc_free_memory(). Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Backported-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Closes #10834	2020-09-15 21:16:58 +00:00
Coleman Kane	c33b623535	Linux 5.9 compat: make_request_fn replaced with submit_bio interface The make_request_fn and associated API was replaced recently in a Linux 5.9 merge, to replace its functionality with a new submit_bio member in struct block_device_operations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #10696 (cherry picked from commit `d817c17100`) Signed-off-by: Eli Schwartz <eschwartz@archlinux.org>	2020-09-15 21:16:58 +00:00
Jorgen Lundman	04837c8dcb	Replace sprintf()->snprintf() and strcpy()->strlcpy() The strcpy() and sprintf() functions are deprecated on some platforms. Care is needed to ensure correct size is used. If some platforms miss snprintf, we can add a #define to sprintf, likewise strlcpy(). The biggest change is adding a size parameter to zfs_id_to_fuidstr(). The various *_impl_get() functions are only used on linux and have not yet been updated.	2020-09-15 21:16:58 +00:00
George Amanakis	4fe5f9016f	Add missing zfs_refcount_destroy() in key_mapping_rele() Otherwise when running with reference_tracking_enable=TRUE mounting and unmounting an encrypted dataset panics with: Call Trace: dump_stack+0x66/0x90 slab_err+0xcd/0xf2 ? __kmalloc+0x174/0x260 ? __kmem_cache_shutdown+0x158/0x240 __kmem_cache_shutdown.cold+0x1d/0x115 shutdown_cache+0x11/0x140 kmem_cache_destroy+0x210/0x230 spl_kmem_cache_destroy+0x122/0x3e0 [spl] zfs_refcount_fini+0x11/0x20 [zfs] spa_fini+0x4b/0x120 [zfs] zfs_kmod_fini+0x6b/0xa0 [zfs] _fini+0xa/0x68c [zfs] __x64_sys_delete_module+0x19c/0x2b0 do_syscall_64+0x5b/0x1a0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Tom Caputi <tcaputi@datto.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10246	2020-05-12 10:53:32 -07:00
Brian Behlendorf	e6142ac0f2	Linux 5.7 compat: blk_alloc_queue() Commit https://github.com/torvalds/linux/commit/3d745ea5 simplified the blk_alloc_queue() interface by updating it to take the request queue as an argument. Add a wrapper function which accepts the new arguments and internally uses the available interfaces. Other minor changes include increasing the Linux-Maximum to 5.6 now that 5.6 has been released. It was not bumped to 5.7 because this release has not yet been finalized and is still subject to change. Added local 'struct zvol_state_os *zso' variable to zvol_alloc. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10181 Closes #10187	2020-05-12 10:53:32 -07:00
Matthew Macy	ea15efd4c9	Prefix struct rangelock A struct rangelock already exists on FreeBSD. Add a zfs_ prefix as per our convention to prevent any conflict with existing symbols. This change is a follow up to `2cc479d0`. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9534	2020-05-12 10:53:32 -07:00
Fabio Scaccabarozzi	590ababea2	Bugfix/fix uio partial copies In zfs_write(), the loop continues to the next iteration without accounting for partial copies occurring in uiomove_iov when copy_from_user/__copy_from_user_inatomic return a non-zero status. This results in "zfs: accessing past end of object..." in the kernel log, and the write failing. Account for partial copies and update uio struct before returning EFAULT, leave a comment explaining the reason why this is done. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: ilbsmart <wgqimut@gmail.com> Signed-off-by: Fabio Scaccabarozzi <fsvm88@gmail.com> Closes #8673 Closes #10148	2020-05-12 10:53:32 -07:00
Mark Roper	009ff83548	Prevent deadlock in arc_read in Linux memory reclaim callback Using zfs with Lustre, an arc_read can trigger kernel memory allocation that in turn leads to a memory reclaim callback and a deadlock within a single zfs process. This change uses spl_fstrans_mark and spl_trans_unmark to prevent the reclaim attempt and the deadlock (https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba). The stack trace observed is: __schedule at ffffffff81610f2e schedule at ffffffff81611558 schedule_preempt_disabled at ffffffff8161184a __mutex_lock at ffffffff816131e8 arc_buf_destroy at ffffffffa0bf37d7 [zfs] dbuf_destroy at ffffffffa0bfa6fe [zfs] dbuf_evict_one at ffffffffa0bfaa96 [zfs] dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs] dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs] osd_object_delete at ffffffffa0b64ecc [osd_zfs] lu_object_free at ffffffffa06d6a74 [obdclass] lu_site_purge_objects at ffffffffa06d7fc1 [obdclass] lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass] shrink_slab at ffffffff811ca9d8 shrink_node at ffffffff811cfd94 do_try_to_free_pages at ffffffff811cfe63 try_to_free_pages at ffffffff811d01c4 __alloc_pages_slowpath at ffffffff811be7f2 __alloc_pages_nodemask at ffffffff811bf3ed new_slab at ffffffff81226304 ___slab_alloc at ffffffff812272ab __slab_alloc at ffffffff8122740c kmem_cache_alloc at ffffffff81227578 spl_kmem_cache_alloc at ffffffffa048a1fd [spl] arc_buf_alloc_impl at ffffffffa0befba2 [zfs] arc_read at ffffffffa0bf0924 [zfs] dbuf_read at ffffffffa0bf9083 [zfs] dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs] Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Roper <markroper@gmail.com> Closes #9987	2020-05-12 10:53:32 -07:00
Alexander Motin	4e55349857	Fix infinite scan on a pool with only special allocations Attempt to run scrub or resilver on a new pool containing only special allocations (special vdev added on creation) caused infinite loop because of dsl_scan_should_clear() limiting memory usage to 5% of pool size, which it calculated accounting only normal allocation class. Addition of special and just in case dedup classes fixes the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #10106 Closes #8694	2020-05-12 10:53:32 -07:00
Brian Behlendorf	06b473a8ae	Linux 5.6 compat: timestamp_truncate() The timestamp_truncate() function was added, it replaces the existing timespec64_trunc() function. This change renames our wrapper function to be consistent with the upstream name and updates the compatibility code for older kernels accordingly. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9956 Closes #9961	2020-05-12 10:53:32 -07:00
Brian Behlendorf	cf2a3464e9	Linux 5.6 compat: time_t As part of the Linux kernel's y2038 changes the time_t type has been fully retired. Callers are now required to use the time64_t type. Rather than move to the new type, I've removed the few remaining places where a time_t is used in the kernel code. They've been replaced with a uint64_t which is already how ZFS internally handled these values. Going forward we should work towards updating the remaining user space time_t consumers to the 64-bit interfaces. Reviewed-by: Matthew Macy <mmacy@freebsd.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10052 Closes #10064	2020-05-12 10:53:32 -07:00
Brian Behlendorf	49f065d5a4	Linux 5.5 compat: blkg_tryget() Commit https://github.com/torvalds/linux/commit/9e8d42a0f accidentally converted the static inline function blkg_tryget() to GPL-only for kernels built with CONFIG_PREEMPT_RCU=y and CONFIG_BLK_CGROUP=y. Resolve the build issue by providing our own equivalent functionality when needed which uses rcu_read_lock_sched() internally as before. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9745 Closes #10072	2020-05-12 10:53:32 -07:00
Tony Hutter	9e36832d31	Fix zfs-0.8.3 "qat.h" This applies the patch from: https://github.com/zfsonlinux/zfs/issues/9476#issuecomment-543854498 ...which was originally from: `9fa8b5b` QAT related bug fixes This allows QAT to build. Signed-off-by: Tony Hutter <hutter2@llnl.gov>	2020-01-22 13:49:07 -08:00
jwpoduska	1be3cba381	Prevent unnecessary resilver restarts If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: John Poduska <jpoduska@datto.com> Issue #840 Closes #9155 Closes #9378 Closes #9551 Closes #9588	2020-01-22 13:49:07 -08:00
Brian Behlendorf	0fd9a28de8	Fix QAT allocation failure return value When qat_compress() fails to allocate the required contiguous memory it mistakenly returns success. This prevents the fallback software compression from taking over and (un)compressing the block. Resolve the issue by correctly setting the local 'status' variable on all exit paths. Furthermore, initialize it to CPA_STATUS_FAIL to ensure qat_compress() always fails safe to guard against any similar bugs in the future. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9784 Closes #9788	2020-01-22 13:49:07 -08:00
Steve Mokris	da6a7f0239	Avoid some crashes when importing a pool with corrupt metadata - Skip invalid DVAs when importing pools in readonly mode (in addition to when the config is untrusted). - Upon encountering a DVA with a null VDEV, fail gracefully instead of panicking with a NULL pointer dereference. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Steve Mokris <smokris@softpixel.com> Closes #9022	2020-01-22 13:49:06 -08:00
Brian Behlendorf	bb04f9c195	Cancel initialize and TRIM before vdev_metaslab_fini() Any running 'zpool initialize' or TRIM must be cancelled prior to the vdev_metaslab_fini() call in spa_vdev_remove_log() which will unload the metaslabs and set ms->ms_group == NULL. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8602 Closes #9751	2020-01-22 13:49:06 -08:00
loli10K	e05c965d5b	Fix for ARC sysctls ignored at runtime This change leverage module_param_call() to run arc_tuning_update() immediately after the ARC tunable has been updated as suggested in `cffa837` code review. A simple test case is added to the ZFS Test Suite to prevent future regressions in functionality. This is a backport of #9489 provided from: https://github.com/zfsonlinux/zfs/pull/9776#issuecomment-569418370 Signed-off-by: loli10K <ezomori.nozomu@gmail.com>	2020-01-22 13:49:06 -08:00
Brian Behlendorf	d01290f44d	cppcheck: (warning) Possible null pointer dereference: dnp The dnp argument can only be set to NULL when the DNODE_DRY_RUN flag is set. In which case, an early return path will be executed and a NULL pointer dereference at the given location is impossible. Add an additional ASSERT to silence the cppcheck warning and document that dbp must never be NULL at the point in the function. [module/zfs/dnode.c:1566]: (warning) Possible null pointer deref: dnp Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9732	2020-01-22 13:49:06 -08:00
Tomohiro Kusumi	6455859ee7	Don't fail to apply umask for O_TMPFILE files Apply umask to `mode` which will eventually be applied to inode. This is needed since VFS doesn't apply umask for O_TMPFILE files. (Note that zpl_init_acl() applies `ip->i_mode &= ~current_umask();` only when POSIX ACL is used.) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8997 Closes #8998	2020-01-22 13:49:05 -08:00
Tom Caputi	7ad0ae91d5	Allow empty ds_props_obj to be destroyed Currently, 'zfs list' and 'zfs get' commands can be slow when working with snapshots that have a ds_props_obj. This is because the code that discovers all of the properties for these snapshots needs to read this object for each snapshot, which almost always ends up causing an extra random synchronous read for each snapshot. This performance penalty exists even if the properties on that snapshot have been unset because the object is normally only freed when the snapshot is freed, even though it is only created when it is needed. This patch allows the user to regain 'zfs list' performance on these snapshots by destroying the ds_props_obj when it no longer has any entries left. In practice on a production machine, this optimization seems to make 'zfs list' about 55% faster. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9704	2020-01-22 13:49:05 -08:00
Matthew Ahrens	856d185dc2	Fix use-after-free of vd_path in spa_vdev_remove() After spa_vdev_remove_aux() is called, the config nvlist is no longer valid, as it's been replaced by the new one (with the specified device removed). Therefore any pointers into the nvlist are no longer valid. So we can't save the result of `fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the call to spa_vdev_remove_aux(). Instead, use spa_strdup() to save a copy of the string before calling spa_vdev_remove_aux. Found by AddressSanitizer: ERROR: AddressSanitizer: heap-use-after-free on address ... READ of size 34 at 0x608000a1fcd0 thread T686 #0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d) #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447 #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259 #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #4 0x55ffbc769fba in ztest_execute ztest.c:6714 #5 0x55ffbc779a90 in ztest_thread ztest.c:6761 #6 0x7fe889cbc6da in start_thread #7 0x7fe8899e588e in __clone 0x608000a1fcd0 is located 48 bytes inside of 88-byte region freed by thread T686 here: #0 0x7fe88b14e7b8 in __interceptor_free #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874 #2 0x7fe88ae543ba in nvpair_free nvpair.c:844 #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978 #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185 #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221 #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #7 0x55ffbc769fba in ztest_execute ztest.c:6714 #8 0x55ffbc779a90 in ztest_thread ztest.c:6761 #9 0x7fe889cbc6da in start_thread Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #9706	2020-01-22 13:49:05 -08:00
Paul Zuchowski	4d658bda32	zio_decompress_data always ASSERTs successful decompression This interferes with zdb_read_block trying all the decompression algorithms when the 'd' flag is specified, as some are expected to fail. Also control the output when guessing algorithms, try the more common compression types first, allow specifying lsize/psize, and fix an uninitialized variable. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #9612 Closes #9630	2020-01-22 13:49:05 -08:00
Matthew Macy	d2233a08fa	Exclude data from cores unconditionally and metadata conditionally This change allows us to align the code dump logic across platforms. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9691	2020-01-22 13:49:05 -08:00
Brian Behlendorf	2525b71c68	ZTS: Fix zpool_reopen_001_pos Update the vdev_disk_open() retry logic to use a specified number of milliseconds to be more robust. Additionally, on failure log both the time waited and requested timeout to the internal log. The default maximum allowed open retry time has been increased from 500ms to 1000ms. Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9680 Conflicts:	2020-01-22 13:49:05 -08:00
Alexander Motin	388ef045b2	Fix use-after-free in case of L2ARC prefetch failure In case L2ARC read failed, l2arc_read_done() creates _different_ ZIO to read data from the original storage device. Unfortunately pointer to the failed ZIO remains in hdr->b_l1hdr.b_acb->acb_zio_head, and if some other read try to bump the ZIO priority, it will crash. The problem is reproducible by corrupting L2ARC content and reading some data with prefetch if l2arc_noprefetch tunable is changed to 0. With the default setting the issue is probably not reproducible now. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #9648	2020-01-22 13:49:05 -08:00
Brian Behlendorf	36fe63042c	Remove zfs_vdev_elevator module option As described in commit `f81d5ef6` the zfs_vdev_elevator module option is being removed. Users who require this functionality should update their systems to set the disk scheduler using a udev rule. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8664 Closes #9417 Closes #9609	2020-01-22 13:49:04 -08:00
Mauricio Faria de Oliveira	bc21c56c2d	Check for unlinked znodes after igrab() The changes in commit `41e1aa2a` / PR #9583 introduced a regression on tmpfile_001_pos: fsetxattr() on a O_TMPFILE file descriptor started to fail with errno ENODATA: openat(AT_FDCWD, "/test", O_RDWR\|O_TMPFILE, 0666) = 3 <...> fsetxattr(3, "user.test", <...>, 64, 0) = -1 ENODATA The originally proposed change on PR #9583 is not susceptible to it, so just move the code/if-checks around back in that way, to fix it. Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by: Heitor Alves de Siqueira <halves@canonical.com> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Closes #9602	2020-01-22 13:49:04 -08:00
Heitor Alves de Siqueira	20e124dd71	Break out of zfs_zget early if unlinked znode If zp->z_unlinked is set, we're working with a znode that has been marked for deletion. If that's the case, we can skip the "goto again" loop and return ENOENT, as the znode should not be discovered. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Heitor Alves de Siqueira <halves@canonical.com> Closes #9583	2020-01-22 13:49:04 -08:00
loli10K	880a37aa35	Prevent NULL pointer dereference in blkg_tryget() on EL8 kernels blkg_tryget() as shipped in EL8 kernels does not seem to handle NULL @blkg as input; this is different from its mainline counterpart where NULL is accepted. To prevent dereferencing a NULL pointer when dealing with block devices which do not set a root_blkg on the request queue perform the NULL check in vdev_bio_associate_blkg(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9546 Closes #9577	2020-01-22 13:49:04 -08:00
Alexander Motin	edaec84225	Improve logging of 128KB writes Before my ZIL space optimization few years ago 128KB writes were logged as two 64KB+ records in two 128KB log blocks. After that change it became ~127KB+/1KB+ in two 128KB log blocks to free space in the second block for another record. Unfortunately in case of 128KB only writes, when space in the second block remained unused, that change increased write latency by unbalancing checksum computation and write times between parallel threads. It also didn't help with SLOG space efficiency in that case. This change introduces new 68KB log block size, used for both writes below 67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB block to not increase processing overhead. Writes above 131KB are still using full 128KB blocks, since possible saving there is small. Mixed loads will likely also fall back to previous 128KB, since code uses maximum of the last 16 requested block sizes. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #9409	2020-01-22 13:49:04 -08:00
Matthew Macy	ca0f9b7473	Include prototypes for vdev_initialize Address two prototype related warnings emitted by clang. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9535	2020-01-22 13:49:03 -08:00
Tom Caputi	7e1b772edd	Fix 'zfs change-key' with unencrypted child Currently, when you call 'zfs change-key' on an encrypted dataset that has an unencrypted child, the code will trigger a VERIFY. This VERIFY is leftover from before we allowed unencrypted datasets to exist underneath encrypted ones. This patch fixes the issue by simply replacing the VERIFY with an early return when recursing through datasets. Reviewed by: Jason King <jason.brian.king@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9524	2020-01-22 13:49:03 -08:00
loli10K	f1ba5478a3	Fix pool creation with feature@allocation_classes disabled When "feature@allocation_classes" is not enabled on the pool no vdev with "special" or "dedup" allocation type should be allowed to exist in the vdev tree. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9427 Closes #9429	2020-01-22 13:49:02 -08:00
Brian Behlendorf	5a1bf9e8b1	Fix automount for root filesystems Commit `093bb64` resolved an automount failures for chroot'd processes but inadvertently broke automounting for root filesystems where the vfs_mntpoint is NULL. Resolve the issue by checking for NULL in order to generate the correct path. Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9381 Closes #9384	2020-01-22 13:49:02 -08:00
Matthew Macy	b43893de86	Rename rangelock_ functions to zfs_rangelock_ A rangelock KPI already exists on FreeBSD. Add a zfs_ prefix as per our convention to prevent any conflict with existing symbols. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9402	2020-01-22 13:49:02 -08:00
Brian Behlendorf	62c034f6d4	Linux 5.0 compat: SIMD compatibility Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS, and 5.0 and newer kernels. This commit squashes the following commits from master in to a single commit which can be applied to 0.8.2. `10fa2545` - Linux 4.14, 4.19, 5.0+ compat: SIMD save/restore `b88ca2ac` - Enable SIMD for encryption `095b5412` - Fix CONFIG_X86_DEBUG_FPU build failure `e5db3134` - Linux 5.0 compat: SIMD compatibility Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> TEST_ZIMPORT_SKIP="yes"	2020-01-22 13:49:01 -08:00
Brian Behlendorf	055238d2eb	Add warning for zfs_vdev_elevator option removal Originally the zfs_vdev_elevator module option was added as a convenience so the requested elevator would be automatically set on the underlying block devices. At the time this was simple because the kernel provided an API function which did exactly this. This API was then removed in the Linux 4.12 kernel which prompted us to add compatibly code to set the elevator via a usermodehelper. While well intentioned this introduced a bug which could cause a system hang, that issue was subsequently fixed by commit `2a0d4188`. In order to avoid future bugs in this area, and to simplify the code, this functionality is being deprecated. A console warning has been added to notify any existing consumers and the documentation updated accordingly. This option will remain for the lifetime of the 0.8.x series for compatibility but if planned to be phased out of master. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8664 Closes #9317	2020-01-22 13:49:01 -08:00
loli10K	ec5d76e853	diff_cb() does not handle large dnodes Trying to 'zfs diff' a snapshot with large dnodes will incorrectly try to access its interior slots when dnodesize > sizeof(dnode_phys_t). This is normally not an issue because the interior slots are zero-filled, which report_dnode() handles calling report_free_dnode_range(). However this is not the case for encrypted large dnodes or filesystem using many SA based xattrs where the extra data past the legacy dnode size boundary is interpreted as a dnode_phys_t. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7678 Closes #8931 Closes #9343	2020-01-22 13:49:01 -08:00
loli10K	444df1051c	Device removal of indirect vdev panics the kernel This commit fixes a NULL pointer dereference triggered in spa_vdev_remove_top_check() by trying to "zpool remove" an indirect vdev. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9327	2020-01-22 13:49:01 -08:00
Tom Caputi	5986c5c687	Fix clone handling with encryption roots Currently, spa_keystore_change_key_sync_impl() does not recurse into clones when updating encryption roots for either a call to 'zfs promote' or 'zfs change-key'. This can cause children of these clones to end up in a state where they point to the wrong dataset as the encryption root. It can also trigger ASSERTs in some cases where the code checks reference counts on wrapping keys. This patch fixes this issue by ensuring that this function properly recurses into clones during processing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9267 Closes #9294	2020-01-22 13:49:00 -08:00
Tom Caputi	8747ee4513	Fix stalled txg with repeated noop scans Currently, the DSL scan code figures out when it should suspend processing and allow a txg to continue by calling the function dsl_scan_check_suspend(). Unfortunately, this function only allows the scan to suspend at a level 0 block. In the event that the system is scanning a bunch of empty snapshots or a resilver is running with a high enough scn_cur_min_txg, the scan will stop processing each dataset at the root level, deciding it has nothing left to do. This means that the check_suspend function is never called and the txg remains stuck until a dataset is found that has data to scan. This patch fixes the problem by allowing scans to suspend at the root level of the objset. For backwards compatibility, we use the bookmark <objsetid, 0, 0, 0> when we suspend here so that older versions of the code will work as intended. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9300	2020-01-22 13:49:00 -08:00
Igor K	5cb46afcf1	Fix panic on DilOS with kstat per dataset statistics Account for ZFS_MAX_DATASET_NAME_LEN in kstat data size. This value is ignored in the Linux kstat code but resolves the issue for other platforms. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Igor Kozhukhov <igor@dilos.org> Closes #9254 Closes #9151	2020-01-22 13:49:00 -08:00
George Wilson	7e93917309	maxinflight can overflow in spa_load_verify_cb() When running on larger memory systems, we can overflow the value of maxinflight. This can result in maxinflight having a value of 0 causing the system to hang. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com> Closes #9272	2020-01-22 13:49:00 -08:00
Andrea Gelmini	5097eb6ac9	Fix typos in module/zfs/ Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #9240	2020-01-22 13:48:59 -08:00
Paul Dagnelie	ebdb770554	Prevent metaslab_sync panic due to spa_final_dirty_txg If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being exported, we can enter a situation that causes a kernel panic. Any metaslabs that are loaded during the final dirty txg and haven't already been condensed will cause metaslab_sync to proceed after the final dirty txg so that the condense can be performed, which there are assertions to prevent. Because of the nature of this issue, there are a number of ways we can enter this state. Rather than try to prevent each of them one by one, potentially missing some edge cases, we instead cut it off at the point of intersection; by preventing metaslab_sync from proceeding if it would only do so to perform a condense and we're past the final dirty txg, we preserve the utility of the existing asserts while preventing this particular issue. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9185 Closes #9186 Closes #9231 Closes #9253	2020-01-22 13:48:58 -08:00
Tony Nguyen	16f42e1b6d	Use smaller default slack/delta value for schedule_hrtimeout_range() For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta for calls to schedule_hrtimeout_range(). This 100us slack can be costly for small writes. This change improves small write performance by passing resolution `res` parameter to schedule_hrtimeout_range() to be used as delta/slack. A new tunable `spl_schedule_hrtimeout_slack_us` is added to preserve old behavior when desired. Performance observations on 8K recordsize filesystem: - 8K random writes at 1-64 threads, up to 60% improvement for one thread and smaller gains as thread count increases. At >64 threads, 2-5% decrease in performance was observed. - 8K sequential writes, similar 60% improvement for one thread and leveling out around 64 threads. At >64 threads, 5-10% decrease in performance was observed. - 128K sequential write sees 1-5 for the 128K. No observed regression at high thread count. Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on VMware ESX. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes #9217	2020-01-22 13:48:58 -08:00

1 2 3 4 5 ...

2136 Commits