mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-26 20:22:14 +03:00

Author	SHA1	Message	Date
Tony Nguyen	477edd642c	Run arc_evict thread at higher priority Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes #12397	2021-09-14 14:30:13 -07:00
Rich Ercolani	23184b172a	Make get_key_material_file fail more verbosely It turns out, there are a lot of possible reasons for fopen to fail. Let's share which reason we failed for today. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12410	2021-09-14 14:30:13 -07:00
Brian Behlendorf	32a971e749	Enable /proc/diskstats for zvols The /proc/diskstats accounting needs to be explicitly enabled for block devices which do not use multi-queue. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12440 Closes #12066	2021-09-14 14:30:13 -07:00
George Melikov	c07ed69577	Man zpool-scrub.8: describe sequential scrub Describe sequential scrub and add examples of scrub status. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #12429	2021-09-14 14:29:46 -07:00
hedongzhang	ddb732e2c8	Modify checksum obtain method of QAT CpaDcGeneratefooter function that obtain the checksum code does not support the CPA_DC_STATELESS mode. So we get the adler32 chencksum of the end of the zlib from dc_results. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chengfei Zhu <chengfeix.zhu@intel.com> Signed-off-by: hedong.zhang <h_d_zhang@163.com> Closes #12343	2021-09-14 14:29:46 -07:00
Mark Johnston	451d6da988	Allow disabling of unmapped I/O on FreeBSD We have a tunable which permits one to disable the use of unmapped I/O for the buffer cache. Respect it in ZFS as well. This is useful for KMSAN, which cannot easily maintain shadow state for unmapped pages. No functional change intended, as unmapped I/O is permitted by default and there's no real reason to disable it in practice except for debugging. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12446	2021-09-14 14:29:46 -07:00
Alexander Motin	e298ac5d04	Add comment on metaslab_class_throttle_reserve() locking Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Issue #12314 Closes #12419	2021-09-14 13:09:40 -07:00
John Wren Kennedy	9429910781	Assorted fixes for the performance tests - Bail out early if we're running the perf tests and forget to specify disks. - Allow perf tests to run with any number of disks. - Remove weekly vs. nightly settings - Move variables with common values to perf.shlib - Use zinject to clear the ARC over export/import - Fix dbuf cache size calculation When the meaning of `dbuf_cache_max_bytes` changed, the performance test that covers the dbuf cache started to fail. The test would try to write files for the test using the max possible size of the cache, inevitably filling the pool and failing. This change uses `dbuf_cache_shift` to correctly calculate the dbuf cache size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: John Kennedy <john.kennedy@delphix.com> Closes #12408	2021-09-14 13:09:24 -07:00
Matthew Ahrens	8a969f3e2d	Read past end of argv array in zpool_do_import() `zpool_do_import()` passes `argv[0]`, (optionally) `argv[1]`, and `pool_specified` to `import_pools()`. If `pool_specified==FALSE`, the `argv[]` arguments are not used. However, these values may be off the end of the `argv[]` array, so loading them could dereference unmapped memory. This error is reported by the asan build: ``` ================================================================= ==6003==ERROR: AddressSanitizer: heap-buffer-overflow READ of size 8 at 0x6030000004a8 thread T0 #0 0x562a078b50eb in zpool_do_import zpool_main.c:3796 #1 0x562a078858c5 in main zpool_main.c:10709 #2 0x7f5115231bf6 in __libc_start_main #3 0x562a07885eb9 in _start 0x6030000004a8 is located 0 bytes to the right of 24-byte region allocated by thread T0 here: #0 0x7f5116ac6b40 in __interceptor_malloc #1 0x562a07885770 in main zpool_main.c:10699 #2 0x7f5115231bf6 in __libc_start_main ``` This commit passes NULL for these arguments if they are off the end of the `argv[]` array. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #12339	2021-09-14 13:08:53 -07:00
Václav Skála	898b1e173c	Add missing properties to zfs allow manpage Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Václav Skála <skala@vshosting.cz> Closes #12402	2021-09-14 13:08:19 -07:00
George Amanakis	406534f807	Fixes in persistent L2ARC In l2arc_add_vdev() first decide whether the device is eligible for L2ARC rebuild or whole device trim and then add it to the list of cache devices. Otherwise l2arc_feed_thread() might already start writing on the device invalidating previous content as l2ad_hand = l2ad_start. However l2arc_rebuild_vdev() needs the device present in the cache device list to figure out its l2arc_dev_t. Fix this by moving most of l2arc_rebuild_vdev() in a new function l2arc_rebuild_dev() which does not need to search in the cache device list. In contrast to l2arc_add_vdev() we do not have to worry about l2arc_feed_thread() invalidating previous content when onlining a cache device. The device parameters (l2ad*) are not cleared when offlining the device and writing new buffers will not invalidate all previous content. In worst case only buffers that have not had their log block written to the device will be lost. Retire persist_l2arc_00{4,5,8} tests since they cover code already covered by the remaining ones. Test persist_l2arc_006 is renamed to persist_l2arc_004 and persist_l2arc_007 is renamed to persist_l2arc_005. Fix a typo in persist_l2arc_004, and remove an assertion that is not always true from l2arc_arcstats_pos. Also update an assertion in persist_l2arc_005 and explain why in a comment. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12365	2021-09-14 13:07:44 -07:00
Mark Johnston	ac573e3105	Initialize dn_next_type[] in the dnode constructor It seems nothing ensures that this array is zeroed when a dnode is freshly allocated, so in principle it retains the values from the previous allocation. In practice it seems to be the case that the fields should end up zeroed, but we can zero the field anyway for consistency. This was found using KMSAN. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12383	2021-09-14 13:07:44 -07:00
Mark Johnston	99df200ffc	Zero pad bytes following TX_WRITE log data When logging a TX_WRITE record in the case where file data has to be copied from the DMU, we pad the log record size to a multiple of 8 bytes. In this case, any padding bytes should be zeroed, otherwise the contents of uninitialized memory are written to the ZIL. This was found using KMSAN. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12383	2021-09-14 12:42:21 -07:00
Mark Johnston	bd910fdeb0	Zero pad bytes when allocating a ZIL record When allocating a record, we round up the allocation size to a multiple of 8. In this case, any padding bytes should be zeroed, otherwise the contents of uninitialized memory are written to the ZIL. This was found using KMSAN. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12383	2021-09-14 12:42:21 -07:00
Mark Johnston	9cc9821014	Initialize all fields in zfs_log_xvattr() When logging TX_SETATTR, we could otherwise fail to initialize part of the corresponding ZIL record depending on which fields are present in the xvattr. Initialize the creation time and the AV scan timestamp to zero so that uninitialized bytes are not written to the ZIL. This was found using KMSAN. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12383	2021-09-14 12:42:21 -07:00
Mark Johnston	fceda40c1e	Initialize "autoreplace" in spa_ld_get_props() spa_prop_find() may fail to find the specified property, in which case it suppresses ENOENT from zap_lookup(). In this case, the return value is left uninitialized, so spa_autoreplace was being initialized using an uninitialized stack variable. This was found using KMSAN. It appears to be a regression from commit `9eb7b46ed0`, which removed the initialization of "autoreplace" from the definition. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12383	2021-09-14 12:41:10 -07:00
Coleman Kane	4434baab11	Linux 5.14 compat: explicity assign set_page_dirty Kernel 5.14 introduced a change where set_page_dirty of struct address_space_operations is no longer implicitly set to __set_page_dirty_buffers(), which ended up resulting in a NULL pointer deref in the kernel when it is attempted to be called. This change sets .set_page_dirty in the structure to __set_page_dirty_nobuffers(), which was introduced with the related patch set. The breaking change was introduce in commit 0af573780b0b13fceb7fabd49dc1b073cee9a507 to torvalds/linux.git. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #12427	2021-09-14 12:41:10 -07:00
Rich Ercolani	6385f4e70e	Fix unfortunate NULL in spa_update_dspace After `1325434b`, we can in certain circumstances end up calling spa_update_dspace with vd->vdev_mg NULL, which ends poorly during vdev removal. So let's not do that further space adjustment when we can't. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12380 Closes #12428	2021-09-14 12:41:10 -07:00
Brian Behlendorf	2f073cc9c6	Linux 5.14 compat: blk_alloc_disk() In Linux 5.14, blk_alloc_queue is no longer exported, and its usage has been superseded by blk_alloc_disk, which returns a gendisk struct from which we can still retrieve the struct request_queue* that is needed in the one place where it is used. This also replaces the call to alloc_disk(minors), and minors is now set via struct member assignment. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12362 Closes #12409	2021-09-14 12:40:45 -07:00
Ryan Moeller	729eb48666	zloop: Add a max iterations option, use default run/pass times It is useful to have control over the number of iterations of zloop so we can easily produce "x core dumps found in y iterations" metrics. Using random values for run/pass times doesn't improve coverage in a meaningful way. Randomizing run time could be seen as a compromise between running a greater variety of shorter tests versus a smaller variety of longer tests within a fixed time span. However, it is not desirable when running a fixed number of iterations. Pass time already incorporates randomness within ztest. Either parameter can be passed to ztest explicitly if the defaults are not satisfactory. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #12411	2021-09-14 12:40:45 -07:00
Alexander Motin	93e11e257b	FreeBSD: Ignore make_dev_s() errors Since errors returned by zvol_create_minor_impl() are ignored by the common code, it is more convenient to ignore make_dev_s() errors there. It allows, for example, to get device created for the zvol after later rename instead of having it further stuck in half-created state. zvol_rename_minor() already ignores those errors. While there, switch from MAXPHYS to maxphys in FreeBSD 13+. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12375	2021-09-14 12:40:45 -07:00
Jorgen Lundman	eaa10257ca	Remove old orig_fd variable from zfs send Possibly required in the past, but is currently fills no purpose. Ordinarily such tiny cleanup is not generally worth it, however on the macOS port, in a future commit, we do unspeakable things to the "fd" for send/recv, and it would be easier to only have to deal with one "fd" instead of two. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12404	2021-09-14 12:40:16 -07:00
Alexander Motin	32c0b6468c	Optimize allocation throttling Remove mc_lock use from metaslab_class_throttle_(). The math there is based on refcounts and so atomic, so the only race possible there is between zfs_refcount_count() and zfs_refcount_add(). But in most cases metaslab_class_throttle_reserve() is called with the allocator lock held, which covers the race. In cases where the lock is not held, GANG_ALLOCATION() or METASLAB_MUST_RESERVE are set, and so we do not use zfs_refcount_count(). And even if we assume some other non-existing scenario, the worst that may happen from this race is few more I/Os get to allocation earlier, that is not a problem. Move locks and data of different allocators into different cache lines to avoid false sharing. Group spa_alloc_ arrays together into single array of aligned struct spa_alloc spa_allocs. Align struct metaslab_class_allocator. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12314	2021-09-14 12:40:15 -07:00
George Melikov	7c61e1ef9d	CI: generate ABI files if changed So commit author can just download them as artifacts and commit. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #12379	2021-09-14 12:40:15 -07:00
Alexander Motin	6a49948c73	Minor ARC optimizations Remove unneeded global, practically constant, state pointer variables (arc_anon, arc_mru, etc.), replacing them with macros of real state variables addresses (&ARC_anon, &ARC_mru, etc.). Change ARC_EVICT_ALL from -1ULL to UINT64_MAX, not requiring special handling in inner loop of ARC reclamation. Respectively change bytes argument of arc_evict_state() from int64_t to uint64_t. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12348	2021-09-14 12:39:48 -07:00
Jorgen Lundman	4dfb698aac	dmu_redact.c does not call bqueue_destroy Ensure all calls to bqueue_init() has a corresponding call to bqueue_destroy() Reviewed-by: Paul Dagnelie <pcd@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12118	2021-09-14 12:39:48 -07:00
Alexander	4affa09f3e	A few fixes of callback typecasting (for the upcoming ClangCFI) * zio: avoid callback typecasting * zil: avoid zil_itxg_clean() callback typecasting * zpl: decouple zpl_readpage() into two separate callbacks * nvpair: explicitly declare callbacks for xdr_array() * linux/zfs_nvops: don't use external iput() as a callback * zcp_synctask: don't use fnvlist_free() as a callback * zvol: don't use ops->zv_free() as a callback for taskq_dispatch() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #12260	2021-09-14 12:39:48 -07:00
Ryan Moeller	0ca9558561	Remove unused fields from zvol_task_t We don't use or need the pool name or value source in the zvol tasks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #12361	2021-09-14 12:39:17 -07:00
Alexander Motin	c2c4d05700	FreeBSD: Switch from MAXPHYS to maxphys on FreeBSD 13+ Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12378	2021-09-14 12:39:17 -07:00
George Melikov	f8c2e91db5	zpool_influxdb: fix -Werror=stringop-truncation Use strlcpy instead of problematic strncpy Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #12344	2021-09-14 12:39:17 -07:00
Rich Ercolani	056c273939	Correct zfs-send(8) on readonly sends zfs-send(8) claimed in the flags list you could use -pR when sending a readonly filesystem or volume. You cannot. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12336	2021-09-14 12:38:51 -07:00
Alexander Motin	ba76bb30a6	Introduce dsl_dir_diduse_transfer_space() Most of dsl_dir_diduse_space() and dsl_dir_transfer_space() CPU time is a dd_lock overhead and time spent in dmu_buf_will_dirty(). Calling them one after another is a waste of time and even more contention. Doing that twice for each rewritten block within dbuf_write_done() via dsl_dataset_block_kill() and dsl_dataset_block_born() created one of the biggest CPU overheads in case of small blocks rewrite. dsl_dir_diduse_transfer_space() combines functionality of these two functions for cases where it is needed, but without double overhead, practically for the cost of dsl_dir_diduse_space() or even cheaper. While there, optimize dsl_dir_phys() calls in dsl_dir_diduse_space() and dsl_dir_transfer_space(). It seems Clang detects some aliasing there, repeating dd->dd_dbuf->db_data dereference multiple times, increasing dd_lock scope and contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Author: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12300	2021-09-14 12:38:51 -07:00
наб	968dc13572	config/libatomic: require -latomic iff atomic.c doesn't link w/o it In absence of LTO, and dynamic libatomic, la.so ends up in the needs section of every toolchain executable; some consider this an issue. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12345 Closes #12359	2021-09-14 12:38:51 -07:00
Rich Ercolani	960a5a557b	Tinker with slop space accounting with dedup * Tinker with slop space accounting with dedup Do not include the deduplicated space usage in the slop space reservation, it leads to surprising outcomes. * Update spa_dedup_dspace sometimes Sometimes, we get into spa_get_slop_space() with spa_dedup_dspace=~0ULL, AKA "unset", while spa_dspace is correctly set. So call the code to update it before we use it if we hit that case. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12271	2021-09-14 12:38:05 -07:00
Alexander Motin	45305a067f	Fix ARC ghost states eviction accounting arc_evict_hdr() returns number of evicted bytes in scope of specific state. For ghost states it does not mean the amount of really freed memory, but the logical buffer size. It is correct for the eviction process, but not for waking up threads waiting for ARC size reduction, as added in "Revise ARC shrinker algorithm" commit, causing premature wakeups while ARC is still overflowed, allowing even bigger overflow, plus processing overhead when next allocation will also get blocked, probably also for too short time. To fix that make arc_evict_hdr() also return the amount of really freed memory, which for the ghost states is only the header, and use it to update arc_evict_count instead. Originally I was thinking to not return it at all, since arc_get_data_impl() does not account for the headers, but decided that some slow allocation progress is better than long waits, reaching on my tests up to 100ms. To reduce negative latency effects of long time periods when reclaim thread can free little real memory, start reclamation process earlier, before we actually reached the overflow threshold, when we have to throttle new allocations. We can also do it without taking global arc_evict_lock, reducing the contention. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12279	2021-09-14 12:38:05 -07:00
Brian Behlendorf	a5e68f0478	Update bug report template - Remove the "SPL Version" line, the repositories have been merged since the 0.8 release and we no longer need to ask about this. - Simply ask for the kernel version / patch level and add a hint about how to get this information on Linux and FreeBSD. - Remove "Status: Triage Needed" from the template, in practice we really haven't been using this label so let's step setting it. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #12340	2021-09-14 12:38:05 -07:00
George Wilson	8415c3c170	file reference counts can get corrupted Callers of zfs_file_get and zfs_file_put can corrupt the reference counts for the file structure resulting in a panic or a soft lockup. When zfs send/recv runs, it will add a reference count to the open file, and begin to send or recv the stream. If the file descriptor is closed, then when dmu_recv_stream() or dmu_send() return we will call zfs_file_put to remove the reference we placed on the file structure. Unfortunately, because zfs_file_put() uses the file descriptor to lookup the file structure, it may end up finding that the file descriptor table no longer contains the file struct, thus leaking the file structure. Or it might end up finding a file descriptor for a different file and blindly updating its reference counts. Other failure modes probably exists. This change reworks the zfs_file_[get\|put] interface to not rely on the file descriptor but instead pass the zfs_file_t pointer around. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: George Wilson <gwilson@delphix.com> External-issue: DLPX-76119 Closes #12299	2021-09-14 12:37:38 -07:00
Jorgen Lundman	04ebe29188	dprintf_dnode: strcpy -> strlcpy Missed a couple of strcpy() in earlier commit, this is only used with --enable-debug. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12311	2021-09-14 12:37:38 -07:00
Jorgen Lundman	a0b4da2297	Replace strchrnul() with strrchr() Could have gone either way with this one, either adding it to macOS/Windows SPL, or returning it to "classic" usage with strrchr(). Since the new special way isn't really used, and only used once, we have this commit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12312	2021-09-14 12:37:38 -07:00
Alexander Motin	c84670950a	FreeBSD: Use unmapped I/O for scattered/gang ABD buffers Many FreeBSD disk drivers support "unmapped" I/O mode, when data buffer represented not with a virtually contiguous KVA-mapped address range, but with a list of physical memory pages. Originally it was designed to do I/O from buffers without KVA mapping (unmapped). But moving virtual addresses out of equation allows us to operate even non-contiguous data buffers with one condition: all buffer discon- tinuities must be aligned to memory page borders. Doing I/O to capable GEOM device this patch traverses through non- linear ABD buffers, validating the chunks borders. If the condition is met, it supplies GEOM with the list of original physical memory pages instead of copying the data into temporary contiguous buffer. On capable hardware on pools with ashift=12 and default ABD chunk of 4KB it should handle all the I/O without additional memory copying. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12320	2021-09-14 12:37:02 -07:00
Alexander Motin	49bb454120	FreeBSD: Hardcode abd_chunk_size to PAGE_SIZE It makes no sense to set it below PAGE_SIZE, since it increases all overheads and makes returning memory to OS problematic. It makes no sense to set it above PAGE_SIZE, since such allocations and especially frees are too expensive and cause KVA fragmentation to benefit from fewer chunks. After that it makes no sense to keep more complicated math here. What may have sense though is just a tunable border between linear and scatter ABDs, previously also controlled by this tunable. Retain that functionality by taking abd_scatter_min_size tunable from Linux, just with different default value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12328	2021-09-14 12:36:44 -07:00
Alexander Motin	41b33dce44	Move gethrtime() calls out of vdev queue lock This dramatically reduces the lock contention on systems with slower (non-TSC) timecounters. With TSC the difference is minimal, but since this lock is pretty congested, any improvement counts. Plus I don't see any reason to do it under the lock other than the latency of the lock itself, which this change actually reduces. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12281	2021-09-14 12:35:53 -07:00
Justin Gottula	dab147d65a	Use substantially more robust program exit status logic in zvol_id Currently, there are several places in zvol_id where the program logic returns particular errno values, or even particular ioctl return values, as the program exit status, rather than a straightforward system of explicit zero on success and explicit nonzero value(s) on failure. This is problematic for multiple reasons. One particularly interesting problem that can arise, is that if any of these values happens to have all 8 least significant bits unset (i.e., it is a positive or negative multiple of 256), then although the C program sees a nonzero int value (presumed to be a failure exit status), the actual exit status as seen by the system is only the bottom 8 bits of that integer: zero. This can happen in practice, and I have encountered it myself. In a particularly weird situation, the zvol_open code in the zfs kernel module was behaving in such a manner that it caused the open() syscall to fail and for errno to be set to a kernel-private value (ERESTARTSYS, which happens to be defined as 512). It turns out that 512 is evenly divisible by 256; or, in other words, its least significant 8 bits are all-zero. So even though zvol_id believed it was returning a nonzero (failure) exit status of 512, the system modulo'd that value by 256, resulting in the actual exit status visible by other programs being 0! This actually-zero (non-failure) exit status caused problems: udev believed that the program was operating successfully, when in fact it was attempting to indicate failure via a nonzero exit status integer. Combined with another problem, this led to the creation of nonsense symlinks for zvol dev nodes by udev. Let's get rid of all this problematic logic, and simply return EXIT_SUCCESS (0) is everything went fine, and EXIT_FAILURE (1) if anything went wrong. Additionally, let's clarify some of the variable names (error is similar to errno, etc) and clean up the overall program flow a bit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Justin Gottula <justin@jgottula.com> Closes #12302	2021-09-14 12:23:38 -07:00
Justin Gottula	7138fe7205	Print zvol_id error messages to stderr rather than stdout The zvol_id program is invoked by udev, via a PROGRAM key in the 60-zvol.rules.in rule file, to determine the "pretty" /dev/zvol/* symlink paths paths that should be generated for each opaquely named /dev/zd* dev node. The udev rule uses the PROGRAM key, followed by a SYMLINK+= assignment containing the %c substitution, to collect the program's stdout and then "paste" it directly into the name of the symlink(s) to be created. Unfortunately, as currently written, zvol_id outputs both its intended output (a single string representing the symlink path that should be created to refer to the name of the dataset whose /dev/zd* path is given) AND its error messages (if any) to stdout. When processing PROGRAM keys (and others, such as IMPORT{program}), udev uses only the data written to stdout for functional purposes. Any data written to stderr is used solely for the purposes of logging (if udev's log_level is set to debug). The unintended consequence of this is as follows: if zvol_id encounters an error condition; and then udev fails to halt processing of the current rule (either because zvol_id didn't return a nonzero exit status, or because the PROGRAM key in the rule wasn't written properly to result in a "non-match" condition that would stop the current rule on a nonzero exit); then udev will create a space-delimited list of symlink names derived directly from the words of the error message string! I've observed this exact behavior on my own system, in a situation where the open() syscall on /dev/zd* dev nodes was failing sporadically (for reasons that aren't especially relevant here). Because the open() call failed, zvol_id printed "Unable to open device file: /dev/zd736\n" to stdout and then exited. The udev rule finished with SYMLINK+="zvol/%c %c". Assuming a volume name like pool/foo/bar, this would ordinarily expand to SYMLINK+="zvol/pool/foo/bar pool/foo/bar" and would cause symlinks to be created like this: /dev/zvol/pool/foo/bar -> /dev/zd736 /dev/pool/foo/bar -> /dev/zd736 But because of the combination of error messages being printed to stdout, and the udev syntax freely accepting a space-delimited sequence of names in this context, the error message string "Unable to open device file: /dev/zd736\n" in reality expanded to SYMLINK+="zvol/Unable to open device file: /dev/zd736" which caused the following symlinks to actually be created: /dev/zvol/Unable -> /dev/zd736 /dev/to -> /dev/zd736 /dev/open -> /dev/zd736 /dev/device -> /dev/zd736 /dev/file: -> /dev/zd736 /dev//dev/zd736 -> /dev/zd736 (And, because multiple zvols had open() syscall errors, multiple zvols attempted to claim several of those symlink names, resulting in numerous udev errors and timeouts and general chaos.) This commit rectifies all this silliness by simply printing error messages to stderr, as Dennis Ritchie originally intended. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Justin Gottula <justin@jgottula.com> Closes #12302	2021-09-14 12:23:38 -07:00
Justin Gottula	fd2e4d143d	Udev rules: use match (==) rather than assign (=) for PROGRAM Assignment syntax (=) can be used for the PROGRAM key. But the PROGRAM key is really a match key, not an assign key. The internal logic used by udev to decide whether a PROGRAM key "matched" or not (which determines whether the remainder of the rule is evaluated) depends on whether the operator was OP_MATCH (==) or OP_NOMATCH (!=). [1] The man page claims that '"=", ":=", and "+=" have the same effect as "=="' for PROGRAM keys. And, after a brief perusal, the udev source code does seem to confirm that operators other than OP_MATCH (==) or OP_NOMATCH (!=) are implicitly converted to OP_MATCH (==). [2] But it's not entirely clear that this is definitely the case: anecdotal testing seems to indicate that when OP_ASSIGN (=) is used, the program's exit status is disregarded and the remainder of the rule is processed regardless of whether it was, in fact, a successful exit. The bottom line here is that, if zvol_id hits some snag and returns a nonzero exit status, then we almost certainly do NOT want to continue on with the rule and use whatever the stdout contents may have been to mindlessly create /dev/zvol/* symlinks. Therefore, let's be extra-sure and use the match (==) operator explicitly, to eliminate any possibility that udev might do the wrong thing, and ensure that a nonzero exit status will definitely short-circuit the rest of the rule, bypassing the SYMLINK+= assignments. [1] udev, file src/udev/udev-rules.c, func udev_rule_apply_token_to_event, switch case TK_M_PROGRAM if r != 0 (nonzero exit status): return token->op == OP_NOMATCH; switch case TK_M_PROGRAM if r == 0 (zero exit status): return token->op == OP_MATCH; func retval 0 => key is considered to have matched func retval 1 => key is considered to have NOT matched [2] udev, file src/udev/udev-rules.c, func parse_token, at func start: bool is_match = IN_SET(op, OP_MATCH, OP_NOMATCH); in else-if case streq(key, "PROGRAM"): if (!is_match) op = OP_MATCH; Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Justin Gottula <justin@jgottula.com> Closes #12302	2021-09-14 12:23:10 -07:00
Justin Gottula	0cb122941e	Udev rules: replace deprecated $tempnode with $devnode The $tempnode substitution is so old that it's not even mentioned in the man page anymore. It is still technically supported by udev, but with plenty of "deprecated" comments surrounding it. The preferred modern equivalent of $tempnode is $devnode (or alternatively, %N). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Justin Gottula <justin@jgottula.com> Closes #12302	2021-09-14 12:23:10 -07:00
Justin Gottula	c20ba9bd7a	Udev rules: use non-ancient comma syntax This file is old as dirt. It's entirely possible that commas were optional in udev back at that time. But they're definitely supposed to be there nowadays. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Justin Gottula <justin@jgottula.com> Closes #12302	2021-09-14 12:23:10 -07:00
Alexander Motin	15177c1aac	Compact dbuf/buf hashes and lock arrays With default dbuf cache size of 1/32 of ARC, it makes no sense to have hash table of the same size (or even bigger on Linux). Reduce it to 1/8 of ARC's one, still leaving some slack, assuming higher I/O rate via dbuf cache than via ARC. Remove padding from ARC hash locks array. The idea behind padding is to avoid false sharing between locks. It would have sense if there would be a limited number of very busy locks. But since we have no limit on the number, using the same memory for more locks we can achieve even lower lock contention with the same false sharing, or we can use less memory for the same contention level. Reduce number of hash locks from 8192 to 2048. The number is still big enough to not cause contention, but reduced memory size improves cache hit rate for mutex_tryenter() in ARC eviction thread, saving about 1% of the thread time. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12289	2021-09-14 12:22:46 -07:00
Jorgen Lundman	035219ee10	Fix abd leak, kmem_free correct size of abd_t Fix a leak of abd_t that manifested mostly when using raidzN with at least as many columns as N (e.g. a four-disk raidz2 but not a three-disk raidz2). Sufficiently heavy raidz use would eventually run a system out of memory. Additionally: * Switch abd_cache arena to FIRSTFIT, which empirically improves perofrmance. * Make abd_chunk_cache more performant and debuggable. * Allocate the abd_zero_buf from abd_chunk_cache rather than the heap. * Don't try to reap non-existent qcaches in abd_cache arena. * KM_PUSHPAGE->KM_SLEEP when allocating chunks from their own arena Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: Sean Doran <smd@use.net> Closes #12295	2021-09-14 12:22:28 -07:00
Jorgen Lundman	2334bc4efa	Upstream: dmu_zfetch_stream_fini leaks refcount dmu_zfetch_stream_fini() is missing calls to destroy the refcounts, leaking them and the mutex inside. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12294	2021-09-14 12:21:55 -07:00

1 2 3 4 5 ...

6975 Commits