mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-03-22 08:51:30 +03:00

Author	SHA1	Message	Date
Allan Jude	6c4ede4026	ZFS allow send:encrypted A new `zfs allow` permissions that ONLY allows sending replication streams in raw (encrypted) mode, so encrypted data will not be decrypted as part of the replication process. Sponsored-by: Klara, Inc. Sponsored-by: Karakun AG Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Co-authored-by: JT Pennington <jt.pennington@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #17543	2025-09-12 15:05:02 -07:00
Tony Hutter	4a7a04630d	zed: Add synchronous zedlets Historically, ZED has blindly spawned off zedlets in parallel and never worried about their completion order. This means that you can potentially have zedlets for event number 2 starting before zedlets for event number 1 had finished. Most of the time this is fine, and it actually helps a lot when the system is getting spammed with hundreds of events. However, there are times when you want your zedlets to be executed in sequence with the event ID. That is where synchronous zedlets come in. ZED will wait for all previously spawned zedlets to finish before running a synchronous zedlet. Synchronous zedlets are guaranteed to be the only zedlet running. No other zedlets may run in parallel with a synchronous zedlet. Users should be careful to only use synchronous zedlets when needed, since they decrease parallelism. To make a zedlet synchronous, simply add a "-sync-" immediately following the event name in the zedlet's file name: EVENT_NAME-sync-ZEDLETNAME.sh For example, if you wanted a synchronous statechange script: statechange-sync-myzedlet.sh Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #17335	2025-09-11 15:58:59 -07:00
Paul Dagnelie	df55ba7c49	Detect a slow raidz child during reads A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Contributions-by: Don Brady <don.brady@klarasystems.com> Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17227	2025-09-10 15:31:30 -07:00
Paul Dagnelie	e2e708241a	Enable zhack to work properly with 4k sector size disks Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17576	2025-09-10 15:01:32 -07:00
Paul Dagnelie	26983d6fa7	Add allocation profile export and zhack subcommand for import When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though in that case the allocation map may actually be from several different TXGs as new ones are loaded on demand. The second is a new subcommand for zhack, zhack metaslab leak (and its supporting kernel changes). This is a zhack subcommand that imports a pool and then modified the range trees of the metaslabs, allowing the sync process to write them out normall. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding free subcommand (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17576	2025-09-10 15:01:28 -07:00
Shengqi Chen	717c57c834	cmd: rename arcstat to zarcstat Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Colm Buckley <colm@tuatha.org> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16357 Closes #17712	2025-09-10 15:01:20 -07:00
Shengqi Chen	5bf1500ee3	Remove renaming notice and symlinks for arcstat and arc_summary Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Colm Buckley <colm@tuatha.org> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16357 Closes #17712	2025-09-10 15:01:12 -07:00
Shengqi Chen	cbc6d57012	Add upcoming renaming notice for arc_summary and arcstat They will become zarcsummary and zarcstat in 2.4.0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16357 Closes #17695	2025-09-09 17:05:26 -07:00
Alexander Ziaee	92d4b135b6	manuals: Audit/bump dates for last content change Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Ziaee <ziaee@FreeBSD.org> Closes #17676	2025-09-09 17:04:19 -07:00
Shawn Bayern	8604e67dc9	Add description of default sorting behavior to zfs_list.8 The sorting logic is all in cmd/zfs/zfs_iter.c. I borrowed where I could from the comments in the source code, but please note that the comment to zfs_sort() is a little imprecise, or at least incomplete, because it doesn't give any indication of the chronological sort that will be used by default for snapshots in zfs_compare(). While adding this description, I took the liberty to copy-edit the rest of the file lightly. In those edits, I've removed "If specified, you can list property information by the absolute pathname or the relative pathname" because, in context, it seems more confusing than helpful. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Shawn Bayern <sbayern@law.fsu.edu> Closes #15713 Closes #15869	2025-09-09 17:03:55 -07:00
r-ricci	30a915efed	zfs-send.8: mention combination of -c/-e flags and zstd_compress feature Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Roberto Ricci <io@r-ricci.it> Closes #17647	2025-08-19 10:56:58 -04:00
Brian Behlendorf	5061f959d1	Retire zfs_autoimport_disable kmod option Back in 2014 the zfs_autoimport_disable module option was added to control whether the kmods should load the pool configs from the cache file on module load. The default value since that time has been for the kernel to not process the cache file. Detecting and importing pools during boot is now controlled outside of the kmod on both Linux and FreeBSD. By all accounts this has been working well and we can remove this dormant code on the kernel side. The spa_config_load() function is has been moved to userspace, it is now only used by libzpool. Additionally, the spa_boot_init() hook which was used by FreeBSD now looks to be used and was removed. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17618	2025-08-14 14:58:58 -07:00
Alan Somers	d3c1d27afd	zdb: better handling for corrupt block pointers When dumping indirect blocks, attempt to print corrupt block pointers rather than abort the program. When corruption is detected zdb will exit with an error code of 3. Sponsored by: ConnectWise Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Alek Pinchuk <alek.pinchuk@connectwise.com> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #17166	2025-08-12 14:16:37 -07:00
Alexander Motin	8302b6e32b	Some documentation polishing for log vdevs Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17592	2025-08-06 10:45:45 -07:00
Alexander Motin	60f714e6e2	Implement physical rewrites Based on previous commit this implements `zfs rewrite -P` flag, making ZFS to keep blocks logical birth times while rewriting files. It should exclude the rewritten blocks from incremental sends, snapshot diffs, etc. Snapshots space usage same time will reflect the additional space usage from newly allocated blocks. Since this begins to use new "rewrite" flag in the block pointers, this commit introduces a new read-compatible per-dataset feature physical_rewrite. It must be enabled for the command to not fail, it is activated on first use and deactivated on deletion of the last affected dataset. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17565	2025-08-06 10:36:56 -07:00
Mariusz Zaborski	894edd084e	Add TXG timestamp database This feature enables tracking of when TXGs are committed to disk, providing an estimated timestamp for each TXG. With this information, it becomes possible to perform scrubs based on specific date ranges, improving the granularity of data management and recovery operations. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #16853	2025-08-06 10:31:21 -07:00
Alexander Motin	f70c85086b	BRT: Fix ZAP entry endianness During original block cloning implementation a mistake was made, making BRT ZAP entries an array of 8 1-byte entries instead of 1 entry of 8 bytes. This makes the pools non-endian-safe. This commit introduces a new read-compatible pool feature "com.truenas:block_cloning_endian", fixing the endianness issue for new pools while maintaining compatibility with existing ones. The feature is automatically activated when creating the first BRT ZAP (ensuring we don't activate it on pools that already have BRT entries in the old format). When active, BRT entries are stored as single 8-byte values. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17572	2025-07-30 09:42:47 -07:00
Akash B	b6e8db509d	zpool/zfs: Add '-a\|--all' option to scrub, trim, initialize Add support for the '-a \| --all' option to perform trim, scrub, and initialize operations on all pools. Previously, specifying a pool name was mandatory for these operations. With this enhancement, users can now execute these operations across all pools at once, without needing to manually iterate over each pool from the command line. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Akash B <akash-b@hpe.com> Closes #17524	2025-07-29 14:50:44 -07:00
Brian Behlendorf	cf146460c1	Default to zfs_bclone_wait_dirty=1 Update the default FICLONE and FICLONERANGE ioctl behavior to wait on dirty blocks. While this does remove some control from the application, in practice ZFS is better positioned to the optimial thing and immediately force a TXG sync. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17455	2025-07-25 10:42:23 -04:00
Alexander Motin	be1e991a1a	Allow and prefer special vdevs as ZIL Before this change ZIL blocks were allocated only from normal or SLOG vdevs. In typical situation when special vdevs are SSDs and normal are HDDs it could cause weird inversions when data blocks are written to SSDs, but ZIL referencing them to HDDs. This change assumes that special vdevs typically have much better (or at least not worse) latency than normal, and so in absence of SLOGs should store ZIL blocks. It means similar to normal vdevs introduction of special embedded log allocation class and updating the allocation fallback order to: SLOG -> special embedded log -> special -> normal embedded log -> normal. The code tries to guess whether data block is going to be written to normal or special vdev (it can not be done precisely before compression) and prefer indirect writes for blocks written to a special vdev to avoid double-write. For blocks that are going to be written to normal vdev, special vdev by default plays as SLOG, reducing write latency by the cost of higher special vdev wear, but it is tunable via module parameter. This should allow HDD pools with decent SSD as special vdev to work under synchronous workloads without requiring additional SLOG SSD, impractical in many scenarios. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17505	2025-07-18 18:44:14 -07:00
Rob Norris	fce18e04d5	libzpool: tunable-based option interface for zdb/ztest Removes the old dlsym() based option setter and adds a new function handle_tunable_option() that can set, get and list all the tunables in the system. And then wire it up to zdb and ztest. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17537	2025-07-15 15:47:03 -07:00
Paul Dagnelie	a981cb69e4	Implement dynamic gang header sizes ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17004	2025-07-09 14:02:53 -07:00
Rob Norris	6af8db61b1	metaslab: don't pass whole zio to throttle reserve APIs They only need a couple of fields, and passing the whole thing just invites fiddling around inside it, like modifying flags, which then makes it much harder to understand the zio state from inside zio.c. We move the flag update to just after a successful throttle in zio.c. Rename ZIO_FLAG_IO_ALLOCATING to ZIO_FLAG_ALLOC_THROTTLED Better describes what it means, and makes it look less like IO_IS_ALLOCATING, which means something different. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17508	2025-07-04 23:22:22 -04:00
Alexander Motin	4e92aee233	Relax special_small_blocks restrictions special_small_blocks is applied to blocks after compression, so it makes no sense to demand its values to be power of 2. At most they could be multiple of 512, but that would still buy us nothing, so lets allow them be any within SPA_MAXBLOCKSIZE. Also special_small_blocks does not really need to depend on the set recordsize, enabled pool features or presence of special vdev. At worst in any of those cases it will just do nothing, so we should not complicate users lives by artificial limitations. While there, polish comments for recordsize and volblocksize. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17497	2025-07-02 11:11:37 -07:00
Rob Norris	3ff2eca0be	zfs-program(8): document zfs.sync.clone() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17426	2025-06-10 14:53:18 -07:00
Alexander Motin	b7f919d228	Relax zfs_vnops_read_chunk_size limitations It makes no sense to limit read size below the block size, since DMU will any way consume resources for the whole block, while the current zfs_vnops_read_chunk_size is only 1MB, which is smaller that maximum block size of 16MB. Plus in case of misaligned Uncached I/O the buffer may get evicted between the chunks, requiring repeating I/Os. On 64-bit platforms increase zfs_vnops_read_chunk_size to 32MB. It allows to less depend on speculative prefetcher if application requests specific size, first not waiting for prefetcher to start and later not prefetching more than needed. Also while there, we don't need to align reads to the chunk size, but only to a block size, which is smaller and so more forgiving. My profiles show ~4% of CPU time saving when reading 16MB blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17415	2025-06-04 11:24:15 -04:00
Rob Norris	5764e218ba	vdev_disk: remove classic IO submission Since it was disabled for 2.3, there's been no confirmed sightings of strange IO errors, misalignments or related shenanigans. Absence of evidence and all that, but I'd rather fix bugs in the new code than in the old. "It isn't hubris until he's failed." -- Chrisjen Avasarala Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17399	2025-05-30 10:31:02 -04:00
Rob Norris	44e3266894	events: include zio type in IO error reports Usually the IO type can be inferred from the other fields (in particular, priority and flags) sometimes it's not easy to see. This is just another little debug helper. May 27 2025 00:54:54.024110493 ereport.fs.zfs.data class = "ereport.fs.zfs.data" ena = 0x1f5ecfae600801 ... zio_delta = 0x0 zio_type = 0x2 [WRITE] zio_priority = 0x3 [ASYNC_WRITE] zio_objset = 0x0 Document zio_type and zio_priority. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17381	2025-05-30 10:29:29 -04:00
Rob Norris	fc617645a3	vdev_disk: remove zfs_vdev_scheduler option It has existed as a warning since 0.8.3, 5+ years ago. I think people have had enough time. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17376	2025-05-27 15:06:15 -07:00
Rob Norris	284580c878	dmu_traverse: remove 'ignore_hole_birth' tunable alias It's been many years, we can probably do without. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17376	2025-05-27 15:05:09 -07:00
Don Brady	b048bfa9c1	Allow opt-in of zvol blocks in special class Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: Don Brady <dev.fs.zfs@gmail.com> Closes #14876	2025-05-24 16:44:26 -04:00
Cameron Harr	92157c840c	Refactor man page and CLI help output per mandoc The man page and the usage statement from the CLI have been refactored to abide by the ManDoc standard. Style changes include: * Upper-case letters before lower-case * List short options w/o arguments first * Then list short options w/ arguments * Then list long arguments Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Cameron Harr <harr1@llnl.gov> Closes #17357	2025-05-23 09:10:30 -07:00
Cameron Harr	cdb4c44684	Reformat cli help and man page to be in sync The man page and CLI usage statements were both a little out of sync and neither fully alphabetized correctly. That has been fixed. One outstanding question is whether to get rid of the ellipses on the CLI usage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Cameron Harr <harr1@llnl.gov> Closes #16004 Closes #17357	2025-05-23 09:10:21 -07:00
Alexander Motin	d5616ad34a	Increase meta-dnode redundancy in "some" mode Loss of one indirect block of the meta dnode likely means loss of the whole dataset. It is worse than one file that the man page promises, and in my opinion is not much better than "none" mode. This change restores redundancy of the meta-dnode indirect blocks, while same time still corrects expectations in the man page. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17339	2025-05-16 13:23:32 -04:00
Allan Jude	b6916f995e	ARC: parallel eviction On systems with enormous amounts of memory, the single arc_evict thread can become a bottleneck if reads and writes are stuck behind it, waiting for old data to be evicted before new data can take its place. This commit adds support for evicting from multiple ARC lists in parallel, by farming the evict work out to some number of threads and then accumulating their results. A new tuneable, zfs_arc_evict_threads, sets the number of threads. By default, it will scale based on the number of CPUs. Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Youzhong Yang <youzhong@gmail.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Signed-off-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Co-authored-by: Rob Norris <rob.norris@klarasystems.com> Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Co-authored-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com> Closes #16486	2025-05-14 10:38:32 -04:00
Alexander Motin	0aa83dce99	Linux: Stop using NR_FILE_PAGES for ARC scaling I've found that QEMU/KVM guest memory accounted as shared also included into NR_FILE_PAGES. But it is actually a non-evictable anonymous memory. Using it as a base for zfs_arc_pc_percent parameter makes ARC to ignore shrinker requests while page cache does not really have anything to evict, ending up in OOM killer killing the QEMU process. Instead use of NR_ACTIVE_FILE + NR_INACTIVE_FILE should represent the part of a page cache that is actually evictable, which should be safer to use as a reference for ARC scaling. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17334	2025-05-14 09:29:02 -04:00
Alexander Motin	734eba251d	Wire O_DIRECT also to Uncached I/O (#17218 ) Before Direct I/O was implemented, I've implemented lighter version I called Uncached I/O. It uses normal DMU/ARC data path with some optimizations, but evicts data from caches as soon as possible and reasonable. Originally I wired it only to a primarycache property, but now completing the integration all the way up to the VFS. While Direct I/O has the lowest possible memory bandwidth usage, it also has a significant number of limitations. It require I/Os to be page aligned, does not allow speculative prefetch, etc. The Uncached I/O does not have those limitations, but instead require additional memory copy, though still one less than regular cached I/O. As such it should fill the gap in between. Considering this I've disabled annoying EINVAL errors on misaligned requests, adding a tunable for those who wants to test their applications. To pass the information between the layers I had to change a number of APIs. But as side effect upper layers can now control not only the caching, but also speculative prefetch. I haven't wired it to VFS yet, since it require looking on some OS specifics. But while there I've implemented speculative prefetch of indirect blocks for Direct I/O, controllable via all the same mechanisms. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Fixes #17027 Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-05-13 14:26:55 -07:00
Alexander Motin	49fbdd4533	Introduce zfs rewrite subcommand (#17246 ) This allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-05-12 10:22:17 -07:00
Quentin Thébault	63de2d2dbd	zfs-rollback.8: fix typo in example number Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Alexander Ziaee <ziaee@FreeBSD.org> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Quentin Thébault <quentin.thebault@defenso.fr> Closes #17282	2025-04-28 15:38:08 -04:00
Ameer Hamza	7c4ff2a051	zfsprops.7 manpage changes for default quotas Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-04-03 10:36:49 -07:00
Simon Howard	fd018248d5	Disambiguate reference to kibibytes, not kilobytes A minor nitpick that is kind of obvious based on the surrounding context and reference to powers of two. It's better to be explicit, though. Signed-off-by: Simon Howard <fraggle@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org>	2025-03-24 14:37:43 -07:00
Simon Howard	ef81812726	Fix spelling errors Unlike some of my other fixes which are more subtle, these are unambigously spelling errors. Signed-off-by: Simon Howard <fraggle@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org>	2025-03-24 14:37:40 -07:00
Simon Howard	e759a86fa5	Correct "umount" to "unmount" in a couple of places This is admittedly a nitpicky change, but `umount` is the command that performs an unmount. So if we are talking about unmounting something we should phrase it that way. Signed-off-by: Simon Howard <fraggle@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org>	2025-03-24 14:37:36 -07:00
Simon Howard	1d4505d7a1	Capitalize in various places where appropriate These are mostly acronyms (CPUs; ZILs) but also proper nouns such as "Unix" and "Unicode" which should also be capitalized. Signed-off-by: Simon Howard <fraggle@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org>	2025-03-24 14:37:34 -07:00
Simon Howard	b386bf87c1	Fix cases where "descendent" is used as a noun As per Wiktionary: "descendent" may be used as an adjective (e.g. "a descendent dataset") but for nouns (e.g. "descendants of this dataset"), "descendant" is the correct spelling. Signed-off-by: Simon Howard <fraggle@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org>	2025-03-24 14:37:31 -07:00
Simon Howard	73494f3352	Make use of "i.e." (id est) consistent This is the most common way it is written throughout the manpages, but there are a few cases where it is written slightly differently. Signed-off-by: Simon Howard <fraggle@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org>	2025-03-24 14:37:26 -07:00
Simon Howard	530ddcd5f1	Harmonize on American spelling in several places Most of the documentation is written in American English, so it makes sense to be consistent. Signed-off-by: Simon Howard <fraggle@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org>	2025-03-24 14:36:34 -07:00
Alexander Motin	94a3fabcb0	Unified allocation throttling (#17020 ) Existing allocation throttling had a goal to improve write speed by allocating more data to vdevs that are able to write it faster. But in the process it completely broken the original mechanism, designed to balance vdev space usage. With severe vdev space use imbalance it is possible that some with higher use start growing fragmentation sooner than others and after getting full will stop any writes at all. Also after vdev addition it might take a very long time for pool to restore the balance, since the new vdev does not have any real preference, unless the old one is already much slower due to fragmentation. Also the old throttling was request- based, which was unpredictable with block sizes varying from 512B to 16MB, neither it made much sense in case of I/O aggregation, when its 32-100 requests could be aggregated into few, leaving device underutilized, submitting fewer and/or shorter requests, or in opposite try to queue up to 1.6GB of writes per device. This change presents a completely new throttling algorithm. Unlike the request-based old one, this one measures allocation queue in bytes. It makes possible to integrate with the reworked allocation quota (aliquot) mechanism, which is also byte-based. Unlike the original code, balancing the vdevs amounts of free space, this one balances their free/used space fractions. It should result in a lower and more uniform fragmentation in a long run. This algorithm still allows to improve write speed by allocating more data to faster vdevs, but does it in more controllable way. On top of space-based allocation quota, it also calculates minimum queue depth that vdev is allowed to maintain, and respectively the amount of extra allocations it can receive if it appear faster. That amount is based on vdev's capacity and space usage, but also applied only when the pool is busy. This way the code can choose between faster writes when needed and better vdev balance when not, with the choice gradually reducing together with the free space. This change also makes allocation queues per-class, allowing them to throttle independently and in parallel. Allocations that are bounced between classes due to allocation errors will be able to properly throttle in the new class. Allocations that should not be throttled (ZIL, gang, copies) are not, but may still follow the rotor and allocation quota mechanism of the class without disrupting it. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com>	2025-03-24 09:25:01 -07:00
Rob Norris	7d8dd8d9a5	SPDX: license tags: MIT Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:56:54 -07:00
Rob Norris	eb9098ed47	SPDX: license tags: CDDL-1.0 Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:56:27 -07:00

1 2 3 4 5 ...

1081 Commits