mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-06-09 23:46:38 +03:00

Author	SHA1	Message	Date
Rob Norris	224393a321	feature: large_microzap In `a4b21eadec` we added the zap_micro_max_size tuneable to raise the size at which "micro" (single-block) ZAPs are upgraded to "fat" (multi-block) ZAPs. Before this, a microZAP was limited to 128KiB, which was the old largest block size. The side effect of raising the max size past 128KiB is that it be stored in a large block, requiring the large_blocks feature. Unfortunately, this means that a backup stream created without the --large-block (-L) flag to zfs send would split the microZAP block into smaller blocks and send those, as is normal behaviour for large blocks. This would be received correctly, but since microZAPs are limited to the first block in the object by definition, the entries in the later blocks would be inaccessible. For directory ZAPs, this gives the appearance of files being lost. This commit adds a feature flag, large_microzap, that must be enabled for microZAPs to grow beyond 128KiB, and which will be activated the first time that occurs. This feature is later checked when generating the stream and if active, the send operation will abort unless --large-block has also been requested. Changing the limit still requires zap_micro_max_size to be changed. The state of this flag effectively sets the upper value for this tuneable, that is, if the feature is disabled, the tuneable will be clamped to 128KiB. A stream flag is also added to ensure that the receiver also activates its own feature flag upon receiving the stream. This is not strictly necessary to _use_ the received microZAP, since it doesn't care how large its block is, but it is required to send the microZAP object on, otherwise the original problem occurs again. Because it's difficult to reliably distinguish a microZAP from a fatZAP from outside the ZAP code, and because it seems unlikely that most users are affected (a fairly niche tuneable combined with what should be an uncommon use of send), and for the sake of expediency, this change activates the feature the first time a microZAP grows to use a large block, and is never deactivated after that. This can be improved in the future. This commit changes nothing for existing pools that already have large microZAPs. The feature will not be retroactively applied, but will be activated the next time a microZAP grows past the limit. Don't use large_blocks feature for enable/disable tests. The large_microzap depends on large_blocks, so it gets enabled as a dependency, breaking the test. Instead use feature "longname", which has the exact same feature characteristics. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16593	2024-10-02 20:47:11 -07:00
Brian Behlendorf	d34d4f97a8	snapdir: add 'disabled' value to make .zfs inaccessible In some environments, just making the .zfs control dir hidden from sight might not be enough. In particular, the following scenarios might warrant not allowing access at all: - old snapshots with wrong permissions/ownership - old snapshots with exploitable setuid/setgid binaries - old snapshots with sensitive contents Introducing a new 'disabled' value that not only hides the control dir, but prevents access to its contents by returning ENOENT solves all of the above. The new property value takes advantage of 'iuv' semantics ("ignore unknown value") to automatically fall back to the old default value when a pool is accessed by an older version of ZFS that doesn't yet know about 'disabled' semantics. I think that technically the zfs_dirlook change is enough to prevent access, but preventing lookups and dir entries in an already opened .zfs handle might also be a good idea to prevent races when modifying the property at runtime. Add zfs_snapshot_no_setuid parameter to control whether automatically mounted snapshots have the setuid mount option set or not. this could be considered a partial fix for one of the scenarios mentioned in desired. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Co-authored-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Closes #3963 Closes #16587	2024-10-02 09:12:02 -07:00
rilysh	86737c5927	Avoid computing strlen() inside loops Compiling with -O0 (no proper optimizations), strlen() call in loops for comparing the size, isn't being called/initialized before the actual loop gets started, which causes n-numbers of strlen() calls (as long as the string is). Keeping the length before entering in the loop is a good idea. On some places, even with -O2, both GCC and Clang can't recognize this pattern, which seem to happen in an array of char pointer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: rilysh <nightquick@proton.me> Closes #16584	2024-10-02 09:10:06 -07:00
Sanjeev Bagewadi	20232ecfaa	Support for longnames for files/directories (Linux part) This patch adds the ability for zfs to support file/dir name up to 1023 bytes. This number is chosen so we can support up to 255 4-byte characters. This new feature is represented by the new feature flag feature@longname. A new dataset property "longname" is also introduced to toggle longname support for each dataset individually. This property can be disabled, even if it contains longname files. In such case, new file cannot be created with longname but existing longname files can still be looked up. Note that, to my knowledge native Linux filesystems don't support name longer than 255 bytes. So there might be programs not able to work with longname. Note that NFS server may needs to use exportfs_get_name to reconnect dentries, and the buffer being passed is limit to NAME_MAX+1 (256). So NFS may not work when longname is enabled. Note, FreeBSD vfs layer imposes a limit of 255 name lengh, so even though we add code to support it here, it won't actually work. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #15921	2024-10-01 13:40:27 -07:00
w0xel	ccc420acd5	Add missing guard defines for simd_stat This adds the HAVE_KERNEL_NEON and HAVE_KERNEL_FPU_INTERNAL guards to simd_stat.c defaulted to 0 to make it build again. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Shengqi Chen <harry-chen@outlook.com> Signed-off-by: Sebastian Wuerl <s.wuerl@mailbox.org> Closes #16558	2024-09-24 09:07:26 -07:00
George Melikov	e419a63bf4	xattr dataset prop: change defaults to sa It's the main recommendation to set xattr=sa even in man pages, so let's set it by default. xattr=sa don't use feature flag, so in the worst case we'll have non-readable xattrs by other non-openzfs platforms. Non-overridden default `xattr` prop of existing pools will automatically use `sa` after this commit too. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #15147	2024-09-23 09:50:48 -07:00
Rich Ercolani	1d84c9eb66	Fix /proc/spl/kstat/simd on x86 Evidently while reworking it on aarch64, I broke it on x86 and didn't notice. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #16556	2024-09-22 13:11:19 -07:00
Brian Behlendorf	f9d4f1b480	Add SIMD metadata in /proc on Linux follow up This change accidentally broke the FreeBSD build due to a conflict between the simd_stat_init()/simd_stat_fini() macros on FreeBSD and the extern function prototype. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #16552	2024-09-20 15:48:12 -07:00
Rich Ercolani	5d01243964	Add SIMD metadata in /proc on Linux Too many times, people's performance problems have amounted to "somehow your SIMD support isn't working", and determining that at runtime is difficult to describe to people. This adds a /proc/spl/kstat/zfs/simd node, which exposes metadata about which instructions ZFS thinks it can use, on AArch64 and x86_64 Linux, to make investigating things like this much easier. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #16530	2024-09-20 08:16:44 -07:00
Shengqi Chen	0ae4460c61	zcommon: add specialized versions of cityhash4 Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callers invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 And each actual argument (even if passing 0) contributes evenly to the number of multiplication instructions generated: - x86: 9, 12, 15 ,18 - arm-v7a: 6, 8, 10, 12 - riscv / power: 12, 18, 20, 24 - mipsel: 9, 12, 15, 19 On 64-bit architectures, the tendencies are similar. But both stack sizes and instruction counts are significantly smaller thus negligible. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16131 Closes #16483	2024-09-19 15:18:59 -07:00
Brian Atkinson	a10e552b99	Adding Direct IO Support Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads. O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests. This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will not be synced until the associated TXG is synced. For both O_DIRECT read and write request the offset and request sizes, at a minimum, must be PAGE_SIZE aligned. In the event they are not, then EINVAL is returned unless the direct property is set to always (see below). For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk. For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer. For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in the event that file contents are mmap'ed. In this case, all requests that are at least PAGE_SIZE aligned will just fall back to the buffered paths. If the request however is not PAGE_SIZE aligned, EINVAL will be returned as always regardless if the file's contents are mmap'ed. Since O_DIRECT writes go through the normal ZIO pipeline, the following operations are supported just as with normal buffered writes: Checksum Compression Encryption Erasure Coding There is one caveat for the data integrity of O_DIRECT writes that is distinct for each of the OS's supported by ZFS. FreeBSD - FreeBSD is able to place user pages under write protection so any data in the user buffers and written directly down to the VDEV disks is guaranteed to not change. There is no concern with data integrity and O_DIRECT writes. Linux - Linux is not able to place anonymous user pages under write protection. Because of this, if the user decides to manipulate the page contents while the write operation is occurring, data integrity can not be guaranteed. However, there is a module parameter `zfs_vdev_direct_write_verify` that controls the if a O_DIRECT writes that can occur to a top-level VDEV before a checksum verify is run before the contents of the I/O buffer are committed to disk. In the event of a checksum verification failure the write will return EIO. The number of O_DIRECT write checksum verification errors can be observed by doing `zpool status -d`, which will list all verification errors that have occurred on a top-level VDEV. Along with `zpool status`, a ZED event will be issues as `dio_verify` when a checksum verification error occurs. ZVOLs and dedup is not currently supported with Direct I/O. A new dataset property `direct` has been added with the following 3 allowable values: disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request. standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used. always - Treats every write/read IO request as though it passed O_DIRECT and will do O_DIRECT if the alignment restrictions are met otherwise will redirect through the ARC. This property will not allow a request to fail. There is also a module parameter zfs_dio_enabled that can be used to force all reads and writes through the ARC. By setting this module parameter to 0, it mimics as if the direct dataset property is set to disabled. Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Co-authored-by: Mark Maybee <mark.maybee@delphix.com> Co-authored-by: Matt Macy <mmacy@FreeBSD.org> Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov> Closes #10018	2024-09-14 13:47:59 -07:00
Rob Norris	82ff9aafd6	value strings: pretty printers for flags and enums This adds zfs_valstr, a collection of pretty printers for bitfields and enums. These are useful in debugging, logging and other display contexts where raw values are difficult for the untrained (or even trained!) eye to decipher. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-09-05 13:40:05 -07:00
Rob Norris	db2b1fdb79	ddt: add FDT feature and support for legacy and new on-disk formats This is the supporting infrastructure for the upcoming dedup features. Traditionally, dedup objects live directly in the MOS root. While their details vary (checksum, type and class), they are all the same "kind" of thing - a store of dedup entries. The new features are more varied than that, and are better thought of as a set of related stores for the overall state of a dedup table. This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will cause new DDTs to be created as a ZAP in the MOS root, named DDT-<checksum>. The is used as the root object for the normal type/class store objects, but will also be a place for any storage required by new features. This commit adds two new fields to ddt_t, for version and flags. These are intended to describe the structure and features of the overall dedup table, and are stored as-is in the DDT root. In this commit, flags are always zero, but the intent is that they can be used to hang optional logic or state onto for new dedup features. Version is always 1. For a "legacy" dedup table, where no DDT root directory exists, the version will be 0. ddt_configure() is expected to determine the version and flags features currently in operation based on whether or not the fast_dedup feature is enabled, and from what's available on disk. In this way, its possible to support both old and new tables. This also provides a migration path. A legacy setup can be upgraded to FDT by creating the DDT root ZAP, moving the existing objects into it, and setting version and flags appropriately. There's no support for that here, but it would be straightforward to add later and allows the possibility that newer features could be applied to existing dedup tables. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15892	2024-08-16 11:58:59 -07:00
Allan Jude	62e7d3c89e	ddt: add support for prefetching tables into the ARC This change adds a new `zpool prefetch -t ddt $pool` command which causes a pool's DDT to be loaded into the ARC. The primary goal is to remove the need to "warm" a pool's cache before deduplication stops slowing write performance. It may also provide a way to reload portions of a DDT if they have been flushed due to inactivity. Sponsored-by: iXsystems, Inc. Sponsored-by: Catalogics, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Will Andrews <will.andrews@klarasystems.com> Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Will Andrews <will.andrews@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Closes #15890	2024-07-26 09:16:18 -07:00
Allan Jude	c7ada64bb6	ddt: dedup table quota enforcement This adds two new pool properties: - dedup_table_size, the total size of all DDTs on the pool; and - dedup_table_quota, the maximum possible size of all DDTs in the pool When set, quota will be enforced by checking when a new entry is about to be created. If the pool is over its dedup quota, the entry won't be created, and the corresponding write will be converted to a regular non-dedup write. Note that existing entries can be updated (ie their refcounts changed), as that reuses the space rather than requiring more. dedup_table_quota can be set to 'auto', which will set it based on the size of the devices backing the "dedup" allocation device. This makes it possible to limit the DDTs to the size of a dedup vdev only, such that when the device fills, no new blocks are deduplicated. Sponsored-by: iXsystems, Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com> Closes #15889	2024-07-25 09:47:36 -07:00
Don Brady	fb6d8cf229	Add some missing vdev properties (#16346 ) Sponsored-by: Klara, Inc. Sponsored-By: Wasabi Technology, Inc. Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-23 16:34:09 -07:00
chenqiuhao1997	41ae864b69	Replace P2ALIGN with P2ALIGN_TYPED and delete P2ALIGN. In P2ALIGN, the result would be incorrect when align is unsigned integer and x is larger than max value of the type of align. In that case, -(align) would be a positive integer, which means high bits would be zero and finally stay zero after '&' when align is converted to a larger integer type. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Qiuhao Chen <chenqiuhao1997@gmail.com> Closes #15940	2024-05-10 08:47:21 -07:00
Don Brady	cbe882298e	Add slow disk diagnosis to ZED Slow disk response times can be indicative of a failing drive. ZFS currently tracks slow I/Os (slower than zio_slow_io_ms) and generates events (ereport.fs.zfs.delay). However, no action is taken by ZED, like is done for checksum or I/O errors. This change adds slow disk diagnosis to ZED which is opt-in using new VDEV properties: VDEV_PROP_SLOW_IO_N VDEV_PROP_SLOW_IO_T If multiple VDEVs in a pool are undergoing slow I/Os, then it skips the zpool_vdev_degrade(). Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #15469	2024-02-08 09:19:52 -08:00
Don Brady	5caeef02fa	RAID-Z expansion feature This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks). == Initiating expansion == A new device (disk) can be attached to an existing RAIDZ vdev, by running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank raidz2-0 sda`. The new device will become part of the RAIDZ group. A "raidz expansion" will be initiated, and the new device will contribute additional space to the RAIDZ group once the expansion completes. The `feature@raidz_expansion` on-disk feature flag must be `enabled` to initiate an expansion, and it remains `active` for the life of the pool. In other words, pools with expanded RAIDZ vdevs can not be imported by older releases of the ZFS software. == During expansion == The expansion entails reading all allocated space from existing disks in the RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including the newly added device). The expansion progress can be monitored with `zpool status`. Data redundancy is maintained during (and after) the expansion. If a disk fails while the expansion is in progress, the expansion pauses until the health of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting for reconstruction to complete). The pool remains accessible during expansion. Following a reboot or export/import, the expansion resumes where it left off. == After expansion == When the expansion completes, the additional space is available for use, and is reflected in the `available` zfs property (as seen in `zfs list`, `df`, etc). Expansion does not change the number of failures that can be tolerated without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion). A RAIDZ vdev can be expanded multiple times. After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to `zfs list`, `df`, `ls -s`, and similar tools. Sponsored-by: The FreeBSD Foundation Sponsored-by: iXsystems, Inc. Sponsored-by: vStack Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Authored-by: Matthew Ahrens <mahrens@delphix.com> Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com> Contributions-by: Stuart Maybee <stuart.maybee@comcast.net> Contributions-by: Thorsten Behrens <tbehrens@outlook.com> Contributions-by: Fmstrat <nospam@nowsci.com> Contributions-by: Don Brady <dev.fs.zfs@gmail.com> Signed-off-by: Don Brady <dev.fs.zfs@gmail.com> Closes #15022	2023-11-08 10:19:41 -08:00
Ameer Hamza	60387facd2	zvol: Implement zvol threading as a Property Currently, zvol threading can be switched through the zvol_request_sync module parameter system-wide. By making it a zvol property, zvol threading can be switched per zvol. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15409	2023-10-31 09:50:32 -07:00
Brian Behlendorf	07345ac252	Add prefetch property ZFS prefetch is currently governed by the zfs_prefetch_disable tunable. However, this is a module-wide settings - if a specific dataset benefits from prefetch, while others have issue with it, an optimal solution does not exists. This commit introduce the "prefetch" tri-state property, which enable granular control (at dataset/volume level) for prefetching. This patch does not remove the zfs_prefetch_disable, which remains a system-wide switch for enable/disable prefetch. However, to avoid duplication, it would be preferable to deprecate and then remove the module tunable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Co-authored-by: Gionatan Danti <g.danti@assyoma.it> Closes #15237 Closes #15436	2023-10-24 11:00:07 -07:00
Paul Dagnelie	bee9cfb813	Increase limit of redaction list by using spill block Currently redaction bookmarks and their associated redaction lists have a relatively low limit of 36 redaction snapshots. This is imposed by the number of snapshot GUIDs that fit in the bonus buffer of the redaction list object. While this is more than enough for most use cases, there are some limited cases where larger numbers would be useful to support. We tweak the redaction list creation code to use a spill block if the number of redaction snapshots is above the amount that would fit in the bonus buffer. We also make a small change to allow spill blocks to be use for types of data besides SA. In order to fully leverage this logic, we also change the redaction code to use vmem_alloc, to handle extremely large allocations if needed. Finally, small tweaks were made to the zfs commands and the test suite. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #15018	2023-08-26 11:34:43 -07:00
Yuri Pankov	8beabfd3bf	set autotrim default to 'off' everywhere As it turns out having autotrim default to 'on' on FreeBSD never really worked due to mess with defines where userland and kernel module were getting different default values (userland was defaulting to 'off', module was thinking it's 'on'). Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org> Closes #15079	2023-07-20 09:06:55 -07:00
rob-wing	3e4ed4213d	Create zap for root vdev And add it to the AVZ, this is not backwards compatible with older pools due to an assertion in spa_sync() that verifies the number of ZAPs of all vdevs matches the number of ZAPs in the AVZ. Granted, the assertion only applies to #DEBUG builds - still, a feature flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2 Notably, this allows to get/set properties on the root vdev: % zpool set user:prop=value <pool> root-0 Before this commit, it was already possible to get/set properties on top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0): % zpool set user:prop=value <pool> mirror-0 This syntax also applies to the root vdev as it is is of type 'root' with a vdev_id of 0, root-0. The keyword 'root' as an alias for 'root-0'. The following tests have been added: - zpool get all properties from root vdev - zpool set a property on root vdev - verify root vdev ZAP is created Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Sponsored-by: Seagate Technology Submitted-by: Klara, Inc. Closes #14405	2023-04-20 10:07:56 -07:00
Martin Matuška	a3f82aec93	Miscellaneous FreBSD compilation bugfixes Add missing machine/md_var.h to spl/sys/simd_aarch64.h and spl/sys/simd_arm.h In spl/sys/simd_x86.h, PCB_FPUNOSAVE exists only on amd64, use PCB_NPXNOSAVE on i386 In FreeBSD sys/elf_common.h redefines AT_UID and AT_GID on FreeBSD, we need a hack in vnode.h similar to Linux. sys/simd.h needs to be included early. In zfs_freebsd_copy_file_range() we pass a (size_t )lenp to zfs_clone_range() that expects a (uint64_t ) Allow compiling armv6 world by limiting ARM macros in sha256_impl.c and sha512_impl.c to __ARM_ARCH > 6 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Signed-off-by: WHR <msl0000023508@gmail.com> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #14674	2023-04-06 10:35:02 -07:00
Rich Ercolani	0ad5f43442	Drop lying to the compiler in the fletcher4 code This is probably the uncontroversial part of #13631, which fixes a real problem people are having. There's still things to improve in our code after this is merged, but it should stop the breakage that people have reported, where we lie about a type always being aligned and then pass in stack objects with no alignment requirement and hope for the best. Of course, our SIMD code was written with unaligned accesses, so it doesn't care if we drop this...but some auto-vectorized code that gcc emits sure does, since we told it it can assume they're aligned. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14649	2023-03-24 10:29:19 -07:00
Richard Yao	d1807f168e	nvpair: Constify string functions After addressing coverity complaints involving `nvpair_name()`, the compiler started complaining about dropping const. This lead to a rabbit hole where not only `nvpair_name()` needed to be constified, but also `nvpair_value_string()`, `fnvpair_value_string()` and a few other static functions, plus variable pointers throughout the code. The result became a fairly big change, so it has been split out into its own patch. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14612	2023-03-14 15:25:50 -07:00
Attila Fülöp	78289b8458	zcommon: Refactor FPU state handling in fletcher4 Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in #14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14600	2023-03-14 09:45:28 -07:00
Pawel Jakub Dawidek	67a1b03791	Implementation of block cloning for ZFS Block Cloning allows to manually clone a file (or a subset of its blocks) into another (or the same) file by just creating additional references to the data blocks without copying the data itself. Those references are kept in the Block Reference Tables (BRTs). The whole design of block cloning is documented in module/zfs/brt.c. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #13392	2023-03-10 11:59:53 -08:00
rob-wing	69f024a56e	Configure zed's diagnosis engine with vdev properties Introduce four new vdev properties: checksum_n checksum_t io_n io_t These properties can be used for configuring the thresholds of zed's diagnosis engine and are interpeted as <N> events in T <seconds>. When this property is set to a non-default value on a top-level vdev, those thresholds will also apply to its leaf vdevs. This behavior can be overridden by explicitly setting the property on the leaf vdev. Note that, these properties do not persist across vdev replacement. For this reason, it is advisable to set the property on the top-level vdev instead of the leaf vdev. The default values for zed's diagnosis engine (10 events, 600 seconds) remains unchanged. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Sponsored-by: Seagate Technology LLC Closes #13805	2023-01-23 13:14:25 -08:00
Richard Yao	2e7f664f04	Cleanup of dead code suggested by Clang Static Analyzer (#14380 ) I recently gained the ability to run Clang's static analyzer on the linux kernel modules via a few hacks. This extended coverage to code that was previously missed since Clang's static analyzer only looked at code that we built in userspace. Running it against the Linux kernel modules built from my local branch produced a total of 72 reports against my local branch. Of those, 50 were reports of logic errors and 22 were reports of dead code. Since we already had cleaned up all of the previous dead code reports, I felt it would be a good next step to clean up these dead code reports. Clang did a further breakdown of the dead code reports into: Dead assignment 15 Dead increment 2 Dead nested assignment 5 The benefit of cleaning these up, especially in the case of dead nested assignment, is that they can expose places where our error handling is incorrect. A number of them were fairly straight forward. However several were not: In vdev_disk_physio_completion(), not only were we not using the return value from the static function vdev_disk_dio_put(), but nothing used it, so I changed it to return void and removed the existing (void) cast in the other area where we call it in addition to no longer storing it to a stack value. In FSE_createDTable(), the function is dead code. Its helper function FSE_freeDTable() is also dead code, as are the CPP definitions in `module/zstd/include/zstd_compat_wrapper.h`. We just delete it all. In zfs_zevent_wait(), we have an optimization opportunity. cv_wait_sig() returns 0 if there are waiting signals and 1 if there are none. The Linux SPL version literally returns `signal_pending(current) ? 0 : 1)` and FreeBSD implements the same semantics, we can just do `!cv_wait_sig()` in place of `signal_pending(current)` to avoid unnecessarily calling it again. zfs_setattr() on FreeBSD version did not have error handling issue because the code was removed entirely from FreeBSD version. The error is from updating the attribute directory's files. After some thought, I decided to propapage errors on it to userspace. In zfs_secpolicy_tmp_snapshot(), we ignore a lack of permission from the first check in favor of checking three other permissions. I assume this is intentional. In zfs_create_fs(), the return value of zap_update() was not checked despite setting an important version number. I see no backward compatibility reason to permit failures, so we add an assertion to catch failures. Interestingly, Linux is still using ASSERT(error == 0) from OpenSolaris while FreeBSD has switched to the improved ASSERT0(error) from illumos, although illumos has yet to adopt it here. ASSERT(error == 0) was used on Linux while ASSERT0(error) was used on FreeBSD since the entire file needs conversion and that should be the subject of another patch. dnode_move()'s issue was caused by us not having implemented POINTER_IS_VALID() on Linux. We have a stub in `include/os/linux/spl/sys/kmem_cache.h` for it, when it really should be in `include/os/linux/spl/sys/kmem.h` to be consistent with Illumos/OpenSolaris. FreeBSD put both `POINTER_IS_VALID()` and `POINTER_INVALIDATE()` in `include/os/freebsd/spl/sys/kmem.h`, so we copy what it did. Whenever a report was in platform-specific code, I checked the FreeBSD version to see if it also applied to FreeBSD, but it was only relevant a few times. Lastly, the patch that enabled Clang's static analyzer to be run on the Linux kernel modules needs more work before it can be put into a PR. I plan to do that in the future as part of the on-going static analysis work that I am doing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14380	2023-01-17 09:57:12 -08:00
Richard Yao	59493b63c1	Micro-optimize fletcher4 calculations When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on every page in the abd. This is wasteful and slows down checksum performance versus what the benchmark claimed. We correct this by moving those calls to the init and fini functions. Also, we always check the buffer length against 0 before calling the non-scalar checksum functions. This means that we do not need to execute the loop condition for the first loop iteration. That allows us to micro-optimize the checksum calculations by switching to do-while loops. Note that we do not apply that micro-optimization to the scalar implementation because there is no check in `fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()` against 0 sized buffers being passed. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14247	2022-12-05 11:00:34 -08:00
Richard Yao	303678350a	Convert some sprintf() calls to kmem_scnprintf() These `sprintf()` calls are used repeatedly to write to a buffer. There is no protection against overflow other than reviewers explicitly checking to see if the buffers are big enough. However, such issues are easily missed during review and when they are missed, we would rather stop printing rather than have a buffer overflow, so we convert these functions to use `kmem_scnprintf()`. The Linux kernel provides an entire page for module parameters, so we are safe to write up to PAGE_SIZE. Removing `sprintf()` from these functions removes the last instances of `sprintf()` usage in our platform-independent kernel code. This improves XNU kernel compatibility because the XNU kernel does not support (removed support for?) `sprintf()`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14209	2022-11-28 13:49:58 -08:00
Mariusz Zaborski	16f0fdaddd	Allow to control failfast Linux defaults to setting "failfast" on BIOs, so that the OS will not retry IOs that fail, and instead report the error to ZFS. In some cases, such as errors reported by the HBA driver, not the device itself, we would wish to retry rather than generating vdev errors in ZFS. This new property allows that. This introduces a per vdev option to disable the failfast option. This also introduces a global module parameter to define the failfast mask value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-by: Seagate Technology LLC Submitted-by: Klara, Inc. Closes #14056	2022-11-10 13:37:12 -08:00
Damian Szuberski	109731cd73	dsl_prop_known_index(): check for invalid prop Resolve UBSAN array-index-out-of-bounds error in zprop_desc_t. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #14142 Closes #14147	2022-11-08 10:16:01 -08:00
Akash B	5405be0365	Add options to zfs redundant_metadata property Currently, additional/extra copies are created for metadata in addition to the redundancy provided by the pool(mirror/raidz/draid), due to this 2 times more space is utilized per inode and this decreases the total number of inodes that can be created in the filesystem. By setting redundant_metadata to none, no additional copies of metadata are created, hence can reduce the space consumed by the additional metadata copies and increase the total number of inodes that can be created in the filesystem. Additionally, this can improve file create performance due to the reduced amount of metadata which needs to be written. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com> Closes #13680	2022-10-19 17:07:51 -07:00
George Melikov	fbc210fab2	Enable relatime by default Linux sets relatime on mount by default for any file system, but relatime=off in ZFS disables it explicitly. Let's be consistent with other file systems on Linux. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #13614	2022-08-12 14:20:25 -07:00
Umer Saleem	9681de4657	Add snapshots_changed as property Make dd_snap_cmtime property persistent across mount and unmount operations by storing in ZAP and restore the value from ZAP on hold into dd_snap_cmtime instead of updating it. Expose dd_snap_cmtime as 'snapshots_changed' property that provides a mechanism to quickly determine whether snapshot list for dataset has changed without having to mount a dataset or iterate the snapshot list. It specifies the time at which a snapshot for a dataset was last created or deleted. This allows us to be more efficient how often we query snapshots. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #13635	2022-08-02 16:45:30 -07:00
Tino Reichardt	1d3ba0bf01	Replace dead opensolaris.org license link The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13619	2022-07-11 14:16:13 -07:00
наб	a926aab902	Enable -Wwrite-strings Also, fix leak from ztest_global_vars_to_zdb_args() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13348	2022-06-29 14:08:54 -07:00
Tino Reichardt	985c33b132	Introduce BLAKE3 checksums as an OpenZFS feature This commit adds BLAKE3 checksums to OpenZFS, it has similar performance to Edon-R, but without the caveats around the latter. Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3 Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3 Short description of Wikipedia: BLAKE3 is a cryptographic hash function based on Bao and BLAKE2, created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real World Crypto. BLAKE3 is a single algorithm with many desirable features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE and BLAKE2, which are algorithm families with multiple variants. BLAKE3 has a binary tree structure, so it supports a practically unlimited degree of parallelism (both SIMD and multithreading) given enough input. The official Rust and C implementations are dual-licensed as public domain (CC0) and the Apache License. Along with adding the BLAKE3 hash into the OpenZFS infrastructure a new benchmarking file called chksum_bench was introduced. When read it reports the speed of the available checksum functions. On Linux: cat /proc/spl/kstat/zfs/chksum_bench On FreeBSD: sysctl kstat.zfs.misc.chksum_bench This is an example output of an i3-1005G1 test system with Debian 11: implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1196 1602 1761 1749 1762 1759 1751 skein-generic 546 591 608 615 619 612 616 sha256-generic 240 300 316 314 304 285 276 sha512-generic 353 441 467 476 472 467 426 blake3-generic 308 313 313 313 312 313 312 blake3-sse2 402 1289 1423 1446 1432 1458 1413 blake3-sse41 427 1470 1625 1704 1679 1607 1629 blake3-avx2 428 1920 3095 3343 3356 3318 3204 blake3-avx512 473 2687 4905 5836 5844 5643 5374 Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1840 2458 2665 2719 2711 2723 2693 skein-generic 870 966 996 992 1003 1005 1009 sha256-generic 415 442 453 455 457 457 457 sha512-generic 608 690 711 718 719 720 721 blake3-generic 301 313 311 309 309 310 310 blake3-sse2 343 1865 2124 2188 2180 2181 2186 blake3-sse41 364 2091 2396 2509 2463 2482 2488 blake3-avx2 365 2590 4399 4971 4915 4802 4764 Output on Debian 5.10.0-9-powerpc64le system: (POWER 9) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1213 1703 1889 1918 1957 1902 1907 skein-generic 434 492 520 522 511 525 525 sha256-generic 167 183 187 188 188 187 188 sha512-generic 186 216 222 221 225 224 224 blake3-generic 153 152 154 153 151 153 153 blake3-sse2 391 1170 1366 1406 1428 1426 1414 blake3-sse41 352 1049 1212 1174 1262 1258 1259 Output on Debian 5.10.0-11-arm64 system: (Pi400) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 487 603 629 639 643 641 641 skein-generic 271 299 303 308 309 309 307 sha256-generic 117 127 128 130 130 129 130 sha512-generic 145 165 170 172 173 174 175 blake3-generic 81 29 71 89 89 89 89 blake3-sse2 112 323 368 379 380 371 374 blake3-sse41 101 315 357 368 369 364 360 Structurally, the new code is mainly split into these parts: - 1x cross platform generic c variant: blake3_generic.c - 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512) - 2x assembly for ARMv8 (NEON converted from SSE2) - 2x assembly for PPC64-LE (POWER8 converted from SSE2) - one file for switching between the implementations Note the PPC64 assembly requires the VSX instruction set and the kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly. Reviewed-by: Felix Dörre <felix@dogcraft.de> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Co-authored-by: Rich Ercolani <rincebrain@gmail.com> Closes #10058 Closes #12918	2022-06-08 15:55:57 -07:00
George Amanakis	0409d33273	Improve zpool status output, list all affected datasets Currently, determining which datasets are affected by corruption is a manual process. The primary difficulty in reporting the list of affected snapshots is that since the error was initially found, the snapshot where the error originally occurred in, may have been deleted. To solve this issue, we add the ID of the head dataset of the original snapshot which the error was detected in, to the stored error report. Then any time a filesystem is deleted, the errors associated with it are deleted as well. Any time a clone promote occurs, we modify reports associated with the original head to refer to the new head. The stored error reports are identified by this head ID, the birth time of the block which the error occurred in, as well as some information about the error itself are also stored. Once this information is stored, we can find the set of datasets affected by an error by walking back the list of snapshots in the given head until we find one with the appropriate birth txg, and then traverse through the snapshots of the clone family, terminating a branch if the block was replaced in a given snapshot. Then we report this information back to libzfs, and to the zpool status command, where it is displayed as follows: pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec 3 08:27:57 2021 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 sdb ONLINE 0 0 1.58K errors: Permanent errors have been detected in the following files: test@1:/test.0.0 /test/test.0.0 /test/1clone/test.0.0 A new feature flag is introduced to mark the presence of this change, as well as promotion and backwards compatibility logic. This is an updated version of #9175. Rebase required fixing the tests, updating the ABI of libzfs, updating the man pages, fixing bugs, fixing the error returns, and updating the old on-disk error logs to the new format when activating the feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Co-authored-by: TulsiJain <tulsi.jain@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #9175 Closes #12812	2022-04-25 17:25:42 -07:00
наб	ad9e767657	linux: module: weld all but spl.ko into zfs.ko Originally it was thought it would be useful to split up the kmods by functionality. This would allow external consumers to only load what was needed. However, in practice we've never had a case where this functionality would be needed, and conversely managing multiple kmods can be awkward. Therefore, this change merges all but the spl.ko kmod in to a single zfs.ko kmod. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13274	2022-04-20 13:28:24 -07:00
наб	9292cf761e	Fix string/index variables being unflexed by default I got the status backward (B_FALSE for fixed, rather than B_TRUE for flex); before: $ zfs get mountpoint tarta-zoot -r NAME PROPERTY VALUE SOURCE tarta-zoot mountpoint / local tarta-zoot/PAGEFILE.SYS mountpoint - - tarta-zoot/etc mountpoint /etc inherited from tarta-zoot tarta-zoot/home mountpoint /home inherited from tarta-zoot tarta-zoot/home/xspon mountpoint /home/xspon inherited from tarta-zoot tarta-zoot/home/nabijaczleweli mountpoint /home/nabijaczleweli inherited from tarta-zoot tarta-zoot/home/nabijaczleweli/tftp mountpoint /home/nabijaczleweli/tftp inherited from tarta-zoot tarta-zoot/home/root mountpoint /root local after: $ zfs get mountpoint tarta-zoot -r NAME PROPERTY VALUE SOURCE tarta-zoot mountpoint / local tarta-zoot/PAGEFILE.SYS mountpoint - - tarta-zoot/etc mountpoint /etc inherited from tarta-zoot tarta-zoot/home mountpoint /home inherited from tarta-zoot tarta-zoot/home/xspon mountpoint /home/xspon inherited from tarta-zoot tarta-zoot/home/nabijaczleweli mountpoint /home/nabijaczleweli inherited from tarta-zoot tarta-zoot/home/nabijaczleweli/tftp mountpoint /home/nabijaczleweli/tftp inherited from tarta-zoot tarta-zoot/home/root mountpoint /root local Fixes: `be8e1d81bf` ("Flex non-pretty-printed properties and raw-/pretty-print remaining ones") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13286	2022-04-05 09:34:46 -07:00
наб	d465fc5844	Forbid b{copy,zero,cmp}(). Don't include <strings.h> for <string.h> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12996	2022-03-15 15:13:48 -07:00
наб	861166b027	Remove bcopy(), bzero(), bcmp() bcopy() has a confusing argument order and is actually a move, not a copy; they're all deprecated since POSIX.1-2001 and removed in -2008, and we shim them out to mem*() on Linux anyway Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12996	2022-03-15 15:13:42 -07:00
Attila Fülöp	ce7a5dbf4b	Linux x86 SIMD: factor out unneeded kernel dependencies Cleanup the kernel SIMD code by removing kernel dependencies. - Replace XSTATE_XSAVE with our own XSAVE implementation for all kernels not exporting kernel_fpu{begin,end}(), see #13059 - Replace union fpregs_state by a uint8_t * buffer and get the size of the buffer from the hardware via the CPUID instruction - Replace kernels xgetbv() by our own implementation which was already there for userspace. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #13102	2022-03-08 16:19:15 -08:00
наб	be8e1d81bf	Flex non-pretty-printed properties and raw-/pretty-print remaining ones Before: nabijaczleweli@tarta:~/store/code/zfs$ /sbin/zpool list -Td -o name,size,alloc,free,ckpoint,expandsz,guid,load_guid,frag,cap,dedup,health,altroot,guid,dedupditto,load_guid,maxblocksize,maxdnodesize 2>/dev/null Sun 20 Feb 03:57:44 CET 2022 NAME SIZE ALLOC FREE CKPOINT EXPANDSZ GUID LOAD_GUID FRAG CAP DEDUP HEALTH ALTROOT GUID DEDUPDITTO LOAD_GUID MAXBLOCKSIZE MAXDNODESIZE filling 25.5T 6.52T 18.9T - 64M 11512889483096932869 11656109927366648364 1% 25% 1.00x ONLINE - 11512889483096932869 0 11656109927366648364 1048576 16384 tarta-boot 240M 50.6M 189M - - 2372068846917849656 7752280792179633787 12% 21% 1.00x ONLINE - 2372068846917849656 0 7752280792179633787 1048576 512 tarta-zoot 55.5G 6.42G 49.1G - - 12971868889665384604 8622632123393589527 17% 11% 1.00x ONLINE - 12971868889665384604 0 8622632123393589527 1048576 16384 nabijaczleweli@tarta:~/store/code/zfs$ /sbin/zfs list -o name,guid,keyguid,ivsetguid,createtxg,objsetid,pbkdf2iters,refratio -r tarta-zoot NAME GUID KEYGUID IVSETGUID CREATETXG OBJSETID PBKDF2ITERS REFRATIO tarta-zoot 1110930838977259561 659P - 1 54 0 1.03x tarta-zoot/PAGEFILE.SYS 2202570496672997800 3.20E - 2163 1539 0 1.07x tarta-zoot/dupa 16941280502417785695 9.81E - 2274707 1322 1000000000000 1.00x tarta-zoot/etc 17029963068508333530 12.9E - 3663 1087 0 1.52x tarta-zoot/home 3508163802370032575 8.50E - 3664 294 0 1.00x tarta-zoot/home/misio 7283672744014848555 13.0E - 3665 302 0 2.28x tarta-zoot/home/nabijaczleweli 12286744508078616303 5.15E - 3666 200 0 2.05x tarta-zoot/home/nabijaczleweli/tftp 13551632689932817643 5.16E - 3667 1095 0 1.00x tarta-zoot/home/root 5203106193060067946 15.4E - 3668 698 0 2.86x tarta-zoot/home/shared-config 8866040021005142194 14.5E - 3670 2069 0 1.20x tarta-zoot/home/tymek 9472751824283011822 4.56E - 3671 1202 0 1.32x tarta-zoot/oldboot 10460192444135730377 13.8E - 2268398 1232 0 1.01x tarta-zoot/opt 9945621324983170410 5.84E - 3672 1210 0 1.00x tarta-zoot/opt/icecc 13178238931846132425 9.04E - 3673 1103 0 2.83x tarta-zoot/opt/swtpm 10172962421514870859 4.13E - 825669 145132 0 1.87x tarta-zoot/srv 217179989022738337 3.90E - 3674 2469 0 1.00x tarta-zoot/usr 12214213243060765090 15.0E - 3675 2477 0 2.58x tarta-zoot/usr/local 7542700368693813134 941P - 3676 2484 0 2.33x tarta-zoot/var 13414177124447929530 10.2E - 3677 2492 0 1.57x tarta-zoot/var/lib 6969944550407159241 5.28E - 3678 2499 0 2.34x tarta-zoot/var/tmp 6399468088048343912 1.34E - 3679 1218 0 3.95x After: nabijaczleweli@tarta:~/store/code/zfs$ cmd/zpool/zpool list -Td -o name,size,alloc,free,ckpoint,expandsz,guid,load_guid,frag,cap,dedup,health,altroot,guid,dedupditto,load_guid,maxblocksize,maxdnodesize 2>/dev/null Sun 20 Feb 03:57:42 CET 2022 NAME SIZE ALLOC FREE CKPOINT EXPANDSZ GUID LOAD_GUID FRAG CAP DEDUP HEALTH ALTROOT GUID DEDUPDITTO LOAD_GUID MAXBLOCKSIZE MAXDNODESIZE filling 25.5T 6.52T 18.9T - 64M 11512889483096932869 11656109927366648364 1% 25% 1.00x ONLINE - 11512889483096932869 0 11656109927366648364 1M 16K tarta-boot 240M 50.6M 189M - - 2372068846917849656 7752280792179633787 12% 21% 1.00x ONLINE - 2372068846917849656 0 7752280792179633787 1M 512 tarta-zoot 55.5G 6.42G 49.1G - - 12971868889665384604 8622632123393589527 17% 11% 1.00x ONLINE - 12971868889665384604 0 8622632123393589527 1M 16K nabijaczleweli@tarta:~/store/code/zfs$ cmd/zfs/zfs list -o name,guid,keyguid,ivsetguid,createtxg,objsetid,pbkdf2iters,refratio -r tarta-zoot NAME GUID KEYGUID IVSETGUID CREATETXG OBJSETID PBKDF2ITERS REFRATIO tarta-zoot 1110930838977259561 741529699813639505 - 1 54 0 1.03x tarta-zoot/PAGEFILE.SYS 2202570496672997800 3689529982640017884 - 2163 1539 0 1.07x tarta-zoot/dupa 16941280502417785695 11312442953423259518 - 2274707 1322 1000000000000 1.00x tarta-zoot/etc 17029963068508333530 14852574366795347233 - 3663 1087 0 1.52x tarta-zoot/home 3508163802370032575 9802810070759776956 - 3664 294 0 1.00x tarta-zoot/home/misio 7283672744014848555 14983161489316798151 - 3665 302 0 2.28x tarta-zoot/home/nabijaczleweli 12286744508078616303 5937870537299886218 - 3666 200 0 2.05x tarta-zoot/home/nabijaczleweli/tftp 13551632689932817643 5950522828900813054 - 3667 1095 0 1.00x tarta-zoot/home/root 5203106193060067946 17718025091255443518 - 3668 698 0 2.86x tarta-zoot/home/shared-config 8866040021005142194 16716354482778968577 - 3670 2069 0 1.20x tarta-zoot/home/tymek 9472751824283011822 5251854710505749954 - 3671 1202 0 1.32x tarta-zoot/oldboot 10460192444135730377 15894065034622168157 - 2268398 1232 0 1.01x tarta-zoot/opt 9945621324983170410 6737735639539098405 - 3672 1210 0 1.00x tarta-zoot/opt/icecc 13178238931846132425 10425145983015238428 - 3673 1103 0 2.83x tarta-zoot/opt/swtpm 10172962421514870859 4764783754852521469 - 825669 145132 0 1.87x tarta-zoot/srv 217179989022738337 4492810461439647259 - 3674 2469 0 1.00x tarta-zoot/usr 12214213243060765090 17306702395865262834 - 3675 2477 0 2.58x tarta-zoot/usr/local 7542700368693813134 1059954157997659784 - 3676 2484 0 2.33x tarta-zoot/var 13414177124447929530 11764397504176937123 - 3677 2492 0 1.57x tarta-zoot/var/lib 6969944550407159241 6084753728494937404 - 3678 2499 0 2.34x tarta-zoot/var/tmp 6399468088048343912 1548692824635344277 - 3679 1218 0 3.95x Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13122 Closes #13125	2022-03-04 12:08:33 -08:00
наб	a9a89755fa	module: zcommon: zprop: common: zprop_width: namespace exceptions Before this, /all/ numerical properties 1 (ZFS_PROP_CREATION, ZPOOL_PROP_SIZE, VDEV_PROP_CAPACITY) would be non-fixed and /all/ numerical properties 5 (ZFS_PROP_COMPRESSRATIO, ZPOOL_PROP_HEALTH, VDEV_PROP_PSIZE) would be 8-wide Realistically, this doesn't appear to be much of a problem Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13125	2022-03-04 12:08:30 -08:00
Jitendra Patidar	361a7e8211	log xattr=sa create/remove/update to ZIL As such, there are no specific synchronous semantics defined for the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is done, if sync=always is set on dataset. This provides sync semantics for xattr=on with sync=always set on dataset. For the xattr=sa implementation, it doesn't log to ZIL, so, even with sync=always, xattrs are not guaranteed to be synced before xattr call returns to caller. So, xattr can be lost if system crash happens, before txg carrying xattr transaction is synced. This change adds xattr=sa logging to ZIL on xattr create/remove/update and xattrs are synced to ZIL (zil_commit() done) for sync=always. This makes xattr=sa behavior similar to xattr=on. Implementation notes: The actual logging is fairly straight-forward and does not warrant additional explanation. However, it has been 14 years since we last added new TX types to the ZIL [1], hence this is the first time we do it after the introduction of zpool features. Therefore, here is an overview of the feature activation and deactivation workflow: 1. The feature must be enabled. Otherwise, we don't log the new record type. This ensures compatibility with older software. 2. The feature is activated per-dataset, since the ZIL is per-dataset. 3. If the feature is enabled and dataset is not for zvol, any append to the ZIL chain will activate the feature for the dataset. Likewise for starting a new ZIL chain. 4. A dataset that doesn't have a ZIL chain has the feature deactivated. We ensure (3) by activating on the first zil_commit() after the feature was enabled. Since activating the features requires waiting for txg sync, the first zil_commit() after enabling the feature will be slower than usual. The downside is that this is really a conservative approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL chain, we pay the penalty for feature activation. The upside is that the user is in control of when we pay the penalty, i.e., upon enabling the feature. We ensure (4) by hooking into zil_sync(), where ZIL destroy actually happens. One more piece on feature activation, since it's spread across multiple functions: zil_commit() zil_process_commit_list() if lwb == NULL // first zil_commit since zil_open zil_create() if no log block pointer in ZIL header: if feature enabled and not active: // CASE 1 enable, COALESCE txg wait with dmu_tx that allocated the log block else // log block was allocated earlier than this zil_open if feature enabled and not active: // CASE 2 enable, EXPLICIT txg wait else // already have an in-DRAM LWB if feature enabled and not active: // this happens when we enable the feature after zil_create // CASE 3 enable, EXPLICIT txg wait [1] https://github.com/illumos/illumos-gate/commit/da6c28aaf62fa55f0fdb8004aa40f88f23bf53f0 Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #8768 Closes #9078	2022-02-22 13:06:43 -08:00

1 2 3 4 5

237 Commits