mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2024-11-17 10:01:01 +03:00

Author	SHA1	Message	Date
Brian Atkinson	a10e552b99	Adding Direct IO Support Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads. O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests. This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will not be synced until the associated TXG is synced. For both O_DIRECT read and write request the offset and request sizes, at a minimum, must be PAGE_SIZE aligned. In the event they are not, then EINVAL is returned unless the direct property is set to always (see below). For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk. For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer. For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in the event that file contents are mmap'ed. In this case, all requests that are at least PAGE_SIZE aligned will just fall back to the buffered paths. If the request however is not PAGE_SIZE aligned, EINVAL will be returned as always regardless if the file's contents are mmap'ed. Since O_DIRECT writes go through the normal ZIO pipeline, the following operations are supported just as with normal buffered writes: Checksum Compression Encryption Erasure Coding There is one caveat for the data integrity of O_DIRECT writes that is distinct for each of the OS's supported by ZFS. FreeBSD - FreeBSD is able to place user pages under write protection so any data in the user buffers and written directly down to the VDEV disks is guaranteed to not change. There is no concern with data integrity and O_DIRECT writes. Linux - Linux is not able to place anonymous user pages under write protection. Because of this, if the user decides to manipulate the page contents while the write operation is occurring, data integrity can not be guaranteed. However, there is a module parameter `zfs_vdev_direct_write_verify` that controls the if a O_DIRECT writes that can occur to a top-level VDEV before a checksum verify is run before the contents of the I/O buffer are committed to disk. In the event of a checksum verification failure the write will return EIO. The number of O_DIRECT write checksum verification errors can be observed by doing `zpool status -d`, which will list all verification errors that have occurred on a top-level VDEV. Along with `zpool status`, a ZED event will be issues as `dio_verify` when a checksum verification error occurs. ZVOLs and dedup is not currently supported with Direct I/O. A new dataset property `direct` has been added with the following 3 allowable values: disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request. standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used. always - Treats every write/read IO request as though it passed O_DIRECT and will do O_DIRECT if the alignment restrictions are met otherwise will redirect through the ARC. This property will not allow a request to fail. There is also a module parameter zfs_dio_enabled that can be used to force all reads and writes through the ARC. By setting this module parameter to 0, it mimics as if the direct dataset property is set to disabled. Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Co-authored-by: Mark Maybee <mark.maybee@delphix.com> Co-authored-by: Matt Macy <mmacy@FreeBSD.org> Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov> Closes #10018	2024-09-14 13:47:59 -07:00
Rob Norris	ba2209ec9e	abd_get_from_buf_struct: wrap existing buf with ABD stored on stack This allows a simple "wrapping" ABD for an existing linear buffer to be allocated on the stack, avoiding an allocation. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	b3f4e4e1ec	abd: remove ABD_FLAG_ZEROS Nothing ever checks it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16253	2024-08-21 13:36:24 -07:00
Rob Norris	b0bf14cdb5	abd: lift ABD zero scan from zio_compress_data() to abd_cmp_zero() It's now the caller's responsibility do special handling for holes if that's something it wants. This also makes zio_compress_data() and zio_decompress_data() properly the inverse of each other. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Jason Lee <jasonlee@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16326	2024-08-09 14:30:26 -07:00
Rob Norris	390b448726	abd: add page iterator The regular ABD iterators yield data buffers, so they have to map and unmap pages into kernel memory. If the caller only wants to count chunks, or can use page pointers directly, then the map/unmap is just unnecessary overhead. This adds adb_iterate_page_func, which yields unmapped struct page instead. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588	2024-03-25 16:50:35 -07:00
Mark Johnston	f4cd1bac72	Make abd_raidz_gen_iterate() pass an initialized pointer to the callback Otherwise callbacks may trigger KMSAN violations in the dlen == 0 case. For example, raidz_syn_pq_abd() will compare an uninitialized pointer with itself before returning. This seems harmless, but let's maintain good hygiene and avoid passing uninitialized variables, if only to placate KMSAN. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #15491	2023-11-07 10:24:15 -08:00
Alexander Motin	05a7348a7e	RAIDZ: Use cache blocking during parity math RAIDZ parity is calculated by adding data one column at a time. It works OK for small blocks, but for large blocks results of previous addition may already be evicted from CPU caches to main memory, and in addition to extra memory write require extra read to get it back. This patch splits large parity operations into 64KB chunks, that should in most cases fit into CPU L2 caches from the last decade. I haven't touched more complicated cases of data reconstruction to not over complicate the code. Those should be relatively rare. My tests on Xeon Gold 6242R CPU with 1MB of L2 cache per core show up to 10/20% memory traffic reduction when writing to 4-wide RAIDZ/ RAIDZ2 blocks of ~4MB and up. Older CPUs with 256KB of L2 cache should see the effect even on smaller blocks. Wider vdevs may need bigger blocks to be affected. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15448	2023-10-30 14:54:27 -07:00
Alexander Motin	e007908a16	ABD: Be more assertive in iterators Once we verified the ABDs and asserted the sizes we should never see premature ABDs ends. Assert that and remove extra branches from production builds. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15428	2023-10-24 14:33:58 -07:00
Alexander Motin	190290a9ac	Fix two abd_gang_add_gang() issues. - There is no reason to assert that added gang is not empty. It may be weird to add an empty gang, but it is legal. - When moving chain list from the added gang clear its size, or it will trigger assertion in abd_verify() when that gang is freed. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14816	2023-05-05 09:17:55 -07:00
Alexander Motin	bba7cbf0a4	Fix positive ABD size assertion in abd_verify(). Gang ABDs without childred are legal, and they do have zero size. For other ABD types zero size doesn't have much sense and likely not working correctly now. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14795	2023-04-26 09:20:43 -07:00
Richard Yao	d692e6c36e	abd_return_buf() should call zfs_refcount_remove_many() early Calling zfs_refcount_remove_many() after freeing memory means we pass a reference to freed memory as the holder. This is not believed to be able to cause a problem, but there is a bit of a tradition of fixing these issues when they appear so that they do not obscure more serious issues in static analyzer output, so we fix this one too. Clang's static analyzer found this with the help of CodeChecker's CTU analysis. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14043	2022-10-19 17:11:01 -07:00
Tino Reichardt	1d3ba0bf01	Replace dead opensolaris.org license link The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13619	2022-07-11 14:16:13 -07:00
Jorgen Lundman	9a70e97fe1	Rename fallthrough to zfs_fallthrough Unfortunately macOS has obj-C keyword "fallthrough" in the OS headers. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #13097	2022-02-15 08:58:59 -08:00
наб	14e4e3cb9f	module: zfs: fix unused, remove argsused Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2021-12-23 09:42:47 -08:00
Brian Behlendorf	6954c22f35	Use fallthrough macro As of the Linux 5.9 kernel a fallthrough macro has been added which should be used to anotate all intentional fallthrough paths. Once all of the kernel code paths have been updated to use fallthrough the -Wimplicit-fallthrough option will because the default. To avoid warnings in the OpenZFS code base when this happens apply the fallthrough macro. Additional reading: https://lwn.net/Articles/794944/ Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12441	2021-09-14 10:17:54 -06:00
Alexander Motin	7eebcd2be6	Avoid small buffer copying on write It is wrong for arc_write_ready() to use zfs_abd_scatter_enabled to decide whether to reallocate/copy the buffer, because the answer is OS-specific and depends on the buffer size. Instead of that use abd_size_alloc_linear(), moved into public header. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12425	2021-07-27 16:05:47 -07:00
Jorgen Lundman	c6d1112bf4	Fix abd leak, kmem_free correct size of abd_t Fix a leak of abd_t that manifested mostly when using raidzN with at least as many columns as N (e.g. a four-disk raidz2 but not a three-disk raidz2). Sufficiently heavy raidz use would eventually run a system out of memory. Additionally: * Switch abd_cache arena to FIRSTFIT, which empirically improves perofrmance. * Make abd_chunk_cache more performant and debuggable. * Allocate the abd_zero_buf from abd_chunk_cache rather than the heap. * Don't try to reap non-existent qcaches in abd_cache arena. * KM_PUSHPAGE->KM_SLEEP when allocating chunks from their own arena Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: Sean Doran <smd@use.net> Closes #12295	2021-07-01 09:28:15 -06:00
Alexander Motin	5e2c8338bf	Help compiller optimize out abd_verify() While abd_verify() does nothing when built without debug, compiler can't optimize it out by itself due to calls to external list_*() and abd_verify_scatter(). This commit makes it explicit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12280	2021-06-25 16:38:31 -07:00
Andrea Gelmini	bf169e9f15	Fix various typos Correct an assortment of typos throughout the code base. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #11774	2021-04-02 18:52:15 -07:00
Jorgen Lundman	8a6d444825	Fix abd_get_offset_struct() may allocate new abd Even when supplied with an abd to abd_get_offset_struct(), the call to abd_get_offset_impl() can allocate a different abd. Ensure to call abd_fini_struct() on the abd that is not used. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #11683	2021-03-05 12:22:57 -08:00
Matthew Ahrens	2d4bbd14fc	The abd child/parent relationship does not need to be tracked ABD's currently track their parent/child relationship. This applies to `abd_get_offset()` and `abd_borrow_buf()`. However, nothing depends on knowing this relationship, it's only used for consistency checks to verify that we are not destroying an ABD that's still in use. When we are creating/destroying ABD's frequently, the performance impact of maintaining these data structures (in particular the atomic increment/decrement operations) can be measurable. This commit removes this verification code on production builds, but keeps it when ZFS_DEBUG is set. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11535	2021-01-30 10:04:42 -08:00
Brian Atkinson	2993698eb3	Fixing gang ABD when adding another gang I originally applied a fix in #11539 to fix a parent's child references when a gang ABD is free'd. However, I did not take into account abd_gang_add_gang(). We still need to make sure to update the child references in this function as well. In order to resolve this I removed decreasing the gang ABD's size in abd_free_gang() as well as moved back the original placeent of zfs_refcount_remove_many() in abd_free(). Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #11542	2021-01-28 16:54:12 -08:00
Brian Atkinson	416015ef54	Removing ABD Parent Child Reference Before Freeing ABD Moving the call to zfs_refcount_remove_many() in abd_free() to be called before any of the ABD free variants are called. This is necessary because abd_free_gang() adjusts the abd_size for the gang ABD. If the parent's child references are removed after free'ing the gang ABD the refcount is not adjusted correctly for the parent's children. I also removed some stray abd_put() in comments and changed abd_free_gang_abd() -> abd_free_gang(). Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #11539	2021-01-28 09:15:17 -08:00
Colm	4a90d4d6fc	Fix two minor lint errors (cppcheck) Fix two minor errors reported by cppcheck: In module/zfs/abd.c (abd_get_offset_impl), add non-NULL assertion to prevent NULL dereference warning. In module/zfs/arc.c (l2arc_write_buffers), change 'try' variable to 'pass' to avoid C++ reserved word. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Colm Buckley <colm@tuatha.org> Closes #11507	2021-01-23 15:49:32 -08:00
Matthew Ahrens	e2af2acce3	allow callers to allocate and provide the abd_t struct The `abd_get_offset_()` routines create an abd_t that references another abd_t, and doesn't allocate any pages/buffers of its own. In some workloads, these routines may be called frequently, to create many abd_t's representing small pieces of a single large abd_t. In particular, the upcoming RAIDZ Expansion project makes heavy use of these routines. This commit adds the ability for the caller to allocate and provide the abd_t struct to a variant of `abd_get_offset_()`. This eliminates the cost of allocating the abd_t and performing the accounting associated with it (`abdstat_struct_size`). The RAIDZ/DRAID code uses this for the `rc_abd`, which references the zio's abd. The upcoming RAIDZ Expansion project will leverage this infrastructure to increase performance of reads post-expansion by around 50%. Additionally, some of the interfaces around creating and destroying abd_t's are cleaned up. Most significantly, the distinction between `abd_put()` and `abd_free()` is eliminated; all types of abd_t's are now disposed of with `abd_free()`. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Issue #8853 Closes #11439	2021-01-20 11:24:37 -08:00
Brian Behlendorf	b2255edcc0	Distributed Spare (dRAID) Feature This patch adds a new top-level vdev type called dRAID, which stands for Distributed parity RAID. This pool configuration allows all dRAID vdevs to participate when rebuilding to a distributed hot spare device. This can substantially reduce the total time required to restore full parity to pool with a failed device. A dRAID pool can be created using the new top-level `draid` type. Like `raidz`, the desired redundancy is specified after the type: `draid[1,2,3]`. No additional information is required to create the pool and reasonable default values will be chosen based on the number of child vdevs in the dRAID vdev. zpool create <pool> draid[1,2,3] <vdevs...> Unlike raidz, additional optional dRAID configuration values can be provided as part of the draid type as colon separated values. This allows administrators to fully specify a layout for either performance or capacity reasons. The supported options include: zpool create <pool> \ draid[<parity>][:<data>d][:<children>c][:<spares>s] \ <vdevs...> - draid[parity] - Parity level (default 1) - draid[:<data>d] - Data devices per group (default 8) - draid[:<children>c] - Expected number of child vdevs - draid[:<spares>s] - Distributed hot spares (default 0) Abbreviated example `zpool status` output for a 68 disk dRAID pool with two distributed spares using special allocation classes. ``` pool: tank state: ONLINE config: NAME STATE READ WRITE CKSUM slag7 ONLINE 0 0 0 draid2:8d:68c:2s-0 ONLINE 0 0 0 L0 ONLINE 0 0 0 L1 ONLINE 0 0 0 ... U25 ONLINE 0 0 0 U26 ONLINE 0 0 0 spare-53 ONLINE 0 0 0 U27 ONLINE 0 0 0 draid2-0-0 ONLINE 0 0 0 U28 ONLINE 0 0 0 U29 ONLINE 0 0 0 ... U42 ONLINE 0 0 0 U43 ONLINE 0 0 0 special mirror-1 ONLINE 0 0 0 L5 ONLINE 0 0 0 U5 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 L6 ONLINE 0 0 0 U6 ONLINE 0 0 0 spares draid2-0-0 INUSE currently in use draid2-0-1 AVAIL ``` When adding test coverage for the new dRAID vdev type the following options were added to the ztest command. These options are leverages by zloop.sh to test a wide range of dRAID configurations. -K draid\|raidz\|random - kind of RAID to test -D <value> - dRAID data drives per group -S <value> - dRAID distributed hot spares -R <value> - RAID parity (raidz or dRAID) The zpool_create, zpool_import, redundancy, replacement and fault test groups have all been updated provide test coverage for the dRAID feature. Co-authored-by: Isaac Huang <he.huang@intel.com> Co-authored-by: Mark Maybee <mmaybee@cray.com> Co-authored-by: Don Brady <don.brady@delphix.com> Co-authored-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mmaybee@cray.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10102	2020-11-13 13:51:51 -08:00
Brian Atkinson	6fba7bfd0e	Add gang ABD child to parent gang ABD By design a gang ABD can not have another gang ABD as a child. This is to make sure the logical offset in a gang ABD is consistent with the individual ABDS it contains as children. If a gang ABD is added as a child of a gang ABD we will add the individual children of the gang ABD to the parent gang ABD. This allows for a consistent view of offsets within the parent gang ABD. Reviewed-by: Mark Maybee <mmaybee@cray.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10430	2020-07-24 21:09:20 -07:00
Brian Atkinson	e4d3d77684	Fixing gang ABD child removal race condition On linux the list debug code has been setting off a failure when checking that the node->next->prev value is pointing back at the node. At times this check evaluates to 0xdead. When removing a child from a gang ABD we must acquire the child's abd_mtx to make sure that the same ABD is not being added to another gang ABD while it is being removed from a gang ABD. This fixes a race condition when checking if an ABDs link is already active and part of another gang ABD before adding it to a gang. Added additional debug code for the gang ABD in abd_verify() to make sure each child ABD has active links. Also check to make sure another gang ABD is not added to a gang ABD. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10511	2020-07-14 11:04:35 -07:00
Andrea Gelmini	dd4bc569b9	Fix typos Correct various typos in the comments and tests. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #10423	2020-06-09 21:24:09 -07:00
Brian Atkinson	fb822260b1	Gang ABD Type Adding the gang ABD type, which allows for linear and scatter ABDs to be chained together into a single ABD. This can be used to avoid doing memory copies to/from ABDs. An example of this can be found in vdev_queue.c in the vdev_queue_aggregate() function. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Brian <bwa@clemson.edu> Co-authored-by: Mark Maybee <mmaybee@cray.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10069	2020-05-20 18:06:09 -07:00
Brian Behlendorf	2ade659eb4	Fix abd_enter/exit_critical wrappers Commit `fc551d7` introduced the wrappers abd_enter_critical() and abd_exit_critical() to mark critical sections. On Linux these are implemented with the local_irq_save() and local_irq_restore() macros which set the 'flags' argument when saving. By wrapping them with a function the local variable is no longer set by the macro and is no longer properly restored. Convert abd_enter_critical() and abd_exit_critical() to macros to resolve this issue and ensure the flags are properly restored. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10332	2020-05-14 20:45:16 -07:00
Brian Atkinson	fc551d7efb	Combine OS-independent ABD Code into Common Source File Reorganizing ABD code base so OS-independent ABD code has been placed into a common abd.c file. OS-dependent ABD code has been left in each OS's ABD source files, and these source files have been renamed to abd_os. The OS-independent ABD code is now under: module/zfs/abd.c With the OS-dependent code in: module/os/linux/zfs/abd_os.c module/os/freebsd/zfs/abd_os.c Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10293	2020-05-10 12:23:52 -07:00
Matthew Macy	bced7e3aaa	OpenZFS restructuring - move platform specific sources Move platform specific Linux source under module/os/linux/ and update the build system accordingly. Additional code restructuring will follow to make the common code fully portable. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Macy <mmacy@FreeBSD.org> Closes #9206	2019-09-06 11:26:26 -07:00
Don Brady	28c91ab66d	Tag ABD pages for exclusion in kernel crash dumps Tag the ABD data pages so that they can be identified for exclusion from kernel crash dumps. Eliminating the zfs file data allows for significantly smaller crash dump files. Note that ZFS in illumos has always excluded the zfs data pages from a kernel crash dump. This change tags ARC scatter data pages so they can be identified from the makedumpfile(8) command. That command is used to create smaller dump files by ignoring some memory regions and using compression. It already filters file data from the VFS page cache and will now be able to exclude ZFS file data pages from the dump file. A corresponding change to makeumpfile(8) is required to identify ZFS data pages. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #8899	2019-08-28 10:44:46 -07:00
Tony Hutter	a9ebdfdd43	Linux 5.3: Fix switch() fall though compiler errors Fix some switch() fall-though compiler errors: abd.c:1504:9: error: this statement may fall through Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #9170	2019-08-21 09:29:23 -07:00
Matthew Ahrens	5662fd5794	single-chunk scatter ABDs can be treated as linear Scatter ABD's are allocated from a number of pages. In contrast to linear ABD's, these pages are disjoint in the kernel's virtual address space, so they can't be accessed as a contiguous buffer. Therefore routines that need a linear buffer (e.g. abd_borrow_buf() and friends) must allocate a separate linear buffer (with zio_buf_alloc()), and copy the contents of the pages to/from the linear buffer. This can have a measurable performance overhead on some workloads. https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e ("abd_alloc should use scatter for >1K allocations") increased the use of scatter ABD's, specifically switching 1.5K through 4K (inclusive) buffers from linear to scatter. For workloads that access blocks whose compressed sizes are in this range, that commit introduced an additional copy into the read code path. For example, the sequential_reads_arc_cached tests in the test suite were reduced by around 5% (this is doing reads of 8K-logical blocks, compressed to 3K, which are cached in the ARC). This commit treats single-chunk scattered buffers as linear buffers, because they are contiguous in the kernel's virtual address space. All single-page (4K) ABD's can be represented this way. Some multi-page ABD's can also be represented this way, if we were able to allocate a single "chunk" (higher-order "page" which represents a power-of-2 series of physically-contiguous pages). This is often the case for 2-page (8K) ABD's. Representing a single-entry scatter ABD as a linear ABD has the performance advantage of avoiding the copy (and allocation) in abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of around 5% has been observed for ARC-cached reads (of small blocks which can take advantage of this), fixing the regression introduced by `87c25d567`. Note that this optimization is only possible because all physical memory is always mapped into the kernel's address space. This is not the case for HIGHMEM pages, so the optimization can not be made on 32-bit systems. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8580	2019-06-11 09:02:31 -07:00
Matthew Ahrens	87c25d567f	abd_alloc should use scatter for >1K allocations abd_alloc() normally does scatter allocations, thus solving the problem that ABD originally set out to: the bulk of ZFS's allocations are single pages, which are faster to allocate and free, and don't suffer from internal fragmentation (and the inability to reclaim memory because some buffers in the slab are still allocated). However, the current code does linear allocations for 4KB and smaller allocations, defeating the purpose of ABD. Scatter ABD's use at least one page each, so sub-page allocations waste some space when allocated as scatter (e.g. 2KB scatter allocation wastes half of each page). Using linear ABD's for small allocations means that they will be put on slabs which contain many allocations. This can improve memory efficiency, but it also makes it much harder for ARC evictions to actually free pages, because all the buffers on one slab need to be freed in order for the slab (and underlying pages) to be freed. Typically, 512B and 1KB kmem caches have 16 buffers per slab, so it's possible for them to actually waste more memory than scatter (one page per buf = wasting 3/4 or 7/8th; one buf per slab = wasting 15/16th). Spill blocks are typically 512B and are heavily used on systems running selinux with the default dnode size and the `xattr=sa` property set. By default we will use linear allocations for 512B and 1KB, and scatter allocations for larger (1.5KB and up). Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: DHE <git@dehacked.net> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8455	2019-02-28 17:52:55 -08:00
Tim Schumacher	424fd7c3e0	Prefix all refcount functions with zfs_ Recent changes in the Linux kernel made it necessary to prefix the refcount_add() function with zfs_ due to a name collision. To bring the other functions in line with that and to avoid future collisions, prefix the other refcount functions as well. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Schumacher <timschumi@gmx.de> Closes #7963	2018-10-01 10:42:05 -07:00
Brian Behlendorf	93ce2b4ca5	Update build system and packaging Minimal changes required to integrate the SPL sources in to the ZFS repository build infrastructure and packaging. Build system and packaging: * Renamed SPL_* autoconf m4 macros to ZFS_. Removed redundant SPL_* autoconf m4 macros. * Updated the RPM spec files to remove SPL package dependency. * The zfs package obsoletes the spl package, and the zfs-kmod package obsoletes the spl-kmod package. * The zfs-kmod-devel* packages were updated to add compatibility symlinks under /usr/src/spl-x.y.z until all dependent packages can be updated. They will be removed in a future release. * Updated copy-builtin script for in-kernel builds. * Updated DKMS package to include the spl.ko. * Updated stale AUTHORS file to include all contributors. * Updated stale COPYRIGHT and included the SPL as an exception. * Renamed README.markdown to README.md * Renamed OPENSOLARIS.LICENSE to LICENSE. * Renamed DISCLAIMER to NOTICE. Required code changes: * Removed redundant HAVE_SPL macro. * Removed _BOOT from nvpairs since it doesn't apply for Linux. * Initial header cleanup (removal of empty headers, refactoring). * Remove SPL repository clone/build from zimport.sh. * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due to build issues when forcing C99 compilation. * Replaced legacy ACCESS_ONCE with READ_ONCE. * Include needed headers for `current` and `EXPORT_SYMBOL`. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> TEST_ZIMPORT_SKIP="yes" Closes #7556	2018-05-29 16:00:33 -07:00
Brian Behlendorf	94183a9d8a	Update for cppcheck v1.80 Resolve new warnings and errors from cppcheck v1.80. * [lib/libshare/libshare.c:543]: (warning) Possible null pointer dereference: protocol * [lib/libzfs/libzfs_dataset.c:2323]: (warning) Possible null pointer dereference: srctype * [lib/libzfs/libzfs_import.c:318]: (error) Uninitialized variable: link * [module/zfs/abd.c:353]: (error) Uninitialized variable: sg * [module/zfs/abd.c:353]: (error) Uninitialized variable: i * [module/zfs/abd.c:385]: (error) Uninitialized variable: sg * [module/zfs/abd.c:385]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: sg * [module/zfs/abd.c:763]: (error) Uninitialized variable: i * [module/zfs/abd.c:763]: (error) Uninitialized variable: sg * [module/zfs/abd.c:305]: (error) Uninitialized variable: tmp_page * [module/zfs/zpl_xattr.c:342]: (warning) Possible null pointer dereference: value * [module/zfs/zvol.c:208]: (error) Uninitialized variable: p Convert the following suppression to inline. * [module/zfs/zfs_vnops.c:840]: (error) Possible null pointer dereference: aiov Exclude HAVE_UIO_ZEROCOPY and HAVE_DNLC from analysis since these macro's will never be defined until this functionality is implemented. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6879	2017-11-18 14:08:00 -08:00
Don Brady	1c27024e22	Undo c89 workarounds to match with upstream With PR 5756 the zfs module now supports c99 and the remaining past c89 workarounds can be undone. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #6816	2017-11-04 13:25:13 -07:00
Brian Behlendorf	57f4ef2e81	Fix abdstats kstat on 32-bit systems When decrementing the struct_size and scatter_chunk_waste kstats the value needs to be cast to an int on 32-bit systems. Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6721	2017-10-06 11:23:12 -07:00
jxiong	2b91b5119c	minor improvement to abd_free_pages() It doesn't need to have a loop to free page in a single scatterlist entry because it should be single or compound page. The pages can be freed in one invocation to __free_pages() for both cases. Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com> Closes #6057	2017-05-02 10:06:18 -07:00
Gvozden Neskovic	84c07adadb	Remove dependency on linear ABD Wherever possible it's best to avoid depending on a linear ABD. Update the code accordingly in the following areas. - vdev_raidz - zio, zio_checksum - zfs_fm - change abd_alloc_for_io() to use abd_alloc() Reviewed-by: David Quigley <david.quigley@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #5668	2017-03-29 12:24:51 -07:00
George Melikov	4ea3f86426	codebase style improvements for OpenZFS 6459 port	2017-01-22 13:25:40 -08:00
Brian Behlendorf	02730c333c	Use cstyle -cpP in `make cstyle` check Enable picky cstyle checks and resolve the new warnings. The vast majority of the changes needed were to handle minor issues with whitespace formatting. This patch contains no functional changes. Non-whitespace changes are as follows: * 8 times ; to { } in for/while loop * fix missing ; in cmd/zed/agents/zfs_diagnosis.c * comment (confim -> confirm) * change endline , to ; in cmd/zpool/zpool_main.c * a number of /* BEGIN CSTYLED / / END CSTYLED / blocks /* CSTYLED / markers change == 0 to ! * ulong to unsigned long in module/zfs/dsl_scan.c * rearrangement of module_param lines in module/zfs/metaslab.c * add { } block around statement after for_each_online_node Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Håkan Johansson <f96hajo@chalmers.se> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5465	2016-12-12 10:46:26 -08:00
luozhengzheng	c077090a9b	Fix coverity defects: CID 154617 CID 154617: Memory - illegal accesses (UNINIT) The value here just needs to be initialized to make Coverity happy. When dsize == 0, then value of daiter.iter_mapaddr is irrelevant. That address won't be accessed, it's only used for some arithmetic. dsize can be zero either if dabd is null, or if code column is longer than the current data column. Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5437	2016-12-08 14:48:09 -07:00
luozhengzheng	ba712624d6	Fix incorrect operator in abd_alloc_sametype() This should be & and not \| so is_metadata is set correctly. Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5438	2016-12-01 16:45:16 -07:00
Chunwei Chen	9829574834	ABD optimized page allocation code * Convert ABD to use the Linux Kernel scatterlist implementation instead of the hand rolled one from illumos. * Scatter ABDs are preferentially populated with higher order compound pages from a single zone. Allocation size is progressively decreased until it can be satisfied without performing reclaim or compaction. * An alternate page allocator is provided for kernels older than 3.6 and for CONFIG_HIGHMEM systems. This allocator is designed as a fallback for maximum compatibility. * Extended abdstats to provide visibility in the the allocator. * Add cached value for PAGESIZE in userspace. Contributions-by: Chunwei Chen <david.chen@osnexus.com> Gvozden Neskovic <neskovic@gmail.com> Jinshan Xiong <jinshan.xiong@intel.com> Isaac Huang <he.huang@intel.com> David Quigley <david.quigley@intel.com> Brian Behlendorf <behlendorf1@llnl.gov>	2016-11-29 14:34:33 -08:00
Chunwei Chen	4f60152910	ABD kmap to kmap_atomic Convert usage of kmap to kmap_atomic while correctly saving off irq state.	2016-11-29 14:34:33 -08:00

1 2

53 Commits