mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-11 22:21:48 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	bfe27ace0d	Fix unused variable warnings Resolved unused variable warnings observed after restricting -Wno-unused-but-set-variable to only libzfs and libzpool. Reviewed-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6941	2018-01-09 12:28:03 -08:00
Brian Behlendorf	06401e4222	Fix ztest_verify_dnode_bt() test case In ztest_verify_dnode_bt the ztest_object_lock must be held in order to safely verify the unused bonus space. Reviewed-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6941	2018-01-09 12:27:12 -08:00
DHE	460f239e69	Fix -fsanitize=address memory leak kmem_alloc(0, ...) in userspace returns a leakable pointer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Issue #6941	2018-01-09 12:26:25 -08:00
George Amanakis	be54a13c3e	Fix percentage styling in zfs-module-parameters.5 Replace "percent" with "%", add bold to default values. Reviewed-by: bunder2015 <omfgbunder@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #7018	2018-01-09 11:51:11 -08:00
Brian Behlendorf	b02becaa00	Reduce codecov PR comments Attempt to reduce the number of comments posted by codecov to PR requests. Based on the codecov documenation setting "require_changes=yes" and "behavior=once" should result in a single comment under most circumstances. https://docs.codecov.io/v4.3.6/docs/pull-request-comments Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #7022 Closes #7025	2018-01-09 11:15:55 -08:00
Nathaniel Wesley Filardo	8b20a9f996	zhack: fix getopt return type This fixes zhack's command processing on ARM. On ARM char is unsigned, and so, in promotion to an int, it will never compare equal to -1. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes #7016	2018-01-09 11:14:45 -08:00
Brian Behlendorf	0873bb6337	Fix ARC hit rate When the compressed ARC feature was added in commit `d3c2ae1` the method of reference counting in the ARC was modified. As part of this accounting change the arc_buf_add_ref() function was removed entirely. This would have be fine but the arc_buf_add_ref() function served a second undocumented purpose of updating the ARC access information when taking a hold on a dbuf. Without this logic in place a cached dbuf would not migrate its associated arc_buf_hdr_t to the MFU list. This would negatively impact the ARC hit rate, particularly on systems with a small ARC. This change reinstates the missing call to arc_access() from dbuf_hold() by implementing a new arc_buf_access() function. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6171 Closes #6852 Closes #6989	2018-01-08 09:52:36 -08:00
LOLi	390d679acd	Fix 'zpool add' handling of nested interior VDEVs When replacing a faulted device which was previously handled by a spare multiple levels of nested interior VDEVs will be present in the pool configuration; the following example illustrates one of the possible situations: NAME STATE READ WRITE CKSUM testpool DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 spare-0 DEGRADED 0 0 0 replacing-0 DEGRADED 0 0 0 /var/tmp/fault-dev UNAVAIL 0 0 0 cannot open /var/tmp/replace-dev ONLINE 0 0 0 /var/tmp/spare-dev1 ONLINE 0 0 0 /var/tmp/safe-dev ONLINE 0 0 0 spares /var/tmp/spare-dev1 INUSE currently in use This is safe and allowed, but get_replication() needs to handle this situation gracefully to let zpool add new devices to the pool. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6678 Closes #6996	2017-12-28 10:15:32 -08:00
Prakash Surya	2fe61a7ecc	OpenZFS 8909 - 8585 can cause a use-after-free kernel panic Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: John Kennedy <jwk404@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Prakash Surya <prakash.surya@delphix.com> PROBLEM ======= There's a race condition that exists if `zil_free_lwb` races with either `zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`. Here's an example panic due to this bug: > ::status debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40 operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc) image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference dump content: kernel pages only > $c zio_shrink+0x12() zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20) zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit+0x80(ffffff03dcd15cc0, 9a9) zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) write+0x250(42, fffffd7ff4832000, 2000) sys_syscall+0x177() If there's an outstanding lwb that's in `zil_commit_waiter_timeout` waiting to timeout, waiting on it's waiter's CV, we must be sure not to call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB may be freed and can result in a use-after-free situation where the stale lwb pointer stored in the `zil_commit_waiter_t` structure of the thread waiting on the waiter's CV is used. A similar situation can occur if an lwb is issued to disk, and thus in the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the disk is servicing that lwb. In this situation, the lwb will be freed by `zil_free_lwb`, which will result in a use-after-free situation when the lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called. This race condition is prevented in `zil_close` by calling `zil_commit` before `zil_free_lwb` is called, which will ensure all outstanding (i.e. all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states) reach the `LWB_STATE_DONE` state before the lwb's are freed (`zil_commit` will not return untill all the lwb's are `LWB_STATE_DONE`). Further, this race condition is prevented in `zil_sync` by only calling `zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set. All lwb's not in the `LWB_STATE_DONE` state will have a non-null value for this pointer; the pointer is only cleared in `zil_lwb_flush_vdevs_done`, at which point the lwb's state will be changed to `LWB_STATE_DONE`. This race is present in `zil_suspend`, leading to this bug. At first glance, it would appear as though this would not be true because `zil_suspend` will call `zil_commit`, just like `zil_close`, but the problem is that `zil_suspend` will set the zilog's `zl_suspend` field prior to calling `zil_commit`. Further, in `zil_commit`, if `zl_suspend` is set, `zil_commit` will take a special branch of logic and use `txg_wait_synced` instead of performing the normal `zil_commit` logic. This call to `txg_wait_synced` might be good enough for the data to reach disk safely before it returns, but it does not ensure that all outstanding lwb's reach the `LWB_STATE_DONE` state before it returns. This is because, if there's an lwb "stuck" in `zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will maintain a non-null value for it's `lwb_buf` field and thus `zil_sync` will not free that lwb. Thus, even though the lwb's data is already on disk, the lwb will be left lingering, waiting on the CV, and will eventually timeout and be issued to disk even though the write is unnecessary. So, after `zil_commit` is called from `zil_suspend`, we incorrectly assume that there are not outstanding lwb's, and proceed to free all lwb's found on the zilog's lwb list. As a result, we free the lwb that will later be used `zil_commit_waiter_timeout`. SOLUTION ======== The solution to this, is to ensure all outstanding lwb's complete before calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch accomplishes this goal by forcing the normal `zil_commit` logic when called from `zil_sync`. Now, `zil_suspend` will call `zil_commit_impl` which will always use the normal logic of waiting/issuing lwb's to disk before it returns. As a result, any lwb's outstanding when `zil_commit_impl` is called will be guaranteed to reach the `LWB_STATE_DONE` state by the time it returns. Further, no new lwb's will be created via `zil_commit` since the zilog's `zl_suspend` flag will be set. This will force all new callers of `zil_commit` to use `txg_wait_synced` instead of creating and issuing new lwb's. Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is called will be in the `LWB_STATE_DONE` state, and we'll avoid this race condition. OpenZFS-issue: https://www.illumos.org/issues/8909 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d Closes #6940	2017-12-28 10:18:04 -08:00
lidongyang	823d48bfb1	Call commit callbacks from the tail of the list Our zfs backed Lustre MDT had soft lockups while under heavy metadata workloads while handling transaction callbacks from osd_zfs. The problem is zfs is not taking advantage of the fast path in Lustre's trans callback handling, where Lustre will skip the calls to ptlrpc_commit_replies() when it already saw a higher transaction number. This patch corrects this, it also has a positive impact on metadata performance on Lustre with osd_zfs, plus some cleanup in the headers. A similar issue for ext4/ldiskfs is described on: https://jira.hpdd.intel.com/browse/LU-6527 Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au> Closes #6986	2017-12-22 10:19:51 -08:00
Tom Caputi	44b61ea506	Remove empty files accidentally added by `a8b2e306` This patch simply removes 2 empty files that were accidentally added a part of the scrub priority patch. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6990	2017-12-22 10:17:48 -08:00
Tom Caputi	a8b2e30685	Support re-prioritizing asynchronous prefetches When sequential scrubs were merged, all calls to arc_read() (including prefetch IOs) were given ZIO_PRIORITY_ASYNC_READ. Unfortunately, this behaves badly with an existing issue where prefetch IOs cannot be re-prioritized after the issue. The result is that synchronous reads end up in the same vdev_queue as the scrub IOs and can have (in some workloads) multiple seconds of latency. This patch incorporates 2 changes. The first ensures that all scrub IOs are given ZIO_PRIORITY_SCRUB to allow the vdev_queue code to differentiate between these I/Os and user prefetches. Second, this patch introduces zio_change_priority() to provide the missing capability to upgrade a zio's priority. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6921 Closes #6926	2017-12-21 09:13:06 -08:00
Simon Guest	993669a7bf	vdev_id: new slot type ses This extends vdev_id to support a new slot type, ses, for SCSI Enclosure Services. With slot type ses, the disk slot numbers are determined by using the device slot number reported by sg_ses for the device with matching SAS address, found by querying all available enclosures. This is primarily of use on systems with a deficient driver omitting support for bay_identifier in /sys/devices. In my testing, I found that the existing slot types of port and id were not stable across disk replacement, so an alternative was required. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Simon Guest <simon.guest@tesujimath.org> Closes #6956	2017-12-20 09:42:07 -08:00
Giuseppe Di Natale	89a66a0457	Handle broken pipes in arc_summary Using a command similar to 'arc_summary.py \| head' causes a broken pipe exception. Gracefully exit in the case of a broken pipe in arc_summary.py. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6965 Closes #6969	2017-12-19 13:19:24 -08:00
LOLi	c4ba46dead	Handle invalid options in arc_summary If an invalid option is provided to arc_summary.py we handle any error thrown from the getopt Python module and print the usage help message. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6983	2017-12-19 13:02:40 -08:00
Dominik Hassler	2e7c1bb35a	OpenZFS 8794 - cstyle generates warnings with recent perl Authored by: Dominik Hassler <hadfl@omniosce.org> Reviewed by: Andy Fiddaman <andy@omniosce.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8794 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/578f67364c Closes #6973	2017-12-19 12:54:08 -08:00
LOLi	c30e34faa1	ZTS: Fix create-o_ashift test case The function that fills the uberblock ring buffer on every device label has been reworked to avoid occasional failures caused by a race condition that prevents 'zpool sync' from writing some uberblock sequentially: this happens when the pool sync ioctl dispatch code calls txg_wait_synced() while we're already waiting for a TXG to sync. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6924 Closes #6977	2017-12-19 10:49:33 -08:00
Brian Behlendorf	bbffb59efc	Fix multihost stale cache file import When the multihost property is enabled it should be impossible to import an active pool even using the force (-f) option. This patch prevents a forced import from succeeding when importing with a stale cache file. The root cause of the problem is that the kernel modules trusted the hostid provided in configuration. This is always correct when the configuration is generated by scanning for the pool. However, when using an existing cache file the hostid could be stale which would result in the activity check being skipped. Resolve the issue by always using the hostid read from the label configuration where the best uberblock was found. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6933 Closes #6971	2017-12-18 10:28:27 -08:00
LOLi	e2d936e0f8	Honor --with-mounthelperdir where applicable Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6962	2017-12-17 14:14:07 -08:00
LOLi	ee410eefc2	Fix --with-systemd on Debian-based distributions (#6963 ) These changes propagate the "--with-systemd" configure option to the RPM spec file, allowing Debian-based distributions to package systemd-related files. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6591 Closes #6963	2017-12-17 14:08:48 -08:00
Brian Behlendorf	516c09d0d5	Remove lib/libspl/include/sys/frame.h The functionality provided by this header is not required by any of the ZFS user space code. Minimal functionality was provided in commit `c28a677` which added include/sys/frame.h. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6960 Closes #6972	2017-12-17 14:02:29 -08:00
David Qian	f6940bb9ea	Enable QAT support in zfs-dkms RPM Enable QAT accelerated gzip compression in zfs-dkms RPM package when environment variant ICP_ROOT is set to QAT drive source code folder and QAT hardware presence. Otherwise, use default gzip compression. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: David Qian <david.qian@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6932	2017-12-14 12:44:27 -05:00
Lalufu	9920950ccb	Add zfs-import.target services in spec file Add missing zfs-import.target to list of systemd services in zfs RPM spec file. Reviewed-by: Niklas Wagner <Skaro@Skaronator.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ralf Ertzinger <ralf@skytale.net> Issue #6953 Closes #6955	2017-12-13 17:09:22 -08:00
LOLi	4e9b156960	Various ZED fixes * Teach ZED to handle spares usingi the configured ashift: if the zpool 'ashift' property is set then ZED should use its value when kicking in a hotspare; with this change 512e disks can be used as spares for VDEVs that were created with ashift=9, even if ZFS natively detects them as 4K block devices. * Introduce an additional auto_spare test case which verifies that in the face of multiple device failures an appropiate number of spares are kicked in. * Fix zed_stop() in "libtest.shlib" which did not correctly wait the target pid. * Fix ZED crashing on startup caused by a race condition in libzfs when used in multi-threaded context. * Convert ZED over to using the tpool library which is already present in the Illumos FMA code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #2562 Closes #6858	2017-12-08 16:58:41 -08:00
Brian Behlendorf	3ab3166347	Disable vdev_zaps_004_pos Occasionally observed failure of vdev_zaps_004_pos due to the test case not being 100% reliable. In order to prevent false positives disable this test case until it can be made reliable. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #6935 Closes #6936	2017-12-07 16:43:59 -08:00
Brian Behlendorf	c28a67733c	Suppress incorrect objtool warnings Suppress incorrect warnings from versions of objtool which are not aware of x86 EVEX prefix instructions used for AVX512. module/zfs/vdev_raidz_math_avx512bw.o: warning: objtool: <func+offset>: can't find jump dest instruction at .text Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6928	2017-12-07 10:28:50 -08:00
Tony Hutter	674b89342e	Fix segfault in zpool iostat when adding VDEVs Fix a segfault when running 'zpool iostat -v 1' while adding a VDEV. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #6748 Closes #6872	2017-12-06 11:43:07 -08:00
Prakash Surya	1b2b0acab5	OpenZFS 8603 - rename zilog's "zl_writer_lock" to "zl_issuer_lock" This is a purely cosmetic change. The zilog's "zl_writer_lock" field is being renamed to "zl_issuer_lock" to try and make the code easier to understand; no other changes are made. Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: C Fraire <cfraire@me.com> Approved by: Dan McDonald <danmcd@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8603 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2daf06546b Closes #6927	2017-12-06 11:38:10 -08:00
Brian Behlendorf	0c415a93d2	Disable create-o_ashift Occasionally observed failure of create-o_ashift due to the test case not being 100% reliable. In order to prevent false positives disable this test case until it can be made reliable. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #6924 Closes #6925	2017-12-06 10:13:54 -08:00
Prakash Surya	1ce23dcaff	OpenZFS 8585 - improve batching done in zil_commit() Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Prakash Surya <prakash.surya@delphix.com> Problem ======= The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch). To further complicate the situation, these two issues result in the following side effect: 3. If a given batch takes longer to complete than normal, this results in larger batch sizes, which then take longer to complete and further drive up the latency of zil_commit(). This can occur for a number of reasons, including (but not limited to): transient changes in the workload, and storage latency irregularites. Solution ======== The solution attempted by this change has the following goals: 1. no on-disk changes; maintain current on-disk format. 2. modify the "batch size" to be equal to the "ZIL block size". 3. allow new batches to be generated and issued to disk, while there's already batches being serviced by the disk. 4. allow zil_commit() to wait for as few ZIL blocks as possible. 5. use as few ZIL blocks as possible, for the same amount of ZIL transactions, without introducing significant latency to any individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks. In theory, with these goals met, the new allgorithm will allow the following improvements: 1. new ZIL blocks can be generated and issued, while there's already oustanding ZIL blocks being serviced by the storage. 2. the latency of zil_commit() should be proportional to the underlying storage latency, rather than the incoming synchronous workload. Porting Notes ============= Due to the changes made in commit `119a394ab0`, the lifetime of an itx structure differs than in OpenZFS. Specifically, the itx structure is kept around until the data associated with the itx is considered to be safe on disk; this is so that the itx's callback can be called after the data is committed to stable storage. Since OpenZFS doesn't have this itx callback mechanism, it's able to destroy the itx structure immediately after the itx is committed to an lwb (before the lwb is written to disk). To support this difference, and to ensure the itx's callbacks can still be called after the itx's data is on disk, a few changes had to be made: * A list of itxs was added to the lwb structure. This list contains all of the itxs that have been committed to the lwb, such that the callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(), after the data for the itxs is committed to disk. * A list of itxs was added on the stack of the zil_process_commit_list() function; the "nolwb_itxs" list. In some circumstances, an itx may not be committed to an lwb (e.g. if allocating the "next" ZIL block on disk fails), so this list is used to keep track of which itxs fall into this state, such that their callbacks can be called after the ZIL's writer pipeline is "stalled". * The logic to actually call the itx's callback was moved into the zil_itx_destroy() function. Since all consumers of zil_itx_destroy() were effectively performing the same logic (i.e. if callback is non-null, call the callback), it seemed like useful code cleanup to consolidate this logic into a single function. Additionally, the existing Linux tracepoint infrastructure dealing with the ZIL's probes and structures had to be updated to reflect these code changes. Specifically: * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be removed from "trace_zil.h" as well. * Some of the zilog structure's fields were removed, which affected the tracepoint definitions of the structure. * New tracepoints had to be added for the following 3 new probes: * zil__process__commit__itx * zil__process__normal__itx * zil__commit__io__error OpenZFS-issue: https://www.illumos.org/issues/8585 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a Closes #6566	2017-12-05 09:39:16 -08:00
Brian Behlendorf	7b3407003f	Fix NFS sticky bit permission denied error When zfs_sticky_remove_access() was originally adapted for Linux a typo was made which altered the intended behavior. As described in the block comment, the intended behavior is that permission should be granted when the entry is a regular file and you have write access. That is, S_ISREG should have been used instead of S_ISDIR. Restricting permission to regular files made good sense for older systems where setting the bit on executable files would instruct the system to save the program's text segment on the swap device. On modern systems this behavior has been replaced by the sticky bit acting as a restricted deletion flag and the plain file restriction has been relaxed. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6889 Closes #6910	2017-12-04 11:55:57 -08:00
JKDingwall	9717fe052b	Add /usr/bin/env to COPY_EXEC_LIST initramfs hook `5dc1ff29` changed the user space program to mount a zfs snapshot from /bin/sh to /usr/bin/env. If the executable is not present in the initramfs then snapshots cannot be automounted. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: James Dingwall <james.dingwall@zynstra.com> Closes #5360 Closes #6913	2017-12-04 11:53:57 -08:00
Brian Behlendorf	ea39f75f64	Fix 'zpool create\|add' replication level check When the pool configuration contains a hole due to a previous device removal ignore this top level vdev. Failure to do so will result in the current configuration being assessed to have a non-uniform replication level and the expected warning will be disabled. The zpool_add_010_pos test case was extended to cover this scenario. Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6907 Closes #6911	2017-12-04 11:50:35 -08:00
Brian Behlendorf	72841b9fd9	Preserve itx alloc size for zio_data_buf_free() Using zio_data_buf_alloc() to allocate the itx's may be unsafe because the itx->itx_lr.lrc_reclen field is not constant from allocation to free. Using a different itx->itx_lr.lrc_reclen size in zio_data_buf_free() can result in the allocation being returned to the wrong kmem cache. This issue can be avoided entirely by storing the allocation size in itx->itx_size and using that for zio_data_buf_free(). Reviewed by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6912	2017-12-04 11:44:39 -08:00
Tom Caputi	d4677269f2	Unbreak the scan status ABI When `d4a72f23` was merged, pss_pass_issued was incorrectly added to the middle of the pool_scan_stat_t structure instead of the end. This patch simply corrects this issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6909	2017-11-30 09:40:13 -08:00
LOLi	ed15d54481	Fix 'zfs get {user\|group}objused@' functionality Fix a regression accidentally introduced in `1b81ab4` that prevents 'zfs get {user\|group}objused@' from correctly reporting the requested value. Update "userspace_003_pos.ksh" and "groupspace_003_pos.ksh" to verify this functionality. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6908	2017-11-29 11:59:22 -08:00
Mark Wright	56d8d8ace4	Linux 4.14 compat: CONFIG_GCC_PLUGIN_RANDSTRUCT Fix build errors with gcc 7.2.0 on Gentoo with kernel 4.14 built with CONFIG_GCC_PLUGIN_RANDSTRUCT=y such as: module/nvpair/nvpair.c:2810:2:error: positional initialization of field in ?struct? declared with 'designated_init' attribute [-Werror=designated-init] nvs_native_nvlist, ^~~~~~~~~~~~~~~~~ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Wright <gienah@gentoo.org> Closes #5390 Closes #6903	2017-11-28 17:33:48 -06:00
Richard Laager	48ac22d855	initramfs: Honor canmount=off The initramfs script was not honoring canmount=off. With this change, it does. If the administrator has asked that a filesystem not be mounted, that should be honored. As an exception, the initramfs script ignores canmount=off on the rootfs. The rootfs should not have canmount=off set either. However, mounting it anyway seems harmless because it is being asked for explicitly. The point of this exception is to avoid the risk of breaking existing systems, just in case someone has canmount=off set on their rootfs. The initramfs still mounts filesystems with canmount=noauto. This is necessary because it is typical to set that on the rootfs so that it can be cloned. Without canmount=noauto, the clones' duplicate mountpoints would conflict. This is the remainder of the fix for: https://github.com/zfsonlinux/pkg-zfs/issues/221 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #6897	2017-11-28 09:38:13 -08:00
Richard Laager	bd2958dea0	initramfs: Honor mountpoint=none/legacy For filesystems that are children of the rootfs, when mountpoint=none or mountpoint=legacy, the initrafms script would assume a mountpoint based on the dataset path. Given that the rootfs should have mountpoint=/ and mountpoint inheritance is is the default behavior of ZFS, this behavior seems unnecessary. In any event, it turns mountpoint=none into a no-op. That removes this option from the administrator, and if someone uses it, it does not work as expected. Worse yet, if the mountpoint directory does not exist (which is the typical case for mountpoint=none), the mounting and thus the boot process will fail. For the case of mountpoint=legacy, the assumed mountpoint may not be the correct value set in /etc/fstab. This change makes the initramfs script not mount the filesystem in either case. For mountpoint=none, this means we are correctly honoring the setting. For mountpoint=legacy, there are two scenarios: If canmount=on, the filesystem will be mounted by the normal mechanisms later in the boot process. If canmount=noauto, the filesystem will not be mounted at all, unless the administrator has done something special. If they're not doing something special and they want it mounted by the initramfs, they can simply not set mountpoint=legacy. This is part of the fix for: https://github.com/zfsonlinux/pkg-zfs/issues/221 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #6897	2017-11-28 09:38:00 -08:00
DeHackEd	1c68856bca	zpool(8): Fix "zpool import -t" Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: DHE <git@dehacked.net> Closes #6894	2017-11-28 11:10:52 -06:00
Brian Behlendorf	94183a9d8a	Update for cppcheck v1.80 Resolve new warnings and errors from cppcheck v1.80. * [lib/libshare/libshare.c:543]: (warning) Possible null pointer dereference: protocol * [lib/libzfs/libzfs_dataset.c:2323]: (warning) Possible null pointer dereference: srctype * [lib/libzfs/libzfs_import.c:318]: (error) Uninitialized variable: link * [module/zfs/abd.c:353]: (error) Uninitialized variable: sg * [module/zfs/abd.c:353]: (error) Uninitialized variable: i * [module/zfs/abd.c:385]: (error) Uninitialized variable: sg * [module/zfs/abd.c:385]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: sg * [module/zfs/abd.c:763]: (error) Uninitialized variable: i * [module/zfs/abd.c:763]: (error) Uninitialized variable: sg * [module/zfs/abd.c:305]: (error) Uninitialized variable: tmp_page * [module/zfs/zpl_xattr.c:342]: (warning) Possible null pointer dereference: value * [module/zfs/zvol.c:208]: (error) Uninitialized variable: p Convert the following suppression to inline. * [module/zfs/zfs_vnops.c:840]: (error) Possible null pointer dereference: aiov Exclude HAVE_UIO_ZEROCOPY and HAVE_DNLC from analysis since these macro's will never be defined until this functionality is implemented. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6879	2017-11-18 14:08:00 -08:00
Scot W. Stevenson	8d18776973	Fix data on evict_skips in arc_summary.py Display correct data from kstat arcstats for evict_skips, which is currently repeating the data from mutex_misses. Fixes #6882 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com> Closes #6882 Closes #6883	2017-11-18 14:07:04 -08:00
DeHackEd	da5d4697a8	Fix ARC pointer overrun Only access the `b_crypt_hdr` field of an ARC header if the content is encrypted. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #6877	2017-11-17 15:11:39 -08:00
Tom Caputi	d4a72f2386	Sequential scrub and resilvers Currently, scrubs and resilvers can take an extremely long time to complete. This is largely due to the fact that zfs scans process pools in logical order, as determined by each block's bookmark. This makes sense from a simplicity perspective, but blocks in zfs are often scattered randomly across disks, particularly due to zfs's copy-on-write mechanisms. This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance. This patch also updates and cleans up some of the scan code which has not been updated in several years. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com> Authored-by: Alek Pinchuk <apinchuk@datto.com> Authored-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #3625 Closes #6256	2017-11-15 17:27:01 -08:00
Scot W. Stevenson	e301113c17	Minor code cleanups in arc_python.py Remove unused library re and associated variable kstat_pobj. Add note to documentation at start of program about required support for old versions of Python. Change variable "format" (which is a built-in function) to "fmt". Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com> Closes #6869	2017-11-15 10:28:11 -08:00
Brian Behlendorf	454365bbaa	Fix dirty check in dmu_offset_next() The correct way to determine if a dnode is dirty is to check if any of the dn->dn_dirty_link's are active. Relying solely on the dn->dn_dirtyctx can result in the dnode being mistakenly reported as clean. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3125 Closes #6867	2017-11-15 10:19:32 -08:00
Brian Behlendorf	13589da974	Disable automatic dependencies in zfs-test package All of the ZTS test scripts specify /bin/ksh as the interpreter. Unfortunately, as of Fedora 27 only /usr/bin/ksh is provided by the package manager. Rather than change all the scripts to accommodate the latest Fedora disable automatic dependencies for the zfs-test package. Functionally this will not cause any problems since /bin is a symlink to /usr/bin. Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6868	2017-11-15 09:12:52 -08:00
Brian Behlendorf	71788d91f4	Disable zvol_ENOSPC_001_pos on 32-bit systems Occasionally observed failure of zvol_ENOSPC_001_pos due to the test case taking too long to complete. Disable the test case until it can be improved. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5848 Closes #6862	2017-11-13 16:26:15 -08:00
LOLi	99834d1950	Fix truncate(2) mtime and ctime handling On Linux, ftruncate(2) always changes the file timestamps, even if the file size is not changed. However, in case of a successfull truncate(2), the timestamps are updated only if the file size changes. This translates to the VFS calling the ZFS Posix Layer "setattr" function (zpl_setattr) with ATTR_MTIME and ATTR_CTIME unconditionally set on the iattr mask only when doing a ftruncate(2), while the truncate(2) is left to the filesystem implementation to be dealt with. This behaviour is consistent with POSIX:2004/SUSv3 specifications where there's no explicit requirement for file size changes to update the timestamps only for ftruncate(2): http://pubs.opengroup.org/onlinepubs/009695399/functions/truncate.html http://pubs.opengroup.org/onlinepubs/009695399/functions/ftruncate.html This has been later updated in POSIX:2008/SUSv4 where, for both truncate(2)/ftruncate(2), there's no mention of this size change requirement: http://austingroupbugs.net/view.php?id=489 http://pubs.opengroup.org/onlinepubs/9699919799/functions/truncate.html http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html Unfortunately the Linux VFS is still calling into the ZPL without ATTR_MTIME/ATTR_CTIME set in the truncate(2) case: we fix this by explicitly updating the timestamps when detecting the ATTR_SIZE bit, which is always set in do_truncate(), on the iattr mask. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6811 Closes #6819	2017-11-13 09:24:26 -08:00
George Melikov	13042a6ccd	Add .travis.yml Travis builders have maximum work time ~49 minutes, so we have to use 5 builders and spread the ZTS over them using test group tags. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #6829	2017-11-13 09:18:18 -08:00

... 3 4 5 6 7 ...

3431 Commits