mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2025-12-01 09:32:08 +03:00

Author	SHA1	Message	Date
Tobin Harding	80cc2f6111	Remove unnecessary equality check Currently `if` statement includes an assignment (from a function return value) and a equality check. The parenthesis are in the incorrect place, currently the code clobbers the function return value because of this. We can fix this by simplifying the `if` statement. `if (foo != 0)` can be more succinctly expressed as `if (foo)` Remove the equality check, add parenthesis to correct the statement. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Tobin C. Harding <me@tobin.cc> Closes #6685 Close #6719	2017-10-16 10:57:55 -07:00
Isaac Huang	b97948276d	Use linear abd in vdev_copy_uberblocks() The vdev_copy_uberblocks() function should use abd_alloc_linear() to allocate ub_abd, because abd_to_buf(ub_abd)) is used later. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #6718 Closes #6713	2017-10-16 10:57:55 -07:00
Ned Bass	4cfc086e4d	receive_freeobjects() skips freeing some objects When receiving a FREEOBJECTS record, receive_freeobjects() incorrectly skips a freed object in some cases. Specifically, this happens when the first object in the range to be freed doesn't exist, but the second object does. This leaves an object allocated on disk on the receiving side which is unallocated on the sending side, which may cause receiving subsequent incremental streams to fail. The bug was caused by an incorrect increment of the object index variable when current object being freed doesn't exist. The increment is incorrect because incrementing the object index is handled by a call to dmu_object_next() in the increment portion of the for loop statement. Add test case that exposes this bug. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #6694 Closes #6695	2017-10-16 10:57:55 -07:00
chrisrd	25d232f407	Scale the dbuf cache with arc_c Commit `d3c2ae1` introduced a dbuf cache with a default size of the minimum of 100M or 1/32 maximum ARC size. (These figures may be adjusted using dbuf_cache_max_bytes and dbuf_cache_max_shift.) The dbuf cache is counted as metadata for the purposes of ARC size calculations. On a 1GB box the ARC maximum size defaults to c_max 493M which gives a dbuf cache default minimum size of 15.4M, and the ARC metadata defaults to minimum 16M. I.e. the dbuf cache is an significant proportion of the minimum metadata size. With other overheads involved this actually means the ARC metadata doesn't get down to the minimum. This patch dynamically scales the dbuf cache to the target ARC size instead of statically scaling it to the maximum ARC size. (The scale is still set by dbuf_cache_max_shift and the maximum size is still fixed by dbuf_cache_max_bytes.) Using the target ARC size rather than the current ARC size is done to help the ARC reach the target rather than simply focusing on the current size. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #6506 Closes #6561	2017-10-16 10:57:54 -07:00
Giuseppe Di Natale	bef6a8bc3a	Correct cppcheck errors (#6662 ) ZFS buildbot STYLE builder was moved to Ubuntu 17.04 which has a newer version of cppcheck. Handle the new cppcheck errors. uu_* functions removed in this commit were unused and effectively dead code. They are now retired. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6653	2017-09-20 12:59:21 -07:00
Brian Behlendorf	266b181e75	Increase default arc_c_min Increase the default arc_c_min value to which whichever is larger, either 32M or 1/32 of total system memory. This is advantageous for systems with more than 1G of memory where performance issues may occur when the ARC is allowed to collapse below a minimum size. At the same time we want to use the bare minimum value which is still functional so the filesystem can be used in very low memory environments. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6659	2017-09-20 10:25:54 -07:00
Brian Behlendorf	c474f5e9a7	Export symbol dmu_tx_mark_netfree() This symbol is needed by Lustre for the same reason it was needed by the ZPL. It should have been exported when the original patch was merged. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6660	2017-09-20 10:25:54 -07:00
Brian Behlendorf	661907e6bc	Linux 4.14 compat: IO acct, global_page_state, etc (#6655 ) generic_start_io_acct/generic_end_io_acct in the master branch of the linux kernel requires that the request_queue be provided. Move the logic from freemem in the spl to arc_free_memory in arc.c. Do this so we can take advantage of global_page_state interface checks in zfs. Upstream kernel replaced struct block_device with struct gendisk in struct bio. Determine if the function bio_set_dev exists during configure and have zfs use that if it exists. bio_set_dev https://github.com/torvalds/linux/commit/74d4699 global_node_page_state https://github.com/torvalds/linux/commit/75ef718 io acct https://github.com/torvalds/linux/commit/d62e26b Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6635 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-09-19 14:24:34 -07:00
Gaurav Kumar	d3e7d981d4	Modifying XATTRs doesnt change the ctime Changing any metadata, should modify the ctime. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: gaurkuma <gauravk.18@gmail.com> Closes #3644 Closes #6586	2017-09-13 16:05:18 -07:00
Brian Behlendorf	a2a0440918	Fix volume WR_INDIRECT log replay (#6620 ) The portion of the zvol_replay_write() handler responsible for replaying indirect log records for some reason never existed. As a result indirect log records were not being correctly replayed. This went largely unnoticed since the majority of zvol log records were of the type WR_COPIED or WR_NEED_COPY prior to OpenZFS 7578. This patch updates zvol_replay_write() to correctly handle these log records and adds a new test case which verifies volume replay to prevent any regression. The existing test case which verified replay on filesystem was renamed slog_replay_fs.ksh for clarity. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6603	2017-09-13 16:04:16 -07:00
Giuseppe Di Natale	45d1abc74d	Improved dnode allocation and dmu_hold_impl() (#6611 ) Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the code, fix errors introduced by commit `dbeb879` (PR #6117) interacting badly with large dnodes, and improve performance. * When allocating a new dnode in dmu_object_alloc_dnsize(), update the percpu object ID for the core's metadnode chunk immediately. This eliminates most lock contention when taking the hold and creating the dnode. * Correct detection of the chunk boundary to work properly with large dnodes. * Separate the dmu_hold_impl() code for the FREE case from the code for the ALLOCATED case to make it easier to read. * Fully populate the dnode handle array immediately after reading a block of the metadnode from disk. Subsequently the dnode handle array provides enough information to determine which dnode slots are in use and which are free. * Add several kstats to allow the behavior of the code to be examined. * Verify dnode packing in large_dnode_008_pos.ksh. Since the test is purely creates, it should leave very few holes in the metadnode. * Add test large_dnode_009_pos.ksh, which performs concurrent creates and deletes, to complement existing test which does only creates. With the above fixes, there is very little contention in a test of about 200,000 racing dnode allocations produced by tests 'large_dnode_008_pos' and 'large_dnode_009_pos'. name type data dnode_hold_dbuf_hold 4 0 dnode_hold_dbuf_read 4 0 dnode_hold_alloc_hits 4 3804690 dnode_hold_alloc_misses 4 216 dnode_hold_alloc_interior 4 3 dnode_hold_alloc_lock_retry 4 0 dnode_hold_alloc_lock_misses 4 0 dnode_hold_alloc_type_none 4 0 dnode_hold_free_hits 4 203105 dnode_hold_free_misses 4 4 dnode_hold_free_lock_misses 4 0 dnode_hold_free_lock_retry 4 0 dnode_hold_free_overflow 4 0 dnode_hold_free_refcount 4 57 dnode_hold_free_txg 4 0 dnode_allocate 4 203154 dnode_reallocate 4 0 dnode_buf_evict 4 23918 dnode_alloc_next_chunk 4 4887 dnode_alloc_race 4 0 dnode_alloc_next_block 4 18 The performance is slightly improved for concurrent creates with 16+ threads, and unchanged for low thread counts. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov>	2017-09-13 15:46:15 -07:00
dbavatar	89950722c6	Linux 4.8+ compatibility fix for vm stats vm_node_stat must be used instead of vm_zone_stat. Unfortunately the old code still compiles potentially leading to silent failure of arc_evictable_memory() AKAMAI: CR 3816601: Regression in zfs dropcache test Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #6528	2017-09-13 14:21:59 -07:00
LOLi	ae5b4a05ff	Fix range locking in ZIL commit codepath Since OpenZFS 7578 (`1b7c1e5`) if we have a ZVOL with logbias=throughput we will force WR_INDIRECT itxs in zvol_log_write() setting itx->itx_lr offset and length to the offset and length of the BIO from zvol_write()->zvol_log_write(): these offset and length are later used to take a range lock in zillog->zl_get_data function: zvol_get_data(). Now suppose we have a ZVOL with blocksize=8K and push 4K writes to offset 0: we will only be range-locking 0-4096. This means the ASSERTion we make in dbuf_unoverride() is no longer valid because now dmu_sync() is called from zilog's get_data functions holding a partial lock on the dbuf. Fix this by taking a range lock on the whole block in zvol_get_data(). Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6238 Closes #6315 Closes #6356 Closes #6477	2017-08-21 16:46:54 -07:00
LOLi	3468fdbd34	Fix remounting snapshots read-write It's not enough to preserve/restore MS_RDONLY on the superblock flags to avoid remounting a snapshot read-write: be explicit about our intentions to the VFS layer so the readonly bit is updated correctly in do_remount_sb(). Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6510 Closes #6515	2017-08-21 16:46:52 -07:00
Chunwei Chen	aec4318870	Fix NULL pointer when O_SYNC read in snapshot When doing read on a file open with O_SYNC, it will trigger zil_commit. However for snapshot, there's no zil, so we shouldn't be doing that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #6478 Closes #6494	2017-08-21 16:41:22 -07:00
sanjeevbagewadi	2d9b57d39f	zio_dva_throttle_done() should allow zinjected ZIO If fault injection is enabled, the ZIO_FLAG_IO_RETRY could be set by zio_handle_device_injection() to generate the FMA events and update stats. Hence, ignore the flag and process such zios. A better fix would be to add another flag in the zio_t to indicate that the zio is failed because of a zinject rule. However, considering the fact that we do this in debug bits, we could do with the crude check using the global flag zio_injection_enabled which is set to 1 when zinject records are added. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Closes #6383 Closes #6384	2017-08-21 16:41:22 -07:00
Brian Behlendorf	751941e248	Fix dnode allocation race When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6414 Closes #6439	2017-08-08 10:17:33 -07:00
Ned Bass	ef605a5517	Add debug log entries for failed receive records Log contents of a receive record if an error occurs while writing it out to the pool. This may help determine the cause when backup streams are rejected as invalid. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #6465	2017-08-08 10:17:23 -07:00
Tony Hutter	07cbcd5089	Only record zio->io_delay on reads and writes While investigating https://github.com/zfsonlinux/zfs/issues/6425 I noticed that ioctl ZIOs were not setting zio->io_delay correctly. They would set the start time in zio_vdev_io_start(), but never set the end time in zio_vdev_io_done(), since ioctls skip it and go straight to zio_done(). This was causing spurious "delayed IO" events to appear, which would eventually get rate-limited and displayed as "Missed events" messages in zed. To get around the problem, this patch only sets zio->io_delay for read and write ZIOs, since that's all we care about anyway. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #6425 Closes #6440	2017-08-02 11:37:18 -07:00
LOLi	20c88dc3ef	Fix volmode=none property behavior at import time At import time spa_import() calls zvol_create_minors() directly: with the current implementation we have no way to avoid device node creation when volmode=none. Fix this by enforcing volmode=none directly in zvol_alloc(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6426	2017-08-02 11:21:14 -07:00
Giuseppe Di Natale	bff245dd34	OpenZFS 8508 - Mounting a zpool on 32-bit platforms panics Authored by: Justin Hibbits <chmeeedalf@gmail.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8508 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/15fc257 Closes #6404	2017-07-26 09:44:21 -07:00
Ned Bass	8740cf4a2f	Add line info and SET_ERROR() to ZFS debug log Redefine the SET_ERROR macro in terms of __dprintf() so the error return codes get logged as both tracepoint events (if tracepoints are enabled) and as ZFS debug log entries. This also allows us to use the same definition of SET_ERROR() in kernel and user space. Define a new debug flag ZFS_DEBUG_SET_ERROR=512 that may be bitwise or'd into zfs_flags. Setting this flag enables both dprintf() and SET_ERROR() messages in the debug log. That is, setting ZFS_DEBUG_SET_ERROR and ZFS_DEBUG_DPRINTF\|ZFS_DEBUG_SET_ERROR are equivalent (this was done for sake of simplicity). Leaving ZFS_DEBUG_SET_ERROR unset suppresses the SET_ERROR() messages which helps avoid cluttering up the logs. To enable SET_ERROR() logging, run: echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable echo 512 > /sys/module/zfs/parameters/zfs_flags Remove the zfs_set_error_class tracepoints event class since SET_ERROR() now uses __dprintf(). This sacrifices a bit of granularity when selecting individual tracepoint events to enable but it makes the code simpler. Include file, function, and line number information in debug log entries. The information is now added to the message buffer in __dprintf() and as a result the zfs_dprintf_class tracepoints event class was changed from a 4 parameter interface to a single parameter. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #6400	2017-07-25 23:09:48 -07:00
Ned Bass	73aac4aa41	Some additional send stream validity checking Check in the DMU whether an object record in a send stream being received contains an unsupported dnode slot count, and return an error if it does. Failure to catch an unsupported dnode slot count would result in a panic when the SPA attempts to increment the reference count for the large_dnode feature and the pool has the feature disabled. This is not normally an issue for a well-formed send stream which would have the DMU_BACKUP_FEATURE_LARGE_DNODE flag set if it contains large dnodes, so it will be rejected as unsupported if the required feature is disabled. This change adds a missing object record field validation. Add missing stream feature flag checks in dmu_recv_resume_begin_check(). Consolidate repetitive comment blocks in dmu_recv_begin_check(). Update zstreamdump to print the dnode slot count (dn_slots) for an object record when running in verbose mode. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #6396	2017-07-25 18:52:40 -07:00
Brian Behlendorf	3f759c0c73	Fix 'zpool clear' on suspended pools 'zpool clear' should be able to resume I/O on suspended, but otherwise healthy, pools. `4a283c7` accidentally introduced a new code path where we call txg_wait_synced() on the suspended pool before we had the chance to resume I/O via zio_resume(): this results in the 'zpool clear' command hanging indefinitely, waiting for a TXG that cannot be synced. Fix this by avoiding the call to txg_wait_synced(). Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6399	2017-07-25 12:20:52 -07:00
Olaf Faaland	e889f0f520	Report MMP_STATE_NO_HOSTID immediately There is no need to perform the activity check before detecting that the user must set the system hostid, because the pool's multihost property is on, but spa_get_hostid() returned 0. The initial call to vdev_uberblock_load() provided the information required. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6388	2017-07-25 13:22:28 -04:00
Olaf Faaland	0582e40322	Add callback for zfs_multihost_interval Add a callback to wake all running mmp threads when zfs_multihost_interval is changed. This is necessary when the interval is changed from a very large value to a significantly lower one, while pools are imported that have the multihost property enabled. Without this commit, the mmp thread does not wake up and detect the new interval until after it has waited the old multihost interval time. A user monitoring mmp writes via the provided kstat would be led to believe that the changed setting did not work. Added a test in the ZTS under mmp to verify the new functionality is working. Added a test to ztest which starts and stops mmp threads, and calls into the code to signal sleeping mmp threads, to test for deadlocks or similar locking issues. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6387	2017-07-25 13:22:20 -04:00
Olaf Faaland	ffb195c256	Release SCL_STATE in map_write_done() The config lock must be held for the duration of the MMP write. Since the I/Os are executed via map_nowait(), the done function is the only place where we know the write has completed. Since SCL_STATE is taken as reader, overlapping I/Os do not create a deadlock. The refcount is simply increased when new I/Os are queued and decreased when I/Os complete. Test case added which exercises the probe IO call path to verify the fix and prevent a regression. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6394	2017-07-25 12:25:05 -04:00
Olaf Faaland	f43615d0cc	Revert Fix vdev_probe() call wrt SCL_STATE_ALL This reverts commit `cc9c6bc`, which has been causing intermittent test failures on buildbot. A correct fix for this locking issue has been applied in a separate patch. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov>	2017-07-25 12:24:42 -04:00
Olaf Faaland	b6e5c40382	Use correct macro for hz in mmp.c Commit `379ca9c` Multi-modifier protection (MMP) used HZ to convert nanoseconds to ticks for use with cv_timedwait() and ddi_get_lbolt(). The correct macro is hz, which is defined within the SPL for kernel space, and within zfs_context.h for user space. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6357 Closes #6360	2017-07-24 11:22:10 -07:00
Giuseppe Di Natale	802ae562ed	Fix coverity defects: CID 165755 CID 165755: Division or modulo by zero (DIVIDE_BY_ZERO) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6352	2017-07-24 11:16:58 -07:00
Brian Behlendorf	36ba27e9e0	Linux 4.13 compat: bio->bi_status and blk_status_t Commit torvalds/linux@4e4cbee9. The bio->bi_error field was replaced with bio->bi_status which is an enum that describes all possible error types. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6351	2017-07-23 19:37:12 -07:00
Brian Behlendorf	cc9c6bcb73	Fix vdev_probe() call outside SCL_STATE_ALL lock When an IO fails then zio_vdev_io_done() can call vdev_probe() to determine the health of the vdev. This is safe as long as the original zio was submitted with zio_wait() and holds the SCL_STATE_ALL lock over the operation. If zio_no_wait() was used then the done callback will submit the probe IO outside the SCL_STATE_ALL lock and hit this ASSERT in zio_create() ASSERT(!vd \|\| spa_config_held(spa, SCL_STATE_ALL, RW_READER)); Resolve the issue by only allowing vdev_probe() to be called when there's a waiter indicating the caller is using zio_wait(). This assumes that caller is still holding SCL_STATE_ALL. This issue isn't MMP specific but was surfaced when testing. Without this patch it can be reproduced by running: zpool set multihost on <pool> zinject -d <vdev> -e io -T write -f 50 <pool> -L uber Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Closes #745 Closes #6279	2017-07-13 13:54:10 -04:00
Olaf Faaland	379ca9cf2b	Multi-modifier protection (MMP) Add multihost=on\|off pool property to control MMP. When enabled a new thread writes uberblocks to the last slot in each label, at a set frequency, to indicate to other hosts the pool is actively imported. These uberblocks are the last synced uberblock with an updated timestamp. Property defaults to off. During tryimport, find the "best" uberblock (newest txg and timestamp) repeatedly, checking for change in the found uberblock. Include the results of the activity test in the config returned by tryimport. These results are reported to user in "zpool import". Allow the user to control the period between MMP writes, and the duration of the activity test on import, via a new module parameter zfs_multihost_interval. The period is specified in milliseconds. The activity test duration is calculated from this value, and from the mmp_delay in the "best" uberblock found initially. Add a kstat interface to export statistics about Multiple Modifier Protection (MMP) updates. Include the last synced txg number, the timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV label that received the last MMP update, and the VDEV path. Abbreviated output below. $ cat /proc/spl/kstat/zfs/mypool/multihost 31 0 0x01 10 880 105092382393521 105144180101111 txg timestamp mmp_delay vdev_guid vdev_label vdev_path 20468 261337 250274925 68396651780 3 /dev/sda 20468 261339 252023374 6267402363293 1 /dev/sdc 20468 261340 252000858 6698080955233 1 /dev/sdx 20468 261341 251980635 783892869810 2 /dev/sdy 20468 261342 253385953 8923255792467 3 /dev/sdd 20468 261344 253336622 042125143176 0 /dev/sdab 20468 261345 253310522 1200778101278 2 /dev/sde 20468 261346 253286429 0950576198362 2 /dev/sdt 20468 261347 253261545 96209817917 3 /dev/sds 20468 261349 253238188 8555725937673 3 /dev/sdb Add a new tunable zfs_multihost_history to specify the number of MMP updates to store history for. By default it is set to zero meaning that no MMP statistics are stored. When using ztest to generate activity, for automated tests of the MMP function, some test functions interfere with the test. For example, the pool is exported to run zdb and then imported again. Add a new ztest function, "-M", to alter ztest behavior to prevent this. Add new tests to verify the new functionality. Tests provided by Giuseppe Di Natale. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #745 Closes #6279	2017-07-13 13:54:00 -04:00
Dave Eddy	12fa0466df	OpenZFS 6939 - add sysevents to zfs core for commands Authored by: Dave Eddy <dave@daveeddy.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Joshua M. Clulow <jmc@joyent.com> Reviewed by: Josh Wilsdon <jwilsdon@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Alan Somers <asomers@gmail.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6939 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ce1577b Closes #6328	2017-07-12 21:28:13 -07:00
LOLi	cf8738d853	Add port of FreeBSD 'volmode' property The volmode property may be set to control the visibility of ZVOL block devices. This allow switching ZVOL between three modes: full - existing fully functional behaviour (default) dev - hide partitions on ZVOL block devices none - not exposing volumes outside ZFS Additionally the new zvol_volmode module parameter can be used to control the default behaviour. This functionality can be used, for instance, on "backup" pools to avoid cluttering /dev with unneeded zd* devices. Original-patch-by: mav <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> FreeBSD-commit: https://github.com/freebsd/freebsd/commit/dd28e6bb Closes #1796 Closes #3438 Closes #6233	2017-07-12 13:05:37 -07:00
Yuri Pankov	e19572e4cc	OpenZFS 5428 - provide fts(), reallocarray(), and strtonum() Authored by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Joshua M. Clulow <josh@sysmgr.org> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: * All hunks unrelated to ZFS were dropped. OpenZFS-issue: https://www.illumos.org/issues/5428 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4585130 Closes #6326	2017-07-08 20:35:35 -07:00
Matthew Ahrens	2ade4a99f0	OpenZFS 8126 - ztest assertion failed in dbuf_dirty due to dn_nlevels changing The sync thread is concurrently modifying dn_phys->dn_nlevels while dbuf_dirty() is trying to assert something about it, without holding the necessary lock. We need to move this assertion further down in the function, after we have acquired the dn_struct_rwlock. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8126 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0ef125d Closes #6314	2017-07-07 14:58:33 -07:00
Matthew Ahrens	a896468c78	OpenZFS 8067 - zdb should be able to dump literal embedded block pointer Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8067 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8173085 Closes #6319	2017-07-07 11:28:01 -07:00
LOLi	92e43c1718	Fix 'zpool clear' on readonly pools Illumos 4080 inadvertently allows 'zpool clear' on readonly pools: fix this by reintroducing a check (POOL_CHECK_READONLY) in zfs_ioc_clear registration code. Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6306	2017-07-07 10:39:53 -07:00
Alek P	0ea05c64f8	Implemented zpool scrub pause/resume Currently, there is no way to pause a scrub. Pausing may be useful when the pool is busy with other I/O to preserve bandwidth. This patch adds the ability to pause and resume scrubbing. This is achieved by maintaining a persistent on-disk scrub state. While the state is 'paused' we do not scrub any more blocks. We do however perform regular scan housekeeping such as freeing async destroyed and deadlist blocks while paused. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: Serapheim Dimitropoulos <serapheimd@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #6167	2017-07-06 22:16:13 -07:00
Arkadiusz Bubała	94b25662c5	Reschedule processes on -ERESTARTSYS On the single core machine the system may hang when the spa_namespare_lock acquisition fails in the zvol_first_open function. It returns -ERESTARTSYS error what causes the endless loop in __blkdev_get function. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com> Closes #6283 Closes #6312	2017-07-06 08:38:24 -07:00
Matthew Ahrens	02dc43bc46	OpenZFS 8378 - crash due to bp in-memory modification of nopwrite block Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> The problem is that zfs_get_data() supplies a stale zgd_bp to dmu_sync(), which we then nopwrite against. zfs_get_data() doesn't hold any DMU-related locks, so after it copies db_blkptr to zgd_bp, dbuf_write_ready() could change db_blkptr, and dbuf_write_done() could remove the dirty record. dmu_sync() then sees the stale BP and that the dbuf it not dirty, so it is eligible for nop-writing. The fix is for dmu_sync() to copy db_blkptr to zgd_bp after acquiring the db_mtx. We could still see a stale db_blkptr, but if it is stale then the dirty record will still exist and thus we won't attempt to nopwrite. OpenZFS-issue: https://www.illumos.org/issues/8378 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3127742 Closes #6293	2017-07-04 15:41:24 -07:00
Andriy Gapon	8ca78ab002	OpenZFS 7600 - zfs rollback should pass target snapshot to kernel Authored by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> The existing kernel-side code only provides a method to rollback to a latest snapshot, whatever it happens to be at the time when the rollback is actually done. That could be unsafe or confusing in environments where concurrent DSL changes are possible as the resulting state could correspond to a newer or older snapshot than the originally requested one. This change allows to amend that method such that the rollback is performed only when the latest snapshot has a specific name. That is, if a new snapshot is concurrently created or the target snapshot is destroyed, then no rollback is done and EXDEV error is returned. New libzfs_core function lzc_rollback_to() is provided for the new functionality. libzfs is changed to use lzc_rollback_to() to implement zfs rollback command. Perhaps we should return different errors to distinguish the case where the desired snapshot exists but it's not the latest snapshot and the case where the desired snapshot does not exist. OpenZFS-issue: https://www.illumos.org/issues/7600 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3d645eb Closes #6292	2017-07-04 15:29:52 -07:00
Andriy Gapon	018503911c	OpenZFS 7910 - l2arc_write_buffers() may write beyond target_sz Authored by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7910 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cb6af4b Closes #6291	2017-07-04 15:28:58 -07:00
Matthew Ahrens	c6f6767eea	OpenZFS 8377 - Panic in bookmark deletion Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> The problem is that when dsl_bookmark_destroy_check() is executed from open context (the pre-check), it fills in dbda_success based on the existence of the bookmark. But the bookmark (or containing filesystem as in this case) can be destroyed before we get to syncing context. When we re-run dsl_bookmark_destroy_check() in syncing context, it will not add the deleted bookmark to dbda_success, intending for dsl_bookmark_destroy_sync() to not process it. But because the bookmark is still in dbda_success from the open-context call, we do try to destroy it. The fix is that dsl_bookmark_destroy_check() should not modify dbda_success when called from open context. OpenZFS-issue: https://www.illumos.org/issues/8377 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b0b6fe3 Closes #6286	2017-06-30 11:11:01 -07:00
Matthew Ahrens	817b1b6e7b	Clean up large dnode code Resolves issues discovered when porting to OpenZFS. * Lint warnings. * Made dnode_move_impl() large dnode aware. This functionality is currently unused on Linux. Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #6262	2017-06-29 10:18:03 -07:00
chrisrd	b8a97fb101	Set arc_meta_limit, arc_dnode_limit on change Make zfs_arc_meta_limit_percent and zfs_arc_dnode_limit_percent behave as you would expect from zfs-module-parameters.5. - recalculate arc_meta_limit if zfs_arc_meta_limit_percent changes - recalculate arc_dnode_limit if zfs_arc_dnode_limit_percent changes - correctly set arc_meta_limit and arc_dnode_limit if zfs_arc_max or zfs_arc_meta_min changes Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #6269	2017-06-29 09:57:27 -07:00
Tony Hutter	682ce104cd	GCC 7.1 fixes GCC 7.1 with will warn when we're not checking the snprintf() return code in cases where the buffer could be truncated. This patch either checks the snprintf return code (where applicable), or simply disables the warnings (ztest.c). Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #6253	2017-06-28 10:05:16 -07:00
Brian Behlendorf	2d678f779a	Cap maximum aggregate IO size Commit `8542ef8` allowed optional IOs to be aggregated beyond the specified aggregation limit. Since the aggregation limit was also used to enforce the maximum block size, setting `zfs_vdev_aggregation_limit=16777216` could result in an attempt to allocate an ABD larger than 16M. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6259 Closes #6270	2017-06-27 10:09:16 -07:00
Boris Protopopov	58404a73db	Refine use of zv_state_lock. Use zv_state_lock to protect all members of zvol_state structure, add relevant ASSERT()s. Take zv_suspend_lock before zv_state_lock, do not hold zv_state_lock across suspend/resume. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Closes #6226	2017-06-27 12:51:44 -04:00
Giuseppe Di Natale	82710e993a	OpenZFS 5220 - L2ARC does not support devices that do not provide 512B access Authored by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Dan McDonald <danmcd@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5220 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/403a8da Closes #6260	2017-06-26 17:32:43 -07:00
Giuseppe Di Natale	d12f91fde3	OpenZFS 8264 - want support for promoting datasets in libzfs_core Authored by: Andrew Stormont <astormont@racktopsystems.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@kebe.com> Approved by: Dan McDonald <danmcd@kebe.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8264 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a4b8c9a Closes #6254	2017-06-26 16:56:09 -07:00
Boris Protopopov	03928896e1	Call cv_signal() with mutex held In bqueue_dequeue(), call cv_signal() with bq_lock held. Re-enable rsend_009_pos to test the fix. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Closes #5887	2017-06-26 14:36:49 -07:00
Morgan Jones	d9ad3fea3b	Add kpreempt_disable/enable around CPU_SEQID uses In zfs/dmu_object and icp/core/kcf_sched, the CPU_SEQID macro should be surrounded by `kpreempt_disable` and `kpreempt_enable` calls to avoid a Linux kernel BUG warning. These code paths use the cpuid to minimize lock contention and is is safe to reschedule the process to a different processor at any time. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Morgan Jones <me@numin.it> Closes #6239	2017-06-19 09:43:16 -07:00
Don Brady	0241e491a0	Inject zinject(8) a percentage amount of dev errs In the original form of device error injection, it was an all or nothing situation. To help simulate intermittent error conditions, you can now specify a real number percentage value. This is also very useful for our ZFS fault diagnosis testing and for injecting intermittent errors during load testing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Closes #6227	2017-06-16 17:21:11 -07:00
Boris Protopopov	ef4be34a64	Avoid 'queue not locked' warning at pool import. Use queue_flag_set_unlocked() in zvol_alloc(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Issue #6226	2017-06-15 11:16:19 -07:00
LOLi	97f8d7961e	Fix zvol_state_t->zv_open_count race `5559ba0` added zv_state_lock to protect zvol_state_t internal data: this, however, doesn't guard zv->zv_open_count and zv->zv_disk->private_data in zvol_remove_minors_impl(). Fix this by taking zv->zv_state_lock before we check its zv_open_count. P1 (z_zvol) P2 (systemd-udevd) --- --- zvol_remove_minors_impl() : zv->zv_open_count==0 zvol_open() ->mutex_enter(zv_state_lock) : zv->zv_open_count++ ->mutex_exit(zv_state_lock) ->mutex_enter(zv->zv_state_lock) ->zvol_remove(zv) ->mutex_exit(zv->zv_state_lock) : zv->zv_disk->private_data = NULL ->zvol_free() -->ASSERT(zv->zv_open_count==0) * zvol_release() : zv = disk->private_data ->ASSERT(zv && zv->zv_open_count>0) * --- --- * ASSERT() fails Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6213	2017-06-15 11:08:45 -07:00
Richard Yao	8f7933fec9	Fix zvol_init error handling Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@prophetstor.com>	2017-06-13 15:29:21 -04:00
Richard Yao	5228cf0116	Make zvol operations use _by_dnode routines This continues what was started in `0eef1bde31` by fully converting zvols to avoid unnecessary dnode_hold() calls. This saves a small amount of CPU time and slightly improves latencies of operations on zvols. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@prophetstor.com> Closes #6058	2017-06-13 09:18:08 -07:00
DeHackEd	419c80e6dc	Reduce stack usage of dsl_dir_tempreserve_impl Buildbots and zfs-tests regularly see 7 kilobytes of stack usage with this function. Convert self-calls to iterations Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #6219	2017-06-12 11:41:03 -07:00
Paul Dagnelie	dd429b46b7	OpenZFS 8056 - zfs send size estimate is inaccurate for some zvols Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> The send size estimate for a zvol can be too low, if the size of the record headers (dmu_replay_record_t's) is a significant portion of the size. This is typically the case when the data is highly compressible, especially with embedded blocks. The problem is that dmu_adjust_send_estimate_for_indirects() assumes that blocks are the size of the "recordsize" property (128KB). However, for zvols, the blocks are the size of the "volblocksize" property (8KB). Therefore, we estimate that there will be 16x less record headers than there really will be. The fix is to check the type of the object set (whether it is a zvol or not) and pick the appropriate property. In addition, while we are at it, we also add the size of the BEGIN and END records to the estimate. OpenZFS-issue: https://www.illumos.org/issues/8056 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/faf09cd Closes #6205	2017-06-09 09:46:14 -07:00
Matthew Ahrens	38240ebd7a	OpenZFS 8156 - dbuf_evict_notify() does not need dbuf_evict_lock Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do the eviction itself (because the evict thread is not able to keep up). This can result in massive lock contention. It isn't necessary to hold the lock, because if we make the wrong choice occasionally, nothing bad will happen. This commit results in a ~60% performance improvement for ARC-cached sequential reads. OpenZFS-issue: https://www.illumos.org/issues/8156 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f73e5d9 Closes #6204	2017-06-09 09:45:13 -07:00
Matthew Ahrens	dbeb879699	OpenZFS 8199 - multi-threaded dmu_object_alloc() dmu_object_alloc() is single-threaded, so when multiple threads are creating files in a single filesystem, they spend a lot of time waiting for the os_obj_lock. To improve performance of multi-threaded file creation, we must make dmu_object_alloc() typically not grab any filesystem-wide locks. The solution is to have a "next object to allocate" for each CPU. Each of these "next object"s is in a different block of the dnode object, so that concurrent allocation holds dnodes in different dbufs. When a thread's "next object" reaches the end of a chunk of objects (by default 4 blocks worth -- 128 dnodes), it will be reset to the per-objset os_obj_next, which will be increased by a chunk of objects (128). Only when manipulating the os_obj_next will we need to grab the os_obj_lock. This decreases lock contention dramatically, because each thread only needs to grab the os_obj_lock briefly, once per 128 allocations. This results in a 70% performance improvement to multi-threaded object creation (where each thread is creating objects in its own directory), from 67,000/sec to 115,000/sec, with 8 CPUs. Work sponsored by Intel Corp. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/8199 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/374 Closes #4703 Closes #6117	2017-06-09 09:43:26 -07:00
Giuseppe Di Natale	1b7c1e5ce9	OpenZFS 7578 - Fix/improve some aspects of ZIL writing - After some ZIL changes 6 years ago zil_slog_limit got partially broken due to zl_itx_list_sz not updated when async itx'es upgraded to sync. Actually because of other changes about that time zl_itx_list_sz is not really required to implement the functionality, so this patch removes some unneeded broken code and variables. - Original idea of zil_slog_limit was to reduce chance of SLOG abuse by single heavy logger, that increased latency for other (more latency critical) loggers, by pushing heavy log out into the main pool instead of SLOG. Beside huge latency increase for heavy writers, this implementation caused double write of all data, since the log records were explicitly prepared for SLOG. Since we now have I/O scheduler, I've found it can be much more efficient to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG. - Existing ZIL implementation had problem with space efficiency when it has to write large chunks of data into log blocks of limited size. In some cases efficiency stopped to almost as low as 50%. In case of ZIL stored on spinning rust, that also reduced log write speed in half, since head had to uselessly fly over allocated but not written areas. This change improves the situation by offloading problematic operations from z_log_write() to zil_lwb_commit(), which knows real situation of log blocks allocation and can split large requests into pieces much more efficiently. Also as side effect it removes one of two data copy operations done by ZIL code WR_COPIED case. - While there, untangle and unify code of z_log_write() functions. Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing block boundary, that may also improve efficiency if ZPL is made to do that. Sponsored by: iXsystems, Inc. Authored by: Alexander Motin <mav@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <ryao@gentoo.org> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7578 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac Closes #6191	2017-06-09 09:15:37 -07:00
Matthew Ahrens	82644107c4	OpenZFS 8155 - simplify dmu_write_policy handling of pre-compressed buffers Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> When writing pre-compressed buffers, arc_write() requires that the compression algorithm used to compress the buffer matches the compression algorithm requested by the zio_prop_t, which is set by dmu_write_policy(). This makes dmu_write_policy() and its callers a bit more complicated. We simplify this by making arc_write() trust the caller to supply the type of pre-compressed buffer that it wants to write, and override the compression setting in the zio_prop_t. OpenZFS-issue: https://www.illumos.org/issues/8155 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b55ff58 Closes #6200	2017-06-07 14:16:01 -04:00
LOLi	9f7b066bd9	Linux 4.9 compat: fix zfs_ctldir xattr handling Since torvalds/linux@d0a5b99 IOP_XATTR is used to indicate the inode has xattr support: clear it for the ctldir inodes to avoid EIO errors. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6189	2017-06-05 11:26:25 -07:00
LOLi	92aceb2a7e	Fix "snapdev" property issues When inheriting the "snapdev" property to we don't always call zfs_prop_set_special(): this prevents device nodes from being created in certain situations. Because "snapdev" is the only special property that is also inheritable we need to call zfs_prop_set_special() even when we're not reverting it to the received value ('zfs inherit -S'). Additionally, fix a NULL pointer dereference accidentally introduced in `5559ba0` that can be triggered when setting the "snapdev" property to the value "hidden" twice. Finally, add a new test case "zvol_misc_snapdev" to the ZFS Test Suite. Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6131 Closes #6175 Closes #6176	2017-06-02 07:17:00 -07:00
Chunwei Chen	b870c4b5d7	Fix import wrong spare/l2 device when path change If, for example, your aux device was /dev/sdc, but now the aux device is removed and /dev/sdc points to other device. zpool import will still use that device and corrupt it. The problem is that the spa_validate_aux in spa_import, rather than validate the on-disk label, it would actually write label to disk. We remove them since spa_load_{spares,l2cache} seems to do everything we need and they would actually validate on-disk label. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #6158	2017-06-01 06:39:42 -07:00
LOLi	3f7d0418dc	Fix memory leak in zvol_set_volsize() Move kmem_free() so it's called for every error path: this is preferred over making `dmu_object_info_t doi` local to accommodate older kernels with limited stacks. Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6177	2017-05-31 12:52:12 -07:00
Boris Protopopov	2d82116e80	Fix ida leak in zvol_create_minor_impl Added missing ida_simple_remove() in the error handling path. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Closes #6159 Closes #6172	2017-05-26 17:50:25 -07:00
Alek P	9210e43a16	Don't dirty bpobj if it has no entries In certain cases (dsl_scan_sync() is one), we may end up calling bpobj_iterate() on an empty bpobj. Even though we don't end up modifying the bpobj it still gets dirtied, causing unneeded writes to the pool. This patch adds an early bail from bpobj_iterate_impl() if bpobj is empty to prevent unneeded writes. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #6164	2017-05-26 11:42:10 -07:00
Brian Behlendorf	261c013fbf	Revert "Fix "snapdev" property inheritance behaviour" This reverts commit `959f56b993`. An issue was uncovered by the new zvol_misc_snapdev test case which needs to be investigated and resolved. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6174 Issue #6131	2017-05-26 11:40:44 -07:00
Alan Somers	0071036526	OpenZFS 8070 - Add some ZFS comments Authored by: Alan Somers <asomers@gmail.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: bunder2015 <omfgbunder@gmail.com> OpenZFS-issue: https://www.illumos.org/issues/8070 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/40713f2 Closes #6160	2017-05-25 17:15:06 -07:00
LOLi	959f56b993	Fix "snapdev" property inheritance behaviour When inheriting the "snapdev" property to we don't always call zfs_prop_set_special(): this prevents device nodes from being created in certain situations. Because "snapdev" is the only special property that is also inheritable we need to call zfs_prop_set_special() even when we're not reverting it to the received value ('zfs inherit -S'). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6131	2017-05-25 16:43:46 -07:00
LOLi	3e6c943347	Linux 4.12 compat: fix super_setup_bdi_name() call Provide a format parameter to super_setup_bdi_name() so we don't create duplicate names in '/devices/virtual/bdi' sysfs namespace which would prevent us from mounting more than one ZFS filesystem at a time. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6147	2017-05-25 09:55:55 -07:00
Feng Sun	f871ab6ea2	Fix LZ4_uncompress_unknownOutputSize caused panic Sync with kernel patches for lz4 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/log/lib/lz4 4a3a99 lz4: add overrun checks to lz4_uncompress_unknownoutputsize() d5e7ca LZ4 : fix the data abort issue bea2b5 lib/lz4: Pull out constant tables 99b7e9 lz4: fix system halt at boot kernel on x86_64 Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Feng Sun <loyou85@gmail.com> Closes #5975 Closes #5973	2017-05-19 13:45:46 -07:00
Alek P	bec1067d54	Implemented zpool sync command This addition will enable us to sync an open TXG to the main pool on demand. The functionality is similar to 'sync(2)' but 'zpool sync' will return when data has hit the main storage instead of potentially just the ZIL as is the case with the 'sync(2)' cmd. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #6122	2017-05-19 12:33:11 -07:00
Tony Hutter	4a283c7f77	Force fault a vdev with 'zpool offline -f' This patch adds a '-f' option to 'zpool offline' to fault a vdev instead of bringing it offline. Unlike the OFFLINE state, the FAULTED state will trigger the FMA code, allowing for things like autoreplace and triggering the slot fault LED. The -f faults persist across imports, unless they were set with the temporary (-t) flag. Both persistent and temporary faults can be cleared with zpool clear. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #6094	2017-05-19 12:30:16 -07:00
Tom Caputi	a32df59e18	Fixed small memory leak in ereport handling One pre-check in zfs_ereport_start() was being called after the nvlists were being allocated. This simply corrects that issue. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6140	2017-05-18 17:35:49 -07:00
Boris Protopopov	5559ba094f	Introduce zv_state_lock The lock is designed to protect internal state of zvol_state_t and to avoid taking spa_namespace_lock (e.g. in dmu_objset_own() code path) while holding zvol_stat_lock. Refactor the code accordingly. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3484 Closes #6065 Closes #6134	2017-05-16 19:44:06 -04:00
Boris Protopopov	07783588bc	Revert commit `1ee159f4` Fix lock order inversion with zvol_open() as it did not account for use of zvols as vdevs. The latter use cases resulted in the lock order inversion deadlocks that involved spa_namespace_lock and bdev->bd_mutex. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #6065 Issue #6134	2017-05-16 19:42:47 -04:00
Isaac Huang	3d6da72d18	Skip spurious resilver IO on raidz vdev On a raidz vdev, a block that does not span all child vdevs, excluding its skip sectors if any, may not be affected by a child vdev outage or failure. In such cases, the block does not need to be resilvered. However, current resilver algorithm simply resilvers all blocks on a degraded raidz vdev. Such spurious IO is not only wasteful, but also adds the risk of overwriting good data. This patch eliminates such spurious IOs. Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #5316	2017-05-12 17:28:03 -07:00
Matthew Ahrens	4747a7d3d4	OpenZFS 8063 - verify that we do not attempt to access inactive txg Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> A standard practice in ZFS is to keep track of "per-txg" state. Any of the 3 active TXG's (open, quiescing, syncing) can have different values for this state. We should assert that we do not attempt to modify other (inactive) TXG's. Porting Notes: - ASSERTV added to txg_sync_waiting() for unused variable. OpenZFS-issue: https://www.illumos.org/issues/8063 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/01acb46 Closes #6109	2017-05-10 13:52:22 -04:00
Matthew Ahrens	335b251ac1	OpenZFS 8166 - zpool scrub thinks it repaired offline device Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Matthew Ahrens <mahrens@delphix.com> If we do a scrub while a leaf device is offline (via "zpool offline"), we will inadvertently clear the DTL (dirty time log) of the offline device, even though it is still damaged. When the device comes back online, we will incompletely resilver it, thinking that the scrub repaired blocks written before the scrub was started. The incomplete resilver can lead to data loss if there is a subsequent failure of a different leaf device. The fix is to never clear the DTL of offline devices. Note that if a device is onlined while a scrub is in progress, the scrub will be restarted. The problem can be worked around by running "zpool scrub" after "zpool online". OpenZFS-issue: https://www.illumos.org/issues/8166 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/372 Closes #5806 Closes #6103	2017-05-10 10:32:39 -07:00
Tom Caputi	f486f58440	Add missing arc_free_cksum() to arc_release() The arc layer tracks checksums of its data in the arc header so that it can ensure that buffers haven't changed when they're not supposed to. This checksum is only maintained while there is an uncompressed buffer still attached to the header. Unfortunately there is a missing call to arc_free_cksum() in arc_release() that can trigger ASSERTs. This has not been a common issue because the checksums are only maintained for debug builds and triggering the bug requires writing a block (and therefore calling arc_release()) while a compressed buffer is still being used on a debug build. This simply corrects the issue. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6105	2017-05-10 10:25:27 -07:00
Brian Behlendorf	2946a1a15a	Linux 4.12 compat: CURRENT_TIME removed Linux 4.9 added current_time() as the preferred interface to get the filesystem time. CURRENT_TIME was retired in Linux 4.12. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6114	2017-05-10 09:30:48 -07:00
LOLi	a3eeab2de6	Add property overriding (-o\|-x) to 'zfs receive' This allows users to specify "-o property=value" to override and "-x property" to exclude properties when receiving a zfs send stream. Both native and user properties can be specified. This is useful when using zfs send/receive for periodic backup/replication because it lets users change properties such as canmount, mountpoint, or compression without modifying the source. References: https://www.illumos.org/issues/2745 https://www.illumos.org/issues/3753 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #1350 Closes #5349	2017-05-09 16:21:09 -07:00
Chunwei Chen	e624cd1959	Linux 4.12 compat: PF_FSTRANS was removed zfsonlinux/spl@8f87971 added __spl_pf_fstrans_check for the xfs related check, so we use them accordingly. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #6113	2017-05-09 10:38:46 -07:00
Brian Behlendorf	1eab430af7	Fix unused variable warning Remove the lz4_ac local variable from dmu_write_policy() to resolve the following unused variable warning on non-debug builds. dmu.c: In function ‘dmu_write_policy’: dmu.c:1892:12: warning: unused variable ‘lz4_ac’ [-Wunused-variable] boolean_t lz4_ac = spa_feature_is_active(os->os_spa, Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-05-05 10:23:58 -07:00
Gvozden Neskovic	c17486b217	Add missing _destroy/_fini calls The proposed debugging enhancements in zfsonlinux/spl#587 identified the following missing _destroy/_fini calls. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #5428	2017-05-04 19:26:28 -04:00
Brian Behlendorf	8fa5250f5d	Default to zvol_request_async=0 Change the default ZVOL behavior so requests are handled asynchronously. This behavior is functionally the same as in the zfs-0.6.4 release. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5902	2017-05-04 18:01:50 -04:00
Richard Yao	bc17f1047a	Enable Linux read-ahead for a single page on ZVOLs Linux has read-ahead logic designed to accelerate sequential workloads. ZFS has its own read-ahead logic called zprefetch that operates on both ZVOLs and datasets. Having two prefetchers active at the same time can cause overprefetching, which unnecessarily reduces IOPS performance on CoW filesystems like ZFS. Testing shows that entirely disabling the Linux prefetch results in a significant performance penalty for reads while commensurate benefits are seen in random writes. It appears that read-ahead benefits are inversely proportional to random write benefits, and so a single page of Linux-layer read-ahead appears to offer the middle ground for both workloads. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #5902	2017-05-04 18:00:27 -04:00
RageLtMan	5731140eaf	Disable write merging on ZVOLs The current ZVOL implementation does not explicitly set merge options on ZVOL device queues, which results in the default merge behavior. Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the ZIO pipeline to do its work. Initial benchmarks (tiotest with no O_DIRECT) show random write performance going up almost 3X on 8K ZVOLs, even after significant rewrites of the logical space allocation. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: RageLtMan <rageltman@sempervictus> Issue #5902	2017-05-04 17:59:52 -04:00
Olaf Faaland	9d3f7b8791	Write label 2,3 uberblocks when vdev expands When vdev_psize increases, the location of labels 2 and 3 changes because their location is relative to the end of the device. The configs for labels 2 and 3 are written during the next spa_sync() because the vdev is added to the dirty config list. However, the uberblock rings are not re-written in their new location, leaving the device vulnerable to the beginning of the device being overwritten or damaged. This patch copies the uberblock ring from label 0 to labels 2 and 3, in their new locations, at the next sync after vdev_psize increases. Also, add a test zpool_expand_004_pos.ksh to confirm the uberblocks are copied. Reviewed-by: BearBabyLiu <liu.huang@zte.com.cn> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #5108	2017-05-02 13:55:24 -07:00
Debabrata Banerjee	03b60eee78	Allow scaling of arc in proportion to pagecache When multiple filesystems are in use, memory pressure causes arc_cache to collapse to a minimum. Allow arc_cache to maintain proportional size even when hit rates are disproportionate. We do this only via evictable size from the kernel shrinker, thus it's only in effect under memory pressure. AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #6035	2017-05-02 15:50:49 -04:00
Debabrata Banerjee	4149bf498a	Correct signed operation Could return the wrong pages value AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035	2017-05-02 15:50:26 -04:00
Debabrata Banerjee	44813aefad	Don't run the reaper if we didn't shrink the cache Calling it when nothing is evictable will cause extra kswapd cpu. Also if we didn't shrink it's unlikely to have memory to reap because we likely just called it microseconds ago. The exception is if we are in direct reclaim. You can see how hard this is being hit in kswapd with a light test workload: 34.95% [zfs] [k] arc_kmem_reap_now 5.40% [spl] [k] spl_kmem_cache_reap_now 3.79% [kernel] [k] _raw_spin_lock 2.86% [spl] [k] __spl_kmem_cache_generic_shrinker.isra.7 2.70% [kernel] [k] shrink_slab.part.37 1.93% [kernel] [k] isolate_lru_pages.isra.43 1.55% [kernel] [k] __wake_up_bit 1.20% [kernel] [k] super_cache_count 1.20% [kernel] [k] __radix_tree_lookup With ZFS just mounted but only ext4/pagecache memory pressure arc_kmem_reap_now still consumes excessive CPU: 12.69% [kernel] [k] isolate_lru_pages.isra.43 10.76% [kernel] [k] free_pcppages_bulk 7.98% [kernel] [k] drop_buffers 7.31% [kernel] [k] shrink_page_list 6.44% [zfs] [k] arc_kmem_reap_now 4.19% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.95% [kernel] [k] __isolate_lru_page 3.09% [kernel] [k] __radix_tree_lookup Same pagecache only workload as above with this patch series: 11.58% [kernel] [k] isolate_lru_pages.isra.43 11.20% [kernel] [k] drop_buffers 9.67% [kernel] [k] free_pcppages_bulk 8.44% [kernel] [k] shrink_page_list 4.86% [kernel] [k] __isolate_lru_page 4.43% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.44% [kernel] [k] __radix_tree_lookup (arc_kmem_reap_now has 0 samples in perf) AKAMAI: zfs: CR 3695042 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035	2017-05-02 15:50:13 -04:00
Debabrata Banerjee	1a31dcf53c	Only wakeup waiters if we've actually done work AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035	2017-05-02 15:50:02 -04:00
Debabrata Banerjee	2e91c2fb1a	Do not stop kernel shrinker on lock contention Lock contention, by itself, shouldn't indicate a stop condition to the kernel's slab shrinker. Doing so can cause stalls when the kernel is trying to free large parts of the cache such as is done by drop_caches Also, perhaps arc_reclaim_lock should be a spinlock, and this code eliminated. AKAMAI: zfs: CR 3593801 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035	2017-05-02 15:49:48 -04:00
Debabrata Banerjee	b855550c33	Stop double reclaiming or not reclaiming at all Move arcstat_need_free increment from all direct calls to when arc_reclaim_lock is busy and we exit wihout doing anything. Data will be reclaimed in reclaim thread. The previous location meant that we both reclaim the memory in this thread, and also schedule the same amount of memory for reclaim in arc_reclaim, effectively doubling the requested reclaim. AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035	2017-05-02 15:49:36 -04:00

1 2 3 4 5 ...

1581 Commits