mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-06-09 23:46:38 +03:00

Author	SHA1	Message	Date
Joe Stein	e0ab3ab553	OpenZFS 6736 - ZFS per-vdev ZAPs 6736 ZFS per-vdev ZAPs Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/6736 https://github.com/openzfs/openzfs/commit/215198a Ported-by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4515	2016-05-02 14:27:45 -07:00
Nikolay Borisov	4cd77889b6	Kill znode->z_gen field This field is a duplicate of the inode->i_generation, so just kill it Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4538	2016-05-02 11:22:31 -07:00
Tim Chase	ddab862d4c	Enable PF_FSTRANS for ioctl secpolicy callbacks (#4571 ) At the very least, the zfs_secpolicy_write_perms ioctl security policy callback, which calls dsl_dataset_hold(), can require freeing memory and, therefore, re-enter ZFS. This patch enables PF_FSTRANS for all of the security policy callbacks similarly to the manner in which it's enabled for the actual ioctl callback. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4554	2016-05-02 10:00:50 -07:00
Vitaut Bajaryn	ef1c27117b	module/.gitignore: Add *.dwo (#4580 ) These files get generated when CONFIG_DEBUG_INFO_DWARF4 is enabled in Linux .config. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4580	2016-05-02 09:07:04 -07:00
Alex Reece	463a8cfe2b	Illumos 6844 - dnode_next_offset can detect fictional holes 6844 dnode_next_offset can detect fictional holes Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> dnode_next_offset is used in a variety of places to iterate over the holes or allocated blocks in a dnode. It operates under the premise that it can iterate over the blockpointers of a dnode in open context while holding only the dn_struct_rwlock as reader. Unfortunately, this premise does not hold. When we create the zio for a dbuf, we pass in the actual block pointer in the indirect block above that dbuf. When we later zero the bp in zio_write_compress, we are directly modifying the bp. The state of the bp is now inconsistent from the perspective of dnode_next_offset: the bp will appear to be a hole until zio_dva_allocate finally finishes filling it in. In the meantime, dnode_next_offset can detect a hole in the dnode when none exists. I was able to experimentally demonstrate this behavior with the following setup: 1. Create a file with 1 million dbufs. 2. Create a thread that randomly dirties L2 blocks by writing to the first L0 block under them. 3. Observe dnode_next_offset, waiting for it to skip over a hole in the middle of a file. 4. Do dnode_next_offset in a loop until we skip over such a non-existent hole. The fix is to ensure that it is valid to iterate over the indirect blocks in a dnode while holding the dn_struct_rwlock by passing the zio a copy of the BP and updating the actual BP in dbuf_write_ready while holding the lock. References: https://www.illumos.org/issues/6844 https://github.com/openzfs/openzfs/pull/82 DLPX-35372 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4548	2016-04-27 16:24:15 -07:00
Josef 'Jeff' Sipek	8a5fc74880	Illumos 6659 - nvlist_free(NULL) is a no-op 6659 nvlist_free(NULL) is a no-op Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Marcel Telka <marcel@telka.sk> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/6659 https://github.com/illumos/illumos-gate/commit/aab83bb Ported-by: David Quigley <dpquigl@davequigley.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4566	2016-04-27 15:58:23 -07:00
Tim Chase	ea2633ad26	Clear PF_FSTRANS over spl_filp_fallocate() The problem described in `2a5d574` also applies to XFS's file or inode fallocate method. Both paths may trigger writeback and expose this issue, see the full stack below. When layered on XFS a warning will be emitted under CentOS7 when entering either the file or inode fallocate method with PF_FSTRANS already set. To avoid triggering this error PF_FSTRANS is cleared and then reset in vn_space(). WARNING: at fs/xfs/xfs_aops.c:982 xfs_vm_writepage+0x58b/0x5d0 Call Trace: [<ffffffff810a1ed5>] warn_slowpath_common+0x95/0xe0 [<ffffffff810a1f3a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa0231fdb>] xfs_vm_writepage+0x58b/0x5d0 [xfs] [<ffffffff81173ed7>] __writepage+0x17/0x40 [<ffffffff81176f81>] write_cache_pages+0x251/0x530 [<ffffffff811772b1>] generic_writepages+0x51/0x80 [<ffffffffa0230cb0>] xfs_vm_writepages+0x60/0x80 [xfs] [<ffffffff81177300>] do_writepages+0x20/0x30 [<ffffffff8116a5f5>] __filemap_fdatawrite_range+0xb5/0x100 [<ffffffff8116a6cb>] filemap_write_and_wait_range+0x8b/0xd0 [<ffffffffa0235bb4>] xfs_free_file_space+0xf4/0x520 [xfs] [<ffffffffa023cbce>] xfs_file_fallocate+0x19e/0x2c0 [xfs] [<ffffffffa036c6fc>] vn_space+0x3c/0x40 [spl] [<ffffffffa0434817>] vdev_file_io_start+0x207/0x260 [zfs] [<ffffffffa047170d>] zio_vdev_io_start+0xad/0x2d0 [zfs] [<ffffffffa0474942>] zio_execute+0x82/0xe0 [zfs] [<ffffffffa036ba7d>] taskq_thread+0x28d/0x5a0 [spl] [<ffffffff810c1777>] kthread+0xd7/0xf0 [<ffffffff8167de2f>] ret_from_fork+0x3f/0x70 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Nikolay Borisov <kernel@kyup.com> Closes zfsonlinux/zfs#4529	2016-04-26 11:22:43 -07:00
Brian Behlendorf	2d82ea8b11	Use udev for partition detection When ZFS partitions a block device it must wait for udev to create both a device node and all the device symlinks. This process takes a variable length of time and depends on factors such how many links must be created, the complexity of the rules, etc. Complicating the situation further it is not uncommon for udev to create and then remove a link multiple times while processing the udev rules. Given the above, the existing scheme of waiting for an expected partition to appear by name isn't 100% reliable. At this point udev may still remove and recreate think link resulting in the kernel modules being unable to open the device. In order to address this the zpool_label_disk_wait() function has been updated to use libudev. Until the registered system device acknowledges that it in fully initialized the function will wait. Once fully initialized all device links are checked and allowed to settle for 50ms. This makes it far more likely that all the device nodes will exist when the kernel modules need to open them. For systems without libudev an alternate zpool_label_disk_wait() was updated to include a settle time. In addition, the kernel modules were updated to include retry logic for this ENOENT case. Due to the improved checks in the utilities it is unlikely this logic will be invoked. However, if the rare event it is needed it will prevent a failure. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #4523 Closes #3708 Closes #4077 Closes #4144 Closes #4214 Closes #4517	2016-04-25 11:13:20 -07:00
Chunwei Chen	232604b58e	Linux 4.5 compat: Use xattr_handler->name for acl Linux 4.5 added member "name" to xattr_handler. xattr_handler which matches to whole name rather than prefix should use "name" instead of "prefix". Otherwise, kernel will return with EINVAL when it tries to resolve handlers. Also, we remove the strcmp checks when xattr_handler has name, because xattr_resolve_name will do the check for us. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4549 Closes #4537	2016-04-25 08:42:08 -07:00
Brian Behlendorf	da5e151f20	Add pn_alloc()/pn_free() functions In order to remove the HAVE_PN_UTILS wrappers the pn_alloc() and pn_free() functions must be implemented. The existing illumos implementation were used for this purpose. The `flags` argument which was used in places wrapped by the HAVE_PN_UTILS condition has beed added back to zfs_remove() and zfs_link() functions. This removes a small point of divergence between the ZoL code and upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4522	2016-04-21 09:49:25 -07:00
Ned Bass	98f03691a4	Fix ZPL miswrite of default POSIX ACL Commit `4967a3e` introduced a typo that caused the ZPL to store the intended default ACL as an access ACL. Due to caching this problem may not become visible until the filesystem is remounted or the inode is evicted from the cache. Fix the typo and add a regression test. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4520	2016-04-18 11:26:55 -07:00
Colin Ian King	4903926f89	Fix inverted logic on none elevator comparison Commit `d1d7e2689d` ("cstyle: Resolve C style issues") inverted the logic on the none elevator comparison. Fix this and make it cstyle warning clean. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4507	2016-04-15 12:18:08 -07:00
Chunwei Chen	704cd0758a	Enable lazytime semantic for atime Linux 4.0 introduces lazytime. The idea is that when we update the atime, we delay writing it to disk for as long as it is reasonably possible. When lazytime is enabled, dirty_inode will be called with only I_DIRTY_TIME flag whenever i_atime is updated. So under such condition, we will set z_atime_dirty. We will only write it to disk if file is closed, inode is evicted or setattr is called. Ideally, we should also write it whenever SA is going to be updated, but it is left for future improvement. There's one thing that we should take care of now that we allow i_atime to be dirty. In original implementation, whenever SA is modified, zfs_inode_update will be called to overwrite every thing in inode. This will cause dirty i_atime to be discarded. We fix this by don't overwrite i_atime in zfs_inode_update. We only overwrite i_atime when allocating new inode or doing zfs_rezget with zfs_inode_update_new. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4482	2016-04-05 18:55:51 -07:00
Chunwei Chen	0df9673f01	Fix atime handling and relatime The problem for atime: We have 3 places for atime: inode->i_atime, znode->z_atime and SA. And its handling is a mess. A huge part of mess regarding atime comes from zfs_tstamp_update_setup, zfs_inode_update, and zfs_getattr, which behave inconsistently with those three values. zfs_tstamp_update_setup clears z_atime_dirty unconditionally as long as you don't pass ATTR_ATIME. Which means every write(2) operation which only updates ctime and mtime will cause atime changes to not be written to disk. Also zfs_inode_update from write(2) will replace inode->i_atime with what's inside SA(stale). But doesn't touch z_atime. So after read(2) and write(2). You'll have i_atime(stale), z_atime(new), SA(stale) and z_atime_dirty=0. Now, if you do stat(2), zfs_getattr will actually replace i_atime with what's inside, z_atime. So you will have now you'll have i_atime(new), z_atime(new), SA(stale) and z_atime_dirty=0. These will all gone after umount. And you'll leave with a stale atime. The problem for relatime: We do have a relatime config inside ZFS dataset, but how it should interact with the mount flag MS_RELATIME is not well defined. It seems it wanted relatime mount option to override the dataset config by showing it as temporary in `zfs get`. But at the same time, `zfs set relatime=on\|off` would also seems to want to override the mount option. Not to mention that MS_RELATIME flag is actually never passed into ZFS, so it never really worked. How Linux handles atime: The Linux kernel actually handles atime completely in VFS, except for writing it to disk. So if we remove the atime handling in ZFS, things would just work, no matter it's strictatime, relatime, noatime, or even O_NOATIME. And whenever VFS updates the i_atime, it will notify the underlying filesystem via sb->dirty_inode(). And also there's one thing to note about atime flags like MS_RELATIME and other flags like MS_NODEV, etc. They are mount point flags rather than filesystem(sb) flags. Since native linux filesystem can be mounted at multiple places at the same time, they can all have different atime settings. So these flags are never passed down to filesystem drivers. What this patch tries to do: We remove znode->z_atime, since we won't gain anything from it. We remove most of the atime handling and leave it to VFS. The only thing we do with atime is to write it when dirty_inode() or setattr() is called. We also add file_accessed() in zpl_read() since it's not provided in vfs_read(). After this patch, only the MS_RELATIME flag will have effect. The setting in dataset won't do anything. We will make zfstuil to mount ZFS with MS_RELATIME set according to the setting in dataset in future patch. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4482	2016-04-05 18:54:55 -07:00
Brian Behlendorf	8b1899d3fb	Linux 4.6 compat: PAGE_CACHE_SIZE removal As described in torvalds/linux@4a2d057e the macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were originally introduced to make it possible to add bigger chunks to the page cache. This never panned out and it has therefore been removed from the kernel. ZFS has been updated to use the PAGE_{SIZE,SHIFT,MASK,ALIGN} macros and calls to page_cache_release() have been replaced with put_page(). There was no need to introduce a configure check for this because these interfaces have existed for a very long time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4489	2016-04-05 17:26:56 -07:00
Colin Ian King	f7b939bdbc	Add support 32 bit FS_IOC32_{GET\|SET}FLAGS compat ioctls We need 32 bit userspace FS_IOC32_GETFLAGS and FS_IOC32_SETFLAGS compat ioctls for systems such as powerpc64. We use the normal compat ioctl idiom as used by a variety of file systems to provide this support. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4477	2016-03-31 17:56:12 -07:00
DHE	e3e7cf6026	gcc build error: -Wbool-compare in metaslab.c When debugging is enabled on a very recent version of gcc (tested with 5.3.0), DVA_SET_GANG(dva, !!(flags)) fails because an assertion causes a comparison between what is technically a boolean and an integer. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4465	2016-03-30 09:36:51 -07:00
Brian Behlendorf	c35b188246	Fix zpool_scrub_* test cases The zpool_scrub_002, zpool_scrub_003, zpool_scrub_004 test cases fail reliably when running against small pools or fast storage. This occurs because the scrub/resilver operation completes before subsequent commands can be run. A one second delay has been added to 10% of zio's in order to ensure the scrub/resilver operation will run for at least several seconds. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4450	2016-03-30 09:30:34 -07:00
Tim Chase	7bb5d92de8	Allow spawning a new thread for TQ_NOQUEUE dispatch with dynamic taskq When a TQ_NOQUEUE dispatch is done on a dynamic taskq, allow another thread to be spawned. This will cause TQ_NOQUEUE to behave similarly as it does with non-dynamic taskqs. Add support for TQ_NOQUEUE to taskq_dispatch_ent(). Signed-off-by: Tim Chase <tim@onlight.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #530	2016-03-17 09:52:35 -07:00
Alex Wilson	887d1e60ef	Illumos 6681 - zfs list burning lots of time in dodefault() via dsl_prop_* 6681 zfs list burning lots of time in dodefault() via dsl_prop_* Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Matthew Ahrens <mahrens@delphix.com> References: https://www.illumos.org/issues/6681 https://github.com/illumos/illumos-gate/commit/d09e447 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4406	2016-03-15 18:46:44 -07:00
Paul Dagnelie	c352ec27d5	Illumos 6370 - ZFS send fails to transmit some holes 6370 ZFS send fails to transmit some holes Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Stefan Ring <stefanrin@gmail.com> Reviewed by: Steven Burgess <sburgess@datto.com> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/6370 https://github.com/illumos/illumos-gate/commit/286ef71 In certain circumstances, "zfs send -i" (incremental send) can produce a stream which will result in incorrect sparse file contents on the target. The problem manifests as regions of the received file that should be sparse (and read a zero-filled) actually contain data from a file that was deleted (and which happened to share this file's object ID). Note: this can happen only with filesystems (not zvols, because they do not free (and thus can not reuse) object IDs). Note: This can happen only if, since the incremental source (FromSnap), a file was deleted and then another file was created, and the new file is sparse (i.e. has areas that were never written to and should be implicitly zero-filled). We suspect that this was introduced by 4370 (applies only if hole_birth feature is enabled), and made worse by 5243 (applies if hole_birth feature is disabled, and we never send any holes). The bug is caused by the hole birth feature. When an object is deleted and replaced, all the holes in the object have birth time zero. However, zfs send cannot tell that the holes are new since the file was replaced, so it doesn't send them in an incremental. As a result, you can end up with invalid data when you receive incremental send streams. As a short-term fix, we can always send holes with birth time 0 (unless it's a zvol or a dataset where we can guarantee that no objects have been reused). Ported-by: Steven Burgess <sburgess@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4369 Closes #4050	2016-03-10 14:25:22 -08:00
Brian Behlendorf	a6ae97caed	Add rw_tryupgrade() This implementation of rw_tryupgrade() behaves slightly differently from its counterparts on other platforms. It drops the RW_READER lock and then acquires the RW_WRITER lock leaving a small window where no lock is held. On other platforms the lock is never released during the upgrade process. This is necessary under Linux because the kernel does not provide an upgrade function. There are currently no callers in the ZFS code where this change in behavior is a problem. In fact, in most cases the code is already written such that if the upgrade fails the RW_READER lock is dropped and the caller blocks waiting to acquire the lock as RW_WRITER. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Matthew Thode <prometheanfire@gentoo.org> Closes zfsonlinux/zfs#4388 Closes #534	2016-03-10 13:05:25 -08:00
Boris Protopopov	1ee159f423	Fix lock order inversion with zvol_open() zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Remove trylock of spa_namespace_lock as it is no longer needed when zvol minor operations are performed in a separate context with no prior locking state; the spa_namespace_lock is no longer held when bdev->bd_mutex or zfs_state_lock might be taken in the code paths originating from the zvol minor operation callbacks. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3681	2016-03-10 09:53:36 -08:00
Boris Protopopov	a0bd735adb	Add support for asynchronous zvol minor operations zfsonlinux issue #2217 - zvol minor operations: check snapdev property before traversing snapshots of a dataset zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Create a per-pool zvol taskq for asynchronous zvol tasks. There are a few key design decisions to be aware of. * Each taskq must be single threaded to ensure tasks are always processed in the order in which they were dispatched. * There is a taskq per-pool in order to keep the pools independent. This way if one pool is suspended it will not impact another. * The preferred location to dispatch a zvol minor task is a sync task. In this context there is easy access to the spa_t and minimal error handling is required because the sync task must succeed. Support for asynchronous zvol minor operations address issue #3681. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2217 Closes #3678 Closes #3681	2016-03-10 09:49:22 -08:00
Tim Chase	f764edf016	Change KM_SLEEP to TQ_SLEEP in spa_deadman() Since they both evaluate to zero, this is a semi-cosmetic change but the latter is the proper value to use as an argument to taskq_dispatch_delay(). Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4393	2016-03-09 10:41:31 -08:00
ab-oe	513168abd2	Make zvol update volsize operation synchronous. There is a race condition when new transaction group is added to dp->dp_dirty_datasets list by the zap_update in the zvol_update_volsize. Meanwhile, before these dirty data are synchronized, the receive process can cause that dmu_recv_end_sync is executed. Then finally dirty data are going to be synchronized but the synchronization ends with the NULL pointer dereference error. Signed-off-by: ab-oe <arkadiusz.bubala@open-e.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4116	2016-02-29 09:07:27 -08:00
smh	9f500936c8	FreeBSD r256956: Improve ZFS N-way mirror read performance by using load and locality information. The existing algorithm selects a preferred leaf vdev based on offset of the zio request modulo the number of members in the mirror. It assumes the devices are of equal performance and that spreading the requests randomly over both drives will be sufficient to saturate them. In practice this results in the leaf vdevs being under utilized. The new algorithm takes into the following additional factors: * Load of the vdevs (number outstanding I/O requests) * The locality of last queued I/O vs the new I/O request. Within the locality calculation additional knowledge about the underlying vdev is considered such as; is the device backing the vdev a rotating media device. This results in performance increases across the board as well as significant increases for predominantly streaming loads and for configurations which don't have evenly performing devices. The following are results from a setup with 3 Way Mirror with 2 x HD's and 1 x SSD from a basic test running multiple parrallel dd's. With pre-fetch disabled (vfs.zfs.prefetch_disable=1): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s With pre-fetch enabled (vfs.zfs.prefetch_disable=0): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s In addition to the performance changes the code was also restructured, with the help of Justin Gibbs, to provide a more logical flow which also ensures vdevs loads are only calculated from the set of valid candidates. The following additional sysctls where added to allow the administrator to tune the behaviour of the load algorithm: * vfs.zfs.vdev.mirror.rotating_inc * vfs.zfs.vdev.mirror.rotating_seek_inc * vfs.zfs.vdev.mirror.rotating_seek_offset * vfs.zfs.vdev.mirror.non_rotating_inc * vfs.zfs.vdev.mirror.non_rotating_seek_inc These changes where based on work started by the zfsonlinux developers: https://github.com/zfsonlinux/zfs/pull/1487 Reviewed by: gibbs, mav, will MFC after: 2 weeks Sponsored by: Multiplay References: https://github.com/freebsd/freebsd@5c7a6f5d https://github.com/freebsd/freebsd@31b7f68d https://github.com/freebsd/freebsd@e186f564 Performance Testing: https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141 Porting notes: - The tunables were adjusted to have ZoL-style names. - The code was modified to use ZoL's vd_nonrot. - Fixes were done to make cstyle.pl happy - Merge conflicts were handled manually - freebsd/freebsd@e186f564bc by my collegue Andriy Gapon has been included. It applied perfectly, but added a cstyle regression. - This replaces `556011dbec` entirely. - A typo "IO'a" has been corrected to say "IO's" - Descriptions of new tunables were added to man/man5/zfs-module-parameters.5. Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4334	2016-02-26 11:24:35 -08:00
Brian Behlendorf	8a09d5fd46	Add l2arc_max_block_size tunable Set a limit for the largest compressed block which can be written to an L2ARC device. By default this limit is set to 16M so there is no change in behavior. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #4323	2016-02-25 09:44:00 -08:00
Boris Protopopov	5428dc51fb	Make zvol minor functionality more robust Close the race window in zvol_open() to prevent removal of zvol_state in the 'first open' code path. Move the call to check_disk_change() under zvol_state_lock to make sure the zvol_media_changed() and zvol_revalidate_disk() called by check_disk_change() are invoked with positive zv_open_count. Skip opened zvols when removing minors and set private_data to NULL for zvols that are not in use whose minors are being removed, to indicate to zvol_open() that the state is gone. Skip opened zvols when renaming minors to avoid modifying zv_name that might be in use, e.g. in zvol_ioctl(). Drop zvol_state_lock before calling add_disk() when creating minors to avoid deadlocks with zvol_open(). Wrap dmu_objset_find() with spl_fstran_mark()/unmark(). Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #4344	2016-02-24 11:54:24 -08:00
Richard Yao	19a47cb1c2	Call dmu_read_uio_dbuf() in zvol_read() The difference between `dmu_read_uio()` and `dmu_read_uio_dbuf()` is that the former takes a hold while the latter uses an existing hold. `zfs_read()` in the ZPL will use `dmu_read_uio_dbuf()` while our analogous `zvol_write()` will use `dmu_write_uio_dbuf()`, but for no apparent reason, we inherited a `zvol_read()` function from OpenSolaris that does `dmu_read_uio()`. illumos-gate also still uses `dmu_read_uio()` to this day. Lets switch to `dmu_read_uio_dbuf()`, which is more performant. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4316	2016-02-18 10:24:33 -08:00
Richard Yao	a765a34a31	Clean up zvol request processing to pass uio and fix porting regressions In illumos-gate, `zvol_read` and `zvol_write` are both passed uio_t rather than bio_t. Since we are translating from bio to uio for both, we might as well unify the logic and have code more similar to its illumos counterpart. At the same time, we can fix some regressions that occurred versus the original code from illumos-gate. We refactor zvol_write to take uio and also correct the following problems: 1. We did `dnode_hold()` on each IO when we already had a hold. 2. We would attempt to send writes that exceeded `DMU_MAX_ACCESS` to the DMU. 3. We could call `zil_commit()` twice. In this case, this is because Linux uses the `->write` function to send flushes and can aggregate the flush with a write. If a synchronous write occurred with the flush, we effectively flushed twice when there is no need to do that. zvol_read also suffers from the first two problems. Other platforms suffer from the first, so we leave that for a second patch so that there is a discrete patch for them to cherry-pick. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4316	2016-02-18 10:23:30 -08:00
Chunwei Chen	093911f194	Remove wrong ASSERT in annotate_ecksum When using large blocks like 1M, there will be more than UINT16_MAX qwords in one block, so this ASSERT would go off. Also, it is possible for the histogram to overflow. We cap them to UINT16_MAX to prevent this. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4257	2016-02-17 10:43:02 -08:00
Richard Yao	0b43696e66	random_get_pseudo_bytes() need not provide cryptographic strength entropy Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a 128-bit pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf Profiling the same benchmark with an earlier variant of this patch that used a slightly different generator (roughly same number of instructions) by the same author showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement. This particular generator algorithm is also well known to be fast: http://xorshift.di.unimi.it/#speed The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14 GBps of throughput on an Intel Core i7-4770 in what is presumably a single-threaded context. Using it in `random_get_pseudo_bytes()` in the manner I have will probably not reach that level of performance, but it should be fairly high and many times higher than the Linux `get_random_bytes()` function that we use now, which runs at 16.3 MB/s on my Intel Xeon E3-1276v3 processor when measured by using dd on /dev/urandom. Also, putting this generator's seed into per-CPU variables allows us to eliminate overhead from both spin locks and CPU memory barriers, which is NUMA friendly. We could have alternatively modified consumers to use something like `gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but that has a few potential problems that this approach avoids: 1. Switching to `gethrtime() % 3` in hot code paths today requires diverging from illumos-gate and does nothing about potential future patches from illumos-gate that call our slow `random_get_pseudo_bytes()` in different hot code paths. Reimplementing `random_get_pseudo_bytes()` with a per-CPU PRNG avoids both of those things entirely, which means less work for us in the future. 2. Looking at the code that implements `gethrtime()`, I think it is unlikely to be faster than this per-CPU PRNG implementation of `random_get_pseudo_bytes()`. It would be best to go with something fast now so that there is no point in revisiting this from a performance perspective. 3. `gethrtime() % 3` can vary in behavior from system to system based on kernel version, architecture and clock source. In comparison, this per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()` that should behave consistently across all systems regardless of kernel version, system architecture or machine clock source. It is unlikely that we would ever need to revisit this per-CPU PRNG while the same cannot be said for `gethrtime() % 3`. 4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions depending on the clock source, so replacing `random_get_pseudo_bytes()` with `gethrtime()` in hot code paths could still require a future person working on NUMA scalability to reimplement it anyway while this per-CPU PRNG would not by virtue of using neither CPU memory barriers nor atomic instructions. Note that I did not check various clock sources for the presence of atomic instructions. There is simply too much code to read and given the drawbacks versus this per-cpu PRNG, there is no point in being certain. 5. I have heard of instances where poor quality pseudo-random numbers caused problems for HPC code in ways that took more than a year to identify and were remedied by switching to a higher quality source of pseudo-random numbers. While filesystems are different than HPC code, I do not think it is impossible for us to have instances where poor quality pseudo-random numbers can cause problems. Opting for a well studied PRNG algorithm that passes tests for statistical randomness over changing callers to use `gethrtime() % 3` bypasses the need to think about both whether poor quality pseudo-random numbers can cause problems and the statistical quality of numbers from `gethrtime() % 3`. 6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is probably not a huge issue, but anyone using kgdb would never be able to step through a seqlock critical section, which is not a problem either now or with the per-CPU PRNG: https://en.wikipedia.org/wiki/Seqlock The only downside that I can see is that this code's memory requirement is O(N) where N is NR_CPUS, versus the current code and `gethrtime() % 3`, which are O(1), but that should not be a problem. The seeds will use 64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of memory at the low end (i.e. `NR_CPU == 1`). In either case, we should only use a few hundred bytes of code for text, especially since `spl_rand_jump()` should be inlined into `spl_random_init()`, which should be removed during early boot as part of "Freeing unused kernel memory". In either case, the memory requirements are minuscule. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #372	2016-02-17 09:49:09 -08:00
Paul Dagnelie	6b42ea8590	Illumos 5809 - Blowaway full receive in v1 pool causes kernel panic 5809 Blowaway full receive in v1 pool causes kernel panic Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5809 https://github.com/illumos/illumos-gate/commit/f40b29c Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-02-08 09:37:55 -08:00
Chunwei Chen	8f3b403a73	Allow kicking a taskq to spawn more threads This patch add a module parameter spl_taskq_kick. When writing non-zero value to it, it will scan all the taskq, if a taskq contains a task pending for more than 5 seconds, it will be forced to spawn a new thread. This is use as an emergency recovery from deadlock, not a general solution. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #529	2016-02-05 14:08:31 -08:00
Gary Mills	7c9abfa7ab	Illumos 6537 - Panic on zpool scrub with DEBUG kernel 6537 Panic on zpool scrub with DEBUG kernel Reviewed by: Steve Gonczi <gonczi@comcast.net> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> References: https://www.illumos.org/issues/6537 https://github.com/illumos/illumos-gate/commit/8c04a1f Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-02-05 11:29:32 -08:00
Dan McDonald	a4d179efa9	Illumos 6096 - ZFS_SMB_ACL_RENAME needs to cleanup better 6096 ZFS_SMB_ACL_RENAME needs to cleanup better Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Gordon Ross <gordon.w.ross@gmail.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/6096 https://github.com/illumos/illumos-gate/commit/8f5190a5 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-02-05 11:28:53 -08:00
Matthew Ahrens	b77222c873	Illumos 6450 - scrub/resilver unnecessarily traverses snapshots 6450 scrub/resilver unnecessarily traverses snapshots created after the scrub started Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/6450 https://github.com/illumos/illumos-gate/commit/38d6103 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-02-02 14:00:24 -08:00
Richard Sharpe	9d36cdb650	Handling negative dentries in a CI file system. For a Case Insensitive file system we must avoid creating negative entries in the dentry cache. We must also pass the FIGNORECASE into zfs_lookup so that special files are handled correctly. We must also prevent negative dentries from being created when files are unlinked. Tested by running fsstress from LTP (10 loops, 10 processes, 10,000 ops.) Also tested with printks (now removed) to ensure that lookups come to zpl_lookup when negative should not exist. Tests: 1. ls Some-file.txt; touch some-file.txt; ls Some-file.txt and ensure no errors. 2. touch Some-file.txt; rm some-file.txt; ls Some-file.txt and ensure that the last ls shows log messages showing the lookup went all the way to zpl_lookup. Thanks to tuxoko for helping me get this correct. Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4243	2016-02-02 13:59:00 -08:00
Jorgen Lundman	4b9ed698b4	Illumos 6527 - Possible access beyond end of string in zpool comment 6527 Possible access beyond end of string in zpool comment Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/6527 https://github.com/illumos/illumos-gate/commit/2bd7a8d Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com>	2016-01-28 12:46:04 -05:00
Brian Behlendorf	e56766360b	Illumos 6495 - Fix mutex leak in dmu_objset_find_dp 6495 Fix mutex leak in dmu_objset_find_dp Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Albert Lee <trisk@omniti.com> References: https://www.illumos.org/issues/6495 https://github.com/illumos/illumos-gate/commit/2bad225 Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com>	2016-01-28 12:45:39 -05:00
Brian Behlendorf	b6fcb792ca	Illumos 6414 - vdev_config_sync could be simpler 6414 vdev_config_sync could be simpler Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/6414 https://github.com/illumos/illumos-gate/commit/eb5bb58 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com>	2016-01-28 12:44:39 -05:00
Simon Klinkert	1a04bab348	llumos 6334 - Cannot unlink files when over quota 6334 Cannot unlink files when over quota Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/6334 https://github.com/illumos/illumos-gate/commit/6575bca Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-26 15:27:08 -08:00
kernelOfTruth	a966c5640e	Reintroduce zfs_remove() synchronous deletes Reintroduce a slightly adapted version of the Illumos logic for synchronous unlinks. The basic idea here is that only files smaller than zfs_delete_blocks (20480) blocks should be deleted synchronously. Unlinking larger files should be handled asynchronously to minimize impact to the caller. To accomplish this iput() which is responsible for calling zfs_znode_delete() on Linux is only called in the delete_now path. Otherwise zfs_async_iput() is used which allows the last reference to be dropped by a taskq thread effectively making the removal asynchronous. Porting notes: - Add zfs_delete_blocks module option for performance analysis. The default value is DMU_MAX_DELETEBLKCNT which is the same as upstream. Reducing this value means that smaller files will be unlinked asynchronously like large files. - All occurrences of zfsvfs changes to zsb. Ported-by: KernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-26 15:26:02 -08:00
Dan McDonald	460a021391	Log zvol truncate/discard operations As the comments in zvol_discard() suggested, the discard operation could be logged to the zil. This is a port of the relevant code from Nexenta as it was added in "701 UNMAP support for COMSTAR" and has been attributed to the author of that commit. References: https://github.com/Nexenta/illumos-nexenta/commit/b77b923 https://github.com/zfsonlinux/zfs/blob/089fa91b/module/zfs/zvol.c#L637 Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-26 14:16:03 -08:00
George Wilson	ba5ad9a48d	Illumos 6251 - add tunable to disable free_bpobj processing 6251 - add tunable to disable free_bpobj processing Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Albert Lee <trisk@omniti.com> Reviewed by: Xin Li <delphij@freebsd.org> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/6251 https://github.com/illumos/illumos-gate/commit/139510f Porting notes: - Added as module option declaration. - Added to zfs-module-parameters.5 man page. Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-25 13:15:17 -08:00
Tim Chase	0a1f8cd999	Set arc_c_min properly in userland builds Since it's set to arc_c_max / 2, it must be set after arc_c_max is set. Also added protection against it falling below 2 * maxblocksize in userland builds. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4268	2016-01-25 10:25:10 -08:00
Tim Chase	1b8951b319	Prevent arc_c collapse Adjusting arc_c directly is racy because it can happen in the context of multiple threads. It should always be >= 2 * maxblocksize. Set it to a known valid value rather than adjusting it directly. In addition refactor arc_shrink() to a simpler structure, protect against underflow in the calculation of the new arc_c value. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reverts: `935434ef` Closes: #3904 Closes: #4161	2016-01-25 10:25:04 -08:00
Brian Behlendorf	6b38e7510f	Remove RLIM64_INFINITY assert in vn_rdwr() Previous commit `be29e6a` updated kobj_read_file() so it no longer unconditionally passes RLIM64_INFINITY. The vn_rdwr() function needs to be updated accordingly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #513	2016-01-23 11:16:23 -08:00
Richard Yao	be29e6a6e6	kobj_read_file: Return -1 on vn_rdwr() error I noticed that the SPL implementation of kobj_read_file is not correct after comparing it with the userland implementation of kobj_read_file() in zfsonlinux/zfs#4104. Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr implementation did not support it anyway, so there is no difference. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #513	2016-01-23 10:10:44 -08:00

... 35 36 37 38 39 ...

3346 Commits