mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-26 12:12:13 +03:00

Author	SHA1	Message	Date
Clemens Fruhwirth	8e99d66b05	Add support for rw semaphore under PREEMPT_RT_FULL The main complication from the RT patch set is that the RW semaphore locks change such that read locks on an rwsem can be taken only by a single thread. All other threads are locked out. This single thread can take a read lock multiple times though. The underlying implementation changes to a mutex with an additional read_depth count. The implementation can be best understood by inspecting the RT patch. rwsem_rt.h and rt.c give the best insight into how RT rwsem works. My implementation for rwsem_tryupgrade is basically an inversion of rt_downgrade_write found in rt.c. Please see the comments in the code. Unfortunately, I have to drop SPLAT rwlock test4 completely as this test tries to take multiple locks from different threads, which RT rwsems do not support. Otherwise SPLAT, zconfig.sh, zpios-sanity.sh and zfs-tests.sh pass on my Debian-testing VM with the kernel linux-image-4.8.0-1-rt-amd64. Tested-by: kernelOfTruth <kerneloftruth@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org> Closes zfsonlinux/zfs#5491 Closes #589 Closes #308	2016-12-19 12:45:24 -08:00
Clemens Fruhwirth	6d064f7a07	Remove stale comment from rw_tryupgrade() Commit `f58040c0fc` should have removed this comment which is no longer relevant. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org> Issue #589	2016-12-19 11:27:27 -08:00
Chunwei Chen	9c9ad845ef	Refactor some splat macro to function Refactor the code by making splat_test_{init,fini}, splat_subsystem_{init,fini} into functions. They don't have reason to be macro and it would be too bloated to inline every call. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-15 11:30:11 -08:00
Chunwei Chen	71a3c9c45d	Fix splat memleak SPLAT_TEST_FINI didn't call kfree causing memleak. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-15 11:30:11 -08:00
Chunwei Chen	f200b83673	Add system_delay_taskq for long delay Add a dedicated system_delay_taskq for long delay like spa_deadman and zpl_posix_acl_free. This will allow us to use system_taskq in the manner of dispatch multiple tasks and call taskq_wait_outstanding. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #588	2016-12-08 14:00:20 -07:00
Chunwei Chen	493492559e	Limit number of tasks shown in taskq proc To prevent holding tq_lock for too long. Before zfsonlinux/zfs@8e71ab9, hogging delay tasks and cat /proc/spl/taskq would easily cause a lockup. While that bug has been fixed. It's probably still a good idea to do this just in case task lists grow too large. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #586	2016-12-01 11:06:27 -07:00
Ubuntu	cbba714667	Add TASKQID_INVALID and TASKQID_INITIAL macros Add the TASKQID_INVALID and TASKQID_INITIAL macros and update the taskq implementation and test cases to use them. This is solely for the purposes of readability and introduces no functional change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-11-02 10:34:19 -07:00
Ubuntu	1b457bcbe5	Fix vmem_size() Add a minimal implementation of vmem_size() which accounts for the virtual memory usage of the SPL's kmem cache. This functionality is only useful on 32-bit systems with a small virtual address space. The following assumptions are made: 1) The major SPL consumer of virtual memory is the kmem cache. 2) Memory allocated with vmem_alloc() is short lived and can be ignored. 3) Allow a 4MB floor as a generous pad given normal consumption. 4) The spl_kmem_cache_sem only contends with cache create/destroy. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-11-02 10:34:19 -07:00
Brian Behlendorf	7b25c48e6e	Tag 0.7.0-rc2 Second release candidate. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-10-25 13:13:49 -07:00
Chunwei Chen	ae7eda1dde	Linux 4.9 compat: group_info changes In Linux 4.9, torvalds/linux@81243ea, group_info changed from 2d array via ->blocks to 1d array via ->gid. We change the spl cred functions accordingly. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #581	2016-10-20 09:33:28 -07:00
Chunwei Chen	87063d7dc3	Fix splat-cred.c cred usage No need to crhold current_cred(), fix possible leak in splat_cred_test2 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #556	2016-10-20 09:33:22 -07:00
Chunwei Chen	9ba3c01923	Fix crgetgroups out-of-bound and misc cred fix init_groups has 0 nblocks, therefore calling the current crgetgroups with init_groups would result in out-of-bound access. We fix this by returning NULL when nblocks is 0. Cap crgetngroups to NGROUPS_PER_BLOCK, since crgetgroups will only return blocks[0]. Also, remove all get_group_info. The cred already holds reference on the group_info, and cred is not mutable. So there's no reason to hold extra reference, if we hold cred. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #556	2016-10-20 09:33:01 -07:00
tuxoko	0d26756665	Fix out-of-bound in per_cpu in spl_random_init When iterating per_cpu values, we need to use for_each_possible_cpu. While NR_CPUS indicates the number of CPU supported by the kernel, it might not initialize all of them if the kernel decides it's not possible to use them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #578	2016-10-07 20:59:46 -07:00
tuxoko	2529b3a80e	Linux 4.8 compat: Fix RW_READ_HELD Linux 4.8, starting from torvalds/linux@19c5d690e, will set owner to 1 when read held instead of leave it NULL. So we change the condition to `rw_owner(rwp) <= 1` in RW_READ_HELD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes zfsonlinux/zfs#5233 Closes #577	2016-10-07 20:53:58 -07:00
Brian Behlendorf	341dfdb3fd	Fix p0 initializer Due to changes in the task_struct the following warning is occurs when initializing the global p0. Since this structure only exists for it's address to be taken initialize it in a manor which isn't sensitive to internal changes to the structure. module/spl/spl-generic.c:58:1: error: missing braces around initializer [-Werror=missing-braces] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #576	2016-10-04 17:26:36 -07:00
Brian Behlendorf	6c2a66bfa8	Fix aarch64 type warning Explicitly cast type in splat-rwlock.c test case to silence the following warning. warning: format ‘%ld’ expects argument of type ‘long int’, but argument N has type ‘int’ Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #574	2016-10-01 18:33:01 -07:00
Brian Behlendorf	8acfb2bcc1	Fix automatically generated release number When building from the head of a branch a release number is automatically generated with `git describe` using the last tag on that branch as the base. For this to work the last tag on the branch needs to be predictable given the current META file. This logic was accidentally broken when an -rcX tag was added to the branch. Update it to search for a VERSION or VERSION-RELEASE tag. Reviewed-by: Chris Siebenmann <cks.git01@cs.toronto.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#5105 Closes #572	2016-09-21 13:44:32 -07:00
Brian Behlendorf	cb81c0c588	Increase spl_kmem_alloc_warn limit In order to support ABD with large blocks the spl_kmem_alloc_warn limit needs to be increased to 64K. A 16M block requires that pointers be stored for 4096 4K-pages on an x86_64 system. Each of these pointers is 8 bytes requiring an allocation of 8*4096=32,768 bytes. The addition of a small header to this structure pushes the allocation over the default 32K warning threshold. In addition, fix a small bug where MAX was used instead of MIN when setting the default. This ensures a reasonable limit is still set on systems with page sizes larger then 4K. Reviewed-by: David Quigley <david.quigley@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #571	2016-09-16 17:10:36 -07:00
legend-hua	49fbac3ace	Fix spl check.sh script Update splat_cmd to reference the correct location of the splat utility. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Liu Hua<liu.hua130@zte.com.cn> Closes #570	2016-09-14 17:17:00 -07:00
tuxoko	4329bd5b73	Cleanup in cred.h Remove the code that doesn't make any sense. Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #569	2016-09-14 16:59:31 -07:00
Brian Behlendorf	4fd75d35af	Tag 0.7.0-rc1 First release candidate. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-09-07 10:33:21 -07:00
GeLiXin	aeb9baa618	Fix: handle NULL case in spl_kmem_free_track() When DEBUG_KMEM_TRACKING is enabled in SPL, we keep tracking all the buffers alloced by kmem_alloc() and kmem_zalloc(). If a NULL pointer which indicates no track info in SPL is passed to spl_kmem_free_track, we just ignore it. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#4967 Closes #567	2016-08-19 09:14:24 -07:00
Tim Chase	576865be20	Fix HAVE_MUTEX_OWNER test for kernels prior to 4.6 Recent 4.X kernels prior to 4.6 require #include of spinlock.h in order to get the definition of __ARCH_SPIN_LOCK_UNLOCKED which is used by DEFINE_MUTEX(). Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #566	2016-08-01 12:45:08 -07:00
Nikolay Borisov	4b9dddf430	Add handling for kernel 4.7's CONFIG_TRIM_UNUSED_KSYMS Kernel 4.7 added the option to trim the unused exported symbols. In my testing this showed to be problematic since the PDE_DATA function was considered unused and as such was trimmed. This in turn caused the respective test during spl's configure stage to falsely detect that PDE_DATA is not defined, which in turn caused build failures later. Handle this situation by adding detection whether CONFIG_TRIM_UNUSED_KSYMS is enabled and refuse to build against a kernel which has it enabled Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #565	2016-08-01 12:43:01 -07:00
Nikolay Borisov	fb83388387	Add gitignore entry for spl-*.o.d files Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #565	2016-08-01 12:42:55 -07:00
Brian Behlendorf	b7c7008ba2	Linux 4.8 compat: rw_semaphore atomic_long_t count For non-rwsem-spinlocks the "count" member was changed from a "long" to "atomic_long_t" type. A configure check has been added to detect this change along with new versions of the _rwsem_tryupgrade() function and RWSEM_COUNT() macro. See https://github.com/torvalds/linux/commit/8ee62b18 for complete details. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #563	2016-07-29 14:17:53 -07:00
Tom Caputi	d2f97b2a26	Added highbit() and lowbit() macros Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #562	2016-07-20 10:28:46 -07:00
Tony Hutter	5ad98ad097	Add _ALIGNMENT_REQUIRED to isa_defs.h for checksums _ALIGNMENT_REQUIRED needs to be #defined in isa_defs.h in order to port the Illumos checksum code to ZoL: 4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R OpenZFS-issue: https://www.illumos.org/issues/4185 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #561	2016-06-21 13:37:04 -07:00
Jinshan Xiong	16fc1ec3ba	Improve spl slab cache alloc The policy is to try to allocate with KM_NOSLEEP, which will lead to memory allocation with GFP_ATOMIC, and if it fails, it will launch an taskq to expand slab space. This way it should be able to get better NUMA memory locality and reduce the overhead of context switch. Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #551	2016-06-01 10:26:42 -07:00
Chunwei Chen	ea5f1a200b	Fix use-after-free in splat_taskq_test7 This splat_vprint is using tq_arg->name after tq_arg is freed. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #557	2016-05-31 11:58:42 -07:00
Chunwei Chen	f58040c0fc	Implement a proper rw_tryupgrade Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then does rw_enter(RW_READER) if it fails. This violate the assumption that rw_tryupgrade should be atomic and could cause extra contention or even lock inversion. This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we use cmpxchg on rwsem->count to change the value from single reader to single writer. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes zfsonlinux/zfs#4692 Closes #554	2016-05-31 11:44:15 -07:00
YunQiang Su	c60a51b640	Add isa_defs for MIPS GCC for MIPS only defines _LP64 when 64bit, while no _ILP32 defined when 32bit. Signed-off-by: YunQiang Su <syq@debian.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #558	2016-05-31 09:05:56 -07:00
Chunwei Chen	b3a22a0a00	Fix taskq_wait_outstanding re-evaluate tq_next_id wait_event is a macro, so the current implementation will cause re- evaluation of tq_next_id every time it wakes up. This would cause taskq_wait_outstanding(tq, 0) to be equivalent to taskq_wait(tq) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #553	2016-05-24 13:02:10 -07:00
Chunwei Chen	5ce028b0d4	Fix race between taskq_destroy and dynamic spawning thread While taskq_destroy would wait for dynamic_taskq to finish its tasks, but it does not implies the thread being spawned is up and running. This will cause taskq to be freed before the thread can exit. We fix this by using tq_nspawn to indicate how many threads are being spawned before they are inserted to the thread list. And have taskq_destroy to wait for it to drop to zero. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #553 Closes #550	2016-05-24 13:00:17 -07:00
Chunwei Chen	872e0cc9c7	Restore CALLOUT_FLAG_ABSOLUTE in cv_timedwait_hires In `39cd90e`, I mistakenly disabled the ability of using absolute expire time in cv_timedwait_hires. I don't quite sure why I did that, so let's restore it. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #553	2016-05-24 12:58:49 -07:00
Chunwei Chen	fdbc1ba99d	Linux 4.7 compat: inode_lock() and friends Linux 4.7 changes i_mutex to i_rwsem, and we should used inode_lock and inode_lock_shared to do exclusive and shared lock respectively. We use spl_inode_lock{,_shared}() to hide the difference. Note that on older kernel you'll always take an exclusive lock. We also add all other inode_lock friends. And nested users now should explicitly call spl_inode_lock_nested with correct subclass. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#4665 Closes #549	2016-05-20 11:00:14 -07:00
Chunwei Chen	39cd90ef08	Add cv_timedwait_sig_hires to allow interruptible sleep Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #548	2016-05-12 14:54:15 -07:00
David Quigley	5e39e4f0b2	Add a macro to convert seconds to nanoseconds and vice-versa Required infrastructure for zfsonlinux/zfs#4600. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #546	2016-05-05 16:10:46 -07:00
Tim Chase	ea2633ad26	Clear PF_FSTRANS over spl_filp_fallocate() The problem described in `2a5d574` also applies to XFS's file or inode fallocate method. Both paths may trigger writeback and expose this issue, see the full stack below. When layered on XFS a warning will be emitted under CentOS7 when entering either the file or inode fallocate method with PF_FSTRANS already set. To avoid triggering this error PF_FSTRANS is cleared and then reset in vn_space(). WARNING: at fs/xfs/xfs_aops.c:982 xfs_vm_writepage+0x58b/0x5d0 Call Trace: [<ffffffff810a1ed5>] warn_slowpath_common+0x95/0xe0 [<ffffffff810a1f3a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa0231fdb>] xfs_vm_writepage+0x58b/0x5d0 [xfs] [<ffffffff81173ed7>] __writepage+0x17/0x40 [<ffffffff81176f81>] write_cache_pages+0x251/0x530 [<ffffffff811772b1>] generic_writepages+0x51/0x80 [<ffffffffa0230cb0>] xfs_vm_writepages+0x60/0x80 [xfs] [<ffffffff81177300>] do_writepages+0x20/0x30 [<ffffffff8116a5f5>] __filemap_fdatawrite_range+0xb5/0x100 [<ffffffff8116a6cb>] filemap_write_and_wait_range+0x8b/0xd0 [<ffffffffa0235bb4>] xfs_free_file_space+0xf4/0x520 [xfs] [<ffffffffa023cbce>] xfs_file_fallocate+0x19e/0x2c0 [xfs] [<ffffffffa036c6fc>] vn_space+0x3c/0x40 [spl] [<ffffffffa0434817>] vdev_file_io_start+0x207/0x260 [zfs] [<ffffffffa047170d>] zio_vdev_io_start+0xad/0x2d0 [zfs] [<ffffffffa0474942>] zio_execute+0x82/0xe0 [zfs] [<ffffffffa036ba7d>] taskq_thread+0x28d/0x5a0 [spl] [<ffffffff810c1777>] kthread+0xd7/0xf0 [<ffffffff8167de2f>] ret_from_fork+0x3f/0x70 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Nikolay Borisov <kernel@kyup.com> Closes zfsonlinux/zfs#4529	2016-04-26 11:22:43 -07:00
Tim Chase	3bf657b90c	Use vmem_free() in dfl_free() and add dfl_alloc() This change was lost, somehow, in `e5f9a9a`. Since the arrays can be rather large, they need to be allocated with vmem_zalloc() via dfl_alloc() and freed with vmem_free() via dfl_free(). The new dfl_alloc() function should be used to allocate object of type dkioc_free_list_t in order that they're allocated from vmem. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Nikolay Borisov <kernel@kyup.com> Closes #543	2016-04-26 11:20:14 -07:00
Chunwei Chen	cdd39dd245	Use kernel provided mutex owner To reduce mutex footprint, we detect the existence of owner in kernel mutex, and rely on it if it exists. Note that before Linux 3.0, mutex owner is of type thread_info. Also note that, in Linux 3.18, the condition for owner is changed from CONFIG_DEBUG_MUTEXES \|\| CONFIG_SMP to CONFIG_DEBUG_MUTEXES \|\| CONFIG_MUTEX_SPIN_ON_OWNER Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #540	2016-04-25 17:04:07 -07:00
Dimitri John Ledkov	224817e2a8	Add support for s390[x]. Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #537	2016-03-17 09:54:49 -07:00
Tim Chase	7bb5d92de8	Allow spawning a new thread for TQ_NOQUEUE dispatch with dynamic taskq When a TQ_NOQUEUE dispatch is done on a dynamic taskq, allow another thread to be spawned. This will cause TQ_NOQUEUE to behave similarly as it does with non-dynamic taskqs. Add support for TQ_NOQUEUE to taskq_dispatch_ent(). Signed-off-by: Tim Chase <tim@onlight.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #530	2016-03-17 09:52:35 -07:00
Brian Behlendorf	a6ae97caed	Add rw_tryupgrade() This implementation of rw_tryupgrade() behaves slightly differently from its counterparts on other platforms. It drops the RW_READER lock and then acquires the RW_WRITER lock leaving a small window where no lock is held. On other platforms the lock is never released during the upgrade process. This is necessary under Linux because the kernel does not provide an upgrade function. There are currently no callers in the ZFS code where this change in behavior is a problem. In fact, in most cases the code is already written such that if the upgrade fails the RW_READER lock is dropped and the caller blocks waiting to acquire the lock as RW_WRITER. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Matthew Thode <prometheanfire@gentoo.org> Closes zfsonlinux/zfs#4388 Closes #534	2016-03-10 13:05:25 -08:00
Brian Behlendorf	47f9824781	Remove RPM package restriction ZFS on Linux is regularly tested on arm, ppc, ppc64, i686 and x86_64 architectures. Given this the artificial architecture restriction in the packaging has been removed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-03-10 09:19:08 -08:00
Tom Caputi	18d2f56176	Changes to support zfs encryption Unused modlinkage struct removed and ntohll functions added. Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #533	2016-02-25 11:42:46 -08:00
Richard Yao	0b43696e66	random_get_pseudo_bytes() need not provide cryptographic strength entropy Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a 128-bit pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf Profiling the same benchmark with an earlier variant of this patch that used a slightly different generator (roughly same number of instructions) by the same author showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement. This particular generator algorithm is also well known to be fast: http://xorshift.di.unimi.it/#speed The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14 GBps of throughput on an Intel Core i7-4770 in what is presumably a single-threaded context. Using it in `random_get_pseudo_bytes()` in the manner I have will probably not reach that level of performance, but it should be fairly high and many times higher than the Linux `get_random_bytes()` function that we use now, which runs at 16.3 MB/s on my Intel Xeon E3-1276v3 processor when measured by using dd on /dev/urandom. Also, putting this generator's seed into per-CPU variables allows us to eliminate overhead from both spin locks and CPU memory barriers, which is NUMA friendly. We could have alternatively modified consumers to use something like `gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but that has a few potential problems that this approach avoids: 1. Switching to `gethrtime() % 3` in hot code paths today requires diverging from illumos-gate and does nothing about potential future patches from illumos-gate that call our slow `random_get_pseudo_bytes()` in different hot code paths. Reimplementing `random_get_pseudo_bytes()` with a per-CPU PRNG avoids both of those things entirely, which means less work for us in the future. 2. Looking at the code that implements `gethrtime()`, I think it is unlikely to be faster than this per-CPU PRNG implementation of `random_get_pseudo_bytes()`. It would be best to go with something fast now so that there is no point in revisiting this from a performance perspective. 3. `gethrtime() % 3` can vary in behavior from system to system based on kernel version, architecture and clock source. In comparison, this per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()` that should behave consistently across all systems regardless of kernel version, system architecture or machine clock source. It is unlikely that we would ever need to revisit this per-CPU PRNG while the same cannot be said for `gethrtime() % 3`. 4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions depending on the clock source, so replacing `random_get_pseudo_bytes()` with `gethrtime()` in hot code paths could still require a future person working on NUMA scalability to reimplement it anyway while this per-CPU PRNG would not by virtue of using neither CPU memory barriers nor atomic instructions. Note that I did not check various clock sources for the presence of atomic instructions. There is simply too much code to read and given the drawbacks versus this per-cpu PRNG, there is no point in being certain. 5. I have heard of instances where poor quality pseudo-random numbers caused problems for HPC code in ways that took more than a year to identify and were remedied by switching to a higher quality source of pseudo-random numbers. While filesystems are different than HPC code, I do not think it is impossible for us to have instances where poor quality pseudo-random numbers can cause problems. Opting for a well studied PRNG algorithm that passes tests for statistical randomness over changing callers to use `gethrtime() % 3` bypasses the need to think about both whether poor quality pseudo-random numbers can cause problems and the statistical quality of numbers from `gethrtime() % 3`. 6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is probably not a huge issue, but anyone using kgdb would never be able to step through a seqlock critical section, which is not a problem either now or with the per-CPU PRNG: https://en.wikipedia.org/wiki/Seqlock The only downside that I can see is that this code's memory requirement is O(N) where N is NR_CPUS, versus the current code and `gethrtime() % 3`, which are O(1), but that should not be a problem. The seeds will use 64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of memory at the low end (i.e. `NR_CPU == 1`). In either case, we should only use a few hundred bytes of code for text, especially since `spl_rand_jump()` should be inlined into `spl_random_init()`, which should be removed during early boot as part of "Freeing unused kernel memory". In either case, the memory requirements are minuscule. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #372	2016-02-17 09:49:09 -08:00
Chunwei Chen	8f3b403a73	Allow kicking a taskq to spawn more threads This patch add a module parameter spl_taskq_kick. When writing non-zero value to it, it will scan all the taskq, if a taskq contains a task pending for more than 5 seconds, it will be forced to spawn a new thread. This is use as an emergency recovery from deadlock, not a general solution. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #529	2016-02-05 14:08:31 -08:00
Chip Parker	d112232f5e	Ensure spl/ only occurs once in core-y Update copy-builtin so it may be run multiple times against the kernel source tree. This change makes sed more discriminating to ensure spl/ only occurs once in core-y. Signed-off-by: Chip Parker <aparker@enthought.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #526	2016-01-26 11:54:24 -08:00
Brian Behlendorf	6b38e7510f	Remove RLIM64_INFINITY assert in vn_rdwr() Previous commit `be29e6a` updated kobj_read_file() so it no longer unconditionally passes RLIM64_INFINITY. The vn_rdwr() function needs to be updated accordingly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #513	2016-01-23 11:16:23 -08:00

1 2 3 4 5 ...

1012 Commits