mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-01-25 10:12:13 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	42d3b990cf	Update incorrect ddt_zap_lookup() assertion When the ddt_zap_lookup() function was updated to dynamically allocate memory for the cbuf variable, to save stack space, the 'csize <= sizeof (cbuf)' assertion was not updated. The result of this was that the size of the pointer was being used in the comparison rather than the buffer size. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov>	2012-07-03 15:14:34 -07:00
Etienne Dechamps	b6ad9671ac	Add ZIL statistics. The performance of the ZIL is usually the main bottleneck when dealing with synchronous, write-heavy workloads (e.g. databases). Understanding the behavior of the ZIL is required to diagnose performance issues for these workloads, and to tune ZIL parameters (like zil_slog_limit) accordingly. This commit adds a new kstat page dedicated to the ZIL with some counters which, hopefully, scheds some light into what the ZIL is doing, and how it is doing it. Currently, these statistics are available in /proc/spl/kstat/zfs/zil. A description of the fields can be found in zil.h. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #786	2012-06-29 09:56:51 -07:00
Pawel Jakub Dawidek	0cee24064a	Speed up 'zfs list -t snapshot -o name -s name' FreeBSD #xxx: Dramatically optimize listing snapshots when user requests only snapshot names and wants to sort them by name, ie. when executes: # zfs list -t snapshot -o name -s name Because only name is needed we don't have to read all snapshot properties. Below you can find how long does it take to list 34509 snapshots from a single disk pool before and after this change with cold and warm cache: before: # time zfs list -t snapshot -o name -s name > /dev/null cold cache: 525s warm cache: 218s after: # time zfs list -t snapshot -o name -s name > /dev/null cold cache: 1.7s warm cache: 1.1s NOTE: This patch only appears in FreeBSD. If/when Illumos picks up the change we may want to drop this patch and adopt their version. However, for now this addresses a real issue. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #450	2012-06-14 09:49:04 -07:00
Darik Horn	74497b7ab6	Add zvol_inhibit_dev module option. ZoL can create more zvols at runtime than can be configured during system start, which hangs the init stack at reboot. When a slow system has more than a few hundred zvols, udev will fork bomb during system start and spend too much time in device detection routines, so upstart kills it. The zfs_inhibit_dev option allows an affected system to be rescued by skipping /dev/zd* creation and thereby avoiding the udev overload. All zvols are made inaccessible if this option is set, but the `zfs destroy` and `zfs send` commands still work, and ZFS filesystems can be mounted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-06-13 17:05:16 -07:00
Etienne Dechamps	ee191e802c	Make zil_slog_limit a tunable module parameter. zil_slog_limit specifies the maximum commit size to be written to the separate log device. Larger commits bypass the separate log device and go directly to the data devices. The optimal value for zil_slog_limit directly depends on the latency and throughput characteristics of both the separate log device and the data disks. Small synchronous writes are faster on low-latency separate log devices (e.g. SSDs) whereas large synchronous writes are faster on high-latency data disks (e.g. spindles) because of higher throughput, especially with a large array. The point is, the line between "small" and "large" synchronous writes in this scenario is heavily dependent on the hardware used. That's why it should be made configurable. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #783	2012-06-12 08:45:53 -07:00
Richard Yao	6a0936babc	Linux 3.4 compat, d_make_root() replaces d_alloc_root() torvalds/linux@adc0e91ab1 introduced introduced d_make_root() as a replacement for d_alloc_root(). Further commits appear to have removed d_alloc_root() from the Linux source tree. This causes the following failure: error: implicit declaration of function 'd_alloc_root' [-Werror=implicit-function-declaration] To correct this we update the code to use the current d_make_root() interface for readability. Then we introduce an autotools check to determine if d_make_root() is available. If it isn't then we define some compatibility logic which used the older d_alloc_root() interface. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #776	2012-06-11 10:04:49 -07:00
Etienne Dechamps	ab85f8455b	Honor logbias when writing to ZVOLs. The logbias option is not taken into account when writing to ZVOLs. We fix that by using the same logic as in the zfs filesystem write code (see zfs_log.c). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #774	2012-06-11 09:43:48 -07:00
Brian Behlendorf	710114089f	Revert "Disable direct reclaim on zvols" This reverts commit `ce90208cf9`. This change was observed to cause problems when using a zvol to back a VM under 2.6.32.59 kernels. This issue was filed as #710. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #342 Issue #710	2012-04-30 14:26:49 -07:00
Brian Behlendorf	b39d3b9f7b	Linux 3.3 compat, iops->create()/mkdir()/mknod() The mode argument of iops->create()/mkdir()/mknod() was changed from an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf check was added to detect the API change and then correctly set a zpl_umode_t typedef. There is no functional change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #701	2012-04-30 12:52:38 -07:00
Richard Yao	ce90208cf9	Disable direct reclaim on zvols Previously, it was possible for the direct reclaim path to be invoked when a write to a zvol was made. When a zvol is used as a swap device, this often causes swap requests to depend on additional swap requests, which deadlocks. We address this by disabling the direct reclaim path on zvols. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #342	2012-04-30 11:25:36 -07:00
Richard Yao	518b487602	Update ARC memory limits to account for SLUB internal fragmentation `23bdb07d4e` updated the ARC memory limits to be 1/2 of memory or all but 4GB. Unfortunately, these values assume zero internal fragmentation in the SLUB allocator, when in reality, the internal fragmentation could be as high as 50%, effectively doubling memory usage. This poses clear safety issues, because it permits the size of ARC to exceed system memory. This patch changes this so that the default value of arc_c_max is always 1/2 of system memory. This effectively limits the ARC to the memory that the system has physically installed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #660	2012-04-30 10:04:34 -07:00
Brian Behlendorf	302f753f16	Integrate ARC more tightly with Linux Under Solaris the ARC was designed to stay one step ahead of the VM subsystem. It would attempt to recognize low memory situtions before they occured and evict data from the cache. It would also make assessments about if there was enough free memory to perform a specific operation. This was all possible because Solaris exposes a fairly decent view of the memory state of the system to other kernel threads. Linux on the other hand does not make this information easily available. To avoid extensive modifications to the ARC the SPL attempts to provide these same interfaces. While this works it is not ideal and problems can arise when the ARC and Linux have different ideas about when your out of memory. This has manifested itself in the past as a spinning arc_reclaim_thread. This patch abandons the emulated Solaris interfaces in favor of the prefered Linux interface. That means moving the bulk of the memory reclaim logic out of the arc_reclaim_thread and in to the evict driven shrinker callback. The Linux VM will call this function when it needs memory. The ARC is then responsible for attempting to free the requested amount of memory if possible. Several interfaces have been modified to accomidate this approach, however the basic user space implementation remains the same. The following changes almost exclusively just apply to the kernel implementation. * Removed the hdr_recl() reclaim callback which is redundant with the broader arc_shrinker_func(). * Reduced arc_grow_retry to 5 seconds from 60. This is now used internally in the ARC with arc_no_grow to indicate that direct reclaim was recently performed. This typically indicates a rapid change in memory demands which the kswapd threads were unable to keep ahead of. As long as direct reclaim is happening once every 5 seconds arc growth will be paused to avoid further contributing to the existing memory pressure. The more common indirect reclaim paths will not set arc_no_grow. * arc_shrink() has been extended to take the number of bytes by which arc_c should be reduced. This allows for a more granual reduction of the arc target. Since the kernel provides a reclaim value to the arc_shrinker_func() this value is used instead of 1<<arc_shrink_shift. * arc_reclaim_needed() has been removed. It was used to determine if the system was under memory pressure and relied extensively on Solaris specific VM interfaces. In most case the new code just checks arc_no_grow which indicates that within the last arc_grow_retry seconds direct memory reclaim occurred. * arc_memory_throttle() has been updated to always include the amount of evictable memory (arc and page cache) in its free space calculations. This space is largely available in most call paths due to direct memory reclaim. * The Solaris pageout code was also removed to avoid confusion. It has always been disabled due to proc_pageout being defined as NULL in the Linux port. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-04-30 10:03:05 -07:00
Brian Behlendorf	afec56b43f	Add zfs_mdcomp_disable module option Expose the zfs_mdcomp_disable variable as a module option. This can be used to disable compression of zfs meta data which is enabled by default. This shouldn't need to be tuned but for most workloads, however there may be very specific instances where it makes sense to trade disk capacity for extra cpu cycles. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-04-27 16:28:02 -07:00
Brian Behlendorf	ebf8e3a237	Illumos #1909 : disk sync write perf regression when slog is used post oi_148 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Bill Pijewski <wdp@joyent.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Steve Gonczi <gonczi@comcast.net> Reviewed by: Garrett D'Amore <garrett.damore@gmail.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Albert Lee <trisk@nexenta.com> Approved by: Eric Schrock <eric.schrock@delphix.com> Refererces to Illumos issue: https://www.illumos.org/issues/1909 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #680	2012-04-19 16:26:29 -07:00
Prakash Surya	409dc1a570	Use KM_PUSHPAGE in l2arc_write_buffers There is potential for deadlock in the l2arc_feed thread if KM_PUSHPAGE is not used for the allocations made in l2arc_write_buffers. Specifically, if KM_PUSHPAGE is not used for these allocations, it is possible for reclaim to be triggered which can cause the l2arc_feed thread to deadlock itself on the ARC_mru mutex. An example of this is demonstrated in the following backtrace of the l2arc_feed thread: crash> bt 4123 PID: 4123 TASK: ffff88062f8c1500 CPU: 6 COMMAND: "l2arc_feed" 0 [ffff88062511d610] schedule at ffffffff814eeee0 1 [ffff88062511d6d8] __mutex_lock_slowpath at ffffffff814f057e 2 [ffff88062511d748] mutex_lock at ffffffff814f041b 3 [ffff88062511d768] arc_evict at ffffffffa05130ca [zfs] 4 [ffff88062511d858] arc_adjust at ffffffffa05139a9 [zfs] 5 [ffff88062511d878] arc_shrink at ffffffffa0513a95 [zfs] 6 [ffff88062511d898] arc_kmem_reap_now at ffffffffa0513be8 [zfs] 7 [ffff88062511d8c8] arc_shrinker_func at ffffffffa0513ccc [zfs] 8 [ffff88062511d8f8] shrink_slab at ffffffff8112a17a 9 [ffff88062511d958] do_try_to_free_pages at ffffffff8112bfdf 10 [ffff88062511d9e8] try_to_free_pages at ffffffff8112c3ed 11 [ffff88062511da98] __alloc_pages_nodemask at ffffffff8112431d 12 [ffff88062511dbb8] kmem_getpages at ffffffff8115e632 13 [ffff88062511dbe8] fallback_alloc at ffffffff8115f24a 14 [ffff88062511dc68] ____cache_alloc_node at ffffffff8115efc9 15 [ffff88062511dcc8] __kmalloc at ffffffff8115fbf9 16 [ffff88062511dd18] kmem_alloc_debug at ffffffffa047b8cb [spl] 17 [ffff88062511dda8] l2arc_feed_thread at ffffffffa0511e71 [zfs] 18 [ffff88062511dea8] thread_generic_wrapper at ffffffffa047d1a1 [spl] 19 [ffff88062511dee8] kthread at ffffffff81090a86 20 [ffff88062511df48] kernel_thread at ffffffff8100c14a Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-04-17 11:56:21 -07:00
Martin Matuska	7d5cd71da6	Illumos #1346 : zfs incremental receive may leave behind temporary clones 1356 zfs dataset prefetch code not working Reviewed by: Matthew Ahrens <matt@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue: https://www.illumos.org/issues/1346 https://www.illumos.org/issues/1356 Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #647	2012-04-11 12:02:27 -07:00
Albert Lee	22cd4a4653	Illumos #1475 : zfs spill block hold can access invalid spill blkptr Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Garrett D'Amore <garrett@nexenta.com> References to Illumos issue: https://www.illumos.org/issues/1475 Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #648	2012-04-11 11:46:30 -07:00
George Wilson	5ffb9d1d05	Illumos #1951 : leaking a vdev when removing an l2cache device 1952 memory leak when adding a file-based l2arc device 1954 leak in ZFS from metaslab_group_create and zfs_ereport_checksum Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Bill Pijewski <wdp@joyent.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Eric Schrock <eric.schrock@delphix.com> References to Illumos issues: https://www.illumos.org/issues/1951 https://www.illumos.org/issues/1952 https://www.illumos.org/issues/1954 Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #650	2012-04-11 11:32:06 -07:00
Martin Matuska	b129c6590e	OS-926: zfs panic in zfs_fill_zplprops_impl() This change appears to be exclusive to SmartOS. It is not present in illumos-gate but it just adds some needed error handling. This is clearly preferable to simply ASSERTING which is what would occur prior to the patch. Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matt Ahrens <matt@delphix.com> Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #652	2012-04-11 11:29:19 -07:00
Andriy Gapon	3adfc400f5	Illumos #1680 : zfs vdev_file_io_start: validate vdev before using vdev_tsd vdev_tsd can be NULL for certain vdev states. At least in userland testing with ztest. References to Illumos issue: https://www.illumos.org/issues/1680 Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #655	2012-04-11 11:23:18 -07:00
Brian Behlendorf	f0fd83be65	Export additional dsl symbols Principly these symbols were exported to get access to the dsl_prop_register/dsl_prop_unregister functions. They allow us to cleanly register a callback which is called when a dataset property is modified. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-04-11 09:26:55 -07:00
Gunnar Beutner	1f0d8a566f	Fixed a NULL pointer dereference bug in zfs_preumount When zpl_fill_super -> zfs_domount fails (e.g. because the dataset was destroyed before it could be successfully mounted) the subsequent call to zpl_kill_sb -> zfs_preumount would derefence a NULL pointer. This bug can be reproduced using this shell script: #!/bin/sh ( while true; do zfs create -o mountpoint=legacz tank/bar zfs destroy tank/bar done ) & ( while true; do mount -t zfs tank/bar /mnt umount /mnt done ) & Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #639	2012-04-05 11:29:42 -07:00
Brian Behlendorf	fc41c6402b	Properly expose the mfu ghost list kstats Due to a typo the mru ghost lists stats were accidentally being exposed as the mfu ghost list stats. This was harmless but confusing since memory usage could be over reported. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-27 15:08:22 -07:00
Brian Behlendorf	1c5de20ae2	Add --enable-debug-dmu-tx configure option Allow rigorous (and expensive) tx validation to be enabled/disabled indepentantly from the standard zfs debugging. When enabled these checks ensure that all txs are constructed properly and that a dbuf is never dirtied without taking the correct tx hold. This checking is particularly helpful when adding new dmu consumers like Lustre. However, for established consumers such as the zpl with no known outstanding tx construction problems this is just overhead. --enable-debug-dmu-tx - Enable/disable validation of each tx as --disable-debug-dmu-tx it is constructed. By default validation is disabled due to performance concerns. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-23 12:25:17 -07:00
Brian Behlendorf	99ea23c583	Enhance a dmu_tx_dirty_buf() assertion The following assertion is good to validate the correctness of new DMU consumers, but it doesn't quite provide enough information. Slightly rework the assertion so that when it is hit the actual offending values will be included in the output. SPLError: 4787:0:(dmu_tx.c:828:dmu_tx_dirty_buf()) ASSERTION(dn == NULL \|\| dn->dn_assigned_txg == tx->tx_txg) failed Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-23 12:24:05 -07:00
Brian Behlendorf	4b5d425f14	Add ZFS_META_RELEASE to module load/unload messages Include the ZFS_META_RELEASE in the module load/unload messages to more clearly indidcate exactly what version of ZFS has been loaded. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-23 12:14:35 -07:00
Brian Behlendorf	9ed86e7cc7	Account for .zfs ctldir inodes Because the .zfs ctldir inodes are not backed by physical storage they use a different create path which was not properly accounting for them as used. This could result in ->nr_cached_objects() returning 0 and cause a divide by zero error in prune_super(). In my option there's a kernel bug here too which allows this to happen. They should either be checking for 0 or adding +1 like they correctly do earlier in the function. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #617	2012-03-22 15:43:55 -07:00
Brian Behlendorf	ebe7e575ea	Add .zfs control directory Add support for the .zfs control directory. This was accomplished by leveraging as much of the existing ZFS infrastructure as posible and updating it for Linux as required. The bulk of the core functionality is now all there with the following limitations. ) The .zfs/snapshot directory automount support requires a 2.6.37 or newer kernel. The exception is RHEL6.2 which has backported the d_automount patches. ) Creating/destroying/renaming snapshots with mkdir/rmdir/mv in the .zfs/snapshot directory works as expected. However, this functionality is only available to root until zfs delegations are finished. * mkdir - create a snapshot * rmdir - destroy a snapshot * mv - rename a snapshot The following issues are known defeciences, but we expect them to be addressed by future commits. ) Add automount support for kernels older the 2.6.37. This should be possible using follow_link() which is what Linux did before. ) Accessing the .zfs/snapshot directory via NFS is not yet possible. The majority of the ground work for this is complete. However, finishing this work will require resolving some lingering integration issues with the Linux NFS kernel server. *) The .zfs/shares directory exists but no futher smb functionality has yet been implemented. Contributions-by: Rohan Puri <rohan.puri15@gmail.com> Contributiobs-by: Andrew Barnes <barnes333@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #173	2012-03-22 13:03:47 -07:00
Brian Behlendorf	49be0ccf1f	Add zio constructor/destructor Add a standard zio constructor and destructor. Normally, this is done to reduce to cost of allocating a new structure by reducing expensive operations such as memory allocations. However, in this case none of the operations moved out of zio_create() were really very expensive. This change was principly made as a debug patch (and workaround) for a zio_destroy() race. The is good evidence that zio_create() is reinitializing a mutex which is really still in use by another thread. This would completely explain the observed symptoms in the issue report. This patch doesn't fix the root cause of the race, but it should make it less likely by only initializing the mutex once in the constructor. Also, this particular flaw might have gone unnoticed in other zfs implementations due to the specific implementation details of Linux ticket spinlocks. Once the real root cause is determined and resolved this change can be safely reverted. Until then this should help workaround the issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #496	2012-03-21 14:51:44 -07:00
Brian Behlendorf	c8df41538d	Revert "Add zio constructor/destructor" This patch was slightly flawed and allowed for zio->io_logical to potentially not be reinitialized for a new zio. This could lead to assertion failures in specific cases when debugging is enabled (--enable-debug) and I/O errors are encountered. It may also have caused problems when issues logical I/Os. Since we want to make sure this workaround can be easily removed in the future (when we have the real fix). I'm reverting this change and applying a new version of the patch which includes the zio->io_logical fix. This reverts commit `2c6d0b1e07`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #602 Issue #604	2012-03-21 14:51:01 -07:00
Brian Behlendorf	77a405ae52	Add missing NULL in zpl_xattr_handlers The xattr_resolve_name() helper function expects the registered list of xattr handlers to be NULL terminated. This NULL was accidentally missing which could result in a NULL dereference. Interestingly this issue only manifested itself on certain 32-bit systems. Presumably on 64-bit kernels we just always happen to get lucky and the memory following the structure is zeroed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #594	2012-03-15 15:18:29 -07:00
Brian Behlendorf	0ece356db5	Add sa_spill_rele() interface Add a SA interface which allows us to release the spill block from a SA handle without destroying the handle. This is useful because we can then ensure that a copy of the dirty spill block is not made at sync time due to the extra hold. Susequent calls to sa_update() or sa_lookup() with transparently refetch the spill block dbuf from the ARC hash. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-07 16:28:00 -08:00
Brian Behlendorf	2c6d0b1e07	Add zio constructor/destructor Add a standard zio constructor and destructor. Normally, this is done to reduce to cost of allocating a new structure by reducing expensive operations such as memory allocations. However, in this case none of the operations moved out of zio_create() were really very expensive. This change was principly made as a debug patch (and workaround) for a zio_destroy() race. The is good evidence that zio_create() is reinitializing a mutex which is really still in use by another thread. This would completely explain the observed symptoms in the issue report. This patch doesn't fix the root cause of the race, but it should make it less likely by only initializing the mutex once in the constructor. Also, this particular flaw might have gone unnoticed in other zfs implementations due to the specific implementation details of Linux ticket spinlocks. Once the real root cause is determined and resolved this change can be safely reverted. Until then this should help workaround the issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #496	2012-03-07 16:06:23 -08:00
Brian Behlendorf	ec2626ad3f	Use SA_HDL_PRIVATE for SA xattrs A private SA handle must be used to ensure we can drop the dbuf hold on the spill block prior to calling dmu_tx_commit(). If we call dmu_tx_commit() before sa_handle_destroy(), then our hold will trigger a copy of the dbuf to be made. This is done to prevent data from leaking in to the syncing txg. As a result the original dirty spill block will remain cached. Additionally, relying on the shared zp->z_sa_hdl is unsafe in the xattr context because the znode may be asynchronously dropped from the cache. It's far safer and simpler just to use a private handle for xattrs. Plus any additional overhead is offset by the avoidance of the previously mentioned memory copy. These forever dirty buffers can be noticed in the arcstats under the anon_size. On a quiescent system the value should be zero. Without this fix and a SA xattr write workload you will see anon_size increase. Eventually, if enough dirty data builds up your system it will appear to hang. This occurs because the dmu won't allow new txs to be assigned until that dirty data is flushed, and it won't be because it's not part of an assigned tx. As an aside, I typically see anon_size lurk around 16k so I think there is another place in the code which needs a similar fix. However, this value doesn't grow over time so it isn't critical. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #503 Issue #513	2012-03-02 13:20:48 -08:00
Brian Behlendorf	570827e129	Add 'dmu_tx' kstats entry Keep counters for the various reasons that a thread may end up in txg_wait_open() waiting on a new txg. This can be useful when attempting to determine why a particular workload is under performing. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-27 08:59:10 -08:00
Brian Behlendorf	13be560d89	Add arc_state_t stats to arcstats To ensure the arc is behaving properly we need greater visibility in to exactly how it's managing the systems memory. This patch takes one step in that direction be adding the current arc_state_t for the anon, mru, mru_ghost, mfu, and mfs_ghost lists. The l2 arc_state_t is already well represented in the arcstats. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-27 08:58:59 -08:00
Alex Zhuravlev	a473d90cee	Export symbols for zero-copy Export additional symbols to make use of the DMU's zero-copy API. This allows external modules to move data in to and out of the ARC without incurring the cost of a memory copy. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-17 12:43:02 -08:00
Richard Yao	b41c9906dc	Support ashift=13 for 8KB SSD block sizes New SSDs are now available which use an internal 8k block size. To make sure ZFS can get the maximum performance out of these devices we're increasing the maximum ashift to 13 (8KB). This value is still small enough that we can fit 16 uberblocks in the vdev ring label. However, I don't want to increase this any futher or it will limit the ability the safely roll back a pool to recover it. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #565	2012-02-13 12:25:27 -08:00
Brian Behlendorf	b10c77f70a	Export symbols for zero-copy Exported the required symbols to make use of the DMU's zero-copy API. This allows external modules to move data in to and out of the ARC without incurring the cost of a memory copy. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-10 11:56:55 -08:00
Brian Behlendorf	a31acb462d	Use spl_debug_* helpers When configuring the spl debug log support use the provided wrapper functions. This ensures that if --disable-debug-log was used when buiding the spl the functions will have no effect. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-09 16:37:48 -08:00
Etienne Dechamps	30930fba21	Add support for DISCARD to ZVOLs. DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning. It allows ZVOL clients to discard (unmap, trim) block ranges from a ZVOL, thus optimizing disk space usage by allowing a ZVOL to shrink instead of just grow. We can't use zfs_space() or zfs_freesp() here, since these functions only work on regular files, not volumes. Fortunately we can use the low-level function dmu_free_long_range() which does exactly what we want. Currently the discard operation is not added to the log. That's not a big deal since losing discard requests cannot result in data corruption. It would however result in disk space usage higher than it should be. Thus adding log support to zvol_discard() is probably a good idea for a future improvement. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-09 16:19:38 -08:00
Etienne Dechamps	cb2d19010d	Support the fallocate() file operation. Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is supported, since it's the only one that matches the behavior of zfs_space(). This makes it pretty much useless in its current form, but it's a start. To support other flag combinations we would need to modify zfs_space() to make it more flexible, or emulate the desired functionality in zpl_fallocate(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #334	2012-02-09 16:19:32 -08:00
Etienne Dechamps	aec69371a6	Check permissions in zfs_space(). This isn't done on Solaris because on this OS zfs_space() can only be called with an opened file handle. Since the addition of zpl_truncate_range() this isn't the case anymore, so we need to enforce access rights. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #334	2012-02-09 15:20:37 -08:00
Etienne Dechamps	5cb63a57f8	Implement the truncate_range() inode operation. This operation allows "hole punching" in ZFS files. On Solaris this is done via the vop_space() system call, which maps to the zfs_space() function. So we just need to write zpl_truncate_range() as a wrapper around zfs_space(). Note that this only works for regular files, not ZVOLs. This is currently an insecure implementation without permission checking, although this isn't that big of a deal since truncate_range() isn't even callable from userspace. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #334	2012-02-09 15:20:32 -08:00
Etienne Dechamps	dde9380a1b	Use 32 as the default number of zvol threads. Currently, the `zvol_threads` variable, which controls the number of worker threads which process items from the ZVOL queues, is set to the number of available CPUs. This choice seems to be based on the assumption that ZVOL threads are CPU-bound. This is not necessarily true, especially for synchronous writes. Consider the situation described in the comments for `zil_commit()`, which is called inside `zvol_write()` for synchronous writes: > itxs are committed in batches. In a heavily stressed zil there will be a > commit writer thread who is writing out a bunch of itxs to the log for a > set of committing threads (cthreads) in the same batch as the writer. > Those cthreads are all waiting on the same cv for that batch. > > There will also be a different and growing batch of threads that are > waiting to commit (qthreads). When the committing batch completes a > transition occurs such that the cthreads exit and the qthreads become > cthreads. One of the new cthreads becomes he writer thread for the batch. > Any new threads arriving become new qthreads. We can easily deduce that, in the case of ZVOLs, there can be a maximum of `zvol_threads` cthreads and qthreads. The default value for `zvol_threads` is typically between 1 and 8, which is way too low in this case. This means there will be a lot of small commits to the ZIL, which is very inefficient compared to a few big commits, especially since we have to wait for the data to be on stable storage. Increasing the number of threads will increase the amount of data waiting to be commited and thus the size of the individual commits. On my system, in the context of VM disk image storage (lots of small synchronous writes), increasing `zvol_threads` from 8 to 32 results in a 50% increase in sequential synchronous write performance. We should choose a more sensible default for `zvol_threads`. Unfortunately the optimal value is difficult to determine automatically, since it depends on the synchronous write latency of the underlying storage devices. In any case, a hardcoded value of 32 would probably be better than the current situation. Having a lot of ZVOL threads doesn't seem to have any real downside anyway. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes #392	2012-02-08 13:58:10 -08:00
Etienne Dechamps	34037afe24	Improve ZVOL queue behavior. The Linux block device queue subsystem exposes a number of configurable settings described in Linux block/blk-settings.c. The defaults for these settings are tuned for hard drives, and are not optimized for ZVOLs. Proper configuration of these options would allow upper layers (I/O scheduler) to take better decisions about write merging and ordering. Detailed rationale: - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to handle writes of any size, so there's no reason to impose a limit. Let the upper layer decide. - max_segments and max_segment_size are set to unlimited. zvol_write() will copy the requests' contents into a dbuf anyway, so the number and size of the segments are irrelevant. Let the upper layer decide. - physical_block_size and io_opt are set to the ZVOL's block size. This has the potential to somewhat alleviate issue #361 for ZVOLs, by warning the upper layers that writes smaller than the volume's block size will be slow. - The NONROT flag is set to indicate this isn't a rotational device. Although the backing zpool might be composed of rotational devices, the resulting ZVOL often doesn't exhibit the same behavior due to the COW mechanisms used by ZFS. Setting this flag will prevent upper layers from making useless decisions (such as reordering writes) based on incorrect assumptions about the behavior of the ZVOL. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-07 16:23:06 -08:00
Etienne Dechamps	b18019d2d8	Fix synchronicity for ZVOLs. zvol_write() assumes that the write request must be written to stable storage if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed, "sync" does not mean what we think it means in the context of the Linux block layer. This is well explained in linux/fs.h: WRITE: A normal async write. Device will be plugged. WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down the hint that someone will be waiting on this IO shortly. WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush. WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on non-volatile media on completion. In other words, SYNC does not mean that the write must be on stable storage on completion. It just means that someone is waiting on us to complete the write request. Thus triggering a ZIL commit for each SYNC write request on a ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL users have no way to express that they actually want data to be written to stable storage, which means the ZIL is broken for ZVOLs. The request for stable storage is expressed by the FUA flag, so we must commit the ZIL after the write if the FUA flag is set. In addition, we must commit the ZIL before the write if the FLUSH flag is set. Also, we must inform the block layer that we actually support FLUSH and FUA. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-07 16:23:06 -08:00
Etienne Dechamps	56c34bac44	Support "sync=always" for ZVOLs. Currently the "sync=always" property works for regular ZFS datasets, but not for ZVOLs. This patch remedies that. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes #374.	2012-02-07 16:23:06 -08:00
Brian Behlendorf	47621f3d76	Linux 3.3 compat, sops->show_options() The second argument of sops->show_options() was changed from a 'struct vfsmount ' to a 'struct dentry '. Add an autoconf check to detect the API change and then conditionally define the expected interface. In either case we are only interested in the zfs_sb_t. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #549	2012-02-03 10:02:01 -08:00
Brian Behlendorf	d7e398ce1a	Cleanup ZFS debug infrastructure Historically the internal zfs debug infrastructure has been scattered throughout the code. Since we expect to start making more use of this code this patch performs some cleanup. * Consolidate the zfs debug infrastructure in the zfs_debug.[ch] files. This includes moving the zfs_flags and zfs_recover variables, plus moving the zfs_panic_recover() function. * Remove the existing unused functionality in zfs_debug.c and replace it with code which correctly utilized the spl logging infrastructure. * Remove the __dprintf() function from zfs_ioctl.c. This is dead code, the dprintf() functionality in the kernel relies on the spl log support. * Remove dprintf() from hdr_recl(). This wasn't particularly useful and was missing the required format specifier anyway. * Subsequent patches should unify the dprintf() and zfs_dbgmsg() functions. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-02 11:24:30 -08:00
Brian Behlendorf	0c5dde492f	Allow multiple values per directory entry When using zfs to back a Lustre filesystem it's advantageous to to store a fid with the object id in the directory zap. The only technical impediment to doing this is that the zpl code expects a single value in the zap per directory entry. This change relaxes that requirement such that multiple entries are allowed provided the first one is the object id. The zpl code will just ignore additional entries. This allows the ZoL count to mount datasets which are being used as Lustre server backends. Once the upstream feature flags support is merged in this change should be updated to a read-only feature. Until this occurs other zfs implementations will not be able to read the zfs filesystems created by Lustre. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-02 11:22:08 -08:00
Brian Behlendorf	e29be02e46	Export symbol zfs_attr_table Export the zfs_attr_table symbol so it may be used by non-zpl consumers which are still interested in writing a zpl compatible dataset (e.g. Lustre). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-01-27 09:23:36 -08:00
Richard Laager	57a4eddc4d	Allow setting bootfs on any pool The vdev_is_bootable() restrictions are no longer necessary with recent GRUB2 code. FreeBSD has implemented the same change, except that I moved the Solaris comment to be inside the #ifdef __sun__ block. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #317	2012-01-17 13:49:07 -08:00
Ned Bass	08d08ebba2	Reduce number of zio free threads As described in Issue #458 and #258, unlinking large amounts of data can cause the threads in the zio free wait queue to start spinning. Reducing the number of z_fr_iss threads from a fixed value of 100 to 1 per cpu signficantly reduces contention on the taskq spinlock and improves throughput. Instrumenting the taskq code showed that __taskq_dispatch() can spend a long time holding tq->tq_lock if there are a large number of threads in the queue. It turns out the time spent in wake_up() scales linearly with the number of threads in the queue. When a large number of short work items are dispatched, as seems to be the case with unlink, the worker threads drain the queue faster than the dispatcher can fill it. They then all pile into the work wait queue to wait for new work items. So if 100 threads are in the queue, wake_up() takes about 100 times as long, and the woken threads have to spin until the dispatcher releases the lock. Reducing the number of threads helps with the symptoms, but doesn't get to the root of the problem. It would seem that wake_up() shouldn't scale linearly in time with queue depth, particularly if we are only trying to wake up one thread. In that vein, I tried making all of the waiting processes exclusive to prevent the scheduler from iterating over the entire list, but I still saw the linear time scaling. So further investigation is needed, but in the meantime reducing the thread count is an easy workaround. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #258 Issue #458	2012-01-17 08:54:00 -08:00
Darik Horn	96b91ef0d6	Apply the ZoL coding standard to zpl_xattr.c Make the indenting in the zpl_xattr.c file consistent with the Sun coding standard by removing soft tabs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-01-12 15:12:03 -08:00
Brian Behlendorf	166dd49de0	Linux 3.2 compat, security_inode_init_security() The security_inode_init_security() API has been changed to include a filesystem specific callback to write security extended attributes. This was done to support the initialization of multiple LSM xattrs and the EVM xattr. This change updates the code to use the new API when it's available. Otherwise it falls back to the previous implementation. In addition, the ZFS_AC_KERNEL_6ARGS_SECURITY_INODE_INIT_SECURITY autoconf test has been made more rigerous by passing the expected types. This is done to ensure we always properly the detect the correct form for the security_inode_init_security() API. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #516	2012-01-12 15:06:39 -08:00
Brian Behlendorf	ab26409db7	Linux 3.1 compat, super_block->s_shrink The Linux 3.1 kernel has introduced the concept of per-filesystem shrinkers which are directly assoicated with a super block. Prior to this change there was one shared global shrinker. The zfs code relied on being able to call the global shrinker when the arc_meta_limit was exceeded. This would cause the VFS to drop references on a fraction of the dentries in the dcache. The ARC could then safely reclaim the memory used by these entries and honor the arc_meta_limit. Unfortunately, when per-filesystem shrinkers were added the old interfaces were made unavailable. This change adds support to use the new per-filesystem shrinker interface so we can continue to honor the arc_meta_limit. The major benefit of the new interface is that we can now target only the zfs filesystem for dentry and inode pruning. Thus we can minimize any impact on the caching of other filesystems. In the context of making this change several other important issues related to managing the ARC were addressed, they include: * The dnlc_reduce_cache() function which was called by the ARC to drop dentries for the Posix layer was replaced with a generic zfs_prune_t callback. The ZPL layer now registers a callback to drop these dentries removing a layering violation which dates back to the Solaris code. This callback can also be used by other ARC consumers such as Lustre. arc_add_prune_callback() arc_remove_prune_callback() * The arc_reduce_dnlc_percent module option has been changed to arc_meta_prune for clarity. The dnlc functions are specific to Solaris's VFS and have already been largely eliminated already. The replacement tunable now represents the number of bytes the prune callback will request when invoked. * Less aggressively invoke the prune callback. We used to call this whenever we exceeded the arc_meta_limit however that's not strictly correct since it results in over zeleous reclaim of dentries and inodes. It is now only called once the arc_meta_limit is exceeded and every effort has been made to evict other data from the ARC cache. * More promptly manage exceeding the arc_meta_limit. When reading meta data in to the cache if a buffer was unable to be recycled notify the arc_reclaim thread to invoke the required prune. * Added arcstat_prune kstat which is incremented when the ARC is forced to request that a consumer prune its cache. Remember this will only occur when the ARC has no other choice. If it can evict buffers safely without invoking the prune callback it will. * This change is also expected to resolve the unexpect collapses of the ARC cache. This would occur because when exceeded just the arc_meta_limit reclaim presure would be excerted on the arc_c value via arc_shrink(). This effectively shrunk the entire cache when really we just needed to reclaim meta data. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #466 Closes #292	2012-01-11 11:46:02 -08:00
Darik Horn	28eb9213d8	Linux 3.2 compat: set_nlink() Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170 Use the new set_nlink() kernel function instead. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #462	2011-12-16 20:02:52 -08:00
Garrett D'Amore	a38718a63d	Illumos #734 : Use taskq_dispatch_ent() interface It has been observed that some of the hottest locks are those of the zio taskqs. Contention on these locks can limit the rate at which zios are dispatched which limits performance. This upstream change from Illumos uses new interface to the taskqs which allow them to utilize a prealloc'ed taskq_ent_t. This removes the need to perform an allocation at dispatch time while holding the contended lock. This has the effect of improving system performance. Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Alexey Zaytsev <alexey.zaytsev@nexenta.com> Reviewed by: Jason Brian King <jason.brian.king@gmail.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue: https://www.illumos.org/issues/734 Ported-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #482	2011-12-14 09:19:30 -08:00
Brian Behlendorf	30a9524e45	Set zvol_major/zvol_threads permissions The zvol_major and zvol_threads module options were being created with 0 permission bits. This prevented them from being listed in the /sys/module/zfs/parameters/ directory, although they were visible in `modinfo zfs`. This patch fixes the issue by updating the permission bits to 0444. For the moment these options must be read-only because they are used during module initialization. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #392	2011-12-07 09:27:50 -08:00
Brian Behlendorf	23bdb07d4e	Update default ARC memory limits In the upstream OpenSolaris ZFS code the maximum ARC usage is limited to 3/4 of memory or all but 1GB, whichever is larger. Because of how Linux's VM subsystem is organized these defaults have proven to be too large which can lead to stability issues. To avoid making everyone manually tune the ARC the defaults are being changed to 1/2 of memory or all but 4GB. The rational for this is as follows: * Desktop Systems (less than 8GB of memory) Limiting the ARC to 1/2 of memory is desirable for desktop systems which have highly dynamic memory requirements. For example, launching your web browser can suddenly result in a demand for several gigabytes of memory. This memory must be reclaimed from the ARC cache which can take some time. The user will experience this reclaim time as a sluggish system with poor interactive performance. Thus in this case it is preferable to leave the memory as free and available for immediate use. * Server Systems (more than 8GB of memory) Using all but 4GB of memory for the ARC is preferable for server systems. These systems often run with minimal user interaction and have long running daemons with relatively stable memory demands. These systems will benefit most by having as much data cached in memory as possible. These values should work well for most configurations. However, if you have a desktop system with more than 8GB of memory you may wish to further restrict the ARC. This can still be accomplished by setting the 'zfs_arc_max' module option. Additionally, keep in mind these aren't currently hard limits. The ARC is based on a slab implementation which can suffer from memory fragmentation. Because this fragmentation is not visible from the ARC it may believe it is within the specified limits while actually consuming slightly more memory. How much more memory get's consumed will be determined by how badly fragmented the slabs are. In the long term this can be mitigated by slab defragmentation code which was OpenSolaris solution. Or preferably, using the page cache to back the ARC under Linux would be even better. See issue #75 for the benefits of more tightly integrating with the page cache. This change also fixes a issue where the default ARC max was being set incorrectly for machines with less than 2GB of memory. The constant in the arc_c_max comparison must be explicitly cast to a uint64_t type to prevent overflow and the wrong conditional branch being taken. This failure was typically observed in VMs which are commonly created with less than 2GB of memory. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #75	2011-12-05 12:02:12 -08:00
Brian Behlendorf	f31b3ebe6e	Allow xattrs on symlinks The Solaris version of ZFS does not allow xattrs to be set on symlinks due to the way they implemented the attropen() system call. Linux however implements xattrs through the lgetxattr() and lsetxattr() system calls which do not have this limitation. The only reason this hasn't always worked under ZFS on Linux is that the xattr handlers were not registered for symlink type inodes. This was done simply to be consistent with the Solaris behavior. Upon futher reflection I believe this should be allowed under Linux. The only ill effect would be that the xattrs on symlinks will not be visible when the pool is imported on a Solaris system. This also has the benefit that it allows for SELinux style security xattr labeling which expects to be able to set xattrs on all inode types. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #272	2011-11-29 10:24:24 -08:00
Brian Behlendorf	82a37189aa	Implement SA based xattrs The current ZFS implementation stores xattrs on disk using a hidden directory. In this directory a file name represents the xattr name and the file contexts are the xattr binary data. This approach is very flexible and allows for arbitrarily large xattrs. However, it also suffers from a significant performance penalty. Accessing a single xattr can requires up to three disk seeks. 1) Lookup the dnode object. 2) Lookup the dnodes's xattr directory object. 3) Lookup the xattr object in the directory. To avoid this performance penalty Linux filesystems such as ext3 and xfs try to store the xattr as part of the inode on disk. When the xattr is to large to store in the inode then a single external block is allocated for them. In practice most xattrs are small and this approach works well. The addition of System Attributes (SA) to zfs provides us a clean way to make this optimization. When the dataset property 'xattr=sa' is set then xattrs will be preferentially stored as System Attributes. This allows tiny xattrs (~100 bytes) to be stored with the dnode and up to 64k of xattrs to be stored in the spill block. If additional xattr space is required, which is unlikely under Linux, they will be stored using the traditional directory approach. This optimization results in roughly a 3x performance improvement when accessing xattrs which brings zfs roughly to parity with ext4 and xfs (see table below). When multiple xattrs are stored per-file the performance improvements are even greater because all of the xattrs stored in the spill block will be cached. However, by default SA based xattrs are disabled in the Linux port to maximize compatibility with other implementations. If you do enable SA based xattrs then they will not be visible on platforms which do not support this feature. ---------------------------------------------------------------------- Time in seconds to get/set one xattr of N bytes on 100,000 files ------+--------------------------------+------------------------------ \| setxattr \| getxattr bytes \| ext4 xfs zfs-dir zfs-sa \| ext4 xfs zfs-dir zfs-sa ------+--------------------------------+------------------------------ 1 \| 2.33 31.88 21.50 4.57 \| 2.35 2.64 6.29 2.43 32 \| 2.79 30.68 21.98 4.60 \| 2.44 2.59 6.78 2.48 256 \| 3.25 31.99 21.36 5.92 \| 2.32 2.71 6.22 3.14 1024 \| 3.30 32.61 22.83 8.45 \| 2.40 2.79 6.24 3.27 4096 \| 3.57 317.46 22.52 10.73 \| 2.78 28.62 6.90 3.94 16384 \| n/a 2342.39 34.30 19.20 \| n/a 45.44 145.90 7.55 65536 \| n/a 2941.39 128.15 131.32* \| n/a 141.92 256.85 262.12* Legend: * ext4 - Stock RHEL6.1 ext4 mounted with '-o user_xattr'. * xfs - Stock RHEL6.1 xfs mounted with default options. * zfs-dir - Directory based xattrs only. * zfs-sa - Prefer SAs but spill in to directories as needed, a trailing * indicates overflow in to directories occured. NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file. NOTE: XFS and ZFS have no limit on xattr name/value pairs per file. NOTE: Linux limits individual name/value pairs to 65536 bytes. NOTE: All setattr/getattr's were done after dropping the cache. NOTE: All tests were run against a single hard drive. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #443	2011-11-28 15:45:51 -08:00
Brian Behlendorf	ca5fd24984	Limit maximum ashift value to 12 While we initially allowed you to set your ashift as large as 17 (SPA_MAXBLOCKSIZE) that is actually unsafe. What wasn't considered at the time is that each uberblock written to the vdev label ring buffer will be of this size. Now the buffer is statically sized to 128k and we need to be able to fit several uberblocks in it. With a large ashift that becomes a problem. Therefore I'm reducing the maximum configurable ashift value to 12. This is large enough for the 4k sector drives and small enough that we can still keep the most recent 32 uberblock in the vdev label ring buffer. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #425	2011-11-11 14:50:48 -08:00
Brian Behlendorf	adcd70bd1a	Linux 3.1 compat, fops->fsync() The Linux 3.1 kernel updated the fops->fsync() callback yet again. They now pass the requested range and delegate the responsibility for calling filemap_write_and_wait_range() to the callback. In addition imutex is no longer held by the caller and the callback is responsible for taking the lock if required. This commit updates the code to provide a zpl_fsync() function for the updated API. Implementations for the previous two APIs are also maintained for compatibility. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #445	2011-11-10 10:03:08 -08:00
Brian Behlendorf	5547c2f1bf	Simplify BDI integration Update the code to use the bdi_setup_and_register() helper to simplify the bdi integration code. The updated code now just registers the bdi during mount and destroys it during unmount. The only complication is that for 2.6.32 - 2.6.33 kernels the helper wasn't available so in these cases the zfs code must provide it. Luckily the bdi_setup_and_register() function is trivial. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #367	2011-11-08 10:19:03 -08:00
Brian Behlendorf	591fb62f19	Disown dataset in zfs_sb_create() Fix an unlikely failure cause in zfs_sb_create() which could leave the dataset owned on error and thus unavailable until after a reboot. Disown the dataset if SA are expected but are in fact missing. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-11-08 10:18:40 -08:00
Brian Behlendorf	ae6ba3dbe6	Improve meta data performance Profiling the system during meta data intensive workloads such as creating/removing millions of files, revealed that the system was cpu bound. A large fraction of that cpu time was being spent waiting on the virtual address space spin lock. It turns out this was caused by certain heavily used kmem_caches being backed by virtual memory. By default a kmem_cache will dynamically determine the type of memory used based on the object size. For large objects virtual memory is usually preferable and for small object physical memory is a better choice. See the spl_slab_alloc() function for a longer discussion on this. However, there is a certain amount of gray area when defining a 'large' object. For the following caches it turns out they were just over the line: * dnode_cache * zio_cache * zio_link_cache * zio_buf_512_cache * zfs_data_buf_512_cache Now because we know there will be a lot of churn in these caches, and because we know the slabs will still be reasonably sized. We can safely request with the KMC_KMEM flag that the caches be backed with physical memory addresses. This entirely avoids the need to serialize on the virtual address space lock. As a bonus this also reduces our vmalloc usage which will be good for 32-bit kernels which have a very small virtual address space. It will also probably be good for interactive performance since unrelated processes could also block of this same global lock. Finally, we may see less cpu time being burned in the arc_reclaim and txg_sync_threads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #258	2011-11-03 10:19:21 -07:00
Brian Behlendorf	6a95d0b74c	Fix NULL deref in balance_pgdat() Be careful not to unconditionally clear the PF_MEMALLOC bit in the task structure. It may have already been set when entering zpl_putpage() in which case it must remain set on exit. In particular the kswapd thread will have PF_MEMALLOC set in order to prevent it from entering direct reclaim. By clearing it we allow the following NULL deref to potentially occur. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #287	2011-11-03 10:15:39 -07:00
Gunnar Beutner	a7b125e9a5	Fix a race condition in zfs_getattr_fast() zfs_getattr_fast() was missing a lock on the ZFS superblock which could result in zfs_znode_dmu_fini() clearing the zp->z_sa_hdl member while zfs_getattr_fast() was accessing the znode. The result of this would usually be a panic. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes #431	2011-11-03 10:13:09 -07:00
Xin Li	c475167627	Illumos #1661 : Fix flaw in sa_find_sizes() calculation When calculating space needed for SA_BONUS buffers, hdrsize is always rounded up to next 8-aligned boundary. However, in two places the round up was done against sum of 'total' plus hdrsize. On the other hand, hdrsize increments by 4 each time, which means in certain conditions, we would end up returning with will_spill == 0 and (total + hdrsize) larger than full_space, leading to a failed assertion because it's invalid for dmu_set_bonus. Reviewed by: Matthew Ahrens <matt@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue: https://www.illumos.org/issues/1661 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #426	2011-10-24 09:57:52 -07:00
Darik Horn	3cee2262a6	Change sun.com URLs to zfsonlinux.org ZFS contains error messages that point to the defunct www.sun.com domain, which is currently offline. Change these error messages to use the zfsonlinux.org mirror instead. This commit depends on: zfsonlinux/zfsonlinux.github.com@8e10ead3dc Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-24 09:52:21 -07:00
Brian Behlendorf	6f2255ba8a	Set mtime on symbolic links Register the setattr/getattr callbacks for symlinks. Without these the generic inode_setattr() and generic_fillattr() functions will be used. In the setattr case this will only result in the inode being updated in memory, the dirty_inode callback would also normally run but none is registered for zfs. The straight forward fix is to set the setattr/getattr callbacks for symlinks so they are handled just like files and directories. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #412	2011-10-18 15:49:31 -07:00
Alexander Stetsenko	8d35c1499d	Illumos #755 : dmu_recv_stream builds incomplete guid_to_ds_map An incomplete guid_to_ds_map would cause restore_write_byref() to fail while receiving a de-duplicated backup stream. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Garrett D`Amore <garrett@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/755 - https://github.com/illumos/illumos-gate/commit/ec5cf9d53a Signed-off-by: Gunnar Beutner <gunnar@beutner.name> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #372	2011-10-18 11:18:14 -07:00
Brian Behlendorf	86f35f34f4	Export symbols for the VFS API Export all symbols already marked extern in the zfs_vfsops.h header. Several non-static symbols have also been added to the header and exportewd. This allows external modules to more easily create and manipulate properly created ZFS filesystem type datasets. Rename zfsvfs_teardown() to zfs_sb_teardown and export it. This is done simply for consistency with the rest of the code base. All other zfsvfs_* functions have already been renamed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-11 10:25:59 -07:00
Brian Behlendorf	e45aa45298	Export symbols for the full SA API Export all the symbols for the system attribute (SA) API. This allows external module to cleanly manipulate the SAs associated with a dnode. Documention for the SA API can be found in the module/zfs/sa.c source. This change also removes the zfs_sa_uprade_pre, and zfs_sa_uprade_post prototypes. The functions themselves were dropped some time ago. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-05 15:59:56 -07:00
Andreas Dilger	baab063016	zpl: Fix "df -i" to have better free inodes value Due to the confusion in Linux statfs between f_frsize and f_bsize the blocks counts were changed to be in units of z_max_blksize instead of SPA_MINBLOCKSIZE as it is on other platforms. However, the free files calculation in zfs_statvfs() is limited by the free blocks count, since each dnode consumes one block/sector. This provided a reasonable estimate of free inodes, but on Linux this meant that the free inodes count was underestimated by a large amount, since 256 512-byte dnodes can fit into a 128kB block, and more if the max blocksize is increased to 1MB or larger. Also, the use of SPA_MINBLOCKSIZE is semantically incorrect since DNODE_SIZE may change to a value other than SPA_MINBLOCKSIZE and may even change per dataset, and devices with large sectors setting ashift will also use a larger blocksize. Correct the f_ffree calculation to use (availbytes >> DNODE_SHIFT) to more accurately compute the maximum number of dnodes that can be created. Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #413 Closes #400	2011-09-28 11:27:10 -07:00
Brian Behlendorf	dee28b0700	Export symbols for the full ZAP API Export all the symbols for the ZAP API. This allows external modules to cleanly interface with ZAP type objects. Previously only a subset of the functionality was exposed. Documention for the ZAP API can be found in the sys/zap.h header. This change also removes a duplicate zap_increment_int() prototype. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-09-27 16:12:36 -07:00
Brian Behlendorf	fa6e5ced2f	Suppress kmem_alloc() warning in zfs_prop_set_special() Suppress the warning for this large kmem_alloc() because it is not that far over the warning threshhold (8k) and it is short lived. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-09-15 20:26:51 -07:00
Brian Behlendorf	2708f716c0	Fix usage of zsb after free Caught by code inspection, the variable zsb was referenced after being freed. Move the kmem_free() to the end of the function. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-09-09 10:29:48 -07:00
Brian Behlendorf	95d9fd028b	Fix incompatible pointer type warning This warning was accidentally introduced by commit `f3ab88d646` which updated the .readpages() implementation. The fix is to simply cast the helper function to the appropriate type when passed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-08-19 15:16:30 -07:00
Brian Behlendorf	f3ab88d646	Correctly lock pages for .readpages() Unlike the .readpage() callback which is passed a single locked page to be populated. The .readpages() callback is passed a list of unlocked pages which are all marked for read-ahead (PG_readahead set). It is the responsibly of .readpages() to ensure to pages are properly locked before being populated. Prior to this change the requested read-ahead pages would be updated outside of the page lock which is unsafe. The unlocked pages would then be unlocked again which is harmless but should have been immediately detected as bug. Unfortunately, newer kernels failed detect this issue because the check is done with a VM_BUG_ON which is disabled by default. Luckily, the old Debian Lenny 2.6.26 kernel caught this because it simply uses a BUG_ON. The straight forward fix for this is to update the .readpages() callback to use the read_cache_pages() helper function. The helper function will ensure that each page in the list is properly locked before it is passed to the .readpage() callback. In addition resolving the bug, this results in a nice simplification of the existing code. The downside to this change is that instead of passing one large read request to the dmu multiple smaller ones are submitted. All of these requests however are marked for readahead so the lower layers should issue a large I/O regardless. Thus most of the request should hit the ARC cache. Futher optimization of this code can be done in the future is a perform analysis determines it to be worthwhile. But for the moment, it is preferable that code be correct and understandable. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #355	2011-08-08 13:24:52 -07:00
Brian Behlendorf	76659dc110	Add backing_device_info per-filesystem For a long time now the kernel has been moving away from using the pdflush daemon to write 'old' dirty pages to disk. The primary reason for this is because the pdflush daemon is single threaded and can be a limiting factor for performance. Since pdflush sequentially walks the dirty inode list for each super block any delay in processing can slow down dirty page writeback for all filesystems. The replacement for pdflush is called bdi (backing device info). The bdi system involves creating a per-filesystem control structure each with its own private sets of queues to manage writeback. The advantage is greater parallelism which improves performance and prevents a single filesystem from slowing writeback to the others. For a long time both systems co-existed in the kernel so it wasn't strictly required to implement the bdi scheme. However, as of Linux 2.6.36 kernels the pdflush functionality has been retired. Since ZFS already bypasses the page cache for most I/O this is only an issue for mmap(2) writes which must go through the page cache. Even then adding this missing support for newer kernels was overlooked because there are other mechanisms which can trigger writeback. However, there is one critical case where not implementing the bdi functionality can cause problems. If an application handles a page fault it can enter the balance_dirty_pages() callpath. This will result in the application hanging until the number of dirty pages in the system drops below the dirty ratio. Without a registered backing_device_info for the filesystem the dirty pages will not get written out. Thus the application will hang. As mentioned above this was less of an issue with older kernels because pdflush would eventually write out the dirty pages. This change adds a backing_device_info structure to the zfs_sb_t which is already allocated per-super block. It is then registered when the filesystem mounted and unregistered on unmount. It will not be registered for mounted snapshots which are read-only. This change will result in flush-<pool> thread being dynamically created and destroyed per-mounted filesystem for writeback. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #174	2011-08-04 13:37:38 -07:00
Brian Behlendorf	3c0e5c0f45	Cleanup mmap(2) writes While the existing implementation of .writepage()/zpl_putpage() was functional it was not entirely correct. In particular, it would move dirty pages in to a clean state simply after copying them in to the ARC cache. This would result in the pages being lost if the system were to crash enough though the Linux VFS believed them to be safe on stable storage. Since at the moment virtually all I/O, except mmap(2), bypasses the page cache this isn't as bad as it sounds. However, as hopefully start using the page cache more getting this right becomes more important so it's good to improve this now. This patch takes a big step in that direction by updating the code to correctly move dirty pages through a writeback phase before they are marked clean. When a dirty page is copied in to the ARC it will now be set in writeback and a completion callback is registered with the transaction. The page will stay in writeback until the dmu runs the completion callback indicating the page is on stable storage. At this point the page can be safely marked clean. This process is normally entirely asynchronous and will be repeated for every dirty page. This may initially sound inefficient but most of these pages will end up in a few txgs. That means when they are eventually written to disk they should be nicely batched. However, there is room for improvement. It may still be desirable to batch up the pages in to larger writes for the dmu. This would reduce the number of callbacks and small 4k buffer required by the ARC. Finally, if the caller requires that the I/O be done synchronously by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set. Then the I/O will trigger a zil_commit() to flush the data to stable storage. At which point the registered callbacks will be run leaving the date safe of disk and marked clean before returning from .writepage. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-08-02 10:34:55 -07:00
Martin Matuska	cddafdcbc5	Illumos #1313 : Integer overflow in txg_delay() The function txg_delay() is used to delay txg (transaction group) threads in ZFS. The timeout value for this function is calculated using: int timeout = ddi_get_lbolt() + ticks; Later, the actual wait is performed: while (ddi_get_lbolt() < timeout && tx->tx_syncing_txg < txg-1 && !txg_stalled(dp)) (void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock, timeout - ddi_get_lbolt()); The ddi_get_lbolt() function returns current uptime in clock ticks and is typed as clock_t. The clock_t type on 64-bit architectures is int64_t. The "timeout" variable will overflow depending on the tick frequency (e.g. for 1000 it will overflow in 28.855 days). This will make the expression "ddi_get_lbolt() < timeout" always false - txg threads will not be delayed anymore at all. This leads to a slowdown in ZFS writes. The attached patch initializes timeout as clock_t to match the return value of ddi_get_lbolt(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #352	2011-08-01 12:09:43 -07:00
Alexander Stetsenko	0b7936d5c2	Illumos #278 : get rid zfs of python and pyzfs dependencies Remove all python and pyzfs dependencies for consistency and to ensure full functionality even in a mimimalist environment. Reviewed by: gordon.w.ross@gmail.com Reviewed by: trisk@opensolaris.org Reviewed by: alexander.r.eremin@gmail.com Reviewed by: jerry.jelinek@joyent.com Approved by: garrett@nexenta.com References to Illumos issue and patch: - https://www.illumos.org/issues/278 - https://github.com/illumos/illumos-gate/commit/1af68beac3 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340 Issue #160 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-08-01 12:09:36 -07:00
Martin Matuska	ca5252204a	Illumos #1043 : Recursive zfs snapshot destroy fails Prior to revision 11314 if a user was recursively destroying snapshots of a dataset the target dataset was not required to exist. The zfs_secpolicy_destroy_snaps() function introduced the security check on the target dataset, so since then if the target dataset does not exist, the recursive destroy is not performed. Before 11314, only a delete permission check on the snapshot's master dataset was performed. Steps to reproduce: zfs create pool/a zfs snapshot pool/a@s1 zfs destroy -r pool@s1 Therefore I suggest to fallback to the old security check, if the target snapshot does not exist and continue with the destroy. References to Illumos issue and patch: - https://www.illumos.org/issues/1043 - https://www.illumos.org/attachments/217/recursive_dataset_destroy.patch Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Eric Schrock	3e31d2b080	Illumos #883 : ZIL reuse during remount corruption Moving the zil_free() cleanup to zil_close() prevents this problem from occurring in the first place. There is a very good description of the issue and fix in Illumus #883. Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com> Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com> Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Garrett D'Amore <garrett@nexenta.com> Reivewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/883 - https://github.com/illumos/illumos-gate/commit/c9ba2a43cb Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Matt Ahrens	f5fc4acaa7	Illumos #1092 : zfs refratio property Add a "REFRATIO" property, which is the compression ratio based on data referenced. For snapshots, this is the same as COMPRESSRATIO, but for filesystems/volumes, the COMPRESSRATIO is based on the data "USED" (ie, includes blocks in children, but not blocks shared with the origin). This is needed to figure out how much space a filesystem would use if it were not compressed (ignoring snapshots). Reviewed by: George Wilson <George.Wilson@delphix.com> Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Mark Musante <Mark.Musante@oracle.com> Reviewed by: Garrett D'Amore <garrett@nexenta.com> Approved by: Garrett D'Amore <garrett@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/1092 - https://github.com/illumos/illumos-gate/commit/187d6ac08a Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
George Wilson	6d974228ef	Illumos #1051 : zfs should handle imbalanced luns Today zfs tries to allocate blocks evenly across all devices. This means when devices are imbalanced zfs will use lots of CPU searching for space on devices which tend to be pretty full. It should instead fail quickly on the full LUNs and move onto devices which have more availability. Reviewed by: Eric Schrock <Eric.Schrock@delphix.com> Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com> Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com> Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Approved by: Garrett D'Amore <garrett@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/510 - https://github.com/illumos/illumos-gate/commit/5ead3ed965 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Garrett D'Amore	2cc6c8db12	Illumos #175 : zfs vdev cache consumes excessive memory Note that with the current ZFS code, it turns out that the vdev cache is not helpful, and in some cases actually harmful. It is better if we disable this. Once some time has passed, we should actually remove this to simplify the code. For now we just disable it by setting the zfs_vdev_cache_size to zero. Note that Solaris 11 has made these same changes. References to Illumos issue and patch: - https://www.illumos.org/issues/175 - https://github.com/illumos/illumos-gate/commit/b68a40a845 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Gordon Ross	ef3c1dea70	Illumos #764 : panic in zfs:dbuf_sync_list Hypothesis about what's going on here. At some time in the past, something, i.e. dnode_reallocate() calls one of: dbuf_rm_spill(dn, tx); These will do: dbuf_rm_spill(dnode_t dn, dmu_tx_t tx) dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx) dbuf_undirty(db, tx) Currently dbuf_undirty can leave a spill block in dn_dirty_records[], (it having been put there previously by dbuf_dirty) and free it. Sometime later, dbuf_sync_list trips over this reference to free'd (and typically reused) memory. Also, dbuf_undirty can call dnode_clear_range with a bogus block ID. It needs to test for DMU_SPILL_BLKID, similar to how dnode_clear_range is called in dbuf_dirty(). References to Illumos issue and patch: - https://www.illumos.org/issues/764 - https://github.com/illumos/illumos-gate/commit/3f2366c2bb Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Mark.Maybe@oracle.com Reviewed by: Albert Lee <trisk@nexenta.com Approved by: Garrett D'Amore <garrett@nexenta.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Tim Haley	7b8518cb8d	Illumos #xxx: zdb -vvv broken after zfs diff integration References to Illumos issue and patch: - https://github.com/illumos/illumos-gate/commit/163eb7ff Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:02 -07:00
Brian Behlendorf	beb9826902	Fix txg_sync_thread deadlock Update two kmem_alloc()'s in dbuf_dirty() to use KM_PUSHPAGE. Because these functions are called from txg_sync_thread we must ensure they don't reenter the zfs filesystem code via the .writepage callback. This would result in a deadlock. This deadlock is rare and has only been observed once under an abusive mmap() write workload. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-22 15:24:57 -07:00
Brian Behlendorf	22872ff5da	Use zfs_mknode() to create dataset root Long, long, long ago when the effort to port ZFS was begun the zfs_create_fs() function was heavily modified to remove all of its VFS dependencies. This allowed Lustre to use the dataset without us having to spend the time porting all the required VFS code. Fast-forward several years and we now have all the VFS code in place but are still relying on the modified zfs_create_fs(). This isn't required anymore and we can now use zfs_mknode() to create the root znode for the filesystem. This commit reverts the contents of zfs_create_fs() to largely match the upstream OpenSolaris code. There have been minor modifications to accomidate the Linux VFS but that is all. This code fixes issue #116 by bootstraping enough of the VFS data structures so we can rely on zfs_mknode() to create the root directory. This ensures it is created properly with support for system attributes. Previously it wasn't which is why it behaved differently that all other directories when modified. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #116	2011-07-20 19:52:26 -07:00
Brian Behlendorf	9fd91daeef	Honor setgit bit on directories Newly created files were always being created with the fsuid/fsgid in the current users credentials. This is correct except in the case when the parent directory sets the 'setgit' bit. In this case according to posix the newly created file/directory should inherit the gid of the parent directory. Additionally, in the case of a subdirectory it should also inherit the 'setgit' bit. Finally, this commit performs a little cleanup of the vattr_t initialization by moving it to a common helper function. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #262	2011-07-20 14:07:13 -07:00
Brian Behlendorf	fe0ed8f910	Fix 'make install' overly broad 'rm' When running 'make install' without DESTDIR set the module install rules would mistakenly destroy the 'modules.*' files for ALL of your installed kernels. This could lead to a non-functional system for the alternate kernels because 'depmod -a' will only be run for the kernel which was compiled against. This issue would not impact anyone using the 'make <deb\|rpm\|pkg>' build targets to build and install packages. The fix for this issue is to only remove extraneous build products when DESTDIR is set. This almost exclusively indicates we are building packages and installed the build products in to a temporary staging location. Additionally, limit the removal the unneeded build products to the target kernel version. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #328	2011-07-20 09:38:51 -07:00
Brian Behlendorf	cfc9a5c88f	Fix zpl_writepage() deadlock Disable the normal reclaim path for zpl_putpage(). This ensures that all memory allocations under this call path will never enter direct reclaim. If this were to happen the VM might try to write out additional pages by calling zpl_putpage() again resulting in a deadlock. This sitution is typically handled in Linux by marking each offending allocation GFP_NOFS. However, since much of the code used is common it makes more sense to use PF_MEMALLOC to flag the entire call tree. Alternately, the code could be updated to pass the needed allocation flags but that's a more invasive change. The following example of the above described deadlock was triggered by test 074 in the xfstest suite. Call Trace: [<ffffffff814dcdb2>] down_write+0x32/0x40 [<ffffffffa05af6e4>] dnode_new_blkid+0x94/0x2d0 [zfs] [<ffffffffa0597d66>] dbuf_dirty+0x556/0x750 [zfs] [<ffffffffa05987d1>] dmu_buf_will_dirty+0x81/0xd0 [zfs] [<ffffffffa059ee70>] dmu_write+0x90/0x170 [zfs] [<ffffffffa0611afe>] zfs_putpage+0x2ce/0x360 [zfs] [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffffa06287b2>] zpl_writepage+0x12/0x20 [zfs] [<ffffffff8115f907>] writeout+0xa7/0xd0 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0 [<ffffffff811559ab>] compact_zone+0x4fb/0x780 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0 [<ffffffff8115464a>] alloc_pages_current+0xaa/0x110 [<ffffffff8111e36e>] __get_free_pages+0xe/0x50 [<ffffffffa03f0e2f>] kv_alloc+0x3f/0xb0 [spl] [<ffffffffa03f11d9>] spl_kmem_cache_alloc+0x339/0x660 [spl] [<ffffffffa05950b3>] dbuf_create+0x43/0x370 [zfs] [<ffffffffa0596fb1>] __dbuf_hold_impl+0x241/0x480 [zfs] [<ffffffffa0597276>] dbuf_hold_impl+0x86/0xc0 [zfs] [<ffffffffa05977ff>] dbuf_hold_level+0x1f/0x30 [zfs] [<ffffffffa05a9dde>] dmu_tx_check_ioerr+0x4e/0x110 [zfs] [<ffffffffa05aa1f9>] dmu_tx_count_write+0x359/0x6f0 [zfs] [<ffffffffa05aa5df>] dmu_tx_hold_write+0x4f/0x70 [zfs] [<ffffffffa0611a6d>] zfs_putpage+0x23d/0x360 [zfs] [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffff811221f9>] write_cache_pages+0x1c9/0x4a0 [<ffffffffa0628738>] zpl_writepages+0x18/0x20 [zfs] [<ffffffff81122521>] do_writepages+0x21/0x40 [<ffffffff8119bbbd>] writeback_single_inode+0xdd/0x2c0 [<ffffffff8119bfbe>] writeback_sb_inodes+0xce/0x180 [<ffffffff8119c11b>] writeback_inodes_wb+0xab/0x1b0 [<ffffffff8119c4bb>] wb_writeback+0x29b/0x3f0 [<ffffffff8119c6cb>] wb_do_writeback+0xbb/0x240 [<ffffffff811308ea>] bdi_forker_task+0x6a/0x310 [<ffffffff8108ddf6>] kthread+0x96/0xa0 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #327	2011-07-19 16:17:01 -07:00
Brian Behlendorf	abd39a8289	Fix zio_execute() deadlock To avoid deadlocking the system it is crucial that all memory allocations performed in the zio_execute() call path are marked KM_PUSHPAGE (GFP_NOFS). This ensures that while a z_wr_iss thread is processing the syncing transaction group it does not re-enter the filesystem code and deadlock on itself. Call Trace: [<ffffffffa02580e8>] cv_wait_common+0x78/0xe0 [spl] [<ffffffffa0347bab>] txg_wait_open+0x7b/0xa0 [zfs] [<ffffffffa030e73d>] dmu_tx_wait+0xed/0xf0 [zfs] [<ffffffffa0376a49>] zfs_putpage+0x219/0x360 [zfs] [<ffffffffa038d75e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffffa038d7b2>] zpl_writepage+0x12/0x20 [zfs] [<ffffffff8115f907>] writeout+0xa7/0xd0 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0 [<ffffffff811559ab>] compact_zone+0x4fb/0x780 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0 [<ffffffff81159932>] kmem_getpages+0x62/0x170 [<ffffffff8115a54a>] fallback_alloc+0x1ba/0x270 [<ffffffff8115a2c9>] ____cache_alloc_node+0x99/0x160 [<ffffffff8115b059>] __kmalloc+0x189/0x220 [<ffffffffa02539fb>] kmem_alloc_debug+0xeb/0x130 [spl] [<ffffffffa031454a>] dnode_hold_impl+0x46a/0x550 [zfs] [<ffffffffa0314649>] dnode_hold+0x19/0x20 [zfs] [<ffffffffa03042e3>] dmu_read+0x33/0x180 [zfs] [<ffffffffa034729d>] space_map_load+0xfd/0x320 [zfs] [<ffffffffa03300bc>] metaslab_activate+0x10c/0x170 [zfs] [<ffffffffa0330ad9>] metaslab_alloc+0x469/0x800 [zfs] [<ffffffffa038963c>] zio_dva_allocate+0x6c/0x2f0 [zfs] [<ffffffffa038a249>] zio_execute+0x99/0xf0 [zfs] [<ffffffffa0254b1c>] taskq_thread+0x1cc/0x330 [spl] [<ffffffff8108ddf6>] kthread+0x96/0xa0 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #291	2011-07-19 11:55:42 -07:00
Brian Behlendorf	a140dc5469	Fix mmap(2)/write(2)/read(2) deadlock When modifing overlapping regions of a file using mmap(2) and write(2)/read(2) it is possible to deadlock due to a lock inversion. The zfs_write() and zfs_read() hooks first take the zfs range lock and then lock the individual pages. Conversely, when using mmap'ed I/O the zpl_writepage() hook is called with the individual page locks already taken and then zfs_putpage() takes the zfs range lock. The most straight forward fix is to simply not take the zfs range lock in the mmap(2) case. The individual pages will still be locked thus serializing access. Updating the same region of a file with write(2) and mmap(2) has always been a dodgy thing to do. This change at a minimum ensures we don't deadlock and is consistent with the existing Linux semantics enforced by the VFS. This isn't an issue under Solaris because the only range locking performed will be with the zfs range locks. It's up to each filesystem to perform its own file locking. Under Linux the VFS provides many of these services. It may be possible/desirable at a latter date to entirely dump the existing zfs range locking and rely on the Linux VFS page locks. However, for now its safest to perform both layers of locking until zfs is more tightly integrated with the page cache. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #302	2011-07-19 11:55:42 -07:00

1 2 3 4 5 ...

377 Commits