mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-15 18:56:59 +03:00

Author	SHA1	Message	Date
Tim Chase	4254acb057	Undirty freed spill blocks. If a spill block's dbuf hasn't yet been written when a spill block is freed, the unwritten version will still be written. This patch handles the case in which a spill block's dbuf is freed and undirties it to prevent it from being written. The most common case in which this could happen is when xattr=sa is being used and a long xattr is immediately replaced by a short xattr as in: setfattr -n user.test -v very_very_very..._long_value <file> setfattr -n user.test -v short_value <file> The first value must be sufficiently long that a spill block is generated and the second value must be short enough to not require a spill block. In practice, this would typically happen due to internal xattr operations as a result of setting acltype=posixacl. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2663 Closes #2700 Closes #2701 Closes #2717 Closes #2863 Closes #2884	2014-11-17 11:25:48 -08:00
Prakash Surya	0b39b9f96f	Swap DTRACE_PROBE* with Linux tracepoints This patch leverages Linux tracepoints from within the ZFS on Linux code base. It also refactors the debug code to bring it back in sync with Illumos. The information exported via tracepoints can be used for a variety of reasons (e.g. debugging, tuning, general exploration/understanding, etc). It is advantageous to use Linux tracepoints as the mechanism to export this kind of information (as opposed to something else) for a number of reasons: * A number of external tools can make use of our tracepoints "automatically" (e.g. perf, systemtap) * Tracepoints are designed to be extremely cheap when disabled * It's one of the "accepted" ways to export this kind of information; many other kernel subsystems use tracepoints too. Unfortunately, though, there are a few caveats as well: * Linux tracepoints appear to only be available to GPL licensed modules due to the way certain kernel functions are exported. Thus, to actually make use of the tracepoints introduced by this patch, one might have to patch and re-compile the kernel; exporting the necessary functions to non-GPL modules. * Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux tracepoints are not available for unsigned kernel modules (tracepoints will get disabled due to the module's 'F' taint). Thus, one either has to sign the zfs kernel module prior to loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or newer. Assuming the above two requirements are satisfied, lets look at an example of how this patch can be used and what information it exposes (all commands run as 'root'): # list all zfs tracepoints available $ ls /sys/kernel/debug/tracing/events/zfs enable filter zfs_arc__delete zfs_arc__evict zfs_arc__hit zfs_arc__miss zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write zfs_new_state__mfu zfs_new_state__mru # enable all zfs tracepoints, clear the tracepoint ring buffer $ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable $ echo 0 > /sys/kernel/debug/tracing/trace # import zpool called 'tank', inspect tracepoint data (each line was # truncated, they're too long for a commit message otherwise) $ zpool import tank $ cat /sys/kernel/debug/tracing/trace \| head -n35 # tracer: nop # # entries-in-buffer/entries-written: 1219/1219 #P:8 # # _-----=> irqs-off # / _----=> need-resched # \| / _---=> hardirq/softirq # \|\| / _--=> preempt-depth # \|\|\| / delay # TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION # \| \| \| \|\|\|\| \| \| lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr... z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr... z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr... z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr... z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr... z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr... z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru... lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr... z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru... lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ... To highlight the kind of detailed information that is being exported using this infrastructure, I've taken the first tracepoint line from the output above and reformatted it such that it fits in 80 columns: lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr { dva 0x1:0x40082 birth 15491 cksum0 0x163edbff3a flags 0x640 datacnt 1 type 1 size 2048 spa 3133524293419867460 state_type 0 access 0 mru_hits 0 mru_ghost_hits 0 mfu_hits 0 mfu_ghost_hits 0 l2_hits 0 refcount 1 } bp { dva0 0x1:0x40082 dva1 0x1:0x3000e5 dva2 0x1:0x5a006e cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00 lsize 2048 } zb { objset 0 object 0 level -1 blkid 0 } For the specific tracepoint shown here, 'zfs_arc__miss', data is exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!). This kind of precise and detailed information can be extremely valuable when trying to answer certain kinds of questions. For anybody unfamiliar but looking to build on this, I found the XFS source code along with the following three web links to be extremely helpful: * http://lwn.net/Articles/379903/ * http://lwn.net/Articles/381064/ * http://lwn.net/Articles/383362/ I should also node the more "boring" aspects of this patch: * The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to support a sixth paramter. This parameter is used to populate the contents of the new conftest.h file. If no sixth parameter is provided, conftest.h will be empty. * The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced. This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro, except it has support for a fifth option that is then passed as the sixth parameter to ZFS_LINUX_COMPILE_IFELSE. These autoconf changes were needed to test the availability of the Linux tracepoint macros. Due to the odd nature of the Linux tracepoint macro API, a separate ".h" must be created (the path and filename is used internally by the kernel's define_trace.h file). * The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This is to determine if we can safely enable the Linux tracepoint functionality. We need to selectively disable the tracepoint code due to the kernel exporting certain functions as GPL only. Without this check, the build process will fail at link time. In addition, the SET_ERROR macro was modified into a tracepoint as well. To do this, the 'sdt.h' file was moved into the 'include/sys' directory and now contains a userspace portion and a kernel space portion. The dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as well. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-17 11:13:55 -08:00
Ned Bass	29e57d15c8	Fix dprintf format specifiers Fix a few dprintf format specifiers that disagreed with their argument types. These came to light as compiler errors when converting dprintf to use the Linux trace buffer. Previously this wasn't a problem, presumably because the SPL debug logging uses vsnprintf which must perform automatic type conversion. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-17 11:13:45 -08:00
Ned Bass	59ec819a0c	Move a few internal ARC strucutres to arc_impl.h Add a new file named arc_impl.h and move a few internal ARC structure definitions into this file. This is needed in order to allow the Linux tracepoint functions to grub around in the internals of these structures. Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-17 11:13:27 -08:00
Prakash Surya	fb42a49328	Illumos 5213 - panic in metaslab_init due to space_map_open returning ENXIO 5213 panic in metaslab_init due to space_map_open returning ENXIO Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: George Wilson george.wilson@delphix.com References: https://www.illumos.org/issues/5213 https://reviews.csiden.org/r/110 Porting notes: For the Linux port, KM_SLEEP was replaced with KM_PUSHPAGE. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2745	2014-11-14 15:37:45 -08:00
Chris Wedgwood	b31d8ea77c	Reduce buf/dbuf mutex contention Due to evidence of contention both the buf_hash_table and the dbuf_hash_table sizes have been increased from 256 to 8192. This increase in hash table size adds approximating 0.5M to our fixed memory footprint. This relatively small increase is not expected to cause problems even on low memory machines. This footprint will also become dynamic when the persistent L2ARC support is finalized. In the meanwhile, this small change significantly reduces contention for certain workloads. Signed-off-by: Chris Wedgwood <cw@f00f.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Closes #1291	2014-11-14 14:59:21 -08:00
Alex Zhuravlev	0f69910833	Export symbols for ZIL interface These symbols are needed by consumers (i.e. Lustre) who wish to integrate with the ZIL. In addition the zil_rollback_destroy() prototype was removed because the implementation of this function was removed long ago. Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2892	2014-11-14 14:39:43 -08:00
Alexander Pyhalov	bb9d808c5a	Fix modules installation directory When building zfs modules with kernel, compiled from deb.src, the packaging process ends up installing the modules in the wrong place. Signed-off-by: Alexander Pyhalov <apyhalov@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2822	2014-10-28 09:46:14 -07:00
Tim Chase	ed6e9cc235	Linux 3.12 compat: shrinker semantics The new shrinker API as of Linux 3.12 modifies "struct shrinker" by replacing the @shrink callback with the pair of @count_objects and @scan_objects. It also requires the return value of @count_objects to return the number of objects actually freed whereas the previous @shrink callback returned the number of remaining freeable objects. This patch adds support for the new @scan_objects return value semantics. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #2837	2014-10-28 09:34:51 -07:00
Matthew Ahrens	9635861742	Illumos 5164-5165 - space map fixes 5164 space_map_max_blksz causes panic, does not work 5165 zdb fails assertion when run on pool with recently-enabled space map_histogram feature Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5164 https://www.illumos.org/issues/5165 https://github.com/illumos/illumos-gate/commit/b1be289 Porting Notes: The metaslab_fragmentation() hunk was dropped from this patch because it was already resolved by commit `8b0a084`. The comment modified in metaslab.c was updated to use the correct variable name, space_map_blksz. The upstream commit incorrectly used space_map_blksize. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2697	2014-10-23 15:30:32 -07:00
Alex Reece	b02fe35d37	Illumos 4958 zdb trips assert on pools with ashift >= 0xe 4958 zdb trips assert on pools with ashift >= 0xe Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4958 https://github.com/illumos/illumos-gate/commit/2a104a5 Porting notes: Keep the ZIO_FLAG_FASTWRITE define. This is for a feature present in Linux but not yet in *BSD. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2697	2014-10-23 15:30:32 -07:00
Brian Behlendorf	5f6d0b6f5a	Handle block pointers with a corrupt logical size The general strategy used by ZFS to verify that blocks are valid is to checksum everything. This has the advantage of being extremely robust and generically applicable regardless of the contents of the block. If a blocks checksum is valid then its contents are trusted by the higher layers. This system works exceptionally well as long as bad data is never written with a valid checksum. If this does somehow occur due to a software bug or a memory bit-flip on a non-ECC system it may result in kernel panic. One such place where this could occur is if somehow the logical size stored in a block pointer exceeds the maximum block size. This will result in an attempt to allocate a buffer greater than the maximum block size causing a system panic. To prevent this from happening the arc_read() function has been updated to detect this specific case. If a block pointer with an invalid logical size is passed it will treat the block as if it contained a checksum error. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2678	2014-10-23 09:20:52 -07:00
Ned Bass	bc151f7b31	Remove checks for mandatory locks The Linux VFS handles mandatory locks generically so we shouldn't need to check for conflicting locks in zfs_read(), zfs_write(), or zfs_freesp(). Linux 3.18 removed the lock_may_read() and lock_may_write() interfaces which we were relying on for this purpose. Rather than emulating those interfaces we remove the redundant checks. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2804	2014-10-22 11:06:53 -07:00
Matthew Ahrens	88904bb3e3	Illumos 5162 - zfs recv should use loaned arc buffer to avoid copy 5162 zfs recv should use loaned arc buffer to avoid copy Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5162 https://github.com/illumos/illumos-gate/commit/8a90470 Porting notes: Fix spelling error 's/arena/area/' in dmu.c. In restore_write() declare bonus and abuf at the top of the function. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2696	2014-10-21 16:32:11 -07:00
Matthew Ahrens	4b20a6f509	Illumos 5150 - zfs clone of a defer_destroy snapshot causes strangeness When a clone is created of a snapshot that has been marked for deferred destroy (with "zfs destroy -d"), the clone "inherits" the defer_destroy flag from the origin, and any snapshots of the clone "inherit" the defer_destroy flag from the clone. This causes a strange situation where the clone's snapshots are marked for defer_destroy but they have no holds or clones. If the clone's snapshot gets a hold or clone, which is then deleted, we will honor the incorrectly-set defer_destroy flag and delete the snapshot! Steps to reproduce: * zpool create test c1t1d0 * zfs create test/fs * zfs snapshot test/fs@a * zfs clone test/fs@a test/clone * zfs destroy -d test/fs@a * zfs clone test/fs@a test/clone2 * zfs snapshot test/clone2@a * zfs hold hld test/clone2@a * zfs release hld test/clone2@a * zfs list -r -t all test <test/clone2@a has been destroyed> We noticed that this causes dcenter to get very confused, because it treats snapshots that are marked defer_destroy as not existing. So it won't see any snapshots of the clone that's marked defer_destroy. 5150 - zfs clone of a defer_destroy snapshot causes strangeness Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/projects/illumos-gate//issues/5150 https://github.com/illumos/illumos-gate/commit/42fcb65 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2690	2014-10-21 15:26:58 -07:00
Matthew Ahrens	6c59307a3c	Illumos 3693 - restore_object uses at least two transactions to restore an object Restore_object should not use two transactions to restore an object: * one transaction is used for dmu_object_claim * another transaction is used to set compression, checksum and most importantly bonus data * furthermore dmu_object_reclaim internally uses multiple transactions * dmu_free_long_range frees chunks in separate transactions * dnode_reallocate is executed in a distinct transaction The fact the dnode_allocate/dnode_reallocate are executed in one transaction and bonus (re-)population is executed in a different transaction may lead to violation of ZFS consistency assertions if the transactions are assigned to different transaction groups. Also, if the first transaction group is successfully written to a permanent storage, but the second transaction is lost, then an invalid dnode may be created on the stable storage. 3693 restore_object uses at least two transactions to restore an object Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andriy Gapon <andriy.gapon@hybridcluster.com> Approved by: Robert Mustacchi <rm@joyent.com> Original authors: Matthew Ahrens and Andriy Gapon References: https://www.illumos.org/issues/3693 https://github.com/illumos/illumos-gate/commit/e77d42e Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2689	2014-10-21 15:26:50 -07:00
Tim Chase	356d9ed4c8	Don't perform ACL-to-mode translation on empty ACL In zfs_acl_chown_setattr(), the zfs_mode_comput() function is used to create a traditional mode value based on an ACL. If no ACL exists, this processing shouldn't be done. Problems caused by this were most evident on version 4 filesystems which not only don't have system attributes, and also frequently have empty ACLs. On such filesystems, performing a chown() operation could have the effect of dirtying the mode bits in memory but not on the file system as follows: # create a file with typical mode of 664 echo test > test chown anyuser test ls -l test and the mode will show up as all zeroes. Unmounting/mounting and/or exporting/importing the filesystem will reveal the proper mode again. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1264	2014-10-21 09:23:27 -07:00
Daniil Lunev	62bdd5eb7a	Illumos 4924 - LZ4 Compression for metadata Reviewed by Matthew Ahrens <mahrens@delphix.com> Reviewed by Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://github.com/illumos/illumos-gate/commit/b8289d2 https://www.illumos.org/issues/3756 Porting notes: The static function zfs_prop_activate_feature() was removed because this change removes the only caller. The function was not removed from Illumos but instead left as dead code. However, to keep gcc happy it was removed from Linux and may be easily restored if needed. Ported by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1540	2014-10-20 16:17:49 -07:00
Brian Behlendorf	ba232d8aea	Suppress AIO kmem warnings The new zpl_aio_write() and zpl_aio_read() functions use kmem_alloc() to allocate enough memory to hold the vectorized IO. While this allocation will be small it's been observed in practice to sometimes slightly exceed the 8K warning threshold by a few kilobytes. Therefore, the KM_NODEBUG flag has been added to suppress warning. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #2774	2014-10-20 16:10:25 -07:00
Brian Behlendorf	33074f2254	Handle NULL mirror child vdev When selecting a mirror child it's possible that map allocated by vdev_mirror_map_allc() contains a NULL for the child vdev. In this case the child should be skipped and the read issues to another member of the mirror. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #1744	2014-10-17 14:59:05 -07:00
Brian Behlendorf	f0e324f25d	Update utsname support Modify the code to use the utsname() kernel function rather than a global variable. This results is cleaner more portable code because utsname() is already provided by the kernel and can be easily emulated in user space via uname(2). This means that it will behave consistently in both contexts. This is also has the benefit that it allows the removal of a few _KERNEL pre-processor conditions. And it also is a pre-requisite for a proper FUSE port because we need to provide a valid utsname. Finally, it allows us to remove this functionality from the SPL and all the related compatibility code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2757	2014-10-17 14:58:57 -07:00
Brian Behlendorf	050d22b068	Remove shrink_dcache_memory() and shrink_icache_memory() This functionality is optional and until Linux 3.0, which provided per-filesystem shinkers, they was never a reasonable interface. Therefore, this functionality is being dropped for earlier kernels. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2757	2014-10-17 14:58:50 -07:00
Brian Behlendorf	60bba62814	Update code to use misc_register()/misc_deregister() When ZPIOS was originally written it was designed to use the device_create() and device_destroy() functions. Unfortunately, these functions changed considerably over the years making them difficult to rely on. As it turns out a better choice would have been to use the misc_register()/misc_deregister() functions. This interface for registering character devices has remained stable, is simple, and provides everything we need. Therefore the code has been reworked to use this interface. The higher level ZFS code has always depended on these same interfaces so this is also as a step towards minimizing our kernel dependencies. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2757	2014-10-17 14:58:44 -07:00
Brian Behlendorf	e6659763c6	Improve VERIFY() error in dmu_write() This is a debug patch designed to ensure an error code is logged to the console when this VERIFY() is hit. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #1440	2014-10-08 09:18:14 -07:00
Brian Behlendorf	8878261fc9	Fix CPU_SEQID use in preemptible context Commit `e022864` introduced a regression for kernels which are built with CONFIG_DEBUG_PREEMPT. The use of CPU_SEQID in a preemptible context causes zio_nowait() to trigger the BUG. Since CPU_SEQID is simply being used as a random index the usage here is safe. To resolve the issue preempt is disable while calling CPU_SEQID. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #2769	2014-10-07 16:40:29 -07:00
Matthew Ahrens	e022864d19	Illumos 5176 - lock contention on godfather zio 5176 lock contention on godfather zio Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5176 https://github.com/illumos/illumos-gate/commit/6f834bc Porting notes: Under Linux max_ncpus is defined as num_possible_cpus(). This is largest number of cpu ids which might be available during the life time of the system boot. This value can be larger than the number of present cpus if CONFIG_HOTPLUG_CPU is defined. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2711	2014-10-07 11:24:24 -07:00
Richard Yao	83e9986f6e	Implement -t option to zpool create for temporary pool names Creating virtual machines that have their rootfs on ZFS on hosts that have their rootfs on ZFS causes SPA namespace collisions when the standard name rpool is used. The solution is either to give each guest pool a name unique to the host, which is not always desireable, or boot a VM environment containing an ISO image to install it, which is cumbersome. `26b42f3f9d` introduced `zpool import -t ...` to simplify situations where a host must access a guest's pool when there is a SPA namespace conflict. We build upon that to introduce `zpool import -t tname ...`. That allows us to create a pool whose in-core name is tname, but whose on-disk name is the normal name specified. This simplifies the creation of machine images that use a rootfs on ZFS. That benefits not only real world deployments, but also ZFSOnLinux development by decreasing the time needed to perform rootfs on ZFS experiments. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2417	2014-09-30 10:46:59 -07:00
Tim Chase	cb08f06307	Perform whole-page page truncation for hole-punching under a range lock As an attempt to perform the page truncation more optimally, the hole-punching support added in `223df0161f` truncated performed the operation in two steps: first, sub-page "stubs" were zeroed under the range lock in zfs_free_range() using the new zfs_zero_partial_page() function and then the whole pages were truncated within zfs_freesp(). This left a window of opportunity during which the full pages could be touched. This patch closes the window by moving the whole-page truncation into zfs_free_range() under the range lock. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2733	2014-09-29 09:22:03 -07:00
Max Grossman	36283ca233	Illumos 5138 - add tunable for maximum number of blocks freed in one txg Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: Mattew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5138 https://github.com/illumos/illumos-gate/commit/af3465d Porting notes: Because support for exposing a uint64_t parameter wasn't added until v3.17-rc1 the zfs_free_max_blocks variable has been declared as a unsigned long. This is already far larger than required and it allows us to avoid additional autoconf compatibility code. The default value has been set to 100,000 on Linux instead of ULONG_MAX which is used on Illumos. This was done to limit the number of outstanding IOs in the system when snapshots are destroyed. This helps ensure individual TXG sync times are kept reasonable and memory isn't wasted managing a huge backlog of outstanding IOs. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2675 Closes #2581	2014-09-23 14:26:34 -07:00
Alex Reece	acbad6ff67	Illumos 4753 - increase number of outstanding async writes when sync task is waiting Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4753 https://github.com/illumos/illumos-gate/commit/73527f4 Comments by Matt Ahrens from the issue tracker: When a sync task is waiting for a txg to complete, we should hurry it along by increasing the number of outstanding async writes (i.e. make vdev_queue_max_async_writes() return a larger number). Initially we might just have a tunable for "minimum async writes while a synctask is waiting" and set it to 3. Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2716	2014-09-23 13:50:55 -07:00
Matthew Ahrens	d97aa48f7c	Illumos 5139 - SEEK_HOLE failed to report a hole at end of file 5139 SEEK_HOLE failed to report a hole at end of file Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Peng Dai <peng.dai@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5139 https://github.com/illumos/illumos-gate/commit/0fbc0cd Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2714	2014-09-23 10:38:45 -07:00
Richard Yao	485c581c41	Fix function call with uninitialized value in vdev_inuse LLVM's static analyzer reported that we could pass an uninitialized pool_guid to spa_by_guid() in vdev_inuse(). Upon review, it is correct. An attempt to repurpose a spare or L2ARC drive from an exported pool will cause the pool_guid passed to spa_by_guid() to be unintialized information from the stack. This will cause non-deterministic behavior. Since there is no reason why we cannot repurpose such disks, we modify vdev_inuse() to avoid calling spa_by_guid() when they are detected. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2330	2014-09-23 10:32:45 -07:00
Matthew Ahrens	b8bcca18f7	Illumos 5161 - add tunable for number of metaslabs per vdev 5161 add tunable for number of metaslabs per vdev Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5161 https://github.com/illumos/illumos-gate/commit/bf3e216 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2698	2014-09-23 10:00:02 -07:00
Matthew Ahrens	ebcf49365a	Illumos 5177 - remove dead code from dsl_scan.c 5177 remove dead code from dsl_scan.c Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5177 https://github.com/illumos/illumos-gate/commit/5f37736 Porting notes: The local variable 'buf' was removed from dsl_scan_visitbp(). This wasn't part of the original patch but it should have been. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2712	2014-09-22 15:52:58 -07:00
Adam Leventhal	64dbba3679	Illumos 5174 - add sdt probe for blocked read in dbuf_read() 5174 add sdt probe for blocked read in dbuf_read() Reviewed by: Basil Crow <basil.crow@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5174 https://github.com/illumos/illumos-gate/commit/f6164ad Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2710	2014-09-22 14:20:25 -07:00
Matthew Ahrens	6d9036f350	Illumos 5140 - message about "%recv could not be opened" is printed when booting after crash Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/projects/illumos-gate//issues/5140 https://github.com/illumos/illumos-gate/commit/2243853 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2676	2014-09-18 15:04:59 -07:00
Brian Behlendorf	2d50158343	Fix z_teardown_inactive_lock deadlock When rolling back a mounted filesystem zfs_suspend() is called which acquires the z_teardown_inactive_lock. This lock can not be dropped until the filesystem has been rolled back and resumed in zfs_resume_fs(). Therefore, we must not call iput() under this lock because it may result in the inode->evict() handler being called which also takes this lock. Instead use zfs_iput_async() to ensure dropping the last reference is deferred and runs in a safe context. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2670	2014-09-11 09:11:45 -07:00
Tim Chase	223df0161f	Implement fallocate FALLOC_FL_PUNCH_HOLE Add support for the FALLOC_FL_PUNCH_HOLE \| FALLOC_FL_KEEP_SIZE mode of fallocate(2). Mimic the behavior of other native file systems such as ext4 in cases where the file might be extended. If the offset is beyond the end of the file, return success without changing the file. If the extent of the punched hole would extend the file, only the existing tail of the file is punched. Add the zfs_zero_partial_page() function, modeled after update_page(), to handle zeroing partial pages in a hole-punching operation. It must be used under a range lock for the requested region in order that the ARC and page cache stay in sync. Move the existing page cache truncation via truncate_setsize() into zfs_freesp() for better source structure compatibility with upstream code. Add page cache truncation to zfs_freesp() and zfs_free_range() to handle hole punching. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #2619	2014-09-08 13:52:25 -07:00
George Wilson	4f68d7878f	Illumos 5117 - spacemap reallocation can cause corruption 5117 space map reallocation can cause corruption Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/projects/illumos-gate/issues/5117 https://github.com/illumos/illumos-gate/commit/e503a68 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2662	2014-09-08 09:42:39 -07:00
Brian Behlendorf	ceb49b0acd	Add object type checking to zap_lockdir() If a non-ZAP object is passed to zap_lockdir() it will be treated as a valid ZAP object. This can result in zap_lockdir() attempting to read what it believes are leaf blocks from invalid disk locations. The SCSI layer will eventually generate errors for these bogus IOs but the caller will hang in zap_get_leaf_byblk(). The good news is that is a situation which can not occur unless the pool has been damaged. The bad news is that there are reports from both FreeBSD and Solaris of damaged pools. Specifically, there are normal files in the filesystem which reference another normal file as their parent. Since pools like this are known to exist the zap_lockdir() function has been updated to verify the type of the object. If a non-ZAP object has been passed it EINVAL will be returned immediately. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2597 Issue #2602	2014-09-08 09:15:38 -07:00
Richard Yao	cd3939c5f0	Linux AIO Support nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Lastly, it is important to know that handling of the iovec updates differs between Illumos and Linux in the implementation of read/write. On Linux, it is the VFS' responsibility whle on Illumos, it is the filesystem's responsibility. We take the intermediate solution of copying the iovec so that the ZFS code can update it like on Solaris while leaving the originals alone. This imposes some overhead. We could always revisit this should profiling show that the allocations are a problem. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #223 Closes #2373	2014-09-05 15:11:43 -07:00
Alex Reece	f38dfec3fd	Illumos 5049 - panic when removing log device Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Mattew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Saso Kiselkov <skiselkov@gmail.com> Approved by: Rich Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5049 https://github.com/illumos/illumos-gate/commit/2986efa Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2636	2014-09-05 09:06:27 -07:00
Stanislav Seletskiy	2078f21015	Fix invalid locking order in rename operation This commit should prevent a deadlock on dp_config_rwlock when running `zfs rename` by ensuring zvol_rename_minors() is not called under this lock. Signed-off-by: Stanislav Seletskiy <s.seletskiy@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2652. Closes #2525.	2014-09-04 09:50:46 -07:00
Alexey Smirnoff	0dfc732416	Change the default 'zfs_dedup_prefetch' value to '0' This gives a huge performance improvement in operations with deduped datasets especially when the bottleneck is the amount of ram available for zfs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2639	2014-09-04 09:50:45 -07:00
Dan Swartzendruber	287be44f53	Improve handling of filesystem versions Change mount code to diagnose filesystem versions that are not supported by the current implementation. Change upgrade code to do likewise and refuse to upgrade a pool if any filesystems on it are a version which is not supported by the current implementation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Dan Swartzendruber <dswartz@druber.com> Closes: #2616	2014-09-03 09:17:14 -07:00
Matthew Ahrens	dea377c0d9	Illumos 4970-4974 - extreme rewind enhancements 4970 need controls on i/o issued by zpool import -XF 4971 zpool import -T should accept hex values 4972 zpool import -T implies extreme rewind, and thus a scrub 4973 spa_load_retry retries the same txg 4974 spa_load_verify() reads all data twice Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/4970 https://www.illumos.org/issues/4971 https://www.illumos.org/issues/4972 https://www.illumos.org/issues/4973 https://www.illumos.org/issues/4974 https://github.com/illumos/illumos-gate/commit/e42d205 Notes: This set of patches adds a set of tunable parameters for the "extreme rewind" mode of pool import which allows control over the traversal performed during such an import. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2598	2014-08-26 16:29:57 -07:00
Matthew Ahrens	49ddb31506	Illumos 5034 - ARC's buf_hash_table is too small 5034 ARC's buf_hash_table is too small Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5034 https://github.com/illumos/illumos-gate/commit/63e911b Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2615	2014-08-26 16:14:49 -07:00
Isaac Huang	0426c16804	Fixed memory leaks in zevent handling Some nvlist_t could be leaked in error handling paths. Also make sure cb argument to zfs_zevent_post() cannnot be NULL. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2158	2014-08-20 10:45:16 -07:00
Matthew Ahrens	bd089c5477	Illumos 4631 - zvol_get_stats triggering too many reads 4631 zvol_get_stats triggering too many reads Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4631 https://github.com/illumos/illumos-gate/commit/bbfa8ea Ported-by: Boris Protopopov <bprotopopov@hotmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2612 Closes #2480	2014-08-20 09:17:00 -07:00
Tim Chase	8b0a0840b4	Don't upgrade a metaslab when the pool is not writable Illumos 4982 added code to metaslab_fragmentation() to proactively update space maps when the spacemap_histogram feature is enabled. This should only happen when the pool is writeable. References: https://www.illumos.org/issues/4982 https://github.com/illumos/illumos-gate/commit/2e4c998 Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2595	2014-08-18 08:47:19 -07:00

1 2 3 4 5 ...

795 Commits