mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	2e0358cbca	Sync /dev/zfs ioctl ordering In order to minimize any future disruption caused by the addition and removal /dev/zfs ioctls this patch makes the following changes. 1) Sync ZoL's ioctl ordering such that it matches Illumos. For historic reasons the ZFS_IOC_DESTROY_SNAPS and ZFS_IOC_POOL_REGUID ioctls were out of order. 2) Move Linux and FreeBSD specific ioctls in to their own reserved ranges. This allows us to preserve the existing ordering when new ioctls are added by either Illumos or FreeBSD. When an ioctl is no longer needed it should be retired in place. This change alters the ZFS user/kernel ABI so make sure you rebuild both your user and kernel modules. However, it should allow for a much stabler interface going forward. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #1973	2013-12-16 09:41:39 -08:00
Brian Behlendorf	ba6a24026c	Remove ZFC_IOC__MINOR ioctl()s Early versions of ZFS coordinated the creation and destruction of device minors from userspace. This was inherently racy and in late 2009 these ioctl()s were removed leaving everything up to the kernel. This significantly simplified the code. However, we never picked up these changes in ZoL since we'd already significantly adjusted this code for Linux. This patch aims to rectify that by finally removing ZFC_IOC__MINOR ioctl()s and moving all the functionality down in to the kernel. Since this cleanup will change the kernel/user ABI it's being done in the same tag as the previous libzfs_core ABI changes. This will minimize, but not eliminate, the disruption to end users. Once merged ZoL, Illumos, and FreeBSD will basically be back in sync in regards to handling ZVOLs in the common code. While each platform must have its own custom zvol.c implemenation the interfaces provided are consistent. NOTES: 1) This patch introduces one subtle change in behavior which could not be easily avoided. Prior to this change callers of 'zfs create -V ...' were guaranteed that upon exit the /dev/zvol/ block device link would be created or an error returned. That's no longer the case. The utilities will no longer block waiting for the symlink to be created. Callers are now responsible for blocking, this is why a 'udev_wait' call was added to the 'label' function in scripts/common.sh. 2) The read-only behavior of a ZVOL now solely depends on if the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant policy setting in the gendisk structure was removed. This both simplifies the code and allows us to safely leverage set_disk_ro() to issue a KOBJ_CHANGE uevent. See the comment in the code for futher details on this. 3) Because __zvol_create_minor() and zvol_alloc() may now be called in a sync task they must use KM_PUSHPAGE. References: illumos/illumos-gate@681d9761e8 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #1969	2013-12-16 09:15:57 -08:00
Shen Yan	5cb65efe2c	Fix zstream_t incorrect type The DMU zfetch code organizes streams with lists not avl trees. A avl_node_t was mistakenly used for a list_node_t in the zstream_t type. This is incorrect (but harmless) and when unnoticed because: 1) The list functions explicitly cast the value preventing a warning, 2) sizeof(avl_node_t) >= sizeof(list_node_t) so no overrun occurs, and 3) The calculated offset is the same regardless of the type. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1946	2013-12-10 10:09:27 -08:00
Matthew Ahrens	e8b96c6007	Illumos #4045 write throttle & i/o scheduler performance work 4045 zfs write throttle & i/o scheduler performance work 1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync read, sync write, async read, async write, and scrub/resilver. The scheduler issues a number of concurrent i/os from each class to the device. Once a class has been selected, an i/o is selected from this class using either an elevator algorithem (async, scrub classes) or FIFO (sync classes). The number of concurrent async write i/os is tuned dynamically based on i/o load, to achieve good sync i/o latency when there is not a high load of writes, and good write throughput when there is. See the block comment in vdev_queue.c (reproduced below) for more details. 2. The write throttle (dsl_pool_tempreserve_space() and txg_constrain_throughput()) is rewritten to produce much more consistent delays when under constant load. The new write throttle is based on the amount of dirty data, rather than guesses about future performance of the system. When there is a lot of dirty data, each transaction (e.g. write() syscall) will be delayed by the same small amount. This eliminates the "brick wall of wait" that the old write throttle could hit, causing all transactions to wait several seconds until the next txg opens. One of the keys to the new write throttle is decrementing the amount of dirty data as i/o completes, rather than at the end of spa_sync(). Note that the write throttle is only applied once the i/o scheduler is issuing the maximum number of outstanding async writes. See the block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for more details. This diff has several other effects, including: * the commonly-tuned global variable zfs_vdev_max_pending has been removed; use per-class zfs_vdev__max_active values or zfs_vdev_max_active instead. the size of each txg (meaning the amount of dirty data written, and thus the time it takes to write out) is now controlled differently. There is no longer an explicit time goal; the primary determinant is amount of dirty data. Systems that are under light or medium load will now often see that a txg is always syncing, but the impact to performance (e.g. read latency) is minimal. Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this. * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression, checksum, etc. This improves latency by not allowing these CPU-intensive tasks to consume all CPU (on machines with at least 4 CPU's; the percentage is rounded up). --matt APPENDIX: problems with the current i/o scheduler The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem with this is that if there are always i/os pending, then certain classes of i/os can see very long delays. For example, if there are always synchronous reads outstanding, then no async writes will be serviced until they become "past due". One symptom of this situation is that each pass of the txg sync takes at least several seconds (typically 3 seconds). If many i/os become "past due" (their deadline is in the past), then we must service all of these overdue i/os before any new i/os. This happens when we enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in the future. If we can't complete all the i/os in 2.5 seconds (e.g. because there were always reads pending), then these i/os will become past due. Now we must service all the "async" writes (which could be hundreds of megabytes) before we service any reads, introducing considerable latency to synchronous i/os (reads or ZIL writes). Notes on porting to ZFS on Linux: - zio_t gained new members io_physdone and io_phys_children. Because object caches in the Linux port call the constructor only once at allocation time, objects may contain residual data when retrieved from the cache. Therefore zio_create() was updated to zero out the two new fields. - vdev_mirror_pending() relied on the depth of the per-vdev pending queue (vq->vq_pending_tree) to select the least-busy leaf vdev to read from. This tree has been replaced by vq->vq_active_tree which is now used for the same purpose. - vdev_queue_init() used the value of zfs_vdev_max_pending to determine the number of vdev I/O buffers to pre-allocate. That global no longer exists, so we instead use the sum of the *_max_active values for each of the five I/O classes described above. - The Illumos implementation of dmu_tx_delay() delays a transaction by sleeping in condition variable embedded in the thread (curthread->t_delay_cv). We do not have an equivalent CV to use in Linux, so this change replaced the delay logic with a wrapper called zfs_sleep_until(). This wrapper could be adopted upstream and in other downstream ports to abstract away operating system-specific delay logic. - These tunables are added as module parameters, and descriptions added to the zfs-module-parameters.5 man page. spa_asize_inflation zfs_deadman_synctime_ms zfs_vdev_max_active zfs_vdev_async_write_active_min_dirty_percent zfs_vdev_async_write_active_max_dirty_percent zfs_vdev_async_read_max_active zfs_vdev_async_read_min_active zfs_vdev_async_write_max_active zfs_vdev_async_write_min_active zfs_vdev_scrub_max_active zfs_vdev_scrub_min_active zfs_vdev_sync_read_max_active zfs_vdev_sync_read_min_active zfs_vdev_sync_write_max_active zfs_vdev_sync_write_min_active zfs_dirty_data_max_percent zfs_delay_min_dirty_percent zfs_dirty_data_max_max_percent zfs_dirty_data_max zfs_dirty_data_max_max zfs_dirty_data_sync zfs_delay_scale The latter four have type unsigned long, whereas they are uint64_t in Illumos. This accommodates Linux's module_param() supported types, but means they may overflow on 32-bit architectures. The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most likely to overflow on 32-bit systems, since they express physical RAM sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to 2^32 which does overflow. To resolve that, this port instead initializes it in arc_init() to 25% of physical RAM, and adds the tunable zfs_dirty_data_max_max_percent to override that percentage. While this solution doesn't completely avoid the overflow issue, it should be a reasonable default for most systems, and the minority of affected systems can work around the issue by overriding the defaults. - Fixed reversed logic in comment above zfs_delay_scale declaration. - Clarified comments in vdev_queue.c regarding when per-queue minimums take effect. - Replaced dmu_tx_write_limit in the dmu_tx kstat file with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts how many times a transaction has been delayed because the pool dirty data has exceeded zfs_delay_min_dirty_percent. The latter counts how many times the pool dirty data has exceeded zfs_dirty_data_max (which we expect to never happen). - The original patch would have regressed the bug fixed in zfsonlinux/zfs@c418410, which prevented users from setting the zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE. A similar fix is added to vdev_queue_aggregate(). - In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the heap instead of the stack. In Linux we can't afford such large structures on the stack. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Brendan Gregg <brendan.gregg@joyent.com> Approved by: Robert Mustacchi <rm@joyent.com> References: http://www.illumos.org/issues/4045 illumos/illumos-gate@69962b5647 Ported-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1913	2013-12-06 09:32:43 -08:00
Etienne Dechamps	119a394ab0	Only commit the ZIL once in zpl_writepages() (msync() case). Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. Note that this change also fixes a bug related to #907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1849 Closes #907	2013-11-23 15:08:29 -08:00
Brian Behlendorf	e3dc14b861	Add I/O Read/Write Accounting Because ZFS bypasses the page cache we don't inherit per-task I/O accounting for free. However, the Linux kernel does provide helper functions allow us to perform our own accounting. These are most commonly used for direct IO which also bypasses the page cache, but they can be used for the common read/write call paths as well. Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #313 Closes #1275	2013-11-21 08:56:24 -08:00
Massimo Maggi	b695c34ea4	Honor CONFIG_FS_POSIX_ACL kernel option The required Posix ACL interfaces are only available for kernels with CONFIG_FS_POSIX_ACL defined. Therefore, only enable Posix ACL support for these kernels. All major distribution kernels enable CONFIG_FS_POSIX_ACL by default. If your kernel does not support Posix ACLs the following warning will be printed at ZFS module load time. "ZFS: Posix ACLs disabled by kernel" Signed-off-by: Massimo Maggi <me@massimo-maggi.eu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1825	2013-11-05 16:22:05 -08:00
George Wilson	ac72fac3ea	Illumos #3954 , #4080 , #4081 3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit 4080 zpool clear fails to clear pool 4081 need zfs_mg_noalloc_threshold Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3954 https://www.illumos.org/issues/4080 https://www.illumos.org/issues/4081 illumos/illumos-gate@22e30981d8 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:25:01 -08:00
Matthew Ahrens	b663a23d36	Illumos #4047 4047 panic from dbuf_free_range() from dmu_free_object() while doing zfs receive Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4047 illumos/illumos-gate@713d6c2088 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. The exported symbol dmu_free_object() was renamed to dmu_free_long_object() in Illumos.	2013-11-05 12:23:35 -08:00
Matthew Ahrens	46ba1e59d3	Illumos #3996 3996 want a libzfs_core API to rollback to latest snapshot Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andy Stormont <andyjstormont@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3996 illumos/illumos-gate@a7027df17f Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:23:11 -08:00
George Wilson	5d1f7fb647	Illumos #3956 , #3957 , #3958 , #3959 , #3960 , #3961 , #3962 3956 ::vdev -r should work with pipelines 3957 ztest should update the cachefile before killing itself 3958 multiple scans can lead to partial resilvering 3959 ddt entries are not always resilvered 3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth 3961 freed gang blocks are not resilvered and can cause pool to suspend 3962 ztest should print out zfs debug buffer before exiting Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3956 https://www.illumos.org/issues/3957 https://www.illumos.org/issues/3958 https://www.illumos.org/issues/3959 https://www.illumos.org/issues/3960 https://www.illumos.org/issues/3961 https://www.illumos.org/issues/3962 illumos/illumos-gate@b4952e17e8 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting notes: 1. zfs_dbgmsg_print() is only used in userland. Since we do not have mdb on Linux, it does not make sense to make it available in the kernel. This means that a build failure will occur if any future kernel patch depends on it. However, that is unlikely given that this functionality was added to support zdb. 2. zfs_dbgmsg_print() is only invoked for -VVV or greater log levels. This preserves the existing behavior of minimal noise when running with -V, and -VV. 3. In vdev_config_generate() the call to nvlist_alloc() was not changed to fnvlist_alloc() because we must pass KM_PUSHPAGE in the txg_sync context.	2013-11-05 12:23:05 -08:00
Matthew Ahrens	ea97f8ce35	Illumos #3834 3834 incremental replication of 'holey' file systems is slow Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3834 illumos/illumos-gate@ca48f36f20 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:15:00 -08:00
Matthew Ahrens	2883cad5b7	Illumos #3836 3836 zio_free() can be processed immediately in the common case Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/3836 illumos/illumos-gate@9cb154a3c9 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:14:56 -08:00
Matthew Ahrens	498877baf5	Illumos #3112 , #3113 , #3114 3112 ztest does not honor ZFS_DEBUG 3113 ztest should use watchpoints to protect frozen arc bufs 3114 some leaked nvlists in zfsdev_ioctl Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matt Amdur <Matt.Amdur@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Approved by: Eric Schrock <eric.schrock@delphix.com> References: https://www.illumos.org/issues/3112 https://www.illumos.org/issues/3113 https://www.illumos.org/issues/3114 illumos/illumos-gate@cd1c8b85eb The /proc/self/cmd watchpoint interface is specific to Solaris. Therefore, the #3113 implementation was reworked to use the more portable mprotect(2) system call. When the pages are watched they are marked read-only for protection. Any write to the protected address range immediately trigger a SIGSEGV. The pages are marked writable again when they are unwatched. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1489	2013-11-05 12:14:48 -08:00
George Wilson	03c6040bee	Illumos #3236 3236 zio nop-write Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: illumos/illumos-gate@80901aea8e https://www.illumos.org/issues/3236 Porting Notes 1. This patch is being merged dispite an increased instance of https://www.illumos.org/issues/3113 being triggered by ztest. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1489	2013-11-05 12:14:21 -08:00
Keith M Wesolowski	831baf06ef	Illumos #3875 3875 panic in zfs_root() after failed rollback Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/3875 illumos/illumos-gate@91948b51b8 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-04 11:27:41 -08:00
Matthew Ahrens	1958067629	Illumos #3888 3888 zfs recv -F should destroy any snapshots created since the incremental source Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Peng Dai <peng.dai@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3888 illumos/illumos-gate@34f2f8cf94 Porting notes: 1. Commit `1fde1e3720` wrapped a declaration in dsl_dataset_modified_since_lastsnap in ASSERTV(). The ASSERTV() and local variable have been removed to avoid an unused variable warning. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Richard Yao <ryao@gentoo.org> Issue #1775	2013-11-04 11:18:14 -08:00
Keith M Wesolowski	96c2e96193	Illumos #3894 3894 zfs should not allow snapshot of inconsistent dataset Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/3894 illumos/illumos-gate@ca48f36f20 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-04 11:18:14 -08:00
Steven Hartland	95fd54a1c5	Illumos #3740 3740 Poor ZFS send / receive performance due to snapshot hold / release processing Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3740 illumos/illumos-gate@a7a845e4bf Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. `13fe019870` introduced a merge conflict in dsl_dataset_user_release_tmp where some variables were moved outside of the preprocessor directive. 2. dea9dfefdd747534b3846845629d2200f0616dad made the previous merge conflict worse by switching KM_SLEEP to KM_PUSHPAGE. This is notable because this commit refactors the code, adding a new KM_SLEEP allocation. It is not clear to me whether this should be converted to KM_PUSHPAGE. 3. We had a merge conflict in libzfs_sendrecv.c because of copyright notices. 4. Several small C99 compatibility fixed were made.	2013-11-04 11:17:48 -08:00
Will Andrews	d09f25dc66	Illumos #3744 3744 zfs shouldn't ignore errors unmounting snapshots Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3744 illumos/illumos-gate@fc7a6e3fef Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. There is no clear way to distinguish between a failure when we tried to unmount the snapdir of a zvol (which does not exist) and the failure when we try to unmount a snapdir of a dataset, so the changes to zfs_unmount_snap() were dropped in favor of an altered Linux function that unconditionally returns 0.	2013-11-04 10:55:25 -08:00
Will Andrews	d3cc8b152e	Illumos #3742 3742 zfs comments need cleaner, more consistent style Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3742 illumos/illumos-gate@f717074149 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. The change to zfs_vfsops.c was dropped because it involves zfs_mount_label_policy, which does not exist in the Linux port.	2013-11-04 10:55:25 -08:00
Will Andrews	e49f1e20a0	Illumos #3741 3741 zfs needs better comments Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3741 illumos/illumos-gate@3e30c24aee Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-04 10:55:25 -08:00
Adam Leventhal	63fd3c6cfd	Illumos #3582 , #3584 3582 zfs_delay() should support a variable resolution 3584 DTrace sdt probes for ZFS txg states Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Richard Elling <richard.elling@dey-sys.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/3582 illumos/illumos-gate@0689f76 Ported by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-04 10:55:25 -08:00
George Wilson	2696dfafd9	Illumos #3642 , #3643 3642 dsl_scan_active() should not issue I/O to determine if async destroying is active 3643 txg_delay should not hold the tc_lock Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/3642 https://www.illumos.org/issues/3643 illumos/illumos-gate@4a92375985 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting Notes: 1. The alignment assumptions for the tx_cpu structure assume that a kmutex_t is 8 bytes. This isn't true under Linux but tc_pad[] was adjusted anyway for consistency since this structure was never carefully aligned in ZoL. If careful alignment does impact performance significantly this should be reworked to be portable.	2013-11-01 08:55:12 -07:00
Matthew Ahrens	2e528b49f8	Illumos #3598 3598 want to dtrace when errors are generated in zfs Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/3598 illumos/illumos-gate@be6fd75a69 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. include/sys/zfs_context.h has been modified to render some new macros inert until dtrace is available on Linux. 2. Linux-specific changes have been adapted to use SET_ERROR(). 3. I'm NOT happy about this change. It does nothing but ugly up the code under Linux. Unfortunately we need to take it to avoid more merge conflicts in the future. -Brian	2013-10-31 14:58:04 -07:00
Matthew Ahrens	24a64651b4	Illumos #3588 3588 provide zfs properties for logical (uncompressed) space used and referenced Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Richard Elling <richard.elling@dey-sys.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3588 illumos/illumos-gate@77372cb0f3 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-31 10:16:11 -07:00
Matthew Ahrens	330847ff36	Illumos #3537 3537 want pool io kstats Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Sa?o Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Brendan Gregg <brendan.gregg@joyent.com> Approved by: Gordon Ross <gwr@nexenta.com> References: http://www.illumos.org/issues/3537 illumos/illumos-gate@c3a6601 Ported by: Cyril Plisko <cyril.plisko@mountall.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: 1. The patch was restructured to take advantage of the existing spa statistics infrastructure. To accomplish this the kstat was moved in to spa->io_stats and the init/destroy code moved to spa_stats.c. 2. The I/O kstat was simply named <pool> which conflicted with the pool directory we had already created. Therefore it was renamed to <pool>/io 3. An update handler was added to allow the kstat to be zeroed.	2013-10-31 09:16:03 -07:00
Richard Yao	495b25a91a	Add missing code to zfs_debug.{c,h} This is required to make Illumos 3962 merge. Signed-off-by: Richard Yao <ryao@gentoo.org>	2013-10-29 15:06:18 -07:00
Richard Yao	632a242e83	Add missing copyright notices from Illumos This resolves merge conflicts when merging Illumos #3588 and Illumos #4047. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-10-29 15:06:18 -07:00
Massimo Maggi	023699cd62	Posix ACL Support This change adds support for Posix ACLs by storing them as an xattr which is common practice for many Linux file systems. Since the Posix ACL is stored as an xattr it will not overwrite any existing ZFS/NFSv4 ACLs which may have been set. The Posix ACL will also be non-functional on other platforms although it may be visible as an xattr if that platform understands SA based xattrs. By default Posix ACLs are disabled but they may be enabled with the new 'aclmode=noacl\|posixacl' property. Set the property to 'posixacl' to enable them. If ZFS/NFSv4 ACL support is ever added an appropriate acltype will be added. This change passes the POSIX Test Suite cleanly with the exception of xacl/00.t test 45 which is incorrect for Linux (Ext4 fails too). http://www.tuxera.com/community/posix-test-suite/ Signed-off-by: Massimo Maggi <me@massimo-maggi.eu> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #170	2013-10-29 14:54:26 -07:00
Brian Behlendorf	e0b0ca983d	Add visibility in to cached dbufs Currently there is no mechanism to inspect which dbufs are being cached by the system. There are some coarse counters in arcstats by they only give a rough idea of what's being cached. This patch aims to improve the current situation by adding a new dbufs kstat. When read this new kstat will walk all cached dbufs linked in to the dbuf_hash. For each dbuf it will dump detailed information about the buffer. It will also dump additional information about the referenced arc buffer and its related dnode. This provides a more complete view in to exactly what is being cached. With this generic infrastructure in place utilities can be written to post-process the data to understand exactly how the caching is working. For example, the data could be processed to show a list of all cached dnodes and how much space they're consuming. Or a similar list could be generated based on dnode type. Many other ways to interpret the data exist based on what kinds of questions you're trying to answer. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov>	2013-10-25 13:59:40 -07:00
Brian Behlendorf	2d37239a28	Add visibility in to dmu_tx_assign times This change adds a new kstat to gain some visibility into the amount of time spent in each call to dmu_tx_assign. A histogram is exported via the new dmu_tx_assign file. The information contained in this histogram is the frequency dmu_tx_assign took to complete given an interval range. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-25 13:57:25 -07:00
Brian Behlendorf	0b1401ee91	Add visibility in to txg sync behavior This change is an attempt to add visibility in to how txgs are being formed on a system, in real time. To do this, a list was added to the in memory SPA data structure for a pool, with each element on the list corresponding to txg. These entries are then exported through the kstat interface, which can then be interpreted in userspace. For each txg, the following information is exported: * Unique txg number (uint64_t) * The time the txd was born (hrtime_t) (not wall clock time; relative to the other entries on the list) * The current txg state ((O)pen/(Q)uiescing/(S)yncing/(C)ommitted) * The number of reserved bytes for the txg (uint64_t) * The number of bytes read during the txg (uint64_t) * The number of bytes written during the txg (uint64_t) * The number of read operations during the txg (uint64_t) * The number of write operations during the txg (uint64_t) * The time the txg was closed (hrtime_t) * The time the txg was quiesced (hrtime_t) * The time the txg was synced (hrtime_t) Note that while the raw kstat now stores relative hrtimes for the open, quiesce, and sync times. Those relative times are used to calculate how long each state took and these deltas and printed by output handlers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-25 13:57:25 -07:00
Prakash Surya	1421c89142	Add visibility in to arc_read This change is an attempt to add visibility into the arc_read calls occurring on a system, in real time. To do this, a list was added to the in memory SPA data structure for a pool, with each element on the list corresponding to a call to arc_read. These entries are then exported through the kstat interface, which can then be interpreted in userspace. For each arc_read call, the following information is exported: * A unique identifier (uint64_t) * The time the entry was added to the list (hrtime_t) (not wall clock time; relative to the other entries on the list) * The objset ID (uint64_t) * The object number (uint64_t) * The indirection level (uint64_t) * The block ID (uint64_t) * The name of the function originating the arc_read call (char[24]) * The arc_flags from the arc_read call (uint32_t) * The PID of the reading thread (pid_t) * The command or name of thread originating read (char[16]) From this exported information one can see, in real time, exactly what is being read, what function is generating the read, and whether or not the read was found to be already cached. There is still some work to be done, but this should serve as a good starting point. Specifically, dbuf_read's are not accounted for in the currently exported information. Thus, a follow up patch should probably be added to export these calls that never call into arc_read (they only hit the dbuf hash table). In addition, it might be nice to create a utility similar to "arcstat.py" to digest the exported information and display it in a more readable format. Or perhaps, log the information and allow for it to be "replayed" at a later time. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-25 13:57:25 -07:00
Brian Behlendorf	76463d4026	Revert "Add txgs-<pool> kstat file" This reverts commit `e95853a331`.	2013-10-25 13:57:25 -07:00
Brian Behlendorf	98ab38d109	Revert "Add new kstat for monitoring time in dmu_tx_assign" This reverts commit `92334b14ec`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-25 13:57:25 -07:00
Richard Yao	b3c49d3df8	Linux 3.11 compat: Rename LZ4 symbols Linus Torvalds merged LZ4 into Linux 3.11. This causes a conflict whenever CONFIG_LZ4_DECOMPRESS=y or CONFIG_LZ4_COMPRESS=y are set in the kernel's .config. We rename the symbols to avoid the conflict. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1789	2013-10-22 10:12:39 -07:00
Matthew Ahrens	13fe019870	Illumos #3464 3464 zfs synctask code needs restructuring Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/3464 illumos/illumos-gate@3b2aab1880 Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1495	2013-09-04 16:01:24 -07:00
Matthew Ahrens	6f1ffb0665	Illumos #2882 , #2883 , #2900 2882 implement libzfs_core 2883 changing "canmount" property to "on" should not always remount dataset 2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Chris Siden <christopher.siden@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Bill Pijewski <wdp@joyent.com> Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: https://www.illumos.org/issues/2882 https://www.illumos.org/issues/2883 https://www.illumos.org/issues/2900 illumos/illumos-gate@4445fffbbb Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1293 Porting notes: WARNING: This patch changes the user/kernel ABI. That means that the zfs/zpool utilities built from master are NOT compatible with the 0.6.2 kernel modules. Ensure you load the matching kernel modules from master after updating the utilities. Otherwise the zfs/zpool commands will be unable to interact with your pool and you will see errors similar to the following: $ zpool list failed to read pool configuration: bad address no pools available $ zfs list no datasets available Add zvol minor device creation to the new zfs_snapshot_nvl function. Remove the logging of the "release" operation in dsl_dataset_user_release_sync(). The logging caused a null dereference because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the logging functions try to get the ds name via the dsl_dataset_name() function. I've got no idea why this particular code would have worked in Illumos. This code has subsequently been completely reworked in Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring). Squash some "may be used uninitialized" warning/erorrs. Fix some printf format warnings for %lld and %llu. Apply a few spa_writeable() changes that were made to Illumos in illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and 3115 fixes. Add a missing call to fnvlist_free(nvl) in log_internal() that was added in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time (zfsonlinux/zfs@9e11c73) because it depended on future work.	2013-09-04 15:49:00 -07:00
Richard Yao	0f37d0c8be	Linux 3.11 compat: fops->iterate() Commit torvalds/linux@2233f31aad replaced ->readdir() with ->iterate() in struct file_operations. All filesystems must now use the new ->iterate method. To handle this the code was reworked to use the new ->iterate interface. Care was taken to keep the majority of changes confined to the ZPL layer which is already Linux specific. However, minor changes were required to the common zfs_readdir() function. Compatibility with older kernels was accomplished by adding versions of the trivial dir_emit* helper functions. Also the various _readdir() functions were reworked in to wrappers which create a dir_context structure to pass to the new _iterate() functions. Unfortunately, the new dir_emit* functions prevent us from passing a private pointer to the filldir function. The xattr directory code leveraged this ability through zfs_readdir() to generate the list of xattr names. Since we can no longer use zfs_readdir() a simplified zpl_xattr_readdir() function was added to perform the same task. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1653 Issue #1591	2013-08-15 16:19:07 -07:00
Matthew Ahrens	cb682a173a	Illumos #3618 ::zio dcmd does not show timestamp data 3618 ::zio dcmd does not show timestamp data Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Dan McDonald <danmcd@nexenta.com> References: http://www.illumos.org/issues/3618 illumos/illumos-gate@c55e05cb35 Notes on porting to ZFS on Linux: The original changeset mostly deals with mdb ::zio dcmd. However, in order to provide the requested functionality it modifies vdev and zio structures to keep the timing data in nanoseconds instead of ticks. It is these changes that are ported over in the commit in hand. One visible change of this commit is that the default value of 'zfs_vdev_time_shift' tunable is changed: zfs_vdev_time_shift = 6 to zfs_vdev_time_shift = 29 The original value of 6 was inherited from OpenSolaris and was subotimal - since it shifted the raw tick value - it didn't compensate for different tick frequencies on Linux and OpenSolaris. The former has HZ=1000, while the latter HZ=100. (Which itself led to other interesting performance anomalies under non-trivial load. The deadline scheduler delays the IO according to its priority - the lower priority the further the deadline is set. The delay is measured in units of "shifted ticks". Since the HZ value was 10 times higher, the delay units were 10 times shorter. Thus really low priority IO like resilver (delay is 10 units) and scrub (delay is 20 units) were scheduled much sooner than intended. The overall effect is that resilver and scrub IO consumed more bandwidth at the expense of the other IO.) Now that the bookkeeping is done is nanoseconds the shift behaves correctly for any tick frequency (HZ). Ported-by: Cyril Plisko <cyril.plisko@mountall.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1643	2013-08-12 16:46:50 -07:00
Saso Kiselkov	4e59f47511	Illumos #3964 L2ARC should always compress metadata buffers 3964 L2ARC should always compress metadata buffers Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> References: https://www.illumos.org/issues/3964 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1379	2013-08-08 13:37:00 -07:00
Saso Kiselkov	3a17a7a99a	Illumos #3137 L2ARC compression 3137 L2ARC compression Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: illumos/illumos-gate@aad02571bc https://www.illumos.org/issues/3137 http://wiki.illumos.org/display/illumos/L2ARC+Compression Notes for Linux port: A l2arc_nocompress module option was added to prevent the compression of l2arc buffers regardless of how a dataset's compression property is set. This allows the legacy behavior to be preserved. Ported by: James H <james@kagisoft.co.uk> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1379	2013-08-08 13:27:21 -07:00
Richard Yao	3f4058cd15	Remove arc_data_buf_alloc()/arc_data_buf_free() These functions are used in neither Illumos nor ZFSOnLinux. They appear to have been replaced by arc_buf_alloc()/arc_buf_free(), so lets remove them. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1614	2013-08-01 09:48:07 -07:00
Prakash Surya	92334b14ec	Add new kstat for monitoring time in dmu_tx_assign This change adds a new kstat to gain some visibility into the amount of time spent in each call to dmu_tx_assign. A histogram is exported via a new dmu_tx_assign_histogram-$POOLNAME file. The information contained in this histogram is the frequency dmu_tx_assign took to complete given an interval range. For example, given the below histogram file: $ cat /proc/spl/kstat/zfs/dmu_tx_assign_histogram-tank 12 1 0x01 32 1536 19792068076691 20516481514522 name type data 1 us 4 859 2 us 4 252 4 us 4 171 8 us 4 2 16 us 4 0 32 us 4 2 64 us 4 0 128 us 4 0 256 us 4 0 512 us 4 0 1024 us 4 0 2048 us 4 0 4096 us 4 0 8192 us 4 0 16384 us 4 0 32768 us 4 1 65536 us 4 1 131072 us 4 1 262144 us 4 4 524288 us 4 0 1048576 us 4 0 2097152 us 4 0 4194304 us 4 0 8388608 us 4 0 16777216 us 4 0 33554432 us 4 0 67108864 us 4 0 134217728 us 4 0 268435456 us 4 0 536870912 us 4 0 1073741824 us 4 0 2147483648 us 4 0 one can see most calls to dmu_tx_assign completed in 32us or less, but a few outliers did not. Specifically, 4 of the calls took between 262144us and 131072us. This information is difficult, if not impossible, to gather without this change. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1584	2013-07-11 13:53:44 -07:00
Shen Yan	e77aa730bc	Fix the comment in zfs.h The path to code is also changed in zfsonlinux. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issues #1566	2013-07-09 10:41:18 -07:00
George Wilson	294f68063b	Illumos #3498 panic in arc_read() 3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt) Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: illumos/illumos-gate@1b912ec710 https://www.illumos.org/issues/3498 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1249	2013-07-02 13:34:31 -07:00
Matthew Ahrens	96b89346c0	Illumos #3122 zfs destroy filesystem should prefetch blocks 3122 zfs destroy filesystem should prefetch blocks Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: illumos/illumos-gate@b4709335aa https://www.illumos.org/issues/3122 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1565	2013-07-02 13:34:02 -07:00
Li Dongyang	802e7b5feb	Add SEEK_DATA/SEEK_HOLE to lseek()/llseek() The approach taken was the rework zfs_holey() as little as possible and then just wrap the code as needed to ensure correct locking and error handling. Tested with xfstests 285 and 286. All tests pass except for 7-9 of 285 which try to reserve blocks first via fallocate(2) and fail because fallocate(2) is not yet supported. Note that the filp->f_lock spinlock did not exist prior to Linux 2.6.30, but we avoid the need for autotools check by virtue of the fact that SEEK_DATA/SEEK_HOLE support was not added until Linux 3.1. An autoconf check was added for lseek_execute() which is currently a private function but the expectation is that it will be exported perhaps as early as Linux 3.11. Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1384	2013-07-02 09:24:43 -07:00
Brian Behlendorf	81eaf15107	Register correct handlers in nvlist_alloc() The non-blocking allocation handlers in nvlist_alloc() would be mistakenly assigned if any flags other than KM_SLEEP were passed. This meant that nvlists allocated with KM_PUSHPUSH or other KM_* debug flags were effectively always using atomic allocations. While these failures were unlikely it could lead to assertions because KM_PUSHPAGE allocations in particular are guaranteed to succeed or block. They must never fail. Since the existing API does not allow us to pass allocation flags to the private allocators the cleanest thing to do is to add a KM_PUSHPAGE allocator. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/spl#249	2013-06-20 09:58:15 -07:00
Matthew Ahrens	df4474f92d	Illumos #3805 arc shouldn't cache freed blocks 3805 arc shouldn't cache freed blocks Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Richard Elling <richard.elling@dey-sys.com> Reviewed by: Will Andrews <will@firepipe.net> Approved by: Dan McDonald <danmcd@nexenta.com> References: illumos/illumos-gate@6e6d5868f5 https://www.illumos.org/issues/3805 ZFS should proactively evict freed blocks from the cache. On dcenter, we saw that we were caching ~256GB of metadata, while the pool only had <4GB of metadata on disk. We were wasting about half the system's RAM (252GB) on blocks that have been freed. Even though these freed blocks will never be used again, and thus will eventually be evicted, this causes us to use memory inefficiently for 2 reasons: 1. A block that is freed has no chance of being accessed again, but will be kept in memory preferentially to a block that was accessed before it (and is thus older) but has not been freed and thus has at least some chance of being accessed again. 2. We partition the ARC into several buckets: user data that has been accessed only once (MRU) metadata that has been accessed only once (MRU) user data that has been accessed more than once (MFU) metadata that has been accessed more than once (MFU) The user data vs metadata split is somewhat arbitrary, and the primary control on how much memory is used to cache data vs metadata is to simply try to keep the proportion the same as it has been in the past (each bucket "evicts against" itself). The secondary control is to evict data before evicting metadata. Because of this bucketing, we may end up with one bucket mostly containing freed blocks that are very old, while another bucket has more recently accessed, still-allocated blocks. Data in the useful bucket (with still-allocated blocks) may be evicted in preference to data in the useless bucket (with old, freed blocks). On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU data bucket was 27GB and the MRU metadata bucket was 256GB. However, the vast majority of data in the MRU metadata bucket (256GB) was freed blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB) was constantly evicting useful blocks that will be soon needed. The problem of cache segmentation is a larger problem that needs more investigation. However, if we stop caching freed blocks, it should reduce the impact of this more fundamental issue. Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1503	2013-06-20 09:55:52 -07:00
George Wilson	e51be06697	Illumos #3552 , #3564 3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread 3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not condensing) Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: illumos/illumos-gate@16a4a80742 https://www.illumos.org/issues/3552 https://www.illumos.org/issues/3564 Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1513	2013-06-19 16:22:39 -07:00
Madhav Suresh	c99c90015e	Illumos #3006 3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first argument is zero Reviewed by Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by George Wilson <george.wilson@delphix.com> Approved by Eric Schrock <eric.schrock@delphix.com> References: illumos/illumos-gate@fb09f5aad4 https://illumos.org/issues/3006 Requires: zfsonlinux/spl@1c6d149feb Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1509	2013-06-19 15:14:10 -07:00
Brian Behlendorf	044baf009a	Use taskq for dump_bytes() The vn_rdwr() function performs I/O by calling the vfs_write() or vfs_read() functions. These functions reside just below the system call layer and the expectation is they have almost the entire 8k of stack space to work with. In fact, certain layered configurations such as ext+lvm+md+multipath require the majority of this stack to avoid stack overflows. To avoid this posibility the vn_rdwr() call in dump_bytes() has been moved to the ZIO_TYPE_FREE, taskq. This ensures that all I/O will be performed with the majority of the stack space available. This ends up being very similiar to as if the I/O were issued via sys_write() or sys_read(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1399 Closes #1423	2013-05-06 14:05:42 -07:00
Adam Leventhal	7ef5e54e2e	Illumos #3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock contention 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: illumos/illumos-gate@ec94d32 https://illumos.org/issues/3581 Notes for Linux port: Earlier commit `08d08eb` reduced contention on this taskq lock by simply reducing the number of z_fr_iss threads from 100 to one-per-CPU. We also optimized the taskq implementation in zfsonlinux/spl@3c6ed54. These changes significantly improved unlink performance to acceptable levels. This patch further reduces time spent spinning on this lock by randomly dispatching the work items over multiple independent task queues. The Illumos ZFS developers stated that this lock contention only arose after "3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()" was landed. It's not clear if 3329 affects the Linux port or not. I didn't see spa_free_sync_cb() show up in oprofile sessions while unlinking large files, but I may just not have used the right test case. I tested unlinking a 1 TB of data with and without the patch and didn't observe a meaningful difference in elapsed time. However, oprofile showed that the percent time spent in taskq_thread() was reduced from about 16% to about 5%. Aside from a possible slight performance benefit this may be worth landing if only for the sake of maintaining consistency with upstream. Ported-by: Ned Bass <bass6@llnl.gov> Closes #1327	2013-05-06 14:05:37 -07:00
George Wilson	55d85d5a8c	Illumos #3329 , #3330 , #3331 , #3335 3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb() 3330 space_seg_t should have its own kmem_cache 3331 deferred frees should happen after sync_pass 1 3335 make SYNC_PASS_* constants tunable Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Eric Schrock <eric.schrock@delphix.com> References: illumos/illumos-gate@01f55e48fb https://www.illumos.org/issues/3329 https://www.illumos.org/issues/3330 https://www.illumos.org/issues/3331 https://www.illumos.org/issues/3335 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-05-06 12:39:34 -07:00
George.Wilson	cc92e9d0c3	3246 ZFS I/O deadman thread Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> NOTES: This patch has been reworked from the original in the following ways to accomidate Linux ZFS implementation ) Usage of the cyclic interface was replaced by the delayed taskq interface. This avoids the need to implement new compatibility code and allows us to rely on the existing taskq implementation. ) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h because declaring externs in source files as was done in the original patch is just plain wrong. ) Instead of panicing the system when the deadman triggers a zevent describing the blocked vdev and the first pending I/O is posted. If the panic behavior is desired Linux provides other generic methods to panic the system when threads are observed to hang. ) For reference, to delay zios by 30 seconds for testing you can use zinject as follows: 'zinject -d <vdev> -D30 <pool>' References: illumos/illumos-gate@283b84606b https://www.illumos.org/issues/3246 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1396	2013-05-01 17:05:52 -07:00
Jan Engelhardt	4e95cc99b0	build: resolve orthographic and other grammatical errors Signed-off-by: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-04-02 10:44:52 -07:00
Brian Behlendorf	775f2d34a3	Change zfs-kmod-devel install path Install the common zfs kernel development headers under /usr/src/zfs-<version>/ rather than in a kernel specific directory. The kernel specific build products such as zfs_config.h and Modules.symvers are left installed under /usr/src/zfs-<version>/<kernel>. This was done to be consistent with where dkms expects kernel module source to be packaged. It also allows for a common zfs-kmod-devel package which includes the headers, and per-kernel zfs-kmod-devel-<kernel> packages. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-13 13:42:16 -07:00
Eric Dillmann	0b4d1b5853	Add snapdev=[hidden\|visible] dataset property The new snapdev dataset property may be set to control the visibility of zvol snapshot devices. By default this value is set to 'hidden' which will prevent zvol snapshots from appearing under /dev/zvol/ and /dev/<dataset>/. When set to 'visible' all zvol snapshots for the dataset will be visible. This functionality was largely added because when automatic snapshoting is enabled large numbers of read-only zvol snapshots will be created. When creating these devices the kernel will attempt to read their partition tables, and blkid will attempt to identify any filesystems on those partitions. This leads to a variety of issues: 1) The zvol partition tables will be read in the context of the `modprobe zfs` for automatically imported pools. This is undesirable and should be done asynchronously, but for now reducing the number of visible devices helps. 2) Udev expects to be able to complete its work for a new block devices fairly quickly. When many zvol devices are added at the same time this is no longer be true. It can lead to udev timeouts and missing /dev/zvol links. 3) Simply having lots of devices in /dev/ can be aukward from a management standpoint. Hidding the devices your unlikely to ever use helps with this. Any snapshot device which is needed can be made visible by changing the snapdev property. NOTE: This patch changes the default behavior for zvols which was effectively 'snapdev=visible'. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1235 Closes #945 Issue #956 Issue #756	2013-03-05 12:37:54 -08:00
Richard Yao	b01615d5ac	Constify structures containing function pointers The PaX team modified the kernel's modpost to report writeable function pointers as section mismatches because they are potential exploit targets. We could ignore the warnings, but their presence can obscure actual issues. Proper const correctness can also catch programming mistakes. Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2 kernel reports 133 section mismatches prior to this patch. This patch eliminates 130 of them. The quantity of writeable function pointers eliminated by constifying each structure is as follows: vdev_opts_t 52 zil_replay_func_t 24 zio_compress_info_t 24 zio_checksum_info_t 9 space_map_ops_t 7 arc_byteswap_func_t 5 The remaining 3 writeable function pointers cannot be addressed by this patch. 2 of them are in zpl_fs_type. The kernel's sget function requires that this be non-const. The final writeable function pointer is created by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and remove_shrinker() functions also require that this be non-const. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1300	2013-03-04 08:49:32 -08:00
Brian Behlendorf	79c6e4c445	Remove NPTL_GUARD_WITHIN_STACK Commit `4b2f65b253` increased the user space stack by 4x to resolve certain stack overflows. As such it no longer makes sense to worry about a single extra page which might or might not be part of the process stack. There is now ample headroom for normal usage. By eliminating this configure check we are also resolving the following segfault which intentionally occurs at configure time and may be logged in dmesg. conftest[22156]: segfault at 7fbf18a47e48 ip 00000000004007fe sp 00007fbf18a4be50 error 6 in conftest[400000+1000] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-29 10:58:20 -08:00
Eric Dillmann	9759c60f1a	Illumos #3035 LZ4 compression support in ZFS and GRUB 3035 LZ4 compression support in ZFS and GRUB Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Christopher Siden <csiden@delphix.com> References: illumos/illumos-gate@a6f561b4ae https://www.illumos.org/issues/3035 http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS This patch has been slightly modified from the upstream Illumos version to be compatible with Linux. Due to the very limited stack space in the kernel a lz4 workspace kmem cache is used. Since we are using gcc we are also able to take advantage of the gcc optimized __builtin_ctz functions. Support for GRUB has been dropped from this patch. That code is available but those changes will need to made to the upstream GRUB package. Lastly, several hunks of dead code were dropped for clarity. They include the functions real_LZ4_uncompress(), LZ4_compressBound() and the Visual Studio specific hunks wrapped in _MSC_VER. Ported-by: Eric Dillmann <eric@jave.fr> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1217	2013-01-29 09:28:20 -08:00
Brian Behlendorf	6772fb679a	Use dsl_dataset_snap_lookup() Retire the dmu_snapshot_id() function which was introduced in the initial .zfs control directory implementation. There is already an existing dsl_dataset_snap_lookup() which does exactly what we need, and the dmu_snapshot_id() function as implemented is racy. https://github.com/zfsonlinux/zfs/issues/1215#issuecomment-12579879 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1238	2013-01-25 15:07:40 -08:00
Brian Behlendorf	7b3e34ba5a	Fix 'zfs rollback' on mounted file systems Rolling back a mounted filesystem with open file handles and cached dentries+inodes never worked properly in ZoL. The major issue was that Linux provides no easy mechanism for modules to invalidate the inode cache for a file system. Because of this it was possible that an inode from the previous filesystem would not get properly dropped from the cache during rolling back. Then a new inode with the same inode number would be create and collide with the existing cached inode. Ideally this would trigger an VERIFY() but in practice the error wasn't handled and it would just NULL reference. Luckily, this issue can be resolved by sprucing up the existing Solaris zfs_rezget() functionality for the Linux VFS. The way it works now is that when a file system is rolled back all the cached inodes will be traversed and refetched from disk. If a version of the cached inode exists on disk the in-core copy will be updated accordingly. If there is no match for that object on disk it will be unhashed from the inode cache and marked as stale. This will effectively make the inode unfindable for lookups allowing the inode number to be immediately recycled. The inode will then only be accessible from the cached dentries. Subsequent dentry lookups which reference a stale inode will result in the dentry being invalidated. Once invalidated the dentry will drop its reference on the inode allowing it to be safely pruned from the cache. Special care is taken for negative dentries since they do not reference any inode. These dentires will be invalidate based on when they were added to the dentry cache. Entries added before the last rollback will be invalidate to prevent them from masking real files in the dataset. Two nice side effects of this fix are: * Removes the dependency on spl_invalidate_inodes(), it can now be safely removed from the SPL when we choose to do so. * zfs_znode_alloc() no longer requires a dentry to be passed. This effectively reverts this portition of the code to its upstream counterpart. The dentry is not instantiated more correctly in the Linux ZPL layer. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #795	2013-01-17 09:51:20 -08:00
Matthew Ahrens	a94addd974	Illumos #3208 cross-endian incorrect user/group accounting 3208 moving zpool cross-endian results in incorrect user/group accounting Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: illumos/illumos-gate@e828a46d29 illumos changeset: 13835:eea81edc4f14 https://www.illumos.org/issues/3208 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #627 Closes #1136	2013-01-14 09:32:22 -08:00
George Wilson	1eb5bfa3dc	Illumos #3145 , #3212 3145 single-copy arc 3212 ztest: race condition between vdev_online() and spa_vdev_remove() Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com> Approved by: Eric Schrock <eric.schrock@delphix.com> References: illumos-gate/commit/9253d63df408bb48584e0b1abfcc24ef2472382e illumos changeset: 13840:97fd5cdf328a https://www.illumos.org/issues/3145 https://www.illumos.org/issues/3212 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #989 Closes #1137	2013-01-08 10:35:44 -08:00
Matthew Ahrens	753c38392d	Illumos #3104 : eliminate empty bpobjs 3104 eliminate empty bpobjs Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Eric Schrock <eric.schrock@delphix.com> References: illumos/illumos-gate@f174573681 illumos changeset: 13782:8f78aae28a63 https://www.illumos.org/issues/3104 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-08 10:35:43 -08:00
Matthew Ahrens	29809a6cba	Illumos #3086 : unnecessarily setting DS_FLAG_INCONSISTENT on async 3086 unnecessarily setting DS_FLAG_INCONSISTENT on async destroyed datasets Reviewed by: Christopher Siden <chris.siden@delphix.com> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: illumos/illumos-gate@ce636f8b38 illumos changeset: 13776:cd512c80fd75 https://www.illumos.org/issues/3086 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-08 10:35:43 -08:00
Christopher Siden	b9b24bb4ca	Illumos #2762 : zpool command should have better support for feature flags 2762 zpool command should have better support for feature flags Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: illumos/illumos-gate@57221772c3 https://www.illumos.org/issues/2762 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-08 10:35:43 -08:00
George Wilson	3bc7e0fb0f	Illumos #3090 and #3102 3090 vdev_reopen() during reguid causes vdev to be treated as corrupt 3102 vdev_uberblock_load() and vdev_validate() may read the wrong label Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: illumos/illumos-gate@dfbb943217 illumos changeset: 13777:b1e53580146d https://www.illumos.org/issues/3090 https://www.illumos.org/issues/3102 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #939	2013-01-08 10:35:42 -08:00
Christopher Siden	9ae529ec5d	Illumos #2619 and #2747 2619 asynchronous destruction of ZFS file systems 2747 SPA versioning with zfs feature flags Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: illumos/illumos-gate@53089ab7c8 illumos/illumos-gate@ad135b5d64 illumos changeset: 13700:2889e2596bd6 https://www.illumos.org/issues/2619 https://www.illumos.org/issues/2747 NOTE: The grub specific changes were not ported. This change must be made to the Linux grub packages. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-08 10:35:35 -08:00
Matt Johnston	72938d6905	Use cv_wait_io() which will will account for iowait Update zio_wait() to use cv_wait_io() to ensure the iowait time is properly accounted for. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-07 10:52:52 -08:00
Brian Behlendorf	d5446cfc52	Revert "Remove TSD zfs_fsyncer_key" This reverts commit `31f2b5abdf` back to the original code until the fsync(2) performance regression can be addressed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-20 09:56:28 -08:00
Brian Behlendorf	31f2b5abdf	Remove TSD zfs_fsyncer_key It's my understanding that the zfs_fsyncer_key TSD was added as a performance omtimization to reduce contention on the zl_lock from zil_commit(). This issue manifested itself as very long (100+ms) fsync() system call times for fsync() heavy workloads. However, under Linux I'm not seeing the same contention that was originally described. Therefore, I'm removing this code in order to ween ourselves off any dependence on TSD. If the original performance issue reappears on Linux we can revisit fixing it without resorting to TSD. This just leaves one small ZFS TSD consumer. If it can be cleanly removed from the code we'll be able to shed the SPL TSD implementation entirely. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/spl#174	2012-12-19 09:08:01 -08:00
Jorgen Lundman	6c2856726f	Fix using zvol as slog device During the original ZoL port the vdev_uses_zvols() function was disabled until it could be properly implemented. This prevented a zpool from use a zvol for its slog device. This patch implements that missing functionality by adding a zvol_is_zvol() function to zvol.c. Given the full path to a device it will lookup the device and verify its major number against the registered zvol major number for the system. If they match we know the device is a zvol. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1131	2012-12-18 11:02:28 -08:00
Brian Behlendorf	8780c53961	Update SAs when an inode is dirtied Revert the portion of commit `d3aa3ea` which always resulted in the SAs being update when an mmap()'ed file was closed. That change accidentally resulted in unexpected ctime updates which upset tools like git. That was always a horrible hack and I'm happy it will never make it in to a tagged release. The right fix is something I initially resisted doing because I was worried about the additional overhead. However, in hindsight the overhead isn't as bad as I feared. This patch implemented the sops->dirty_inode() callback which is unsurprisingly called when an inode is dirtied. We leverage this callback to keep the znode SAs strictly in sync with the inode. However, for now we're going to go slowly to avoid introducing any new unexpected issues by only updating the atime, mtime, and ctime. This will cover the callpath of most concern to us. ->filemap_page_mkwrite->file_update_time->update_time-> mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #764 Closes #1140	2012-12-14 12:18:54 -08:00
Brian Behlendorf	2ae1031962	Linux 3.7 compat, schedule_delayed_work() Linux kernel commit d8e794d accidentally broke the delayed work APIs for non-GPL callers. While the APIs to schedule a delayed work item are still available to all callers, it is no longer possible to initialize the delayed work item. I'm cautiously optimistic we could get the delayed_work_timer_fn exported for all callers in the upstream kernel. But frankly the compatibility code to use this kernel interface has always been problematic. Therefore, this patch abandons direct use the of the Linux kernel interface in favor of the new delayed taskq interface. It provides roughly the same functionality as delayed work queues but it's a stable interface under our control. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1053	2012-12-12 10:47:05 -08:00
Brian Behlendorf	e89260a1c8	Directory xattr znodes hold a reference on their parent Unlike normal file or directory znodes, an xattr znode is guaranteed to only have a single parent. Therefore, we can take a refernce on that parent if it is provided at create time and cache it. Additionally, we take care to cache it on any subsequent zfs_zaccess() where the parent is provided as an optimization. This allows us to avoid needing to do a zfs_zget() when setting up the SELinux security xattr in the create path. This is critical because a hash lookup on the directory will deadlock since it is locked. The zpl_xattr_security_init() call has also been moved up to the zpl layer to ensure TXs to create the required xattrs are performed after the create TX. Otherwise we run the risk of deadlocking on the open create TX. Ideally the security xattr should be fully constructed before the new inode is unlocked. However, doing so would require far more extensive changes to ZFS. This change may also have the benefitial side effect of ensuring xattr directory znodes are evicted from the cache before normal file or directory znodes due to the extra reference. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #671	2012-12-03 12:10:46 -08:00
Brian Behlendorf	30315d237b	Increase ZFS_OBJ_MTX_SZ to 256 Increasing this limit costs us 6144 bytes of memory per mounted filesystem, but this is small price to pay for accomplishing the following: * Allows for up to 256-way concurreny when performing lookups which helps performance when there are a large number of processes. * Minimizes the likelyhood of encountering the deadlock described in issue #1101. Because vmalloc() won't strictly honor __GFP_FS there is still a very remote chance of a deadlock. See the zfsonlinux/spl@043f9b57 commit. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1101	2012-11-27 13:46:32 -08:00
George Wilson	32a9872bba	Illumos #2671 : zpool import should not fail if vdev ashift has increased Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Richard Lowe <richlowe@richlowe.net> Refererces to Illumos issue: https://www.illumos.org/issues/2671 This patch has been slightly modified from the upstream Illumos version. In the upstream implementation a warning message is logged to the console. To prevent pointless console noise this notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift" event. The event indicates a non-optimial (but entirely safe) ashift value was used to create the pool. Depending on your workload this may impact pool performance. Unfortunately, the only way to correct the issue is to recreate the pool with a new ashift. NOTE: The unrelated fix to the comment in zpool_main.c appears in the upstream commit and was preserved for consistnecy. Ported-by: Cyril Plisko <cyril.plisko@mountall.com> Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #955	2012-11-15 11:05:59 -08:00
Brian Behlendorf	9dcb971983	Log I/Os longer than zio_delay_max (30s default) There have been reports of ZFS deadlocking due to what appears to be a lost IO. This patch addes some debugging to determine the exact state of the IO which neither 1) completed, 2) failed, or 3) timed out after zio_delay_max (30) seconds. This information will be logged using the ZFS FMA infrastructure as a 'delay' event and posted to the internal zevent log. By default the last 64 events will be kept in the log but the limit is configurable via the zfs_zevent_len_max module option. To dump the contents of the log use the 'zpool events -v' command and look for the resource.fs.zfs.delay event. It will include various information about the pool, vdev, and zio which may shed some light on the issue. In the context of this change the 120 second kernel blocked thread watchdog has been disabled for synchronous IOs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #930	2012-11-02 15:45:59 -07:00
Brian Behlendorf	e95853a331	Add txgs-<pool> kstat file Create a kstat file which contains useful statistics about the last N txgs processed. This can be helpful when analyzing pool performance. The new KSTAT_TYPE_TXG type was added for this purpose and it tracks the following statistics per-txg. txg - Unique txg number state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted birth; - Creation time nread - Bytes read nwritten; - Bytes written reads - IOPs read writes - IOPs write open_time; - Length in nanoseconds the txg was open quiesce_time - Length in nanoseconds the txg was quiescing sync_time; - Length in nanoseconds the txg was syncing Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-02 15:45:56 -07:00
Brian Behlendorf	e8fd45a0f9	Add ddt_object_count() error handling The interface for the ddt_zap_count() function assumes it can never fail. However, internally ddt_zap_count() is implemented with zap_count() which can potentially fail. Now because there was no way to return the error to the caller a VERIFY was used to ensure this case never happens. Unfortunately, it has been observed that pools can be damaged in such a way that zap_count() fails. The result is that the pool can not be imported without hitting the VERIFY and crashing the system. This patch reworks ddt_object_count() so the error can be safely caught and returned to the caller. This allows a pool which has be damaged in this way to be safely rewound for import. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #910	2012-10-29 08:57:45 -07:00
Etienne Dechamps	920dd524fb	Add FASTWRITE algorithm for synchronous writes. Currently, ZIL blocks are spread over vdevs using hint block pointers managed by the ZIL commit code and passed to metaslab_alloc(). Spreading log blocks accross vdevs is important for performance: indeed, using mutliple disks in parallel decreases the ZIL commit latency, which is the main performance metric for synchronous writes. However, the current implementation suffers from the following issues: 1) It would be best if the ZIL module was not aware of such low-level details. They should be handled by the ZIO and metaslab modules; 2) Because the hint block pointer is managed per log, simultaneous commits from multiple logs might use the same vdevs at the same time, which is inefficient; 3) Because dmu_write() does not honor the block pointer hint, indirect writes are not spread. The naive solution of rotating the metaslab rotor each time a block is allocated for the ZIL or dmu_sync() doesn't work in practice because the first ZIL block to be written is actually allocated during the previous commit. Consequently, when metaslab_alloc() decides the vdev for this block, it will do so while a bunch of other allocations are happening at the same time (from dmu_sync() and other ZILs). This means the vdev for this block is chosen more or less at random. When the next commit happens, there is a high chance (especially when the number of blocks per commit is slightly less than the number of the disks) that one disk will have to write two blocks (with a potential seek) while other disks are sitting idle, which defeats spreading and increases the commit latency. This commit introduces a new concept in the metaslab allocator: fastwrites. Basically, each top-level vdev maintains a counter indicating the number of synchronous writes (from dmu_sync() and the ZIL) which have been allocated but not yet completed. When the metaslab is called with the FASTWRITE flag, it will choose the vdev with the least amount of pending synchronous writes. If there are multiple vdevs with the same value, the first matching vdev (starting from the rotor) is used. Once metaslab_alloc() has decided which vdev the block is allocated to, it updates the fastwrite counter for this vdev. The rationale goes like this: when an allocation is done with FASTWRITE, it "reserves" the vdev until the data is written. Until then, all future allocations will naturally avoid this vdev, even after a full rotation of the rotor. As a result, pending synchronous writes at a given point in time will be nicely spread over all vdevs. This contrasts with the previous algorithm, which is based on the implicit assumption that blocks are written instantaneously after they're allocated. metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to manually increase or decrease fastwrite counters, respectively. They should be used with caution, as there is no per-BP tracking of fastwrite information, so leaks and "double-unmarks" are possible. There is, however, an assert in the vdev teardown code which will fire if the fastwrite counters are not zero when the pool is exported or the vdev removed. Note that as stated above, marking is also done implictly by metaslab_alloc(). ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to the metaslab when allocating (assuming ZIO does the allocation, which is only true in the case of dmu_sync). This flag will also trigger an unmark when zio_done() fires. A side-effect of the new algorithm is that when a ZIL stops being used, its last block can stay in the pending state (allocated but not yet written) for a long time, polluting the fastwrite counters. To avoid that, I've implemented a somewhat crude but working solution which unmarks these pending blocks in zil_sync(), thus guaranteeing that linguering fastwrites will get pruned at each sync event. The best performance improvements are observed with pools using a large number of top-level vdevs and heavy synchronous write workflows (especially indirect writes and concurrent writes from multiple ZILs). Real-life testing shows a 200% to 300% performance increase with indirect writes and various commit sizes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1013	2012-10-17 08:56:41 -07:00
Brian Behlendorf	87d98efe9e	Fix zfs_txg_timeout module parameter Allow the zfs_txg_timeout variable to be dynamically tuned at run time. By pulling it down out of the variable declaration it will be evaluted each time through the loop. The zfs_txg_timeout variable is now declared extern in a the common sys/txg.h header rather than locally in dsl_scan.c. This prevents potential type mismatches if the global variable needs to be used elsewhere. Move the module_param() code in to the same source file where zfs_txg_timeout is declared. This is the most logical location. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-10-11 15:07:09 -07:00
Richard Yao	7df05a4266	Fix zfs_write_limit_max integer size mismatch on 32-bit systems Commit `c409e4647f` introduced a number of module parameters. This required several types to be changed to accomidate the required module parameters Linux macros. Unfortunately, arc.c contained its own extern definition of the zfs_write_limit_max variable and its type was not updated to be consistent with its dsl_pool.c counterpart. If the variable had been properly marked extern in a common header, then gcc would have generated a warning and this would not have slipped through. The result of this was that the ARC unconditionally expected zfs_write_limit_max to be 64-bit. Unfortunately, the largest size integer module parameter that Linux supports is unsigned long, which varies in size depending on the host system's native word size. The effect was that on 32-bit systems, ARC incorrectly performed 64-bit operations on a 32-bit value by reading the neighboring 32 bits as the upper 32 bits of the 64-bit value. We correct that by changing the extern declaration to use the unsigned long type and move these extern definitions in to the common arc.h header. This should make ARC correctly treat zfs_write_limit_max as a 32-bit value on 32-bit systems. Reported-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #749	2012-10-11 11:09:25 -07:00
Matthew Ahrens	04434775b7	Illumos #3100 : zvol rename fails with EBUSY when dirty. illumos/illumos-gate@2e2c135528 Illumos changeset: 13780:6da32a929222 3100 zvol rename fails with EBUSY when dirty Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: Adam H. Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Eric Schrock <eric.schrock@delphix.com> Ported-by: Etienne Dechamps <etienne.dechamps@ovh.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #995	2012-10-03 13:59:02 -07:00
Etienne Dechamps	274091c074	Fix VOP_CLOSE() in userspace. Currently, for unknown reasons, VOP_CLOSE() is a no-op in userspace. This causes file descriptor leaks. This is especially problematic with long ztest runs, since zpool.cache is opened repeatedly and never closed, resulting in resource exhaustion (EMFILE errors). This patch fixes the issue by making VOP_CLOSE() do what it is supposed to do. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #989	2012-10-03 13:32:48 -07:00
Etienne Dechamps	0aebd4f9e3	Create threads in detached state in userspace. Currently, thread_create(), when called in userspace, creates a joinable (i.e. not detached thread). This is the pthread default. Unfortunately, this does not reproduce kthreads behavior (kthreads are always detached). In addition, this contradicts the original Solaris code which creates userspace threads in detached mode. These joinable threads are never joined, which leads to a leakage of pthread thread objects ("zombie threads"). This in turn results in excessive ressource consumption, and possible ressource exhaustion in extreme cases (e.g. long ztest runs). This patch fixes the issue by creating userspace threads in detached mode. The only exception is ztest worker threads which are meant to be joinable. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #989	2012-10-03 13:32:48 -07:00
Bill Pijewski	37abac6d55	Illumos #2703 : add mechanism to report ZFS send progress Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: https://www.illumos.org/issues/2703 Ported by: Martin Matuska <martin@matuska.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-09-19 13:39:06 -07:00
Chris Siden	1bd201e70d	Illumos #1948 : zpool list should show more detailed pool info Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Eric Schrock <eric.schrock@delphix.com> References: https://www.illumos.org/issues/1948 Ported by: Martin Matuska <martin@matuska.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #685	2012-09-19 13:39:05 -07:00
Brian Behlendorf	0ef0ff546e	Switch KM_SLEEP to KM_PUSHPAGE This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit `b8d06fca08` for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #917	2012-09-04 16:00:06 -07:00
Richard Yao	b8d06fca08	Switch KM_SLEEP to KM_PUSHPAGE Differences between how paging is done on Solaris and Linux can cause deadlocks if KM_SLEEP is used in any the following contexts. * The txg_sync thread * The zvol write/discard threads * The zpl_putpage() VFS callback This is because KM_SLEEP will allow for direct reclaim which may result in the VM calling back in to the filesystem or block layer to write out pages. If a lock is held over this operation the potential exists to deadlock the system. To ensure forward progress all memory allocations in these contexts must us KM_PUSHPAGE which disables performing any I/O to accomplish the memory allocation. Previously, this behavior was acheived by setting PF_MEMALLOC on the thread. However, that resulted in unexpected side effects such as the exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs code, but it is more consistent with the right way to handle these cases under Linux. This is patch lays the ground work for being able to safely revert the following commits which used PF_MEMALLOC: `21ade34` Disable direct reclaim for z_wr_* threads `cfc9a5c` Fix zpl_writepage() deadlock `eec8164` Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #726	2012-08-27 12:01:37 -07:00
Brian Behlendorf	86dd0fd922	Pre-allocate vdev I/O buffers The vdev queue layer may require a small number of buffers when attempting to create aggregate I/O requests. Rather than attempting to allocate them from the global zio buffers, which is slow under memory pressure, it makes sense to pre-allocate them because... 1) These buffers are short lived. They are only required for the life of a single I/O at which point they can be used by the next I/O. 2) The maximum number of concurrent buffers needed by a vdev is small. It's roughly limited by the zfs_vdev_max_pending tunable which defaults to 10. By keeping a small list of these buffer per-vdev we can ensure one is always available when we need it. This significantly reduces contention on the vq->vq_lock, because we no longer need to perform a slow allocation under this lock. This is particularly important when memory is already low on the system. It would probably be wise to extend the use of these buffers beyond aggregate I/O and in to the raidz implementation. The inability to quickly allocate buffer for the parity stripes could result in similiar problems. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:01:37 -07:00
Richard Yao	44f21da41c	Revert Disable direct reclaim for z_wr_* threads This commit used PF_MEMALLOC to prevent a memory reclaim deadlock. However, commit `49be0ccf1f` eliminated the invocation of __cv_init(), which was the cause of the deadlock. PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA to be allocated. The use of PF_MEMALLOC was found to cause stability problems when doing swap on zvols. Since this technique is known to cause problems and no longer fixes anything, we revert it. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #726	2012-08-27 12:01:37 -07:00
Brian Behlendorf	ca8b5af89d	Remove autotools products Remove all of the generated autotools products from the repository and update the .gitignore files accordingly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #718	2012-08-27 11:47:44 -07:00
Prakash Surya	15a9e03368	Wrap smp_processor_id in kpreempt_[dis\|en]able After surveying the code, the few places where smp_processor_id is used were deemed to be safe to use with a preempt enabled kernel. As such, no core logic had to be changed. These smp_processor_id call sites are simply are wrapped in kpreempt_disable and kpreempt_enabled to prevent the Linux kernel from emitting scary warnings. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Issue #83	2012-08-24 13:19:06 -07:00
Brian Behlendorf	8f576c2321	Export dbuf_* symbols Export these symbols so they may be used by other ZFS consumers besides the ZPL. Remove three stale prototype definites from dbuf.h. The actual implementations of these functions were removed/renamed long ago. It would be good in the long term to remove the existing pragmas we inherited from Solaris and simply use the dbuf_* names. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-10 16:45:13 -07:00
Dan McDonald	d96eb2b153	Illumos #1693 : persistent 'comment' field for a zpool Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/1693 Ported by: Martin Matuska <martin@matuska.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #678	2012-08-08 11:49:37 -07:00

1 2 3 4 5 ...

254 Commits