mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-22 02:27:36 +03:00

Author	SHA1	Message	Date
Garrett D'Amore	2cc6c8db12	Illumos #175 : zfs vdev cache consumes excessive memory Note that with the current ZFS code, it turns out that the vdev cache is not helpful, and in some cases actually harmful. It is better if we disable this. Once some time has passed, we should actually remove this to simplify the code. For now we just disable it by setting the zfs_vdev_cache_size to zero. Note that Solaris 11 has made these same changes. References to Illumos issue and patch: - https://www.illumos.org/issues/175 - https://github.com/illumos/illumos-gate/commit/b68a40a845 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Gordon Ross	ef3c1dea70	Illumos #764 : panic in zfs:dbuf_sync_list Hypothesis about what's going on here. At some time in the past, something, i.e. dnode_reallocate() calls one of: dbuf_rm_spill(dn, tx); These will do: dbuf_rm_spill(dnode_t dn, dmu_tx_t tx) dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx) dbuf_undirty(db, tx) Currently dbuf_undirty can leave a spill block in dn_dirty_records[], (it having been put there previously by dbuf_dirty) and free it. Sometime later, dbuf_sync_list trips over this reference to free'd (and typically reused) memory. Also, dbuf_undirty can call dnode_clear_range with a bogus block ID. It needs to test for DMU_SPILL_BLKID, similar to how dnode_clear_range is called in dbuf_dirty(). References to Illumos issue and patch: - https://www.illumos.org/issues/764 - https://github.com/illumos/illumos-gate/commit/3f2366c2bb Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Mark.Maybe@oracle.com Reviewed by: Albert Lee <trisk@nexenta.com Approved by: Garrett D'Amore <garrett@nexenta.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Tim Haley	7b8518cb8d	Illumos #xxx: zdb -vvv broken after zfs diff integration References to Illumos issue and patch: - https://github.com/illumos/illumos-gate/commit/163eb7ff Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:02 -07:00
Brian Behlendorf	beb9826902	Fix txg_sync_thread deadlock Update two kmem_alloc()'s in dbuf_dirty() to use KM_PUSHPAGE. Because these functions are called from txg_sync_thread we must ensure they don't reenter the zfs filesystem code via the .writepage callback. This would result in a deadlock. This deadlock is rare and has only been observed once under an abusive mmap() write workload. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-22 15:24:57 -07:00
Brian Behlendorf	22872ff5da	Use zfs_mknode() to create dataset root Long, long, long ago when the effort to port ZFS was begun the zfs_create_fs() function was heavily modified to remove all of its VFS dependencies. This allowed Lustre to use the dataset without us having to spend the time porting all the required VFS code. Fast-forward several years and we now have all the VFS code in place but are still relying on the modified zfs_create_fs(). This isn't required anymore and we can now use zfs_mknode() to create the root znode for the filesystem. This commit reverts the contents of zfs_create_fs() to largely match the upstream OpenSolaris code. There have been minor modifications to accomidate the Linux VFS but that is all. This code fixes issue #116 by bootstraping enough of the VFS data structures so we can rely on zfs_mknode() to create the root directory. This ensures it is created properly with support for system attributes. Previously it wasn't which is why it behaved differently that all other directories when modified. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #116	2011-07-20 19:52:26 -07:00
Brian Behlendorf	9fd91daeef	Honor setgit bit on directories Newly created files were always being created with the fsuid/fsgid in the current users credentials. This is correct except in the case when the parent directory sets the 'setgit' bit. In this case according to posix the newly created file/directory should inherit the gid of the parent directory. Additionally, in the case of a subdirectory it should also inherit the 'setgit' bit. Finally, this commit performs a little cleanup of the vattr_t initialization by moving it to a common helper function. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #262	2011-07-20 14:07:13 -07:00
Brian Behlendorf	fe0ed8f910	Fix 'make install' overly broad 'rm' When running 'make install' without DESTDIR set the module install rules would mistakenly destroy the 'modules.*' files for ALL of your installed kernels. This could lead to a non-functional system for the alternate kernels because 'depmod -a' will only be run for the kernel which was compiled against. This issue would not impact anyone using the 'make <deb\|rpm\|pkg>' build targets to build and install packages. The fix for this issue is to only remove extraneous build products when DESTDIR is set. This almost exclusively indicates we are building packages and installed the build products in to a temporary staging location. Additionally, limit the removal the unneeded build products to the target kernel version. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #328	2011-07-20 09:38:51 -07:00
Brian Behlendorf	cfc9a5c88f	Fix zpl_writepage() deadlock Disable the normal reclaim path for zpl_putpage(). This ensures that all memory allocations under this call path will never enter direct reclaim. If this were to happen the VM might try to write out additional pages by calling zpl_putpage() again resulting in a deadlock. This sitution is typically handled in Linux by marking each offending allocation GFP_NOFS. However, since much of the code used is common it makes more sense to use PF_MEMALLOC to flag the entire call tree. Alternately, the code could be updated to pass the needed allocation flags but that's a more invasive change. The following example of the above described deadlock was triggered by test 074 in the xfstest suite. Call Trace: [<ffffffff814dcdb2>] down_write+0x32/0x40 [<ffffffffa05af6e4>] dnode_new_blkid+0x94/0x2d0 [zfs] [<ffffffffa0597d66>] dbuf_dirty+0x556/0x750 [zfs] [<ffffffffa05987d1>] dmu_buf_will_dirty+0x81/0xd0 [zfs] [<ffffffffa059ee70>] dmu_write+0x90/0x170 [zfs] [<ffffffffa0611afe>] zfs_putpage+0x2ce/0x360 [zfs] [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffffa06287b2>] zpl_writepage+0x12/0x20 [zfs] [<ffffffff8115f907>] writeout+0xa7/0xd0 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0 [<ffffffff811559ab>] compact_zone+0x4fb/0x780 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0 [<ffffffff8115464a>] alloc_pages_current+0xaa/0x110 [<ffffffff8111e36e>] __get_free_pages+0xe/0x50 [<ffffffffa03f0e2f>] kv_alloc+0x3f/0xb0 [spl] [<ffffffffa03f11d9>] spl_kmem_cache_alloc+0x339/0x660 [spl] [<ffffffffa05950b3>] dbuf_create+0x43/0x370 [zfs] [<ffffffffa0596fb1>] __dbuf_hold_impl+0x241/0x480 [zfs] [<ffffffffa0597276>] dbuf_hold_impl+0x86/0xc0 [zfs] [<ffffffffa05977ff>] dbuf_hold_level+0x1f/0x30 [zfs] [<ffffffffa05a9dde>] dmu_tx_check_ioerr+0x4e/0x110 [zfs] [<ffffffffa05aa1f9>] dmu_tx_count_write+0x359/0x6f0 [zfs] [<ffffffffa05aa5df>] dmu_tx_hold_write+0x4f/0x70 [zfs] [<ffffffffa0611a6d>] zfs_putpage+0x23d/0x360 [zfs] [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffff811221f9>] write_cache_pages+0x1c9/0x4a0 [<ffffffffa0628738>] zpl_writepages+0x18/0x20 [zfs] [<ffffffff81122521>] do_writepages+0x21/0x40 [<ffffffff8119bbbd>] writeback_single_inode+0xdd/0x2c0 [<ffffffff8119bfbe>] writeback_sb_inodes+0xce/0x180 [<ffffffff8119c11b>] writeback_inodes_wb+0xab/0x1b0 [<ffffffff8119c4bb>] wb_writeback+0x29b/0x3f0 [<ffffffff8119c6cb>] wb_do_writeback+0xbb/0x240 [<ffffffff811308ea>] bdi_forker_task+0x6a/0x310 [<ffffffff8108ddf6>] kthread+0x96/0xa0 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #327	2011-07-19 16:17:01 -07:00
Brian Behlendorf	abd39a8289	Fix zio_execute() deadlock To avoid deadlocking the system it is crucial that all memory allocations performed in the zio_execute() call path are marked KM_PUSHPAGE (GFP_NOFS). This ensures that while a z_wr_iss thread is processing the syncing transaction group it does not re-enter the filesystem code and deadlock on itself. Call Trace: [<ffffffffa02580e8>] cv_wait_common+0x78/0xe0 [spl] [<ffffffffa0347bab>] txg_wait_open+0x7b/0xa0 [zfs] [<ffffffffa030e73d>] dmu_tx_wait+0xed/0xf0 [zfs] [<ffffffffa0376a49>] zfs_putpage+0x219/0x360 [zfs] [<ffffffffa038d75e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffffa038d7b2>] zpl_writepage+0x12/0x20 [zfs] [<ffffffff8115f907>] writeout+0xa7/0xd0 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0 [<ffffffff811559ab>] compact_zone+0x4fb/0x780 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0 [<ffffffff81159932>] kmem_getpages+0x62/0x170 [<ffffffff8115a54a>] fallback_alloc+0x1ba/0x270 [<ffffffff8115a2c9>] ____cache_alloc_node+0x99/0x160 [<ffffffff8115b059>] __kmalloc+0x189/0x220 [<ffffffffa02539fb>] kmem_alloc_debug+0xeb/0x130 [spl] [<ffffffffa031454a>] dnode_hold_impl+0x46a/0x550 [zfs] [<ffffffffa0314649>] dnode_hold+0x19/0x20 [zfs] [<ffffffffa03042e3>] dmu_read+0x33/0x180 [zfs] [<ffffffffa034729d>] space_map_load+0xfd/0x320 [zfs] [<ffffffffa03300bc>] metaslab_activate+0x10c/0x170 [zfs] [<ffffffffa0330ad9>] metaslab_alloc+0x469/0x800 [zfs] [<ffffffffa038963c>] zio_dva_allocate+0x6c/0x2f0 [zfs] [<ffffffffa038a249>] zio_execute+0x99/0xf0 [zfs] [<ffffffffa0254b1c>] taskq_thread+0x1cc/0x330 [spl] [<ffffffff8108ddf6>] kthread+0x96/0xa0 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #291	2011-07-19 11:55:42 -07:00
Brian Behlendorf	a140dc5469	Fix mmap(2)/write(2)/read(2) deadlock When modifing overlapping regions of a file using mmap(2) and write(2)/read(2) it is possible to deadlock due to a lock inversion. The zfs_write() and zfs_read() hooks first take the zfs range lock and then lock the individual pages. Conversely, when using mmap'ed I/O the zpl_writepage() hook is called with the individual page locks already taken and then zfs_putpage() takes the zfs range lock. The most straight forward fix is to simply not take the zfs range lock in the mmap(2) case. The individual pages will still be locked thus serializing access. Updating the same region of a file with write(2) and mmap(2) has always been a dodgy thing to do. This change at a minimum ensures we don't deadlock and is consistent with the existing Linux semantics enforced by the VFS. This isn't an issue under Solaris because the only range locking performed will be with the zfs range locks. It's up to each filesystem to perform its own file locking. Under Linux the VFS provides many of these services. It may be possible/desirable at a latter date to entirely dump the existing zfs range locking and rely on the Linux VFS page locks. However, for now its safest to perform both layers of locking until zfs is more tightly integrated with the page cache. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #302	2011-07-19 11:55:42 -07:00
Brian Behlendorf	61f218b090	Fix send/recv 'dataset is busy' errors This commit fixes a regression which was accidentally introduced by the Linux 2.6.39 compatibility chanages. As part of these changes instead of holding an active reference on the namepsace (which is no longer posible) a reference is taken on the super block. This reference ensures the super block remains valid while it is in use. To handle the unlikely race condition of the filesystem being unmounted concurrently with the start of a 'zfs send/recv' the code was updated to only take the super block reference when there was an existing reference. This indicates that the filesystem is active and in use. Unfortunately, in the 'zfs recv' case this is not the case. The newly created dataset will not have a super block without an active reference which results in the 'dataset is busy' error. The most straight forward fix for this is to simply update the code to always take the reference even when it's zero. This may expose us to very very unlikely concurrent umount/send/recv case but the consequences of that are minor. Closes #319	2011-07-15 16:37:19 -07:00
Brian Behlendorf	057e8eee35	Improve fstat(2) performance There is at most a factor of 3x performance improvement to be had by using the Linux generic_fillattr() helper. However, to use it safely we need to ensure the values in a cached inode are kept rigerously up to date. Unfortunately, this isn't the case for the blksize, blocks, and atime fields. At the moment the authoritative values are still stored in the znode. This patch introduces an optimized zfs_getattr_fast() call. The idea is to use the up to date values from the inode and the blksize, block, and atime fields from the znode. At some latter date we should be able to strictly use the inode values and further improve performance. The remaining overhead in the zfs_getattr_fast() call can be attributed to having to take the znode mutex. This overhead is unavoidable until the inode is kept strictly up to date. The the careful reader will notice the we do not use the customary ZFS_ENTER()/ZFS_EXIT() macros. These macro's are designed to ensure the filesystem is not torn down in the middle of an operation. However, in this case the VFS is holding a reference on the active inode so we know this is impossible. =================== Performance Tests ======================== This test calls the fstat(2) system call 10,000,000 times on an open file description in a tight loop. The test results show the zfs stat(2) performance is now only 22% slower than ext4. This is a 2.5x improvement and there is a clear long term plan to get to parity with ext4. filesystem \| test-1 test-2 test-3 \| average \| times-ext4 --------------+-------------------------+---------+----------- ext4 \| 7.785s 7.899s 7.284s \| 7.656s \| 1.000x zfs-0.6.0-rc4 \| 24.052s 22.531s 23.857s \| 23.480s \| 3.066x zfs-faststat \| 9.224s 9.398s 9.485s \| 9.369s \| 1.223x The second test is to run 'du' of a copy of the /usr tree which contains 110514 files. The test is run multiple times both using both a cold cache (/proc/sys/vm/drop_caches) and a hot cache. As expected this change signigicantly improved the zfs hot cache performance and doesn't quite bring zfs to parity with ext4. A little surprisingly the zfs cold cache performance is better than ext4. This can probably be attributed to the zfs allocation policy of co-locating all the meta data on disk which minimizes seek times. By default the ext4 allocator will spread the data over the entire disk only co-locating each directory. filesystem \| cold \| hot --------------+---------+-------- ext4 \| 13.318s \| 1.040s zfs-0.6.0-rc4 \| 4.982s \| 1.762s zfs-faststat \| 4.933s \| 1.345s	2011-07-11 09:11:22 -07:00
Brian Behlendorf	abd8610cd5	Add L2ARC tunables The performance of the L2ARC can be tweaked by a number of tunables, which may be necessary for different workloads: l2arc_write_max max write bytes per interval l2arc_write_boost extra write bytes during device warmup l2arc_noprefetch skip caching prefetched buffers l2arc_headroom number of max device writes to precache l2arc_feed_secs seconds between L2ARC writing l2arc_feed_min_ms min feed interval in milliseconds l2arc_feed_again turbo L2ARC warmup l2arc_norw no reads during writes Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #316	2011-07-08 12:44:11 -07:00
Gunnar Beutner	3c9609b322	Renamed HAVE_SHARE ifdefs to HAVE_SMB_SHARE. The remaining code that is guarded by HAVE_SHARE ifdefs is related to the .zfs/shares functionality which is currently not available on Linux. On Solaris the .zfs/shares directory can be used to set permissions for SMB shares. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-06 09:20:28 -07:00
Gunnar Beutner	46e18b3f0f	Implemented sharing datasets via NFS using libshare. The sharenfs and sharesmb properties depend on the libshare library to export datasets via NFS and SMB. This commit implements the base libshare functionality as well as support for managing NFS shares. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-06 09:20:28 -07:00
Brian Behlendorf	285226eff3	Always allow non-user xattrs Under Linux you may only disable USER xattrs. The SECURITY, SYSTEM, and TRUSTED xattr namespaces must always be available if xattrs are supported by the filesystem. The enforcement of USER xattrs is performed in the zpl_xattr_user_* handlers. Under Solaris there is only a single xattr namespace which is managed globally.	2011-07-01 13:39:48 -07:00
Rohan Puri	a89c3e0bd5	Support mandatory locks (nbmand) The Linux kernel already has support for mandatory locking. This change just replaces the Solaris mandatory locking calls with the Linux equivilants. In fact, it looks like this code could be removed entirely because this checking is already done generically in the Linux VFS. However, for now we'll leave it in place even if it is redundant just in case we missed something. The original patch to update the code to support mandatory locking was done by Rohan Puri. This patch is an updated version which is compatible with the previous mount option handling changes. Original-Patch-by: Rohan Puri <rohan.puri15@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #222 Closes #253	2011-07-01 13:39:40 -07:00
Brian Behlendorf	2cf7f52bc4	Linux compat 2.6.39: mount_nodev() The .get_sb callback has been replaced by a .mount callback in the file_system_type structure. When using the new interface the caller must now use the mount_nodev() helper. Unfortunately, the new interface no longer passes the vfsmount down to the zfs layers. This poses a problem for the existing implementation because we currently save this pointer in the super block for latter use. It provides our only entry point in to the namespace layer for manipulating certain mount options. This needed to be done originally to allow commands like 'zfs set atime=off tank' to work properly. It also allowed me to keep more of the original Solaris code unmodified. Under Solaris there is a 1-to-1 mapping between a mount point and a file system so this is a fairly natural thing to do. However, under Linux they many be multiple entries in the namespace which reference the same filesystem. Thus keeping a back reference from the filesystem to the namespace is complicated. Rather than introduce some ugly hack to get the vfsmount and continue as before. I'm leveraging this API change to update the ZFS code to do things in a more natural way for Linux. This has the upside that is resolves the compatibility issue for the long term and fixes several other minor bugs which have been reported. This commit updates the code to remove this vfsmount back reference entirely. All modifications to filesystem mount options are now passed in to the kernel via a '-o remount'. This is the expected Linux mechanism and allows the namespace to properly handle any options which apply to it before passing them on to the file system itself. Aside from fixing the compatibility issue, removing the vfsmount has had the benefit of simplifying the code. This change which fairly involved has turned out nicely. Closes #246 Closes #217 Closes #187 Closes #248 Closes #231	2011-07-01 13:36:39 -07:00
Brian Behlendorf	5c03efc379	Linux compat 2.6.39: security_inode_init_security() The security_inode_init_security() function now takes an additional qstr argument which must be passed in from the dentry if available. Passing a NULL is safe when no qstr is available the relevant security checks will just be skipped. Closes #246 Closes #217 Closes #187	2011-07-01 12:40:08 -07:00
Brian Behlendorf	e2e7aa2df8	Add ZFS specific mmap() checks Under Linux the VFS handles virtually all of the mmap() access checks. Filesystem specific checks are left to be handled in the .mmap() hook and normally there arn't any. However, ZFS provides a few attributes which can influence the mmap behavior and should be honored. Note, currently the code to modify these attributes has not been implemented under Linux. * ZFS_IMMUTABLE \| ZFS_READONLY \| ZFS_APPENDONLY: when any of these attributes are set a file may not be mmaped with write access. * ZFS_AV_QUARANTINED: when set a file file may not be mmaped with read or exec access. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-01 12:23:46 -07:00
Brian Behlendorf	f0b2486034	Remove unused MMAP functions The following functions were required for the OpenSolaris mmap implementation. Because the Linux VFS does most the most heavy lifting for us they are not required and are being removed to keep the code clean and easy to understand. * zfs_null_putapage() * zfs_frlock() * zfs_no_putpage() Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>	2011-07-01 12:22:57 -07:00
Prasad Joshi	dde471ef5a	MMAP Optimization Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions. The functions have been modified to make them Linux friendly. ZFS uses these functions to read/write the mmapped pages. Using them from readpage/writepage results in clear code. The patch also adds readpages and writepages interface functions to read/write list of pages in one function call. The code change handles the first mmap optimization mentioned on https://github.com/behlendorf/zfs/issues/225 Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com> Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov> Issue #255	2011-07-01 12:22:52 -07:00
Prasad Joshi	218b8eafbd	Use truncate_setsize in zfs_setattr According to Linux kernel commit 2c27c65e, using truncate_setsize in setattr simplifies the code. Therefore, the patch replaces the call to vmtruncate() with truncate_setsize(). zfs_setattr uses zfs_freesp to free the disk space belonging to the file. As truncate_setsize may release the page cache and flushing the dirty data to disk, it must be called before the zfs_freesp. Suggested-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com> Closes #255	2011-06-27 09:59:52 -07:00
Prasad Joshi	b312979252	Tear down and flush the mmap region The inode eviction should unmap the pages associated with the inode. These pages should also be flushed to disk to avoid the data loss. Therefore, use truncate_setsize() in evict_inode() to release the pagecache. The API truncate_setsize() was added in 2.6.35 kernel. To ensure compatibility with the old kernel, the patch defines its own truncate_setsize function. Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com> Closes #255	2011-06-27 09:59:19 -07:00
Brian Behlendorf	7e7baecaa3	Linux 3.0 compat, shrinker compatibility To accomindate the updated Linux 3.0 shrinker API the spl shrinker compatibility code was updated. Unfortunately, this couldn't be done cleanly without slightly adjusting the comapt API. See spl commit `a55bcaad18`. This commit updates the ZFS code to use the slightly modified API. You must use the latest SPL if your building ZFS.	2011-06-21 14:36:39 -07:00
Gunnar Beutner	b00131d43c	Fix unlink/xattr deadlock The problem here is that prune_icache() tries to evict/delete both the xattr directory inode as well as at least one xattr inode contained in that directory. Here's what happens: 1. File is created. 2. xattr is created for that file (behind the scenes a xattr directory and a file in that xattr directory are created) 3. File is deleted. 4. Both the xattr directory inode and at least one xattr inode from that directory are evicted by prune_icache(); prune_icache() acquires a lock on both inodes before it calls ->evict() on the inodes When the xattr directory inode is evicted zfs_zinactive attempts to delete the xattr files contained in that directory. While enumerating these files zfs_zget() is called to obtain a reference to the xattr file znode - which tries to lock the xattr inode. However that very same xattr inode was already locked by prune_icache() further up the call stack, thus leading to a deadlock. This can be reliably reproduced like this: $ touch test $ attr -s a -V b test $ rm test $ echo 3 > /proc/sys/vm/drop_caches This patch fixes the deadlock by moving the zfs_purgedir() call to zfs_unlinked_drain(). Instead zfs_rmnode() now checks whether the xattr dir is empty and leaves the xattr dir in the unlinked set if it finds any xattrs. To ensure zfs_unlinked_drain() never accesses a stale super block zfsvfs_teardown() has been update to block until the iput taskq has been drained. This avoids a potential race where a file with an xattr directory is removed and the file system is immediately unmounted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #266	2011-06-20 13:47:03 -07:00
Gunnar Beutner	6f0cf71e0d	Removed erroneous zfs_inode_destroy() calls from zfs_rmnode(). iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy() for us after zfs_zinactive(), thus making sure that the inode is properly cleaned up. The zfs_inode_destroy() calls in zfs_rmnode() would lead to a double-free. Fixes #282	2011-06-20 10:30:17 -07:00
Christian Kohlschütter	df30f56639	Add "ashift" property to zpool create Some disks with internal sectors larger than 512 bytes (e.g., 4k) can suffer from bad write performance when ashift is not configured correctly. This is caused by the disk not reporting its actual sector size, but a sector size of 512 bytes. The drive may behave this way for compatibility reasons. For example, the WDC WD20EARS disks are known to exhibit this behavior. When creating a zpool, ZFS takes that wrong sector size and sets the "ashift" property accordingly (to 9: 1<<9=512), whereas it should be set to 12 for 4k sectors (1<<12=4096). This patch allows an adminstrator to manual specify the known correct ashift size at 'zpool create' time. This can significantly improve performance in certain cases. However, it will have an impact on your total pool capacity. See the updated ashift property description in the zpool.8 man page for additional details. Valid values for the ashift property range from 9 to 17 (512B-128KB). Additionally, you may set the ashift to 0 if you wish to auto-detect the sector size based on what the disk reports, this is the default behavior. The most common ashift values are 9 and 12. Example: zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd Closes #280 Original-patch-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-06-17 16:35:49 -07:00
Brian Behlendorf	96801d2906	Linux 2.6.37 compat, WRITE_FLUSH_FUA The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been introduced as a replacement for WRITE_BARRIER. This was done to allow richer semantics to be expressed to the block layer. It is the block layers responsibility to choose the correct way to implement these semantics. This change simply updates the bio's to use the new kernel API which should be absolutely safe. However, since ZFS depends entirely on this working as designed for correctness we do want to be careful. Closes #281	2011-06-17 14:37:26 -07:00
Brian Behlendorf	e95b3bdcbb	Fix stack ddt_class_contains() Stack usage for ddt_class_contains() reduced from 524 bytes to 68 bytes. This large stack allocation significantly contributed to the likelyhood of a stack overflow when scrubbing/resilvering dedup pools.	2011-05-31 12:17:27 -07:00
Brian Behlendorf	5b8c7bbcea	Fix stack ddt_zap_lookup() Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120 bytes. This large stack allocation significantly contributed to the likelyhood of a stack overflow when scrubbing/resilvering dedup pools.	2011-05-31 12:17:27 -07:00
Brian Behlendorf	c7f8f831a4	Revert "Fix stack traverse_visitbp()" This abomination is no longer required because the zio's issued during this recursive call path will now be handled asynchronously by the taskq thread pool. This reverts commit `6656bf5621`.	2011-05-31 12:17:27 -07:00
Brian Behlendorf	2fac4c2a74	Make tgx_sync_thread zio's async The majority of the recursive operations performed by the dsl are done either in the context of the tgx_sync_thread or during pool import. It is these recursive operations which contribute greatly to the stack depth. When this recursion is coupled with a synchronous I/O in the same context overflow becomes possible. Previously to handle this case I have focused on keeping the individual stack frames as light as possible. This is a good idea as long as it can be done in a way which doesn't overly complicate the code. However, there is a better solution. If we treat all zio's issued by the tgx_sync_thread as async then we can use the tgx_sync_thread stack for the recursive parts, and the zio_* threads for the I/O parts. This effectively doubles our available stack space with the only drawback being a small delay to schedule the I/O. However, in practice the scheduling time is so much smaller than the actual I/O time this isn't an issue. Another benefit of making the zio async is that the zio pipeline is now parallel. That should mean for CPU intensive pipelines such as compression or dedup performance may be improved. With this change in place the worst case stack usage observed so far is 6902 bytes. This is still higher than I'd like but significantly improved. Additional changes to specific functions should improve this further. This change allows us to revent commit `6656bf5` which did some horrible things to the recursive traverse_visitbp() callpath in the name of saving stack.	2011-05-31 12:17:27 -07:00
Brian Behlendorf	f74fae8b30	Fix 4K sector support Yesterday I ran across a 3TB drive which exposed 4K sectors to Linux. While I thought I had gotten this support correct it turns out there were 2 subtle bugs which prevented it from working. sudo ./cmd/zpool/zpool create -f large-sector /dev/sda cannot create 'large-sector': one or more devices is currently unavailable 1) The first issue was that it was possible that bdev_capacity() would return the number of 512 byte sectors rather than the number of 4096 sectors. Internally, certain Linux functions only operate with 512 byte sectors so you need to be careful. To avoid any confusion in the future I've updated bdev_capacity() to simply return the device (or partition) capacity in bytes. The higher levels of ZFS want the value in bytes anyway so this is cleaner. 2) When creating a bio the ->bi_sector count must always be expressed in 512 byte sectors. The existing code would scale the byte offset by the logical sector size. Until now this was always 512 so it never caused problems. Trying a 4K sector drive clearly exposed the issue. The problem has been fixed by hard coding the 512 byte sector which is exactly what the bio code does internally. With these changes I'm now able to create ZFS pools using 4K sector drives. No issues were observed during fairly extensive testing. This is also a low risk change if your using 512b sectors devices because none of the logic changes. Closes #256	2011-05-27 11:38:53 -07:00
Brian Behlendorf	2b8cad6159	Use vmem_alloc() for zfs_ioc_userspace_many() The default buffer size when requesting multiple quota entries is 100 times the zfs_useracct_t size. In practice this works out to exactly 27200 bytes. Since this will be a short lived buffer in a non-performance critical path it is preferable to vmem_alloc() the needed memory.	2011-05-20 14:23:18 -07:00
Brian Behlendorf	f01b360e67	Pass caller's credential in zfsdev_ioctl() Initially when zfsdev_ioctl() was ported to Linux we didn't have any credential support implemented. So at the time we simply passed NULL which wasn't much of a problem since most of the secpolicy code was disabled. However, one exception is quota handling which does require the credential. Now that proper credentials are supported we can safely start passing the callers credential. This is also an initial step towards fully implemented the zfs secpolicy.	2011-05-20 10:12:25 -07:00
Brian Behlendorf	3fd70ee6b0	Fix 'negative objects to delete' warning Normally when the arc_shrinker_func() function is called the return value should be: >=0 - To indicate the number of freeable objects in the cache, or -1 - To indicate this cache should be skipped However, when the shrinker callback is called with 'nr_to_scan' equal to zero. The caller simply wants the number of freeable objects in the cache and we must never return -1. This patch reorders the first two conditionals in arc_shrinker_func() to ensure this behavior. This patch also now explictly casts arc_size and arc_c_min to signed int64_t types so MAX(x, 0) works as expected. As unsigned types we would never see an negative value which defeated the purpose of the MAX() lower bound and broke the shrinker logic. Finally, when nr_to_scan is non-zero we explictly prevent all reclaim below arc_c_min. This is done to prevent the Linux page cache from completely crowding out the ARC. This limit is tunable and some experimentation is likely going to be required to set it exactly right. For now we're sticking with the OpenSolaris defaults. Closes #218 Closes #243	2011-05-18 10:29:22 -07:00
Brian Behlendorf	e814770f2e	Update synchronous open zfs_close() comment The comment in zfs_close() pertaining to decrementing the synchronous open count needs to be updated for Linux. The code was already updated to be correct, but the comment was missed and is now misleading. Under Linux the zfs_close() hook is only called once when the final reference is dropped. This differs from Solaris where zfs_close() is called for each close. Closes #237	2011-05-13 08:20:06 -07:00
Brian Behlendorf	c91d229809	Merge pull request #235 from nedbass/rdev Don't store rdev in SA for FIFOs and sockets	2011-05-09 16:41:28 -07:00
Ned A. Bass	aa6d8c1086	Don't store rdev in SA for FIFOs and sockets Update the handling of named pipes and sockets to be consistent with other platforms with regard to the rdev attribute. While all ZFS ipmlementations store the rdev for device files in a system attribute (SA), this is not the case for FIFOs and sockets. Indeed, Linux always passes rdev=0 to mknod() for FIFOs and sockets, so the value is not needed. Add an ASSERT that rdev==0 for FIFOs and sockets to detect if the expected behavior ever changes. Closes #216	2011-05-09 13:35:07 -07:00
Brian Behlendorf	21ade34764	Disable direct reclaim for z_wr_* threads The direct reclaim path in the z_wr_* threads must be disabled to ensure forward progress is always maintained for txg processing. This ensures that a txg will never get stuck waiting on itself because it entered the following memory reclaim callpath. ->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive() ->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open() It would be preferable to target this exact code path but the kernel offers no way to do this without custom patches. To avoid this we are forced to disable all reclaim for these threads. It should not be necessary to do this for other other z_* threads because they will not hold a txg open. Closes #232	2011-05-06 15:26:26 -07:00
Brian Behlendorf	3117dd0b90	Handle NULL in nfsd .fsync() hook How nfsd handles .fsync() has been changed a couple of times in the recent kernels. But basically there are three cases we need to consider. Linux 2.6.12 - 2.6.33 * The .fsync() hook takes 3 arguments * The nfsd will call .fsync() with a NULL file struct pointer. Linux 2.6.34 * The .fsync() hook takes 3 arguments * The nfsd no longer calls .fsync() but instead used sync_inode() Linux 2.6.35 - 2.6.x * The .fsync() hook takes 2 arguments * The nfsd no longer calls .fsync() but instead used sync_inode() For once it looks like we've gotten lucky. The first two cases can actually be collased in to one if we stop using the file struct pointer entirely. Since the dentry is still passed in both cases this is possible. The last case can then be safely handled by unconditionally using the dentry in the file struct pointer now that we know the nfsd caller has been removed. Closes #230	2011-05-06 12:33:45 -07:00
Brian Behlendorf	34b84cb831	Use vmem_alloc() for zfs_ioc_pool_get_history() The default buffer size when requesting history is 128k. This is far to large for a kmem_alloc() so instead use the slower vmem_alloc(). This path has no performance concerns and the buffer is immediately free'd after its contents are copied to the user space buffer.	2011-05-06 09:59:52 -07:00
Brian Behlendorf	c409e4647f	Add missing ZFS tunables This commit adds module options for all existing zfs tunables. Ideally the average user should never need to modify any of these values. However, in practice sometimes you do need to tweak these values for one reason or another. In those cases it's nice not to have to resort to rebuilding from source. All tunables are visable to modinfo and the list is as follows: $ modinfo module/zfs/zfs.ko filename: module/zfs/zfs.ko license: CDDL author: Sun Microsystems/Oracle, Lawrence Livermore National Laboratory description: ZFS srcversion: 8EAB1D71DACE05B5AA61567 depends: spl,znvpair,zcommon,zunicode,zavl vermagic: 2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions parm: zvol_major:Major number for zvol device (uint) parm: zvol_threads:Number of threads for zvol device (uint) parm: zio_injection_enabled:Enable fault injection (int) parm: zio_bulk_flags:Additional flags to pass to bulk buffers (int) parm: zio_delay_max:Max zio millisec delay before posting event (int) parm: zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool) parm: zil_replay_disable:Disable intent logging replay (int) parm: zfs_nocacheflush:Disable cache flushes (bool) parm: zfs_read_chunk_size:Bytes to read per chunk (long) parm: zfs_vdev_max_pending:Max pending per-vdev I/Os (int) parm: zfs_vdev_min_pending:Min pending per-vdev I/Os (int) parm: zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int) parm: zfs_vdev_time_shift:Deadline time shift for vdev I/O (int) parm: zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int) parm: zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int) parm: zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int) parm: zfs_vdev_scheduler:I/O scheduler (charp) parm: zfs_vdev_cache_max:Inflate reads small than max (int) parm: zfs_vdev_cache_size:Total size of the per-disk cache (int) parm: zfs_vdev_cache_bshift:Shift size to inflate reads too (int) parm: zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int) parm: zfs_recover:Set to attempt to recover from fatal errors (int) parm: spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp) parm: zfs_zevent_len_max:Max event queue length (int) parm: zfs_zevent_cols:Max event column width (int) parm: zfs_zevent_console:Log events to the console (int) parm: zfs_top_maxinflight:Max I/Os per top-level (int) parm: zfs_resilver_delay:Number of ticks to delay resilver (int) parm: zfs_scrub_delay:Number of ticks to delay scrub (int) parm: zfs_scan_idle:Idle window in clock ticks (int) parm: zfs_scan_min_time_ms:Min millisecs to scrub per txg (int) parm: zfs_free_min_time_ms:Min millisecs to free per txg (int) parm: zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int) parm: zfs_no_scrub_io:Set to disable scrub I/O (bool) parm: zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool) parm: zfs_txg_timeout:Max seconds worth of delta per txg (int) parm: zfs_no_write_throttle:Disable write throttling (int) parm: zfs_write_limit_shift:log2(fraction of memory) per txg (int) parm: zfs_txg_synctime_ms:Target milliseconds between tgx sync (int) parm: zfs_write_limit_min:Min tgx write limit (ulong) parm: zfs_write_limit_max:Max tgx write limit (ulong) parm: zfs_write_limit_inflated:Inflated tgx write limit (ulong) parm: zfs_write_limit_override:Override tgx write limit (ulong) parm: zfs_prefetch_disable:Disable all ZFS prefetching (int) parm: zfetch_max_streams:Max number of streams per zfetch (uint) parm: zfetch_min_sec_reap:Min time before stream reclaim (uint) parm: zfetch_block_cap:Max number of blocks to fetch at a time (uint) parm: zfetch_array_rd_sz:Number of bytes in a array_read (ulong) parm: zfs_pd_blks_max:Max number of blocks to prefetch (int) parm: zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int) parm: zfs_arc_min:Min arc size (ulong) parm: zfs_arc_max:Max arc size (ulong) parm: zfs_arc_meta_limit:Meta limit for arc size (ulong) parm: zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int) parm: zfs_arc_grow_retry:Seconds before growing arc size (int) parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int) parm: zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)	2011-05-04 10:02:37 -07:00
Brian Behlendorf	5f35b19007	Fully update inode when created When a new znode/inode pair is created both the znode and the inode should be immediately updated to the correct values. This was done for the znode and for most of the values in the inode, but not all of them. This normally wasn't a problem because most subsequent operations would cause the inode to be immediately updated. This change ensures the inode is now fully updated before it is inserted in to the inode hash. Closes #116 Closes #146 Closes #164	2011-05-02 14:04:19 -07:00
Brian Behlendorf	df554c148e	Fix 'zfs set volsize=N pool/dataset' This change fixes a kernel panic which would occur when resizing a dataset which was not open. The objset_t stored in the zvol_state_t will be set to NULL when the block device is closed. To avoid this issue we pass the correct objset_t as the third arg. The code has also been updated to correctly notify the kernel when the block device capacity changes. For 2.6.28 and newer kernels the capacity change will be immediately detected. For earlier kernels the capacity change will be detected when the device is next opened. This is a known limitation of older kernels. Online ext3 resize test case passes on 2.6.28+ kernels: $ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023 $ zpool create tank /tmp/zvol $ zfs create -V 500M tank/zd0 $ mkfs.ext3 /dev/zd0 $ mkdir /mnt/zd0 $ mount /dev/zd0 /mnt/zd0 $ df -h /mnt/zd0 $ zfs set volsize=800M tank/zd0 $ resize2fs /dev/zd0 $ df -h /mnt/zd0 Original-patch-by: Fajar A. Nugraha <github@fajar.net> Closes #68 Closes #84	2011-05-02 08:54:40 -07:00
Gunnar Beutner	055656d4f4	Implemented NFS export_operations. Implemented the required NFS operations for exporting ZFS datasets using the in-kernel NFS daemon.	2011-04-29 12:36:13 -07:00
Brian Behlendorf	5476e6952c	Suppress 'vdev_metaslab_init' memory warning The vdev_metaslab_init() function has been observed to allocate larger than 8k chunks. However, they are not much larger than 8k and it does this infrequently so it is allowed and the warning is supressed.	2011-04-27 09:35:18 -07:00
Brian Behlendorf	40a39e1103	Conserve stack in dsl_scan_visit() The dsl_scan_visit() function is a little heavy weight taking 464 bytes on the stack. This can be easily reduced for little cost by moving zap_cursor_t and zap_attribute_t off the stack and on to the heap. After this change dsl_scan_visit() has been reduced in size by 320 bytes. This change was made to reduce stack usage in the dsl_scan_sync() callpath which is recursive and has been observed to overflow the stack. Issue #174	2011-04-26 15:48:00 -07:00
Brian Behlendorf	b81c4ac9af	Conserve stack in dsl_scan_visitbp() This function is called recursively so everything possible must be done to limit its stack consumption. The dprintf_bp() debugging function adds 30 bytes of local variables to the function we cannot afford. By commenting out this debugging we save 30 bytes per recursion and depths of 13 are not uncommon. This yeilds a total stack saving of 390 bytes on our 8k stack. Issue #174	2011-04-26 15:48:00 -07:00

1 2 3 4 5

237 Commits