mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Gunnar Beutner	8b0cf399ff	Updated init scripts to enable automatic sharing of ZFS datasets. The relevant init scripts were updated so as to automatically share ZFS datasets using "zfs share -a" at boot time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-06 09:20:28 -07:00
Gunnar Beutner	3c9609b322	Renamed HAVE_SHARE ifdefs to HAVE_SMB_SHARE. The remaining code that is guarded by HAVE_SHARE ifdefs is related to the .zfs/shares functionality which is currently not available on Linux. On Solaris the .zfs/shares directory can be used to set permissions for SMB shares. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-06 09:20:28 -07:00
Gunnar Beutner	52e7c3a2e5	Link libshare directly to libzfs Drop usage of dlopen/dlsym for libshare. There is no need to do this because the zfs packages provide libshare. Unlike on Solaris we are guaranteed it will be available. This avoids possible problems with hardcoding the libshare path in the code (e.g. when users specify a different install path via configure options). It additionally simplifies the code which is good for maintainability. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-06 09:20:28 -07:00
Gunnar Beutner	46e18b3f0f	Implemented sharing datasets via NFS using libshare. The sharenfs and sharesmb properties depend on the libshare library to export datasets via NFS and SMB. This commit implements the base libshare functionality as well as support for managing NFS shares. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-06 09:20:28 -07:00
Zachary Bedell	dc2a4a9136	Document initramfs process Add documentation for Dracut and the initramfs process. This includes detailing the basic boot process and options available. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-06 09:20:28 -07:00
Zachary Bedell	fde4ce992d	Update for Dracut-010 Update Dracut module for Dracut-010 and fix race conditions that caused boot to fail on MP systems. Add support for zfs_force flag and parsing of spl_hostid from kernel command line. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-06 09:20:28 -07:00
Zachary Bedell	e93ced4847	Update zfs.gentoo/zfs.lsb init script * Update paths to zpool/zfs tools, * Log less for non-error conditions, * Don't be fatal if umount fails at shutdown -- final init remount will take care of it if /usr or / are in use Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-06 09:20:14 -07:00
Gunnar Beutner	c8082367cf	Removed erroneous backticks in the zfs.lunar init script. The backticks would cause the output of the zfs commands to be evaluated as input for the if construct rather than their exit status. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-05 11:25:48 -07:00
Gunnar Beutner	0f4524cca4	Fixed indentation in the zfs.lunar init script. One of the blocks in the init script wasn't indented properly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-05 11:25:48 -07:00
Prasad Joshi	5a52105925	Use consistent error message in zpool sub-command The zpool sub-commands like iostat, list, and status should display consistent message when a given pool is unavailable or no pool is present. This change unifies the default behavior as follows: root@prasad:~# ./zpool list 1 2 no pools available no pools available root@prasad:~# ./zpool iostat 1 2 no pools available no pools available root@prasad:~# ./zpool status 1 2 no pools available no pools available root@prasad:~# ./zpool list tan 1 2 cannot open 'tan': no such pool root@prasad:~# ./zpool iostat tan 1 2 cannot open 'tan': no such pool root@prasad:~# ./zpool status tan 1 2 cannot open 'tan': no such pool Reported-by: Rajshree Thorat <rthorat@stec-inc.com> Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #306	2011-07-04 19:57:01 -07:00
Andrew Tselischev	b59322a0d8	Fix 'rc_parallel="YES"' error If rc_parallel="YES" zfs starts before localmount, which leads to "No such file or directory" error on systems with /usr on a separate partition. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-04 13:54:59 -07:00
Brian Behlendorf	b2f25e00ec	Prep zfs-0.6.0-rc5 tag Create the fifth 0.6.0 release candidate tag (rc5).	2011-07-01 15:24:34 -07:00
Brian Behlendorf	285226eff3	Always allow non-user xattrs Under Linux you may only disable USER xattrs. The SECURITY, SYSTEM, and TRUSTED xattr namespaces must always be available if xattrs are supported by the filesystem. The enforcement of USER xattrs is performed in the zpl_xattr_user_* handlers. Under Solaris there is only a single xattr namespace which is managed globally.	2011-07-01 13:39:48 -07:00
Brian Behlendorf	f2cfee80e3	Fix implicit declaration of 'mkdirp' The lib/libspl/include/libgen.h header file was being mistakenly left out of the 'make dist' tarball. It just happens this doesn't cause a build failure when creating packages because the system libgen/h is included instead. This simply results in the following warning due to the missing forward declaration of mkdirp(). ../../lib/libzfs/libzfs_mount.c:417:3: warning: implicit declaration of function 'mkdirp' [-Wimplicit-function-declaration]	2011-07-01 13:39:47 -07:00
Rohan Puri	a89c3e0bd5	Support mandatory locks (nbmand) The Linux kernel already has support for mandatory locking. This change just replaces the Solaris mandatory locking calls with the Linux equivilants. In fact, it looks like this code could be removed entirely because this checking is already done generically in the Linux VFS. However, for now we'll leave it in place even if it is redundant just in case we missed something. The original patch to update the code to support mandatory locking was done by Rohan Puri. This patch is an updated version which is compatible with the previous mount option handling changes. Original-Patch-by: Rohan Puri <rohan.puri15@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #222 Closes #253	2011-07-01 13:39:40 -07:00
Brian Behlendorf	2cf7f52bc4	Linux compat 2.6.39: mount_nodev() The .get_sb callback has been replaced by a .mount callback in the file_system_type structure. When using the new interface the caller must now use the mount_nodev() helper. Unfortunately, the new interface no longer passes the vfsmount down to the zfs layers. This poses a problem for the existing implementation because we currently save this pointer in the super block for latter use. It provides our only entry point in to the namespace layer for manipulating certain mount options. This needed to be done originally to allow commands like 'zfs set atime=off tank' to work properly. It also allowed me to keep more of the original Solaris code unmodified. Under Solaris there is a 1-to-1 mapping between a mount point and a file system so this is a fairly natural thing to do. However, under Linux they many be multiple entries in the namespace which reference the same filesystem. Thus keeping a back reference from the filesystem to the namespace is complicated. Rather than introduce some ugly hack to get the vfsmount and continue as before. I'm leveraging this API change to update the ZFS code to do things in a more natural way for Linux. This has the upside that is resolves the compatibility issue for the long term and fixes several other minor bugs which have been reported. This commit updates the code to remove this vfsmount back reference entirely. All modifications to filesystem mount options are now passed in to the kernel via a '-o remount'. This is the expected Linux mechanism and allows the namespace to properly handle any options which apply to it before passing them on to the file system itself. Aside from fixing the compatibility issue, removing the vfsmount has had the benefit of simplifying the code. This change which fairly involved has turned out nicely. Closes #246 Closes #217 Closes #187 Closes #248 Closes #231	2011-07-01 13:36:39 -07:00
Brian Behlendorf	5c03efc379	Linux compat 2.6.39: security_inode_init_security() The security_inode_init_security() function now takes an additional qstr argument which must be passed in from the dentry if available. Passing a NULL is safe when no qstr is available the relevant security checks will just be skipped. Closes #246 Closes #217 Closes #187	2011-07-01 12:40:08 -07:00
Brian Behlendorf	bd2f5ac97f	Avoid 'rpm -q' bug for 'make pkg' RPM version 4.9.0 has been observed to generate extra debug messages in certain cases. These debug messages prevent us from cleanly acquiring the architecture. This is clearly an upstream RPM bug which will get fixed. But until then a safe solution is to pipe the result through 'tail -1' to just grab the architecture bit we care about. Example 'rpm -qp spl-0.6.0-rc4.src.rpm --qf %{arch}' output: Freeing read locks for locker 0x166: 28031/47480843735008 Freeing read locks for locker 0x168: 28031/47480843735008 x86_64	2011-07-01 12:39:25 -07:00
Brian Behlendorf	e2e7aa2df8	Add ZFS specific mmap() checks Under Linux the VFS handles virtually all of the mmap() access checks. Filesystem specific checks are left to be handled in the .mmap() hook and normally there arn't any. However, ZFS provides a few attributes which can influence the mmap behavior and should be honored. Note, currently the code to modify these attributes has not been implemented under Linux. * ZFS_IMMUTABLE \| ZFS_READONLY \| ZFS_APPENDONLY: when any of these attributes are set a file may not be mmaped with write access. * ZFS_AV_QUARANTINED: when set a file file may not be mmaped with read or exec access. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-01 12:23:46 -07:00
Brian Behlendorf	f0b2486034	Remove unused MMAP functions The following functions were required for the OpenSolaris mmap implementation. Because the Linux VFS does most the most heavy lifting for us they are not required and are being removed to keep the code clean and easy to understand. * zfs_null_putapage() * zfs_frlock() * zfs_no_putpage() Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>	2011-07-01 12:22:57 -07:00
Prasad Joshi	dde471ef5a	MMAP Optimization Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions. The functions have been modified to make them Linux friendly. ZFS uses these functions to read/write the mmapped pages. Using them from readpage/writepage results in clear code. The patch also adds readpages and writepages interface functions to read/write list of pages in one function call. The code change handles the first mmap optimization mentioned on https://github.com/behlendorf/zfs/issues/225 Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com> Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov> Issue #255	2011-07-01 12:22:52 -07:00
Brian Behlendorf	2a005961a4	Ensure all block devices are available These days most disk drivers will probe for devices asynchronously. This means it's possible that when you zfs init script runs all the required block devices may not yet have been discovered. The result is the pool may fail to cleanly import at boot time. This is particularly common when you have a large number of devices. The fix is for the init script to block until udev settles and we are no longer detecting new devices. Once the system has settled the zfs modules can be loaded and the pool with be automatically imported.	2011-06-30 14:45:33 -07:00
Prasad Joshi	218b8eafbd	Use truncate_setsize in zfs_setattr According to Linux kernel commit 2c27c65e, using truncate_setsize in setattr simplifies the code. Therefore, the patch replaces the call to vmtruncate() with truncate_setsize(). zfs_setattr uses zfs_freesp to free the disk space belonging to the file. As truncate_setsize may release the page cache and flushing the dirty data to disk, it must be called before the zfs_freesp. Suggested-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com> Closes #255	2011-06-27 09:59:52 -07:00
Prasad Joshi	b312979252	Tear down and flush the mmap region The inode eviction should unmap the pages associated with the inode. These pages should also be flushed to disk to avoid the data loss. Therefore, use truncate_setsize() in evict_inode() to release the pagecache. The API truncate_setsize() was added in 2.6.35 kernel. To ensure compatibility with the old kernel, the patch defines its own truncate_setsize function. Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com> Closes #255	2011-06-27 09:59:19 -07:00
Ned A. Bass	560bcf9d14	Multipath device manageability improvements Update udev helper scripts to deal with device-mapper devices created by multipathd. These enhancements are targeted at a particular storage network topology under evaluation at LLNL consisting of two SAS switches providing redundant connectivity between multiple server nodes and disk enclosures. The key to making these systems manageable is to create shortnames for each disk that conveys its physical location in a drawer. In a direct-attached topology we infer a disk's enclosure from the PCI bus number and HBA port number in the by-path name provided by udev. In a switched topology, however, multiple drawers are accessed via a single HBA port. We therefore resort to assigning drawer identifiers based on which switch port a drive's enclosure is connected to. This information is available from sysfs. Add options to zpool_layout to generate an /etc/zfs/zdev.conf using symbolic links in /dev/disk/by-id of the form <label>-<UUID>-switch-port:<X>-slot:<Y>. <label> is a string that depends on the subsystem that created the link and defaults to "dm-uuid-mpath" (this prefix is used by multipathd). <UUID> is a unique identifier for the disk typically obtained from the scsi_id program, and <X> and <Y> denote the switch port and disk slot numbers, respectively. Add a callout script sas_switch_id for use by multipathd to help create symlinks of the form described above. Update zpool_id and the udev zpool rules file to handle both multipath devices and conventional drives.	2011-06-23 10:46:06 -07:00
Brian Behlendorf	7e7baecaa3	Linux 3.0 compat, shrinker compatibility To accomindate the updated Linux 3.0 shrinker API the spl shrinker compatibility code was updated. Unfortunately, this couldn't be done cleanly without slightly adjusting the comapt API. See spl commit `a55bcaad18`. This commit updates the ZFS code to use the slightly modified API. You must use the latest SPL if your building ZFS.	2011-06-21 14:36:39 -07:00
Gunnar Beutner	b00131d43c	Fix unlink/xattr deadlock The problem here is that prune_icache() tries to evict/delete both the xattr directory inode as well as at least one xattr inode contained in that directory. Here's what happens: 1. File is created. 2. xattr is created for that file (behind the scenes a xattr directory and a file in that xattr directory are created) 3. File is deleted. 4. Both the xattr directory inode and at least one xattr inode from that directory are evicted by prune_icache(); prune_icache() acquires a lock on both inodes before it calls ->evict() on the inodes When the xattr directory inode is evicted zfs_zinactive attempts to delete the xattr files contained in that directory. While enumerating these files zfs_zget() is called to obtain a reference to the xattr file znode - which tries to lock the xattr inode. However that very same xattr inode was already locked by prune_icache() further up the call stack, thus leading to a deadlock. This can be reliably reproduced like this: $ touch test $ attr -s a -V b test $ rm test $ echo 3 > /proc/sys/vm/drop_caches This patch fixes the deadlock by moving the zfs_purgedir() call to zfs_unlinked_drain(). Instead zfs_rmnode() now checks whether the xattr dir is empty and leaves the xattr dir in the unlinked set if it finds any xattrs. To ensure zfs_unlinked_drain() never accesses a stale super block zfsvfs_teardown() has been update to block until the iput taskq has been drained. This avoids a potential race where a file with an xattr directory is removed and the file system is immediately unmounted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #266	2011-06-20 13:47:03 -07:00
Gunnar Beutner	6f0cf71e0d	Removed erroneous zfs_inode_destroy() calls from zfs_rmnode(). iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy() for us after zfs_zinactive(), thus making sure that the inode is properly cleaned up. The zfs_inode_destroy() calls in zfs_rmnode() would lead to a double-free. Fixes #282	2011-06-20 10:30:17 -07:00
Christian Kohlschütter	df30f56639	Add "ashift" property to zpool create Some disks with internal sectors larger than 512 bytes (e.g., 4k) can suffer from bad write performance when ashift is not configured correctly. This is caused by the disk not reporting its actual sector size, but a sector size of 512 bytes. The drive may behave this way for compatibility reasons. For example, the WDC WD20EARS disks are known to exhibit this behavior. When creating a zpool, ZFS takes that wrong sector size and sets the "ashift" property accordingly (to 9: 1<<9=512), whereas it should be set to 12 for 4k sectors (1<<12=4096). This patch allows an adminstrator to manual specify the known correct ashift size at 'zpool create' time. This can significantly improve performance in certain cases. However, it will have an impact on your total pool capacity. See the updated ashift property description in the zpool.8 man page for additional details. Valid values for the ashift property range from 9 to 17 (512B-128KB). Additionally, you may set the ashift to 0 if you wish to auto-detect the sector size based on what the disk reports, this is the default behavior. The most common ashift values are 9 and 12. Example: zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd Closes #280 Original-patch-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-06-17 16:35:49 -07:00
Brian Behlendorf	96801d2906	Linux 2.6.37 compat, WRITE_FLUSH_FUA The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been introduced as a replacement for WRITE_BARRIER. This was done to allow richer semantics to be expressed to the block layer. It is the block layers responsibility to choose the correct way to implement these semantics. This change simply updates the bio's to use the new kernel API which should be absolutely safe. However, since ZFS depends entirely on this working as designed for correctness we do want to be careful. Closes #281	2011-06-17 14:37:26 -07:00
Brian Behlendorf	db97f88646	Update rpm/deb packages to be FHS compliant This change is the first step towards updating the default rpm/deb packages to be FHS compliant. It accomplishes this by passing the following options to ./configure to ensure the zfs build products are installed in FHS compliant locations. ./configure --prefix=/ --bindir=/lib/udev \ --libexecdir=/usr/libexec --datadir=/usr/share The core zfs utilities (zfs, zpool, zdb) are now be installed in /sbin, the core libraries in /lib, and the udev helpers (zpool_id, zvol_id) are in /lib/udev with the other udev helpers. The remaining files in the zfs package remain in their previous locations under /usr.	2011-06-17 13:36:16 -07:00
Darik Horn	b772aedeec	Autogen refresh. Run autogen.sh using the same autotools versions as upstream: * autoconf-2.63 * automake-1.11.1 * libtool-2.2.6b	2011-06-17 13:24:44 -07:00
Brian Behlendorf	47a2455fbc	Use datadir not datarootdir for dracut The zfs dracut modules should be installed under the --datadir not --datarootdir path. This was just an oversight in the original Makefile.am. After this change %{_datadir} can now be set safely in the zfs.spec file. The 'make install' location is now consistent with the location expected by the spec file.	2011-06-17 13:22:19 -07:00
Darik Horn	b9f27ee765	Fix autoconf variable substitution in udev rules. Change the variable substitution in the udev rule templates according to the method described in the Autoconf manual; Chapter 4.7.2: Installation Directory Variables. The udev rules are improperly generated if the bindir parameter overrides the prefix parameter during configure. For example: # ./configure --prefix=/usr/local --bindir=/opt/zfs/bin The udev helper is installed as /opt/zfs/bin/zpool_id, but the corresponding udev rule has a different path: # /usr/local/etc/udev/rules.d/60-zpool.rules ENV{DEVTYPE}=="disk", IMPORT{program}="/usr/local/bin/zpool_id -d %p" The @bindir@ variable expands to "${exec_prefix}/bin", so it cannot be used instead of @prefix@ directly. This also applies to the zvol_id helper. Closes #283.	2011-06-17 10:11:29 -07:00
Brian Behlendorf	e130330a87	Handle /etc/mtab -> /proc/mounts symlink Under Fedora 15 /etc/mtab is now a symlink to /proc/mounts by default. When /etc/mtab is a symlink the mount.zfs helper should not update it. There was code in place to handle this case but it used stat() which traverses the link and then issues the stat on /proc/mounts. We need to use lstat() to prevent the link traversal and instead stat /etc/mtab. Closes #270	2011-06-14 16:48:38 -07:00
Brian Behlendorf	2e08aedba4	Always check -Wno-unused-but-set-variable gcc support The previous commit `8a7e1ceefa` wasn't quite right. This check applies to both the user and kernel space build and as such we must make sure it runs regardless of what the --with-config option is set too. For example, if --with-config=kernel then the autoconf test does not run and we generate build warnings when compiling the kernel packages.	2011-06-14 16:40:35 -07:00
Brian Behlendorf	8a7e1ceefa	Check for -Wno-unused-but-set-variable gcc support Gcc versions 4.3.2 and earlier do not support the compiler flag -Wno-unused-but-set-variable. This can lead to build failures on older Linux platforms such as Debian Lenny. Since this is an optional build argument this changes add a new autoconf check for the option. If it is supported by the installed version of gcc then it is used otherwise it is omited. See commit's `12c1acde76` and `79713039a2` for the reason the -Wno-unused-but-set-variable options was originally added.	2011-06-14 14:43:22 -07:00
Brian Behlendorf	10715a0187	Add default stack checking When your kernel is built with kernel stack tracing enabled and you have the debugfs filesystem mounted. Then the zfs.sh script will clear the worst observed kernel stack depth on module load and check the worst case usage on module removal. If the stack depth ever exceeds 7000 bytes the full stack will be printed for debugging. This is dangerously close to overrunning the default 8k stack. This additional advisory debugging is particularly valuable when running the regression tests on a kernel built with 16k stacks. In this case, almost no matter how bad the stack overrun is you will see be able to get a clean stack trace for debugging. Since the worst case stack usage can be highly variable it's helpful to always check the worst case usage.	2011-06-13 13:50:21 -07:00
Brian Behlendorf	da88a7fbe8	Pass -f option for import If a pool was not cleanly exported passing the -f flag may be required at 'zpool import' time. Since this test is simply validating that the pool can be successfully imported in the absense of the cache file always pass the -f to ensure it succeeds. This failure was observed under RHEL6.1.	2011-06-10 11:21:31 -07:00
Brian Behlendorf	1b9d8c340f	Fix 'zfs send -D' segfault Sending pools with dedup results in a segfault due to a Solaris portability issue. Under Solaris the pipe(2) library call creates a bidirectional data channel. Unfortunately, on Linux pipe(2) call creates unidirection data channel. The fix is to use the socketpair(2) function to create the expected bidirectional channel. Seth Heeren did the original leg work on this issue for zfs-fuse. We finally just rediscovered the same portability issue and dfurphy was able to point me at the original issue for the fix. Closes #268	2011-06-09 13:58:48 -07:00
Brian Behlendorf	cbc6fab65c	Sanatize zpios-sanity.sh environment Just like zconfig.sh the zpios-sanity.sh tests should run in a sanatized environment. This ensures they never conflict with an installed /etc/zfs/zpool.cache file. This commit additionally improves the -c cleanup option. It now removes the modules stack if loaded and destroys relevant md devices. This behavior is now identical to zconfig.sh.	2011-06-03 15:08:49 -07:00
Brian Behlendorf	608860b6d0	Delay before destroying loopback devices Generally I don't approve of just adding an arbitrary delay to avoid a problem but in this case I'm going to let it slide. We may need to delay briefly after 'zpool destroy' returns to ensure the loopback devices are closed. If they aren't closed than losetup -d will not be able to destroy them. Unfortunately, there's no easy state the check so we'll have to make due with a simple delay.	2011-06-03 14:38:25 -07:00
Brian Behlendorf	36391312af	Always unload zpios.ko on exit We should always unload zpios.ko on exit. This ensures that subsequent calls to 'zfs.sh -u' from other utilities will be able to unload the module stack and properly cleanup. This is important for the the --cleanup option which can be passed to zconfig.sh and zfault.sh.	2011-06-02 10:25:35 -07:00
Brian Behlendorf	2ea9dc40f8	Fix zpios-sanity.sh return code The zpios-sanity.sh script should return failure when any of the individual zpios.sh tests fail. The previous code would always return success suppressing real failures.	2011-06-02 10:13:15 -07:00
Brian Behlendorf	e95b3bdcbb	Fix stack ddt_class_contains() Stack usage for ddt_class_contains() reduced from 524 bytes to 68 bytes. This large stack allocation significantly contributed to the likelyhood of a stack overflow when scrubbing/resilvering dedup pools.	2011-05-31 12:17:27 -07:00
Brian Behlendorf	5b8c7bbcea	Fix stack ddt_zap_lookup() Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120 bytes. This large stack allocation significantly contributed to the likelyhood of a stack overflow when scrubbing/resilvering dedup pools.	2011-05-31 12:17:27 -07:00
Brian Behlendorf	c7f8f831a4	Revert "Fix stack traverse_visitbp()" This abomination is no longer required because the zio's issued during this recursive call path will now be handled asynchronously by the taskq thread pool. This reverts commit `6656bf5621`.	2011-05-31 12:17:27 -07:00
Brian Behlendorf	2fac4c2a74	Make tgx_sync_thread zio's async The majority of the recursive operations performed by the dsl are done either in the context of the tgx_sync_thread or during pool import. It is these recursive operations which contribute greatly to the stack depth. When this recursion is coupled with a synchronous I/O in the same context overflow becomes possible. Previously to handle this case I have focused on keeping the individual stack frames as light as possible. This is a good idea as long as it can be done in a way which doesn't overly complicate the code. However, there is a better solution. If we treat all zio's issued by the tgx_sync_thread as async then we can use the tgx_sync_thread stack for the recursive parts, and the zio_* threads for the I/O parts. This effectively doubles our available stack space with the only drawback being a small delay to schedule the I/O. However, in practice the scheduling time is so much smaller than the actual I/O time this isn't an issue. Another benefit of making the zio async is that the zio pipeline is now parallel. That should mean for CPU intensive pipelines such as compression or dedup performance may be improved. With this change in place the worst case stack usage observed so far is 6902 bytes. This is still higher than I'd like but significantly improved. Additional changes to specific functions should improve this further. This change allows us to revent commit `6656bf5` which did some horrible things to the recursive traverse_visitbp() callpath in the name of saving stack.	2011-05-31 12:17:27 -07:00
Brian Behlendorf	f74fae8b30	Fix 4K sector support Yesterday I ran across a 3TB drive which exposed 4K sectors to Linux. While I thought I had gotten this support correct it turns out there were 2 subtle bugs which prevented it from working. sudo ./cmd/zpool/zpool create -f large-sector /dev/sda cannot create 'large-sector': one or more devices is currently unavailable 1) The first issue was that it was possible that bdev_capacity() would return the number of 512 byte sectors rather than the number of 4096 sectors. Internally, certain Linux functions only operate with 512 byte sectors so you need to be careful. To avoid any confusion in the future I've updated bdev_capacity() to simply return the device (or partition) capacity in bytes. The higher levels of ZFS want the value in bytes anyway so this is cleaner. 2) When creating a bio the ->bi_sector count must always be expressed in 512 byte sectors. The existing code would scale the byte offset by the logical sector size. Until now this was always 512 so it never caused problems. Trying a 4K sector drive clearly exposed the issue. The problem has been fixed by hard coding the 512 byte sector which is exactly what the bio code does internally. With these changes I'm now able to create ZFS pools using 4K sector drives. No issues were observed during fairly extensive testing. This is also a low risk change if your using 512b sectors devices because none of the logic changes. Closes #256	2011-05-27 11:38:53 -07:00
Brian Behlendorf	2b8cad6159	Use vmem_alloc() for zfs_ioc_userspace_many() The default buffer size when requesting multiple quota entries is 100 times the zfs_useracct_t size. In practice this works out to exactly 27200 bytes. Since this will be a short lived buffer in a non-performance critical path it is preferable to vmem_alloc() the needed memory.	2011-05-20 14:23:18 -07:00

1 2 3 4 5 ...

441 Commits