mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2024-11-17 18:11:00 +03:00

Author	SHA1	Message	Date
Matthew Ahrens	5fbf85c4e2	Linux does not HAVE_DNLC Since Linux does not have the Directory Name Lookup Cache, we don't need the code to manage it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8031	2018-10-17 10:30:08 -07:00
ilbsmart	779a6c0bf6	deadlock between mm_sem and tx assign in zfs_write() and page fault The bug time sequence: 1. thread #1, `zfs_write` assign a txg "n". 2. In a same process, thread #2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread #1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread #2, so it stuck and can't complete, then txg "n" will not complete. So thread #1 and thread #2 are deadlocked. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Grady Wong <grady.w@xtaotech.com> Closes #7939	2018-10-16 11:11:24 -07:00
Matt Ahrens	5d43cc9a59	OpenZFS 9689 - zfs range lock code should not be zpl-specific The ZFS range locking code in zfs_rlock.c/h depends on ZPL-specific data structures, specifically znode_t. However, it's also used by the ZVOL code, which uses a "dummy" znode_t to pass to the range locking code. We should clean this up so that the range locking code is generic and can be used equally by ZPL and ZVOL, and also can be used by future consumers that may need to run in userland (libzpool) as well as the kernel. Porting notes: * Added missing sys/avl.h include to sys/zfs_rlock.h. * Removed 'dbuf is within the locked range' ASSERTs from dmu_sync(). This was needed because ztest does not yet use a locked_range_t. * Removed "Approved by:" tag requirement from OpenZFS commit check to prevent needless warnings when integrating changes which has not been merged to illumos. * Reverted free_list range lock changes which were originally needed to defer the cv_destroy() which was called immediately after cv_broadcast(). With `d2733258` this should be safe but if not we may need to reintroduce this logic. * Reverts: The following two commits were reverted and squashed in to this change in order to make it easier to apply OpenZFS 9689. - `d88895a0`, which removed the dummy znode from zvol_state - `e3a07cd0`, which updated ztest to use range locks * Preserved optimized rangelock comparison function. Preserved the rangelock free list. The cv_destroy() function will block waiting for all processes in cv_wait() to be scheduled and drop their reference. This is done to ensure it's safe to free the condition variable. However, blocking while holding the rl->rl_lock mutex can result in a deadlock on Linux. A free list is introduced to defer the cv_destroy() and kmem_free() until after the mutex is released. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9689 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/680 External-issue: DLPX-58662 Closes #7980	2018-10-11 10:19:33 -07:00
Serapheim Dimitropoulos	a448a2557e	Introduce read/write kstats per dataset The following patch introduces a few statistics on reads and writes grouped by dataset. These statistics are implemented as kstats (backed by aggregate sums for performance) and can be retrieved by using the dataset objset ID number. The motivation for this change is to provide some preliminary analytics on dataset usage/performance. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #7705	2018-08-20 09:52:37 -07:00
Brian Behlendorf	6413c95fbd	Linux 4.18 compat: inode timespec -> timespec64 Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime, and i_ctime members form timespec's to timespec64's to make them 2038 safe. As part of this change the current_time() function was also updated to return the timespec64 type. Resolve this issue by introducing a new inode_timespec_t type which is defined to match the timespec type used by the inode. It should be used when working with inode timestamps to ensure matching types. The timestruc_t type under Illumos was used in a similar fashion but was specified to always be a timespec_t. Rather than incorrectly define this type all timespec_t types have been replaced by the new inode_timespec_t type. Finally, the kernel and user space 'sys/time.h' headers were aligned with each other. They define as appropriate for the context several constants as macros and include static inline implementation of gethrestime(), gethrestime_sec(), and gethrtime(). Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7643	2018-06-19 21:51:18 -07:00
Brian Behlendorf	93ce2b4ca5	Update build system and packaging Minimal changes required to integrate the SPL sources in to the ZFS repository build infrastructure and packaging. Build system and packaging: * Renamed SPL_* autoconf m4 macros to ZFS_. Removed redundant SPL_* autoconf m4 macros. * Updated the RPM spec files to remove SPL package dependency. * The zfs package obsoletes the spl package, and the zfs-kmod package obsoletes the spl-kmod package. * The zfs-kmod-devel* packages were updated to add compatibility symlinks under /usr/src/spl-x.y.z until all dependent packages can be updated. They will be removed in a future release. * Updated copy-builtin script for in-kernel builds. * Updated DKMS package to include the spl.ko. * Updated stale AUTHORS file to include all contributors. * Updated stale COPYRIGHT and included the SPL as an exception. * Renamed README.markdown to README.md * Renamed OPENSOLARIS.LICENSE to LICENSE. * Renamed DISCLAIMER to NOTICE. Required code changes: * Removed redundant HAVE_SPL macro. * Removed _BOOT from nvpairs since it doesn't apply for Linux. * Initial header cleanup (removal of empty headers, refactoring). * Remove SPL repository clone/build from zimport.sh. * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due to build issues when forcing C99 compilation. * Replaced legacy ACCESS_ONCE with READ_ONCE. * Include needed headers for `current` and `EXPORT_SYMBOL`. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> TEST_ZIMPORT_SKIP="yes" Closes #7556	2018-05-29 16:00:33 -07:00
Brian Behlendorf	9464b9591e	RHEL 7.5 compat: FMODE_KABI_ITERATE As of RHEL 7.5 the mainline fops.iterate() method was added to the file_operations structure and is correctly detected by the configure script. Normally this is what we want, but in order to maintain KABI compatibility the RHEL change additionally does the following: * Requires that callers intending to use this extended interface set the FMODE_KABI_ITERATE flag on the file structure when opening the directory. * Adds the fops.iterate() method to the end of the structure, without removing fops.readdir(). This change updates the configure check to ignore the RHEL 7.5+ variant of fops.iterate() when detected. Instead fallback to the fops.readdir() interface which will be available. Finally, add the 'zpl_' prefix to the directory context wrappers to avoid colliding with the kernel provided symbols when both the fops.iterate() and fops.readdir() are provided by the kernel. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7460 Closes #7463	2018-05-02 15:01:24 -07:00
Chunwei Chen	599b864813	Fix ENOSPC in "Handle zap_add() failures in ..." Commit `cc63068` caused ENOSPC error when copy a large amount of files between two directories. The reason is that the patch limits zap leaf expansion to 2 retries, and return ENOSPC when failed. The intent for limiting retries is to prevent pointlessly growing table to max size when adding a block full of entries with same name in different case in mixed mode. However, it turns out we cannot use any limit on the retry. When we copy files from one directory in readdir order, we are copying in hash order, one leaf block at a time. Which means that if the leaf block in source directory has expanded 6 times, and you copy those entries in that block, by the time you need to expand the leaf in destination directory, you need to expand it 6 times in one go. So any limit on the retry will result in error where it shouldn't. Note that while we do use different salt for different directories, it seems that the salt/hash function doesn't provide enough randomization to the hash distance to prevent this from happening. Since `cc63068` has already been reverted. This patch adds it back and removes the retry limit. Also, as it turn out, failing on zap_add() has a serious side effect for mzap_upgrade(). When upgrading from micro zap to fat zap, it will call zap_add() to transfer entries one at a time. If it hit any error halfway through, the remaining entries will be lost, causing those files to become orphan. This patch add a VERIFY to catch it. Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7401 Closes #7421	2018-04-18 14:19:50 -07:00
Matthew Ahrens	a1d477c24c	OpenZFS 7614, 9064 - zfs device evacuation/removal OpenZFS 7614 - zfs device evacuation/removal OpenZFS 9064 - remove_mirror should wait for device removal to complete This project allows top-level vdevs to be removed from the storage pool with "zpool remove", reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now "indirect") vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev. The size of the in-memory mapping table will be reduced when its entries become "obsolete" because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been "remapped" in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be "remapped" to their new (concrete) locations if possible. This process can be accelerated by using the "zfs remap" command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs. Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. At the moment, only mirrors and simple top-level vdevs can be removed and no removal is allowed if any of the top-level vdevs are raidz. Porting Notes: * Avoid zero-sized kmem_alloc() in vdev_compact_children(). The device evacuation code adds a dependency that vdev_compact_children() be able to properly empty the vdev_child array by setting it to NULL and zeroing vdev_children. Under Linux, kmem_alloc() and related functions return a sentinel pointer rather than NULL for zero-sized allocations. * Remove comment regarding "mpt" driver where zfs_remove_max_segment is initialized to SPA_MAXBLOCKSIZE. Change zfs_condense_indirect_commit_entry_delay_ticks to zfs_condense_indirect_commit_entry_delay_ms for consistency with most other tunables in which delays are specified in ms. * ZTS changes: Use set_tunable rather than mdb Use zpool sync as appropriate Use sync_pool instead of sync Kill jobs during test_removal_with_operation to allow unmount/export Don't add non-disk names such as "mirror" or "raidz" to $DISKS Use $TEST_BASE_DIR instead of /tmp Increase HZ from 100 to 1000 which is more common on Linux removal_multiple_indirection.ksh Reduce iterations in order to not time out on the code coverage builders. removal_resume_export: Functionally, the test case is correct but there exists a race where the kernel thread hasn't been fully started yet and is not visible. Wait for up to 1 second for the removal thread to be started before giving up on it. Also, increase the amount of data copied in order that the removal not finish before the export has a chance to fail. * MMP compatibility, the concept of concrete versus non-concrete devices has slightly changed the semantics of vdev_writeable(). Update mmp_random_leaf_impl() accordingly. * Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool feature which is not supported by OpenZFS. * Added support for new vdev removal tracepoints. * Test cases removal_with_zdb and removal_condense_export have been intentionally disabled. When run manually they pass as intended, but when running in the automated test environment they produce unreliable results on the latest Fedora release. They may work better once the upstream pool import refectoring is merged into ZoL at which point they will be re-enabled. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alex Reece <alex@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/7614 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb Closes #6900	2018-04-14 12:16:17 -07:00
Tony Hutter	4f301661df	Revert "Handle zap_add() failures in mixed ... " This reverts commit `cc63068e95`. Under certain circumstances this change can result in an ENOSPC error when adding new files to a directory. See #7401 for full details. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue #7401 Cloes #7416	2018-04-09 14:24:46 -07:00
Brian Behlendorf	b2ab468dde	Fix mmap / libaio deadlock Calling uiomove() in mappedread() under the page lock can result in a deadlock if the user space page needs to be faulted in. Resolve the issue by dropping the page lock before the uiomove(). The inode range lock protects against concurrent updates via zfs_read() and zfs_write(). Reviewed-by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7335 Closes #7339	2018-03-28 10:19:22 -07:00
Nasf-Fan	2705ebf0a7	Misc fixes and cleanup for project quota 1) The Coverity Scan reports some issues for the project quota patch, including: 1.1) zfs_prop_get_userquota() directly uses the const quota type value as the condition check by wrong. 1.2) dmu_objset_userquota_get_ids() may cause dnode::dn_newgid to be overwritten by dnode::dn->dn_oldprojid. 2) This patch fixes related issues. It also enhances the logic for zfs_project_item_alloc() to avoid buffer overflow. 3) Skip project quota ability check if does not change project quota related things (id or flag). Otherwise, it will cause chattr (for other non project quota flags) operation failed if project quota disabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fan Yong <fan.yong@intel.com> Closes #7251 Closes #7265	2018-03-05 12:56:27 -08:00
Nasf-Fan	9c5167d19f	Project Quota on ZFS Project quota is a new ZFS system space/object usage accounting and enforcement mechanism. Similar as user/group quota, project quota is another dimension of system quota. It bases on the new object attribute - project ID. Project ID is a numerical value to indicate to which project an object belongs. An object only can belong to one project though you (the object owner or privileged user) can change the object project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly. The object also can inherit the project ID from its parent when created if the parent has the project inherit flag (that can be set via 'chattr +P' or 'zfs project -s [-p]'). By accounting the spaces/objects belong to the same project, we can know how many spaces/objects used by the project. And if we set the upper limit then we can control the spaces/objects that are consumed by such project. It is useful when multiple groups and users cooperate for the same project, or a user/group needs to participate in multiple projects. Support the following commands and functionalities: zfs set projectquota@project zfs set projectobjquota@project zfs get projectquota@project zfs get projectobjquota@project zfs get projectused@project zfs get projectobjused@project zfs projectspace zfs allow projectquota zfs allow projectobjquota zfs allow projectused zfs allow projectobjused zfs unallow projectquota zfs unallow projectobjquota zfs unallow projectused zfs unallow projectobjused chattr +/-P chattr -p project_id lsattr -p This patch also supports tree quota based on the project quota via "zfs project" commands set as following: zfs project [-d\|-r] <file\|directory ...> zfs project -C [-k] [-r] <file\|directory ...> zfs project -c [-0] [-d\|-r] [-p id] <file\|directory ...> zfs project [-p id] [-r] [-s] <file\|directory ...> For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on the $DIR, then the proejct [obj]quota and [obj]used values for the $DIR's project ID will be shown as the total/free (avail) resource. Keep the same behavior as EXT4/XFS does. Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by Ned Bass <bass6@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fan Yong <fan.yong@intel.com> TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master" Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c Closes #6290	2018-02-13 14:54:54 -08:00
sanjeevbagewadi	cc63068e95	Handle zap_add() failures in mixed case mode With "casesensitivity=mixed", zap_add() could fail when the number of files/directories with the same name (varying in case) exceed the capacity of the leaf node of a Fatzap. This results in a ASSERT() failure as zfs_link_create() does not expect zap_add() to fail. The fix is to handle these failures and rollback the transactions. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Closes #7011 Closes #7054	2018-02-09 10:15:53 -08:00
Prakash Surya	0735ecb334	OpenZFS 8997 - ztest assertion failure in zil_lwb_write_issue PROBLEM ======= When `dmu_tx_assign` is called from `zil_lwb_write_issue`, it's possible for either `ERESTART` or `EIO` to be returned. If `ERESTART` is returned, this will cause an assertion to fail directly in `zil_lwb_write_issue`, where the code assumes the return value is `EIO` if `dmu_tx_assign` returns a non-zero value. This can occur if the SPA is suspended when `dmu_tx_assign` is called, and most often occurs when running `zloop`. If `EIO` is returned, this can cause assertions to fail elsewhere in the ZIL code. For example, `zil_commit_waiter_timeout` contains the following logic: lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb); ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED); In this case, if `dmu_tx_assign` returned `EIO` from within `zil_lwb_write_issue`, the `lwb` variable passed in will not be issued to disk. Thus, it's `lwb_state` field will remain `LWB_STATE_OPENED` and this assertion will fail. `zil_commit_waiter_timeout` assumes that after it calls `zil_lwb_write_issue`, the `lwb` will be issued to disk, and doesn't handle the case where this is not true; i.e. it doesn't handle the case where `dmu_tx_assign` returns `EIO`. SOLUTION ======== This change modifies the `dmu_tx_assign` function such that `txg_how` is a bitmask, rather than of the `txg_how_t` enum type. Now, the previous `TXG_WAITED` semantics can be used via `TXG_NOTHROTTLE`, along with specifying either `TXG_NOWAIT` or `TXG_WAIT` semantics. Previously, when `TXG_WAITED` was specified, `TXG_NOWAIT` semantics was automatically invoked. This was not ideal when using `TXG_WAITED` within `zil_lwb_write_issued`, leading the problem described above. Rather, we want to achieve the semantics of `TXG_WAIT`, while also preventing the `tx` from being penalized via the dirty delay throttling. With this change, `zil_lwb_write_issued` can acheive the semtantics that it requires by passing in the value `TXG_WAIT \| TXG_NOTHROTTLE` to `dmu_tx_assign`. Further, consumers of `dmu_tx_assign` wishing to achieve the old `TXG_WAITED` semantics can pass in the value `TXG_NOWAIT \| TXG_NOTHROTTLE`. Authored by: Prakash Surya <prakash.surya@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: - Additionally updated `zfs_tmpfile` to use `TXG_NOTHROTTLE` OpenZFS-issue: https://www.illumos.org/issues/8997 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/19ea6cb0f9 Closes #7084	2018-01-26 20:19:46 -08:00
Prakash Surya	1ce23dcaff	OpenZFS 8585 - improve batching done in zil_commit() Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Prakash Surya <prakash.surya@delphix.com> Problem ======= The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch). To further complicate the situation, these two issues result in the following side effect: 3. If a given batch takes longer to complete than normal, this results in larger batch sizes, which then take longer to complete and further drive up the latency of zil_commit(). This can occur for a number of reasons, including (but not limited to): transient changes in the workload, and storage latency irregularites. Solution ======== The solution attempted by this change has the following goals: 1. no on-disk changes; maintain current on-disk format. 2. modify the "batch size" to be equal to the "ZIL block size". 3. allow new batches to be generated and issued to disk, while there's already batches being serviced by the disk. 4. allow zil_commit() to wait for as few ZIL blocks as possible. 5. use as few ZIL blocks as possible, for the same amount of ZIL transactions, without introducing significant latency to any individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks. In theory, with these goals met, the new allgorithm will allow the following improvements: 1. new ZIL blocks can be generated and issued, while there's already oustanding ZIL blocks being serviced by the storage. 2. the latency of zil_commit() should be proportional to the underlying storage latency, rather than the incoming synchronous workload. Porting Notes ============= Due to the changes made in commit `119a394ab0`, the lifetime of an itx structure differs than in OpenZFS. Specifically, the itx structure is kept around until the data associated with the itx is considered to be safe on disk; this is so that the itx's callback can be called after the data is committed to stable storage. Since OpenZFS doesn't have this itx callback mechanism, it's able to destroy the itx structure immediately after the itx is committed to an lwb (before the lwb is written to disk). To support this difference, and to ensure the itx's callbacks can still be called after the itx's data is on disk, a few changes had to be made: * A list of itxs was added to the lwb structure. This list contains all of the itxs that have been committed to the lwb, such that the callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(), after the data for the itxs is committed to disk. * A list of itxs was added on the stack of the zil_process_commit_list() function; the "nolwb_itxs" list. In some circumstances, an itx may not be committed to an lwb (e.g. if allocating the "next" ZIL block on disk fails), so this list is used to keep track of which itxs fall into this state, such that their callbacks can be called after the ZIL's writer pipeline is "stalled". * The logic to actually call the itx's callback was moved into the zil_itx_destroy() function. Since all consumers of zil_itx_destroy() were effectively performing the same logic (i.e. if callback is non-null, call the callback), it seemed like useful code cleanup to consolidate this logic into a single function. Additionally, the existing Linux tracepoint infrastructure dealing with the ZIL's probes and structures had to be updated to reflect these code changes. Specifically: * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be removed from "trace_zil.h" as well. * Some of the zilog structure's fields were removed, which affected the tracepoint definitions of the structure. * New tracepoints had to be added for the following 3 new probes: * zil__process__commit__itx * zil__process__normal__itx * zil__commit__io__error OpenZFS-issue: https://www.illumos.org/issues/8585 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a Closes #6566	2017-12-05 09:39:16 -08:00
Brian Behlendorf	94183a9d8a	Update for cppcheck v1.80 Resolve new warnings and errors from cppcheck v1.80. * [lib/libshare/libshare.c:543]: (warning) Possible null pointer dereference: protocol * [lib/libzfs/libzfs_dataset.c:2323]: (warning) Possible null pointer dereference: srctype * [lib/libzfs/libzfs_import.c:318]: (error) Uninitialized variable: link * [module/zfs/abd.c:353]: (error) Uninitialized variable: sg * [module/zfs/abd.c:353]: (error) Uninitialized variable: i * [module/zfs/abd.c:385]: (error) Uninitialized variable: sg * [module/zfs/abd.c:385]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: sg * [module/zfs/abd.c:763]: (error) Uninitialized variable: i * [module/zfs/abd.c:763]: (error) Uninitialized variable: sg * [module/zfs/abd.c:305]: (error) Uninitialized variable: tmp_page * [module/zfs/zpl_xattr.c:342]: (warning) Possible null pointer dereference: value * [module/zfs/zvol.c:208]: (error) Uninitialized variable: p Convert the following suppression to inline. * [module/zfs/zfs_vnops.c:840]: (error) Possible null pointer dereference: aiov Exclude HAVE_UIO_ZEROCOPY and HAVE_DNLC from analysis since these macro's will never be defined until this functionality is implemented. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6879	2017-11-18 14:08:00 -08:00
LOLi	99834d1950	Fix truncate(2) mtime and ctime handling On Linux, ftruncate(2) always changes the file timestamps, even if the file size is not changed. However, in case of a successfull truncate(2), the timestamps are updated only if the file size changes. This translates to the VFS calling the ZFS Posix Layer "setattr" function (zpl_setattr) with ATTR_MTIME and ATTR_CTIME unconditionally set on the iattr mask only when doing a ftruncate(2), while the truncate(2) is left to the filesystem implementation to be dealt with. This behaviour is consistent with POSIX:2004/SUSv3 specifications where there's no explicit requirement for file size changes to update the timestamps only for ftruncate(2): http://pubs.opengroup.org/onlinepubs/009695399/functions/truncate.html http://pubs.opengroup.org/onlinepubs/009695399/functions/ftruncate.html This has been later updated in POSIX:2008/SUSv4 where, for both truncate(2)/ftruncate(2), there's no mention of this size change requirement: http://austingroupbugs.net/view.php?id=489 http://pubs.opengroup.org/onlinepubs/9699919799/functions/truncate.html http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html Unfortunately the Linux VFS is still calling into the ZPL without ATTR_MTIME/ATTR_CTIME set in the truncate(2) case: we fix this by explicitly updating the timestamps when detecting the ATTR_SIZE bit, which is always set in do_truncate(), on the iattr mask. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6811 Closes #6819	2017-11-13 09:24:26 -08:00
Tom Caputi	440a3eb939	Fixes for #6639 Several issues were uncovered by running stress tests with zfs encryption and raw sends in particular. The issues and their associated fixes are as follows: * arc_read_done() has the ability to chain several requests for the same block of data via the arc_callback_t struct. In these cases, the ARC would only use the first request's dsobj from the bookmark to decrypt the data. This is problematic because the first request might be a prefetch zio which is able to handle the key not being loaded, while the second might use a different key that it is sure will work. The fix here is to pass the dsobj with each individual arc_callback_t so that each request can attempt to decrypt the data separately. * DRR_FREE and DRR_FREEOBJECT records in a send file were not having their transactions properly tagged as raw during raw sends, which caused a panic when the dbuf code attempted to decrypt these blocks. * traverse_prefetch_metadata() did not properly set ZIO_FLAG_SPECULATIVE when issuing prefetch IOs. * Added a few asserts and code cleanups to ensure these issues are more detectable in the future. Signed-off-by: Tom Caputi <tcaputi@datto.com>	2017-10-11 16:55:50 -04:00
Tobin Harding	d95a59805f	Remove unnecessary equality check Currently `if` statement includes an assignment (from a function return value) and a equality check. The parenthesis are in the incorrect place, currently the code clobbers the function return value because of this. We can fix this by simplifying the `if` statement. `if (foo != 0)` can be more succinctly expressed as `if (foo)` Remove the equality check, add parenthesis to correct the statement. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Tobin C. Harding <me@tobin.cc> Closes #6685 Close #6719	2017-10-05 19:33:44 -07:00
Giuseppe Di Natale	34d00e7aba	Correct cppcheck errors ZFS buildbot STYLE builder was moved to Ubuntu 17.04 which has a newer version of cppcheck. Handle the new cppcheck errors. uu_* functions removed in this commit were unused and effectively dead code. They are now retired. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6653	2017-09-19 12:17:29 -07:00
LOLi	f763c3d1df	Fix range locking in ZIL commit codepath Since OpenZFS 7578 (`1b7c1e5`) if we have a ZVOL with logbias=throughput we will force WR_INDIRECT itxs in zvol_log_write() setting itx->itx_lr offset and length to the offset and length of the BIO from zvol_write()->zvol_log_write(): these offset and length are later used to take a range lock in zillog->zl_get_data function: zvol_get_data(). Now suppose we have a ZVOL with blocksize=8K and push 4K writes to offset 0: we will only be range-locking 0-4096. This means the ASSERTion we make in dbuf_unoverride() is no longer valid because now dmu_sync() is called from zilog's get_data functions holding a partial lock on the dbuf. Fix this by taking a range lock on the whole block in zvol_get_data(). Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6238 Closes #6315 Closes #6356 Closes #6477	2017-08-21 08:59:48 -07:00
Chunwei Chen	376994828f	Fix NULL pointer when O_SYNC read in snapshot When doing read on a file open with O_SYNC, it will trigger zil_commit. However for snapshot, there's no zil, so we shouldn't be doing that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #6478 Closes #6494	2017-08-11 08:57:54 -07:00
Ned Bass	ecb2b7dc7f	Use SET_ERROR for constant non-zero return codes Update many return and assignment statements to follow the convention of using the SET_ERROR macro when returning a hard-coded non-zero value from a function. This aids debugging by recording the error codes in the debug log. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #6441	2017-08-02 21:16:12 -07:00
Matthew Ahrens	02dc43bc46	OpenZFS 8378 - crash due to bp in-memory modification of nopwrite block Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> The problem is that zfs_get_data() supplies a stale zgd_bp to dmu_sync(), which we then nopwrite against. zfs_get_data() doesn't hold any DMU-related locks, so after it copies db_blkptr to zgd_bp, dbuf_write_ready() could change db_blkptr, and dbuf_write_done() could remove the dirty record. dmu_sync() then sees the stale BP and that the dbuf it not dirty, so it is eligible for nop-writing. The fix is for dmu_sync() to copy db_blkptr to zgd_bp after acquiring the db_mtx. We could still see a stale db_blkptr, but if it is stale then the dirty record will still exist and thus we won't attempt to nopwrite. OpenZFS-issue: https://www.illumos.org/issues/8378 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3127742 Closes #6293	2017-07-04 15:41:24 -07:00
dbavatar	6e03ec4fa2	Fix lseek result when dnode is dirty Fixup commit `66aca24`. We should have equivalent return values as generic_file_llseek() and advance to end of file. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Tested-by: bunder2015 <omfgbunder@gmail.com> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #6050 Closes #6053	2017-04-24 09:38:31 -07:00
Debabrata Banerjee	66aca24730	SEEK_HOLE should not block on txg_wait_synced() Force flushing of txg's can be painfully slow when competing for disk IO, since this is a process meant to execute asynchronously. Optimize this path via allowing data/hole seeking if the file is clean, but if dirty fall back to old logic. This is a compromise to disabling the feature entirely. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #4306 Closes #5962	2017-04-13 10:51:20 -07:00
Brian Behlendorf	f298b24ddf	Rename zfs_* functions Several functions were renamed when ZFS was originally ported to Linux. Revert the code to the original names to minimize the delta with upstream OpenZFS. zfs_sb_teardown -> zfsvfs_teardown zfs_sb_create -> zfsvfs_create zfs_sb_setup -> zfsvfs_setup zfs_sb_free -> zfsvfs_free get_zfs_sb -> getzfsvfs zfs_sb_hold -> zfsvfs_hold zfs_sb_rele -> zfsvfs_rele zfs_sb_prune_aliases -> zfs_prune_aliases (Linux-only) zfs_sb_prune -> zfs_prune (Linux only) Align the zfs_vnops.h and zfs_vfsops.h with upstream as much as possible. Several prototypes were removed and those that remain were reordered. Move the EXPORT_SYMBOL lines to the end of the source files for consistency with the other source files. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-03-10 09:51:35 -08:00
Brian Behlendorf	0037b49e83	Rename zfs_sb_t -> zfsvfs_t The use of zfs_sb_t instead of zfsvfs_t results in unnecessary conflicts with the upstream source. Change all instances of zfs_sb_t to zfsvfs_t including updating the variables names. Whenever possible the code was updated to be consistent with hope it appears in the upstream OpenZFS source. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-03-10 09:51:33 -08:00
Giuseppe Di Natale	589bb918ef	Suppress cppcheck nullPointer error in zfs_write Newer versions of cppcheck find the potential NULL pointer bug in zfs_write(). The function is difficult to refactor without extensive work, so suppress the potential NULL pointer error which cannot occur for now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #5882	2017-03-09 17:40:21 -08:00
Chunwei Chen	9b77d1c958	Fix nfs snapdir automount The current implementation for allowing nfs to access snapdir is very buggy. It uses a special fh for snapdirs, such that the next time nfsd does fh_to_dentry, it actually returns the root inode inside the snapshot. So nfsd never knows it cross a mountpoint. The problem is that nfsd will not hold a reference on the vfsmount of the snapshot. This cause auto unmounter to unmount the snapshot even though nfs is still holding dentries in it. To fix this, we return the inode for the snapdirs themselves. However, we also trigger automount upon fh_to_dentry, and return ESTALE so nfsd will revalidate and see the mountpoint and do crossmnt. Because nfsd will now be aware that these are different filesystems users must add crossmnt to their export options to access snapshot directories. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #3794 Closes #4716 Closes #5810 Closes #5833	2017-03-08 09:26:33 -08:00
Brian Behlendorf	ea7e86d8db	Fix iput() calls within a tx As explicitly stated in section 2 of the 'Programming rules' comments at the top of zfs_vnops.c. If you must call iput() within a tx then use zfs_iput_async(). Move iput() calls after dmu_tx_commit() / dmu_tx_abort when possible. When not possible convert the iput() calls to zfs_iput_async(). Reviewed-by: Don Brady <don.brady@intel.com> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5758	2017-02-08 17:28:22 -08:00
George Melikov	9b7b9cd370	OpenZFS 1300 - filename normalization doesn't work for removes Authored by: Kevin Crowe <kevin.crowe@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/1300 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8f1750d Closes #5725 Porting notes: - zap_micro.c: all `MT_EXACT` are replaced by `0`	2017-02-02 14:13:41 -08:00
George Melikov	61ca48ff38	OpenZFS 7256 - low probability race in zfs_get_data Authored by: Andriy Gapon <andriy.gapon@clusterhq.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7256 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6ed18a8 Closes #5601	2017-01-17 15:18:59 -08:00
ka7	4e33ba4c38	Fix spelling Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se> Closes #5547 Closes #5543	2017-01-03 11:31:18 -06:00
Brian Behlendorf	02730c333c	Use cstyle -cpP in `make cstyle` check Enable picky cstyle checks and resolve the new warnings. The vast majority of the changes needed were to handle minor issues with whitespace formatting. This patch contains no functional changes. Non-whitespace changes are as follows: * 8 times ; to { } in for/while loop * fix missing ; in cmd/zed/agents/zfs_diagnosis.c * comment (confim -> confirm) * change endline , to ; in cmd/zpool/zpool_main.c * a number of /* BEGIN CSTYLED / / END CSTYLED / blocks /* CSTYLED / markers change == 0 to ! * ulong to unsigned long in module/zfs/dsl_scan.c * rearrangement of module_param lines in module/zfs/metaslab.c * add { } block around statement after for_each_online_node Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Håkan Johansson <f96hajo@chalmers.se> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5465	2016-12-12 10:46:26 -08:00
luozhengzheng	32dec7bd1a	Fix coverity defects: CID 147503 CID 147503: Dereference after null check (FORWARD_NULL) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5326	2016-11-10 08:50:32 -08:00
cao	a36cc8d242	Fix coverity defects: CID 147626, 147628 CID 147626: Type:Dereference before null check CID 147628: Type:Dereference before null check Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5304	2016-11-08 14:28:17 -08:00
Chunwei Chen	ace1eae84c	Add support for O_TMPFILE Linux 3.11 add O_TMPFILE to open(2), which allow creating an unlinked file on supported filesystem. It's basically doing open(2) and unlink(2) atomically. The filesystem support is added through i_op->tmpfile. We basically copy the create operation except we get rid of the link and name related stuff and add the new node to unlinked set. We also add support for linkat(2) to link tmpfile. However, since all previous file operation will skip ZIL, we force a txg_wait_synced to make sure we are sync safe. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-11-04 10:46:40 -07:00
Brian Behlendorf	48d3eb40c7	Add TASKQID_INVALID Add the TASKQID_INVALID macros and update callers to use the macro instead of testing against 0. There is no functional change even though the functions in zfs_ctldir.c incorrectly used -1 instead of 0. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5347	2016-11-02 12:14:45 -07:00
cao	5a6765cf8c	Fix coverity defects: CID 147472 CID 147472: Type: 'Constant' variable guards dead code Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5288	2016-10-20 11:24:01 -07:00
luozhengzheng	7c502b0b1d	Fix coverity defects: CID 150926 CID 150926: Unchecked return value (CHECKED_RETURN) - This case cannot occur given the existing taskq implementation and flags passed to task_dispatch(). Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5272	2016-10-18 11:32:59 -07:00
lorddoskias	12fa7f3436	Refactor inode->i_mode management Refactor the code in such a way so that inode->i_mode is being set at the same time zp->z_mode is being changed. This has the effect of keeping both in sync without relying on zfs_inode_update. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #5158	2016-09-27 14:08:52 -07:00
BearBabyLiu	609603a5d3	Fix coverity defects coverity scan CID:147504 Type: Explicit null dereferenced Reason: passing null pointer dl to zfs_dirent_unlock Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: BearBabyLiu <liu.huang@zte.com.cn> Closes #5131	2016-09-20 19:09:22 -07:00
Nikolay Borisov	87f9371aef	Simplify time handling logic in zfs_settattr Simplify time handling in zfs_setattr by mimicking the logic in setattr_copy from the linux kernel. In order to achieve this in the case when ZFS' log is being replayed it is necessary to unconditionally set the ctime in zfs_replay_setattr. Also use the timespec_trunc function when assigning values to the generic inode struct. This is currently a noop since zfs sets s_time_gran to 1, however in the future rules about precision might change. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #4916	2016-09-13 12:00:18 -07:00
Simon Klinkert	db707ad094	OpenZFS 6940 - Cannot unlink directories when over quota From user perspective, I would expect that ZFS is always able to remove files and directories even when the quota is exceeded. Authored by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6940 OpenZFS-issue: https://www.illumos.org/issues/6334 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9918916 Closes #5044	2016-08-30 14:33:04 -07:00
Nikolay Borisov	64aefee1b8	Fix interaction between userns uid/gid and SA * When the uid/gid change is handled in zfs_setattr we want to actually adjust the user passed uid to a KUID and write that to disk. * In trace points use the i_uid member without doing translation, since it has already been performed. * Use kuid in zfs_aclset_common Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4928	2016-08-08 10:47:43 -07:00
Nikolay Borisov	2c6abf15ff	Remove znode's z_uid/z_gid member Remove duplicate z_uid/z_gid member which are also held in the generic vfs inode struct. This is done by first removing the members from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID macros to access the respective member from struct inode. In cases where the uid/gids are being marshalled from/to disk, use the newly introduced zfs_(uid\|gid)_(read\|write) functions to properly save the uids rather than the internal kernel representation. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4685 Issue #227	2016-07-25 13:21:49 -07:00
Chris Dunlop	dfbc86309f	Use native inode->i_nlink instead of znode->z_links A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's 64 bit on-disk link count. We revert "xattr dir doesn't get purged during iput" (`ddae16a`) as this is a more Linux-integrated fix for the same issue. In addition, setting the initial link count on a new node has been changed from setting one less than required in zfs_mknode() then incrementing to the correct count in zfs_link_create() (which was somewhat bizarre in the first place), to setting the correct count in zfs_mknode() and not incrementing it in zfs_link_create(). This both means we no longer set the link count in sa_bulk_update() twice (once for the initial incorrect count then again for the correct count), as well as adhering to the Linux requirement of not incrementing a zero link count without I_LINKABLE (see linux commit f4e0c30c). Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4838 Issue #227	2016-07-14 16:25:34 -07:00
Chunwei Chen	540c392793	Fix out-of-bound access in zfs_fillpage The original code will do an out-of-bound access on pl[] during last iteration. ================================================================== BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs] Read of size 8 by task tmpfile/7850 page:ffffea00017c6dc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0xffff8000000000() page dumped because: kasan: bad access detected CPU: 3 PID: 7850 Comm: tmpfile Tainted: G OE 4.6.0+ #3 ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618 ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8 ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3 Call Trace: [<ffffffff81635618>] dump_stack+0x63/0x8b [<ffffffff81313ee8>] kasan_report_error+0x528/0x560 [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0 [<ffffffff813144b8>] kasan_report+0x58/0x60 [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffff81312e4e>] __asan_load8+0x5e/0x70 [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs] [<ffffffff81353c3a>] SyS_execve+0x3a/0x50 [<ffffffff810058ef>] do_syscall_64+0xef/0x180 [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25 Memory state around the buggy address: ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 ^ ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4705 Issue #4708	2016-05-31 16:01:27 -07:00

1 2 3 4

153 Commits