mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2024-12-28 03:49:38 +03:00

Author	SHA1	Message	Date
Chunwei Chen	b06f40ea9b	Fix ENOSPC in "Handle zap_add() failures in ..." Commit `cc63068` caused ENOSPC error when copy a large amount of files between two directories. The reason is that the patch limits zap leaf expansion to 2 retries, and return ENOSPC when failed. The intent for limiting retries is to prevent pointlessly growing table to max size when adding a block full of entries with same name in different case in mixed mode. However, it turns out we cannot use any limit on the retry. When we copy files from one directory in readdir order, we are copying in hash order, one leaf block at a time. Which means that if the leaf block in source directory has expanded 6 times, and you copy those entries in that block, by the time you need to expand the leaf in destination directory, you need to expand it 6 times in one go. So any limit on the retry will result in error where it shouldn't. Note that while we do use different salt for different directories, it seems that the salt/hash function doesn't provide enough randomization to the hash distance to prevent this from happening. Since `cc63068` has already been reverted. This patch adds it back and removes the retry limit. Also, as it turn out, failing on zap_add() has a serious side effect for mzap_upgrade(). When upgrading from micro zap to fat zap, it will call zap_add() to transfer entries one at a time. If it hit any error halfway through, the remaining entries will be lost, causing those files to become orphan. This patch add a VERIFY to catch it. Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7401 Closes #7421	2018-07-06 02:46:51 -07:00
Prakash Surya	ef7a79488a	OpenZFS 8997 - ztest assertion failure in zil_lwb_write_issue PROBLEM ======= When `dmu_tx_assign` is called from `zil_lwb_write_issue`, it's possible for either `ERESTART` or `EIO` to be returned. If `ERESTART` is returned, this will cause an assertion to fail directly in `zil_lwb_write_issue`, where the code assumes the return value is `EIO` if `dmu_tx_assign` returns a non-zero value. This can occur if the SPA is suspended when `dmu_tx_assign` is called, and most often occurs when running `zloop`. If `EIO` is returned, this can cause assertions to fail elsewhere in the ZIL code. For example, `zil_commit_waiter_timeout` contains the following logic: lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb); ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED); In this case, if `dmu_tx_assign` returned `EIO` from within `zil_lwb_write_issue`, the `lwb` variable passed in will not be issued to disk. Thus, it's `lwb_state` field will remain `LWB_STATE_OPENED` and this assertion will fail. `zil_commit_waiter_timeout` assumes that after it calls `zil_lwb_write_issue`, the `lwb` will be issued to disk, and doesn't handle the case where this is not true; i.e. it doesn't handle the case where `dmu_tx_assign` returns `EIO`. SOLUTION ======== This change modifies the `dmu_tx_assign` function such that `txg_how` is a bitmask, rather than of the `txg_how_t` enum type. Now, the previous `TXG_WAITED` semantics can be used via `TXG_NOTHROTTLE`, along with specifying either `TXG_NOWAIT` or `TXG_WAIT` semantics. Previously, when `TXG_WAITED` was specified, `TXG_NOWAIT` semantics was automatically invoked. This was not ideal when using `TXG_WAITED` within `zil_lwb_write_issued`, leading the problem described above. Rather, we want to achieve the semantics of `TXG_WAIT`, while also preventing the `tx` from being penalized via the dirty delay throttling. With this change, `zil_lwb_write_issued` can acheive the semtantics that it requires by passing in the value `TXG_WAIT \| TXG_NOTHROTTLE` to `dmu_tx_assign`. Further, consumers of `dmu_tx_assign` wishing to achieve the old `TXG_WAITED` semantics can pass in the value `TXG_NOWAIT \| TXG_NOTHROTTLE`. Authored by: Prakash Surya <prakash.surya@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: - Additionally updated `zfs_tmpfile` to use `TXG_NOTHROTTLE` OpenZFS-issue: https://www.illumos.org/issues/8997 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/19ea6cb0f9 Closes #7084	2018-07-06 02:46:51 -07:00
Brian Behlendorf	f79c0de208	Linux 4.18 compat: inode timespec -> timespec64 Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime, and i_ctime members form timespec's to timespec64's to make them 2038 safe. As part of this change the current_time() function was also updated to return the timespec64 type. Resolve this issue by introducing a new inode_timespec_t type which is defined to match the timespec type used by the inode. It should be used when working with inode timestamps to ensure matching types. The timestruc_t type under Illumos was used in a similar fashion but was specified to always be a timespec_t. Rather than incorrectly define this type all timespec_t types have been replaced by the new inode_timespec_t type. Finally, the kernel and user space 'sys/time.h' headers were aligned with each other. They define as appropriate for the context several constants as macros and include static inline implementation of gethrestime(), gethrestime_sec(), and gethrtime(). Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7643 Backported-by: Richard Yao <ryao@gentoo.org>	2018-07-06 02:46:51 -07:00
Brian Behlendorf	0ee129199f	RHEL 7.5 compat: FMODE_KABI_ITERATE As of RHEL 7.5 the mainline fops.iterate() method was added to the file_operations structure and is correctly detected by the configure script. Normally this is what we want, but in order to maintain KABI compatibility the RHEL change additionally does the following: * Requires that callers intending to use this extended interface set the FMODE_KABI_ITERATE flag on the file structure when opening the directory. * Adds the fops.iterate() method to the end of the structure, without removing fops.readdir(). This change updates the configure check to ignore the RHEL 7.5+ variant of fops.iterate() when detected. Instead fallback to the fops.readdir() interface which will be available. Finally, add the 'zpl_' prefix to the directory context wrappers to avoid colliding with the kernel provided symbols when both the fops.iterate() and fops.readdir() are provided by the kernel. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7460 Closes #7463	2018-05-07 17:19:57 -07:00
Brian Behlendorf	63f3396233	Fix mmap / libaio deadlock Calling uiomove() in mappedread() under the page lock can result in a deadlock if the user space page needs to be faulted in. Resolve the issue by dropping the page lock before the uiomove(). The inode range lock protects against concurrent updates via zfs_read() and zfs_write(). Reviewed-by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7335 Closes #7339	2018-05-07 17:19:57 -07:00
Tony Hutter	9a2e90c9fc	Revert "Handle zap_add() failures in mixed ... " This reverts commit `cc63068e95`. Under certain circumstances this change can result in an ENOSPC error when adding new files to a directory. See #7401 for full details. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue #7401 Closes #7416	2018-04-09 17:29:59 -04:00
sanjeevbagewadi	b3da003ebf	Handle zap_add() failures in mixed case mode With "casesensitivity=mixed", zap_add() could fail when the number of files/directories with the same name (varying in case) exceed the capacity of the leaf node of a Fatzap. This results in a ASSERT() failure as zfs_link_create() does not expect zap_add() to fail. The fix is to handle these failures and rollback the transactions. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Closes #7011 Closes #7054	2018-03-14 16:10:37 -07:00
Brian Behlendorf	aebc5df418	Update for cppcheck v1.80 Resolve new warnings and errors from cppcheck v1.80. * [lib/libshare/libshare.c:543]: (warning) Possible null pointer dereference: protocol * [lib/libzfs/libzfs_dataset.c:2323]: (warning) Possible null pointer dereference: srctype * [lib/libzfs/libzfs_import.c:318]: (error) Uninitialized variable: link * [module/zfs/abd.c:353]: (error) Uninitialized variable: sg * [module/zfs/abd.c:353]: (error) Uninitialized variable: i * [module/zfs/abd.c:385]: (error) Uninitialized variable: sg * [module/zfs/abd.c:385]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: sg * [module/zfs/abd.c:763]: (error) Uninitialized variable: i * [module/zfs/abd.c:763]: (error) Uninitialized variable: sg * [module/zfs/abd.c:305]: (error) Uninitialized variable: tmp_page * [module/zfs/zpl_xattr.c:342]: (warning) Possible null pointer dereference: value * [module/zfs/zvol.c:208]: (error) Uninitialized variable: p Convert the following suppression to inline. * [module/zfs/zfs_vnops.c:840]: (error) Possible null pointer dereference: aiov Exclude HAVE_UIO_ZEROCOPY and HAVE_DNLC from analysis since these macro's will never be defined until this functionality is implemented. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6879	2018-01-30 10:27:31 -06:00
LOLi	fedc1d96a8	Fix truncate(2) mtime and ctime handling On Linux, ftruncate(2) always changes the file timestamps, even if the file size is not changed. However, in case of a successfull truncate(2), the timestamps are updated only if the file size changes. This translates to the VFS calling the ZFS Posix Layer "setattr" function (zpl_setattr) with ATTR_MTIME and ATTR_CTIME unconditionally set on the iattr mask only when doing a ftruncate(2), while the truncate(2) is left to the filesystem implementation to be dealt with. This behaviour is consistent with POSIX:2004/SUSv3 specifications where there's no explicit requirement for file size changes to update the timestamps only for ftruncate(2): http://pubs.opengroup.org/onlinepubs/009695399/functions/truncate.html http://pubs.opengroup.org/onlinepubs/009695399/functions/ftruncate.html This has been later updated in POSIX:2008/SUSv4 where, for both truncate(2)/ftruncate(2), there's no mention of this size change requirement: http://austingroupbugs.net/view.php?id=489 http://pubs.opengroup.org/onlinepubs/9699919799/functions/truncate.html http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html Unfortunately the Linux VFS is still calling into the ZPL without ATTR_MTIME/ATTR_CTIME set in the truncate(2) case: we fix this by explicitly updating the timestamps when detecting the ATTR_SIZE bit, which is always set in do_truncate(), on the iattr mask. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6811 Closes #6819	2017-11-21 13:11:29 -06:00
Tobin Harding	80cc2f6111	Remove unnecessary equality check Currently `if` statement includes an assignment (from a function return value) and a equality check. The parenthesis are in the incorrect place, currently the code clobbers the function return value because of this. We can fix this by simplifying the `if` statement. `if (foo != 0)` can be more succinctly expressed as `if (foo)` Remove the equality check, add parenthesis to correct the statement. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Tobin C. Harding <me@tobin.cc> Closes #6685 Close #6719	2017-10-16 10:57:55 -07:00
Giuseppe Di Natale	bef6a8bc3a	Correct cppcheck errors (#6662 ) ZFS buildbot STYLE builder was moved to Ubuntu 17.04 which has a newer version of cppcheck. Handle the new cppcheck errors. uu_* functions removed in this commit were unused and effectively dead code. They are now retired. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6653	2017-09-20 12:59:21 -07:00
LOLi	ae5b4a05ff	Fix range locking in ZIL commit codepath Since OpenZFS 7578 (`1b7c1e5`) if we have a ZVOL with logbias=throughput we will force WR_INDIRECT itxs in zvol_log_write() setting itx->itx_lr offset and length to the offset and length of the BIO from zvol_write()->zvol_log_write(): these offset and length are later used to take a range lock in zillog->zl_get_data function: zvol_get_data(). Now suppose we have a ZVOL with blocksize=8K and push 4K writes to offset 0: we will only be range-locking 0-4096. This means the ASSERTion we make in dbuf_unoverride() is no longer valid because now dmu_sync() is called from zilog's get_data functions holding a partial lock on the dbuf. Fix this by taking a range lock on the whole block in zvol_get_data(). Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6238 Closes #6315 Closes #6356 Closes #6477	2017-08-21 16:46:54 -07:00
Chunwei Chen	aec4318870	Fix NULL pointer when O_SYNC read in snapshot When doing read on a file open with O_SYNC, it will trigger zil_commit. However for snapshot, there's no zil, so we shouldn't be doing that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #6478 Closes #6494	2017-08-21 16:41:22 -07:00
Matthew Ahrens	02dc43bc46	OpenZFS 8378 - crash due to bp in-memory modification of nopwrite block Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> The problem is that zfs_get_data() supplies a stale zgd_bp to dmu_sync(), which we then nopwrite against. zfs_get_data() doesn't hold any DMU-related locks, so after it copies db_blkptr to zgd_bp, dbuf_write_ready() could change db_blkptr, and dbuf_write_done() could remove the dirty record. dmu_sync() then sees the stale BP and that the dbuf it not dirty, so it is eligible for nop-writing. The fix is for dmu_sync() to copy db_blkptr to zgd_bp after acquiring the db_mtx. We could still see a stale db_blkptr, but if it is stale then the dirty record will still exist and thus we won't attempt to nopwrite. OpenZFS-issue: https://www.illumos.org/issues/8378 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3127742 Closes #6293	2017-07-04 15:41:24 -07:00
dbavatar	6e03ec4fa2	Fix lseek result when dnode is dirty Fixup commit `66aca24`. We should have equivalent return values as generic_file_llseek() and advance to end of file. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Tested-by: bunder2015 <omfgbunder@gmail.com> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #6050 Closes #6053	2017-04-24 09:38:31 -07:00
Debabrata Banerjee	66aca24730	SEEK_HOLE should not block on txg_wait_synced() Force flushing of txg's can be painfully slow when competing for disk IO, since this is a process meant to execute asynchronously. Optimize this path via allowing data/hole seeking if the file is clean, but if dirty fall back to old logic. This is a compromise to disabling the feature entirely. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #4306 Closes #5962	2017-04-13 10:51:20 -07:00
Brian Behlendorf	f298b24ddf	Rename zfs_* functions Several functions were renamed when ZFS was originally ported to Linux. Revert the code to the original names to minimize the delta with upstream OpenZFS. zfs_sb_teardown -> zfsvfs_teardown zfs_sb_create -> zfsvfs_create zfs_sb_setup -> zfsvfs_setup zfs_sb_free -> zfsvfs_free get_zfs_sb -> getzfsvfs zfs_sb_hold -> zfsvfs_hold zfs_sb_rele -> zfsvfs_rele zfs_sb_prune_aliases -> zfs_prune_aliases (Linux-only) zfs_sb_prune -> zfs_prune (Linux only) Align the zfs_vnops.h and zfs_vfsops.h with upstream as much as possible. Several prototypes were removed and those that remain were reordered. Move the EXPORT_SYMBOL lines to the end of the source files for consistency with the other source files. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-03-10 09:51:35 -08:00
Brian Behlendorf	0037b49e83	Rename zfs_sb_t -> zfsvfs_t The use of zfs_sb_t instead of zfsvfs_t results in unnecessary conflicts with the upstream source. Change all instances of zfs_sb_t to zfsvfs_t including updating the variables names. Whenever possible the code was updated to be consistent with hope it appears in the upstream OpenZFS source. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-03-10 09:51:33 -08:00
Giuseppe Di Natale	589bb918ef	Suppress cppcheck nullPointer error in zfs_write Newer versions of cppcheck find the potential NULL pointer bug in zfs_write(). The function is difficult to refactor without extensive work, so suppress the potential NULL pointer error which cannot occur for now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #5882	2017-03-09 17:40:21 -08:00
Chunwei Chen	9b77d1c958	Fix nfs snapdir automount The current implementation for allowing nfs to access snapdir is very buggy. It uses a special fh for snapdirs, such that the next time nfsd does fh_to_dentry, it actually returns the root inode inside the snapshot. So nfsd never knows it cross a mountpoint. The problem is that nfsd will not hold a reference on the vfsmount of the snapshot. This cause auto unmounter to unmount the snapshot even though nfs is still holding dentries in it. To fix this, we return the inode for the snapdirs themselves. However, we also trigger automount upon fh_to_dentry, and return ESTALE so nfsd will revalidate and see the mountpoint and do crossmnt. Because nfsd will now be aware that these are different filesystems users must add crossmnt to their export options to access snapshot directories. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #3794 Closes #4716 Closes #5810 Closes #5833	2017-03-08 09:26:33 -08:00
Brian Behlendorf	ea7e86d8db	Fix iput() calls within a tx As explicitly stated in section 2 of the 'Programming rules' comments at the top of zfs_vnops.c. If you must call iput() within a tx then use zfs_iput_async(). Move iput() calls after dmu_tx_commit() / dmu_tx_abort when possible. When not possible convert the iput() calls to zfs_iput_async(). Reviewed-by: Don Brady <don.brady@intel.com> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5758	2017-02-08 17:28:22 -08:00
George Melikov	9b7b9cd370	OpenZFS 1300 - filename normalization doesn't work for removes Authored by: Kevin Crowe <kevin.crowe@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/1300 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8f1750d Closes #5725 Porting notes: - zap_micro.c: all `MT_EXACT` are replaced by `0`	2017-02-02 14:13:41 -08:00
George Melikov	61ca48ff38	OpenZFS 7256 - low probability race in zfs_get_data Authored by: Andriy Gapon <andriy.gapon@clusterhq.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7256 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6ed18a8 Closes #5601	2017-01-17 15:18:59 -08:00
ka7	4e33ba4c38	Fix spelling Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se> Closes #5547 Closes #5543	2017-01-03 11:31:18 -06:00
Brian Behlendorf	02730c333c	Use cstyle -cpP in `make cstyle` check Enable picky cstyle checks and resolve the new warnings. The vast majority of the changes needed were to handle minor issues with whitespace formatting. This patch contains no functional changes. Non-whitespace changes are as follows: * 8 times ; to { } in for/while loop * fix missing ; in cmd/zed/agents/zfs_diagnosis.c * comment (confim -> confirm) * change endline , to ; in cmd/zpool/zpool_main.c * a number of /* BEGIN CSTYLED / / END CSTYLED / blocks /* CSTYLED / markers change == 0 to ! * ulong to unsigned long in module/zfs/dsl_scan.c * rearrangement of module_param lines in module/zfs/metaslab.c * add { } block around statement after for_each_online_node Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Håkan Johansson <f96hajo@chalmers.se> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5465	2016-12-12 10:46:26 -08:00
luozhengzheng	32dec7bd1a	Fix coverity defects: CID 147503 CID 147503: Dereference after null check (FORWARD_NULL) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5326	2016-11-10 08:50:32 -08:00
cao	a36cc8d242	Fix coverity defects: CID 147626, 147628 CID 147626: Type:Dereference before null check CID 147628: Type:Dereference before null check Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5304	2016-11-08 14:28:17 -08:00
Chunwei Chen	ace1eae84c	Add support for O_TMPFILE Linux 3.11 add O_TMPFILE to open(2), which allow creating an unlinked file on supported filesystem. It's basically doing open(2) and unlink(2) atomically. The filesystem support is added through i_op->tmpfile. We basically copy the create operation except we get rid of the link and name related stuff and add the new node to unlinked set. We also add support for linkat(2) to link tmpfile. However, since all previous file operation will skip ZIL, we force a txg_wait_synced to make sure we are sync safe. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-11-04 10:46:40 -07:00
Brian Behlendorf	48d3eb40c7	Add TASKQID_INVALID Add the TASKQID_INVALID macros and update callers to use the macro instead of testing against 0. There is no functional change even though the functions in zfs_ctldir.c incorrectly used -1 instead of 0. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5347	2016-11-02 12:14:45 -07:00
cao	5a6765cf8c	Fix coverity defects: CID 147472 CID 147472: Type: 'Constant' variable guards dead code Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5288	2016-10-20 11:24:01 -07:00
luozhengzheng	7c502b0b1d	Fix coverity defects: CID 150926 CID 150926: Unchecked return value (CHECKED_RETURN) - This case cannot occur given the existing taskq implementation and flags passed to task_dispatch(). Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5272	2016-10-18 11:32:59 -07:00
lorddoskias	12fa7f3436	Refactor inode->i_mode management Refactor the code in such a way so that inode->i_mode is being set at the same time zp->z_mode is being changed. This has the effect of keeping both in sync without relying on zfs_inode_update. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #5158	2016-09-27 14:08:52 -07:00
BearBabyLiu	609603a5d3	Fix coverity defects coverity scan CID:147504 Type: Explicit null dereferenced Reason: passing null pointer dl to zfs_dirent_unlock Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: BearBabyLiu <liu.huang@zte.com.cn> Closes #5131	2016-09-20 19:09:22 -07:00
Nikolay Borisov	87f9371aef	Simplify time handling logic in zfs_settattr Simplify time handling in zfs_setattr by mimicking the logic in setattr_copy from the linux kernel. In order to achieve this in the case when ZFS' log is being replayed it is necessary to unconditionally set the ctime in zfs_replay_setattr. Also use the timespec_trunc function when assigning values to the generic inode struct. This is currently a noop since zfs sets s_time_gran to 1, however in the future rules about precision might change. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #4916	2016-09-13 12:00:18 -07:00
Simon Klinkert	db707ad094	OpenZFS 6940 - Cannot unlink directories when over quota From user perspective, I would expect that ZFS is always able to remove files and directories even when the quota is exceeded. Authored by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6940 OpenZFS-issue: https://www.illumos.org/issues/6334 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9918916 Closes #5044	2016-08-30 14:33:04 -07:00
Nikolay Borisov	64aefee1b8	Fix interaction between userns uid/gid and SA * When the uid/gid change is handled in zfs_setattr we want to actually adjust the user passed uid to a KUID and write that to disk. * In trace points use the i_uid member without doing translation, since it has already been performed. * Use kuid in zfs_aclset_common Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4928	2016-08-08 10:47:43 -07:00
Nikolay Borisov	2c6abf15ff	Remove znode's z_uid/z_gid member Remove duplicate z_uid/z_gid member which are also held in the generic vfs inode struct. This is done by first removing the members from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID macros to access the respective member from struct inode. In cases where the uid/gids are being marshalled from/to disk, use the newly introduced zfs_(uid\|gid)_(read\|write) functions to properly save the uids rather than the internal kernel representation. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4685 Issue #227	2016-07-25 13:21:49 -07:00
Chris Dunlop	dfbc86309f	Use native inode->i_nlink instead of znode->z_links A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's 64 bit on-disk link count. We revert "xattr dir doesn't get purged during iput" (`ddae16a`) as this is a more Linux-integrated fix for the same issue. In addition, setting the initial link count on a new node has been changed from setting one less than required in zfs_mknode() then incrementing to the correct count in zfs_link_create() (which was somewhat bizarre in the first place), to setting the correct count in zfs_mknode() and not incrementing it in zfs_link_create(). This both means we no longer set the link count in sa_bulk_update() twice (once for the initial incorrect count then again for the correct count), as well as adhering to the Linux requirement of not incrementing a zero link count without I_LINKABLE (see linux commit f4e0c30c). Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4838 Issue #227	2016-07-14 16:25:34 -07:00
Chunwei Chen	540c392793	Fix out-of-bound access in zfs_fillpage The original code will do an out-of-bound access on pl[] during last iteration. ================================================================== BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs] Read of size 8 by task tmpfile/7850 page:ffffea00017c6dc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0xffff8000000000() page dumped because: kasan: bad access detected CPU: 3 PID: 7850 Comm: tmpfile Tainted: G OE 4.6.0+ #3 ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618 ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8 ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3 Call Trace: [<ffffffff81635618>] dump_stack+0x63/0x8b [<ffffffff81313ee8>] kasan_report_error+0x528/0x560 [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0 [<ffffffff813144b8>] kasan_report+0x58/0x60 [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffff81312e4e>] __asan_load8+0x5e/0x70 [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs] [<ffffffff81353c3a>] SyS_execve+0x3a/0x50 [<ffffffff810058ef>] do_syscall_64+0xef/0x180 [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25 Memory state around the buggy address: ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 ^ ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4705 Issue #4708	2016-05-31 16:01:27 -07:00
Nikolay Borisov	278f223668	Kill znode->z_gen field This field is a duplicate of the inode->i_generation, so just kill it. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4538 Closes #4654	2016-05-19 13:06:14 -07:00
Chunwei Chen	d88895a069	Remove dummy znode from zvol_state struct zvol_state contains a dummy znode, which is around 1KB on x64, only for zfs_range_lock. But in reality, other than z_range_lock and z_range_avl, zfs_range_lock only need znode on regular file, which means we add 1KB on a structure and gain nothing. In this patch, we remove the dummy znode for zvol_state. In order to do that, we also need to refactor zfs_range_lock a bit. We move z_range_lock and z_range_avl pair out of znode_t to form zfs_rlock_t. This new struct replaces znode_t as the main handle inside the range lock functions. We also add pointers to z_size, z_blksz, and z_max_blksz so range lock code doesn't depend on znode_t. This allows non-ZPL consumers like Lustre to use the range locks with their equivalent znode_t structure. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4510	2016-05-17 10:29:02 -07:00
Brian Behlendorf	c15706490e	Revert "Kill znode->z_gen field" This reverts commit `4cd77889b6`. The i_generation field in the inode is 32-bit and the SA code expects 64-bit fixed values. Revert this optimization for now until this is cleanly addressed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4538	2016-05-12 13:36:22 -07:00
Nikolay Borisov	4cd77889b6	Kill znode->z_gen field This field is a duplicate of the inode->i_generation, so just kill it Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4538	2016-05-02 11:22:31 -07:00
Brian Behlendorf	da5e151f20	Add pn_alloc()/pn_free() functions In order to remove the HAVE_PN_UTILS wrappers the pn_alloc() and pn_free() functions must be implemented. The existing illumos implementation were used for this purpose. The `flags` argument which was used in places wrapped by the HAVE_PN_UTILS condition has beed added back to zfs_remove() and zfs_link() functions. This removes a small point of divergence between the ZoL code and upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4522	2016-04-21 09:49:25 -07:00
Chunwei Chen	704cd0758a	Enable lazytime semantic for atime Linux 4.0 introduces lazytime. The idea is that when we update the atime, we delay writing it to disk for as long as it is reasonably possible. When lazytime is enabled, dirty_inode will be called with only I_DIRTY_TIME flag whenever i_atime is updated. So under such condition, we will set z_atime_dirty. We will only write it to disk if file is closed, inode is evicted or setattr is called. Ideally, we should also write it whenever SA is going to be updated, but it is left for future improvement. There's one thing that we should take care of now that we allow i_atime to be dirty. In original implementation, whenever SA is modified, zfs_inode_update will be called to overwrite every thing in inode. This will cause dirty i_atime to be discarded. We fix this by don't overwrite i_atime in zfs_inode_update. We only overwrite i_atime when allocating new inode or doing zfs_rezget with zfs_inode_update_new. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4482	2016-04-05 18:55:51 -07:00
Chunwei Chen	0df9673f01	Fix atime handling and relatime The problem for atime: We have 3 places for atime: inode->i_atime, znode->z_atime and SA. And its handling is a mess. A huge part of mess regarding atime comes from zfs_tstamp_update_setup, zfs_inode_update, and zfs_getattr, which behave inconsistently with those three values. zfs_tstamp_update_setup clears z_atime_dirty unconditionally as long as you don't pass ATTR_ATIME. Which means every write(2) operation which only updates ctime and mtime will cause atime changes to not be written to disk. Also zfs_inode_update from write(2) will replace inode->i_atime with what's inside SA(stale). But doesn't touch z_atime. So after read(2) and write(2). You'll have i_atime(stale), z_atime(new), SA(stale) and z_atime_dirty=0. Now, if you do stat(2), zfs_getattr will actually replace i_atime with what's inside, z_atime. So you will have now you'll have i_atime(new), z_atime(new), SA(stale) and z_atime_dirty=0. These will all gone after umount. And you'll leave with a stale atime. The problem for relatime: We do have a relatime config inside ZFS dataset, but how it should interact with the mount flag MS_RELATIME is not well defined. It seems it wanted relatime mount option to override the dataset config by showing it as temporary in `zfs get`. But at the same time, `zfs set relatime=on\|off` would also seems to want to override the mount option. Not to mention that MS_RELATIME flag is actually never passed into ZFS, so it never really worked. How Linux handles atime: The Linux kernel actually handles atime completely in VFS, except for writing it to disk. So if we remove the atime handling in ZFS, things would just work, no matter it's strictatime, relatime, noatime, or even O_NOATIME. And whenever VFS updates the i_atime, it will notify the underlying filesystem via sb->dirty_inode(). And also there's one thing to note about atime flags like MS_RELATIME and other flags like MS_NODEV, etc. They are mount point flags rather than filesystem(sb) flags. Since native linux filesystem can be mounted at multiple places at the same time, they can all have different atime settings. So these flags are never passed down to filesystem drivers. What this patch tries to do: We remove znode->z_atime, since we won't gain anything from it. We remove most of the atime handling and leave it to VFS. The only thing we do with atime is to write it when dirty_inode() or setattr() is called. We also add file_accessed() in zpl_read() since it's not provided in vfs_read(). After this patch, only the MS_RELATIME flag will have effect. The setting in dataset won't do anything. We will make zfstuil to mount ZFS with MS_RELATIME set according to the setting in dataset in future patch. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4482	2016-04-05 18:54:55 -07:00
Brian Behlendorf	8b1899d3fb	Linux 4.6 compat: PAGE_CACHE_SIZE removal As described in torvalds/linux@4a2d057e the macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were originally introduced to make it possible to add bigger chunks to the page cache. This never panned out and it has therefore been removed from the kernel. ZFS has been updated to use the PAGE_{SIZE,SHIFT,MASK,ALIGN} macros and calls to page_cache_release() have been replaced with put_page(). There was no need to introduce a configure check for this because these interfaces have existed for a very long time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4489	2016-04-05 17:26:56 -07:00
Simon Klinkert	1a04bab348	llumos 6334 - Cannot unlink files when over quota 6334 Cannot unlink files when over quota Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/6334 https://github.com/illumos/illumos-gate/commit/6575bca Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-26 15:27:08 -08:00
kernelOfTruth	a966c5640e	Reintroduce zfs_remove() synchronous deletes Reintroduce a slightly adapted version of the Illumos logic for synchronous unlinks. The basic idea here is that only files smaller than zfs_delete_blocks (20480) blocks should be deleted synchronously. Unlinking larger files should be handled asynchronously to minimize impact to the caller. To accomplish this iput() which is responsible for calling zfs_znode_delete() on Linux is only called in the delete_now path. Otherwise zfs_async_iput() is used which allows the last reference to be dropped by a taskq thread effectively making the removal asynchronous. Porting notes: - Add zfs_delete_blocks module option for performance analysis. The default value is DMU_MAX_DELETEBLKCNT which is the same as upstream. Reducing this value means that smaller files will be unlinked asynchronously like large files. - All occurrences of zfsvfs changes to zsb. Ported-by: KernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-26 15:26:02 -08:00
Matthew Ahrens	19d55079ae	Illumos 4950 - files sometimes can't be removed from a full filesystem 4950 files sometimes can't be removed from a full filesystem Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4950 https://github.com/illumos/illumos-gate/commit/4bb7380 Porting notes: - ZoL currently does not log discards to zvols, so the portion of this patch that modifies the discard logging to mark it as freeing space has been discarded. 2. may_delete_now had been removed from zfs_remove() in ZoL. It has been reintroduced. 3. We do not try to emulate vnodes, so the following lines are not valid on Linux: mutex_enter(&vp->v_lock); may_delete_now = vp->v_count == 1 && !vn_has_cached_data(vp); mutex_exit(&vp->v_lock); This has been replaced with: mutex_enter(&zp->z_lock); may_delete_now = atomic_read(&ip->i_count) == 1 && !(zp->z_is_mapped); mutex_exit(&zp->z_lock); Ported-by: Richard Yao <richard.yao@clusterhq.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-21 16:59:30 -08:00

1 2 3

142 Commits