mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Umer Saleem	d9ac17a57f	Expose libzutil error info in libpc_handle_t In libzutil, for zpool_search_import and zpool_find_config, we use libpc_handle_t internally, which does not maintain error code and it is not exposed in the interface. Due to this, the error information is not propagated to the caller. Instead, an error message is printed on stderr. This commit adds lpc_error field in libpc_handle_t and exposes it in the interface, which can be used by the users of libzutil to get the appropriate error information and handle it accordingly. Users of the API can also control if they want to print the error message on stderr. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #13969	2022-10-04 09:54:35 -07:00
Richard Yao	a51288aabb	Fix unsafe string operations Coverity caught unsafe use of `strcpy()` in `ztest_dmu_objset_own()`, `nfs_init_tmpfile()` and `dump_snapshot()`. It also caught an unsafe use of `strlcat()` in `nfs_init_tmpfile()`. Inspired by this, I did an audit of every single usage of `strcpy()` and `strcat()` in the code. If I could not prove that the usage was safe, I changed the code to use either `strlcpy()` or `strlcat()`, depending on which function was originally used. In some cases, `snprintf()` was used to replace multiple uses of `strcat` because it was cleaner. Whenever I changed a function, I preferred to use `sizeof(dst)` when the compiler is able to provide the string size via that. When it could not because the string was passed by a caller, I checked the entire call tree of the function to find out how big the buffer was and hard coded it. Hardcoding is less than ideal, but it is safe unless someone shrinks the buffer sizes being passed. Additionally, Coverity reported three more string related issues: * It caught a case where we do an overlapping memory copy in a call to `snprintf()`. We fix that via `kmem_strdup()` and `kmem_strfree()`. * It caught `sizeof (buf)` being used instead of `buflen` in `zdb_nicenum()`'s call to `zfs_nicenum()`, which is passed to `snprintf()`. We change that to pass `buflen`. * It caught a theoretical unterminated string passed to `strcmp()`. This one is likely a false positive, but we have the information needed to do this more safely, so we change this to silence the false positive not just in coverity, but potentially other static analysis tools too. We switch to `strncmp()`. * There was a false positive in tests/zfs-tests/cmd/dir_rd_update.c. We suppress it by switching to `snprintf()` since other static analysis tools might complain about it too. Interestingly, there is a possible real bug there too, since it assumes that the passed directory path ends with '/'. We add a '/' to fix that potential bug. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13913	2022-09-27 16:47:24 -07:00
Richard Yao	fdc2d30371	Cleanup: Specify unsignedness on things that should not be signed In #13871, zfs_vdev_aggregation_limit_non_rotating and zfs_vdev_aggregation_limit being signed was pointed out as a possible reason not to eliminate an unnecessary MAX(unsigned, 0) since the unsigned value was assigned from them. There is no reason for these module parameters to be signed and upon inspection, it was found that there are a number of other module parameters that are signed, but should not be, so we make them unsigned. Making them unsigned made it clear that some other variables in the code should also be unsigned, so we also make those unsigned. This prevents users from setting negative values that could potentially cause bad behaviors. It also makes the code slightly easier to understand. Mostly module parameters that deal with timeouts, limits, bitshifts and percentages are made unsigned by this. Any that are boolean are left signed, since whether booleans should be considered signed or unsigned does not matter. Making zfs_arc_lotsfree_percent unsigned caused a `zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was removed. Removing the check was also necessary to prevent a compiler error from -Werror=type-limits. Several end of line comments had to be moved to their own lines because replacing int with uint_t caused us to exceed the 80 character limit enforced by cstyle.pl. The following were kept signed because they are passed to taskq_create(), which expects signed values and modifying the OpenSolaris/Illumos DDI is out of scope of this patch: * metaslab_load_pct * zfs_sync_taskq_batch_pct * zfs_zil_clean_taskq_nthr_pct * zfs_zil_clean_taskq_minalloc * zfs_zil_clean_taskq_maxalloc * zfs_arc_prune_task_threads Also, negative values in those parameters was found to be harmless. The following were left signed because either negative values make sense, or more analysis was needed to determine whether negative values should be disallowed: * zfs_metaslab_switch_threshold * zfs_pd_bytes_max * zfs_livelist_min_percent_shared zfs_multihost_history was made static to be consistent with other parameters. A number of module parameters were marked as signed, but in reality referenced unsigned variables. upgrade_errlog_limit is one of the numerous examples. In the case of zfs_vdev_async_read_max_active, it was already uint32_t, but zdb had an extern int declaration for it. Interestingly, the documentation in zfs.4 was right for upgrade_errlog_limit despite the module parameter being wrongly marked, while the documentation for zfs_vdev_async_read_max_active (and friends) was wrong. It was also wrong for zstd_abort_size, which was unsigned, but was documented as signed. Also, the documentation in zfs.4 incorrectly described the following parameters as ulong when they were int: * zfs_arc_meta_adjust_restarts * zfs_override_estimate_recordsize They are now uint_t as of this patch and thus the man page has been updated to describe them as uint. dbuf_state_index was left alone since it does nothing and perhaps should be removed in another patch. If any module parameters were missed, they were not found by `grep -r 'ZFS_MODULE_PARAM' \| grep ', INT'`. I did find a few that grep missed, but only because they were in files that had hits. This patch intentionally did not attempt to address whether some of these module parameters should be elevated to 64-bit parameters, because the length of a long on 32-bit is 32-bit. Lastly, it was pointed out during review that uint_t is a better match for these variables than uint32_t because FreeBSD kernel parameter definitions are designed for uint_t, whose bit width can change in future memory models. As a result, we change the existing parameters that are uint32_t to use uint_t. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13875	2022-09-27 16:42:41 -07:00
Richard Yao	ebe1d03616	Fix userland resource leaks Coverity caught these. With the exception of the file descriptor leak in tests/zfs-tests/cmd/draid.c, they are all memory leaks. Also, there is a piece of dead code in zfs_get_enclosure_sysfs_path(). We delete it as cleanup. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13921	2022-09-23 16:55:26 -07:00
Richard Yao	2a493a4c71	Fix unchecked return values and unused return values Coverity complained about unchecked return values and unused values that turned out to be unused return values. Different approaches were used to handle the different cases of unchecked return values: * cmd/zdb/zdb.c: VERIFY0 was used in one place since the existing code had no error handling. An error message was printed in another to match the rest of the code. * cmd/zed/agents/zfs_retire.c: We dismiss the return value with `(void)` because the value is expected to be potentially unset. * cmd/zpool_influxdb/zpool_influxdb.c: We dismiss the return value with `(void)` because the values are expected to be potentially unset. * cmd/ztest.c: VERIFY0 was used since we want failures if something goes wrong in ztest. * module/zfs/dsl_dir.c: We dismiss the return value with `(void)` because there is no guarantee that the zap entry will always be there. For example, old pools imported readonly would not have it and we do not want to fail here because of that. * module/zfs/zfs_fm.c: `fnvlist_add_()` was used since the allocations sleep and thus can never fail. module/zfs/zvol.c: We dismiss the return value with `(void)` because we do not need it. This matches what is already done in the analogous `zfs_replay_write2()`. * tests/zfs-tests/cmd/draid.c: We suppress one return value with `(void)` since the code handles errors already. The other return value is handled by switching to `fnvlist_lookup_uint8_array()`. * tests/zfs-tests/cmd/file/file_fadvise.c: We add error handling. * tests/zfs-tests/cmd/mmap_sync.c: We add error handling for munmap, but ignore failures on remove() with (void) since it is expected to be able to fail. * tests/zfs-tests/cmd/mmapwrite.c: We add error handling. As for unused return values, they were all in places where there was error handling, so logic was added to handle the return values. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13920	2022-09-23 16:52:03 -07:00
Richard Yao	e506a0ce40	Cleanup: Change 1 used in bitshifts to 1ULL Coverity complains about this. It is not a bug as long as we never shift by more than 31, but it is not terrible to change the constants from 1 to 1ULL as clean up. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13914	2022-09-22 11:28:33 -07:00
Richard Yao	b24d1c77f7	Add zfs_btree_verify_intensity kernel module parameter I see a few issues in the issue tracker that might be aided by being able to turn this on. We have no module parameter for it, so I would like to add one. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13874	2022-09-15 16:22:33 -07:00
Richard Yao	7195c04d98	Fix file descriptor handling in zdb_copy_object() Coverity found a file descriptor leak. Eyeballing it showed that we had no handling for the `open()` call failing either. We can address both of these at once. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13862	2022-09-12 12:34:10 -07:00
Andrew Innes	58e8054bce	Alloc zdb_cd_t to fix stack issue Alloc zdb_cd_t since it is too large for the stack on windows which results in `zdb` crashing immediately. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Innes <andrew.c12@gmail.com> Co-authored-by: Jorgen Lundman <lundman@lundman.net> Closes #13807	2022-09-02 13:15:18 -07:00
Christian Schwarz	bf61a507a2	zdb: dump spill block pointer if present Output will look like so: $ sudo zdb -dddd -vv testpool/fs 2 Dataset testpool/fs [ZPL], ID 260, cr_txg 8, 25K, 7 objects, rootbp DVA[0]=<0:1800be00:200> DVA[1]=<0:1c00be00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=16L/16P fill=7 cksum=d03b396cd:489ca835517:d4b04a4d0a62:1b413aac454d53 Object lvl iblk dblk dsize dnsize lsize %full type 2 1 128K 512 1K 512 512 0.00 ZFS plain file (K=inherit) (Z=inherit=lz4) 192 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED SPILL_BLKPTR dnode maxblkid: 0 path /testfile uid 0 gid 0 atime Fri Jul 15 12:36:35 2022 mtime Fri Jul 15 12:36:35 2022 ctime Fri Jul 15 12:36:51 2022 crtime Fri Jul 15 12:36:35 2022 gen 10 mode 100600 size 0 parent 34 links 1 pflags 840800000004 SA xattrs: 248 bytes, 2 entries security.selinux = nutanix_u:object_r:unlabeled_t:s0\000 user.foo = xbLQJjyVvEVPGGuRHV/gjkFFO1MdehKnLjjd36ZaoMVaUqtqFoMMYT5Ya9yywHApJNoK/1hNJfO3\012XCJWv9/QUTKamoWW9xVDE7yi8zn166RNw5QUhf84cZ3JNLnw6oN Spill block: 0:10005c00:200 0:14005c00:200 200L/200P F=1 B=16/16 cksum=1cdfac47a4:910c5caa557:195d0493dfe5a:332b6fde6ad547 Indirect blocks: Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com> Closes #13640	2022-07-20 17:16:29 -07:00
Tino Reichardt	1d3ba0bf01	Replace dead opensolaris.org license link The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13619	2022-07-11 14:16:13 -07:00
наб	dd66857d92	Remaining {=> const} char\|void *tag Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13348	2022-06-29 14:08:59 -07:00
наб	a926aab902	Enable -Wwrite-strings Also, fix leak from ztest_global_vars_to_zdb_args() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13348	2022-06-29 14:08:54 -07:00
Mark Johnston	03df6bad94	zdb: Fix handling of nul termination in symlink targets The SA attribute containing the symlink target does not include a nul terminator, so when printing the target zdb would sometimes include garbage at the end of the string. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #13482	2022-05-20 10:32:49 -07:00
наб	510ee280c0	Remove enable_extended_FILE_stdio() Even on Illumos it's only available in the 32-bit programming environment, and, quoth enable_extended_FILE_stdio(3C): > Historically, 32-bit Solaris applications have been limited to using > only the file descriptors 0 through 255 with the standard I/O > functions (see stdio(3C)) in the C library. The extended FILE > facility allows well-behaved 32-bit applications to use any > valid file descriptor with the standard I/O functions. where "well-behaved" means that it > does not directly access any fields in the FILE structure pointed > to by the FILE pointer associated with any standard I/O stream, And the stdio/flush.c implementation reads: /* * if this is not an internal extended FILE then check * if _file is being changed from underneath us. * It should not be because if * it is then then we lose our ability to guard against * silent data corruption. / if (!iop->__xf_nocheck && bad_fd > -1 && iop->_magic != bad_fd) { (void) fprintf(stderr, "Application violated extended FILE safety mechanism.\n" "Please read the man page for extendedFILE.\nAborting\n"); abort(); } This appears to be an insane workaround for broken implementation with exposed FILE internals and _file being an u8, both only on non-LP64; it's shimmed out on all LP64 targets in Illumos, and we shim it out as well: just get rid of it This appears to've been originally fixed in illumos-gate a5f69788de7ac07553de47f7fec8c05a9a94c105 ("PSARC 2006/162 Extended FILE space for 32-bit Solaris processes", "1085341 32-bit stdio routines should support file descriptors >255"), which also bears extendedFILE and enable_extended_FILE_stdio(3C): - unsigned char _file; / UNIX System file descriptor / + unsigned char _magic; / Old home of the file descriptor / + / Only fileno(3C) can retrieve the value now / and +/ + * Macros to aid the extended fd FILE work. + * This helps isolate the changes to only the 32-bit code + * since 64-bit Solaris is not affected by this. + */ +#ifdef _LP64 +#define GET_FD(iop) ((iop)->_file) +#define SET_FILE(iop, fd) ((iop)->_file = (fd)) +#else +#define GET_FD(iop) \ + (((iop)->__extendedfd) ? _file_get(iop) : (iop)->_magic) +#define SET_FILE(iop, fd) (iop)->_magic = (fd); (iop)->__extendedfd = 0 +#endif Also remove the 1k setrlimit(NOFILE) calls: that's the default on Linux, with 64k on Illumos and 171k on FreeBSD Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13411	2022-05-11 10:33:12 -07:00
наб	d5036491e6	zdb: standardise on ctime() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13284	2022-04-05 09:45:05 -07:00
наб	861166b027	Remove bcopy(), bzero(), bcmp() bcopy() has a confusing argument order and is actually a move, not a copy; they're all deprecated since POSIX.1-2001 and removed in -2008, and we shim them out to mem*() on Linux anyway Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12996	2022-03-15 15:13:42 -07:00
Jorgen Lundman	9a70e97fe1	Rename fallthrough to zfs_fallthrough Unfortunately macOS has obj-C keyword "fallthrough" in the OS headers. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #13097	2022-02-15 08:58:59 -08:00
Damian Szuberski	63652e1546	Add `--enable-asan` and `--enable-ubsan` switches `configure` now accepts `--enable-asan` and `--enable-ubsan` switches which results in passing `-fsanitize=address` and `-fsanitize=undefined`, respectively, to the compiler. Those flags are enabled in GitHub workflows for ZTS and zloop. Errors reported by both instrumentations are corrected, except for: - Memory leak reporting is (temporarily) suppressed. The cost of fixing them is relatively high compared to the gains. - Checksum computing functions in `module/zcommon/zfs_fletcher*` have UBSan errors suppressed. It is completely impractical to enforce 64-byte payload alignment there due to performance impact. - There's no ASan heap poisoning in `module/zstd/lib/zstd.c`. A custom memory allocator is used there rendering that measure unfeasible. - Memory leaks detection has to be suppressed for `cmd/zvol_id`. `zvol_id` is run by udev with the help of `ptrace(2)`. Tracing is incompatible with memory leaks detection. Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #12928	2022-02-03 14:35:38 -08:00
наб	c70bb2f610	Replace *CTASSERT() with _Static_assert() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12993	2022-01-26 11:38:52 -08:00
Paul Zuchowski	5a4d282f55	Fix problem with zdb -d zdb -d <pool>/<objset ID> does not work when other command line arguments are included i.e. zdb -U <cachefile> -d <pool>/<objset ID> This change fixes the command line parsing to handle this situation. Also fix issue where zdb -r <dataset> <file> does not handle the root <dataset> of the pool. Introduce -N option to force <objset ID> to be interpreted as a numeric objsetID. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #12845 Closes #12944	2022-01-20 10:28:55 -07:00
Manoj Joseph	6b2e32019e	Long opts for zdb This change introduces long options for zdb. It updates the usage message as well to include the long options. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Manoj Joseph <manoj.joseph@delphix.com> Closes #12818	2022-01-06 10:54:32 -08:00
наб	6fa118b857	zdb: main: fix unused, remove argsused Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12835	2021-12-21 12:05:12 -08:00
Rich Ercolani	c5ccbbdcf6	Allow printing special vdev metaslab groups Sometimes, we'd like to know info about the metaslab groups on special vdevs too. So let's make -MM do something useful. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12750	2021-11-30 10:26:45 -08:00
Fedor Uporov	2a9c572059	zdb: Report bad label checksum In case if all label checksums will be invalid on any vdev, the pool will become unimportable. From other side zdb with -l option will not provide any useful information why it happened. Add notifications about corrupted label checksums. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #2509 Closes #12685	2021-11-10 12:22:00 -07:00
Fedor Uporov	e39fe05b69	Skip spacemaps reading in case of pool readonly import The only zdb utility require to read metaslab-related data during read-only pool import because of spacemaps validation. Add global variable which will allow zdb read spacemaps in case of readonly import mode. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #9095 Closes #12687	2021-11-09 12:50:39 -08:00
Damian Szuberski	6d680e61ef	Update `checkstyle` workflow env to ubuntu-20.04 - `checkstyle` workflow uses ubuntu-20.04 environment - improved `mancheck.sh` readability Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #12713	2021-11-02 14:02:57 -06:00
Teodor Spæren	01b572bc62	zdb: fix overflow of time estimation The calculation of estimated time remaining in zdb -cc could overflow, as reported in #10666. This patch fixes this, by using uint64_t instead of ints in the calculations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Teodor Spæren <teodor@sparen.no> Closes #10666 Closes #12610	2021-10-15 15:55:34 -07:00
Brian Behlendorf	6954c22f35	Use fallthrough macro As of the Linux 5.9 kernel a fallthrough macro has been added which should be used to anotate all intentional fallthrough paths. Once all of the kernel code paths have been updated to use fallthrough the -Wimplicit-fallthrough option will because the default. To avoid warnings in the OpenZFS code base when this happens apply the fallthrough macro. Additional reading: https://lwn.net/Articles/794944/ Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12441	2021-09-14 10:17:54 -06:00
Jorgen Lundman	49d8a99a4d	Upstream: zdb inode mapping Unfortunately macOS reserves inode ID numbers 0-15, and we can not used them. In macOS port we simply map them really high IDs. Normally this is hidden inside the _os implementation, but this is the one place in the common source files. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12530	2021-09-08 13:56:04 -07:00
Rich Ercolani	b1a1c64313	Fix cross-endian interoperability of zstd It turns out that layouts of union bitfields are a pain, and the current code results in an inconsistent layout between BE and LE systems, leading to zstd-active datasets on one erroring out on the other. Switch everyone over to the LE layout, and add compatibility code to read both. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12008 Closes #12022	2021-08-30 14:13:46 -07:00
наб	2c69ba6444	Normalise /FALLTHR{OUGH,U}/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #12201	2021-07-26 12:07:39 -07:00
наб	4f1009face	zdb: zdb_decompress_block: don't needlessly set buf Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12187	2021-06-07 20:58:23 -07:00
Serapheim Dimitropoulos	86b5f4c121	Livelist logic should handle dedup blkptrs Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #11480 Closes #12177	2021-06-07 13:09:07 -06:00
наб	94f942c658	libspl: staticify buf and pagesize, rename aok to libspl_assert_ok Exporting names this short can easily cause nasty collisions with user code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12050	2021-06-03 11:04:13 -06:00
наб	a0d7e27a13	zdb: remove strtok Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12094	2021-05-26 14:51:18 -07:00
Toomas Soome	1a1302f8c4	zdb: dump_history needs space One space is missing from zdb -h output causing strings to be concatenated. (fixing #11940) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #12098	2021-05-25 11:33:18 -06:00
Toomas Soome	17b83525f5	zdb: dump_history can be improved We only recognize some history records, instead, use same logic as in print_history_records() in zpool_main.c. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #11940	2021-04-29 16:44:07 -07:00
Toomas Soome	09131144b7	zdb: ASSERT issues when DEBUG is not defined If zdb is not built with DEBUG mode, the ASSERT macros will be eliminated. This will leave vim defined, but not used (gcc warning) and checkpoint spacemap validation loop will do nothing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #11932	2021-04-27 08:33:37 -07:00
Allan Jude	393e69241e	Add zdb -r <dataset> <object-id \| file> <output> While you can use zdb -R poolname vdev:offset:[<lsize>/]<psize>[:flags] to extract individual DVAs from a vdev, it would be handy for be able copy an entire file out of the pool. Given a file or object number, add support to copy the contents to a file. Useful for debugging and recovery. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #11027	2021-01-27 21:36:01 -08:00
Matthew Ahrens	aa755b3549	Set aside a metaslab for ZIL blocks Mixing ZIL and normal allocations has several problems: 1. The ZIL allocations are allocated, written to disk, and then a few seconds later freed. This leaves behind holes (free segments) where the ZIL blocks used to be, which increases fragmentation, which negatively impacts performance. 2. When under moderate load, ZIL allocations are of 128KB. If the pool is fairly fragmented, there may not be many free chunks of that size. This causes ZFS to load more metaslabs to locate free segments of 128KB or more. The loading happens synchronously (from zil_commit()), and can take around a second even if the metaslab's spacemap is cached in the ARC. All concurrent synchronous operations on this filesystem must wait while the metaslab is loading. This can cause a significant performance impact. 3. If the pool is very fragmented, there may be zero free chunks of 128KB or more. In this case, the ZIL falls back to txg_wait_synced(), which has an enormous performance impact. These problems can be eliminated by using a dedicated log device ("slog"), even one with the same performance characteristics as the normal devices. This change sets aside one metaslab from each top-level vdev that is preferentially used for ZIL allocations (vdev_log_mg, spa_embedded_log_class). From an allocation perspective, this is similar to having a dedicated log device, and it eliminates the above-mentioned performance problems. Log (ZIL) blocks can be allocated from the following locations. Each one is tried in order until the allocation succeeds: 1. dedicated log vdevs, aka "slog" (spa_log_class) 2. embedded slog metaslabs (spa_embedded_log_class) 3. other metaslabs in normal vdevs (spa_normal_class) The space required for the embedded slog metaslabs is usually between 0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop" space that is not available for user data. On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity, and recordsize=8k, testing shows a ~50% performance increase on random 8k sync writes. On even more fragmented systems (which hit problem #3 above and call txg_wait_synced()), the performance improvement can be arbitrarily large (>100x). Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11389	2021-01-21 15:12:54 -08:00
Matthew Ahrens	808e681238	Memory leak in zdb:import_checkpointed_state() Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11396	2020-12-28 10:05:55 -08:00
Matthew Ahrens	b6722b871b	nvlist leaked in zpool_find_config() In `zpool_find_config()`, the `pools` nvlist is leaked. Part of it (a sub-nvlist) is returned in `*configp`, but the callers also leak that. Additionally, in `zdb.c:main()`, the `searchdirs` is leaked. The leaks were detected by ASAN (`configure --enable-asan`). This commit resolves the leaks. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11396	2020-12-28 10:05:31 -08:00
Alexander Motin	f8020c9363	Make metaslab class rotor and aliquot per-allocator. Metaslab rotor and aliquot are used to distribute workload between vdevs while keeping some locality for logically adjacent blocks. Once multiple allocators were introduced to separate allocation of different objects it does not make much sense for different allocators to write into different metaslabs of the same metaslab group (vdev) same time, competing for its resources. This change makes each allocator choose metaslab group independently, colliding with others only sporadically. Test including simultaneous write into 4 files with recordsize of 4KB on a striped pool of 30 disks on a system with 40 logical cores show reduction of vdev queue lock contention from 54 to 27% due to better load distribution. Unfortunately it won't help much ZVOLs yet since only one dataset/ZVOL is synced at a time, and so for the most part only one allocator is used, but it may improve later. While there, to reduce the number of pointer dereferences change per-allocator storage for metaslab classes and groups from several separate malloc()'s to variable length arrays at the ends of the original class and group structures. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #11288	2020-12-15 10:55:44 -08:00
Brian Behlendorf	b2255edcc0	Distributed Spare (dRAID) Feature This patch adds a new top-level vdev type called dRAID, which stands for Distributed parity RAID. This pool configuration allows all dRAID vdevs to participate when rebuilding to a distributed hot spare device. This can substantially reduce the total time required to restore full parity to pool with a failed device. A dRAID pool can be created using the new top-level `draid` type. Like `raidz`, the desired redundancy is specified after the type: `draid[1,2,3]`. No additional information is required to create the pool and reasonable default values will be chosen based on the number of child vdevs in the dRAID vdev. zpool create <pool> draid[1,2,3] <vdevs...> Unlike raidz, additional optional dRAID configuration values can be provided as part of the draid type as colon separated values. This allows administrators to fully specify a layout for either performance or capacity reasons. The supported options include: zpool create <pool> \ draid[<parity>][:<data>d][:<children>c][:<spares>s] \ <vdevs...> - draid[parity] - Parity level (default 1) - draid[:<data>d] - Data devices per group (default 8) - draid[:<children>c] - Expected number of child vdevs - draid[:<spares>s] - Distributed hot spares (default 0) Abbreviated example `zpool status` output for a 68 disk dRAID pool with two distributed spares using special allocation classes. ``` pool: tank state: ONLINE config: NAME STATE READ WRITE CKSUM slag7 ONLINE 0 0 0 draid2:8d:68c:2s-0 ONLINE 0 0 0 L0 ONLINE 0 0 0 L1 ONLINE 0 0 0 ... U25 ONLINE 0 0 0 U26 ONLINE 0 0 0 spare-53 ONLINE 0 0 0 U27 ONLINE 0 0 0 draid2-0-0 ONLINE 0 0 0 U28 ONLINE 0 0 0 U29 ONLINE 0 0 0 ... U42 ONLINE 0 0 0 U43 ONLINE 0 0 0 special mirror-1 ONLINE 0 0 0 L5 ONLINE 0 0 0 U5 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 L6 ONLINE 0 0 0 U6 ONLINE 0 0 0 spares draid2-0-0 INUSE currently in use draid2-0-1 AVAIL ``` When adding test coverage for the new dRAID vdev type the following options were added to the ztest command. These options are leverages by zloop.sh to test a wide range of dRAID configurations. -K draid\|raidz\|random - kind of RAID to test -D <value> - dRAID data drives per group -S <value> - dRAID distributed hot spares -R <value> - RAID parity (raidz or dRAID) The zpool_create, zpool_import, redundancy, replacement and fault test groups have all been updated provide test coverage for the dRAID feature. Co-authored-by: Isaac Huang <he.huang@intel.com> Co-authored-by: Mark Maybee <mmaybee@cray.com> Co-authored-by: Don Brady <don.brady@delphix.com> Co-authored-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mmaybee@cray.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10102	2020-11-13 13:51:51 -08:00
Toomas Soome	4e84f67a96	zdb should not output binary data on terminal The zdb is interpreting byte array as textual string in dump_zap, but there are also binary arrays and we should not output binary data on terminal. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Toomas Soome <tsoome@me.com> External-issue: https://www.illumos.org/issues/12012 External-issue: https://www.illumos.org/issues/11713 Closes #11006	2020-10-05 14:05:28 -07:00
Matthew Ahrens	a57f954226	zdb leak detection fails with in-progress device removal When a device removal is in progress, there are 2 locations for the data that's already been moved: the original location, on the device that's being removed; and the new location, which is pointed to by the indirect mapping. When doing leak detection, zdb needs to know about both locations. To determine what's already been copied, we load the spacemaps of the removing vdev, omit the blocks that are yet to be copied, and then use the vdev's remap op to find the new location. The problem is with an optimization to the spacemap-loading code in zdb. When processing the log spacemaps, we ignore entries that are not relevant because they are past the point that's been copied. However, entries which span the point that's been copied (i.e. they are partly relevant and partly irrelevant) are processed normally. This can lead to an illegal spacemap operation, for example if offsets up to 100KB have been copied, and the spacemap log has the following entries: ALLOC 50KB-150KB (partly relevant) FREE 50KB-100KB (entirely relevant) FREE 100KB-150KB (entirely irrlevant - ignored) ALLOC 50KB-150KB (partly relevant) Because the entirely irrelevant entry was ignored, its space remains in the spacemap. When the last entry is processed, we attempt to add it to the spacemap, but it partially overlaps with the 100-150KB entry that was left over. This problem was discovered by ztest/zloop. One solution would be to also ignore the irrelevant parts of partially-irrelevant entries (i.e. when processing the ALLOC 50-150, to only add 50-100 to the spacemap). However, this commit implements a simpler solution, which is to remove this optimization entirely. I.e. to process the entire spacemap log, without regard for the point that's been copied. After reconstructing the entire allocatable range tree, there's already code to remove the parts that have not yet been copied. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-71820 Closes #10920	2020-09-17 10:55:30 -07:00
George Amanakis	085321621e	Add L2ARC arcstats for MFU/MRU buffers and buffer content type Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their content type is unknown. Knowing this information may prove beneficial in adjusting the L2ARC caching policy. This commit adds L2ARC arcstats that display the aligned size (in bytes) of L2ARC buffers according to their content type (data/metadata) and according to their ARC state (MRU/MFU or prefetch). It also expands the existing evict_l2_eligible arcstat to differentiate between MFU and MRU buffers. L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a buffer, its ARC state (MRU/MFU) is stored in the L2 header (b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size (in bytes) of L2ARC buffers according to their ARC state (based on b_arcs_state). We also account for the case where an L2ARC and ARC cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize reflects the alinged size (in bytes) of L2ARC buffers that were cached while they had the prefetch flag set in ARC. This is dynamically updated as the prefetch flag of L2ARC buffers changes. When buffers are evicted from ARC, if they are determined to be L2ARC eligible then their logical size is recorded in evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon eviction. Persistent L2ARC: When committing an L2ARC buffer to a log block (L2ARC metadata) its b_arcs_state and prefetch flag is also stored. If the buffer changes its arcstate or prefetch flag this is reflected in the above arcstats. However, the L2ARC metadata cannot currently be updated to reflect this change. Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count this as an MRU buffer. The buffer transitions to MFU. The arcstats are updated to reflect this. Upon pool re-import or on/offlining the L2ARC device the arcstats are cleared and the buffer will now be counted as an MRU buffer, as the L2ARC metadata were not updated. Bug fix: - If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of an ARC buffer. However, prefetches may be issued in a way that arc_read_done() is bypassed. Instead, move the related code in l2arc_write_eligible() to account for those cases too. Also add a test and update manpages for l2arc_mfuonly module parameter, and update the manpages and code comments for l2arc_noprefetch. Move persist_l2arc tests to l2arc. Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10743	2020-09-14 10:10:44 -07:00
Michael Niewöhner	10b3c7f5e4	Add zstd support to zfs This PR adds two new compression types, based on ZStandard: - zstd: A basic ZStandard compression algorithm Available compression. Levels for zstd are zstd-1 through zstd-19, where the compression increases with every level, but speed decreases. - zstd-fast: A faster version of the ZStandard compression algorithm zstd-fast is basically a "negative" level of zstd. The compression decreases with every level, but speed increases. Available compression levels for zstd-fast: - zstd-fast-1 through zstd-fast-10 - zstd-fast-20 through zstd-fast-100 (in increments of 10) - zstd-fast-500 and zstd-fast-1000 For more information check the man page. Implementation details: Rather than treat each level of zstd as a different algorithm (as was done historically with gzip), the block pointer `enum zio_compress` value is simply zstd for all levels, including zstd-fast, since they all use the same decompression function. The compress= property (a 64bit unsigned integer) uses the lower 7 bits to store the compression algorithm (matching the number of bits used in a block pointer, as the 8th bit was borrowed for embedded block pointers). The upper bits are used to store the compression level. It is necessary to be able to determine what compression level was used when later reading a block back, so the concept used in LZ4, where the first 32bits of the on-disk value are the size of the compressed data (since the allocation is rounded up to the nearest ashift), was extended, and we store the version of ZSTD and the level as well as the compressed size. This value is returned when decompressing a block, so that if the block needs to be recompressed (L2ARC, nop-write, etc), that the same parameters will be used to result in the matching checksum. All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`, `zio_prop_t`, etc.) uses the separated _compress and _complevel variables. Only the properties ZAP contains the combined/bit-shifted value. The combined value is split when the compression_changed_cb() callback is called, and sets both objset members (os_compress and os_complevel). The userspace tools all use the combined/bit-shifted value. Additional notes: zdb can now also decode the ZSTD compression header (flag -Z) and inspect the size, version and compression level saved in that header. For each record, if it is ZSTD compressed, the parameters of the decoded compression header get printed. ZSTD is included with all current tests and new tests are added as-needed. Per-dataset feature flags now get activated when the property is set. If a compression algorithm requires a feature flag, zfs activates the feature when the property is set, rather than waiting for the first block to be born. This is currently only used by zstd but can be extended as needed. Portions-Sponsored-By: The FreeBSD Foundation Co-authored-by: Allan Jude <allanjude@freebsd.org> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Co-authored-by: Michael Niewöhner <foss@mniewoehner.de> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Allan Jude <allanjude@freebsd.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #6247 Closes #9024 Closes #10277 Closes #10278	2020-08-20 10:30:06 -07:00
Ryan Moeller	0421f257b2	FreeBSD: Add legacy arc_min and arc_max These tunables were renamed from vfs.zfs.arc_min and vfs.zfs.arc_max to vfs.zfs.arc.min and vfs.zfs.arc.max. Add legacy compat tunables for the old names. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10579	2020-07-19 10:15:34 -07:00

1 2 3 4 5

214 Commits