torvalds/linux@8077c0d983 added a
__must_check to the bdi_setup_and_register(), which caused our autotools
check to break. zfsonlinux/zfs@729210564a
was intended to correct that, but it depended on -Wno-unused-result,
which is unrecognized in older GCC versions. That commit has been
reverted in favor of a solution that does not require
-Wno-unused-result.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2102Closes#2135
Older GCC versions do not obey -Wno-unused-result. This reverts commit
729210564a in favor of a solution that
does not require -Wno-unused-result.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1906
3120 zinject hangs in zfsdev_ioctl() due to uninitialized zc
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3120illumos/illumos-gate@f4c46b1eda
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2152
When testing compiler flags, we only need to do compile test. Otherwise,
configure will fail with "configure: error: cannot run test program while
cross compiling" when cross compiling.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2191
Originally, users had to handle spa namespace collisions by either
exporting the already imported pool or by specifying a new name for the
pool with a conflicting name. In the case of root pools from virtual
guests, neither approach to collision resolution is reasonable. This is
addressed by extending the new name syntax with a -t option to specify
that the new name is temporary. When specified, this sets an internal
flag that is passed into the kernel to tell it that all label updates
should refer to the name used in the original label. Consequently, the
original pool name will be retained on export.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2189
Valgrind suggests that the address we are returning is not properly
aligned, so lets add an assertion.
==87740== Address 0x1012a22a is 554 bytes inside a block of size 4,096
alloc'd
==87740== at 0x4C2BBA0: memalign (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==87740== by 0x4C2BCC7: posix_memalign (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==87740== by 0x52FA845: zio_buf_alloc (umem.h:101)
==87740== by 0x52F6226: zil_alloc_lwb (zil.c:463)
==87740== by 0x52F8559: zil_commit (zil.c:566)
==87740== by 0x40611D: ztest_freeze (ztest.c:5909)
==87740== by 0x4066A7: ztest_init (ztest.c:6048)
==87740== by 0x407AF4: main (ztest.c:6226)
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2174
Stopping arcstat.py with ^C always ends up with error:
TypeError: sighandler() takes no arguments (2 given)
Since no special signal handling was done in sighandler(),
it's simpler to just set SIGINT handler to SIG_DFL, which
terminates the script.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes#2179
Remove the redundant call to zfs_unmount_snap() which was being
done after char array was freed,
This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@d09f25dc66, which
ported the Illumos 3744 changes.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#2156
This is just a small bit of cleanup to ensure ztest fails early
on systems where mmap(2) is not functioning. For the automated
testing which is the primary consumer of ztest there is no
functional change because debugging is always enabled.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2177
Valgrind complained about this and it's absolutely right. The
props nvlist was not being freed in ztest_init.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#2174
Stopping arcstat.py with ^C always ends up with error:
TypeError: sighandler() takes no arguments (2 given)
This patch corrects the error by updating the signal handler
to take the correct number of arguments.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2182
Today arcstat.py prints one header every hdr_intr (20 by default)
lines. It would be more consistent with out utilities like vmstat
if hdr_intr defaulted to terminal window size, i.e. one header
per screenful of outputs.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2183
In Debian GNU/Linux a program called 'linitian' is used to make sure
that packages conforms to the Debian GNU/Linux packaging guide lines.
This fixes the problem reported as:
W: zfsutils: manpage-has-bad-whatis-entry usr/share/man/man1/zhack.1.gz
W: zfsutils: manpage-has-bad-whatis-entry usr/share/man/man8/fsck.zfs.8.gz
Not something that ZoL needs to addhere to, but every other man page
have their NAME section in a special way - why not these two as well?
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2161
4088 use after free in arc_release()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/4088illumos/illumos-gate@ccc22e1304
From the illumos issue:
A race-induced use after free occurs in arc_release() where the
ARC header is used outside the critical section protected by the
hash_lock.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#2162
The spa_label_features nvlist is used in the sync context during pool
version upgrade.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2168
These are needed by consumers (i.e. Lustre) who wish to perform a
callback in the syncing context. Both a blocking and non-blocking
version are available to callers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call
dmu_tx_wait() themselves before retrying if the assignment fails.
The wait times for such callers are not accounted for in the
dmu_tx_assign kstat histogram, because the histogram only records
time spent in dmu_tx_assign(). This change moves the histogram
update to dmu_tx_wait() to properly account for all time spent there.
One downside of this approach is that it is possible to call
dmu_tx_wait() multiple times before successfully assigning a
transaction, in which case the cumulative wait time would not be
recorded. However, this case should not often arise in practice,
because most callers currently use one of these forms:
dmu_tx_assign(tx, TXG_WAIT);
dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
The first form should make just one call to dmu_tx_delay() inside of
dmu_tx_assign(). The second form retries with TXG_WAITED if the first
assignment fails and incurs a delay, in which case no further waiting
is performed. Therefore transaction delays normally occur in one
call to dmu_tx_wait() so the histogram should be fairly accurate.
Another possible downside of this approach is that the histogram will
no longer record overhead outside of dmu_tx_wait() such as in
dmu_tx_try_assign(). While I'm not aware of any reason for concern on
this point, it is conceivable that lock contention, long list
traversal, etc. could cause assignment delays that would not be
reflected in the histogram. Therefore the histogram should strictly
be used for visibility in to the normal delay mechanisms and not as a
profiling tool for code performance.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1915
The nreserved column in the txgs kstat file always contains 0
following the write throttle restructuring of commit
e8b96c6007.
Prior to that commit, the nreserved column showed the number of bytes
temporarily reserved in the pool by a transaction group at sync time.
The new write throttle did away with temporary reservations and uses
the amount of dirty data instead. To approximate the old output of
the txgs kstat, the number of dirty bytes per-txg was passed in as
the nreserved value to spa_txg_history_set_io(). This approach did
not work as intended, because the per-txg dirty value is decremented
as data is written out to disk, so it is zero by the time we call
spa_txg_history_set_io(). To fix this, save the number of dirty
bytes before calling spa_sync(), and pass this value in to
spa_txg_history_set_io().
Also, since the name "nreserved" is now a misnomer, the column
heading is now labeled "ndirty".
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1696
A few counters in the dmu_tx kstats are obsolete or no longer
bumped properly.
- The sync task restructuring commit
13fe019870 removed the code
that bumpted dmu_tx_quota. The counter is now bumped in two
cases, instead of just the one case as before (after the result
of dsl_dataset_check_quota call). The second case is where
we check the requested reservation against the actual pool size,
as this is an implicit quota of sorts.
- The write throttle restructuring commit
e8b96c6007 makes dmu_tx_how and
dmu_tx_inflight obsolete, so they are removed.
Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1914
Userland tools such as blkid, grub2-probe and zdb will go through the
buffer cache. However, ZFS uses on submit_bio() to bypass the buffer
cache when performing IO operations on vdevs for efficiency purposes.
This permits the on-disk state and buffer cache to fall out of
synchronization. That causes seemingly random failures when tools
reading stale metadata from the buffer cache try to access references to
data that is no longer there.
A particularly bad failure this causes involves grub2-probe, which is
used by grub2-mkconfig. Ordinarily, a rootfs might be called
rpool/ROOT/gentoo. However, when a failure occurs in grub2-probe,
grub2-mkconfig will generate a configuration file containing
/ROOT/gentoo, which omits the pool name and causes a boot failure.
This is avoidable by calling invalidate_bdev() on each flush, which is a
simple way to ensure that all non-dirty pages are wiped. Since userland
tools rarely access vdevs directly, this should be a fancy noop >99.999%
of the time and have little impact on IO. We could have tried a finer
grained approach for the rare instances in which the vdevs are accessed
frequently by userland. However, that would require consideration of
corner cases and it is not worth the effort.
Memory-wise, it would have been better to use a Linux kernel API hook to
disable the buffer cache on such devices, but it provides us no way of
doing that, so we opt for this approach instead. We should revisit that
idea in the future when higher priority issues have been tackled.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2150
p_l in #zfsonlinux reported that he had issues mounting filesystems that
were resolved by adding rc_need="mtab" to /etc/init.d/zfs. Closer
inspection revealed that we do have a race, but it is not clear how this
race caused mounting to fail. What is clear is that this race should be
fixed, so lets add the proper `use mtab` line to handle it.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2148
4574 get_clones_stat does not call zap_count in non-debug kernel
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/4574illumos/illumos-gate@03d1795fa6
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2147
The length (number of integers) argument passed to zap_lookup was wrong;
likely as a result of performing stack-reduction on the function.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2141
Remove recursion from dsl_dir_willuse_space() to reduce stack usage.
Issues with stack overflow were observed in zfs recv of zvols,
likelihood of an overflow is proportional to the depth of the dataset
as dsl_dir_willuse_space() recurses to parent datasets.
Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2069
This stack of patches has been empirically shown to drastically improve
the hit rate of the ARC for certain workloads. As a result, fewer reads
to disk are required, which is generally a good thing and can
drastically improve performance if the workload is disk limited.
For the impatient, I'll summarize the results of the tests performed:
* Test 1 - Creating many empty directories. This test saw 99.9%
fewer reads and 12.8% more inodes created when running
*with* these changes.
* Test 2 - Creating many empty files. This test saw 4% fewer reads
and 0% more inodes created when running *with* these
changes.
* Test 3 - Creating many 4 KiB files. This test saw 96.7% fewer
reads and 4.9% more inodes created when running *with*
these changes.
* Test 4 - Creating many 4096 KiB files. This test saw 99.4% fewer
reads and 0% more inodes created (but took 6.9% fewer
seconds to complete) when running *with* these changes.
* Test 5 - Rsync'ing a dataset with many empty directories. This
test saw 36.2% fewer reads and 66.2% more inodes created
when running *with* these changes.
* Test 6 - Rsync'ing a dataset with many empty files. This test saw
30.9% fewer reads and 0% more inodes created (but took
24.3% fewer seconds to complete) when running *with*
these changes.
* Test 7 - Rsync'ing a dataset with many 4 KiB files. This test saw
30.8% fewer reads and 173.3% more inodes created when
running *with* these changes.
For the patient, the following consists of more a more detailed
description of the tests performed and the results gathered.
All the tests were run using identical machines, each with a pool
consisting of 5 mirror pairs with 2TB 7200 RPM disks. Each test was run
twice, once *without* this set of patches and again *with* this set of
patches to highlight the performance changes introduced. The first four
workloads tested were:
** NOTE: None of these tests were run to completion. They ran for a
set amount of time and then were terminated or hit ENOSPC.
1. Creating many empty directories:
* fdtree -d 10 -l 8 -s 0 -f 0 -C
-> 111,111,111 Directories
-> 0 Files
-> 0 KiB File Data
2. Creating many empty files:
* fdtree -d 10 -l 5 -s 0 -f 10000 -C
-> 111,111 Directories
-> 1,111,110,000 Files
-> 0 KiB File Data
3. Creating many 4 KiB files:
* fdtree -d 10 -l 5 -s 1 -f 10000 -C
-> 111,111 Directories
-> 1,111,110,000 Files
-> 4,444,440,000 KiB File Data
4. Creating many 4096 KiB files:
* fdtree -d 10 -l 5 -s 1024 -f 10000 -C
-> 111,111 Directories
-> 1,111,110,000 Files
-> 4,551,106,560,000 KiB File Data
Results for these first four tests are below:
| Time (s) | inodes | reads | writes |
--+----------+----------+--------+-----------+
Test 1 Before | 65069 | 37845363 | 831975 | 3214646 |
Test 1 After | 65069 | 42703608 | 778 | 3327674 |
--+----------+----------+--------+-----------+
Test 2 Before | 65073 | 54257583 | 208647 | 2413056 |
Test 2 After | 65069 | 54255782 | 200038 | 2533759 |
--+----------+----------+--------+-----------+
Test 3 Before | 65068 | 49857744 | 487130 | 5533348 |
Test 3 After | 65071 | 52294311 | 16078 | 5648354 |
--+----------+----------+--------+-----------+
Test 4 Before | 34854 | 2448329 | 385870 | 162116572 |
Test 4 After | 32419 | 2448329 | 2339 | 162175706 |
--+----------+----------+--------+-----------+
* "Time (s)" - The run time of the test in seconds
* "inodes" - The number of inodes created by the test
* "reads" - The number of reads performed by the test
* "writes" - The number of writes performed by the test
As you can see from the table above, running with this patch stack
*significantly* reduced the number of reads performed in 3 out of the 4
tests (due to an improved ARC hit rate).
In addition to the tests described above, which specifically targeted
creates only, three other workloads were tested. These additional tests
were targeting rsync performance against the datasets created in the
previous tests. A brief description of the workloads and results for
these tests are below:
** NOTE: Aside from (6), these tests didn't run to completion. They
ran for a set amount of time and then were terminated.
5. Rsync the dataset created in Test 1 to a new dataset:
* rsync -a /tank/test-1 /tank/test-5
6. Rsync the dataset created in Test 2 to a new dataset:
* rsync -a /tank/test-2 /tank/test-6
7. Rsync the dataset created in Test 3 to a new dataset:
* rsync -a /tank/test-3 /tank/test-7
Results for Test 5, 6, and 7 are below:
| Time (s) | inodes | reads | writes |
--+----------+----------+----------+---------+
Test 5 Before | 93041 | 17921014 | 47632823 | 4094848 |
Test 5 After | 93029 | 29785847 | 30376206 | 4484459 |
--+----------+----------+----------+---------+
Test 6 Before | 15290 | 54264474 | 6018331 | 733087 |
Test 6 After | 11573 | 54260826 | 4155661 | 617285 |
--+----------+----------+----------+---------+
Test 7 Before | 93057 | 10093749 | 41561635 | 3659098 |
Test 7 After | 93045 | 27587043 | 28773151 | 5612234 |
--+----------+----------+----------+---------+
* "Time (s)" - The run time of the test in seconds
* "inodes" - The number of inodes created by the test
* "reads" - The number of reads performed by the test
* "writes" - The number of writes performed by the test
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2110
Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..
Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).
This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.
If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.
So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.
To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).
Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:
- 16K for the dnode object's block - [ 16384 bytes]
- N * sizeof(dnode_t) -------------- [ N * 928 bytes]
- (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) * 72 bytes]
- (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
- (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]
To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:
Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
Pinned "arc_meta_used" bytes = 17000 + N * 1544
This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.
Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:
Total = 17000 + N * 1544 + S * N
^^^^^^^^^^^^^^^^ ^^^^^
metadata data
So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).
File Sizes (S)
| 512 | 1024 | 2048 | 4096 | 8192 | 16384 |
---+----------+----------+----------+----------+----------+----------+
1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
N 4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |
As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.
This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
Previously, the "data_size" field in the arcstats kstat contained the
amount of cached "metadata" and "data" in the ARC. The problem is this
then made it difficult to extract out just the "metadata" size, or just
the "data" size.
To make it easier to distinguish the two values, "data_size" has been
modified to count only buffers of type ARC_BUFC_DATA, and "meta_size"
was added to count only buffers of type ARC_BUFC_METADATA. If one wants
the old "data_size" value, simply sum the new "data_size" and
"meta_size" values.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
Using "arc_meta_used" to determine if the arc's mru list is over it's
target value of "arc_p" doesn't seem correct. The size of the mru list
and the value of "arc_meta_used", although related, are completely
independent. Buffers contained in "arc_meta_used" may not even be
contained in the arc's mru list. As such, this patch removes
"arc_meta_used" from the calculation in arc_adjust.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
To maintain a strict limit on the metadata contained in the arc, while
preventing the arc buffer headers from completely consuming the
"arc_meta_used" space, we need to evict metadata buffers from the arc's
ghost lists along with the regular lists.
This change modifies arc_adjust_meta such that it more closely models
the adjustments made in arc_adjust. "arc_meta_used" is used similarly to
"arc_size", and "arc_meta_limit" is used similarly to "arc_c".
Testing metadata intensive workloads (e.g. creating, copying, and
removing millions of small files and/or directories) has shown this
change to make a dramatic improvement to the hit rate maintained in the
arc. While I think there is still room for improvement, this is a big
step in the right direction.
In addition, zpl_free_cached_objects was made into a no-op as I'm not
yet sure how to properly implement that function.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
This reverts commit c11a12bc3b.
Out of memory events were fixed by reverting this patch.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm disabling by default
until its usefulness is better understood.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "metadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).
What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.
Running a workload consisting of two processes:
* Process 1 is creating many small files
* Process 2 is tar'ing a directory consisting of many small files
I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.
Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).
The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:
commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
Author: maybee <none@none>
Date: Wed Dec 20 15:46:12 2006 -0800
6505658 target MRU size (arc.p) needs to be adjusted more aggressively
and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.
As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:
commit 302f753f16
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Tue Mar 13 14:29:16 2012 -0700
Integrate ARC more tightly with Linux
Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.
This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.
In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
Decrease the mimimum ARC size from 1/32 of total system memory
(or 64MB) to a much smaller 4MB.
1) Large systems with over a 1TB of memory are being deployed
and reserving 1/32 of this memory (32GB) as the mimimum
requirement is overkill.
2) Tiny systems like the raspberry pi may only have 256MB of
memory in which case 64MB is far too large.
The ARC should be reclaimable if the VFS determines it needs
the memory for some other purpose. If you want to ensure the
ARC is never completely reclaimed due to memory pressure you
may still set a larger value with zfs_arc_min.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Issue #2110
This patch stack was designed to accomplish the following:
* Cleanly address the pool incompatibility which was accidentally
introduced post-0.6.2 but pre-0.6.3. This required adding
infrastructure which can be used from now on to notify system
administrators of errata which affect their pool.
* Add additional automated regression testing to ensure this type
of compatibility issue is caught prior to the change being merged.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2094
ZoL commit 1421c89 unintentionally changed the disk format in a forward-
compatible, but not backward compatible way. This was accomplished by
adding an entry to zbookmark_t, which is included in a couple of
on-disk structures. That lead to the creation of pools with incorrect
dsl_scan_phys_t objects that could only be imported by versions of ZoL
containing that commit. Such pools cannot be imported by other versions
of ZFS or past versions of ZoL.
The additional field has been removed by the previous commit. However,
affected pools must be imported and scrubbed using a version of ZoL with
this commit applied. This will return the pools to a state in which they
may be imported by other implementations.
The 'zpool import' or 'zpool status' command can be used to determine if
a pool is impacted. A message similar to one of the following means your
pool must be scrubbed to restore compatibility.
$ zpool import
pool: zol-0.6.2-173
id: 1165955789558693437
state: ONLINE
status: Errata #1 detected.
action: The pool can be imported using its name or numeric identifier,
however there is a compatibility issue which should be corrected
by running 'zpool scrub'
see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...
$ zpool status
pool: zol-0.6.2-173
state: ONLINE
scan: pool compatibility issue detected.
see: https://github.com/zfsonlinux/zfs/issues/2094
action: To correct the issue run 'zpool scrub'.
config:
...
If there was an async destroy in progress 'zpool import' will prevent
the pool from being imported. Further advice on how to proceed will be
provided by the error message as follows.
$ zpool import
pool: zol-0.6.2-173
id: 1165955789558693437
state: ONLINE
status: Errata #2 detected.
action: The pool can not be imported with this version of ZFS due to an
active asynchronous destroy. Revert to an earlier version and
allow the destroy to complete before updating.
see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...
Pools affected by the damaged dsl_scan_phys_t can be detected prior to
an upgrade by running the following command as root:
zdb -dddd poolname 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w
Note that `poolname` must be replaced with the name of the pool you wish
to check. A value of 25 indicates the dsl_scan_phys_t has been damaged.
A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0
indicates that there has never been a scrub run on the pool.
The regression caused by the change to zbookmark_t never made it into a
tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL
stable respositorys. Only those using the HEAD version directly from
Github after the 0.6.2 but before the 0.6.3 tag are affected.
This patch does have one limitation that should be mentioned. It will not
detect errata #2 on a pool unless errata #1 is also present. It expected
this will not be a significant problem because pools impacted by errata #2
have a high probably of being impacted by errata #1.
End users can ensure they do no hit this unlikely case by waiting for all
asynchronous destroy operations to complete before updating ZoL. The
presence of any background destroys on any imported pools can be checked
by running `zpool get freeing` as root. This will display a non-zero
value for any pool with an active asynchronous destroy.
Lastly, it is expected that no user data has been lost as a result of
this erratum.
Original-patch-by: Tim Chase <tim@chase2k.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094
From time to time it may be necessary to inform the pool administrator
about an errata which impacts their pool. These errata will by shown
to the administrator through the 'zpool status' and 'zpool import'
output as appropriate. The errata must clearly describe the issue
detected, how the pool is impacted, and what action should be taken
to resolve the situation. Additional information for each errata will
be provided at http://zfsonlinux.org/msg/ZFS-8000-ER.
To accomplish the above this patch adds the required infrastructure to
allow the kernel modules to notify the utilities that an errata has
been detected. This is done through the ZPOOL_CONFIG_ERRATA uint64_t
which has been added to the pool configuration nvlist.
To add a new errata the following changes must be made:
* A new errata identifier must be assigned by adding a new enum value
to the zpool_errata_t type. New enums must be added to the end to
preserve the existing ordering.
* Code must be added to detect the issue. This does not strictly
need to be done at pool import time but doing so will make the
errata visible in 'zpool import' as well as 'zpool status'. Once
detected the spa->spa_errata member should be set to the new enum.
* If possible code should be added to clear the spa->spa_errata member
once the errata has been resolved.
* The show_import() and status_callback() functions must be updated
to include an informational message describing the errata. This
should include an action message describing what an administrator
should do to address the errata.
* The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be
updated to describe the errata. This space can be used to provide
as much additional information as needed to fully describe the errata.
A link to this documentation will be automatically generated in the
output of 'zpool import' and 'zpool status'.
Original-idea-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094
Both the zpool_import_status() and zpool_get_status() functions
return the zpool_status_t enum. This explicit type should be used
rather than the more generic int type.
This patch makes no functional change and should only be considered
code cleanup. It happens to have been done in the context of #2094
because that's when I noticed this issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094
Commit 1421c89142 added a field to
zbookmark_t that unintentinoally caused a disk format change. This
negatively affected backward compatibility and platform portability.
Therefore, this field is being removed.
The function that field permitted is left unimplemented until a later
patch that will reimplement the field in a way that does not affect the
disk format.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094
Verify that an assortment of known good reference pools can be imported
using different versions of the ZoL code.
By default references pools for the major ZFS implementation will be
checked against the most recent ZoL tags and the master development branch.
Alternate tags or branches may be verified with the '-s <src-tag> option.
Passing the keyword "installed" will instruct the script to test whatever
version is installed.
Preferentially a reference pool is used for all tests. However, if one
does not exist and the pool-tag matches one of the src-tags then a new
reference pool will be created using binaries from that source build.
This is particularly useful when you need to test your changes before
opening a pull request.
New reference pools may be added by placing a bzip2 compressed tarball
of the pool in the scripts/zpool-example directory and then passing
the -p <pool-tag> option. To increase the test coverage reference pools
should be collected for all the major ZFS implementations. Having these
pools easily available is also helpful to the developers.
Care should be taken to run these tests with a kernel supported by all
the listed tags. Otherwise build failure will cause false positives.
EXAMPLES:
The following example will verify the zfs-0.6.2 tag, the master branch,
and the installed zfs version can correctly import the listed pools.
Note there is no reference pool available for master and installed but
because binaries are available one is automatically constructed. The
working directory is also preserved between runs (-k) preventing the
need to rebuild from source for multiple runs.
zimport.sh -k -f /var/tmp/zimport \
-s "zfs-0.6.1 zfs-0.6.2 master installed" \
-p "all master installed"
--------------------- ZFS on Linux Source Versions --------------
zfs-0.6.1 zfs-0.6.2 master 0.6.2-180
-----------------------------------------------------------------
Clone SPL Skip Skip Skip Skip
Clone ZFS Skip Skip Skip Skip
Build SPL Skip Skip Skip Skip
Build ZFS Skip Skip Skip Skip
-----------------------------------------------------------------
zevo-1.1.1 Pass Pass Pass Pass
zol-0.6.1 Pass Pass Pass Pass
zol-0.6.2-173 Fail Fail Pass Pass
zol-0.6.2 Pass Pass Pass Pass
master Fail Fail Pass Pass
installed Pass Pass Pass Pass
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #2094
Various errors can occur when registering property callbacks. As the
author's comments indicate, the code is very paranoid about preserving
the first-seen error when registering callbacks. This patch causes an
error seen while registering the "relatime" callback to not clobber a
previously-seen error.
Reported-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2117
Allow the caller of the zpool-create.sh script to override
the default path where file vdevs are created. This allows
for greated flexibilty when scripting.
Additionally, update the default path from /tmp/ to /var/tmp/
because these days /tmp/ is likely a ramdisk. Even though
these files are sparse they may grow large in which case they
should be backed by a physical device.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2120
zfs-fuse provides the same commands and man page names as ZoL.
Changing the names on either side would make each incompatible with
all existing documentation about ZFS. Providing bit identical files
is not possible due to differing codebases.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1866
Commit ba6a240 adjusted the behavior of 'zfs create -V'. The
caller is no longer guaranteed that udev will have finished
creating the /dev/ entries by the time to command exits. It
is therefore required that we explicitly block waiting for
udev to settle for this test to run reliably.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>