In the upstream OpenSolaris ZFS code the maximum ARC usage is
limited to 3/4 of memory or all but 1GB, whichever is larger.
Because of how Linux's VM subsystem is organized these defaults
have proven to be too large which can lead to stability issues.
To avoid making everyone manually tune the ARC the defaults are
being changed to 1/2 of memory or all but 4GB. The rational for
this is as follows:
* Desktop Systems (less than 8GB of memory)
Limiting the ARC to 1/2 of memory is desirable for desktop
systems which have highly dynamic memory requirements. For
example, launching your web browser can suddenly result in a
demand for several gigabytes of memory. This memory must be
reclaimed from the ARC cache which can take some time. The
user will experience this reclaim time as a sluggish system
with poor interactive performance. Thus in this case it is
preferable to leave the memory as free and available for
immediate use.
* Server Systems (more than 8GB of memory)
Using all but 4GB of memory for the ARC is preferable for
server systems. These systems often run with minimal user
interaction and have long running daemons with relatively
stable memory demands. These systems will benefit most by
having as much data cached in memory as possible.
These values should work well for most configurations. However,
if you have a desktop system with more than 8GB of memory you may
wish to further restrict the ARC. This can still be accomplished
by setting the 'zfs_arc_max' module option.
Additionally, keep in mind these aren't currently hard limits.
The ARC is based on a slab implementation which can suffer from
memory fragmentation. Because this fragmentation is not visible
from the ARC it may believe it is within the specified limits while
actually consuming slightly more memory. How much more memory get's
consumed will be determined by how badly fragmented the slabs are.
In the long term this can be mitigated by slab defragmentation code
which was OpenSolaris solution. Or preferably, using the page cache
to back the ARC under Linux would be even better. See issue #75
for the benefits of more tightly integrating with the page cache.
This change also fixes a issue where the default ARC max was being
set incorrectly for machines with less than 2GB of memory. The
constant in the arc_c_max comparison must be explicitly cast to
a uint64_t type to prevent overflow and the wrong conditional
branch being taken. This failure was typically observed in VMs
which are commonly created with less than 2GB of memory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #75
The Solaris version of ZFS does not allow xattrs to be set on
symlinks due to the way they implemented the attropen() system
call. Linux however implements xattrs through the lgetxattr()
and lsetxattr() system calls which do not have this limitation.
The only reason this hasn't always worked under ZFS on Linux
is that the xattr handlers were not registered for symlink type
inodes. This was done simply to be consistent with the Solaris
behavior.
Upon futher reflection I believe this should be allowed under
Linux. The only ill effect would be that the xattrs on symlinks
will not be visible when the pool is imported on a Solaris
system. This also has the benefit that it allows for SELinux
style security xattr labeling which expects to be able to set
xattrs on all inode types.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#272
The current ZFS implementation stores xattrs on disk using a hidden
directory. In this directory a file name represents the xattr name
and the file contexts are the xattr binary data. This approach is
very flexible and allows for arbitrarily large xattrs. However,
it also suffers from a significant performance penalty. Accessing
a single xattr can requires up to three disk seeks.
1) Lookup the dnode object.
2) Lookup the dnodes's xattr directory object.
3) Lookup the xattr object in the directory.
To avoid this performance penalty Linux filesystems such as ext3
and xfs try to store the xattr as part of the inode on disk. When
the xattr is to large to store in the inode then a single external
block is allocated for them. In practice most xattrs are small
and this approach works well.
The addition of System Attributes (SA) to zfs provides us a clean
way to make this optimization. When the dataset property 'xattr=sa'
is set then xattrs will be preferentially stored as System Attributes.
This allows tiny xattrs (~100 bytes) to be stored with the dnode and
up to 64k of xattrs to be stored in the spill block. If additional
xattr space is required, which is unlikely under Linux, they will be
stored using the traditional directory approach.
This optimization results in roughly a 3x performance improvement
when accessing xattrs which brings zfs roughly to parity with ext4
and xfs (see table below). When multiple xattrs are stored per-file
the performance improvements are even greater because all of the
xattrs stored in the spill block will be cached.
However, by default SA based xattrs are disabled in the Linux port
to maximize compatibility with other implementations. If you do
enable SA based xattrs then they will not be visible on platforms
which do not support this feature.
----------------------------------------------------------------------
Time in seconds to get/set one xattr of N bytes on 100,000 files
------+--------------------------------+------------------------------
| setxattr | getxattr
bytes | ext4 xfs zfs-dir zfs-sa | ext4 xfs zfs-dir zfs-sa
------+--------------------------------+------------------------------
1 | 2.33 31.88 21.50 4.57 | 2.35 2.64 6.29 2.43
32 | 2.79 30.68 21.98 4.60 | 2.44 2.59 6.78 2.48
256 | 3.25 31.99 21.36 5.92 | 2.32 2.71 6.22 3.14
1024 | 3.30 32.61 22.83 8.45 | 2.40 2.79 6.24 3.27
4096 | 3.57 317.46 22.52 10.73 | 2.78 28.62 6.90 3.94
16384 | n/a 2342.39 34.30 19.20 | n/a 45.44 145.90 7.55
65536 | n/a 2941.39 128.15 131.32* | n/a 141.92 256.85 262.12*
Legend:
* ext4 - Stock RHEL6.1 ext4 mounted with '-o user_xattr'.
* xfs - Stock RHEL6.1 xfs mounted with default options.
* zfs-dir - Directory based xattrs only.
* zfs-sa - Prefer SAs but spill in to directories as needed, a
trailing * indicates overflow in to directories occured.
NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file.
NOTE: XFS and ZFS have no limit on xattr name/value pairs per file.
NOTE: Linux limits individual name/value pairs to 65536 bytes.
NOTE: All setattr/getattr's were done after dropping the cache.
NOTE: All tests were run against a single hard drive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #443
While we initially allowed you to set your ashift as large as 17
(SPA_MAXBLOCKSIZE) that is actually unsafe. What wasn't considered
at the time is that each uberblock written to the vdev label ring
buffer will be of this size. Now the buffer is statically sized
to 128k and we need to be able to fit several uberblocks in it.
With a large ashift that becomes a problem.
Therefore I'm reducing the maximum configurable ashift value to 12.
This is large enough for the 4k sector drives and small enough that
we can still keep the most recent 32 uberblock in the vdev label
ring buffer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#425
The Linux 3.1 kernel updated the fops->fsync() callback yet again.
They now pass the requested range and delegate the responsibility
for calling filemap_write_and_wait_range() to the callback. In
addition imutex is no longer held by the caller and the callback
is responsible for taking the lock if required.
This commit updates the code to provide a zpl_fsync() function
for the updated API. Implementations for the previous two APIs
are also maintained for compatibility.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#445
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code. The updated code now just
registers the bdi during mount and destroys it during unmount.
The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it. Luckily the bdi_setup_and_register() function
is trivial.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
Fix an unlikely failure cause in zfs_sb_create() which could
leave the dataset owned on error and thus unavailable until
after a reboot. Disown the dataset if SA are expected but
are in fact missing.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Profiling the system during meta data intensive workloads such
as creating/removing millions of files, revealed that the system
was cpu bound. A large fraction of that cpu time was being spent
waiting on the virtual address space spin lock.
It turns out this was caused by certain heavily used kmem_caches
being backed by virtual memory. By default a kmem_cache will
dynamically determine the type of memory used based on the object
size. For large objects virtual memory is usually preferable
and for small object physical memory is a better choice. See
the spl_slab_alloc() function for a longer discussion on this.
However, there is a certain amount of gray area when defining a
'large' object. For the following caches it turns out they were
just over the line:
* dnode_cache
* zio_cache
* zio_link_cache
* zio_buf_512_cache
* zfs_data_buf_512_cache
Now because we know there will be a lot of churn in these caches,
and because we know the slabs will still be reasonably sized.
We can safely request with the KMC_KMEM flag that the caches be
backed with physical memory addresses. This entirely avoids the
need to serialize on the virtual address space lock.
As a bonus this also reduces our vmalloc usage which will be good
for 32-bit kernels which have a very small virtual address space.
It will also probably be good for interactive performance since
unrelated processes could also block of this same global lock.
Finally, we may see less cpu time being burned in the arc_reclaim
and txg_sync_threads.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure. It may have already been set when entering
zpl_putpage() in which case it must remain set on exit. In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim. By clearing
it we allow the following NULL deref to potentially occur.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #287
zfs_getattr_fast() was missing a lock on the ZFS superblock which
could result in zfs_znode_dmu_fini() clearing the zp->z_sa_hdl member
while zfs_getattr_fast() was accessing the znode. The result of this
would usually be a panic.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#431
When calculating space needed for SA_BONUS buffers, hdrsize is
always rounded up to next 8-aligned boundary. However, in two places
the round up was done against sum of 'total' plus hdrsize. On the
other hand, hdrsize increments by 4 each time, which means in certain
conditions, we would end up returning with will_spill == 0 and
(total + hdrsize) larger than full_space, leading to a failed
assertion because it's invalid for dmu_set_bonus.
Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue:
https://www.illumos.org/issues/1661
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#426
ZFS contains error messages that point to the defunct www.sun.com
domain, which is currently offline. Change these error messages
to use the zfsonlinux.org mirror instead.
This commit depends on:
zfsonlinux/zfsonlinux.github.com@8e10ead3dc
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Register the setattr/getattr callbacks for symlinks. Without these
the generic inode_setattr() and generic_fillattr() functions will
be used. In the setattr case this will only result in the inode being
updated in memory, the dirty_inode callback would also normally run
but none is registered for zfs.
The straight forward fix is to set the setattr/getattr callbacks
for symlinks so they are handled just like files and directories.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#412
An incomplete guid_to_ds_map would cause restore_write_byref() to fail
while receiving a de-duplicated backup stream.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D`Amore <garrett@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/755
- https://github.com/illumos/illumos-gate/commit/ec5cf9d53a
Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#372
Export all symbols already marked extern in the zfs_vfsops.h
header. Several non-static symbols have also been added to
the header and exportewd. This allows external modules to
more easily create and manipulate properly created ZFS
filesystem type datasets.
Rename zfsvfs_teardown() to zfs_sb_teardown and export it.
This is done simply for consistency with the rest of the code
base. All other zfsvfs_* functions have already been renamed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Export all the symbols for the system attribute (SA) API. This
allows external module to cleanly manipulate the SAs associated
with a dnode. Documention for the SA API can be found in the
module/zfs/sa.c source.
This change also removes the zfs_sa_uprade_pre, and
zfs_sa_uprade_post prototypes. The functions themselves were
dropped some time ago.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Due to the confusion in Linux statfs between f_frsize and f_bsize
the blocks counts were changed to be in units of z_max_blksize
instead of SPA_MINBLOCKSIZE as it is on other platforms.
However, the free files calculation in zfs_statvfs() is limited by
the free blocks count, since each dnode consumes one block/sector.
This provided a reasonable estimate of free inodes, but on Linux
this meant that the free inodes count was underestimated by a large
amount, since 256 512-byte dnodes can fit into a 128kB block, and
more if the max blocksize is increased to 1MB or larger.
Also, the use of SPA_MINBLOCKSIZE is semantically incorrect since
DNODE_SIZE may change to a value other than SPA_MINBLOCKSIZE and
may even change per dataset, and devices with large sectors setting
ashift will also use a larger blocksize.
Correct the f_ffree calculation to use (availbytes >> DNODE_SHIFT)
to more accurately compute the maximum number of dnodes that can
be created.
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#413Closes#400
Export all the symbols for the ZAP API. This allows external modules
to cleanly interface with ZAP type objects. Previously only a subset
of the functionality was exposed. Documention for the ZAP API can be
found in the sys/zap.h header.
This change also removes a duplicate zap_increment_int() prototype.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Suppress the warning for this large kmem_alloc() because it is not
that far over the warning threshhold (8k) and it is short lived.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Caught by code inspection, the variable zsb was referenced after
being freed. Move the kmem_free() to the end of the function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This warning was accidentally introduced by commit
f3ab88d646 which updated the
.readpages() implementation. The fix is to simply cast
the helper function to the appropriate type when passed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Unlike the .readpage() callback which is passed a single locked page
to be populated. The .readpages() callback is passed a list of unlocked
pages which are all marked for read-ahead (PG_readahead set). It is
the responsibly of .readpages() to ensure to pages are properly locked
before being populated.
Prior to this change the requested read-ahead pages would be updated
outside of the page lock which is unsafe. The unlocked pages would then
be unlocked again which is harmless but should have been immediately
detected as bug. Unfortunately, newer kernels failed detect this issue
because the check is done with a VM_BUG_ON which is disabled by default.
Luckily, the old Debian Lenny 2.6.26 kernel caught this because it
simply uses a BUG_ON.
The straight forward fix for this is to update the .readpages() callback
to use the read_cache_pages() helper function. The helper function will
ensure that each page in the list is properly locked before it is passed
to the .readpage() callback. In addition resolving the bug, this results
in a nice simplification of the existing code.
The downside to this change is that instead of passing one large read
request to the dmu multiple smaller ones are submitted. All of these
requests however are marked for readahead so the lower layers should
issue a large I/O regardless. Thus most of the request should hit the
ARC cache.
Futher optimization of this code can be done in the future is a perform
analysis determines it to be worthwhile. But for the moment, it is
preferable that code be correct and understandable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#355
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk. The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance. Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.
The replacement for pdflush is called bdi (backing device info). The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback. The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.
For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme. However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.
Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.
However, there is one critical case where not implementing the bdi
functionality can cause problems. If an application handles a page
fault it can enter the balance_dirty_pages() callpath. This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.
Without a registered backing_device_info for the filesystem the
dirty pages will not get written out. Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.
This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block. It is then registered
when the filesystem mounted and unregistered on unmount. It will
not be registered for mounted snapshots which are read-only. This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#174
While the existing implementation of .writepage()/zpl_putpage() was
functional it was not entirely correct. In particular, it would move
dirty pages in to a clean state simply after copying them in to the
ARC cache. This would result in the pages being lost if the system
were to crash enough though the Linux VFS believed them to be safe on
stable storage.
Since at the moment virtually all I/O, except mmap(2), bypasses the
page cache this isn't as bad as it sounds. However, as hopefully
start using the page cache more getting this right becomes more
important so it's good to improve this now.
This patch takes a big step in that direction by updating the code
to correctly move dirty pages through a writeback phase before they
are marked clean. When a dirty page is copied in to the ARC it will
now be set in writeback and a completion callback is registered with
the transaction. The page will stay in writeback until the dmu runs
the completion callback indicating the page is on stable storage.
At this point the page can be safely marked clean.
This process is normally entirely asynchronous and will be repeated
for every dirty page. This may initially sound inefficient but most
of these pages will end up in a few txgs. That means when they are
eventually written to disk they should be nicely batched. However,
there is room for improvement. It may still be desirable to batch
up the pages in to larger writes for the dmu. This would reduce
the number of callbacks and small 4k buffer required by the ARC.
Finally, if the caller requires that the I/O be done synchronously
by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set. Then the I/O
will trigger a zil_commit() to flush the data to stable storage.
At which point the registered callbacks will be run leaving the
date safe of disk and marked clean before returning from .writepage.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The function txg_delay() is used to delay txg (transaction group)
threads in ZFS. The timeout value for this function is calculated
using:
int timeout = ddi_get_lbolt() + ticks;
Later, the actual wait is performed:
while (ddi_get_lbolt() < timeout &&
tx->tx_syncing_txg < txg-1 && !txg_stalled(dp))
(void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock,
timeout - ddi_get_lbolt());
The ddi_get_lbolt() function returns current uptime in clock ticks
and is typed as clock_t. The clock_t type on 64-bit architectures
is int64_t.
The "timeout" variable will overflow depending on the tick frequency
(e.g. for 1000 it will overflow in 28.855 days). This will make the
expression "ddi_get_lbolt() < timeout" always false - txg threads will
not be delayed anymore at all. This leads to a slowdown in ZFS writes.
The attached patch initializes timeout as clock_t to match the return
value of ddi_get_lbolt().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #352
Prior to revision 11314 if a user was recursively destroying
snapshots of a dataset the target dataset was not required to
exist. The zfs_secpolicy_destroy_snaps() function introduced
the security check on the target dataset, so since then if the
target dataset does not exist, the recursive destroy is not
performed. Before 11314, only a delete permission check on
the snapshot's master dataset was performed.
Steps to reproduce:
zfs create pool/a
zfs snapshot pool/a@s1
zfs destroy -r pool@s1
Therefore I suggest to fallback to the old security check, if
the target snapshot does not exist and continue with the destroy.
References to Illumos issue and patch:
- https://www.illumos.org/issues/1043
- https://www.illumos.org/attachments/217/recursive_dataset_destroy.patch
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Moving the zil_free() cleanup to zil_close() prevents this
problem from occurring in the first place. There is a very
good description of the issue and fix in Illumus #883.
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Reivewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/883
- https://github.com/illumos/illumos-gate/commit/c9ba2a43cb
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Add a "REFRATIO" property, which is the compression ratio based on
data referenced. For snapshots, this is the same as COMPRESSRATIO,
but for filesystems/volumes, the COMPRESSRATIO is based on the
data "USED" (ie, includes blocks in children, but not blocks
shared with the origin).
This is needed to figure out how much space a filesystem would
use if it were not compressed (ignoring snapshots).
Reviewed by: George Wilson <George.Wilson@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Mark Musante <Mark.Musante@oracle.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/1092
- https://github.com/illumos/illumos-gate/commit/187d6ac08a
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Today zfs tries to allocate blocks evenly across all devices.
This means when devices are imbalanced zfs will use lots of
CPU searching for space on devices which tend to be pretty
full. It should instead fail quickly on the full LUNs and
move onto devices which have more availability.
Reviewed by: Eric Schrock <Eric.Schrock@delphix.com>
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Note that with the current ZFS code, it turns out that the vdev
cache is not helpful, and in some cases actually harmful. It
is better if we disable this. Once some time has passed, we
should actually remove this to simplify the code. For now we
just disable it by setting the zfs_vdev_cache_size to zero.
Note that Solaris 11 has made these same changes.
References to Illumos issue and patch:
- https://www.illumos.org/issues/175
- https://github.com/illumos/illumos-gate/commit/b68a40a845
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Hypothesis about what's going on here.
At some time in the past, something, i.e. dnode_reallocate()
calls one of:
dbuf_rm_spill(dn, tx);
These will do:
dbuf_rm_spill(dnode_t *dn, dmu_tx_t *tx)
dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx)
dbuf_undirty(db, tx)
Currently dbuf_undirty can leave a spill block in dn_dirty_records[],
(it having been put there previously by dbuf_dirty) and free it.
Sometime later, dbuf_sync_list trips over this reference to free'd
(and typically reused) memory.
Also, dbuf_undirty can call dnode_clear_range with a bogus
block ID. It needs to test for DMU_SPILL_BLKID, similar to
how dnode_clear_range is called in dbuf_dirty().
References to Illumos issue and patch:
- https://www.illumos.org/issues/764
- https://github.com/illumos/illumos-gate/commit/3f2366c2bb
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Mark.Maybe@oracle.com
Reviewed by: Albert Lee <trisk@nexenta.com
Approved by: Garrett D'Amore <garrett@nexenta.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Update two kmem_alloc()'s in dbuf_dirty() to use KM_PUSHPAGE.
Because these functions are called from txg_sync_thread we
must ensure they don't reenter the zfs filesystem code via
the .writepage callback. This would result in a deadlock.
This deadlock is rare and has only been observed once under
an abusive mmap() write workload.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Long, long, long ago when the effort to port ZFS was begun
the zfs_create_fs() function was heavily modified to remove
all of its VFS dependencies. This allowed Lustre to use
the dataset without us having to spend the time porting all
the required VFS code.
Fast-forward several years and we now have all the VFS code
in place but are still relying on the modified zfs_create_fs().
This isn't required anymore and we can now use zfs_mknode()
to create the root znode for the filesystem.
This commit reverts the contents of zfs_create_fs() to largely
match the upstream OpenSolaris code. There have been minor
modifications to accomidate the Linux VFS but that is all.
This code fixes issue #116 by bootstraping enough of the VFS
data structures so we can rely on zfs_mknode() to create the
root directory. This ensures it is created properly with
support for system attributes. Previously it wasn't which
is why it behaved differently that all other directories
when modified.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#116
Newly created files were always being created with the fsuid/fsgid
in the current users credentials. This is correct except in the
case when the parent directory sets the 'setgit' bit. In this
case according to posix the newly created file/directory should
inherit the gid of the parent directory. Additionally, in the
case of a subdirectory it should also inherit the 'setgit' bit.
Finally, this commit performs a little cleanup of the vattr_t
initialization by moving it to a common helper function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#262
When running 'make install' without DESTDIR set the module install
rules would mistakenly destroy the 'modules.*' files for ALL of
your installed kernels. This could lead to a non-functional system
for the alternate kernels because 'depmod -a' will only be run for
the kernel which was compiled against. This issue would not impact
anyone using the 'make <deb|rpm|pkg>' build targets to build and
install packages.
The fix for this issue is to only remove extraneous build products
when DESTDIR is set. This almost exclusively indicates we are
building packages and installed the build products in to a temporary
staging location. Additionally, limit the removal the unneeded
build products to the target kernel version.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#328
Disable the normal reclaim path for zpl_putpage(). This ensures that
all memory allocations under this call path will never enter direct
reclaim. If this were to happen the VM might try to write out
additional pages by calling zpl_putpage() again resulting in a
deadlock.
This sitution is typically handled in Linux by marking each offending
allocation GFP_NOFS. However, since much of the code used is common
it makes more sense to use PF_MEMALLOC to flag the entire call tree.
Alternately, the code could be updated to pass the needed allocation
flags but that's a more invasive change.
The following example of the above described deadlock was triggered
by test 074 in the xfstest suite.
Call Trace:
[<ffffffff814dcdb2>] down_write+0x32/0x40
[<ffffffffa05af6e4>] dnode_new_blkid+0x94/0x2d0 [zfs]
[<ffffffffa0597d66>] dbuf_dirty+0x556/0x750 [zfs]
[<ffffffffa05987d1>] dmu_buf_will_dirty+0x81/0xd0 [zfs]
[<ffffffffa059ee70>] dmu_write+0x90/0x170 [zfs]
[<ffffffffa0611afe>] zfs_putpage+0x2ce/0x360 [zfs]
[<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
[<ffffffffa06287b2>] zpl_writepage+0x12/0x20 [zfs]
[<ffffffff8115f907>] writeout+0xa7/0xd0
[<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170
[<ffffffff8115fed4>] migrate_pages+0x434/0x4c0
[<ffffffff811559ab>] compact_zone+0x4fb/0x780
[<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0
[<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190
[<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0
[<ffffffff8115464a>] alloc_pages_current+0xaa/0x110
[<ffffffff8111e36e>] __get_free_pages+0xe/0x50
[<ffffffffa03f0e2f>] kv_alloc+0x3f/0xb0 [spl]
[<ffffffffa03f11d9>] spl_kmem_cache_alloc+0x339/0x660 [spl]
[<ffffffffa05950b3>] dbuf_create+0x43/0x370 [zfs]
[<ffffffffa0596fb1>] __dbuf_hold_impl+0x241/0x480 [zfs]
[<ffffffffa0597276>] dbuf_hold_impl+0x86/0xc0 [zfs]
[<ffffffffa05977ff>] dbuf_hold_level+0x1f/0x30 [zfs]
[<ffffffffa05a9dde>] dmu_tx_check_ioerr+0x4e/0x110 [zfs]
[<ffffffffa05aa1f9>] dmu_tx_count_write+0x359/0x6f0 [zfs]
[<ffffffffa05aa5df>] dmu_tx_hold_write+0x4f/0x70 [zfs]
[<ffffffffa0611a6d>] zfs_putpage+0x23d/0x360 [zfs]
[<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
[<ffffffff811221f9>] write_cache_pages+0x1c9/0x4a0
[<ffffffffa0628738>] zpl_writepages+0x18/0x20 [zfs]
[<ffffffff81122521>] do_writepages+0x21/0x40
[<ffffffff8119bbbd>] writeback_single_inode+0xdd/0x2c0
[<ffffffff8119bfbe>] writeback_sb_inodes+0xce/0x180
[<ffffffff8119c11b>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff8119c4bb>] wb_writeback+0x29b/0x3f0
[<ffffffff8119c6cb>] wb_do_writeback+0xbb/0x240
[<ffffffff811308ea>] bdi_forker_task+0x6a/0x310
[<ffffffff8108ddf6>] kthread+0x96/0xa0
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#327
When modifing overlapping regions of a file using mmap(2) and
write(2)/read(2) it is possible to deadlock due to a lock inversion.
The zfs_write() and zfs_read() hooks first take the zfs range lock
and then lock the individual pages. Conversely, when using mmap'ed
I/O the zpl_writepage() hook is called with the individual page
locks already taken and then zfs_putpage() takes the zfs range lock.
The most straight forward fix is to simply not take the zfs range
lock in the mmap(2) case. The individual pages will still be locked
thus serializing access. Updating the same region of a file with
write(2) and mmap(2) has always been a dodgy thing to do. This change
at a minimum ensures we don't deadlock and is consistent with the
existing Linux semantics enforced by the VFS.
This isn't an issue under Solaris because the only range locking
performed will be with the zfs range locks. It's up to each filesystem
to perform its own file locking. Under Linux the VFS provides many
of these services.
It may be possible/desirable at a latter date to entirely dump the
existing zfs range locking and rely on the Linux VFS page locks.
However, for now its safest to perform both layers of locking until
zfs is more tightly integrated with the page cache.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #302
This commit fixes a regression which was accidentally introduced by
the Linux 2.6.39 compatibility chanages. As part of these changes
instead of holding an active reference on the namepsace (which is
no longer posible) a reference is taken on the super block. This
reference ensures the super block remains valid while it is in use.
To handle the unlikely race condition of the filesystem being
unmounted concurrently with the start of a 'zfs send/recv' the
code was updated to only take the super block reference when there
was an existing reference. This indicates that the filesystem is
active and in use.
Unfortunately, in the 'zfs recv' case this is not the case. The
newly created dataset will not have a super block without an
active reference which results in the 'dataset is busy' error.
The most straight forward fix for this is to simply update the
code to always take the reference even when it's zero. This
may expose us to very very unlikely concurrent umount/send/recv
case but the consequences of that are minor.
Closes#319
There is at most a factor of 3x performance improvement to be
had by using the Linux generic_fillattr() helper. However, to
use it safely we need to ensure the values in a cached inode
are kept rigerously up to date. Unfortunately, this isn't
the case for the blksize, blocks, and atime fields. At the
moment the authoritative values are still stored in the znode.
This patch introduces an optimized zfs_getattr_fast() call.
The idea is to use the up to date values from the inode and
the blksize, block, and atime fields from the znode. At some
latter date we should be able to strictly use the inode values
and further improve performance.
The remaining overhead in the zfs_getattr_fast() call can be
attributed to having to take the znode mutex. This overhead is
unavoidable until the inode is kept strictly up to date. The
the careful reader will notice the we do not use the customary
ZFS_ENTER()/ZFS_EXIT() macros. These macro's are designed to
ensure the filesystem is not torn down in the middle of an
operation. However, in this case the VFS is holding a
reference on the active inode so we know this is impossible.
=================== Performance Tests ========================
This test calls the fstat(2) system call 10,000,000 times on
an open file description in a tight loop. The test results
show the zfs stat(2) performance is now only 22% slower than
ext4. This is a 2.5x improvement and there is a clear long
term plan to get to parity with ext4.
filesystem | test-1 test-2 test-3 | average | times-ext4
--------------+-------------------------+---------+-----------
ext4 | 7.785s 7.899s 7.284s | 7.656s | 1.000x
zfs-0.6.0-rc4 | 24.052s 22.531s 23.857s | 23.480s | 3.066x
zfs-faststat | 9.224s 9.398s 9.485s | 9.369s | 1.223x
The second test is to run 'du' of a copy of the /usr tree
which contains 110514 files. The test is run multiple times
both using both a cold cache (/proc/sys/vm/drop_caches) and
a hot cache. As expected this change signigicantly improved
the zfs hot cache performance and doesn't quite bring zfs to
parity with ext4.
A little surprisingly the zfs cold cache performance is better
than ext4. This can probably be attributed to the zfs allocation
policy of co-locating all the meta data on disk which minimizes
seek times. By default the ext4 allocator will spread the data
over the entire disk only co-locating each directory.
filesystem | cold | hot
--------------+---------+--------
ext4 | 13.318s | 1.040s
zfs-0.6.0-rc4 | 4.982s | 1.762s
zfs-faststat | 4.933s | 1.345s
The performance of the L2ARC can be tweaked by a number of tunables, which
may be necessary for different workloads:
l2arc_write_max max write bytes per interval
l2arc_write_boost extra write bytes during device warmup
l2arc_noprefetch skip caching prefetched buffers
l2arc_headroom number of max device writes to precache
l2arc_feed_secs seconds between L2ARC writing
l2arc_feed_min_ms min feed interval in milliseconds
l2arc_feed_again turbo L2ARC warmup
l2arc_norw no reads during writes
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#316
The remaining code that is guarded by HAVE_SHARE ifdefs is related to the
.zfs/shares functionality which is currently not available on Linux.
On Solaris the .zfs/shares directory can be used to set permissions for
SMB shares.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The sharenfs and sharesmb properties depend on the libshare library
to export datasets via NFS and SMB. This commit implements the base
libshare functionality as well as support for managing NFS shares.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under Linux you may only disable USER xattrs. The SECURITY,
SYSTEM, and TRUSTED xattr namespaces must always be available
if xattrs are supported by the filesystem. The enforcement
of USER xattrs is performed in the zpl_xattr_user_* handlers.
Under Solaris there is only a single xattr namespace which
is managed globally.
The Linux kernel already has support for mandatory locking. This
change just replaces the Solaris mandatory locking calls with the
Linux equivilants. In fact, it looks like this code could be
removed entirely because this checking is already done generically
in the Linux VFS. However, for now we'll leave it in place even
if it is redundant just in case we missed something.
The original patch to update the code to support mandatory locking
was done by Rohan Puri. This patch is an updated version which is
compatible with the previous mount option handling changes.
Original-Patch-by: Rohan Puri <rohan.puri15@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#222Closes#253
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure. When using the new
interface the caller must now use the mount_nodev() helper.
Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers. This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use. It provides our only entry point
in to the namespace layer for manipulating certain mount options.
This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly. It also allowed me
to keep more of the original Solaris code unmodified. Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do. However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem. Thus keeping a back
reference from the filesystem to the namespace is complicated.
Rather than introduce some ugly hack to get the vfsmount and
continue as before. I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.
This commit updates the code to remove this vfsmount back
reference entirely. All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.
Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code. This
change which fairly involved has turned out nicely.
Closes#246Closes#217Closes#187Closes#248Closes#231
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.
Closes#246Closes#217Closes#187
Under Linux the VFS handles virtually all of the mmap() access
checks. Filesystem specific checks are left to be handled in
the .mmap() hook and normally there arn't any.
However, ZFS provides a few attributes which can influence the
mmap behavior and should be honored. Note, currently the code
to modify these attributes has not been implemented under Linux.
* ZFS_IMMUTABLE | ZFS_READONLY | ZFS_APPENDONLY: when any of these
attributes are set a file may not be mmaped with write access.
* ZFS_AV_QUARANTINED: when set a file file may not be mmaped with
read or exec access.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>