Currently, using msync() results in the following code path:
sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage
In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.
This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.
The solution implemented here is composed of two parts:
- I added a new callback system to the ZIL, which allows the caller to
be notified when its ITX gets written to stable storage. One nice
thing is that the callback is called not only in zil_commit() but
in zil_sync() as well, which means that the caller doesn't have to
care whether the write ended up in the ZIL or the DMU: it will get
notified as soon as it's safe, period. This is an improvement over
dmu_tx_callback_register() that was used previously, which only
supports DMU writes. The rationale for this change is to allow
zpl_putpage() to be notified when a ZIL commit is completed without
having to block on zil_commit() itself.
- zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
from issuing ZIL commits. zpl_writepages() will issue the commit
itself instead of relying on zpl_putpage() to do it, thus nicely
batching the writes. Note, however, that we still have to call
write_cache_pages() again in SYNC mode because there is an edge case
documented in the implementation of write_cache_pages() whereas it
will not give us all dirty pages when running in non-SYNC mode. Thus
we need to run it at least once in SYNC mode to make sure we honor
persistency guarantees. This only happens when the pages are
modified at the same time msync() is running, which should be rare.
In most cases there won't be any additional pages and this second
call will do nothing.
Note that this change also fixes a bug related to #907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1849Closes#907
3742 zfs comments need cleaner, more consistent style
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3742illumos/illumos-gate@f717074149
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. The change to zfs_vfsops.c was dropped because it involves
zfs_mount_label_policy, which does not exist in the Linux port.
The PaX team modified the kernel's modpost to report writeable function
pointers as section mismatches because they are potential exploit
targets. We could ignore the warnings, but their presence can obscure
actual issues. Proper const correctness can also catch programming
mistakes.
Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2
kernel reports 133 section mismatches prior to this patch. This patch
eliminates 130 of them. The quantity of writeable function pointers
eliminated by constifying each structure is as follows:
vdev_opts_t 52
zil_replay_func_t 24
zio_compress_info_t 24
zio_checksum_info_t 9
space_map_ops_t 7
arc_byteswap_func_t 5
The remaining 3 writeable function pointers cannot be addressed by this
patch. 2 of them are in zpl_fs_type. The kernel's sget function requires
that this be non-const. The final writeable function pointer is created
by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and
remove_shrinker() functions also require that this be non-const.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1300
Rolling back a mounted filesystem with open file handles and
cached dentries+inodes never worked properly in ZoL. The
major issue was that Linux provides no easy mechanism for
modules to invalidate the inode cache for a file system.
Because of this it was possible that an inode from the previous
filesystem would not get properly dropped from the cache during
rolling back. Then a new inode with the same inode number would
be create and collide with the existing cached inode. Ideally
this would trigger an VERIFY() but in practice the error wasn't
handled and it would just NULL reference.
Luckily, this issue can be resolved by sprucing up the existing
Solaris zfs_rezget() functionality for the Linux VFS.
The way it works now is that when a file system is rolled back
all the cached inodes will be traversed and refetched from disk.
If a version of the cached inode exists on disk the in-core
copy will be updated accordingly. If there is no match for that
object on disk it will be unhashed from the inode cache and
marked as stale.
This will effectively make the inode unfindable for lookups
allowing the inode number to be immediately recycled. The inode
will then only be accessible from the cached dentries. Subsequent
dentry lookups which reference a stale inode will result in the
dentry being invalidated. Once invalidated the dentry will drop
its reference on the inode allowing it to be safely pruned from
the cache.
Special care is taken for negative dentries since they do not
reference any inode. These dentires will be invalidate based
on when they were added to the dentry cache. Entries added
before the last rollback will be invalidate to prevent them
from masking real files in the dataset.
Two nice side effects of this fix are:
* Removes the dependency on spl_invalidate_inodes(), it can now
be safely removed from the SPL when we choose to do so.
* zfs_znode_alloc() no longer requires a dentry to be passed.
This effectively reverts this portition of the code to its
upstream counterpart. The dentry is not instantiated more
correctly in the Linux ZPL layer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#795
Unlike normal file or directory znodes, an xattr znode is
guaranteed to only have a single parent. Therefore, we can
take a refernce on that parent if it is provided at create
time and cache it. Additionally, we take care to cache it
on any subsequent zfs_zaccess() where the parent is provided
as an optimization.
This allows us to avoid needing to do a zfs_zget() when
setting up the SELinux security xattr in the create path.
This is critical because a hash lookup on the directory
will deadlock since it is locked.
The zpl_xattr_security_init() call has also been moved up
to the zpl layer to ensure TXs to create the required
xattrs are performed after the create TX. Otherwise we
run the risk of deadlocking on the open create TX.
Ideally the security xattr should be fully constructed
before the new inode is unlocked. However, doing so would
require far more extensive changes to ZFS.
This change may also have the benefitial side effect of
ensuring xattr directory znodes are evicted from the cache
before normal file or directory znodes due to the extra
reference.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#671
Add support for the .zfs control directory. This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required. The bulk of the core
functionality is now all there with the following limitations.
*) The .zfs/snapshot directory automount support requires a 2.6.37
or newer kernel. The exception is RHEL6.2 which has backported
the d_automount patches.
*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
in the .zfs/snapshot directory works as expected. However,
this functionality is only available to root until zfs
delegations are finished.
* mkdir - create a snapshot
* rmdir - destroy a snapshot
* mv - rename a snapshot
The following issues are known defeciences, but we expect them to
be addressed by future commits.
*) Add automount support for kernels older the 2.6.37. This should
be possible using follow_link() which is what Linux did before.
*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
The majority of the ground work for this is complete. However,
finishing this work will require resolving some lingering
integration issues with the Linux NFS kernel server.
*) The .zfs/shares directory exists but no futher smb functionality
has yet been implemented.
Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#173
The current ZFS implementation stores xattrs on disk using a hidden
directory. In this directory a file name represents the xattr name
and the file contexts are the xattr binary data. This approach is
very flexible and allows for arbitrarily large xattrs. However,
it also suffers from a significant performance penalty. Accessing
a single xattr can requires up to three disk seeks.
1) Lookup the dnode object.
2) Lookup the dnodes's xattr directory object.
3) Lookup the xattr object in the directory.
To avoid this performance penalty Linux filesystems such as ext3
and xfs try to store the xattr as part of the inode on disk. When
the xattr is to large to store in the inode then a single external
block is allocated for them. In practice most xattrs are small
and this approach works well.
The addition of System Attributes (SA) to zfs provides us a clean
way to make this optimization. When the dataset property 'xattr=sa'
is set then xattrs will be preferentially stored as System Attributes.
This allows tiny xattrs (~100 bytes) to be stored with the dnode and
up to 64k of xattrs to be stored in the spill block. If additional
xattr space is required, which is unlikely under Linux, they will be
stored using the traditional directory approach.
This optimization results in roughly a 3x performance improvement
when accessing xattrs which brings zfs roughly to parity with ext4
and xfs (see table below). When multiple xattrs are stored per-file
the performance improvements are even greater because all of the
xattrs stored in the spill block will be cached.
However, by default SA based xattrs are disabled in the Linux port
to maximize compatibility with other implementations. If you do
enable SA based xattrs then they will not be visible on platforms
which do not support this feature.
----------------------------------------------------------------------
Time in seconds to get/set one xattr of N bytes on 100,000 files
------+--------------------------------+------------------------------
| setxattr | getxattr
bytes | ext4 xfs zfs-dir zfs-sa | ext4 xfs zfs-dir zfs-sa
------+--------------------------------+------------------------------
1 | 2.33 31.88 21.50 4.57 | 2.35 2.64 6.29 2.43
32 | 2.79 30.68 21.98 4.60 | 2.44 2.59 6.78 2.48
256 | 3.25 31.99 21.36 5.92 | 2.32 2.71 6.22 3.14
1024 | 3.30 32.61 22.83 8.45 | 2.40 2.79 6.24 3.27
4096 | 3.57 317.46 22.52 10.73 | 2.78 28.62 6.90 3.94
16384 | n/a 2342.39 34.30 19.20 | n/a 45.44 145.90 7.55
65536 | n/a 2941.39 128.15 131.32* | n/a 141.92 256.85 262.12*
Legend:
* ext4 - Stock RHEL6.1 ext4 mounted with '-o user_xattr'.
* xfs - Stock RHEL6.1 xfs mounted with default options.
* zfs-dir - Directory based xattrs only.
* zfs-sa - Prefer SAs but spill in to directories as needed, a
trailing * indicates overflow in to directories occured.
NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file.
NOTE: XFS and ZFS have no limit on xattr name/value pairs per file.
NOTE: Linux limits individual name/value pairs to 65536 bytes.
NOTE: All setattr/getattr's were done after dropping the cache.
NOTE: All tests were run against a single hard drive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #443
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure. When using the new
interface the caller must now use the mount_nodev() helper.
Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers. This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use. It provides our only entry point
in to the namespace layer for manipulating certain mount options.
This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly. It also allowed me
to keep more of the original Solaris code unmodified. Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do. However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem. Thus keeping a back
reference from the filesystem to the namespace is complicated.
Rather than introduce some ugly hack to get the vfsmount and
continue as before. I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.
This commit updates the code to remove this vfsmount back
reference entirely. All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.
Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code. This
change which fairly involved has turned out nicely.
Closes#246Closes#217Closes#187Closes#248Closes#231
Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility
of synchronous reclaim deadlocks. These deadlocks never existed in the
original OpenSolaris code because all memory reclaim on Solaris is done
asyncronously. Linux does both synchronous (direct) and asynchronous
(indirect) reclaim.
This commit addresses a deadlock caused by inode eviction. A KM_SLEEP
allocation may trigger direct memory reclaim and shrink the inode cache.
This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is
held. Through the ->shrink_icache_memory()->evict()->zfs_inactive()->
zfs_zinactive() call path the same mutex may be reacquired resulting
in a deadlock. To avoid this deadlock the process must not reacquire
the mutex when it is already holding it.
This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD
mutex locking should be reevaluated. This infrastructure already
prevents us from ever using the Linux lock dependency analysis tools,
and it may limit scalability.
Register the missing .sync_fs handler. This is a noop in most cases
because the usual requirement is that sync just be initiated. As part
of the DMU's normal transaction processing txgs will be frequently
synced. However, when the 'wait' flag is set the requirement is that
.sync_fs must not return until the data is safe on disk. With the
addition of the .sync_fs handler this is now properly implemented.
When I began work on the Posix layer it immediately became clear to
me that to integrate cleanly with the Linux VFS certain Solaris
specific things would have to go. One of these things was to elimate
as many Solaris specific types from the ZPL layer as possible. They
would be replaced with their Linux equivalents. This would not only
be good for performance, but for the general readability and health of
the code. The Solaris and Linux VFS are different beasts and should
be treated as such. Most of the code remains common for constructing
transactions and such, but there are subtle and important differenced
which need to be repsected.
This policy went quite for for certain types such as the vnode_t,
and it initially seemed to be working out well for the vattr_t. There
was a relatively small amount of related xvattr_t code I was forced to
comment out with HAVE_XVATTR. But it didn't look that hard to come
back soon and replace it all with a native Linux type.
However, after going doing this path with xvattr some distance it
clear that this code was woven in the ZPL more deeply than I thought.
In particular its hooks went very deep in to the ZPL replay code
and replacing it would not be as easy as I originally thought.
Rather than continue persuing replacing and removing this code I've
taken a step back and reevaluted things. This commit reverts many of
my previous commits which removed xvattr related code. It restores
much of the code to its original upstream state and now relies on
improved xvattr_t support in the zfs package itself.
The result of this is that much of the code which I had commented
out, which accidentally broke things like replay, is now back in
place and working. However, there may be a small performance
impact for getattr/setattr operations because they now require
a translation from native Linux to Solaris types. For now that's
a price I'm willing to pay. Once everything is completely functional
we can revisting the issue of removing the vattr_t/xvattr_t types.
Closes#111
It's worth taking a moment to describe how mmap is implemented
for zfs because it differs considerably from other Linux filesystems.
However, this issue is handled the same way under OpenSolaris.
The issue is that by design zfs bypasses the Linux page cache and
leaves all caching up to the ARC. This has been shown to work
well for the common read(2)/write(2) case. However, mmap(2)
is problem because it relies on being tightly integrated with the
page cache. To handle this we cache mmap'ed files twice, once in
the ARC and a second time in the page cache. The code is careful
to keep both copies synchronized.
When a file with an mmap'ed region is written to using write(2)
both the data in the ARC and existing pages in the page cache
are updated. For a read(2) data will be read first from the page
cache then the ARC if needed. Neither a write(2) or read(2) will
will ever result in new pages being added to the page cache.
New pages are added to the page cache only via .readpage() which
is called when the vfs needs to read a page off disk to back the
virtual memory region. These pages may be modified without
notifying the ARC and will be written out periodically via
.writepage(). This will occur due to either a sync or the usual
page aging behavior. Note because a read(2) of a mmap'ed file
will always check the page cache first even when the ARC is out
of date correct data will still be returned.
While this implementation ensures correct behavior it does have
have some drawbacks. The most obvious of which is that it
increases the required memory footprint when access mmap'ed
files. It also adds additional complexity to the code keeping
both caches synchronized.
Longer term it may be possible to cleanly resolve this wart by
mapping page cache pages directly on to the ARC buffers. The
Linux address space operations are flexible enough to allow
selection of which pages back a particular index. The trick
would be working out the details of which subsystem is in
charge, the ARC, the page cache, or both. It may also prove
helpful to move the ARC buffers to a scatter-gather lists
rather than a vmalloc'ed region.
Additionally, zfs_write/read_common() were used in the readpage
and writepage hooks because it was fairly easy. However, it
would be better to update zfs_fillpage and zfs_putapage to be
Linux friendly and use them instead.
A new flag is required for the zfs_rlock code to determine if
it is operation of the zvol of zpl dataset. This used to be
keyed off the zp->z_vnode, which was a hack to begin with, but
with the removal of vnodes we needed a dedicated flag.
I appologize in advance why to many things ended up in this commit.
When it could be seperated in to a whole series of commits teasing
that all apart now would take considerable time and I'm not sure
there's much merrit in it. As such I'll just summerize the intent
of the changes which are all (or partly) in this commit. Broadly
the intent is to remove as much Solaris specific code as possible
and replace it with native Linux equivilants. More specifically:
1) Replace all instances of zfsvfs_t with zfs_sb_t. While the
type is largely the same calling it private super block data
rather than a zfsvfs is more consistent with how Linux names
this. While non critical it makes the code easier to read when
your thinking in Linux friendly VFS terms.
2) Replace vnode_t with struct inode. The Linux VFS doesn't have
the notion of a vnode and there's absolutely no good reason to
create one. There are in fact several good reasons to remove it.
It just adds overhead on Linux if we were to manage one, it
conplicates the code, and it likely will lead to bugs so there's
a good change it will be out of date. The code has been updated
to remove all need for this type.
3) Replace all vtype_t's with umode types. Along with this shift
all uses of types to mode bits. The Solaris code would pass a
vtype which is redundant with the Linux mode. Just update all the
code to use the Linux mode macros and remove this redundancy.
4) Remove using of vn_* helpers and replace where needed with
inode helpers. The big example here is creating iput_aync to
replace vn_rele_async. Other vn helpers will be addressed as
needed but they should be be emulated. They are a Solaris VFS'ism
and should simply be replaced with Linux equivilants.
5) Update znode alloc/free code. Under Linux it's common to
embed the inode specific data with the inode itself. This removes
the need for an extra memory allocation. In zfs this information
is called a znode and it now embeds the inode with it. Allocators
have been updated accordingly.
6) Minimal integration with the vfs flags for setting up the
super block and handling mount options has been added this
code will need to be refined but functionally it's all there.
This will be the first and last of these to large to review commits.
For the moment we have left ZFS unchanged and it updates many values
as part of the znode. However, some of these values should be set
in the inode. For the moment this is handled by adding a function
called zfs_inode_update() which updates the inode based on the znode.
This is considered a workaround until we can systematically go
through the ZFS code and have it directly update the inode. At
which point zfs_update_inode() can be dropped entirely. Keeping
two copies of the same data isn't only inefficient it's a breeding
ground for bugs.
Under Linux the convention for filesystem specific data structure is
to embed it along with the generic vfs data structure. This differs
significantly from Solaris.
Since we want to integrates as cleanly with the Linux VFS as possible.
This changes modifies zfs_znode_alloc() to allocate a znode with an
embedded inode for use with the generic VFS. This is done by calling
iget_locked() which will allocate a new inode if needed by calling
sb->alloc_inode(). This function allocates enough memory for a
znode_t by returns a pointer to the inode structure for Linux's VFS.
This function is also responsible for setting the callback
znode->z_set_ops_inodes() which is used to register the correct
handlers for the inode.
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz
tar -xzf zfs-x.y.z.tar.gz
cd zfs-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This change also moves many of the include headers from individual
incude/sys directories under the modules directory in to a single
top level include directory. This has the advantage of making
the build rules cleaner and logically it makes a bit more sense.