A preallocated taskq_ent_t's tqent_flags must be checked prior to
servicing the taskq_ent_t. Once a preallocated taskq entry is serviced,
the ownership of the entry is handed back to the caller of
taskq_dispatch, thus the entry's contents can potentially be mangled.
In particular, this is a problem in the case where a preallocated taskq
entry is serviced, and the caller clears it's tqent_flags field. Thus,
when the function returns and task_done is called, it looks as though
the entry is **not** a preallocated task (when in fact it **is** a
preallocated task).
In this situation, task_done will place the preallocated taskq_ent_t
structure onto the taskq_t's free list. This is a **huge** mistake. If
the taskq_ent_t is then freed by the caller of taskq_dispatch, the
taskq_t's free list will hold a pointer to garbage data. Even worse, if
nothing has over written the freed memory before the pointer is
dereferenced, it may still look as though it points to a valid list_head
belonging to a taskq_ent_t structure.
Thus, the task entry's flags are now copied prior to servicing the task.
This copy is then checked to see if it is a preallocated task, and
determine if the entry needs to be passed down to the task_done
function.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#71
The taskq_t's active thread list is sorted based on its
tqt_ent->tqent_id field. The list is kept sorted solely by inserting
new taskq_thread_t's in their correct sorted location; no other
means is used. This means that once inserted, if a taskq_thread_t's
tqt_ent->tqent_id field changes, the list runs the risk of no
longer being sorted.
Prior to the introduction of the taskq_dispatch_prealloc() interface,
this was not a problem as a taskq_ent_t actively being serviced under
the old interface should always have a static tqent_id field. Thus,
once the taskq_thread_t is added to the taskq_t's active thread list,
the taskq_thread_t's tqt_ent->tqent_id field would remain constant.
Now, this is no longer the case. Currently, if using the
taskq_dispatch_prealloc() interface, any given taskq_ent_t actively
being serviced _may_ have its tqent_id value incremented. This happens
when the preallocated taskq_ent_t structure is recursively dispatched.
Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id
field silently modified from under its feet. If this were to happen
to a taskq_thread_t on a taskq_t's active thread list, this would
compromise the integrity of the order of the list (as the list
_may_ no longer be sorted).
To get around this, the taskq_thread_t's taskq_ent_t pointer was
replaced with its own static copy of the tqent_id. So, as a taskq_ent_t
is pulled off of the taskq_t's pending list, a static copy of its
tqent_id is made and this copy is used to sort the active thread
list. Using a static copy is key in ensuring the integrity of the
order of the active thread list. Even if the underlying taskq_ent_t
is recursively dispatched (as has its tqent_id modified), this
static copy stored inside the taskq_thread_t will remain constant.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #71
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:
$ ./configure
$ make pkg # Alternatively, one can run 'make arch' as well
on the Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the zfs userland utilities and
another for the zfs kernel modules. The new packages can then be
installed by running:
# pacman -U $package.pkg.tar.xz
In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be build using the 'sarch' make
rule.
NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#491
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:
$ ./configure
$ make pkg # Alternatively, one can run 'make arch' as well
on an Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the spl userland utilties and
another for the spl kernel modules. The new packages can then be
installed by running:
# pacman -U $package.pkg.tar.xz
In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be built using the 'sarch' make
rule.
NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #68
It has been observed that some of the hottest locks are those
of the zio taskqs. Contention on these locks can limit the
rate at which zios are dispatched which limits performance.
This upstream change from Illumos uses new interface to the
taskqs which allow them to utilize a prealloc'ed taskq_ent_t.
This removes the need to perform an allocation at dispatch
time while holding the contended lock. This has the effect
of improving system performance.
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Alexey Zaytsev <alexey.zaytsev@nexenta.com>
Reviewed by: Jason Brian King <jason.brian.king@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue:
https://www.illumos.org/issues/734
Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#482
This patch implements the taskq_dispatch_prealloc() interface which
was introduced by the following illumos-gate commit. It allows for
a preallocated taskq_ent_t to be used when dispatching items to a
taskq. This eliminates a memory allocation which helps minimize
lock contention in the taskq when dispatching functions.
commit 5aeb94743e3be0c51e86f73096334611ae3a058e
Author: Garrett D'Amore <garrett@nexenta.com>
Date: Wed Jul 27 07:13:44 2011 -0700
734 taskq_dispatch_prealloc() desired
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.
The tq_threads field was replaced with tq_thread_list. Rather than
storing the pointers to the taskq's kernel threads in an array, they are
now stored as a list. In addition to laying the ground work for the
taskq_dispatch_prealloc() interface, this change could also enable taskq
threads to be dynamically created and destroyed as threads can now be
added and removed to this list relatively easily.
The tq_work_list field was replaced with tq_active_list. Instead of
keeping a list of taskq_ent_t's which are currently being serviced, a
list of taskq_threads currently servicing a taskq_ent_t is kept. This
frees up the taskq_ent_t's tqent_list field when it is being serviced
(i.e. now when a taskq_ent_t is being serviced, it's tqent_list field
will be empty).
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
A call site of the MUTEX macro had incorrectly placed its closing
parenthesis, causing two parameters to be passed rather than one. This
change moves the misplaced parenthesis to fix the typographical error.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#70
ZFS and 64-bit linux are perfectly capable of dealing with 64-bit
timestamps, but ZFS deliberately prevents setting them. Adjust
the SPL such that TIMESPEC_OVERFLOW will not always assume 32-bit
values and instead use the correct values for your kernel build.
This effectively allows 64-bit timestamps on 64-bit systems.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #487
The current ZFS implementation stores xattrs on disk using a hidden
directory. In this directory a file name represents the xattr name
and the file contexts are the xattr binary data. This approach is
very flexible and allows for arbitrarily large xattrs. However,
it also suffers from a significant performance penalty. Accessing
a single xattr can requires up to three disk seeks.
1) Lookup the dnode object.
2) Lookup the dnodes's xattr directory object.
3) Lookup the xattr object in the directory.
To avoid this performance penalty Linux filesystems such as ext3
and xfs try to store the xattr as part of the inode on disk. When
the xattr is to large to store in the inode then a single external
block is allocated for them. In practice most xattrs are small
and this approach works well.
The addition of System Attributes (SA) to zfs provides us a clean
way to make this optimization. When the dataset property 'xattr=sa'
is set then xattrs will be preferentially stored as System Attributes.
This allows tiny xattrs (~100 bytes) to be stored with the dnode and
up to 64k of xattrs to be stored in the spill block. If additional
xattr space is required, which is unlikely under Linux, they will be
stored using the traditional directory approach.
This optimization results in roughly a 3x performance improvement
when accessing xattrs which brings zfs roughly to parity with ext4
and xfs (see table below). When multiple xattrs are stored per-file
the performance improvements are even greater because all of the
xattrs stored in the spill block will be cached.
However, by default SA based xattrs are disabled in the Linux port
to maximize compatibility with other implementations. If you do
enable SA based xattrs then they will not be visible on platforms
which do not support this feature.
----------------------------------------------------------------------
Time in seconds to get/set one xattr of N bytes on 100,000 files
------+--------------------------------+------------------------------
| setxattr | getxattr
bytes | ext4 xfs zfs-dir zfs-sa | ext4 xfs zfs-dir zfs-sa
------+--------------------------------+------------------------------
1 | 2.33 31.88 21.50 4.57 | 2.35 2.64 6.29 2.43
32 | 2.79 30.68 21.98 4.60 | 2.44 2.59 6.78 2.48
256 | 3.25 31.99 21.36 5.92 | 2.32 2.71 6.22 3.14
1024 | 3.30 32.61 22.83 8.45 | 2.40 2.79 6.24 3.27
4096 | 3.57 317.46 22.52 10.73 | 2.78 28.62 6.90 3.94
16384 | n/a 2342.39 34.30 19.20 | n/a 45.44 145.90 7.55
65536 | n/a 2941.39 128.15 131.32* | n/a 141.92 256.85 262.12*
Legend:
* ext4 - Stock RHEL6.1 ext4 mounted with '-o user_xattr'.
* xfs - Stock RHEL6.1 xfs mounted with default options.
* zfs-dir - Directory based xattrs only.
* zfs-sa - Prefer SAs but spill in to directories as needed, a
trailing * indicates overflow in to directories occured.
NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file.
NOTE: XFS and ZFS have no limit on xattr name/value pairs per file.
NOTE: Linux limits individual name/value pairs to 65536 bytes.
NOTE: All setattr/getattr's were done after dropping the cache.
NOTE: All tests were run against a single hard drive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #443
This is a bit of cleanup I'd been meaning to get to for a while
to reduce the chance of a type conflict. Well that conflict
finally occurred with the kstat_init() function which conflicts
with a function in the 2.6.32-6-pve kernel.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#56
As of Linux 3.1 the shrink_dcache_memory and shrink_icache_memory
functions have been removed. This same task is now accomplished
more cleanly with per super block shrinkers. This unfortunately
leaves us no easy way to support the dnlc_reduce_cache() function.
This support has always been entirely optional. So when no
reasonable interface is available allow the dnlc_reduce_cache()
function to effectively become a no-op.
The downside of this change is that it will prevent the zfs arc
meta data limts from being enforced. However, the current zfs
implementation in this regard is already flawed and needs to
be reworked. If the arc needs to enfore a meta data limit it
will need to be extended to coordinate directly with the zpl.
This will allow us to drop all this compatibility code and get
more fine grained control over the cache management.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
Preferentially use the vfs_fsync() function. This function was
initially introduced in 2.6.29 and took three arguments. As
of 2.6.35 the dentry argument was dropped from the function.
For older kernels fall back to using file_fsync() which also
took three arguments including the dentry.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
Prior to Linux 3.1 the kern_path_parent symbol was exported for
use by kernel modules. As of Linux 3.1 it is now longer easily
available. To handle this case the spl will now dynamically
look up address of the missing symbol at module load time.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code. The updated code now just
registers the bdi during mount and destroys it during unmount.
The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it. Luckily the bdi_setup_and_register() function
is trivial.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
Profiling the system during meta data intensive workloads such
as creating/removing millions of files, revealed that the system
was cpu bound. A large fraction of that cpu time was being spent
waiting on the virtual address space spin lock.
It turns out this was caused by certain heavily used kmem_caches
being backed by virtual memory. By default a kmem_cache will
dynamically determine the type of memory used based on the object
size. For large objects virtual memory is usually preferable
and for small object physical memory is a better choice. See
the spl_slab_alloc() function for a longer discussion on this.
However, there is a certain amount of gray area when defining a
'large' object. For the following caches it turns out they were
just over the line:
* dnode_cache
* zio_cache
* zio_link_cache
* zio_buf_512_cache
* zfs_data_buf_512_cache
Now because we know there will be a lot of churn in these caches,
and because we know the slabs will still be reasonably sized.
We can safely request with the KMC_KMEM flag that the caches be
backed with physical memory addresses. This entirely avoids the
need to serialize on the virtual address space lock.
As a bonus this also reduces our vmalloc usage which will be good
for 32-bit kernels which have a very small virtual address space.
It will also probably be good for interactive performance since
unrelated processes could also block of this same global lock.
Finally, we may see less cpu time being burned in the arc_reclaim
and txg_sync_threads.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
The old define assumed a specific layout of the kmutex_t struct. This
patch makes the macro independent from the actual struct layout.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The m_owner variable is protected by the mutex itself. Reading the variable
is guaranteed to be atomic (due to it being a word-sized reference) and
ACCESS_ONCE() takes care of read cache effects.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
On kernels with CONFIG_DEBUG_MUTEXES mutex_exit() clears the mutex
owner after releasing the mutex. This would cause mutex_owner()
to return an incorrect owner if another thread managed to lock the
mutex before mutex_exit() had a chance to clear the owner.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #167
Export all symbols already marked extern in the zfs_vfsops.h
header. Several non-static symbols have also been added to
the header and exportewd. This allows external modules to
more easily create and manipulate properly created ZFS
filesystem type datasets.
Rename zfsvfs_teardown() to zfs_sb_teardown and export it.
This is done simply for consistency with the rest of the code
base. All other zfsvfs_* functions have already been renamed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
File descriptors are a per-process resource. The same descriptor
in different processes can refer to different files. find_file()
incorrectly assumed that file descriptors are globally unique.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #386
Export all the symbols for the system attribute (SA) API. This
allows external module to cleanly manipulate the SAs associated
with a dnode. Documention for the SA API can be found in the
module/zfs/sa.c source.
This change also removes the zfs_sa_uprade_pre, and
zfs_sa_uprade_post prototypes. The functions themselves were
dropped some time ago.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Export all the symbols for the ZAP API. This allows external modules
to cleanly interface with ZAP type objects. Previously only a subset
of the functionality was exposed. Documention for the ZAP API can be
found in the sys/zap.h header.
This change also removes a duplicate zap_increment_int() prototype.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
GPT's created by libefi set the HeaderSize attribute in the GPT
header to 512 -- size of the GPT header INCLUDING the 420 padding
bytes at the end. Most other tools set the size to 92 -- size of
the actual header itself excluding the padding. Most tools check
the recorded HeaderSize when verifying CRC, but gptfdisk hardcodes
92 and thus reports CRC verification problems for full-disk vdevs
created IE with `zpool create pool sdc`.
This commit changes libefi's behavior for GPT creation and also
fixes several edge cases where libefi's behavior was similar
(though in an incompatible manner) to gptfdisk. Libefi assumed
HeaderSize was always 512 even if the GPT recorded a different
value. Sanity checks of the GPT headersize read from disk were
added before applying checksum calculation -- this will prevent
segfault in cases of bogus on-disk values.
Zpools created with the resuling libefi are verified as correct
both by parted and gptfdisk. Also pool have been tested to
import correctly on ZFS on Linux as well as Solaris Express 11
livecd.
Signed-off-by: Zachary Bedell <zac@thebedells.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#344
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk. The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance. Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.
The replacement for pdflush is called bdi (backing device info). The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback. The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.
For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme. However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.
Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.
However, there is one critical case where not implementing the bdi
functionality can cause problems. If an application handles a page
fault it can enter the balance_dirty_pages() callpath. This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.
Without a registered backing_device_info for the filesystem the
dirty pages will not get written out. Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.
This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block. It is then registered
when the filesystem mounted and unregistered on unmount. It will
not be registered for mounted snapshots which are read-only. This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#174
While the existing implementation of .writepage()/zpl_putpage() was
functional it was not entirely correct. In particular, it would move
dirty pages in to a clean state simply after copying them in to the
ARC cache. This would result in the pages being lost if the system
were to crash enough though the Linux VFS believed them to be safe on
stable storage.
Since at the moment virtually all I/O, except mmap(2), bypasses the
page cache this isn't as bad as it sounds. However, as hopefully
start using the page cache more getting this right becomes more
important so it's good to improve this now.
This patch takes a big step in that direction by updating the code
to correctly move dirty pages through a writeback phase before they
are marked clean. When a dirty page is copied in to the ARC it will
now be set in writeback and a completion callback is registered with
the transaction. The page will stay in writeback until the dmu runs
the completion callback indicating the page is on stable storage.
At this point the page can be safely marked clean.
This process is normally entirely asynchronous and will be repeated
for every dirty page. This may initially sound inefficient but most
of these pages will end up in a few txgs. That means when they are
eventually written to disk they should be nicely batched. However,
there is room for improvement. It may still be desirable to batch
up the pages in to larger writes for the dmu. This would reduce
the number of callbacks and small 4k buffer required by the ARC.
Finally, if the caller requires that the I/O be done synchronously
by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set. Then the I/O
will trigger a zil_commit() to flush the data to stable storage.
At which point the registered callbacks will be run leaving the
date safe of disk and marked clean before returning from .writepage.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Unlike most other Linux distributions archlinux installs its
init scripts in /etc/rc.d insead of /etc/init.d. This commit
provides an archlinux rc.d script for zfs and extends the
build infrastructure to ensure it get's installed in the
correct place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#322
There is at most a factor of 3x performance improvement to be
had by using the Linux generic_fillattr() helper. However, to
use it safely we need to ensure the values in a cached inode
are kept rigerously up to date. Unfortunately, this isn't
the case for the blksize, blocks, and atime fields. At the
moment the authoritative values are still stored in the znode.
This patch introduces an optimized zfs_getattr_fast() call.
The idea is to use the up to date values from the inode and
the blksize, block, and atime fields from the znode. At some
latter date we should be able to strictly use the inode values
and further improve performance.
The remaining overhead in the zfs_getattr_fast() call can be
attributed to having to take the znode mutex. This overhead is
unavoidable until the inode is kept strictly up to date. The
the careful reader will notice the we do not use the customary
ZFS_ENTER()/ZFS_EXIT() macros. These macro's are designed to
ensure the filesystem is not torn down in the middle of an
operation. However, in this case the VFS is holding a
reference on the active inode so we know this is impossible.
=================== Performance Tests ========================
This test calls the fstat(2) system call 10,000,000 times on
an open file description in a tight loop. The test results
show the zfs stat(2) performance is now only 22% slower than
ext4. This is a 2.5x improvement and there is a clear long
term plan to get to parity with ext4.
filesystem | test-1 test-2 test-3 | average | times-ext4
--------------+-------------------------+---------+-----------
ext4 | 7.785s 7.899s 7.284s | 7.656s | 1.000x
zfs-0.6.0-rc4 | 24.052s 22.531s 23.857s | 23.480s | 3.066x
zfs-faststat | 9.224s 9.398s 9.485s | 9.369s | 1.223x
The second test is to run 'du' of a copy of the /usr tree
which contains 110514 files. The test is run multiple times
both using both a cold cache (/proc/sys/vm/drop_caches) and
a hot cache. As expected this change signigicantly improved
the zfs hot cache performance and doesn't quite bring zfs to
parity with ext4.
A little surprisingly the zfs cold cache performance is better
than ext4. This can probably be attributed to the zfs allocation
policy of co-locating all the meta data on disk which minimizes
seek times. By default the ext4 allocator will spread the data
over the entire disk only co-locating each directory.
filesystem | cold | hot
--------------+---------+--------
ext4 | 13.318s | 1.040s
zfs-0.6.0-rc4 | 4.982s | 1.762s
zfs-faststat | 4.933s | 1.345s
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure. When using the new
interface the caller must now use the mount_nodev() helper.
Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers. This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use. It provides our only entry point
in to the namespace layer for manipulating certain mount options.
This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly. It also allowed me
to keep more of the original Solaris code unmodified. Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do. However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem. Thus keeping a back
reference from the filesystem to the namespace is complicated.
Rather than introduce some ugly hack to get the vfsmount and
continue as before. I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.
This commit updates the code to remove this vfsmount back
reference entirely. All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.
Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code. This
change which fairly involved has turned out nicely.
Closes#246Closes#217Closes#187Closes#248Closes#231
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.
Closes#246Closes#217Closes#187
Under Linux the VFS handles virtually all of the mmap() access
checks. Filesystem specific checks are left to be handled in
the .mmap() hook and normally there arn't any.
However, ZFS provides a few attributes which can influence the
mmap behavior and should be honored. Note, currently the code
to modify these attributes has not been implemented under Linux.
* ZFS_IMMUTABLE | ZFS_READONLY | ZFS_APPENDONLY: when any of these
attributes are set a file may not be mmaped with write access.
* ZFS_AV_QUARANTINED: when set a file file may not be mmaped with
read or exec access.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions.
The functions have been modified to make them Linux friendly.
ZFS uses these functions to read/write the mmapped pages. Using them
from readpage/writepage results in clear code. The patch also adds
readpages and writepages interface functions to read/write list of
pages in one function call.
The code change handles the first mmap optimization mentioned on
https://github.com/behlendorf/zfs/issues/225
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Issue #255
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.
The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
Prior to Linux 2.6.39 when CONFIG_DEBUG_MUTEXES was defined
the kernel stored a thread_info pointer as the mutex owner.
From this you could get the pointer of the current task_struct
to compare with get_current().
As of Linux 2.6.39 this behavior has changed and now the mutex
stores a pointer to the task_struct. This commit detects the
type of pointer stored in the mutex and adjusts the mutex_owner()
and mutex_owned() functions to perform the correct comparision.
Deprecate the /usr/bin/hostid call by reading the /etc/hostid file
directly. Add the spl_hostid_path parameter to override the default
/etc/hostid path.
Rename the set_hostid() function to hostid_exec() to better reflect
actual behavior and complement the new hostid_read() function.
Use HW_INVALID_HOSTID as the spl_hostid sentinel value because
zero seems to be a valid gethostid() result on Linux.
While the splat tests were originally designed to stress test
the Solaris primatives. I am extending them to include some kernel
compatibility tests. Certain linux APIs have changed frequently.
These tests ensure that added compatibility is working properly
and no unnoticed regression have slipped in.
Test 1 and 2 add basic regression tests for shrink_icache_memory
and shrink_dcache_memory. These are simply functional tests to
ensure we can call these functions safely. Checking for correct
behavior is more difficult since other running processes will
influence the behavior. However, these functions are provided
by the kernel so if we can successfully call them we assume they
are working correctly.
Test 3 checks that shrinker functions are being registered and
called correctly. As of Linux 3.0 the shrinker API has changed
four different times so I felt the need to add a trivial test
case to ensure each variant works as expected.
Update the the wrapper macros for the memory shrinker to handle
this 4th API change. The callback function now takes a
shrink_control structure. This is certainly a step in the
right direction but it's annoying to have to accomidate yet
another version of the API.
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly. This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes. The drive may behave this way
for compatibility reasons. For example, the WDC WD20EARS disks are
known to exhibit this behavior.
When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).
This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time. This can significantly improve
performance in certain cases. However, it will have an impact on your
total pool capacity. See the updated ashift property description
in the zpool.8 man page for additional details.
Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior. The most common ashift values are 9 and 12.
Example:
zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd
Closes#280
Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been
introduced as a replacement for WRITE_BARRIER. This was done
to allow richer semantics to be expressed to the block layer.
It is the block layers responsibility to choose the correct way
to implement these semantics.
This change simply updates the bio's to use the new kernel API
which should be absolutely safe. However, since ZFS depends
entirely on this working as designed for correctness we do
want to be careful.
Closes#281
The previous commit 8a7e1ceefa wasn't
quite right. This check applies to both the user and kernel space
build and as such we must make sure it runs regardless of what
the --with-config option is set too.
For example, if --with-config=kernel then the autoconf test does
not run and we generate build warnings when compiling the kernel
packages.
Gcc versions 4.3.2 and earlier do not support the compiler flag
-Wno-unused-but-set-variable. This can lead to build failures
on older Linux platforms such as Debian Lenny. Since this is
an optional build argument this changes add a new autoconf check
for the option. If it is supported by the installed version of
gcc then it is used otherwise it is omited.
See commit's 12c1acde76 and
79713039a2 for the reason the
-Wno-unused-but-set-variable options was originally added.
The direct reclaim path in the z_wr_* threads must be disabled
to ensure forward progress is always maintained for txg processing.
This ensures that a txg will never get stuck waiting on itself
because it entered the following memory reclaim callpath.
->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive()
->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open()
It would be preferable to target this exact code path but the
kernel offers no way to do this without custom patches. To avoid
this we are forced to disable all reclaim for these threads. It
should not be necessary to do this for other other z_* threads
because they will not hold a txg open.
Closes#232
It has become necessary to be able to optionally disable
direct memory reclaim for certain taskqs. To support
this the TASKQ_NORECLAIM flags has been added which sets
the PF_MEMALLOC bit for all threads in the taskq.
How nfsd handles .fsync() has been changed a couple of times in the
recent kernels. But basically there are three cases we need to
consider.
Linux 2.6.12 - 2.6.33
* The .fsync() hook takes 3 arguments
* The nfsd will call .fsync() with a NULL file struct pointer.
Linux 2.6.34
* The .fsync() hook takes 3 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()
Linux 2.6.35 - 2.6.x
* The .fsync() hook takes 2 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()
For once it looks like we've gotten lucky. The first two cases can
actually be collased in to one if we stop using the file struct
pointer entirely. Since the dentry is still passed in both cases
this is possible. The last case can then be safely handled by
unconditionally using the dentry in the file struct pointer now
that we know the nfsd caller has been removed.
Closes#230
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values. However, in practice sometimes you do need to tweak these
values for one reason or another. In those cases it's nice not to
have to resort to rebuilding from source. All tunables are visable
to modinfo and the list is as follows:
$ modinfo module/zfs/zfs.ko
filename: module/zfs/zfs.ko
license: CDDL
author: Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description: ZFS
srcversion: 8EAB1D71DACE05B5AA61567
depends: spl,znvpair,zcommon,zunicode,zavl
vermagic: 2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm: zvol_major:Major number for zvol device (uint)
parm: zvol_threads:Number of threads for zvol device (uint)
parm: zio_injection_enabled:Enable fault injection (int)
parm: zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm: zio_delay_max:Max zio millisec delay before posting event (int)
parm: zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm: zil_replay_disable:Disable intent logging replay (int)
parm: zfs_nocacheflush:Disable cache flushes (bool)
parm: zfs_read_chunk_size:Bytes to read per chunk (long)
parm: zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm: zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm: zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm: zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm: zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm: zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm: zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm: zfs_vdev_scheduler:I/O scheduler (charp)
parm: zfs_vdev_cache_max:Inflate reads small than max (int)
parm: zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm: zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm: zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm: zfs_recover:Set to attempt to recover from fatal errors (int)
parm: spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm: zfs_zevent_len_max:Max event queue length (int)
parm: zfs_zevent_cols:Max event column width (int)
parm: zfs_zevent_console:Log events to the console (int)
parm: zfs_top_maxinflight:Max I/Os per top-level (int)
parm: zfs_resilver_delay:Number of ticks to delay resilver (int)
parm: zfs_scrub_delay:Number of ticks to delay scrub (int)
parm: zfs_scan_idle:Idle window in clock ticks (int)
parm: zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm: zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm: zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm: zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm: zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm: zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm: zfs_no_write_throttle:Disable write throttling (int)
parm: zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm: zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm: zfs_write_limit_min:Min tgx write limit (ulong)
parm: zfs_write_limit_max:Max tgx write limit (ulong)
parm: zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm: zfs_write_limit_override:Override tgx write limit (ulong)
parm: zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm: zfetch_max_streams:Max number of streams per zfetch (uint)
parm: zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm: zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm: zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm: zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm: zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm: zfs_arc_min:Min arc size (ulong)
parm: zfs_arc_max:Max arc size (ulong)
parm: zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm: zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm: zfs_arc_grow_retry:Seconds before growing arc size (int)
parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm: zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)