Under Linux the VFS handles virtually all of the mmap() access
checks. Filesystem specific checks are left to be handled in
the .mmap() hook and normally there arn't any.
However, ZFS provides a few attributes which can influence the
mmap behavior and should be honored. Note, currently the code
to modify these attributes has not been implemented under Linux.
* ZFS_IMMUTABLE | ZFS_READONLY | ZFS_APPENDONLY: when any of these
attributes are set a file may not be mmaped with write access.
* ZFS_AV_QUARANTINED: when set a file file may not be mmaped with
read or exec access.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The following functions were required for the OpenSolaris mmap
implementation. Because the Linux VFS does most the most heavy
lifting for us they are not required and are being removed to
keep the code clean and easy to understand.
* zfs_null_putapage()
* zfs_frlock()
* zfs_no_putpage()
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions.
The functions have been modified to make them Linux friendly.
ZFS uses these functions to read/write the mmapped pages. Using them
from readpage/writepage results in clear code. The patch also adds
readpages and writepages interface functions to read/write list of
pages in one function call.
The code change handles the first mmap optimization mentioned on
https://github.com/behlendorf/zfs/issues/225
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Issue #255
These days most disk drivers will probe for devices asynchronously.
This means it's possible that when you zfs init script runs all the
required block devices may not yet have been discovered. The result
is the pool may fail to cleanly import at boot time. This is
particularly common when you have a large number of devices.
The fix is for the init script to block until udev settles and we
are no longer detecting new devices. Once the system has settled
the zfs modules can be loaded and the pool with be automatically
imported.
According to Linux kernel commit 2c27c65e, using truncate_setsize in
setattr simplifies the code. Therefore, the patch replaces the call
to vmtruncate() with truncate_setsize().
zfs_setattr uses zfs_freesp to free the disk space belonging to the
file. As truncate_setsize may release the page cache and flushing
the dirty data to disk, it must be called before the zfs_freesp.
Suggested-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.
The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
Prior to Linux 2.6.39 when CONFIG_DEBUG_MUTEXES was defined
the kernel stored a thread_info pointer as the mutex owner.
From this you could get the pointer of the current task_struct
to compare with get_current().
As of Linux 2.6.39 this behavior has changed and now the mutex
stores a pointer to the task_struct. This commit detects the
type of pointer stored in the mutex and adjusts the mutex_owner()
and mutex_owned() functions to perform the correct comparision.
ZFS requires a stable hostid to recognize foreign pool imports,
but the hostid of a Linux system can change if the /etc/hostid
file is missing, particularly during DHCP lease updates.
Ensure that the system hostid is stable by creating the
/etc/hostid file from the output of the /usr/bin/hostid utility.
The /sbin/genhostid utility that is provided by the initscripts
package is not used because it creates a random hostid, which
breaks upgrades on systems that already have the SPL module
installed.
The external `printf` is used because the dash builtin lacks
the byte format. Conveniences like a ${HOSTID:$ii:2} substring
range or a `sed` one-liner are similarly avoided.
Deprecate the /usr/bin/hostid call by reading the /etc/hostid file
directly. Add the spl_hostid_path parameter to override the default
/etc/hostid path.
Rename the set_hostid() function to hostid_exec() to better reflect
actual behavior and complement the new hostid_read() function.
Use HW_INVALID_HOSTID as the spl_hostid sentinel value because
zero seems to be a valid gethostid() result on Linux.
Update udev helper scripts to deal with device-mapper devices created
by multipathd. These enhancements are targeted at a particular
storage network topology under evaluation at LLNL consisting of two
SAS switches providing redundant connectivity between multiple server
nodes and disk enclosures.
The key to making these systems manageable is to create shortnames for
each disk that conveys its physical location in a drawer. In a
direct-attached topology we infer a disk's enclosure from the PCI bus
number and HBA port number in the by-path name provided by udev. In a
switched topology, however, multiple drawers are accessed via a single
HBA port. We therefore resort to assigning drawer identifiers based
on which switch port a drive's enclosure is connected to. This
information is available from sysfs.
Add options to zpool_layout to generate an /etc/zfs/zdev.conf using
symbolic links in /dev/disk/by-id of the form
<label>-<UUID>-switch-port:<X>-slot:<Y>. <label> is a string that
depends on the subsystem that created the link and defaults to
"dm-uuid-mpath" (this prefix is used by multipathd). <UUID> is a
unique identifier for the disk typically obtained from the scsi_id
program, and <X> and <Y> denote the switch port and disk slot numbers,
respectively.
Add a callout script sas_switch_id for use by multipathd to help
create symlinks of the form described above. Update zpool_id and the
udev zpool rules file to handle both multipath devices and
conventional drives.
To accomindate the updated Linux 3.0 shrinker API the spl
shrinker compatibility code was updated. Unfortunately, this
couldn't be done cleanly without slightly adjusting the comapt
API. See spl commit a55bcaad18.
This commit updates the ZFS code to use the slightly modified
API. You must use the latest SPL if your building ZFS.
While the splat tests were originally designed to stress test
the Solaris primatives. I am extending them to include some kernel
compatibility tests. Certain linux APIs have changed frequently.
These tests ensure that added compatibility is working properly
and no unnoticed regression have slipped in.
Test 1 and 2 add basic regression tests for shrink_icache_memory
and shrink_dcache_memory. These are simply functional tests to
ensure we can call these functions safely. Checking for correct
behavior is more difficult since other running processes will
influence the behavior. However, these functions are provided
by the kernel so if we can successfully call them we assume they
are working correctly.
Test 3 checks that shrinker functions are being registered and
called correctly. As of Linux 3.0 the shrinker API has changed
four different times so I felt the need to add a trivial test
case to ensure each variant works as expected.
Update the the wrapper macros for the memory shrinker to handle
this 4th API change. The callback function now takes a
shrink_control structure. This is certainly a step in the
right direction but it's annoying to have to accomidate yet
another version of the API.
The problem here is that prune_icache() tries to evict/delete
both the xattr directory inode as well as at least one xattr
inode contained in that directory. Here's what happens:
1. File is created.
2. xattr is created for that file (behind the scenes a xattr
directory and a file in that xattr directory are created)
3. File is deleted.
4. Both the xattr directory inode and at least one xattr
inode from that directory are evicted by prune_icache();
prune_icache() acquires a lock on both inodes before it
calls ->evict() on the inodes
When the xattr directory inode is evicted zfs_zinactive attempts
to delete the xattr files contained in that directory. While
enumerating these files zfs_zget() is called to obtain a reference
to the xattr file znode - which tries to lock the xattr inode.
However that very same xattr inode was already locked by
prune_icache() further up the call stack, thus leading to a
deadlock.
This can be reliably reproduced like this:
$ touch test
$ attr -s a -V b test
$ rm test
$ echo 3 > /proc/sys/vm/drop_caches
This patch fixes the deadlock by moving the zfs_purgedir() call to
zfs_unlinked_drain(). Instead zfs_rmnode() now checks whether the
xattr dir is empty and leaves the xattr dir in the unlinked set if
it finds any xattrs.
To ensure zfs_unlinked_drain() never accesses a stale super block
zfsvfs_teardown() has been update to block until the iput taskq
has been drained. This avoids a potential race where a file with
an xattr directory is removed and the file system is immediately
unmounted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#266
iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy()
for us after zfs_zinactive(), thus making sure that the inode is
properly cleaned up.
The zfs_inode_destroy() calls in zfs_rmnode() would lead to a
double-free.
Fixes#282
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly. This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes. The drive may behave this way
for compatibility reasons. For example, the WDC WD20EARS disks are
known to exhibit this behavior.
When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).
This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time. This can significantly improve
performance in certain cases. However, it will have an impact on your
total pool capacity. See the updated ashift property description
in the zpool.8 man page for additional details.
Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior. The most common ashift values are 9 and 12.
Example:
zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd
Closes#280
Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been
introduced as a replacement for WRITE_BARRIER. This was done
to allow richer semantics to be expressed to the block layer.
It is the block layers responsibility to choose the correct way
to implement these semantics.
This change simply updates the bio's to use the new kernel API
which should be absolutely safe. However, since ZFS depends
entirely on this working as designed for correctness we do
want to be careful.
Closes#281
This change is the first step towards updating the default
rpm/deb packages to be FHS compliant. It accomplishes this
by passing the following options to ./configure to ensure the
zfs build products are installed in FHS compliant locations.
./configure --prefix=/ --bindir=/lib/udev \
--libexecdir=/usr/libexec --datadir=/usr/share
The core zfs utilities (zfs, zpool, zdb) are now be installed
in /sbin, the core libraries in /lib, and the udev helpers
(zpool_id, zvol_id) are in /lib/udev with the other udev
helpers.
The remaining files in the zfs package remain in their
previous locations under /usr.
The zfs dracut modules should be installed under the --datadir
not --datarootdir path. This was just an oversight in the
original Makefile.am.
After this change %{_datadir} can now be set safely in the
zfs.spec file. The 'make install' location is now consistent
with the location expected by the spec file.
Change the variable substitution in the udev rule templates
according to the method described in the Autoconf manual;
Chapter 4.7.2: Installation Directory Variables.
The udev rules are improperly generated if the bindir parameter
overrides the prefix parameter during configure. For example:
# ./configure --prefix=/usr/local --bindir=/opt/zfs/bin
The udev helper is installed as /opt/zfs/bin/zpool_id, but the
corresponding udev rule has a different path:
# /usr/local/etc/udev/rules.d/60-zpool.rules
ENV{DEVTYPE}=="disk", IMPORT{program}="/usr/local/bin/zpool_id -d %p"
The @bindir@ variable expands to "${exec_prefix}/bin", so it cannot
be used instead of @prefix@ directly.
This also applies to the zvol_id helper.
Closes#283.
RPM version 4.9.0 has been observed to generate extra debug
messages in certain cases. These debug messages prevent us
from cleanly acquiring the architecture. This is clearly
an upstream RPM bug which will get fixed. But until then
a safe solution is to pipe the result through 'tail -1'
to just grab the architecture bit we care about.
Example 'rpm -qp spl-0.6.0-rc4.src.rpm --qf %{arch}' output:
Freeing read locks for locker 0x166: 28031/47480843735008
Freeing read locks for locker 0x168: 28031/47480843735008
x86_64
Under Fedora 15 /etc/mtab is now a symlink to /proc/mounts by
default. When /etc/mtab is a symlink the mount.zfs helper
should not update it. There was code in place to handle this
case but it used stat() which traverses the link and then issues
the stat on /proc/mounts. We need to use lstat() to prevent the
link traversal and instead stat /etc/mtab.
Closes#270
The previous commit 8a7e1ceefa wasn't
quite right. This check applies to both the user and kernel space
build and as such we must make sure it runs regardless of what
the --with-config option is set too.
For example, if --with-config=kernel then the autoconf test does
not run and we generate build warnings when compiling the kernel
packages.
Gcc versions 4.3.2 and earlier do not support the compiler flag
-Wno-unused-but-set-variable. This can lead to build failures
on older Linux platforms such as Debian Lenny. Since this is
an optional build argument this changes add a new autoconf check
for the option. If it is supported by the installed version of
gcc then it is used otherwise it is omited.
See commit's 12c1acde76 and
79713039a2 for the reason the
-Wno-unused-but-set-variable options was originally added.
When your kernel is built with kernel stack tracing enabled and you
have the debugfs filesystem mounted. Then the zfs.sh script will clear
the worst observed kernel stack depth on module load and check the worst
case usage on module removal. If the stack depth ever exceeds 7000
bytes the full stack will be printed for debugging. This is dangerously
close to overrunning the default 8k stack.
This additional advisory debugging is particularly valuable when running
the regression tests on a kernel built with 16k stacks. In this case,
almost no matter how bad the stack overrun is you will see be able to
get a clean stack trace for debugging. Since the worst case stack usage
can be highly variable it's helpful to always check the worst case usage.
If a pool was not cleanly exported passing the -f flag may be required
at 'zpool import' time. Since this test is simply validating that the
pool can be successfully imported in the absense of the cache file
always pass the -f to ensure it succeeds. This failure was observed
under RHEL6.1.
Sending pools with dedup results in a segfault due to a Solaris
portability issue. Under Solaris the pipe(2) library call
creates a bidirectional data channel. Unfortunately, on Linux
pipe(2) call creates unidirection data channel. The fix is to
use the socketpair(2) function to create the expected
bidirectional channel.
Seth Heeren did the original leg work on this issue for zfs-fuse.
We finally just rediscovered the same portability issue and
dfurphy was able to point me at the original issue for the fix.
Closes#268
Just like zconfig.sh the zpios-sanity.sh tests should run in a
sanatized environment. This ensures they never conflict with an
installed /etc/zfs/zpool.cache file.
This commit additionally improves the -c cleanup option. It now
removes the modules stack if loaded and destroys relevant md devices.
This behavior is now identical to zconfig.sh.
Generally I don't approve of just adding an arbitrary delay to
avoid a problem but in this case I'm going to let it slide. We
may need to delay briefly after 'zpool destroy' returns to ensure
the loopback devices are closed. If they aren't closed than
losetup -d will not be able to destroy them. Unfortunately,
there's no easy state the check so we'll have to make due with
a simple delay.
We should always unload zpios.ko on exit. This ensures
that subsequent calls to 'zfs.sh -u' from other utilities
will be able to unload the module stack and properly
cleanup. This is important for the the --cleanup option
which can be passed to zconfig.sh and zfault.sh.
The zpios-sanity.sh script should return failure when any
of the individual zpios.sh tests fail. The previous code
would always return success suppressing real failures.
Stack usage for ddt_class_contains() reduced from 524 bytes to 68
bytes. This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120
bytes. This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
This abomination is no longer required because the zio's issued
during this recursive call path will now be handled asynchronously
by the taskq thread pool.
This reverts commit 6656bf5621.
The majority of the recursive operations performed by the dsl
are done either in the context of the tgx_sync_thread or during
pool import. It is these recursive operations which contribute
greatly to the stack depth. When this recursion is coupled with
a synchronous I/O in the same context overflow becomes possible.
Previously to handle this case I have focused on keeping the
individual stack frames as light as possible. This is a good
idea as long as it can be done in a way which doesn't overly
complicate the code. However, there is a better solution.
If we treat all zio's issued by the tgx_sync_thread as async then
we can use the tgx_sync_thread stack for the recursive parts, and
the zio_* threads for the I/O parts. This effectively doubles our
available stack space with the only drawback being a small delay
to schedule the I/O. However, in practice the scheduling time
is so much smaller than the actual I/O time this isn't an issue.
Another benefit of making the zio async is that the zio pipeline
is now parallel. That should mean for CPU intensive pipelines
such as compression or dedup performance may be improved.
With this change in place the worst case stack usage observed so
far is 6902 bytes. This is still higher than I'd like but
significantly improved. Additional changes to specific functions
should improve this further. This change allows us to revent
commit 6656bf5 which did some horrible things to the recursive
traverse_visitbp() callpath in the name of saving stack.
Yesterday I ran across a 3TB drive which exposed 4K sectors to
Linux. While I thought I had gotten this support correct it
turns out there were 2 subtle bugs which prevented it from
working.
sudo ./cmd/zpool/zpool create -f large-sector /dev/sda
cannot create 'large-sector': one or more devices is currently unavailable
1) The first issue was that it was possible that bdev_capacity()
would return the number of 512 byte sectors rather than the number
of 4096 sectors. Internally, certain Linux functions only operate
with 512 byte sectors so you need to be careful. To avoid any
confusion in the future I've updated bdev_capacity() to simply
return the device (or partition) capacity in bytes. The higher
levels of ZFS want the value in bytes anyway so this is cleaner.
2) When creating a bio the ->bi_sector count must always be
expressed in 512 byte sectors. The existing code would scale
the byte offset by the logical sector size. Until now this was
always 512 so it never caused problems. Trying a 4K sector drive
clearly exposed the issue. The problem has been fixed by
hard coding the 512 byte sector which is exactly what the bio
code does internally.
With these changes I'm now able to create ZFS pools using 4K
sector drives. No issues were observed during fairly extensive
testing. This is also a low risk change if your using 512b
sectors devices because none of the logic changes.
Closes#256
The default buffer size when requesting multiple quota entries
is 100 times the zfs_useracct_t size. In practice this works out
to exactly 27200 bytes. Since this will be a short lived buffer
in a non-performance critical path it is preferable to vmem_alloc()
the needed memory.
We will never bring over the pyzfs.py helper script from Solaris
to Linux. Instead the missing functionality will be directly
integrated in to the zfs commands and libraries. To avoid
confusion remove the warning about the missing pyzfs.py utility
and simply use the default internal support.
The Illumous developers are of the same mind and have proposed an
initial patch to do this which has been integrated in to the 'allow'
development branch. After some additional testing this code
can be merged in to master as the right long term solution.
Initially when zfsdev_ioctl() was ported to Linux we didn't have
any credential support implemented. So at the time we simply
passed NULL which wasn't much of a problem since most of the
secpolicy code was disabled.
However, one exception is quota handling which does require the
credential. Now that proper credentials are supported we can
safely start passing the callers credential. This is also an
initial step towards fully implemented the zfs secpolicy.
Normally when the arc_shrinker_func() function is called the return
value should be:
>=0 - To indicate the number of freeable objects in the cache, or
-1 - To indicate this cache should be skipped
However, when the shrinker callback is called with 'nr_to_scan' equal
to zero. The caller simply wants the number of freeable objects in
the cache and we must never return -1. This patch reorders the
first two conditionals in arc_shrinker_func() to ensure this behavior.
This patch also now explictly casts arc_size and arc_c_min to signed
int64_t types so MAX(x, 0) works as expected. As unsigned types
we would never see an negative value which defeated the purpose of
the MAX() lower bound and broke the shrinker logic.
Finally, when nr_to_scan is non-zero we explictly prevent all reclaim
below arc_c_min. This is done to prevent the Linux page cache from
completely crowding out the ARC. This limit is tunable and some
experimentation is likely going to be required to set it exactly right.
For now we're sticking with the OpenSolaris defaults.
Closes#218Closes#243
The comment in zfs_close() pertaining to decrementing the synchronous
open count needs to be updated for Linux. The code was already
updated to be correct, but the comment was missed and is now misleading.
Under Linux the zfs_close() hook is only called once when the final
reference is dropped. This differs from Solaris where zfs_close()
is called for each close.
Closes#237
This workaround was introduced to workaround issue #164. This
issue was fixed by commit 5f35b19 so the workaround can be safely
dropped from both the zfs.fedora and zfs.gentoo init scripts.
Update the handling of named pipes and sockets to be consistent with
other platforms with regard to the rdev attribute. While all ZFS
ipmlementations store the rdev for device files in a system attribute
(SA), this is not the case for FIFOs and sockets. Indeed, Linux always
passes rdev=0 to mknod() for FIFOs and sockets, so the value is not
needed. Add an ASSERT that rdev==0 for FIFOs and sockets to detect if
the expected behavior ever changes.
Closes#216
The direct reclaim path in the z_wr_* threads must be disabled
to ensure forward progress is always maintained for txg processing.
This ensures that a txg will never get stuck waiting on itself
because it entered the following memory reclaim callpath.
->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive()
->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open()
It would be preferable to target this exact code path but the
kernel offers no way to do this without custom patches. To avoid
this we are forced to disable all reclaim for these threads. It
should not be necessary to do this for other other z_* threads
because they will not hold a txg open.
Closes#232
It has become necessary to be able to optionally disable
direct memory reclaim for certain taskqs. To support
this the TASKQ_NORECLAIM flags has been added which sets
the PF_MEMALLOC bit for all threads in the taskq.