For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk. The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance. Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.
The replacement for pdflush is called bdi (backing device info). The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback. The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.
For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme. However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.
Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.
However, there is one critical case where not implementing the bdi
functionality can cause problems. If an application handles a page
fault it can enter the balance_dirty_pages() callpath. This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.
Without a registered backing_device_info for the filesystem the
dirty pages will not get written out. Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.
This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block. It is then registered
when the filesystem mounted and unregistered on unmount. It will
not be registered for mounted snapshots which are read-only. This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#174
Unlike most other Linux distributions archlinux installs its
init scripts in /etc/rc.d insead of /etc/init.d. This commit
provides an archlinux rc.d script for zfs and extends the
build infrastructure to ensure it get's installed in the
correct place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#322
The zpool sub-commands like iostat, list, and status should
display consistent message when a given pool is unavailable or
no pool is present. This change unifies the default behavior
as follows:
root@prasad:~# ./zpool list 1 2
no pools available
no pools available
root@prasad:~# ./zpool iostat 1 2
no pools available
no pools available
root@prasad:~# ./zpool status 1 2
no pools available
no pools available
root@prasad:~# ./zpool list tan 1 2
cannot open 'tan': no such pool
root@prasad:~# ./zpool iostat tan 1 2
cannot open 'tan': no such pool
root@prasad:~# ./zpool status tan 1 2
cannot open 'tan': no such pool
Reported-by: Rajshree Thorat <rthorat@stec-inc.com>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#306
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure. When using the new
interface the caller must now use the mount_nodev() helper.
Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers. This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use. It provides our only entry point
in to the namespace layer for manipulating certain mount options.
This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly. It also allowed me
to keep more of the original Solaris code unmodified. Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do. However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem. Thus keeping a back
reference from the filesystem to the namespace is complicated.
Rather than introduce some ugly hack to get the vfsmount and
continue as before. I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.
This commit updates the code to remove this vfsmount back
reference entirely. All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.
Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code. This
change which fairly involved has turned out nicely.
Closes#246Closes#217Closes#187Closes#248Closes#231
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.
Closes#246Closes#217Closes#187
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.
The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly. This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes. The drive may behave this way
for compatibility reasons. For example, the WDC WD20EARS disks are
known to exhibit this behavior.
When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).
This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time. This can significantly improve
performance in certain cases. However, it will have an impact on your
total pool capacity. See the updated ashift property description
in the zpool.8 man page for additional details.
Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior. The most common ashift values are 9 and 12.
Example:
zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd
Closes#280
Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The previous commit 8a7e1ceefa wasn't
quite right. This check applies to both the user and kernel space
build and as such we must make sure it runs regardless of what
the --with-config option is set too.
For example, if --with-config=kernel then the autoconf test does
not run and we generate build warnings when compiling the kernel
packages.
Gcc versions 4.3.2 and earlier do not support the compiler flag
-Wno-unused-but-set-variable. This can lead to build failures
on older Linux platforms such as Debian Lenny. Since this is
an optional build argument this changes add a new autoconf check
for the option. If it is supported by the installed version of
gcc then it is used otherwise it is omited.
See commit's 12c1acde76 and
79713039a2 for the reason the
-Wno-unused-but-set-variable options was originally added.
This change fixes a kernel panic which would occur when resizing
a dataset which was not open. The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.
The code has also been updated to correctly notify the kernel
when the block device capacity changes. For 2.6.28 and newer
kernels the capacity change will be immediately detected. For
earlier kernels the capacity change will be detected when the
device is next opened. This is a known limitation of older
kernels.
Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0
Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes#68Closes#84
Disable the gethostid() override for Solaris behavior because Linux systems
implement the POSIX standard in a way that allows a negative result.
Mask the gethostid() result to the lower four bytes, like coreutils does in
/usr/bin/hostid, to prevent junk bits or sign-extension on systems that have an
eight byte long type. This can cause a spurious hostid mismatch that prevents
zpool import on 64-bit systems.
As of gcc-4.6 the option -Wunused-but-set-variable is enabled by
default. While this is a useful warning there are numerous places
in the ZFS code when a variable is set and then only checked in an
ASSERT(). To avoid having to update every instance of this in the
code we now set -Wno-unused-but-set-variable to suppress the warning.
Additionally, when building with --enable-debug and -Werror set these
warning also become fatal. We can reevaluate the suppression of these
error at a later time if it becomes an issue. For now we are basically
just reverting to the previous gcc behavior.
Added insert_inode_locked() helper function, prior to this most callers
used insert_inode_hash(). The older method doesn't check for collisions
in the inode_hashtable but it still acceptible for use. Fallback to
using insert_inode_hash() when insert_inode_locked() is unavailable.
To support automatically mounting your zfs on filesystem on boot
a basic init script is needed. Unfortunately, every distribution
has their own idea of the _right_ way to do things. Rather than
write one very complicated portable init script, which would be
invariably replaced by the distributions own anyway. I have
instead added support to provide multiple distribution specific
init scripts.
The correct init script for your distribution will be selected
by ZFS_AC_DEFAULT_PACKAGE which will set DEFAULT_INIT_SCRIPT.
During 'make install' the correct script for your system will
be installed from zfs/etc/init.d/zfs.DEFAULT_INIT_SCRIPT to the
usual /etc/init.d/zfs location.
Currently, there is zfs.fedora and a more generic zfs.lsb init
script. Hopefully, the distribution maintainers who know best
how they want their init scripts to function will feedback their
approved versions to be included in the project.
This change does not consider upstart jobs but I'm not at all
opposed to add that sort of thing.
The open_bdev_exclusive() function has been replaced (again) by the
more generic blkdev_get_by_path() function. Additionally, the
counterpart function close_bdev_exclusive() has been replaced by
blkdev_put(). Because these functions are more generic versions
of the functions they replaced the compatibility macro must add
the FMODE_EXCL mask to ensure they are exclusive.
Closes#114
The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback. It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.
The fsync() callback in the file_operations structure used to take
3 arguments. The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers. To
handle this a compatibility prototype was added to ensure the right
prototype is used. Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.
The const keyword was added to the 'struct xattr_handler' in the
generic Linux super_block structure. To handle this we define an
appropriate xattr_handler_t typedef which can be used. This was
the preferred solution because it keeps the code clean and readable.
The inclusion on dlsym(), dlopen(), and dlclose() symbols require
us to link against the dl library. Be careful to add the flag to
both the libzfs library and the commands which depend on the library.
ZFS even under Solaris does not strictly require libshare to be
available. The current implementation attempts to dlopen() the
library to access the needed symbols. If this fails libshare
support is simply disabled.
This means that on Linux we only need the most minimal libshare
implementation. In fact just enough to prevent the build from
failing. Longer term we can decide if we want to implement a
libshare library like Solaris. At best this would be an abstraction
layer between ZFS and NFS/SMB. Alternately, we can drop libshare
entirely and directly integrate ZFS with Linux's NFS/SMB.
Finally the bare bones user-libshare.m4 test was dropped. If we
do decide to implement libshare at some point it will surely be
as part of this package so the check is not needed.
If libselinux is detected on your system at configure time link
against it. This allows us to use a library call to detect if
selinux is enabled and if it is to pass the mount option:
"context=\"system_u:object_r:file_t:s0"
For now this is required because none of the existing selinux
policies are aware of the zfs filesystem type. Because of this
they do not properly enable xattr based labeling even though
zfs supports all of the required hooks.
Until distro's add zfs as a known xattr friendly fs type we
must use mntpoint labeling. Alternately, end users could modify
their existing selinux policy with a little guidance.
Creating whole-disk vdevs can intermittently fail if a udev-managed symlink to
the disk partition is already in place. To avoid this, we now remove any such
symlink before partitioning the disk. This makes zpool_label_disk_wait() truly
wait for the new link to show up instead of returning if it finds an old link
still in place. Otherwise there is a window between when udev deletes and
recreates the link during which access attempts will fail with ENOENT.
Also, clean up a comment about waiting for udev to create symlinks. It no
longer needs to describe the special cases for the link names, since that is
now handled in a separate helper function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Portability between Solaris and Linux isn't really an issue for us anymore, and
removing sections like this one helps simplify the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change adds two helper functions for working with vdev names and paths.
zfs_resolve_shortname() resolves a shorthand vdev name to an absolute path
of a file in /dev, /dev/disk/by-id, /dev/disk/by-label, /dev/disk/by-path,
/dev/disk/by-uuid, /dev/disk/zpool. This was previously done only in the
function is_shorthand_path(), but we need a general helper function to
implement shorthand names for additional zpool subcommands like remove.
is_shorthand_path() is accordingly updated to call the helper function.
There is a minor change in the way zfs_resolve_shortname() tests if a file
exists. is_shorthand_path() effectively used open() and stat64() to test for
file existence, since its scope includes testing if a device is a whole disk
and collecting file status information. zfs_resolve_shortname(), on the other
hand, only uses access() to test for existence and leaves it to the caller to
perform any additional file operations. This seemed like the most general and
lightweight approach, and still preserves the semantics of is_shorthand_path().
zfs_append_partition() appends a partition suffix to a device path. This
should be used to generate the name of a whole disk as it is stored in the vdev
label. The user-visible names of whole disks do not contain the partition
information, while the name in the vdev label does. The code was lifted from
the function make_disks(), which now just calls the helper function. Again,
having a helper function to do this supports general handling of shorthand
names in the user interface.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
ZFS works best when it is notified as soon as possible when a device
failure occurs. This allows it to immediately start any recovery
actions which may be needed. In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.
That's the theory at least. In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.
Unfortunately, without additional kernels patchs there's not much
which can be done to improve this. Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed. This can be improved and I suspect I'll be
submitting patches upstream to handle this.
To make the 'zpool events' output simple to parse with awk the extra
newline after embedded nvlists has been dropped. This allows the
entire event to be parsed as a single whitespace seperated record.
The -H option has been added to operate in scripted mode. For the
'zpool events' command this means don't print the header. The usage
of -H is consistent with scripted mode for other zpool commands.
The string array 'char dirs[5][8]' was too small to accomodate the terminating
NUL character in "by-label". This change adds the needed additional byte.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz
tar -xzf zfs-x.y.z.tar.gz
cd zfs-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This change also moves many of the include headers from individual
incude/sys directories under the modules directory in to a single
top level include directory. This has the advantage of making
the build rules cleaner and logically it makes a bit more sense.
Add the initial products from autogen.sh. These products will
be updated incrementally after this point as development occurs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This topic branch contains required changes to the user space
utilities to allow them to integrate cleanly with Linux.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This topic branch contains all the changes needed to integrate the user
side zfs tools with Linux style devices. Primarily this includes fixing
up the Solaris libefi library to be Linux friendly, and integrating with
the libblkid library which is provided by e2fsprogs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This topic branch leverages the Solaris style FMA call points
in ZFS to create a user space visible event notification system
under Linux. This new system is called zevent and it unifies
all previous Solaris style ereports and sysevent notifications.
Under this Linux specific scheme when a sysevent or ereport event
occurs an nvlist describing the event is created which looks almost
exactly like a Solaris ereport. These events are queued up in the
kernel when they occur and conditionally logged to the console.
It is then up to a user space application to consume the events
and do whatever it likes with them.
To make this possible the existing /dev/zfs ABI has been extended
with two new ioctls which behave as follows.
* ZFS_IOC_EVENTS_NEXT
Get the next pending event. The kernel will keep track of the last
event consumed by the file descriptor and provide the next one if
available. If no new events are available the ioctl() will block
waiting for the next event. This ioctl may also be called in a
non-blocking mode by setting zc.zc_guid = ZEVENT_NONBLOCK. In the
non-blocking case if no events are available ENOENT will be returned.
It is possible that ESHUTDOWN will be returned if the ioctl() is
called while module unloading is in progress. And finally ENOMEM
may occur if the provided nvlist buffer is not large enough to
contain the entire event.
* ZFS_IOC_EVENTS_CLEAR
Clear are events queued by the kernel. The kernel will keep a fairly
large number of recent events queued, use this ioctl to clear the
in kernel list. This will effect all user space processes consuming
events.
The zpool command has been extended to use this events ABI with the
'events' subcommand. You may run 'zpool events -v' to output a
verbose log of all recent events. This is very similar to the
Solaris 'fmdump -ev' command with the key difference being it also
includes what would be considered sysevents under Solaris. You
may also run in follow mode with the '-f' option. To clear the
in kernel event queue use the '-c' option.
$ sudo cmd/zpool/zpool events -fv
TIME CLASS
May 13 2010 16:31:15.777711000 ereport.fs.zfs.config.sync
class = "ereport.fs.zfs.config.sync"
ena = 0x40982b7897700001
detector = (embedded nvlist)
version = 0x0
scheme = "zfs"
pool = 0xed976600de75dfa6
(end detector)
time = 0x4bec8bc3 0x2e5aed98
pool = "zpios"
pool_guid = 0xed976600de75dfa6
pool_context = 0x0
While the 'zpool events' command is handy for interactive debugging
it is not expected to be the primary consumer of zevents. This ABI
was primarily added to facilitate the addition of a user space
monitoring daemon. This daemon would consume all events posted by
the kernel and based on the type of event perform an action. For
most events simply forwarding them on to syslog is likely enough.
But this interface also cleanly allows for more sophisticated
actions to be taken such as generating an email for a failed drive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add autoconf style build infrastructure to the ZFS tree. This
includes autogen.sh, configure.ac, m4 macros, some scripts/*,
and makefiles for all the core ZFS components.
Remove deadcode. It's possible the code should be in use
somewhere, but as the source code is laid out it currently
is not.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>