Add support for the .zfs control directory. This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required. The bulk of the core
functionality is now all there with the following limitations.
*) The .zfs/snapshot directory automount support requires a 2.6.37
or newer kernel. The exception is RHEL6.2 which has backported
the d_automount patches.
*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
in the .zfs/snapshot directory works as expected. However,
this functionality is only available to root until zfs
delegations are finished.
* mkdir - create a snapshot
* rmdir - destroy a snapshot
* mv - rename a snapshot
The following issues are known defeciences, but we expect them to
be addressed by future commits.
*) Add automount support for kernels older the 2.6.37. This should
be possible using follow_link() which is what Linux did before.
*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
The majority of the ground work for this is complete. However,
finishing this work will require resolving some lingering
integration issues with the Linux NFS kernel server.
*) The .zfs/shares directory exists but no futher smb functionality
has yet been implemented.
Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#173
The 'zfs list' and 'zpool list' commands output the message
'no datasets/pools available' to stdout. This should go to
stderr and only the available datasets/pools should go to
stdout. Returning nothing to stdout is expected behavior
when there is nothing to list.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#581
Allow a source rpm to be rebuilt with debugging enabled. This
avoids the need to have to manually modify the spec file. By
default debugging is still largely disabled. To enable specific
debugging features use the following options with rpmbuild.
'--with debug' - Enables ASSERTs
# For example:
$ rpmbuild --rebuild --with debug zfs-modules-0.6.0-rc6.src.rpm
Additionally, ZFS_CONFIG has been added to zfs_config.h for
packages which build against these headers. This is critical
to ensure both zfs and the dependant package are using the same
prototype and structure definitions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When creating a new pool, make_root_vdev() calls check_in_use() to
ensure that none of the consituent disks are in use. If the disk
contains a valid vdev label it is read to retrieve the list of its
child vdevs and these are checked recursively. However, the
partitions stored in the vdev label my no longer exist, for example
if the partition table has since been altered. In any such case we
would want the pool creation to proceed, so this change removes the
check from check_slice() that returns an error if the device doesn't
exist. As an added assurance, the Solaris implementation also
returns sucess on ENOENT.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.
We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.
Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.
To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.
Detailed rationale:
- max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
handle writes of any size, so there's no reason to impose a limit. Let the
upper layer decide.
- max_segments and max_segment_size are set to unlimited. zvol_write() will
copy the requests' contents into a dbuf anyway, so the number and size of
the segments are irrelevant. Let the upper layer decide.
- physical_block_size and io_opt are set to the ZVOL's block size. This
has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
the upper layers that writes smaller than the volume's block size will be
slow.
- The NONROT flag is set to indicate this isn't a rotational device.
Although the backing zpool might be composed of rotational devices, the
resulting ZVOL often doesn't exhibit the same behavior due to the COW
mechanisms used by ZFS. Setting this flag will prevent upper layers from
making useless decisions (such as reordering writes) based on incorrect
assumptions about the behavior of the ZVOL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:
WRITE: A normal async write. Device will be plugged.
WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down
the hint that someone will be waiting on this IO
shortly.
WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on
non-volatile media on completion.
In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.
The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.
Also, we must inform the block layer that we actually support FLUSH and FUA.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'. Add an autoconf check
to detect the API change and then conditionally define the expected
interface. In either case we are only interested in the zfs_sb_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#549
These libraries, which are an artifact of the ZoL development
process, conflict with packages that are already in distribution:
* libspl: SPL Programming Language
* libavl: AVL for Linux
* libefi: GRUB
And these libraries are potential conflicts:
* libshare: the Linux Mount Manager
* libunicode: Perl and Python
Recompose these five ZoL components into the four libraries that are
conventionally provided by Solaris and FreeBSD systems:
+ libnvpair
+ libuutil
+ libzpool
+ libzfs
This change resolves the name conflict, makes ZoL more compatible
with existing software that uses autotools to detect ZFS, and allows
pkg-zfs to better reflect the official Debian kFreeBSD packaging.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #430
Linux supports mounting over non-empty directories by default.
In Solaris this is not the case and -O option is required for
zfs mount to mount a zfs filesystem over a non-empty directory.
For compatibility, I've added support for -O option to mount
zfs filesystems over non-empty directories if the user wants
to, just like in Solaris.
I've defined MS_OVERLAY to record it in the flags variable if
the -O option is supplied. The flags variable passes through
a few functions and its checked before performing the empty
directory check in zfs_mount function. If -O is given, the
check is not performed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#473
Some implementations of `awk` incorrectly parse the \< and \> regex
symbols, so use a `while read` loop and regular globbing instead.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #259
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block. Prior
to this change there was one shared global shrinker.
The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded. This would cause the VFS to drop
references on a fraction of the dentries in the dcache. The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit. Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.
This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit. The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning. Thus we
can minimize any impact on the caching of other filesystems.
In the context of making this change several other important
issues related to managing the ARC were addressed, they include:
* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback. The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code. This callback can also be used by
other ARC consumers such as Lustre.
arc_add_prune_callback()
arc_remove_prune_callback()
* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity. The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.
* Less aggressively invoke the prune callback. We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes. It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.
* More promptly manage exceeding the arc_meta_limit. When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.
* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache. Remember
this will only occur when the ARC has no other choice. If it
can evict buffers safely without invoking the prune callback
it will.
* This change is also expected to resolve the unexpect collapses
of the ARC cache. This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink(). This effectively shrunk the entire cache
when really we just needed to reclaim meta data.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#466Closes#292
Regenerating the autotools configuration on Debian and Ubuntu systems
causes compilation to fail with this error message:
cmd/mount_zfs/../../cmd/mount_zfs/mount_zfs.c:403:
undefined reference to `is_selinux_enabled'
In the automake template, set "mount_zfs_LDFLAGS = ... $(LIBSELINUX)"
so that the /sbin/mount.zfs utility is linked to libselinux.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit
SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170
Use the new set_nlink() kernel function instead.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:
$ ./configure
$ make pkg # Alternatively, one can run 'make arch' as well
on the Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the zfs userland utilities and
another for the zfs kernel modules. The new packages can then be
installed by running:
# pacman -U $package.pkg.tar.xz
In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be build using the 'sarch' make
rule.
NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#491
The zpool_id script is posixly correct and does not use bash
features, so change its whackbang from /bin/bash to /bin/sh.
Debian policy also stipulates that system scripts be dash compatible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Direct invocation of GNU egrep is deprecated by its man page, and the
its argument in the zpool_id script is not an extended expression.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For consistency and safety, quote all variables in the zpool_id
script. This accomodates a `-c CONFIG` parameter value with
whitespace in the path name.
Also fix a typo in the usage synopsis for `-h`.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #439
The /lib/udev/path_id helper became a builtin command in the udev 174
release, so test whether path_id is external in the zpool_id script.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #429
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code. The updated code now just
registers the bdi during mount and destroys it during unmount.
The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it. Luckily the bdi_setup_and_register() function
is trivial.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
ZFS contains error messages that point to the defunct www.sun.com
domain, which is currently offline. Change these error messages
to use the zfsonlinux.org mirror instead.
This commit depends on:
zfsonlinux/zfsonlinux.github.com@8e10ead3dc
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Export all symbols already marked extern in the zfs_vfsops.h
header. Several non-static symbols have also been added to
the header and exportewd. This allows external modules to
more easily create and manipulate properly created ZFS
filesystem type datasets.
Rename zfsvfs_teardown() to zfs_sb_teardown and export it.
This is done simply for consistency with the rest of the code
base. All other zfsvfs_* functions have already been renamed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When compiling under Debian Lenny with gcc version 4.3.2
(Debian 4.3.2-1.1) the following warning occurs. To quiet
the warning initialize 'error' to zero. Newer versions of
gcc correctly determine that this uninitialized varible is
impossible because ZFS_NUM_USERQUOTA_PROPS is known to be
greater than zero.
cmd/zfs/zfs_main.c:2377: warning: "error" may be
used uninitialized in this function
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This warning was accidentally introduced by commit
b7936d5c2337bc976ac831c1c38de563844c36b. The fix is to
simply initialize the variable to ZFS_DELEG_WHO_UNKNOWN.
cmd/zfs/zfs_main.c:4460:25: warning: 'who_type' may be
used uninitialized in this function
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
These warnings were accidentally introduced by commit
b7936d5c2337bc976ac831c1c38de563844c36b. The fix is to
simply add the missing format specifier.
cmd/zfs/zfs_main.c:4565: warning: format not a string
literal and no format arguments
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Completely disable the zfs binary from attempting to directly update
/etc/mtab. The Linux port relies entirely on the mount.zfs helper
to safely update /etc/mtab. If we left the /etc/mtab updates to
the zfs binary then they could race with concurrent non-zfs mounts.
Routing everything through the system mount command ensures the
/etc/mtab updates are locked properly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #329
This change moves the default install location for the zfs udev
rules from /etc/udev/ to /lib/udev/. The correct convention is
for rules provided by a package to be installed in /lib/udev/.
The /etc/udev/ directory is reserved for custom rules or local
overrides.
Additionally, this patch cleans up some abuse of the bindir install
location by adding a udevdir and udevruledir install directories.
This allows us to revert to the default bin install location. The
udev install directories can be set with the following new options.
--with-udevdir=DIR install udev helpers [EPREFIX/lib/udev]
--with-udevruledir=DIR install udev rules [UDEVDIR/rules.d]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#356
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk. The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance. Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.
The replacement for pdflush is called bdi (backing device info). The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback. The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.
For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme. However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.
Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.
However, there is one critical case where not implementing the bdi
functionality can cause problems. If an application handles a page
fault it can enter the balance_dirty_pages() callpath. This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.
Without a registered backing_device_info for the filesystem the
dirty pages will not get written out. Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.
This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block. It is then registered
when the filesystem mounted and unregistered on unmount. It will
not be registered for mounted snapshots which are read-only. This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#174
Moving the zil_free() cleanup to zil_close() prevents this
problem from occurring in the first place. There is a very
good description of the issue and fix in Illumus #883.
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Reivewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/883
- https://github.com/illumos/illumos-gate/commit/c9ba2a43cb
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Today zfs tries to allocate blocks evenly across all devices.
This means when devices are imbalanced zfs will use lots of
CPU searching for space on devices which tend to be pretty
full. It should instead fail quickly on the full LUNs and
move onto devices which have more availability.
Reviewed by: Eric Schrock <Eric.Schrock@delphix.com>
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
The 'zfs get' command should be able to deal with mountpoint
as an argument. It already works with 'zfs list' command:
# zfs list /export/home/estibi
NAME USED AVAIL REFER MOUNTPOINT
rpool/export/home/estibi 1.14G 3.86G 1.14G /export/home/estibi
but it fails with 'zfs get':
# zfs get all /export/home/estibi
cannot open '/export/home/estibi': invalid dataset name
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Deano <deano@rattie.demon.co.uk>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Unlike most other Linux distributions archlinux installs its
init scripts in /etc/rc.d insead of /etc/init.d. This commit
provides an archlinux rc.d script for zfs and extends the
build infrastructure to ensure it get's installed in the
correct place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#322
Unfortunately, ztest is hard coded to export the zdb utility to
be installed in a certain location. When the packaging was updated
to install zdb in /sbin/ ztest was broken. To fix this I'm updating
ztest to check both common install paths.
The zpool sub-commands like iostat, list, and status should
display consistent message when a given pool is unavailable or
no pool is present. This change unifies the default behavior
as follows:
root@prasad:~# ./zpool list 1 2
no pools available
no pools available
root@prasad:~# ./zpool iostat 1 2
no pools available
no pools available
root@prasad:~# ./zpool status 1 2
no pools available
no pools available
root@prasad:~# ./zpool list tan 1 2
cannot open 'tan': no such pool
root@prasad:~# ./zpool iostat tan 1 2
cannot open 'tan': no such pool
root@prasad:~# ./zpool status tan 1 2
cannot open 'tan': no such pool
Reported-by: Rajshree Thorat <rthorat@stec-inc.com>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#306
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure. When using the new
interface the caller must now use the mount_nodev() helper.
Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers. This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use. It provides our only entry point
in to the namespace layer for manipulating certain mount options.
This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly. It also allowed me
to keep more of the original Solaris code unmodified. Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do. However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem. Thus keeping a back
reference from the filesystem to the namespace is complicated.
Rather than introduce some ugly hack to get the vfsmount and
continue as before. I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.
This commit updates the code to remove this vfsmount back
reference entirely. All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.
Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code. This
change which fairly involved has turned out nicely.
Closes#246Closes#217Closes#187Closes#248Closes#231
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.
Closes#246Closes#217Closes#187
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.
The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
Update udev helper scripts to deal with device-mapper devices created
by multipathd. These enhancements are targeted at a particular
storage network topology under evaluation at LLNL consisting of two
SAS switches providing redundant connectivity between multiple server
nodes and disk enclosures.
The key to making these systems manageable is to create shortnames for
each disk that conveys its physical location in a drawer. In a
direct-attached topology we infer a disk's enclosure from the PCI bus
number and HBA port number in the by-path name provided by udev. In a
switched topology, however, multiple drawers are accessed via a single
HBA port. We therefore resort to assigning drawer identifiers based
on which switch port a drive's enclosure is connected to. This
information is available from sysfs.
Add options to zpool_layout to generate an /etc/zfs/zdev.conf using
symbolic links in /dev/disk/by-id of the form
<label>-<UUID>-switch-port:<X>-slot:<Y>. <label> is a string that
depends on the subsystem that created the link and defaults to
"dm-uuid-mpath" (this prefix is used by multipathd). <UUID> is a
unique identifier for the disk typically obtained from the scsi_id
program, and <X> and <Y> denote the switch port and disk slot numbers,
respectively.
Add a callout script sas_switch_id for use by multipathd to help
create symlinks of the form described above. Update zpool_id and the
udev zpool rules file to handle both multipath devices and
conventional drives.
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly. This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes. The drive may behave this way
for compatibility reasons. For example, the WDC WD20EARS disks are
known to exhibit this behavior.
When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).
This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time. This can significantly improve
performance in certain cases. However, it will have an impact on your
total pool capacity. See the updated ashift property description
in the zpool.8 man page for additional details.
Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior. The most common ashift values are 9 and 12.
Example:
zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd
Closes#280
Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under Fedora 15 /etc/mtab is now a symlink to /proc/mounts by
default. When /etc/mtab is a symlink the mount.zfs helper
should not update it. There was code in place to handle this
case but it used stat() which traverses the link and then issues
the stat on /proc/mounts. We need to use lstat() to prevent the
link traversal and instead stat /etc/mtab.
Closes#270
The previous commit 8a7e1ceefa wasn't
quite right. This check applies to both the user and kernel space
build and as such we must make sure it runs regardless of what
the --with-config option is set too.
For example, if --with-config=kernel then the autoconf test does
not run and we generate build warnings when compiling the kernel
packages.
Gcc versions 4.3.2 and earlier do not support the compiler flag
-Wno-unused-but-set-variable. This can lead to build failures
on older Linux platforms such as Debian Lenny. Since this is
an optional build argument this changes add a new autoconf check
for the option. If it is supported by the installed version of
gcc then it is used otherwise it is omited.
See commit's 12c1acde76 and
79713039a2 for the reason the
-Wno-unused-but-set-variable options was originally added.
We will never bring over the pyzfs.py helper script from Solaris
to Linux. Instead the missing functionality will be directly
integrated in to the zfs commands and libraries. To avoid
confusion remove the warning about the missing pyzfs.py utility
and simply use the default internal support.
The Illumous developers are of the same mind and have proposed an
initial patch to do this which has been integrated in to the 'allow'
development branch. After some additional testing this code
can be merged in to master as the right long term solution.
The zpool_id and zpool_layout helper scripts have been updated to
use the more common /usr/bin/awk symlink. On Fedora/Redhat systems
there are both /bin/awk and /usr/bin/awk symlinks to your installed
version of awk. On Debian/Ubuntu systems only the /usr/bin/awk
symlink exists.
Additionally, add the '\<' token to the beginning of the regex
pattern to prevent partial matches. This pattern only appears to
work with gawk despite the mawk man page claiming to support this
extended regex. Thus you will need to have gawk installed to use
these optional helper scripts. A comment has been added to the
script to reflect this reality.
With the addition of the mount helper we accidentally regressed
the ability to manually mount snapshots. This commit updates
the mount helper to expect the possibility of a ZFS_TYPE_SNAPSHOT.
All snapshot will be automatically treated as 'legacy' type mounts
so they can be mounted manually.
This change fixes a kernel panic which would occur when resizing
a dataset which was not open. The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.
The code has also been updated to correctly notify the kernel
when the block device capacity changes. For 2.6.28 and newer
kernels the capacity change will be immediately detected. For
earlier kernels the capacity change will be detected when the
device is next opened. This is a known limitation of older
kernels.
Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0
Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes#68Closes#84
Disable the gethostid() override for Solaris behavior because Linux systems
implement the POSIX standard in a way that allows a negative result.
Mask the gethostid() result to the lower four bytes, like coreutils does in
/usr/bin/hostid, to prevent junk bits or sign-extension on systems that have an
eight byte long type. This can cause a spurious hostid mismatch that prevents
zpool import on 64-bit systems.
As of gcc-4.6 the option -Wunused-but-set-variable is enabled by
default. While this is a useful warning there are numerous places
in the ZFS code when a variable is set and then only checked in an
ASSERT(). To avoid having to update every instance of this in the
code we now set -Wno-unused-but-set-variable to suppress the warning.
Additionally, when building with --enable-debug and -Werror set these
warning also become fatal. We can reevaluate the suppression of these
error at a later time if it becomes an issue. For now we are basically
just reverting to the previous gcc behavior.
When compiling ZFS in user space gcc-4.6.0 correctly identifies
the variable 'value' as being set but never used. This generates a
warning and a build failure when using --enable-debug. Once again
this is correct but I'm reluctant to remove 'value' because we are
breaking the string in to name/value pairs. While it is not used
now there's a good chance it will be soon and I'd rather not have
to reinvent this. To suppress the warning with just as a VERIFY().
This was observed under Fedora 15.
cmd/mount_zfs/mount_zfs.c: In function ‘parse_option’:
cmd/mount_zfs/mount_zfs.c:112:21: error: variable ‘value’ set but not
used [-Werror=unused-but-set-variable]
Some udev hooks are not designed to be idempotent, so calling udevadm
trigger outside of the distribution's initialization scripts can have
unexpected (and potentially dangerous) side effects. For example, the
system time may change or devices may appear multiple times. See Ubuntu
launchpad bug 320200 and this mailing list post for more details:
https://lists.ubuntu.com/archives/ubuntu-devel/2009-January/027260.html
To avoid these problems we call udevadm trigger with --action=change
--subsystem-match=block. The first argument tells udev just to refresh
devices, and make sure everything's as it should be. The second
argument limits the scope to block devices, so devices belonging to
other subsystems cannot be affected.
This doesn't fix the problem on older udev implementations that don't
provide udevadm but instead have udevtrigger as a standalone program.
In this case the above options aren't available so there's no way to
call call udevtrigger safely. But we can live with that since this
issue only exists in optional test and helper scripts, and most
zfs-on-linux users are running newer systems anyways.
This commit fixes issue on
https://github.com/behlendorf/zfs/issues/#issue/172
Changes:
- update BLKZNAME to use _IOR instead of _IO. Kernel 2.6.32 allows
read parameters (copy_to_user) with _IO, while newer kernels (tested
Archlinux's 2.6.37 kernel) enforces _IOR (which is correct)
- fix return code and message on error
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Added insert_inode_locked() helper function, prior to this most callers
used insert_inode_hash(). The older method doesn't check for collisions
in the inode_hashtable but it still acceptible for use. Fallback to
using insert_inode_hash() when insert_inode_locked() is unavailable.
Compiling with 'LDFLAGS=-Wl,--as-needed' exposed the fact that
there were some library linking problems introduced by mount_zfs.
In particular, the libzfs library does use nvpair symbols, and
mount_zfs contains no dependencies on libzpool.
Closes#161Closes#162
New versions glibc declare getcwd() with the warn_unused_result attribute.
This results in a warning because the updated mount helper was not
checking this return value. This issue was fixed by checking the return
type and in the case of an error simply returning the passed dataset.
One possible, but unlikely, error would be having your cwd directory
unlinked while the mount command was running.
cmd/mount_zfs/mount_zfs.c: In function ‘parse_dataset’:
cmd/mount_zfs/mount_zfs.c:223:2: error: ignoring return value of
‘getcwd’, declared with attribute warn_unused_result
To support automatically mounting your zfs on filesystem on boot
a basic init script is needed. Unfortunately, every distribution
has their own idea of the _right_ way to do things. Rather than
write one very complicated portable init script, which would be
invariably replaced by the distributions own anyway. I have
instead added support to provide multiple distribution specific
init scripts.
The correct init script for your distribution will be selected
by ZFS_AC_DEFAULT_PACKAGE which will set DEFAULT_INIT_SCRIPT.
During 'make install' the correct script for your system will
be installed from zfs/etc/init.d/zfs.DEFAULT_INIT_SCRIPT to the
usual /etc/init.d/zfs location.
Currently, there is zfs.fedora and a more generic zfs.lsb init
script. Hopefully, the distribution maintainers who know best
how they want their init scripts to function will feedback their
approved versions to be included in the project.
This change does not consider upstart jobs but I'm not at all
opposed to add that sort of thing.
When updating /etc/mtab we should be careful and strip certain
options. In particular, we need to strip 'zfsutil' because if
we don't the mount utility will helpfull provide it to the
mount helper when we issue mount(8) again. This subverts the
check that the caller is zfs(8) and not mount(8).
Allow the mount(8) utility to always operate on all datasets when
remounting them read-only. This critical for rc.sysinit/umountroot
which remounts the root filesystem read-only during shutdown to
ensure everything is correctly flushed to disk.
Fix minor typo, the check to set zfsutil should use the bitwise
'&'. I must have accidentally hit the adjacent '*' and obviously
neither the compiler or my code review caught this. Fix it now.
When run with a root '/' cwd the mount.zfs helper would strip not
only the '/' but also the next character from the dataset name.
For example, '/tank' was changed to 'ank' instead of just 'tank'.
Originally, this was done for the '/tmp' cwd case where we needed
to strip the '/' following the cwd. For example '/tmp/tank' needed
to remove the '/tmp' cwd plus 1 character for the '/'.
This change fixes the problem by checking the cwd and if it ends in
a '/' it does not strip and extra character. Otherwise it will strip
the next character. I believe this should only ever be true for the
root directory.
Closes#148
Several issues related to strange mount/umount behavior were reported
and this commit should address most of them. The original idea was
to put in place a zfs mount helper (mount.zfs). This helper is used
to enforce 'legacy' mount behavior, and perform any extra mount argument
processing (selinux, zfsutil, etc). This helper wasn't ready for the
0.6.0-rc1 release but with this change it's functional but needs to
extensively tested.
This change addresses the following open issues.
Closes#101Closes#107Closes#113Closes#115Closes#119
With the removal of the minimal xvattr support from the spl this
support needs to be replaced in the zfs package. This is fairly
easily accomplished by directly adding portions of the sys/vnode.h
header from OpenSolaris. These xvattr additions have been placed
in the sys/xvattr.h header file and included as needed where simply
a sys/vnode.h was included before.
In additon to the xvattr types and helper macros two functions
were also included. The xva_init() and xva_getxoptattr() functions
were included as static inline functions in xvattr.h. They are
simple enough and it was simpler to place them here rather than
in their own .c file.
This commit allows zvols with names longer than 32 characters, which
fixes issue on https://github.com/behlendorf/zfs/issues/#issue/102.
Changes include:
- use /dev/zd* device names for zvol, where * is the device minor
(include/sys/fs/zfs.h, module/zfs/zvol.c).
- add BLKZNAME ioctl to get dataset name from userland
(include/sys/fs/zfs.h, module/zfs/zvol.c, cmd/zvol_id).
- add udev rule to create /dev/zvol/[dataset_name] and the legacy
/dev/[dataset_name] symlink. For partitions on zvol, it will create
/dev/zvol/[dataset_name]-part* (etc/udev/rules.d/60-zvol.rules,
cmd/zvol_id).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The open_bdev_exclusive() function has been replaced (again) by the
more generic blkdev_get_by_path() function. Additionally, the
counterpart function close_bdev_exclusive() has been replaced by
blkdev_put(). Because these functions are more generic versions
of the functions they replaced the compatibility macro must add
the FMODE_EXCL mask to ensure they are exclusive.
Closes#114
The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback. It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.
The fsync() callback in the file_operations structure used to take
3 arguments. The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers. To
handle this a compatibility prototype was added to ensure the right
prototype is used. Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.
The const keyword was added to the 'struct xattr_handler' in the
generic Linux super_block structure. To handle this we define an
appropriate xattr_handler_t typedef which can be used. This was
the preferred solution because it keeps the code clean and readable.
It turns out that older versions of the glibc headers do not
properly define MS_DIRSYNC despite it being explicitly mentioned
in the man pages. They instead call it S_WRITE, so for system
where this is not correct defined map MS_DIRSYNC to S_WRITE.
At the time of this commit both Ubuntu Lucid, and Debian Squeeze
both use the out of date glibc headers.
As for MS_REC this field is also not available in the older headers.
Since there is no obvious mapping in this case we simply disable
the recursive mount option which used it.
The inclusion on dlsym(), dlopen(), and dlclose() symbols require
us to link against the dl library. Be careful to add the flag to
both the libzfs library and the commands which depend on the library.
ZFS even under Solaris does not strictly require libshare to be
available. The current implementation attempts to dlopen() the
library to access the needed symbols. If this fails libshare
support is simply disabled.
This means that on Linux we only need the most minimal libshare
implementation. In fact just enough to prevent the build from
failing. Longer term we can decide if we want to implement a
libshare library like Solaris. At best this would be an abstraction
layer between ZFS and NFS/SMB. Alternately, we can drop libshare
entirely and directly integrate ZFS with Linux's NFS/SMB.
Finally the bare bones user-libshare.m4 test was dropped. If we
do decide to implement libshare at some point it will surely be
as part of this package so the check is not needed.
By design the zfs utility is supposed to handle mounting and unmounting
a zfs filesystem. We could allow zfs to do this directly. There are
system calls available to mount/umount a filesystem. And there are
library calls available to manipulate /etc/mtab. But there are a
couple very good reasons not to take this appraoch... for now.
Instead of directly calling the system and library calls to (u)mount
the filesystem we fork and exec a (u)mount process. The principle
reason for this is to delegate the responsibility for locking and
updating /etc/mtab to (u)mount(8). This ensures maximum portability
and ensures the right locking scheme for your version of (u)mount
will be used. If we didn't do this we would have to resort to an
autoconf test to determine what locking mechanism is used.
The downside to using mount(8) instead of mount(2) is that we lose
the exact errno which was returned by the kernel. The return code
from mount(8) provides some insight in to what went wrong but it
not quite as good. For the moment this is translated as a best
guess in to a errno for the higher layers of zfs.
In the long term a shared library called libmount is under development
which provides a common API to address the locking and errno issues.
Once the standard mount utility has been updated to use this library
we can then leverage it. Until then this is the only safe solution.
http://www.kernel.org/pub/linux/utils/util-linux/libmount-docs/index.html
For the moment, the only advantage in registering a umount helper
would be to automatically unshare a zfs filesystem. Since under
Linux this would be unexpected (but nice) behavior there is no
harm in disabling it.
This is desirable because the 'zfs unmount' path invokes the system
umount. This is done to ensure correct mtab locking but has the
side effect that the umount.zfs helper would be called if it exists.
By default this helper calls back in to zfs to do the unmount on
Solaris which we don't want under Linux.
Once libmount is available and we have a safe way to correctly
lock and update the /etc/mtab file we can reconsider the need
for a umount helper. Using libmount is the prefered solution.
While not strictly required to mount a zfs filesystem using a
mount helper has certain advantages.
First, we need it if we want to honor the mount behavior as found
on Solaris. As part of the mount we need to validate that the
dataset has the legacy mount property set if we are using 'mount'
instead of 'zfs mount'.
Secondly, by using a mount helper we can automatically load the
zpl kernel module. This way you can just issue a 'mount' or
'zfs mount' and it will just work.
Finally, it gives us common hook in user space to add any zfs
specific mount options we might want. At the moment we don't
have any but now the infrastructure is at least in place.
If libselinux is detected on your system at configure time link
against it. This allows us to use a library call to detect if
selinux is enabled and if it is to pass the mount option:
"context=\"system_u:object_r:file_t:s0"
For now this is required because none of the existing selinux
policies are aware of the zfs filesystem type. Because of this
they do not properly enable xattr based labeling even though
zfs supports all of the required hooks.
Until distro's add zfs as a known xattr friendly fs type we
must use mntpoint labeling. Alternately, end users could modify
their existing selinux policy with a little guidance.
These compiler warnings were introduced when code which was
previously #ifdef'ed out by HAVE_ZPL was re-added for use
by the posix layer. All of the following changes should be
obviously correct and will cause no semantic changes.
Specifically, this fixes the two following errors in zdb when a pool
is composed of block devices:
1) 'Value too large for defined data type' when running 'zdb <dataset>'.
2) 'character device required' when running 'zdb -l <block-device>'.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Support for rolling back datasets require a functional ZPL, which we currently
do not have. The zfs command does not check for ZPL support before attempting
a rollback, and in preparation for rolling back a zvol it removes the minor
node of the device. To prevent the zvol device node from disappearing after a
failed rollback operation, this change wraps the zfs_do_rollback() function in
an #ifdef HAVE_ZPL and returns ENOSYS in the absence of a ZPL. This is
consistent with the behavior of other ZPL dependent commands such as mount.
The orginal error message observed with this bug was rather confusing:
internal error: Unknown error 524
Aborted
This was because zfs_ioc_rollback() returns ENOTSUP if we don't HAVE_ZPL, but
Linux actually has no such error code. It should instead return EOPNOTSUPP, as
that is how ENOTSUP is defined in user space. With that we would have gotten
the somewhat more helpful message
cannot rollback 'tank/fish': unsupported version
This is rather a moot point with the above changes since we will no longer make
that ioctl call without a ZPL. But, this change updates the error code just in
case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Creating whole-disk vdevs can intermittently fail if a udev-managed symlink to
the disk partition is already in place. To avoid this, we now remove any such
symlink before partitioning the disk. This makes zpool_label_disk_wait() truly
wait for the new link to show up instead of returning if it finds an old link
still in place. Otherwise there is a window between when udev deletes and
recreates the link during which access attempts will fail with ENOENT.
Also, clean up a comment about waiting for udev to create symlinks. It no
longer needs to describe the special cases for the link names, since that is
now handled in a separate helper function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Portability between Solaris and Linux isn't really an issue for us anymore, and
removing sections like this one helps simplify the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change adds two helper functions for working with vdev names and paths.
zfs_resolve_shortname() resolves a shorthand vdev name to an absolute path
of a file in /dev, /dev/disk/by-id, /dev/disk/by-label, /dev/disk/by-path,
/dev/disk/by-uuid, /dev/disk/zpool. This was previously done only in the
function is_shorthand_path(), but we need a general helper function to
implement shorthand names for additional zpool subcommands like remove.
is_shorthand_path() is accordingly updated to call the helper function.
There is a minor change in the way zfs_resolve_shortname() tests if a file
exists. is_shorthand_path() effectively used open() and stat64() to test for
file existence, since its scope includes testing if a device is a whole disk
and collecting file status information. zfs_resolve_shortname(), on the other
hand, only uses access() to test for existence and leaves it to the caller to
perform any additional file operations. This seemed like the most general and
lightweight approach, and still preserves the semantics of is_shorthand_path().
zfs_append_partition() appends a partition suffix to a device path. This
should be used to generate the name of a whole disk as it is stored in the vdev
label. The user-visible names of whole disks do not contain the partition
information, while the name in the vdev label does. The code was lifted from
the function make_disks(), which now just calls the helper function. Again,
having a helper function to do this supports general handling of shorthand
names in the user interface.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
ZFS works best when it is notified as soon as possible when a device
failure occurs. This allows it to immediately start any recovery
actions which may be needed. In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.
That's the theory at least. In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.
Unfortunately, without additional kernels patchs there's not much
which can be done to improve this. Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed. This can be improved and I suspect I'll be
submitting patches upstream to handle this.
To make the 'zpool events' output simple to parse with awk the extra
newline after embedded nvlists has been dropped. This allows the
entire event to be parsed as a single whitespace seperated record.
The -H option has been added to operate in scripted mode. For the
'zpool events' command this means don't print the header. The usage
of -H is consistent with scripted mode for other zpool commands.
The string array 'char dirs[5][8]' was too small to accomodate the terminating
NUL character in "by-label". This change adds the needed additional byte.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
By default the zpool_layout command would always use the slot
number assigned by Linux when generating the zdev.conf file.
This is a reasonable default there are cases when it makes
sense to remap the slot id assigned by Linux using your own
custom mapping.
This commit adds support to zpool_layout to provide a custom
slot mapping file. The file contains in the first column the
Linux slot it and in the second column the custom slot mapping.
By passing this map file with '-m map' to zpool_config the
mapping will be applied when generating zdev.conf.
Additionally, two sample mapping have been added which reflect
different ways to map the slots in the dragon drawers.
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz
tar -xzf zfs-x.y.z.tar.gz
cd zfs-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This change also moves many of the include headers from individual
incude/sys directories under the modules directory in to a single
top level include directory. This has the advantage of making
the build rules cleaner and logically it makes a bit more sense.
Add the initial products from autogen.sh. These products will
be updated incrementally after this point as development occurs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Minor changes to ztest for this environment. These including
updating ztest to run in the local development tree, as well
as relocating some local variables in this function to the heap.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This topic branch contains required changes to the user space
utilities to allow them to integrate cleanly with Linux.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This topic branch contains all the changes needed to integrate the user
side zfs tools with Linux style devices. Primarily this includes fixing
up the Solaris libefi library to be Linux friendly, and integrating with
the libblkid library which is provided by e2fsprogs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This topic branch leverages the Solaris style FMA call points
in ZFS to create a user space visible event notification system
under Linux. This new system is called zevent and it unifies
all previous Solaris style ereports and sysevent notifications.
Under this Linux specific scheme when a sysevent or ereport event
occurs an nvlist describing the event is created which looks almost
exactly like a Solaris ereport. These events are queued up in the
kernel when they occur and conditionally logged to the console.
It is then up to a user space application to consume the events
and do whatever it likes with them.
To make this possible the existing /dev/zfs ABI has been extended
with two new ioctls which behave as follows.
* ZFS_IOC_EVENTS_NEXT
Get the next pending event. The kernel will keep track of the last
event consumed by the file descriptor and provide the next one if
available. If no new events are available the ioctl() will block
waiting for the next event. This ioctl may also be called in a
non-blocking mode by setting zc.zc_guid = ZEVENT_NONBLOCK. In the
non-blocking case if no events are available ENOENT will be returned.
It is possible that ESHUTDOWN will be returned if the ioctl() is
called while module unloading is in progress. And finally ENOMEM
may occur if the provided nvlist buffer is not large enough to
contain the entire event.
* ZFS_IOC_EVENTS_CLEAR
Clear are events queued by the kernel. The kernel will keep a fairly
large number of recent events queued, use this ioctl to clear the
in kernel list. This will effect all user space processes consuming
events.
The zpool command has been extended to use this events ABI with the
'events' subcommand. You may run 'zpool events -v' to output a
verbose log of all recent events. This is very similar to the
Solaris 'fmdump -ev' command with the key difference being it also
includes what would be considered sysevents under Solaris. You
may also run in follow mode with the '-f' option. To clear the
in kernel event queue use the '-c' option.
$ sudo cmd/zpool/zpool events -fv
TIME CLASS
May 13 2010 16:31:15.777711000 ereport.fs.zfs.config.sync
class = "ereport.fs.zfs.config.sync"
ena = 0x40982b7897700001
detector = (embedded nvlist)
version = 0x0
scheme = "zfs"
pool = 0xed976600de75dfa6
(end detector)
time = 0x4bec8bc3 0x2e5aed98
pool = "zpios"
pool_guid = 0xed976600de75dfa6
pool_context = 0x0
While the 'zpool events' command is handy for interactive debugging
it is not expected to be the primary consumer of zevents. This ABI
was primarily added to facilitate the addition of a user space
monitoring daemon. This daemon would consume all events posted by
the kernel and based on the type of event perform an action. For
most events simply forwarding them on to syslog is likely enough.
But this interface also cleanly allows for more sophisticated
actions to be taken such as generating an email for a failed drive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add autoconf style build infrastructure to the ZFS tree. This
includes autogen.sh, configure.ac, m4 macros, some scripts/*,
and makefiles for all the core ZFS components.
While ztest does run in user space we run it with the same stack
restrictions it would have in kernel space. This ensures that any
stack related issues which would be hit in the kernel can be caught
and debugged in user space instead.
This patch is a first pass to limit the stack usage of every ztest
function to 1024 bytes. Subsequent updates can further reduce this.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This is a portability change which removes the dependence of the Solaris
thread library. All locations where Solaris thread API was used before
have been replaced with equivilant Solaris kernel style thread calls.
In user space the kernel style threading API is implemented in term of
the portable pthreads library. This includes all threads, mutexs,
condition variables, reader/writer locks, and taskqs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Remove deadcode. It's possible the code should be in use
somewhere, but as the source code is laid out it currently
is not.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The upstream commit cb code had a few bugs:
1) The arguments of the list_move_tail() call in txg_dispatch_callbacks()
were reversed by mistake. This caused the commit callbacks to not be
called at all.
2) ztest had a bug in ztest_dmu_commit_callbacks() where "error" was not
initialized correctly. This seems to have caused the test to always take
the simulated error code path, which made ztest unable to detect whether
commit cbs were being called for transactions that successfuly complete.
3) ztest had another bug in ztest_dmu_commit_callbacks() where the commit
cb threshold was not being compared correctly.
4) The commit cb taskq was using 'max_ncpus * 2' as the maxalloc argument
of taskq_create(), which could have caused unnecessary delays in the txg
sync thread.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Resolve issues uncovered by -D_FORTIFY_SOURCE=2, the default redhat
macro's file adds this option to the cflags. This causes warnings
of the following type designed to keep the developer honest:
warning: ignoring return value of 'foo', declared
with attribute warn_unused_result
The short term fix is to wrap these calls in VERIFY() to check the
return code. The code was already assusing these would never fail.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fix non-c90 compliant code, for the most part these changes
simply deal with where a particular variable is declared.
Under c90 it must alway be done at the very start of a block.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>