Commit Graph

277 Commits

Author SHA1 Message Date
Ralf Ertzinger
8b921f667a Introduce zpool_get_prop_literal interface
This change introduces zpool_get_prop_literal. It's an expanded version
of zpool_get_prop taking one additional boolean parameter. With this
parameter set to B_FALSE it will behave identically to zpool_get_prop.
Setting it to B_TRUE will return full precision numbers for the
following properties:

ZPOOL_PROP_SIZE
ZPOOL_PROP_ALLOCATED
ZPOOL_PROP_FREE
ZPOOL_PROP_FREEING
ZPOOL_PROP_EXPANDSZ
ZPOOL_PROP_ASHIFT

Also introduced is a wrapper function for zpool_get_prop making it
use zpool_get_prop_literal in the background.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1813
2013-10-28 14:27:53 -07:00
Brian Behlendorf
e0b0ca983d Add visibility in to cached dbufs
Currently there is no mechanism to inspect which dbufs are being
cached by the system.  There are some coarse counters in arcstats
by they only give a rough idea of what's being cached.  This patch
aims to improve the current situation by adding a new dbufs kstat.

When read this new kstat will walk all cached dbufs linked in to
the dbuf_hash.  For each dbuf it will dump detailed information
about the buffer.  It will also dump additional information about
the referenced arc buffer and its related dnode.  This provides a
more complete view in to exactly what is being cached.

With this generic infrastructure in place utilities can be written
to post-process the data to understand exactly how the caching is
working.  For example, the data could be processed to show a list
of all cached dnodes and how much space they're consuming.  Or a
similar list could be generated based on dnode type.  Many other
ways to interpret the data exist based on what kinds of questions
you're trying to answer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
2013-10-25 13:59:40 -07:00
Prakash Surya
1421c89142 Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.

For each arc_read call, the following information is exported:

 * A unique identifier (uint64_t)
 * The time the entry was added to the list (hrtime_t)
   (*not* wall clock time; relative to the other entries on the list)
 * The objset ID (uint64_t)
 * The object number (uint64_t)
 * The indirection level (uint64_t)
 * The block ID (uint64_t)
 * The name of the function originating the arc_read call (char[24])
 * The arc_flags from the arc_read call (uint32_t)
 * The PID of the reading thread (pid_t)
 * The command or name of thread originating read (char[16])

From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.

There is still some work to be done, but this should serve as a good
starting point.

Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:57:25 -07:00
Brian Behlendorf
76463d4026 Revert "Add txgs-<pool> kstat file"
This reverts commit e95853a331.
2013-10-25 13:57:25 -07:00
Brian Behlendorf
11cb9d773f Increase default udev wait time
When creating a new pool, or adding/replacing a disk in an existing
pool, partition tables will be automatically created on the devices.
Under normal circumstances it will take less than a second for udev
to create the expected device files under /dev/.  However, it has
been observed that if the system is doing heavy IO concurrently udev
may take far longer.  If you also throw in some cheap dodgy hardware
it may take even longer.

To prevent zpool commands from failing due to this the default wait
time for udev is being increased to 30 seconds.  This will have no
impact on normal usage, the increase timeout should only be noticed
if your udev rules are incorrectly configured.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1646
2013-10-22 10:25:51 -07:00
Richard Yao
a6ce1eae54 Fix libzfs_core changes to follow GNU libtool guidelines
The GNU libtool documentation states to start with a version of 0:0:0,
rather than 1:1:0. Illumos uses the name libzfs_core.so.1, so to be
consistent, we should go with 1:0:0.

http://www.gnu.org/software/libtool/manual/libtool.html#Updating-version-info

The GNU libtool documentation also provides guidence on how the version
information should be incremented. Doing this does a SONAME bump of the
libzfs and libzpool libraries. This is particularly important on Gentoo
because a SONAME bump enables portage to retain the older libraries
until any packages that link to them are rebuilt. The main example of
this is GRUB2's grub2-mkconfig, which will break unless it is rebuilt
against the new libraries.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1751
2013-10-10 16:56:51 -07:00
Richard Yao
31fc19399e Generate libraries with correct DT_NEEDED entries
Libraries that depend on other libraries should list them in ELF's
DT_NEEDED field so that programs linking to them do not need to specify
those libraries unless they depend on them as well. This is not the case
in the current code and the consequence is that anything that needs a
library must know its dependencies. This is fragile and caused GRUB2's
configure script to break when a dependency was added on libblkid in
libzfs.

This resolves that problem by using LIBADD/LDADD to specify libraries in
Makefile.am instead of LDFLAGS. This ensures that proper DT_NEEDED
entries are generated and prevents GRUB2's configure script from
breaking in the presence of a libblkid dependency. This also removes
unneeded dependencies from various files.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1751
2013-10-10 16:56:51 -07:00
Richard Yao
1db7b9be75 Fix libblkid support
libblkid support is dormant because the autotools check is broken and
liblkid identifies ZFS vdevs as "zfs_member", not "zfs". We fix that
with a few changes:

First, we fix the libblkid autotools check to do a few things:

1. Make a 64MB file, which is the minimum size ZFS permits.
2. Make 4 fake uberblock entries to make libblkid's check succeed.
3. Return 0 upon success to make autotools use the success case.
4. Include stdlib.h to avoid implicit declration of free().
5. Check for "zfs_member", not "zfs"
6. Make --with-blkid disable autotools check (avoids Gentoo sandbox violation)
7. Pass '-lblkid' correctly using LIBS not LDFLAGS.

Second, we change the libblkid support to scan for "zfs_member", not
"zfs".

This makes --with-blkid work on Gentoo.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1751
2013-10-10 16:56:51 -07:00
Matthew Ahrens
13fe019870 Illumos #3464
3464 zfs synctask code needs restructuring
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3464
  illumos/illumos-gate@3b2aab1880

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1495
2013-09-04 16:01:24 -07:00
Matthew Ahrens
6f1ffb0665 Illumos #2882, #2883, #2900
2882 implement libzfs_core
2883 changing "canmount" property to "on" should not always remount dataset
2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  https://www.illumos.org/issues/2882
  https://www.illumos.org/issues/2883
  https://www.illumos.org/issues/2900
  illumos/illumos-gate@4445fffbbb

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1293

Porting notes:

WARNING: This patch changes the user/kernel ABI.  That means that
the zfs/zpool utilities built from master are NOT compatible with
the 0.6.2 kernel modules.  Ensure you load the matching kernel
modules from master after updating the utilities.  Otherwise the
zfs/zpool commands will be unable to interact with your pool and
you will see errors similar to the following:

  $ zpool list
  failed to read pool configuration: bad address
  no pools available

  $ zfs list
  no datasets available

Add zvol minor device creation to the new zfs_snapshot_nvl function.

Remove the logging of the "release" operation in
dsl_dataset_user_release_sync().  The logging caused a null dereference
because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the
logging functions try to get the ds name via the dsl_dataset_name()
function. I've got no idea why this particular code would have worked
in Illumos.  This code has subsequently been completely reworked in
Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring).

Squash some "may be used uninitialized" warning/erorrs.

Fix some printf format warnings for %lld and %llu.

Apply a few spa_writeable() changes that were made to Illumos in
illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and
3115 fixes.

Add a missing call to fnvlist_free(nvl) in log_internal() that was added
in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time
(zfsonlinux/zfs@9e11c73) because it depended on future work.
2013-09-04 15:49:00 -07:00
Turbo Fredriksson
abbfdca483 No point in rewind() mtab in zfs_unshare_proto(). We're not really
reading the file, but instead use libzfs_mnttab_find() which does
the nessesary freopen() for us.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1498
2013-08-15 10:18:31 -07:00
John Layman
fb5c53ea65 Fix for re-reading /etc/mtab in zfs_is_mounted()
When /etc/mtab is updated on Linux it's done atomically with
rename(2).  A new mtab is written, the existing mtab is unlinked,
and the new mtab is renamed to /etc/mtab.  This means that we
must close the old file and open the new file to get the updated
contents.  Using rewind(3) will just move the file pointer back
to the start of the file, freopen(3) will close and open the file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1611
2013-08-14 11:37:06 -07:00
Yuri Pankov
105afebb15 Illumos #3098 zfs userspace/groupspace fail
3098 zfs userspace/groupspace fail without saying why when run as non-root
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3098
  illumos/illumos-gate@70f56fa693

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1596
2013-08-14 09:28:34 -07:00
Brian Behlendorf
ff3510c1a5 Fix zpool_read_label()
The zpool_read_label() function was subtly broken due to a
difference of behavior in fstat64(2) on Solaris vs Linux.

Under Solaris when a block device is stat'ed the st_size
field will contain the size of the device in bytes.  Under
Linux this is only true for regular file and symlinks.  A
compatibility function called fstat64_blk(2) was added
which can be used when the Solaris behavior is required.

This flaw was never noticed because the only time we would
need to use the device size is when the first two labels
are damaged.  I noticed this issue while adding the
zpool_clear_label() function which is similar in design
and does require us to write all the labels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-07-09 16:02:04 -07:00
Dmitry Khasanov
131cc95ca7 Add FreeBSD 'zpool labelclear' command
The FreeBSD implementation of zfs adds the 'zpool labelclear'
command.  Since this functionality is helpful and straight
forward to add it is being included in ZoL.

References:
  freebsd/freebsd@119a041dc9

Ported-by: Dmitry Khasanov <pik4ez@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1126
2013-07-09 15:58:05 -07:00
Dmitry Khasanov
51a3ae72d2 Readd zpool_clear_label() from OpenSolaris
This patch restores the zpool_clear_label() function from
OpenSolaris.  This was removed by commit d603ed6 because
it wasn't clear we had a use for it in ZoL.  However, this
functionality is a prerequisite for adding the 'zpool labelclear'
command from FreeBSD.

As part of bringing this change in the zpool_clear_label()
function was changed to use fstat64_blk(2) for compatibility
with Linux.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1126
2013-07-09 15:42:27 -07:00
Aaron Fineman
bbb75c1190 Add error message for missing /etc/mtab
The zpool command should not silently fail when the /etc/mtab
file does not exist.  This can occur in an initramfs environment
when the /etc/mtab file hasn't yet been generated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1541
2013-06-27 14:43:37 -07:00
Madhav Suresh
c99c90015e Illumos #3006
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first
     argument is zero

Reviewed by Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by George Wilson <george.wilson@delphix.com>
Approved by Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@fb09f5aad4
  https://illumos.org/issues/3006

Requires:
  zfsonlinux/spl@1c6d149feb

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1509
2013-06-19 15:14:10 -07:00
Ned Bass
da29fe63f0 Don't leak mount flags into kernel
When calling mount(), care must be taken to avoid passing in flags
that are used only by the user space utilities.  Otherwise we may
stomp on flags that are reserved for other purposes in the kernel.

In particular, openSUSE 12.3 kernels have added a new MS_RICHACL
super-block flag whose value conflicts with our MS_COMMENT flag. This
causes incorrect behavior such as the umask being ignored.  The
MS_COMMENT flag essentially serves as a placeholder in the option_map
data structure of zfs_mount.c, but its value is never used. Therefore
we can avoid the conflict by defining it to 0.

The MS_USERS, MS_OWNER, and MS_GROUP flags also conflict with reserved
flags in the kernel. While this is not known to have caused any
problems, it is nevertheless incorrect.  For the purposes of the
mount.zfs helper, the "users", "owner", and "group" options just serve
as hints to set additional implied options.  Therefore we now define
their associated mount flags in terms of the options that they imply
rather than giving them unique values.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1457
2013-06-18 15:30:08 -07:00
Brian Behlendorf
044baf009a Use taskq for dump_bytes()
The vn_rdwr() function performs I/O by calling the vfs_write() or
vfs_read() functions.  These functions reside just below the system
call layer and the expectation is they have almost the entire 8k of
stack space to work with.  In fact, certain layered configurations
such as ext+lvm+md+multipath require the majority of this stack to
avoid stack overflows.

To avoid this posibility the vn_rdwr() call in dump_bytes() has been
moved to the ZIO_TYPE_FREE, taskq.  This ensures that all I/O will be
performed with the majority of the stack space available.  This ends
up being very similiar to as if the I/O were issued via sys_write()
or sys_read().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1399
Closes #1423
2013-05-06 14:05:42 -07:00
Brian Behlendorf
a4914d38a7 Silence 'old_umask' uninit variable warning
Recent changes have caused older versions of gcc to mistakenly
flag 'old_umask' in vn_open() as an unitialized variable.  To
silence the warning initialize it.

  kernel.c: In function 'vn_open':
  kernel.c:525:6: error: 'old_umask' may be used uninitialized
  in this function

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-05-01 17:05:58 -07:00
George.Wilson
cc92e9d0c3 3246 ZFS I/O deadman thread
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

NOTES: This patch has been reworked from the original in the
following ways to accomidate Linux ZFS implementation

*) Usage of the cyclic interface was replaced by the delayed taskq
   interface.  This avoids the need to implement new compatibility
   code and allows us to rely on the existing taskq implementation.

*) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h
   because declaring externs in source files as was done in the
   original patch is just plain wrong.

*) Instead of panicing the system when the deadman triggers a
   zevent describing the blocked vdev and the first pending I/O
   is posted.  If the panic behavior is desired Linux provides
   other generic methods to panic the system when threads are
   observed to hang.

*) For reference, to delay zios by 30 seconds for testing you can
   use zinject as follows: 'zinject -d <vdev> -D30 <pool>'

References:
  illumos/illumos-gate@283b84606b
  https://www.illumos.org/issues/3246

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1396
2013-05-01 17:05:52 -07:00
George Wilson
295304bed6 Illumos #3422, #3425
3422 zpool create/syseventd race yield non-importable pool
3425 first write to a new zvol can fail with EFBIG

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@bda8819455
  https://www.illumos.org/issues/3422
  https://www.illumos.org/issues/3425

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1390
2013-04-12 09:01:36 -07:00
Jan Engelhardt
4e95cc99b0 build: resolve orthographic and other grammatical errors
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 10:44:52 -07:00
Turbo Fredriksson
be8bc8c0d3 Add smb_available() sanity checks
Do basic sanity checks in smb_available() to verify that the 'net'
command and the usershare directory exists.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1124
2013-04-02 09:27:52 -07:00
Eric Dillmann
0b4d1b5853 Add snapdev=[hidden|visible] dataset property
The new snapdev dataset property may be set to control the
visibility of zvol snapshot devices.  By default this value
is set to 'hidden' which will prevent zvol snapshots from
appearing under /dev/zvol/ and /dev/<dataset>/.  When set to
'visible' all zvol snapshots for the dataset will be visible.

This functionality was largely added because when automatic
snapshoting is enabled large numbers of read-only zvol snapshots
will be created.  When creating these devices the kernel will
attempt to read their partition tables, and blkid will attempt
to identify any filesystems on those partitions.  This leads
to a variety of issues:

1) The zvol partition tables will be read in the context of
   the `modprobe zfs` for automatically imported pools.  This
   is undesirable and should be done asynchronously, but for
   now reducing the number of visible devices helps.

2) Udev expects to be able to complete its work for a new
   block devices fairly quickly.  When many zvol devices are
   added at the same time this is no longer be true.  It can
   lead to udev timeouts and missing /dev/zvol links.

3) Simply having lots of devices in /dev/ can be aukward from
   a management standpoint.  Hidding the devices your unlikely
   to ever use helps with this.  Any snapshot device which is
   needed can be made visible by changing the snapdev property.

NOTE: This patch changes the default behavior for zvols which
      was effectively 'snapdev=visible'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1235
Closes #945
Issue #956
Issue #756
2013-03-05 12:37:54 -08:00
Brian Behlendorf
02a946ce10 Remove unused machelf.h header
The machelf.h header is never included by anything in the zfs
build process.  It is all effectively dead code which can be
safely removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1265
2013-02-05 15:34:50 -08:00
Richard Yao
399f60c8b4 Fix function relocations in libzpool
binutils 2.23.1 fails in situations that generate function relocations
on PowerPC and possibly other architectures. This causes linking of
libzpool to fail because it depends on libnvpair. We add a dependency on
libnvpair to lib/libzpool/Makefile.am to correct that.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1267
2013-02-05 15:33:32 -08:00
Brian Behlendorf
dbf763b39b Retire zpool_id infrastructure
In the interest of maintaining only one udev helper to give vdevs
user friendly names, the zpool_id and zpool_layout infrastructure
is being retired.  They are superseded by vdev_id which incorporates
all the previous functionality.

Documentation for the new vdev_id(8) helper and its configuration
file, vdev_id.conf(5), can be found in their respective man pages.
Several useful example files are installed under /etc/zfs/.

  /etc/zfs/vdev_id.conf.alias.example
  /etc/zfs/vdev_id.conf.multipath.example
  /etc/zfs/vdev_id.conf.sas_direct.example
  /etc/zfs/vdev_id.conf.sas_switch.example

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #981
2013-01-29 12:23:17 -08:00
Brian Behlendorf
79c6e4c445 Remove NPTL_GUARD_WITHIN_STACK
Commit 4b2f65b253 increased the user
space stack by 4x to resolve certain stack overflows.  As such it
no longer makes sense to worry about a single extra page which
might or might not be part of the process stack.  There is now
ample headroom for normal usage.

By eliminating this configure check we are also resolving the
following segfault which intentionally occurs at configure time
and may be logged in dmesg.

  conftest[22156]: segfault at 7fbf18a47e48 ip 00000000004007fe
  sp 00007fbf18a4be50 error 6 in conftest[400000+1000]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-29 10:58:20 -08:00
Eric Dillmann
9759c60f1a Illumos #3035 LZ4 compression support in ZFS and GRUB
3035 LZ4 compression support in ZFS and GRUB

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <csiden@delphix.com>

References:
  illumos/illumos-gate@a6f561b4ae
  https://www.illumos.org/issues/3035
  http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS

This patch has been slightly modified from the upstream Illumos
version to be compatible with Linux.  Due to the very limited
stack space in the kernel a lz4 workspace kmem cache is used.
Since we are using gcc we are also able to take advantage of the
gcc optimized __builtin_ctz functions.

Support for GRUB has been dropped from this patch.  That code
is available but those changes will need to made to the upstream
GRUB package.

Lastly, several hunks of dead code were dropped for clarity.  They
include the functions real_LZ4_uncompress(), LZ4_compressBound()
and the Visual Studio specific hunks wrapped in _MSC_VER.

Ported-by: Eric Dillmann <eric@jave.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1217
2013-01-29 09:28:20 -08:00
Brian Behlendorf
14ee71efbc Use strerror() not strerror_r()
The differ() function used strerror_r() instead of strerror() because
it allowed the error message to be directly copied in to a buffer.
This causes two issues under Linux.

* There are two versions of strerror_r() available an XSI-compliant
  version which returns an 'int' error code.  And a GNU-specific
  version which return a 'char *' to the resulting error string.

    int strerror_r(int errnum, char *buf, size_t buflen);   /* XSI */
    char *strerror_r(int errnum, char *buf, size_t buflen); /* GNU */

* The most recent versions of strerror_r() are annotated with the
  warn_unused_result attribute.  This causes the following warning
  since the upstream implementation casts the result to void.

    warning: ignoring return value of 'strerror_r', declared with
    attribute warn_unused_result [-Wunused-result]

The cleanest way to resolve both of these problems is just to use
strerror() and make a copy of the result in to the buffer.  This
resolves both issues and this is the only instance of strerror_r()
in the code base.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1231
2013-01-28 10:02:38 -08:00
Darik Horn
38145d6129 Ensure that zfs diff prints unicode safely.
In the stream_bytes() library function used by `zfs diff`, explicitly
cast each byte in the input string to an unsigned character so that the
Linux fprintf() correctly escapes to octal and does not mangle the output.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1172
2013-01-16 10:15:57 -08:00
Garrett D'Amore
844793c3cc Illumos #1557 assertion failed in userland taskq_destroy()
1557 assertion failed in userland taskq_destroy()

Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@aa846ad9bc
  illumos changeset: 13597:3eac1e8e0f4c
  https://www.illumos.org/issues/1557

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-11 09:17:06 -08:00
Christopher Siden
b9b24bb4ca Illumos #2762: zpool command should have better support for feature flags
2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@57221772c3
  https://www.illumos.org/issues/2762

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
George Wilson
3bc7e0fb0f Illumos #3090 and #3102
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@dfbb943217
  illumos changeset: 13777:b1e53580146d
  https://www.illumos.org/issues/3090
  https://www.illumos.org/issues/3102

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #939
2013-01-08 10:35:42 -08:00
Christopher Siden
9ae529ec5d Illumos #2619 and #2747
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@53089ab7c8
  illumos/illumos-gate@ad135b5d64
  illumos changeset: 13700:2889e2596bd6
  https://www.illumos.org/issues/2619
  https://www.illumos.org/issues/2747

NOTE: The grub specific changes were not ported.  This change
must be made to the Linux grub packages.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:35 -08:00
Massimo Maggi
5e6320cd12 Fix get/set users/groups in quota props via numeric id
Fix setting/getting users/groups in quota properties through
numeric identifier.  This support was accidentally disabled
in the original port by applying the HAVE_IDMAP wrapper macro
too broadly.

Fix obtained by moving #ifdef HAVE_IDMAP to exclude only
the part of code that really needs IDMAP.  Now zfs (get|set)
(user|group)quota@1000 works as expected.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1147
2012-12-17 09:52:58 -08:00
Jorgen Lundman
53c2ec1d1b Fix 'zpool create' segfault due to bad syntax
Incorrect syntax should never cause a segfault.  In this case
listing multiple comma delimited options after '-o' triggered
the problem.  For example:

  zpool create -o ashift=12,listsnaps=on

This patch resolves the issue by wrapping the calls which use
hdr with a NULL test.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1118
2012-12-04 11:15:25 -08:00
Turbo Fredriksson
645fb9cc21 Implemented sharing datasets via SMB using libshare
Add the initial support for the 'smbshare' option using the
existing libshare infrastructure.  Because this implementation
relies on usershares samba version 3.0.23 is required.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #493
2012-12-03 09:42:15 -08:00
Brian Behlendorf
c372b36e3e Allow GPT+EFI vdevs for root pools
Commit 57a4edd allows the bootfs property to be set on any pool.
However, many of the zpool commands still prevent you from using
EFI labeled devices for the root pool.  For example:

    # zpool attach rpool /dev/sda /dev/sdb
    cannot label 'sdb': EFI labeled devices are not supported on
    root pools. on root devices.

For non-Solaris builds such as Linux disable this error.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1077
2012-11-30 13:45:14 -08:00
Brian Behlendorf
0e20a31b4b Recreate minors when renaming zvols
When a zvol with snapshots is renamed the device files under
/dev/zvol/ are not renamed.  This patch resolves the problem
by destroying and recreating the minors with the new name so
the links can be recreated bu udev.

Original-patch-by: Suman Chakravartula <schakrava@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #408
2012-11-19 16:59:44 -08:00
Brian Behlendorf
e95853a331 Add txgs-<pool> kstat file
Create a kstat file which contains useful statistics about the
last N txgs processed.  This can be helpful when analyzing pool
performance.  The new KSTAT_TYPE_TXG type was added for this
purpose and it tracks the following statistics per-txg.

  txg          - Unique txg number
  state        - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
  birth;       - Creation time
  nread        - Bytes read
  nwritten;    - Bytes written
  reads        - IOPs read
  writes       - IOPs write
  open_time;   - Length in nanoseconds the txg was open
  quiesce_time - Length in nanoseconds the txg was quiescing
  sync_time;   - Length in nanoseconds the txg was syncing

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-02 15:45:56 -07:00
Brian Behlendorf
30b937ee15 Update spare and cache device names on import
During 'zpool import' all ZPOOL_CONFIG_PATH names are supposed
to be updated by fix_paths().  This was not happening for spare
and cache devices because the proper names were getting filtered
out of the pool_list_t->names.  Interestingly, the names were
being filtered because the spare and cache devices do not
contain the pool name in their vdev label.

The fix is to exclude the device path from the list only if:

  1) has a valid ZPOOL_CONFIG_POOL_NAME key in the label, and
  2) that pool name does not match the specified pool name.

Since the label is valid and because it does properly store the
vdev guid it will be correctly assembled without the pool name.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #725
2012-10-22 08:46:02 -07:00
Brian Behlendorf
eac4720465 Allow 'zpool replace' to use short device names
The 'zpool replace' command would fail when given a short name
because unlike on other platforms the short name cannot be
deterministically expanded to a single path.  Multiple path
prefixes must be checked and in addition the partition suffix
for whole disks is determined by the prefix.

To handle this complexity a zfs_strcmp_pathname() function was
added which takes either a short or fully qualified device name.
Short names will be expanded using the prefixes in the default
import search path, or the ZPOOL_IMPORT_PATH environment variable
if it's defined.  All posible expansions are then compared against
the comparison path.  Care is taken to strip redundant slashes to
ensure legitimate matches are not missed.

In the context of this work the existing zfs_resolve_shortname()
function was extended to consider the ZPOOL_IMPORT_PATH when set.
The zfs_append_partition() interface was also simplified to take
only a single buffer.

The vast majority of these changes rework existing Linux specific
code which was originally written to accomidate udev.  However,
there is some minimal cleanup which removes Illumos specific code.
This was done to improve readability but the basic flow and intent
of the upstream code was maintained.

These changes are the logical conclusion of the previos work to
adjust the 'zpool import' search behavior, see commit 44867b6a.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #544
Closes #976
2012-10-22 08:45:58 -07:00
Etienne Dechamps
142e6dd100 Add atomic_sub_* functions to libspl.
Both the SPL and the ZFS libspl export most of the atomic_* functions,
except atomic_sub_* functions which are only exported by the SPL, not by
libspl. This patch remedies that by implementing atomic_sub_* functions
in libspl.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1013
2012-10-17 08:56:37 -07:00
Matthew Ahrens
04434775b7 Illumos #3100: zvol rename fails with EBUSY when dirty.
illumos/illumos-gate@2e2c135528
Illumos changeset: 13780:6da32a929222

3100 zvol rename fails with EBUSY when dirty

Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Adam H. Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Ported-by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #995
2012-10-03 13:59:02 -07:00
Etienne Dechamps
0aebd4f9e3 Create threads in detached state in userspace.
Currently, thread_create(), when called in userspace, creates a
joinable (i.e. not detached thread). This is the pthread default.

Unfortunately, this does not reproduce kthreads behavior (kthreads
are always detached). In addition, this contradicts the original
Solaris code which creates userspace threads in detached mode.

These joinable threads are never joined, which leads to a leakage of
pthread thread objects ("zombie threads"). This in turn results in
excessive ressource consumption, and possible ressource exhaustion in
extreme cases (e.g. long ztest runs).

This patch fixes the issue by creating userspace threads in detached
mode. The only exception is ztest worker threads which are meant to be
joinable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
2012-10-03 13:32:48 -07:00
Bill Pijewski
37abac6d55 Illumos #2703: add mechanism to report ZFS send progress
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  https://www.illumos.org/issues/2703

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-19 13:39:06 -07:00
Chris Siden
1bd201e70d Illumos #1948: zpool list should show more detailed pool info
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  https://www.illumos.org/issues/1948

Ported by:	Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #685
2012-09-19 13:39:05 -07:00
Brian Behlendorf
0a2f7b3662 Seg fault 'zpool import -d /dev/disk/by-id -a'
Introduced by commit 44867b6d6e.
We should of course check to ensure best isn't NULL before
attempting to dereference it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #974
2012-09-18 12:33:37 -07:00
Brian Behlendorf
44867b6d6e Improve zpool import search behavior
The goal of this change is to make 'zpool import' prefer to use
the peristent /dev/mapper or /dev/disk/by-* paths.  These are far
preferable to the devices in /dev/ whos names are not persistent
and are determined by the order in which a device is detected.

This patch improves things by changing the default search path from
just to the top level /dev/ directory to (in order):

  /dev/disk/by-vdev   - Custom rules, use first if they exist
  /dev/disk/zpool     - Custom rules, use first if they exist
  /dev/mapper         - Use multipath devices before components
  /dev/disk/by-uuid   - Single unique entry and persistent
  /dev/disk/by-id     - May be multiple entries and persistent
  /dev/disk/by-path   - Encodes physical location and persistent
  /dev/disk/by-label  - Custom persistent labels
  /dev                - UNSAFE device names will change

The default search path can be overriden by setting the
ZPOOL_IMPORT_PATH environment variable.  This must be a colon
delimited list of paths which are searched for vdevs.  If the
'zpool import -d' option is specified only those listed paths
will be searched.

Finally, when multiple paths to the same device are found.  If one
of the paths is an exact match for the path used last time to import
the pool it will be used.  When there are no exact matches the
prefered path will be determined by the provided search order.

This means you can still import a pool and force specific names by
providing the -d <path> option.  And the prefered names will persist
as long as those paths exist on your system.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #965
2012-09-17 13:49:07 -07:00
Cyril Plisko
27ccd4147b Avoid running exportfs on each zfs/zpool command invocation
Delay executing exportfs command until its results are actually
required.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-11 10:21:49 -07:00
Etienne Dechamps
4b2f65b253 Increase the stack space in userspace.
In 1e33ac1e26, the maximum stack size for
userspace tools was set to 8k to mimic the available kernel stack size.

Unfortunately, due to differences in how the stack is used in userspace
vs kernel space, spurious stack overflows could occur in userspace
tools due to the limited stack size. This is especially true in ztest
when debugging is enabled.

This patch multiplies the userspace stack size by 4, which fixes the
stack overflow issues. This comes at the price of not being able to
catch stack size issues in userspace, but the previous solution proved
unreliable anyway.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #934.
2012-09-06 11:59:59 -07:00
Michael Martin
fc24f7c887 Fix missing vdev names in zpool status output
Commit 858219c makes more sense down below in the 'if (verbose)'
section of the code.  Initially, buf and path will never point
to the same location.  Once 'path = buf' is set on a raidz vdev,
the code may drop into the verbose section depending on the
verbose flag.  In here, using a tmpbuf makes sense since now
'buf == path'.

This issue does not occur in the upstream Solaris code because
their implementations of snprintf() allow for buf and path to
be the same address.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #57
2012-09-05 22:09:12 -07:00
Brian Behlendorf
ca8b5af89d Remove autotools products
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #718
2012-08-27 11:47:44 -07:00
Garrett D'Amore
08b1b21d58 Illumos #2803: zfs get guid pretty-prints the output
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/2803

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-23 10:40:14 -07:00
Christopher Siden
e956d65106 Illumos #1796, #2871, #2903, #2957
1796 "ZFS HOLD" should not be used when doing "ZFS SEND" from a read-only pool
2871 support for __ZFS_POOL_RESTRICT used by ZFS test suite
2903 zfs destroy -d does not work
2957 zfs destroy -R/r sometimes fails when removing defer-destroyed snapshot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  https://www.illumos.org/issues/1796
  https://www.illumos.org/issues/2871
  https://www.illumos.org/issues/2903
  https://www.illumos.org/issues/2957

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-23 10:40:02 -07:00
Eric Schrock
db49968e5c Illumos #2635: 'zfs rename -f' to perform force unmount
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <George.Wilson@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/2635

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #717
2012-08-23 10:39:43 -07:00
Martin Matuska
cf997d797b Properly initialize and free destroydata
This regression was accidentally introduced by commit
330d06f90d due to ZoL
specific code.  The fix is to simply ensure the passed
nvlist is initialized and freed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #876
2012-08-23 09:42:21 -07:00
Dan McDonald
d96eb2b153 Illumos #1693: persistent 'comment' field for a zpool
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/1693

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #678
2012-08-08 11:49:37 -07:00
Etienne Dechamps
ee5fd0bb80 Set zvol discard_granularity to the volblocksize.
Currently, zvols have a discard granularity set to 0, which suggests to
the upper layer that discard requests of arbirarily small size and
alignment can be made efficiently.

In practice however, ZFS does not handle unaligned discard requests
efficiently: indeed, it is unable to free a part of a block. It will
write zeros to the specified range instead, which is both useless and
inefficient (see dnode_free_range).

With this patch, zvol block devices expose volblocksize as their discard
granularity, so the upper layer is aware that it's not supposed to send
discard requests smaller than volblocksize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #862
2012-08-07 14:55:31 -07:00
Matthew Ahrens
330d06f90d Illumos #1644, #1645, #1646, #1647, #1708
1644 add ZFS "clones" property
1645 add ZFS "written" and "written@..." properties
1646 "zfs send" should estimate size of stream
1647 "zfs destroy" should determine space reclaimed by
     destroying multiple snapshots
1708 adjust size of zpool history data

References:
  https://www.illumos.org/issues/1644
  https://www.illumos.org/issues/1645
  https://www.illumos.org/issues/1646
  https://www.illumos.org/issues/1647
  https://www.illumos.org/issues/1708

This commit modifies the user to kernel space ioctl ABI.  Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated.  This change has reordered
all of the new ioctl()s to the end of the list.  This should
help minimize this issue in the future.

Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Albert Lee <trisk@opensolaris.org>
Approved by: Garrett D'Amore <garret@nexenta.com>

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #826
Closes #664
2012-07-31 09:25:30 -07:00
Etienne Dechamps
f09398cec6 Use /sys/module instead of /proc/modules.
When libzfs checks if the module is loaded or not, it currently reads
/proc/modules and searches for a line matching the module name.

Unfortunately, if the module is included in the kernel itself (built-in
module), then /proc/modules won't list it, so libzfs will wrongly conclude
that the module is not loaded, thus making all ZFS userspace tools unusable.

Fortunately, all loaded modules appear as directories in /sys/module, even
built-in ones. Thus we can use /sys/module in lieu of /proc/modules to fix
the issue.

As a bonus, the code for checking becomes much simpler.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
2012-07-26 13:45:33 -07:00
Richard Yao
739a1a82e0 Linux 3.5 compat, end_writeback() changed to clear_inode()
The end_writeback() function was changed by moving the call to
inode_sync_wait() earlier in to evict().   This effecitvely changes
the ordering of the sync but it does not impact the details of
the zfs implementation.

However, as part of this change end_writeback() was renamed to
clear_inode() to reflect the new semantics.  This change does
impact us and clear_inode() now maps to end_writeback() for
kernels prior to 3.5.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #784
2012-07-23 12:29:36 -07:00
Richard Yao
ea1fdf46e2 Linux 3.5 compat, iops->truncate_range() removed
The vmtruncate_range() support has been removed from the kernel in
favor of using the fallocate method in the file_operations table.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
2012-07-23 12:29:32 -07:00
Richard Yao
756c3e5a9c Linux 3.5 compat, eops->encode_fh() takes inodes
The export_operations member ->encode_fh() has been updated to
take both the child and parent inodes.  This interface used to
take the child dentry and a bool describing if the parent is needed.

NOTE: While updating this code I noticed that we do not currently
cleanly handle the case where we're passed a connectable parent.
This code should be audited to make sure we're doing the right thing.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
2012-07-23 12:29:23 -07:00
Etienne Dechamps
b5a28807cd Move partition scanning from userspace to module.
Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.

This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.

Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.

This change also depends on API changes which are available in 2.6.37
and latter kernels.  The build system has been updated to detect this,
but there is no compatibility mode for older kernels.  This means that
online expansion will NOT be available in older kernels.  However, it
will still be possible to expand the vdev offline.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #808
2012-07-17 09:17:31 -07:00
Brian Behlendorf
7535251dcf Add PowerPC to supported VTOCs
This code was was inherited from Solaris which was careful to define
the expected VTOC for various supported architectures.  While this
check may have made sense there it's something we should be able to
safely drop under Linux.

However, I'm not quite ready to do that yet.  So for the moment I'm
just doing the very safe thing of adding PowerPC as a supported type.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-12 11:52:34 -07:00
Etienne Dechamps
cee43a7477 Fix efi_use_whole_disk() when efi_nparts == 128.
Commit e5dc681a changed EFI_NUMPAR from 9 to 128. This means that the
on-disk EFI label has efi_nparts = 128 instead of 9. The index of the
reserved partition, however, is still 8. This breaks
efi_use_whole_disk(), which uses efi_nparts-1 as the index of the
reserved partition.

This commit fixes efi_use_whole_disk() when the index of the reserved
partition is not efi_nparts-1. It rewrites the algorithm and makes it
more robust by using the order of the partitions instead of their
numbering. It assumes that the last non-empty partition is the reserved
partition, and that the non-empty partition before that is the data
partition.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #808
2012-07-12 08:59:22 -07:00
Etienne Dechamps
7608bd0dd0 Use the right device path when relabeling.
Currently, zpool_vdev_online() calls zpool_relabel_disk() with a short
partition device name, which is obviously wrong because (1)
zpool_relabel_disk() expects a full, absolute path to use with open()
and (2) efi_write() must be called on an opened disk device, not a
partition device.

With this patch, zpool_relabel_disk() gets called with a full disk
device path. The path is determined using the same algorithm as
zpool_find_vdev().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #808
2012-07-12 08:59:16 -07:00
Etienne Dechamps
8adf486422 Fix error handling for "zpool online -e".
The error handling code around zpool_relabel_disk() is either inexistent
or wrong. The function call itself is not checked, and
zpool_relabel_disk() is generating error messages from an unitialized
buffer.

Before:

    # zpool online -e homez sdb; echo $?
    `: cannot relabel 'sdb1': unable to open device: 2
    0

After:

    # zpool online -e homez sdb; echo $?
    cannot expand sdb: cannot relabel 'sdb1': unable to open device: 2
    1

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #808
2012-07-12 08:58:19 -07:00
George Wilson
c7f2d69de3 Illumos #1949, #1953
1949 crash during reguid causes stale config
1953 allow and unallow missing from zpool history since removal of pyzfs

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  https://www.illumos.org/issues/1949
  https://www.illumos.org/issues/1953

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665
2012-07-11 13:33:31 -07:00
Garrett D'Amore
3541dc6d02 Illumos #1748: desire support for reguid in zfs
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com>
Reviewed by: Alexander Stetsenko <ams@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/1748

This commit modifies the user to kernel space ioctl ABI.  Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated.  If only the user space
component is updated both the 'zpool events' command and the
'zpool reguid' command will not work until the kernel modules
are updated.

Ported by:     Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665
2012-07-11 13:08:56 -07:00
Pawel Jakub Dawidek
0cee24064a Speed up 'zfs list -t snapshot -o name -s name'
FreeBSD #xxx:  Dramatically optimize listing snapshots when user
requests only snapshot names and wants to sort them by name, ie.
when executes:

  # zfs list -t snapshot -o name -s name

Because only name is needed we don't have to read all snapshot
properties.

Below you can find how long does it take to list 34509 snapshots
from a single disk pool before and after this change with cold and
warm cache:

    before:

        # time zfs list -t snapshot -o name -s name > /dev/null
        cold cache: 525s
        warm cache: 218s

    after:

        # time zfs list -t snapshot -o name -s name > /dev/null
        cold cache: 1.7s
        warm cache: 1.1s

NOTE: This patch only appears in FreeBSD.  If/when Illumos picks up
the change we may want to drop this patch and adopt their version.
However, for now this addresses a real issue.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #450
2012-06-14 09:49:04 -07:00
Richard Yao
bc98d6c809 Make zvol_remove_link() print a more useful error message
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-06-13 16:27:19 -07:00
Daniel Verite
c6327b63e6 Retry removal of busy minors
When failing to remove a zvol device link because it's busy, wait
a bit and retry in a loop instead of giving up immediately.  This
technique is similar to the loop in zpool_label_disk_wait(), with
the same goal: waiting for the asynchronous udev processes to finish
their work.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #692
2012-06-11 10:50:20 -07:00
Richard Yao
6a0936babc Linux 3.4 compat, d_make_root() replaces d_alloc_root()
torvalds/linux@adc0e91ab1 introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:

  error: implicit declaration of function 'd_alloc_root'
  [-Werror=implicit-function-declaration]

To correct this we update the code to use the current d_make_root()
interface for readability.  Then we introduce an autotools check
to determine if d_make_root() is available.  If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #776
2012-06-11 10:04:49 -07:00
Brian Behlendorf
abe5b8fb66 Improve 'zpool import' EBUSY error message
When a device is already open O_EXCL by another process the
`zpool import` will correctly fail.  However, the default failure
message isn't very helpful.  It may in fact be harmful if you
take its advise and destroy your pool.

    cannot import 'tank': pool is busy
            Destroy and re-create the pool from
            a backup source.

Improve the error message in the EBUSY case to simply print a
message indicating that the devices are current in use.  The user
will need to manually identify which process has the device open
exclusively and why.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-06-01 08:55:24 -07:00
Brian Behlendorf
b04c9fc009 Add /dev/mapper/ to search path
When creating pools short device names may be used when those
devices appear in certain well known locations under /dev/.
This change adds /dev/mapper/ to that list.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-06-01 08:55:24 -07:00
Ned A. Bass
821b683436 Add vdev_id for JBOD-friendly udev aliases
vdev_id parses the file /etc/zfs/vdev_id.conf to map a physical path
in a storage topology to a channel name.  The channel name is combined
with a disk enclosure slot number to create an alias that reflects the
physical location of the drive.  This is particularly helpful when it
comes to tasks like replacing failed drives.  Slot numbers may also be
re-mapped in case the default numbering is unsatisfactory.  The drive
aliases will be created as symbolic links in /dev/disk/by-vdev.

The only currently supported topologies are sas_direct and sas_switch:

o  sas_direct - a channel is uniquely identified by a PCI slot and a
   HBA port

o  sas_switch - a channel is uniquely identified by a SAS switch port

A multipath mode is supported in which dm-mpath devices are handled by
examining the first running component disk, as reported by 'multipath
-l'.  In multipath mode the configuration file should contain a
channel definition with the same name for each path to a given
enclosure.

vdev_id can replace the existing zpool_id script on systems where the
storage topology conforms to sas_direct or sas_switch.  The script
could be extended to support other topologies as well.  The advantage
of vdev_id is that it is driven by a single static input file that can
be shared across multiple nodes having a common storage toplogy.
zpool_id, on the other hand, requires a unique /etc/zfs/zdev.conf per
node and a separate slot-mapping file.  However, zpool_id provides the
flexibility of using any device names that show up in
/dev/disk/by-path, so it may still be needed on some systems.

vdev_id's functionality subsumes that of the sas_switch_id script, and
it is unlikely that anyone is using it, so sas_switch_id is removed.

Finally, /dev/disk/by-vdev is added to the list of directories that
'zpool import' will scan.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #713
2012-06-01 08:55:14 -07:00
Jorgen Lundman
c421831192 Define the needed ISA types for ARM
Add the minimum required ISA types to support the ARM architecture.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-05-03 11:18:54 -07:00
Brian Behlendorf
b39d3b9f7b Linux 3.3 compat, iops->create()/mkdir()/mknod()
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'.  To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef.  There is no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #701
2012-04-30 12:52:38 -07:00
Richard Laager
109491a897 Improve error message consistency
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-11 10:43:17 -07:00
Brian Behlendorf
1c5de20ae2 Add --enable-debug-dmu-tx configure option
Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging.  When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.

This checking is particularly helpful when adding new dmu consumers
like Lustre.  However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.

--enable-debug-dmu-tx  - Enable/disable validation of each tx as
--disable-debug-dmu-tx   it is constructed.  By default validation
                         is disabled due to performance concerns.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:25:17 -07:00
Brian Behlendorf
ebe7e575ea Add .zfs control directory
Add support for the .zfs control directory.  This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required.  The bulk of the core
functionality is now all there with the following limitations.

*) The .zfs/snapshot directory automount support requires a 2.6.37
   or newer kernel.  The exception is RHEL6.2 which has backported
   the d_automount patches.

*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
   in the .zfs/snapshot directory works as expected.  However,
   this functionality is only available to root until zfs
   delegations are finished.

      * mkdir - create a snapshot
      * rmdir - destroy a snapshot
      * mv    - rename a snapshot

The following issues are known defeciences, but we expect them to
be addressed by future commits.

*) Add automount support for kernels older the 2.6.37.  This should
   be possible using follow_link() which is what Linux did before.

*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
   The majority of the ground work for this is complete.  However,
   finishing this work will require resolving some lingering
   integration issues with the Linux NFS kernel server.

*) The .zfs/shares directory exists but no futher smb functionality
   has yet been implemented.

Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #173
2012-03-22 13:03:47 -07:00
Ned Bass
613d88eda8 Align parition end on 1 MiB boundary
Some devices have exhibited sensitivity to the ending alignment of
partitions.  In particular, even if the first partition begins at 1
MiB, we have seen many sd driver task abort errors with certain SSDs
if the first partition doesn't end on a 1 MiB boundary.  This occurs
when the vdev label is read during pool creation or importation and
causes a delay of about 30 seconds per device.  It can also be
simulated with dd when the pool isn't imported:

  dd if=/dev/sda1 of=/dev/null bs=262144 count=1

For the record, this problem was observed with SMARTMOD
SG9XCA2E200GE01 200GB SSDs.  Unfortunately I don't have a good
explanation for this behavior. It seems to have something to do with
highly fragmented single-sector requests being issued to the device,
which it may not support.  With end-aligned partitions at least
page-sized requests were queued and issued to the driver according
to blktrace. In any case, aligning the partition end is a fairly
innocuous work-around, wasting at most 1 MiB of space.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #574
2012-03-05 09:49:50 -08:00
Brian Behlendorf
4b787d75c8 Cleanly support debug packages
Allow a source rpm to be rebuilt with debugging enabled.  This
avoids the need to have to manually modify the spec file.  By
default debugging is still largely disabled.  To enable specific
debugging features use the following options with rpmbuild.

  '--with debug'               - Enables ASSERTs

  # For example:
  $ rpmbuild --rebuild --with debug zfs-modules-0.6.0-rc6.src.rpm

Additionally, ZFS_CONFIG has been added to zfs_config.h for
packages which build against these headers.  This is critical
to ensure both zfs and the dependant package are using the same
prototype and structure definitions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 14:08:17 -08:00
Richard Yao
b41c9906dc Support ashift=13 for 8KB SSD block sizes
New SSDs are now available which use an internal 8k block size.
To make sure ZFS can get the maximum performance out of these
devices we're increasing the maximum ashift to 13 (8KB).

This value is still small enough that we can fit 16 uberblocks
in the vdev ring label.  However, I don't want to increase this
any futher or it will limit the ability the safely roll back a
pool to recover it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #565
2012-02-13 12:25:27 -08:00
Turbo Fredriksson
d2e032ca9c Add 'fsid' mount option to allowed options.
Resolves nfs-utils-1.0.x compatibility issue which requires
that the fsid be set in the export options.

  exportfs: Warning: /tank/dir requires fsid= for NFS export

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #570
2012-02-13 09:43:57 -08:00
Etienne Dechamps
30930fba21 Add support for DISCARD to ZVOLs.
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.

We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.

Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:19:38 -08:00
Etienne Dechamps
cb2d19010d Support the fallocate() file operation.
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.

To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 16:19:32 -08:00
Etienne Dechamps
34037afe24 Improve ZVOL queue behavior.
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Etienne Dechamps
b18019d2d8 Fix synchronicity for ZVOLs.
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:

    WRITE:       A normal async write. Device will be plugged.
    WRITE_SYNC:  Synchronous write. Identical to WRITE, but passes down
                 the hint that someone will be waiting on this IO
                 shortly.
    WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
    WRITE_FUA:   Like WRITE_SYNC but data is guaranteed to be on
                 non-volatile media on completion.

In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.

The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.

Also, we must inform the block layer that we actually support FLUSH and FUA.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Darik Horn
e67329d8e0 Let libnvpair be linked independently of libzfs.
Autoconf will fail to detect the ZoL libnvpair on systems that do not
implicitly link library runtime dependencies, which is anything that
has the GCC 4.5 DCO update.

Build libuutil before libnvpair, and put it on the the LDADD line of
the libnvpair automake template.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #560
2012-02-07 11:37:15 -08:00
Brian Behlendorf
47621f3d76 Linux 3.3 compat, sops->show_options()
The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'.  Add an autoconf check
to detect the API change and then conditionally define the expected
interface.  In either case we are only interested in the zfs_sb_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #549
2012-02-03 10:02:01 -08:00
Brian Behlendorf
d7e398ce1a Cleanup ZFS debug infrastructure
Historically the internal zfs debug infrastructure has been
scattered throughout the code.  Since we expect to start making
more use of this code this patch performs some cleanup.

* Consolidate the zfs debug infrastructure in the zfs_debug.[ch]
  files.  This includes moving the zfs_flags and zfs_recover
  variables, plus moving the zfs_panic_recover() function.

* Remove the existing unused functionality in zfs_debug.c and
  replace it with code which correctly utilized the spl logging
  infrastructure.

* Remove the __dprintf() function from zfs_ioctl.c.  This is
  dead code, the dprintf() functionality in the kernel relies
  on the spl log support.

* Remove dprintf() from hdr_recl().  This wasn't particularly
  useful and was missing the required format specifier anyway.

* Subsequent patches should unify the dprintf() and zfs_dbgmsg()
  functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:24:30 -08:00
Prakash Surya
ff998d804f Ignore dataset if the dds_type is DMU_OST_OTHER
Since the zpios and potentially other ZFS tests use the
DMU_OST_OTHER type to label their datasets, the zpool and
zfs commands should gracefully handle this type when it is
encountered.  This patch modifies the commands' behavior
to ignore any datasets with a dds_type of DMU_OST_OTHER.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #536
2012-01-19 09:29:48 -08:00
Darik Horn
f783130a1f Allow GPT+EFI vdev replacement in boot pools.
Commit zfsonlinux/zfs@57a4eddc4d
allows the bootfs property to be set on any pool, but does not
accommodate subsequent vdev changes. For example:

	# zpool replace rpool /dev/sda /dev/sdb
	operation not supported on this type of pool
	property 'bootfs' is not supported on EFI labeled devices

For non-Solaris builds, disable the check that emits this error.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-18 11:05:24 -08:00
Darik Horn
750562833f Combine libraries: spl, avl, efi, share, unicode.
These libraries, which are an artifact of the ZoL development
process, conflict with packages that are already in distribution:

  * libspl: SPL Programming Language
  * libavl: AVL for Linux
  * libefi: GRUB

And these libraries are potential conflicts:

  * libshare: the Linux Mount Manager
  * libunicode: Perl and Python

Recompose these five ZoL components into the four libraries that are
conventionally provided by Solaris and FreeBSD systems:

  + libnvpair
  + libuutil
  + libzpool
  + libzfs

This change resolves the name conflict, makes ZoL more compatible
with existing software that uses autotools to detect ZFS, and allows
pkg-zfs to better reflect the official Debian kFreeBSD packaging.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #430
2012-01-17 15:19:50 -08:00