Commit Graph

181 Commits

Author SHA1 Message Date
George Wilson
a05dfd0028 Illumos 5147 - zpool list -v should show individual disk capacity
The 'zpool list -v' command displays lots of info but excludes the
capacity of each disk. This should be added.

5147 zpool list -v should show individual disk capacity
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5147
  https://github.com/illumos/illumos-gate/commit/7a09f97

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2688
2014-09-23 10:10:42 -07:00
Dan Swartzendruber
287be44f53 Improve handling of filesystem versions
Change mount code to diagnose filesystem versions that
are not supported by the current implementation.

Change upgrade code to do likewise and refuse to upgrade
a pool if any filesystems on it are a version which is
not supported by the current implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Closes: #2616
2014-09-03 09:17:14 -07:00
Matthew Ahrens
dea377c0d9 Illumos 4970-4974 - extreme rewind enhancements
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/4970
  https://www.illumos.org/issues/4971
  https://www.illumos.org/issues/4972
  https://www.illumos.org/issues/4973
  https://www.illumos.org/issues/4974
  https://github.com/illumos/illumos-gate/commit/e42d205

Notes:
    This set of patches adds a set of tunable parameters for the
    "extreme rewind" mode of pool import which allows control over
    the traversal performed during such an import.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2598
2014-08-26 16:29:57 -07:00
Richard Yao
2fe5011008 Drive database update
The Intel DC S3500 and Intel DC S3700 are optimized to handle 4KB
sectors well despite of their 8KB page sizes, so we move them to a new
category for enterprise drives where they will receive ashift=12. They
are joined by the Intel 730 series, which uses the same disk controller,
as well as a San Disk enterprise drive. The drive IDs for these two were
obtained by myself with the drive_id utility. The drive ID for the 240GB
Intel 730 model was extrapolated from the drive ID for the 480GB model.

Lastly, we also add some Western Digital mobile drives.  ryuo in
\#zfsonlinux on freenode obtained "ATA     WDC WD2500BEVT-0" from
running drive_id on his own hardware. The additional drives in that
family were extrapolated from that identifer.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2601
2014-08-18 10:09:03 -07:00
George Wilson
f3a7f6610f Illumos 4976-4984 - metaslab improvements
4976 zfs should only avoid writing to a failing non-redundant top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature is enabled
4983 need to collect metaslab information via mdb
4984 device selection should use fragmentation metric
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4976
  https://www.illumos.org/issues/4978
  https://www.illumos.org/issues/4979
  https://www.illumos.org/issues/4980
  https://www.illumos.org/issues/4981
  https://www.illumos.org/issues/4982
  https://www.illumos.org/issues/4983
  https://www.illumos.org/issues/4984
  https://github.com/illumos/illumos-gate/commit/2e4c998

Notes:
    The "zdb -M" option has been re-tasked to display the new metaslab
    fragmentation metric and the new "zdb -I" option is used to control
    the maximum number of in-flight I/Os.

    The new fragmentation metric is derived from the space map histogram
    which has been rolled up to the vdev and pool level and is presented
    to the user via "zpool list".

    Add a number of module parameters related to the new metaslab weighting
    logic.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2595
2014-08-18 08:40:49 -07:00
Matthew Ahrens
9bd274ddd8 Illumos #4374
4374 dn_free_ranges should use range_tree_t

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4374
  https://github.com/illumos/illumos-gate/commit/bf16b11

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2531
2014-07-30 09:20:35 -07:00
Matthew Ahrens
fa86b5dbb6 Illumos 4171, 4172
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool features

Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>a

References:
  https://www.illumos.org/issues/4171
  https://www.illumos.org/issues/4172
  https://github.com/illumos/illumos-gate/commit/2acef22

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2528
2014-07-25 16:40:07 -07:00
Turbo Fredriksson
79eb71dc6c Support '-H' (scripted mode) to 'zpool get'
This functionality is already available in 'zfs get'.  Providing
it for 'zpool get' is useful and good for consistency.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2522
2014-07-25 11:58:36 -07:00
Turbo Fredriksson
628668a39f Add information about the -o option to zpool replace
Users need to be aware that when replacing devices in an existing
pool they may need to override automatically detected ashift value.
This will all depend on the exact hardware they are using.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2024
2014-06-27 08:31:07 -07:00
John Albietz
5f3c101b8f Added INTEL SSD 530 Series
INTEL SSD 530 Series... SSDSC2BW24

Signed-off-by: John Albietz <inthecloud247@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2184
2014-05-19 16:57:14 -07:00
Richard Yao
1cbae971c5 Handle ZPOOL_STATUS_HOSTID_MISMATCH in zpool status
Verbatim imports can cause hostid mismatches, but things otherwise work. `zpool
status` does not handle this and will fail when assertions are enabled:

```
zpool: ../../cmd/zpool/zpool_main.c:4418: status_callback: Assertion `reason == ZPOOL_STATUS_OK' failed.

Program received signal SIGABRT, Aborted.
```

Lets instead add a case to display an informative message such as this:

```
  pool: rpool
 state: ONLINE
status: Mismatch between pool hostid and system hostid on imported pool.
        This pool was previously imported into a system with a different hostid,
        and then was verbatim imported into this system.
action: Export this pool on all systems on which it is imported.
        Then import it to correct the mismatch.
   see: http://zfsonlinux.org/msg/ZFS-8000-EY
  scan: scrub repaired 0 in 0h8m with 0 errors on Thu Apr 17 19:43:57 2014
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          sda       ONLINE       0     0     0

errors: No known data errors
```

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2342
2014-05-19 10:16:29 -07:00
Richard Yao
c6e924fea8 Fix libblkid ZFS detection when making new pools
zfsonlinux/zfs@1db7b9be75 should have
fixed this, but this particular string was overlooked.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2288
2014-05-01 13:26:33 -07:00
Brian Behlendorf
d21705eab9 Add missing DATA_TYPE_STRING_ARRAY output
This functionality has always been missing.  But until now there
were no zevents which included an array of strings so it wasn't
missed.  However, that's now changed so to ensure this information
is output correctly by 'zpool events -v' the DATA_TYPE_STRING_ARRAY
has been implemented.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
2014-04-02 13:10:08 -07:00
Chris Dunlap
8c7aa0cfc4 Replace zpool_events_next() "block" parm w/ "flags"
zpool_events_next() can be called in blocking mode by specifying a
non-zero value for the "block" parameter.  However, the design of
the ZFS Event Daemon (zed) requires additional functionality from
zpool_events_next().  Instead of adding additional arguments to the
function, it makes more sense to use flags that can be bitwise-or'd
together.

This commit replaces the zpool_events_next() int "block" parameter with
an unsigned bitwise "flags" parameter.  It also defines ZEVENT_NONE
to specify the default behavior.  Since non-blocking mode can be
specified with the existing ZEVENT_NONBLOCK flag, the default behavior
becomes blocking mode.  This, in effect, inverts the previous use
of the "block" parameter.  Existing callers of zpool_events_next()
have been modified to check for the ZEVENT_NONBLOCK flag.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2
2014-03-31 16:11:21 -07:00
Brian Behlendorf
9b101a7320 Clarify zpool_events_next() comment
Due to the very poorly chosen argument name 'cleanup_fd' it was
completely unclear that this file descriptor is used to track the
current cursor location.  When the file descriptor is created by
opening ZFS_DEV a private cursor is created in the kernel for the
returned file descriptor.  Subsequent calls to zpool_events_next()
and zpool_events_seek() then require the file descriptor as an
argument to reposition the cursor.  When the file descriptor is
closed the kernel state tracking the cursor is destroyed.

This patch contains no functional change, it just changes a
few variable names and clarifies the documentation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
2014-03-31 16:11:08 -07:00
Richard Yao
26b42f3f9d Implement -t option to zpool import for temporary pool names
Originally, users had to handle spa namespace collisions by either
exporting the already imported pool or by specifying a new name for the
pool with a conflicting name. In the case of root pools from virtual
guests, neither approach to collision resolution is reasonable. This is
addressed by extending the new name syntax with a -t option to specify
that the new name is temporary. When specified, this sets an internal
flag that is passed into the kernel to tell it that all label updates
should refer to the name used in the original label. Consequently, the
original pool name will be retained on export.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2189
2014-03-20 12:05:30 -07:00
Richard Yao
e2282ef57e Remove unused option -r from 'zpool import'
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2178
2014-03-12 09:09:52 -07:00
Richard Yao
4f2dcb3eee Add erratum for issue #2094
ZoL commit 1421c89 unintentionally changed the disk format in a forward-
compatible, but not backward compatible way. This was accomplished by
adding an entry to zbookmark_t, which is included in a couple of
on-disk structures. That lead to the creation of pools with incorrect
dsl_scan_phys_t objects that could only be imported by versions of ZoL
containing that commit.  Such pools cannot be imported by other versions
of ZFS or past versions of ZoL.

The additional field has been removed by the previous commit.  However,
affected pools must be imported and scrubbed using a version of ZoL with
this commit applied.  This will return the pools to a state in which they
may be imported by other implementations.

The 'zpool import' or 'zpool status' command can be used to determine if
a pool is impacted.  A message similar to one of the following means your
pool must be scrubbed to restore compatibility.

$ zpool import
   pool: zol-0.6.2-173
     id: 1165955789558693437
  state: ONLINE
 status: Errata #1 detected.
 action: The pool can be imported using its name or numeric identifier,
         however there is a compatibility issue which should be corrected
         by running 'zpool scrub'
    see: http://zfsonlinux.org/msg/ZFS-8000-ER
 config:
 ...

$ zpool status
  pool: zol-0.6.2-173
 state: ONLINE
  scan: pool compatibility issue detected.
   see: https://github.com/zfsonlinux/zfs/issues/2094
action: To correct the issue run 'zpool scrub'.
config:
...

If there was an async destroy in progress 'zpool import' will prevent
the pool from being imported.  Further advice on how to proceed will be
provided by the error message as follows.

$ zpool import
   pool: zol-0.6.2-173
     id: 1165955789558693437
  state: ONLINE
 status: Errata #2 detected.
 action: The pool can not be imported with this version of ZFS due to an
         active asynchronous destroy.  Revert to an earlier version and
         allow the destroy to complete before updating.
         see: http://zfsonlinux.org/msg/ZFS-8000-ER
 config:
 ...

Pools affected by the damaged dsl_scan_phys_t can be detected prior to
an upgrade by running the following command as root:

zdb -dddd poolname 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w

Note that `poolname` must be replaced with the name of the pool you wish
to check. A value of 25 indicates the dsl_scan_phys_t has been damaged.
A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0
indicates that there has never been a scrub run on the pool.

The regression caused by the change to zbookmark_t never made it into a
tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL
stable respositorys.  Only those using the HEAD version directly from
Github after the 0.6.2 but before the 0.6.3 tag are affected.

This patch does have one limitation that should be mentioned.  It will not
detect errata #2 on a pool unless errata #1 is also present.  It expected
this will not be a significant problem because pools impacted by errata #2
have a high probably of being impacted by errata #1.

End users can ensure they do no hit this unlikely case by waiting for all
asynchronous destroy operations to complete before updating ZoL.  The
presence of any background destroys on any imported pools can be checked
by running `zpool get freeing` as root.  This will display a non-zero
value for any pool with an active asynchronous destroy.

Lastly, it is expected that no user data has been lost as a result of
this erratum.

Original-patch-by: Tim Chase <tim@chase2k.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094
2014-02-21 12:10:40 -08:00
Brian Behlendorf
ffe9d38275 Add generic errata infrastructure
From time to time it may be necessary to inform the pool administrator
about an errata which impacts their pool.  These errata will by shown
to the administrator through the 'zpool status' and 'zpool import'
output as appropriate.  The errata must clearly describe the issue
detected, how the pool is impacted, and what action should be taken
to resolve the situation.  Additional information for each errata will
be provided at http://zfsonlinux.org/msg/ZFS-8000-ER.

To accomplish the above this patch adds the required infrastructure to
allow the kernel modules to notify the utilities that an errata has
been detected.  This is done through the ZPOOL_CONFIG_ERRATA uint64_t
which has been added to the pool configuration nvlist.

To add a new errata the following changes must be made:

* A new errata identifier must be assigned by adding a new enum value
  to the zpool_errata_t type.  New enums must be added to the end to
  preserve the existing ordering.

* Code must be added to detect the issue.  This does not strictly
  need to be done at pool import time but doing so will make the
  errata visible in 'zpool import' as well as 'zpool status'.  Once
  detected the spa->spa_errata member should be set to the new enum.

* If possible code should be added to clear the spa->spa_errata member
  once the errata has been resolved.

* The show_import() and status_callback() functions must be updated
  to include an informational message describing the errata.  This
  should include an action message describing what an administrator
  should do to address the errata.

* The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be
  updated to describe the errata.  This space can be used to provide
  as much additional information as needed to fully describe the errata.
  A link to this documentation will be automatically generated in the
  output of 'zpool import' and 'zpool status'.

Original-idea-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094
2014-02-21 12:10:40 -08:00
Brian Behlendorf
731782ec31 Use expected zpool_status_t type
Both the zpool_import_status() and zpool_get_status() functions
return the zpool_status_t enum.  This explicit type should be used
rather than the more generic int type.

This patch makes no functional change and should only be considered
code cleanup.  It happens to have been done in the context of #2094
because that's when I noticed this issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094
2014-02-21 12:10:40 -08:00
Richard Yao
cbe8e6198c Properly link zpool command to libblkid
31fc19399e incorrectly removed $(LIBBLKID)
from cmd/zpool/Makefile.am. This meant that the toolchain was not given
-lblkid, which resulted in the following build failure on Ubuntu 13.10:

/usr/bin/ld: zpool_vdev.o: undefined reference to symbol
'blkid_put_cache@@BLKID_1.0'
/lib/x86_64-linux-gnu/libblkid.so.1: error adding symbols: DSO missing
from command line
collect2: error: ld returned 1 exit status

That commit reworked various Makefile.am to follow best practices, so we
reintroduce $(LIBBLKID) in a manner consistent with that, rather than
explicitly reverting the change.

Reproduction of this issue was done on a Gentoo Linux system by
executing the following commands:

zfs create -o mountpoint=/mnt/ubuntu-13.10 rpool/ROOT/ubuntu-13.10
debootstrap --variant=buildd --arch amd64 saucy /mnt/ubuntu-13.10 http://archive.ubuntu.com/ubuntu/
mount -o bind /dev /mnt/ubuntu-13.10/dev/
mount -o bind /proc/ /mnt/ubuntu-13.10/proc/
mount -o bind /sys/ /mnt/ubuntu-13.10/sys/
cp /etc/resolv.conf /mnt/ubuntu-13.10/etc/
(cd /mnt/ubuntu-13.10/root/ && git clone git://github.com/zfsonlinux/zfs.git)
chroot /mnt/ubuntu-13.10/
apt-get install git autoconf libtool zlib1g-dev uuid-dev libblkid-dev
\#apt-get install alien fakeroot vim
cd /root/zfs
./autogen.sh
./configure --with-config=user --prefix=/usr
make

That will create a Ubuntu 13.10 chroot, fetch the sources and build
test. At this point, cmd/zpool/Makefile.am was modified and the
following commands were run to verify that the build issue was resolved:

git clean -xdf
./autogen.sh
./configure --with-config=user --prefix=/usr
make

Although it is not shown here, the absence of libblkid-dev enables ZFS
to build successfully without the patch. This could explain how this
escaped detection until recently. A test without libblkid-dev was done
to verify that the patch did not cause a regression in the absence of
libblkid:

apt-get remove libblkid-dev
git clean -xdf
./autogen.sh
./configure --with-config=user --prefix=/usr
make

Additionally, the commands themselves were tested against my live system
from within the chroot to ensure basic functionality. My live system had
corresponding kernel modules already installed and basic commands such
as `zpool list` and `zfs list` worked without incident. Lastly, this
patch was also build tested on Gentoo Linux, where it caused no
problems.

At time of writing, these steps can be used to reproduce these results
on any modern Linux system that has debootstrap installed. On Gentoo,
installing debootstrap can be done with `emerge dev-util/debootstrap`.
The current ZFSOnLinux HEAD revision as of writing is
fd23720ae1.  Once this is fixed in HEAD,
either that revision or another before this fix and after
31fc19399e will be needed to reproduce
this issue.

Lastly, it remains to be seen why the toolchains on the systems
performing regression tests did not catch this. This is not a
ZFS-specific issue, but it is something that we will want to explore in
the future.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2038
2014-01-14 10:27:17 -08:00
Michael Kjorling
d1d7e2689d cstyle: Resolve C style issues
The vast majority of these changes are in Linux specific code.
They are the result of not having an automated style checker to
validate the code when it was originally written.  Others were
caused when the common code was slightly adjusted for Linux.

This patch contains no functional changes.  It only refreshes
the code to conform to style guide.

Everyone submitting patches for inclusion upstream should now
run 'make checkstyle' and resolve any warning prior to opening
a pull request.  The automated builders have been updated to
fail a build if when 'make checkstyle' detects an issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1821
2013-12-18 16:46:35 -08:00
Richard Yao
c8c8d1e7e5 Drive database update
Added:

Adata S396 (obtained from drive_id)
Apple MacBookAir3,1 SSD (obtained from drive_id)
Apple MacBookPro10,1 SSD (obtained from drive_id)
Intel 510 (obtained from drive_id)
Intel 710 (obtained from drive_id)
Intel DC S3500 (obtained from drive_id)
Netapp LUN (obtained from illumos user's sd.conf)
OCZ Agility 3 (obtained from drive_id)
OCZ Vertex (obtained from drive_id)
Samsung PM800 (obtained from drive_id)
Sandisk U100 (obtained from drive_id)
Sun Comstar (obtained from illumos user's sd.conf)

Notes:

1. The entries for the Intel DC S3500 were extrapolated from the 800GB
model's entry, which is "ATA     INTEL SSDSC2BB80".

2. The entires for the Intel 710 were extrapolated from the 120GG
model's entry, which is "ATA     INTEL SSDSA2BZ12".

3. The entires for the Intel 510 were extrapolated from the 250GB
model's entry, which is "ATA     INTEL SSDSC2MH25".

4. The entires for the Apple MacBookPro10,1 SSD were extrapolated from
the 512GB model's entry, which is "ATA     APPLE SSD SM512E". Google
searches suggest that this is a rebadged Samsung 830.

5. The entires for the Apple MacBookAir3,1 SSD were extrapolated from
the 128GB model's entry, which is "ATA     APPLE SSD TS128C". Google
searches suggest that this is a rebadged Kingston SSDNow V+ 100 (based
on Toshiba).

6. Sun Comstar is an iSCSI Target, so we cannot tell what the correct
sector size is through this method. We list it only for reference
purposes, but it is commented out. Similarly, it is not clear what the
right thing to do for Netapp is, so we comment it out.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1907
2013-12-02 14:07:26 -08:00
Maximilian Mehnert
539defc873 Add missing libzfs_core to Makefiles
On some platforms symbols provided by libzfs_core and used by
libzfs were not available to the linker.  To avoid this issue
libzfs_core has been added to the list of required libraries
when building utilities which depend on libzfs.  This should
have been handled properly by libtool and it's still not
entirely clear why it wasn't on all platforms.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1841
2013-11-20 15:44:15 -08:00
Will Andrews
7bc7f25040 Illumos #3745, #3811
3745 zpool create should treat -O mountpoint and -m the same
3811 zpool create -o altroot=/xyz -O mountpoint=/mnt ignores
     the mountpoint option
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3745
  https://www.illumos.org/issues/3811
  illumos/illumos-gate@8b71377531

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 10:55:25 -08:00
Ralf Ertzinger
d65e738109 Add -p switch to "zpool get"
This works the same as the -p switch to "zfs get", displaying full
resolution values for appropriate attributes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1813
2013-10-28 15:40:12 -07:00
Brian Behlendorf
11cb9d773f Increase default udev wait time
When creating a new pool, or adding/replacing a disk in an existing
pool, partition tables will be automatically created on the devices.
Under normal circumstances it will take less than a second for udev
to create the expected device files under /dev/.  However, it has
been observed that if the system is doing heavy IO concurrently udev
may take far longer.  If you also throw in some cheap dodgy hardware
it may take even longer.

To prevent zpool commands from failing due to this the default wait
time for udev is being increased to 30 seconds.  This will have no
impact on normal usage, the increase timeout should only be noticed
if your udev rules are incorrectly configured.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1646
2013-10-22 10:25:51 -07:00
Tim Chase
2e2ddc30b4 Dedup-related documentation additions for zpool and zdb.
Document the "-D" and "-T" options and the optional interval
and count or "zpool status".

Also for zpool's man page, use a consistent order for the
various "-T" options to match the program's help output.

Document the effect of additional "-D" options for zdb.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1786
2013-10-22 10:08:51 -07:00
Richard Yao
31fc19399e Generate libraries with correct DT_NEEDED entries
Libraries that depend on other libraries should list them in ELF's
DT_NEEDED field so that programs linking to them do not need to specify
those libraries unless they depend on them as well. This is not the case
in the current code and the consequence is that anything that needs a
library must know its dependencies. This is fragile and caused GRUB2's
configure script to break when a dependency was added on libblkid in
libzfs.

This resolves that problem by using LIBADD/LDADD to specify libraries in
Makefile.am instead of LDFLAGS. This ensures that proper DT_NEEDED
entries are generated and prevents GRUB2's configure script from
breaking in the presence of a libblkid dependency. This also removes
unneeded dependencies from various files.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1751
2013-10-10 16:56:51 -07:00
Richard Yao
3549721c9e Update drive database
Add Corsair Force GS drive (obtained from drive_id)
Add Kingston HyperX 3K (obtained from drive_id)
Add OCZ Vertex 4 drive (obtained from drive_id)
Add Samsung SM843T enterprise drive (obtained from drive_id)
Add entries for additional sizes of Intel 320/330/335/520 series
Add Cruical C400 (obtained from Illumos user's sd.conf)
Add Toshiba SSD (obtained from Illumos user's sd.conf)
Add Samsung's first SLC SSD (obtained from drive_id)
Add OCZ Core Series (obtained from drive_id)
Add Intel DC S3700 (obtained from drive_id)

Notes:

1. The drive identifer obtained for the Samsung SM843T was MZ7WD480. The
rest were extrapolated. The additional entries were checked with Google
to verify that such drives exist in the wild.

2. The additional entries for Intel drives were extrapolated from
existing entries. The additional entries were checked with Google to
verify that such drives exist in the wild.

3. The "ATA     C400-MTFDDAC512M" and "ATA     TOSHIBA THNSNH51" entries
are from the sd.conf of gcbirzan on freenode. Additional entries were
extrapolated from them and checked with Google.

4. I obtained the Samsung MCCOE64G entry from an actual drive. The
Samsung MCCOE32G entry was extrapolated from it and checked with
Google.

5. I obtained the SSDSC2BA10 from a 100GB Intel DC S3700 drive and
extrapolated the entries for the additional models.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1752
2013-10-09 09:16:23 -07:00
Matthew Ahrens
6f1ffb0665 Illumos #2882, #2883, #2900
2882 implement libzfs_core
2883 changing "canmount" property to "on" should not always remount dataset
2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  https://www.illumos.org/issues/2882
  https://www.illumos.org/issues/2883
  https://www.illumos.org/issues/2900
  illumos/illumos-gate@4445fffbbb

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1293

Porting notes:

WARNING: This patch changes the user/kernel ABI.  That means that
the zfs/zpool utilities built from master are NOT compatible with
the 0.6.2 kernel modules.  Ensure you load the matching kernel
modules from master after updating the utilities.  Otherwise the
zfs/zpool commands will be unable to interact with your pool and
you will see errors similar to the following:

  $ zpool list
  failed to read pool configuration: bad address
  no pools available

  $ zfs list
  no datasets available

Add zvol minor device creation to the new zfs_snapshot_nvl function.

Remove the logging of the "release" operation in
dsl_dataset_user_release_sync().  The logging caused a null dereference
because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the
logging functions try to get the ds name via the dsl_dataset_name()
function. I've got no idea why this particular code would have worked
in Illumos.  This code has subsequently been completely reworked in
Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring).

Squash some "may be used uninitialized" warning/erorrs.

Fix some printf format warnings for %lld and %llu.

Apply a few spa_writeable() changes that were made to Illumos in
illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and
3115 fixes.

Add a missing call to fnvlist_free(nvl) in log_internal() that was added
in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time
(zfsonlinux/zfs@9e11c73) because it depended on future work.
2013-09-04 15:49:00 -07:00
Richard Yao
bff32e0972 Implement database to workaround misreported physical sector sizes
This implements vdev_bdev_database_check(). It alters the detected
sector size of any device listed in a database of drives known to lie
about their physical sector sizes.

This is based on "6931570 Add flash devices' VID/PID to disk table to
advertising 4K physical sector size" from Open Solaris and on
sg_simple4.c from sg3_utils. About two dozen lines are taken from
sg_simple4.c, which is GPLv2 licensed. However, sg_simple4.c is
analogous to a Hello World program and is safe for us to use. We
requested that Douglas Gilbert, the author of sg_simple4.c, confirm that
this is the case. A cutdown version of his response is as follows:

```
I would consider a SCSI INQUIRY example using the Linux sg
driver interface (also written by me) as the equivalent of an
"hello world" program in C.
```

The database was created with the help of the freenode and ZFSOnLinux
communities.

Some notes:

1. The following drives both were confirmed to lie via reports in IRC
and they contain capacity information in their identifiers:

INTEL SSDSA2M080
INTEL SSDSA2M160
M4-CT256M4SSD2
WDC WD15EARS-00S
WDC WD15EARS-00Z
WDC WD20EARS-00M

The identifiers for different capacity models were extrapolated and
added under the assumption that those models also lie. Google was used
to verify that the extrapolated drive identifiers existed prior to their
inclusion.

2. The OCZ-VERTEX2 3.5 identifer applies to two drives that differ
solely in page size (and slightly in capacity). One uses 4096-byte pages
and the other uses 8192-byte pages. Both are set to use 8192-byte pages.
We could detect the page size by checking the capacity, but that would
unnecessarily complicate the code.

3. It is possible for updated drive firmware to correctly report the
sector size. There were reports of a few advanced format drives doing
that. One report stated that the vendor changed the identification
string while another was unclear on this. Both reports involved WDC
models.

4. Google was used to determine the size of pages in the listed flash
devices. Reports of 8192-byte pages took precedence over reports of
4096-byte pages.

5. Devices behind USB adapters can have their identification strings
altered. Identification strings obtained across USB adapters are
omitted and no attempt is made to correct for alterations made by USB
adapters when doing comparisons against the database. Two entries in the
Open Solaris database that appear to have been altered by a USB
adapter were omitted.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1652
2013-08-22 11:24:06 -07:00
Brian Behlendorf
cd72af9c68 Fix 'zpool list -H' error code
Due to an uninitialized variable it was possible for the command
'zpool list -H' to return a non-zero error when there are no pools.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1605
2013-07-23 12:39:05 -07:00
Christer Ekholm
da91c90154 Add missing -v to usage help for zpool list.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-07-22 13:39:01 -07:00
Dmitry Khasanov
131cc95ca7 Add FreeBSD 'zpool labelclear' command
The FreeBSD implementation of zfs adds the 'zpool labelclear'
command.  Since this functionality is helpful and straight
forward to add it is being included in ZoL.

References:
  freebsd/freebsd@119a041dc9

Ported-by: Dmitry Khasanov <pik4ez@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1126
2013-07-09 15:58:05 -07:00
Craig Loomis
50fe577d1f Explicitly flush output at end of each zevent
For "zpool events -f" flush stdout to ensure the last zevent
is always printed immediately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1568
2013-07-08 17:01:16 -07:00
George Wilson
5853fe790d Illumos #3306, #3321
3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man
     page and help

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@31d7e8fa33
  https://www.illumos.org/issues/3306
  https://www.illumos.org/issues/3321

The vdev_file.c implementation in this patch diverges significantly
from the upstream version.  For consistenty with the vdev_disk.c
code the upstream version leverages the Illumos bio interfaces.
This makes sense for Illumos but not for ZoL for two reasons.

1) The vdev_disk.c code in ZoL has been rewritten to use the
   Linux block device interfaces which differ significantly
   from those in Illumos.  Therefore, updating the vdev_file.c
   to use the Illumos interfaces doesn't get you consistency
   with vdev_disk.c.

2) Using the upstream patch as is would requiring implementing
   compatibility code for those Solaris block device interfaces
   in user and kernel space.  That additional complexity could
   lead to confusion and doesn't buy us anything.

For these reasons I've opted to simply move the existing vn_rdwr()
as is in to the taskq function.  This has the advantage of being
low risk and easy to understand.  Moving the vn_rdwr() function
in to its own taskq thread also neatly avoids the possibility of
a stack overflow.

Finally, because of the additional work which is being handled by
the free taskq the number of threads has been increased.  The
thread count under Illumos defaults to 100 but was decreased to 2
in commit 08d08e due to contention.  We increase it to 8 until
the contention can be address by porting Illumos #3581.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1354
2013-05-03 16:53:52 -07:00
Brian Behlendorf
8128bd89fb Fix hot spares
The issue with hot spares in ZoL is because it opens all leaf
vdevs exclusively (O_EXCL).  On Linux, exclusive opens cause
subsequent exclusive opens to fail with EBUSY.

This could be resolved by not opening any of the devices
exclusively, which is what Illumos does, but the additional
protection offered by exclusive opens is desirable.  It cleanly
prevents you from accidentally adding an in-use non-ZFS device
to your pool.

To fix this we very slightly relaxed the usage of O_EXCL in
the following ways.

1) Functions which open the device but only read had the
   O_EXCL flag removed and were updated to use O_RDONLY.

2) A common holder was added to the vdev disk code.  This
   allow the ZFS code to internally open the device multiple
   times but non-ZFS callers may not.

3) An exception was added to make_disks() for hot spare when
   creating partition tables.  For hot spare devices which
   are already opened exclusively we skip creating the partition
   table because this must already have been done when the disk
   was originally added as a hot spare.

Additional minor changes include fixing check_in_use() to use
a partition instead of a slice suffix.  And is_spare() was moved
above make_disks() to avoid adding a forward reference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #250
2013-03-01 13:31:02 -08:00
Tim Connors
c5b247f335 -x shouldn't warn about old on-disk format or unavailable features
`zpool status -x` should only flag errors or where the pool is
unavailable.  If it imported fine but isn't using the latest features
available in the code, that's not an error.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1319
2013-02-28 09:17:09 -08:00
Brian Behlendorf
f52b31eaf0 Honor 80 character limit in 'zpool status'
This is a minor nit, but the second line of the 'action' message
when you need to upgrade your pool to support feature flags exceeds
the standard 80 character limit.  Fix it by moving the word
'feature' on to the third line.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-31 11:06:57 -08:00
Yuri Pankov
240245896a Illumos #1377 `zpool status -D' should tell if there are no DDT entries
1337 `zpool status -D' should tell if there are no DDT entries

Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Albert Lee <trisk@nexenta.com>

References:
  illumos/illumos-gate@ce72e614c1
  illumos changeset: 13432:d1ad8d106d64
  https://www.illumos.org/issues/1337

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-11 09:17:13 -08:00
Christopher Siden
b9b24bb4ca Illumos #2762: zpool command should have better support for feature flags
2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@57221772c3
  https://www.illumos.org/issues/2762

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
Christopher Siden
9ae529ec5d Illumos #2619 and #2747
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@53089ab7c8
  illumos/illumos-gate@ad135b5d64
  illumos changeset: 13700:2889e2596bd6
  https://www.illumos.org/issues/2619
  https://www.illumos.org/issues/2747

NOTE: The grub specific changes were not ported.  This change
must be made to the Linux grub packages.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:35 -08:00
Cyril Plisko
4588bf5701 Make zpool attach -o ashift=... actually work
Commit df83110856 missed update to
getopt() call, while delivering all the rest. This commit adds
"o" to getopt().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #566
2012-11-30 13:50:26 -08:00
Cyril Plisko
df83110856 Add "-o ashift" to zpool add and zpool attach
When adding devices to an existing pool "ashift" property is
auto-detected.  However, if this property was overridden at
the pool creation time (i.e. zpool create -o ashift=12 tank ...)
this may not be what the user wants.  This commit lets the user
specify the value of "ashift" property to be used with newly
added drives. For example,

    zpool add -o ashift=12 tank disk1
    zpool attach -o ashift=12 tank disk1 disk2

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #566
2012-11-15 11:06:14 -08:00
George Wilson
32a9872bba Illumos #2671: zpool import should not fail if vdev ashift has increased
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Richard Lowe <richlowe@richlowe.net>

Refererces to Illumos issue:
      https://www.illumos.org/issues/2671

This patch has been slightly modified from the upstream Illumos
version.  In the upstream implementation a warning message is
logged to the console.  To prevent pointless console noise this
notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift"
event.

The event indicates a non-optimial (but entirely safe) ashift
value was used to create the pool.  Depending on your workload
this may impact pool performance.  Unfortunately, the only way
to correct the issue is to recreate the pool with a new ashift.

NOTE: The unrelated fix to the comment in zpool_main.c appears
in the upstream commit and was preserved for consistnecy.

Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #955
2012-11-15 11:05:59 -08:00
Brian Behlendorf
eac4720465 Allow 'zpool replace' to use short device names
The 'zpool replace' command would fail when given a short name
because unlike on other platforms the short name cannot be
deterministically expanded to a single path.  Multiple path
prefixes must be checked and in addition the partition suffix
for whole disks is determined by the prefix.

To handle this complexity a zfs_strcmp_pathname() function was
added which takes either a short or fully qualified device name.
Short names will be expanded using the prefixes in the default
import search path, or the ZPOOL_IMPORT_PATH environment variable
if it's defined.  All posible expansions are then compared against
the comparison path.  Care is taken to strip redundant slashes to
ensure legitimate matches are not missed.

In the context of this work the existing zfs_resolve_shortname()
function was extended to consider the ZPOOL_IMPORT_PATH when set.
The zfs_append_partition() interface was also simplified to take
only a single buffer.

The vast majority of these changes rework existing Linux specific
code which was originally written to accomidate udev.  However,
there is some minimal cleanup which removes Illumos specific code.
This was done to improve readability but the basic flow and intent
of the upstream code was maintained.

These changes are the logical conclusion of the previos work to
adjust the 'zpool import' search behavior, see commit 44867b6a.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #544
Closes #976
2012-10-22 08:45:58 -07:00
Chris Siden
1bd201e70d Illumos #1948: zpool list should show more detailed pool info
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  https://www.illumos.org/issues/1948

Ported by:	Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #685
2012-09-19 13:39:05 -07:00
Brian Behlendorf
44867b6d6e Improve zpool import search behavior
The goal of this change is to make 'zpool import' prefer to use
the peristent /dev/mapper or /dev/disk/by-* paths.  These are far
preferable to the devices in /dev/ whos names are not persistent
and are determined by the order in which a device is detected.

This patch improves things by changing the default search path from
just to the top level /dev/ directory to (in order):

  /dev/disk/by-vdev   - Custom rules, use first if they exist
  /dev/disk/zpool     - Custom rules, use first if they exist
  /dev/mapper         - Use multipath devices before components
  /dev/disk/by-uuid   - Single unique entry and persistent
  /dev/disk/by-id     - May be multiple entries and persistent
  /dev/disk/by-path   - Encodes physical location and persistent
  /dev/disk/by-label  - Custom persistent labels
  /dev                - UNSAFE device names will change

The default search path can be overriden by setting the
ZPOOL_IMPORT_PATH environment variable.  This must be a colon
delimited list of paths which are searched for vdevs.  If the
'zpool import -d' option is specified only those listed paths
will be searched.

Finally, when multiple paths to the same device are found.  If one
of the paths is an exact match for the path used last time to import
the pool it will be used.  When there are no exact matches the
prefered path will be determined by the provided search order.

This means you can still import a pool and force specific names by
providing the -d <path> option.  And the prefered names will persist
as long as those paths exist on your system.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #965
2012-09-17 13:49:07 -07:00
Cyril Plisko
af909a1089 Illumos #3064: usr/src/cmd/zpool/zpool_main.c misspells "successful"
Reviewed by: Andrew Stormont <Andrew.Stormont@nexenta.com>
Reviewed by: Kartik Mistry <kartik.mistry@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>

References:
      https://www.illumos.org/issues/3064

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-11 10:19:50 -07:00
Brian Behlendorf
ca8b5af89d Remove autotools products
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #718
2012-08-27 11:47:44 -07:00
Dan McDonald
d96eb2b153 Illumos #1693: persistent 'comment' field for a zpool
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/1693

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #678
2012-08-08 11:49:37 -07:00
Etienne Dechamps
ee5fd0bb80 Set zvol discard_granularity to the volblocksize.
Currently, zvols have a discard granularity set to 0, which suggests to
the upper layer that discard requests of arbirarily small size and
alignment can be made efficiently.

In practice however, ZFS does not handle unaligned discard requests
efficiently: indeed, it is unable to free a part of a block. It will
write zeros to the specified range instead, which is both useless and
inefficient (see dnode_free_range).

With this patch, zvol block devices expose volblocksize as their discard
granularity, so the upper layer is aware that it's not supposed to send
discard requests smaller than volblocksize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #862
2012-08-07 14:55:31 -07:00
Richard Yao
739a1a82e0 Linux 3.5 compat, end_writeback() changed to clear_inode()
The end_writeback() function was changed by moving the call to
inode_sync_wait() earlier in to evict().   This effecitvely changes
the ordering of the sync but it does not impact the details of
the zfs implementation.

However, as part of this change end_writeback() was renamed to
clear_inode() to reflect the new semantics.  This change does
impact us and clear_inode() now maps to end_writeback() for
kernels prior to 3.5.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #784
2012-07-23 12:29:36 -07:00
Richard Yao
ea1fdf46e2 Linux 3.5 compat, iops->truncate_range() removed
The vmtruncate_range() support has been removed from the kernel in
favor of using the fallocate method in the file_operations table.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
2012-07-23 12:29:32 -07:00
Richard Yao
756c3e5a9c Linux 3.5 compat, eops->encode_fh() takes inodes
The export_operations member ->encode_fh() has been updated to
take both the child and parent inodes.  This interface used to
take the child dentry and a bool describing if the parent is needed.

NOTE: While updating this code I noticed that we do not currently
cleanly handle the case where we're passed a connectable parent.
This code should be audited to make sure we're doing the right thing.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
2012-07-23 12:29:23 -07:00
Etienne Dechamps
b5a28807cd Move partition scanning from userspace to module.
Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.

This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.

Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.

This change also depends on API changes which are available in 2.6.37
and latter kernels.  The build system has been updated to detect this,
but there is no compatibility mode for older kernels.  This means that
online expansion will NOT be available in older kernels.  However, it
will still be possible to expand the vdev offline.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #808
2012-07-17 09:17:31 -07:00
Garrett D'Amore
3541dc6d02 Illumos #1748: desire support for reguid in zfs
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com>
Reviewed by: Alexander Stetsenko <ams@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/1748

This commit modifies the user to kernel space ioctl ABI.  Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated.  If only the user space
component is updated both the 'zpool events' command and the
'zpool reguid' command will not work until the kernel modules
are updated.

Ported by:     Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665
2012-07-11 13:08:56 -07:00
Richard Yao
6a0936babc Linux 3.4 compat, d_make_root() replaces d_alloc_root()
torvalds/linux@adc0e91ab1 introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:

  error: implicit declaration of function 'd_alloc_root'
  [-Werror=implicit-function-declaration]

To correct this we update the code to use the current d_make_root()
interface for readability.  Then we introduce an autotools check
to determine if d_make_root() is available.  If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #776
2012-06-11 10:04:49 -07:00
Ned A. Bass
821b683436 Add vdev_id for JBOD-friendly udev aliases
vdev_id parses the file /etc/zfs/vdev_id.conf to map a physical path
in a storage topology to a channel name.  The channel name is combined
with a disk enclosure slot number to create an alias that reflects the
physical location of the drive.  This is particularly helpful when it
comes to tasks like replacing failed drives.  Slot numbers may also be
re-mapped in case the default numbering is unsatisfactory.  The drive
aliases will be created as symbolic links in /dev/disk/by-vdev.

The only currently supported topologies are sas_direct and sas_switch:

o  sas_direct - a channel is uniquely identified by a PCI slot and a
   HBA port

o  sas_switch - a channel is uniquely identified by a SAS switch port

A multipath mode is supported in which dm-mpath devices are handled by
examining the first running component disk, as reported by 'multipath
-l'.  In multipath mode the configuration file should contain a
channel definition with the same name for each path to a given
enclosure.

vdev_id can replace the existing zpool_id script on systems where the
storage topology conforms to sas_direct or sas_switch.  The script
could be extended to support other topologies as well.  The advantage
of vdev_id is that it is driven by a single static input file that can
be shared across multiple nodes having a common storage toplogy.
zpool_id, on the other hand, requires a unique /etc/zfs/zdev.conf per
node and a separate slot-mapping file.  However, zpool_id provides the
flexibility of using any device names that show up in
/dev/disk/by-path, so it may still be needed on some systems.

vdev_id's functionality subsumes that of the sas_switch_id script, and
it is unlikely that anyone is using it, so sas_switch_id is removed.

Finally, /dev/disk/by-vdev is added to the list of directories that
'zpool import' will scan.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #713
2012-06-01 08:55:14 -07:00
Brian Behlendorf
b39d3b9f7b Linux 3.3 compat, iops->create()/mkdir()/mknod()
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'.  To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef.  There is no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #701
2012-04-30 12:52:38 -07:00
Frederik Wessels
95bcd51ecc Illumos #1946: incorrect formatting when listing output of multiple pools with zpool iostat -v
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
Approved by: Richard Lowe <richlowe@richlowe.net>

Reference to Illumos issue:
  https://www.illumos.org/issues/1946

Ported by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-27 15:15:37 -07:00
Mike Harsch
187632dcef Illumos #952: separate intent logs should be obvious in 'zpool iostat' output
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Refererce to Illumos issue:
  https://www.illumos.org/issues/952

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #607
2012-04-27 15:12:43 -07:00
Craig Sanders
9fc60702c6 Remove hard-coded 80 column output
When stdout is detected to be a tty use the number of columns
specified by the terminal.  If that fails fall back to a default
80 column width.  In the non-tty case allow for 999 column lines.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-27 15:01:08 -07:00
Brian Behlendorf
f47e1351db Fix executable permissions
Caught by lint, this permission change was accidentally introduced
by commit 42cb3819f1.  Restore the
correct permissions and while I'm at it add a missing whack-bang
to config/ltmain.sh.

  lint: executable-not-elf-or-script: zpool_main.c zfs_main.c

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #620
2012-03-26 11:52:44 -07:00
Brian Behlendorf
1c5de20ae2 Add --enable-debug-dmu-tx configure option
Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging.  When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.

This checking is particularly helpful when adding new dmu consumers
like Lustre.  However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.

--enable-debug-dmu-tx  - Enable/disable validation of each tx as
--disable-debug-dmu-tx   it is constructed.  By default validation
                         is disabled due to performance concerns.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:25:17 -07:00
Brian Behlendorf
ebe7e575ea Add .zfs control directory
Add support for the .zfs control directory.  This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required.  The bulk of the core
functionality is now all there with the following limitations.

*) The .zfs/snapshot directory automount support requires a 2.6.37
   or newer kernel.  The exception is RHEL6.2 which has backported
   the d_automount patches.

*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
   in the .zfs/snapshot directory works as expected.  However,
   this functionality is only available to root until zfs
   delegations are finished.

      * mkdir - create a snapshot
      * rmdir - destroy a snapshot
      * mv    - rename a snapshot

The following issues are known defeciences, but we expect them to
be addressed by future commits.

*) Add automount support for kernels older the 2.6.37.  This should
   be possible using follow_link() which is what Linux did before.

*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
   The majority of the ground work for this is complete.  However,
   finishing this work will require resolving some lingering
   integration issues with the Linux NFS kernel server.

*) The .zfs/shares directory exists but no futher smb functionality
   has yet been implemented.

Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #173
2012-03-22 13:03:47 -07:00
Gregor Kopka
42cb3819f1 Use stderr for 'no pools/datasets available' error
The 'zfs list' and 'zpool list' commands output the message
'no datasets/pools available' to stdout.  This should go to
stderr and only the available datasets/pools should go to
stdout.  Returning nothing to stdout is expected behavior
when there is nothing to list.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #581
2012-03-15 10:24:00 -07:00
Brian Behlendorf
4b787d75c8 Cleanly support debug packages
Allow a source rpm to be rebuilt with debugging enabled.  This
avoids the need to have to manually modify the spec file.  By
default debugging is still largely disabled.  To enable specific
debugging features use the following options with rpmbuild.

  '--with debug'               - Enables ASSERTs

  # For example:
  $ rpmbuild --rebuild --with debug zfs-modules-0.6.0-rc6.src.rpm

Additionally, ZFS_CONFIG has been added to zfs_config.h for
packages which build against these headers.  This is critical
to ensure both zfs and the dependant package are using the same
prototype and structure definitions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 14:08:17 -08:00
Ned Bass
3a4f6caf08 Return success from check_slice() if device doesn't exist
When creating a new pool, make_root_vdev() calls check_in_use() to
ensure that none of the consituent disks are in use.  If the disk
contains a valid vdev label it is read to retrieve the list of its
child vdevs and these are checked recursively.  However, the
partitions stored in the vdev label my no longer exist, for example
if the partition table has since been altered.  In any such case we
would want the pool creation to proceed, so this change removes the
check from check_slice() that returns an error if the device doesn't
exist.  As an added assurance, the Solaris implementation also
returns sucess on ENOENT.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 08:52:38 -08:00
Etienne Dechamps
30930fba21 Add support for DISCARD to ZVOLs.
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.

We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.

Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:19:38 -08:00
Etienne Dechamps
cb2d19010d Support the fallocate() file operation.
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.

To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 16:19:32 -08:00
Etienne Dechamps
34037afe24 Improve ZVOL queue behavior.
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Etienne Dechamps
b18019d2d8 Fix synchronicity for ZVOLs.
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:

    WRITE:       A normal async write. Device will be plugged.
    WRITE_SYNC:  Synchronous write. Identical to WRITE, but passes down
                 the hint that someone will be waiting on this IO
                 shortly.
    WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
    WRITE_FUA:   Like WRITE_SYNC but data is guaranteed to be on
                 non-volatile media on completion.

In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.

The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.

Also, we must inform the block layer that we actually support FLUSH and FUA.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Brian Behlendorf
47621f3d76 Linux 3.3 compat, sops->show_options()
The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'.  Add an autoconf check
to detect the API change and then conditionally define the expected
interface.  In either case we are only interested in the zfs_sb_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #549
2012-02-03 10:02:01 -08:00
Darik Horn
750562833f Combine libraries: spl, avl, efi, share, unicode.
These libraries, which are an artifact of the ZoL development
process, conflict with packages that are already in distribution:

  * libspl: SPL Programming Language
  * libavl: AVL for Linux
  * libefi: GRUB

And these libraries are potential conflicts:

  * libshare: the Linux Mount Manager
  * libunicode: Perl and Python

Recompose these five ZoL components into the four libraries that are
conventionally provided by Solaris and FreeBSD systems:

  + libnvpair
  + libuutil
  + libzpool
  + libzfs

This change resolves the name conflict, makes ZoL more compatible
with existing software that uses autotools to detect ZFS, and allows
pkg-zfs to better reflect the official Debian kFreeBSD packaging.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #430
2012-01-17 15:19:50 -08:00
Brian Behlendorf
ab26409db7 Linux 3.1 compat, super_block->s_shrink
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block.  Prior
to this change there was one shared global shrinker.

The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded.  This would cause the VFS to drop
references on a fraction of the dentries in the dcache.  The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit.  Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.

This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit.  The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning.  Thus we
can minimize any impact on the caching of other filesystems.

In the context of making this change several other important
issues related to managing the ARC were addressed, they include:

* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback.  The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code.  This callback can also be used by
other ARC consumers such as Lustre.

  arc_add_prune_callback()
  arc_remove_prune_callback()

* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity.  The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.

* Less aggressively invoke the prune callback.  We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes.  It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.

* More promptly manage exceeding the arc_meta_limit.  When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.

* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache.  Remember
this will only occur when the ARC has no other choice.  If it
can evict buffers safely without invoking the prune callback
it will.

* This change is also expected to resolve the unexpect collapses
of the ARC cache.  This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink().  This effectively shrunk the entire cache
when really we just needed to reclaim meta data.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #466
Closes #292
2012-01-11 11:46:02 -08:00
Darik Horn
28eb9213d8 Linux 3.2 compat: set_nlink()
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit

  SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170

Use the new set_nlink() kernel function instead.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
2011-12-16 20:02:52 -08:00
Prakash Surya
6ba3b44614 Add make rule for building Arch Linux packages
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:

    $ ./configure
    $ make pkg     # Alternatively, one can run 'make arch' as well

on the Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the zfs userland utilities and
another for the zfs kernel modules. The new packages can then be
installed by running:

    # pacman -U $package.pkg.tar.xz

In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be build using the 'sarch' make
rule.

NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #491
2011-12-14 19:14:23 -08:00
Brian Behlendorf
5547c2f1bf Simplify BDI integration
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code.  The updated code now just
registers the bdi during mount and destroys it during unmount.

The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it.  Luckily the bdi_setup_and_register() function
is trivial.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #367
2011-11-08 10:19:03 -08:00
Darik Horn
3cee2262a6 Change sun.com URLs to zfsonlinux.org
ZFS contains error messages that point to the defunct www.sun.com
domain, which is currently offline.  Change these error messages
to use the zfsonlinux.org mirror instead.

This commit depends on:

  zfsonlinux/zfsonlinux.github.com@8e10ead3dc

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-24 09:52:21 -07:00
Brian Behlendorf
de0a1c099b Autogen refresh for udev changes
Run autogen.sh using the same autotools versions as upstream:

 * autoconf-2.63
 * automake-1.11.1
 * libtool-2.2.6b
2011-08-08 16:30:27 -07:00
Brian Behlendorf
76659dc110 Add backing_device_info per-filesystem
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk.  The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance.  Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.

The replacement for pdflush is called bdi (backing device info).  The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback.  The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.

For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme.  However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.

Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.

However, there is one critical case where not implementing the bdi
functionality can cause problems.  If an application handles a page
fault it can enter the balance_dirty_pages() callpath.  This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.

Without a registered backing_device_info for the filesystem the
dirty pages will not get written out.  Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.

This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block.  It is then registered
when the filesystem mounted and unregistered on unmount.  It will
not be registered for mounted snapshots which are read-only.  This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #174
2011-08-04 13:37:38 -07:00
Kyle Fuller
615ab66d18 Provide a rc.d script for archlinux
Unlike most other Linux distributions archlinux installs its
init scripts in /etc/rc.d insead of /etc/init.d.  This commit
provides an archlinux rc.d script for zfs and extends the
build infrastructure to ensure it get's installed in the
correct place.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #322
2011-07-11 14:12:23 -07:00
Prasad Joshi
5a52105925 Use consistent error message in zpool sub-command
The zpool sub-commands like iostat, list, and status should
display consistent message when a given pool is unavailable or
no pool is present.  This change unifies the default behavior
as follows:

  root@prasad:~# ./zpool list 1 2
  no pools available
  no pools available

  root@prasad:~# ./zpool iostat  1 2
  no pools available
  no pools available

  root@prasad:~# ./zpool status 1 2
  no pools available
  no pools available

  root@prasad:~# ./zpool list tan 1 2
  cannot open 'tan': no such pool

  root@prasad:~# ./zpool iostat tan 1 2
  cannot open 'tan': no such pool

  root@prasad:~# ./zpool status tan 1 2
  cannot open 'tan': no such pool

Reported-by: Rajshree Thorat <rthorat@stec-inc.com>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Closes #306
2011-07-04 19:57:01 -07:00
Brian Behlendorf
2cf7f52bc4 Linux compat 2.6.39: mount_nodev()
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure.  When using the new
interface the caller must now use the mount_nodev() helper.

Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers.  This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use.  It provides our only entry point
in to the namespace layer for manipulating certain mount options.

This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly.  It also allowed me
to keep more of the original Solaris code unmodified.  Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do.  However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem.  Thus keeping a back
reference from the filesystem to the namespace is complicated.

Rather than introduce some ugly hack to get the vfsmount and
continue as before.  I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.

This commit updates the code to remove this vfsmount back
reference entirely.  All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.

Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code.  This
change which fairly involved has turned out nicely.

Closes #246
Closes #217
Closes #187
Closes #248
Closes #231
2011-07-01 13:36:39 -07:00
Brian Behlendorf
5c03efc379 Linux compat 2.6.39: security_inode_init_security()
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.

Closes #246
Closes #217
Closes #187
2011-07-01 12:40:08 -07:00
Prasad Joshi
b312979252 Tear down and flush the mmap region
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.

The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.

Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes #255
2011-06-27 09:59:19 -07:00
Christian Kohlschütter
df30f56639 Add "ashift" property to zpool create
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly.  This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes.  The drive may behave this way
for compatibility reasons.  For example, the WDC WD20EARS disks are
known to exhibit this behavior.

When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).

This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time.  This can significantly improve
performance in certain cases.  However, it will have an impact on your
total pool capacity.  See the updated ashift property description
in the zpool.8 man page for additional details.

Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior.  The most common ashift values are 9 and 12.

  Example:
  zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd

Closes #280

Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-06-17 16:35:49 -07:00
Brian Behlendorf
2e08aedba4 Always check -Wno-unused-but-set-variable gcc support
The previous commit 8a7e1ceefa wasn't
quite right.  This check applies to both the user and kernel space
build and as such we must make sure it runs regardless of what
the --with-config option is set too.

For example, if --with-config=kernel then the autoconf test does
not run and we generate build warnings when compiling the kernel
packages.
2011-06-14 16:40:35 -07:00
Brian Behlendorf
8a7e1ceefa Check for -Wno-unused-but-set-variable gcc support
Gcc versions 4.3.2 and earlier do not support the compiler flag
-Wno-unused-but-set-variable.  This can lead to build failures
on older Linux platforms such as Debian Lenny.  Since this is
an optional build argument this changes add a new autoconf check
for the option.  If it is supported by the installed version of
gcc then it is used otherwise it is omited.

See commit's 12c1acde76 and
79713039a2 for the reason the
-Wno-unused-but-set-variable options was originally added.
2011-06-14 14:43:22 -07:00
Brian Behlendorf
df554c148e Fix 'zfs set volsize=N pool/dataset'
This change fixes a kernel panic which would occur when resizing
a dataset which was not open.  The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.

The code has also been updated to correctly notify the kernel
when the block device capacity changes.  For 2.6.28 and newer
kernels the capacity change will be immediately detected.  For
earlier kernels the capacity change will be detected when the
device is next opened.  This is a known limitation of older
kernels.

Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0

Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes #68
Closes #84
2011-05-02 08:54:40 -07:00
Gunnar Beutner
055656d4f4 Implemented NFS export_operations.
Implemented the required NFS operations for exporting ZFS datasets
using the in-kernel NFS daemon.
2011-04-29 12:36:13 -07:00
Darik Horn
492b8e9e7b Use gethostid in the Linux convention.
Disable the gethostid() override for Solaris behavior because Linux systems
implement the POSIX standard in a way that allows a negative result.

Mask the gethostid() result to the lower four bytes, like coreutils does in
/usr/bin/hostid, to prevent junk bits or sign-extension on systems that have an
eight byte long type. This can cause a spurious hostid mismatch that prevents
zpool import on 64-bit systems.
2011-04-25 10:36:17 -05:00
Brian Behlendorf
12c1acde76 Set -Wno-unused-but-set-variable globally
As of gcc-4.6 the option -Wunused-but-set-variable is enabled by
default.  While this is a useful warning there are numerous places
in the ZFS code when a variable is set and then only checked in an
ASSERT().  To avoid having to update every instance of this in the
code we now set -Wno-unused-but-set-variable to suppress the warning.

Additionally, when building with --enable-debug and -Werror set these
warning also become fatal.  We can reevaluate the suppression of these
error at a later time if it becomes an issue.  For now we are basically
just reverting to the previous gcc behavior.
2011-04-19 10:44:10 -07:00
Brian Behlendorf
bdf4328b04 Linux 2.6.28 compat, insert_inode_locked()
Added insert_inode_locked() helper function, prior to this most callers
used insert_inode_hash().  The older method doesn't check for collisions
in the inode_hashtable but it still acceptible for use.  Fallback to
using insert_inode_hash() when insert_inode_locked() is unavailable.
2011-03-22 12:15:54 -07:00
Brian Behlendorf
01c0e61da0 Add init scripts
To support automatically mounting your zfs on filesystem on boot
a basic init script is needed.  Unfortunately, every distribution
has their own idea of the _right_ way to do things.  Rather than
write one very complicated portable init script, which would be
invariably replaced by the distributions own anyway.  I have
instead added support to provide multiple distribution specific
init scripts.

The correct init script for your distribution will be selected
by ZFS_AC_DEFAULT_PACKAGE which will set DEFAULT_INIT_SCRIPT.
During 'make install' the correct script for your system will
be installed from zfs/etc/init.d/zfs.DEFAULT_INIT_SCRIPT to the
usual /etc/init.d/zfs location.

Currently, there is zfs.fedora and a more generic zfs.lsb init
script.  Hopefully, the distribution maintainers who know best
how they want their init scripts to function will feedback their
approved versions to be included in the project.

This change does not consider upstart jobs but I'm not at all
opposed to add that sort of thing.
2011-03-17 16:51:54 -07:00
Brian Behlendorf
45066d1f20 Linux 2.6.38 compat, blkdev_get_by_path()
The open_bdev_exclusive() function has been replaced (again) by the
more generic blkdev_get_by_path() function.  Additionally, the
counterpart function close_bdev_exclusive() has been replaced by
blkdev_put().  Because these functions are more generic versions
of the functions they replaced the compatibility macro must add
the FMODE_EXCL mask to ensure they are exclusive.

Closes #114
2011-02-23 12:29:38 -08:00
Brian Behlendorf
2c395def27 Linux 2.6.36 compat, sops->evict_inode()
The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback.  It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.
2011-02-11 13:47:51 -08:00
Brian Behlendorf
7268e1bec8 Linux 2.6.35 compat, fops->fsync()
The fsync() callback in the file_operations structure used to take
3 arguments.  The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers.  To
handle this a compatibility prototype was added to ensure the right
prototype is used.  Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.
2011-02-11 09:05:51 -08:00