Commit Graph

582 Commits

Author SHA1 Message Date
Don Brady
4e21fd060a OpenZFS 7303 - dynamic metaslab selection
This change introduces a new weighting algorithm to improve
metaslab selection. The new weighting algorithm relies on the
SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight
now encodes the type of weighting algorithm used (size-based
vs segment-based).

Porting Notes: The metaslab allocation tracing code is conditionally
removed on linux (dependent on mdb debugger).

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Don Brady <don.brady@intel.com>

OpenZFS-issue: https://www.illumos.org/issues/7303
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d5190931bd
Closes #5404
2017-01-12 11:52:56 -08:00
George Melikov
5727b00e06 OpenZFS 6637 - replacing "dontclose" with "should_close"
Authored by: David Schwartz <dschwartz783@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

I find that this is a lot easier to read. "not don't close" is somewhat tough on the eyes.

OpenZFS-issue: https://www.illumos.org/issues/6637
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d189620
Closes #5572
2017-01-12 11:25:27 -08:00
George Melikov
e9aa730c49 OpenZFS 6328 - Fix cstyle errors in zfs codebase
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/6328
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9a686fb
Closes #5579
2017-01-12 09:42:11 -08:00
Johnny Stenback
5eac94bffd Fix TypeError: unorderable types: str() > int() in arc_summary.py
Running arc_summary.py with a l2arc cache device around produces
the following error:

  Traceback (most recent call last):
    File "/usr/bin/arc_summary.py", line 1148, in <module>
      main()
    File "/usr/bin/arc_summary.py", line 1144, in main
      page(Kstat)
    File "/usr/bin/arc_summary.py", line 724, in _l2arc_summary
      arc["l2_arc_evicts"]["reading"] > 0:
  TypeError: unorderable types: str() > int()

This is due to arc["l2_arc_evicts"]['lock_retries'] and
arc["l2_arc_evicts"]["reading"] both being strings, returned
from fHits() earlier. Rather than adding them up and checking
if the result is > 0, this checks if either string is != '0'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #5538
2017-01-03 12:29:23 -06:00
ka7
4e33ba4c38 Fix spelling
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes #5547 
Closes #5543
2017-01-03 11:31:18 -06:00
GeLiXin
547c593661 Fix coverity defects: CID 147587
CID 147587: Out-of-bounds read

Future changes may cause an array overrun of 4096 bytes at byte
offset 4096 by dereferencing pointer dstp.  Adding this additional
check ensures correctness.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes #5297
2016-12-21 11:27:24 -08:00
cao
58bf046ab3 Fix coverity defects: CID 155008
CID 155008:  Resource leaks  (RESOURCE_LEAK)

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5500
2016-12-19 10:26:15 -08:00
Tony Hutter
1528bfdb14 Don't run 'zpool iostat -c CMD' command on all vdevs, if vdevs specified
zpool iostat allows you to specify only certain vdevs to display.
Currently, if you run 'zpool iostat -c CMD vdev1 vdev2 ...'
on specific vdevs, it will actually run the command on *all* vdevs,
and just display the results for the vdevs you specify.  This patch
corrects the behavior to only run the command on the specified vdevs,
and also enables the zpool_iostat_005_pos.ksh tests.
    
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #5443
2016-12-16 16:10:45 -08:00
cao
81eb8a1fbb Fix coverity defects: CID 147534
CID 147534: Negative array index read

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5467
2016-12-16 09:11:17 -08:00
Brian Behlendorf
02730c333c Use cstyle -cpP in make cstyle check
Enable picky cstyle checks and resolve the new warnings.  The vast
majority of the changes needed were to handle minor issues with
whitespace formatting.  This patch contains no functional changes.

Non-whitespace changes are as follows:

* 8 times ; to { } in for/while loop
* fix missing ; in cmd/zed/agents/zfs_diagnosis.c
* comment (confim -> confirm)
* change endline , to ; in cmd/zpool/zpool_main.c
* a number of /* BEGIN CSTYLED */ /* END CSTYLED */ blocks
* /* CSTYLED */ markers
* change == 0 to !
* ulong to unsigned long in module/zfs/dsl_scan.c
* rearrangement of module_param lines in module/zfs/metaslab.c
* add { } block around statement after for_each_online_node

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5465
2016-12-12 10:46:26 -08:00
Håkan Johansson
5f20c145df Do not force VDEV_NAME_TYPE_ID in max_width()
Do not force VDEV_NAME_TYPE_ID in max_width(), instead add it
in the relevant calls to max_width().

The first location of max_width() where VDEV_NAME_TYPE_ID is
now added in show_import() is followed by print_import_config() and
print_logs().  Both these print children vdev names that have been
retrieved using an explicit VDEV_NAME_TYPE_ID added.

The second location is in status_callback().  This is followed by
print_status_config(), print_logs(), print_l2cache(), and
print_spares(). For l2cache and spares it should not matter as there
are no mirror-X or raidz-X involved.  print_status_config() as above
retrieves the name using explicit VDEV_NAME_TYPE_ID before
calling itself to print children.

The call of max_width() in get_namewidth() is not changed, as this is
used by zpool_do_iostat(), followed by print_iostat(), which does not
add VDEV_NAME_TYPE_ID.

Overall, we should consider adding VDEV_NAME_TYPE_ID to the
relevant name_flags / cb_name_flags fields, and remove the explicit
adding in called routines.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes #5401
2016-11-30 17:46:16 -07:00
Brian Behlendorf
7657defc48 Introduce ARC Buffer Data (ABD)
ZFS currently uses ARC buffers which are backed by virtual memory.
While functional, there are some major problems with this approach
which can be observed on all OpenZFS platforms.  ABD was designed
to address these issues and includes contributions from OpenZFS
developers from multiple platforms.

While all OpenZFS platforms will benefit from ABD this functionality
is critical for Linux.  Unlike the other OpenZFS platforms the Linux
kernel discourages extensive use of virtual memory.  The provided
interfaces are not optimized for frequent allocations from the virtual
address space.  To maintain good performance a kmem cache is
used which contains relatively long lived slabs backed by virtual
memory.  The downside to the approach is that those slabs can
become highly fragmented resulting in an inefficient use of memory.

Another issue is that on 32-bit systems the available virtual
address space in the kernel is only a small fraction of total
system memory.  This means the ARC size is highly constrained
which hurts performance and make allocating memory difficult
and OOMs more likely.

ABD is designed to address these issues by using scatter lists
of pages for data buffers.  This removes the need for slabs
which resolves the fragmentation issue.  It also allows high
memory pages to be allocated which alleviates the virtual
address space pressure on 32-bit systems.

For metadata buffers, which are small, linear ABDs are allocated
from the slab.  This is preferable because there are many places
in the code which expect to be able to read from a given offset
in the buffer.  Using linear ABDs means none of that code needs
to be modified.  The majority of these buffers are allocated with
kmalloc so there's minimal impact of the virtual address space.

Tested-by: Kash Pande <kash@tripleback.net>
Tested-by: kernelOfTruth <kerneloftruth@gmail.com>
Tested-by: RageLtMan <rageltman@sempervictus>
Tested-by: DHE <git@dehacked.net>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: David Quigley <david.quigley@intel.com>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3441 
Closes #5135
2016-11-30 14:48:16 -07:00
luozhengzheng
2d44b04b79 Fix coverity defects: CID 154591
CID 154591: Incorrect expression (SIZEOF_MISMATCH)

Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5435
2016-11-30 10:48:01 -07:00
Gvozden Neskovic
65d71d4212 ABD raidz avx512f support
Implement shift based multiplication for 512f. Higher IPC over lookup based
methods yields up to 40% better performance on the current hardware.

Results on Xeon Phi(TM) CPU 7210:
implementation   gen_p           gen_pq          gen_pqr         rec_p           rec_q           rec_r           rec_pq          rec_pr          rec_qr          rec_pqr
original         142232671       24411492        12948205        283053705       22348167        4215911         9171609         2265548         2378370         1648495
scalar           295711162       49851491        33253815        293198109       88179448        61866752        27941684        25764416        17384442        12138153
sse2             410055998       199642658       117973654       406240463       152688682       121092250       84968180        79291076        47473657        20779719
ssse3            411641595       199669571       117937647       406211024       137638508       117050346       81263322        76120405        46281559        32696722
avx2             616485806       311515332       188595628       605455115       260602390       230554476       148198817       138800254       92273356        62937819
avx512f          832191523       408509425       253599522       810094481       404325734       317590971       218235687       197204920       133101937       94001219
fastest          avx512f         avx512f         avx512f         avx512f         avx512f         avx512f         avx512f         avx512f         avx512f         avx512f

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-11-29 14:34:33 -08:00
Gvozden Neskovic
cbf484f8ad ABD Vectorized raidz
Enable vectorized raidz code on ABD buffers.  The avx512f,
avx512bw, neon and aarch64_neonx2 are disabled in this commit.
With the exception of avx512bw these implementations are
updated for ABD in the subsequent commits.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-11-29 14:34:33 -08:00
David Quigley
a6255b7fce DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
Tim Chase
616fa7c02b zstreamdump needs to initialize fletcher 4 support
Otherwise, the checksum function pointer isn't initialized.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #5411
2016-11-29 14:47:05 -07:00
Tony Hutter
8720e9e748 Add -c to zpool iostat & status to run command
This patch adds a command (-c) option to zpool status and zpool iostat.  The
-c option allows you to run an arbitrary command on each vdev and display
the first line of output in zpool status/iostat.  The environment vars
VDEV_PATH and VDEV_UPATH are set to the vdev's path and "underlying path"
before running the command.  For device mapper, multipath, or partitioned
vdevs, VDEV_UPATH is the actual underlying /dev/sd* disk.  This can be useful
if the command you're running requires a /dev/sd* device.

The patch also uses /sys/block/<dev>/slaves/ to lookup the underlying device
instead of using libdevmapper.  This not only removes the libdevmapper
requirement at build time, but also allows you to resolve device mapper
devices without being root.  This means that UDEV_UPATH get set correctly
when running zpool status/iostat as an unprivileged user.

Example:

$ zpool status -c 'echo I am $VDEV_PATH, $VDEV_UPATH'

NAME        STATE     READ WRITE CKSUM
mypool      ONLINE       0     0     0
  mirror-0  ONLINE       0     0     0
    mpatha  ONLINE       0     0     0  I am /dev/mapper/mpatha, /dev/sdc
    sdb     ONLINE       0     0     0  I am /dev/sdb1, /dev/sdb

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #5368
2016-11-29 14:45:38 -07:00
LOLi
2f71caf2d9 Allow zfs unshare <protocol> -a
Allow `zfs unshare <protocol> -a` command to share or unshare all datasets
of a given protocol, nfs or smb.

Additionally, enable most of ZFS Test Suite zfs_share/zfs_unshare test cases.
To work around some Illumos-specific functionalities ($SHARE/$UNSHARE) some
function wrappers were added around them.

Finally, fix and issue in smb_is_share_active() that would leave SMB shares
exported when invoking 'zfs unshare -a'

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #3238 
Closes #5367
2016-11-29 12:22:38 -07:00
Don Brady
0df15db98f Add a statechange notify zedlet
Now that ZED has internal fault diagnosis and the statechange event
is generated for faulted states, we can replace the io-notify and
checksum-notify zedlets with one based on statechange.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #5383
2016-11-10 13:52:59 -08:00
Olaf Faaland
66f801f00a Fix symlinks for {vdev_clear,statechange}-led.sh
These were named in the zed/Makefile.am as vdev_clear-blinkled.sh
and statechange-blinkled.sh causing bad symlinks to be created.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5384
2016-11-09 10:19:43 -08:00
cao
010782be91 Fix coverity defects: CID 147586
CID 147586: function:allow_usage Type:out-of-bounds read

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5364
2016-11-08 17:33:23 -08:00
luozhengzheng
70df4c21c0 Fix coverity defects: 154021
CID 154021: Null pointer dereference

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5380
2016-11-08 14:34:52 -08:00
Don Brady
976246fadd Add illumos FMD ZFS logic to ZED -- phase 2
The phase 2 work primarily entails the Diagnosis Engine and
the Retire Agent modules. It also includes infrastructure
to support a crude FMD environment to host these modules.

The Diagnosis Engine consumes I/O and checksum ereports and
feeds them into a SERD engine which will generate a corres-
ponding fault diagnosis when the SERD engine fires. All the
diagnosis state data is collected into cases, one case per
vdev being tracked.

The Retire Agent responds to diagnosed faults by isolating
the faulty VDEV. It will notify the ZFS kernel module of
the new VDEV state (degraded or faulted). This agent is
also responsible for managing hot spares across pools.
When it encounters a device fault or a device removal it
replaces the device with an appropriate spare if available.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #5343
2016-11-07 15:01:38 -08:00
Tony Hutter
1ad9de6d08 Allow autoreplace even when enclosure LED sysfs entries don't exist
The previous autoreplace code assumed that if you were using autoreplace, then
you also had the enclosure SES driver loaded.  This could lead to autoreplace
not working if the SES driver wasn't loaded, or if it wasn't creating the
proper enclosure_device symlinks (which has happened).  This patch removes
that assumption.

Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #5363
2016-11-04 13:34:13 -07:00
Romain Dolbeau
7f547f85fe Add parity generation/rebuild using AVX-512 for x86-64
avx512f should work on all AVX512 hardware, since it only uses
Foundation instructions.

avx512bw should be faster on hardware supporting the AVW512BW
extension. We can use full-width pshufb (instead of relying on the 256
bits AVX2 pshufb). As a side-effect, the code is also unrolled more.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.github@dolbeau.name>
Closes #5219
2016-11-02 12:40:23 -07:00
LOLi
e4010f2719 Allow for '-o feature@<feature>=disabled' on the command line
Sometimes it is desirable to specifically disable one or several
features directly on the 'zpool create' command line.

$ zpool create -o feature@<feature>=disabled ...

Original-patch-by: Turbo Fredriksson <turbo@bayour.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #3460 
Closes #5142 
Closes #5324
2016-10-25 16:17:47 -07:00
Tony Hutter
6568379eea Fix statechange-led.sh & unnecessary libdevmapper warning
- Fix autoreplace behaviour on statechange-led.sh script.

ZED sends the following events on an auto-replace:

1. statechange: Disk goes UNAVAIL->ONLINE
2. statechange: Disk goes ONLINE->UNAVAIL
3. vdev_attach: Disk goes ONLINE

Events 1-2 happen when ZED first attempts to do an auto-online.  When that
fails, ZED then tries an auto-replace, generating the vdev_attach event in #3.

In the previous code, statechange-led was only looking at the UNAVAIL->ONLINE
transition to turn off the LED.  It ignored the #2 ONLINE->UNAVAIL transition,
assuming it was just the "old" VDEV going offline.  This is problematic, as
a drive can go from ONLINE->UNAVAIL when it's malfunctioning, and we don't want
to ignore that.

This new patch correctly turns on the fault LED every time a drive becomes
UNAVAIL.  It also monitors vdev_attach events to trigger turning off the LED
when an auto-replaced disk comes online.

- Remove unnecessary libdevmapper warning with --with-config=kernel

This fixes an unnecessary libdevmapper warning when building
--with-config=kernel.  Kernel code does not use libdevmapper, so the warning
is not needed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #2375 
Closes #5312 
Closes #5331
2016-10-25 11:05:30 -07:00
cao
aed0e9f3e4 Fix coverity defects: CID 147511, 147513
CID 147511: Type:Dereference before null check
CID 147513: Type:Dereference before null check

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5306
2016-10-24 13:37:38 -07:00
Tony Hutter
1bbd877049 Turn on/off enclosure slot fault LED even when disk isn't present
Previously when a drive faulted, the statechange-led.sh script would lookup
the drive's LED sysfs entry in /sys/block/sd*/device/enclosure_device, and
turn it on.  During testing we noticed that if you pulled out a drive, or if
the drive was so badly broken that it no longer appeared to Linux, that the
/sys/block/sd* path would be removed, and the script could not lookup the
LED entry.

To fix this, this patch looks up the disks's more persistent
"/sys/class/enclosure/X:X:X:X/Slot N" LED sysfs path at pool import.  It then
passes that path to the statechange-led script to use, rather than having the
script look it up on the fly.  This allows the script to turn on/off the slot
LEDs even when the drive is missing.

Closes #5309 
Closes #2375
2016-10-24 10:45:59 -07:00
Tony Hutter
6078881aa1 Multipath autoreplace, control enclosure LEDs, event rate limiting
1. Enable multipath autoreplace support for FMA.

This extends FMA autoreplace to work with multipath disks.  This
requires libdevmapper to be installed at build time.

2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online

Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure
LED for a drive when a drive becomes FAULTED/DEGRADED.  Your enclosure must
be supported by the Linux SES driver for this to work.  The enclosure LED
scripts work for multipath devices as well.  The scripts will clear the LED
when the fault is cleared.

3. Rate limit ZIO delay and checksum events so as not to flood ZED

ZIO delay and checksum events are rate limited to 5/sec in the zfs module.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #2449 
Closes #3017 
Closes #5159
2016-10-19 12:55:59 -07:00
GeLiXin
66826e2285 Fix coverity defects: CID 147643, 152204, 49339
CID 147643: Type: String not null terminated
- make sure that the string is null terminated before strlen
  and fprintf.

CID 152204: Type: Copy into fixed size buffer
- since strlcpy isn't availabe here, use strncpy and terminate
  the string manually.

CID 49339: Type: Buffer not null terminated
- since strlcpy isn't availabe here, terminate the string
  manually before fprintf.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes #5283
2016-10-18 10:43:22 -07:00
Håkan Johansson
fea33e4e50 Pass status_cbdata_t to print_status_config() and friends
First rename spare_cbdata_t cb -> spare_cb in print_status_config(),
to free up cb.

Using the structure removes the explicit parameters namewidth
and name_flags from several functions.  Also use status_cbdata_t
for print_import_config().  This simplifies print_logs().

Remove the parameter 'verbose' for print_logs().  It does not really
mean verbose, it selected between the print_status_config and
print_import_config() paths.  This selection is now done by
cb_print_config of spare_cbdata_t.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Håkan Johansson <f96hajo@chalmers.se>
Closes #5259
2016-10-17 11:46:35 -07:00
cao
06cf4d9890 Fix coverity defects: CID 147606, 147609
coverity scan CID:147606, Type:resource leak
coverity scan CID:147609, Type:resource leak

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5245
2016-10-12 11:16:47 -07:00
GeLiXin
8c8cf8a2d8 Fix coverity defects: CID 147639
When array is passed as a parameter it degenerates into a
pointer so the sizeof(path) in is_shorthand_path() and always
get return value of 8, instead of the string length we want.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes #5198
2016-10-10 15:30:22 -07:00
Brian Behlendorf
7515f8f63d Fix file permissions
The following new test cases need to have execute permissions set:

  userquota/groupspace_003_pos.ksh
  userquota/userquota_013_pos.ksh
  userquota/userspace_003_pos.ksh
  upgrade/upgrade_userobj_001_pos.ksh
  upgrade/setup.ksh
  upgrade/cleanup.ksh

The following source files accidentally were marked executable:

  lib/libzpool/kernel.c
  lib/libshare/nfs.c
  lib/libzfs/libzfs_dataset.c
  lib/libzfs/libzfs_util.c
  tests/zfs-tests/cmd/rm_lnkcnt_zero_file/rm_lnkcnt_zero_file.c
  tests/zfs-tests/cmd/dir_rd_update/dir_rd_update.c
  cmd/zed/zed_exec.c
  module/icp/core/kcf_sched.c
  module/zfs/dsl_pool.c
  module/zfs/arc.c
  module/nvpair/nvpair.c
  man/man5/zfs-module-parameters.5

Reviewed-by: GeLiXin <ge.lixin@zte.com.cn>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5241
2016-10-08 14:57:56 -07:00
Brian Behlendorf
482cd9ee69 Fletcher4: Incremental updates and ctx calculation
Fixes ABI issues with fletcher4 code, adds support for
incremental updates, and adds ztest method for testing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #5164
2016-10-07 12:44:12 -07:00
Brian Behlendorf
910a571578 Add python style checking
Introduce a make recipe for flake8 to enable python
style checking. Ensure all python scripts pass flake8.
Return an error code of 0 for arcstat.py -v and
dbufstat.py -v.  Add test cases for python scripts.

Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ian Lee <IanLee1521@gmail.com>
Closes #5230
2016-10-07 09:54:02 -07:00
Jinshan Xiong
1de321e626 Add support for user/group dnode accounting & quota
This patch tracks dnode usage for each user/group in the
DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode
accounting have the key prefixed with "obj-" followed by the UID/GID
in string format (as done for the block accounting).
A new SPA feature has been added for dnode accounting as well as
a new ZPL version. The SPA feature must be enabled in the pool
before upgrading the zfs filesystem. During the zfs version upgrade,
a "quotacheck" will be executed by marking all dnode as dirty.

ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>
2016-10-07 09:45:13 -07:00
Giuseppe Di Natale
e169749fc0 Correct exit code for dbufstat -v and arcstat -v
Both scripts were returning an error code of 1
when using the -v argument. -v should exit with
an error code of 0.

Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
2016-10-06 10:52:21 -07:00
Giuseppe Di Natale
0bde1f7cb5 Correct style in arcstat and arc_summary
Fix arcstat and arc_summary so they pass
flake8 python code style checks.

Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
2016-10-06 10:04:54 -07:00
Gvozden Neskovic
37f520db2d Fletcher4: Incremental using SIMD
Combine incrementally computed fletcher4 checksums. Checksums are combined
a posteriori, allowing for parallel computation on chunks to be implemented if
required. The algorithm is general, and does not add changes in each SIMD
implementation.
New test in ztest verifies incremental fletcher computations.

Checksum combining matrix for two buffers `a` and `b`, where `Ca` and `Cb` are
respective fletcher4 checksums, `Cab` is combined checksum, `s` is size of buffer
`b` (divided by sizeof(uint32_t)) is:

Cab[A] = Cb[A] + Ca[A]
Cab[B] = Cb[B] + Ca[B] + s * Ca[A]
Cab[C] = Cb[C] + Ca[C] + s * Ca[B] + s(s+1)/2 * Ca[A]
Cab[D] = Cb[D] + Ca[D] + s * Ca[C] + s(s+1)/2 * Ca[B] + s(s+1)(s+2)/6 * Ca[A]

NOTE: this calculation overflows for larger buffers. Thus, internally, the calculation
is performed on 8MiB chunks.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-10-05 16:41:46 +02:00
luozhengzheng
e2c292bbfc Fix coverity defects: CID 150953, 147603, 147610
coverity scan CID:150953,type: uninitialized scalar variable
coverity scan CID:147603,type: Resource leak
coverity scan CID:147610,type: Resource leak

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5209
2016-10-04 18:15:57 -07:00
Tony Hutter
3c67d83a8a OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported by: Tony Hutter <hutter2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/4185
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee

Porting Notes:
This code is ported on top of the Illumos Crypto Framework code:

    b5e030c8db

The list of porting changes includes:

- Copied module/icp/include/sha2/sha2.h directly from illumos

- Removed from module/icp/algs/sha2/sha2.c:
	#pragma inline(SHA256Init, SHA384Init, SHA512Init)

- Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since
  it now takes in an extra parameter.

- Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c

- Added skein & edonr to libicp/Makefile.am

- Added sha512.S.  It was generated from sha512-x86_64.pl in Illumos.

- Updated ztest.c with new fletcher_4_*() args; used NULL for new CTX argument.

- In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section
  to not #include the non-existant endian.h.

- In skein_test.c, renane NULL to 0 in "no test vector" array entries to get
  around a compiler warning.

- Fixup test files:
	- Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>,
	- Remove <note.h> and define NOTE() as NOP.
	- Define u_longlong_t
	- Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p"
	- Rename NULL to 0 in "no test vector" array entries to get around a
	  compiler warning.
	- Remove "for isa in $($ISAINFO); do" stuff
	- Add/update Makefiles
	- Add some userspace headers like stdio.h/stdlib.h in places of
	  sys/types.h.

- EXPORT_SYMBOL *_Init/*_Update/*_Final... routines in ICP modules.

- Update scripts/zfs2zol-patch.sed

- include <sys/sha2.h> in sha2_impl.h

- Add sha2.h to include/sys/Makefile.am

- Add skein and edonr dirs to icp Makefile

- Add new checksums to zpool_get.cfg

- Move checksum switch block from zfs_secpolicy_setprop() to
  zfs_check_settable()

- Fix -Wuninitialized error in edonr_byteorder.h on PPC

- Fix stack frame size errors on ARM32
  	- Don't unroll loops in Skein on 32-bit to save stack space
  	- Add memory barriers in sha2.c on 32-bit to save stack space

- Add filetest_001_pos.ksh checksum sanity test

- Add option to write psudorandom data in file_write utility
2016-10-03 14:51:15 -07:00
Romain Dolbeau
62a65a654e Add parity generation/rebuild using 128-bits NEON for Aarch64
This re-use the framework established for SSE2, SSSE3 and
AVX2. However, GCC is using FP registers on Aarch64, so
unlike SSE/AVX2 we can't rely on the registers being left alone
between ASM statements. So instead, the NEON code uses
C variables and GCC extended ASM syntax. Note that since
the kernel explicitly disable vector registers, they
have to be locally re-enabled explicitly.

As we use the variable's number to define the symbolic
name, and GCC won't allow duplicate symbolic names,
numbers have to be unique. Even when the code is not
going to be used (e.g. the case for 4 registers when
using the macro with only 2). Only the actually used
variables should be declared, otherwise the build
will fails in debug mode.

This requires the replacement of the XOR(X,X) syntax
by a new ZERO(X) macro, which does the same thing but
without repeating the argument. And perhaps someday
there will be a machine where there is a more efficient
way to zero a register than XOR with itself. This affects
scalar, SSE2, SSSE3 and AVX2 as they need the new macro.

It's possible to write faster implementations (different
scheduling, different unrolling, interleaving NEON and
scalar, ...) for various cores, but this one has the
advantage of fitting in the current state of the code,
and thus is likely easier to review/check/merge.

The only difference between aarch64-neon and aarch64-neonx2
is that aarch64-neonx2 unroll some functions some more.

Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes #4801
2016-10-03 09:44:00 -07:00
luozhengzheng
aecdc70604 Fix coverity defects: CID 147448, 147449, 147450, 147453, 147454
coverity scan CID:147448,type: unchecked return value
coverity scan CID:147449,type: unchecked return value
coverity scan CID:147450,type: unchecked return value
coverity scan CID:147453,type: unchecked return value
coverity scan CID:147454,type: unchecked return value

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5206
2016-10-02 11:24:54 -07:00
GeLiXin
ed3ea30fb9 Fix coverity defects: CID 147536, 147537, 147538
coverity scan CID:147536, type: Argument cannot be negative
- may write or close fd which is negative
coverity scan CID:147537, type: Argument cannot be negative
- may call dup2 with a negative fd
coverity scan CID:147538, type: Argument cannot be negative
- may read or fchown with a negative fd

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes #5185
2016-09-30 15:40:07 -07:00
Gvozden Neskovic
292d573e70 raidz_test: respect wall time
When timeout is specified (-t), stop worker threads in the middle of work units.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #5180 
Closes #5190
2016-09-30 15:19:51 -07:00
BearBabyLiu
0b78aeae92 Fix coverity defects: CID 147443, 147656, 147655, 147441, 147653
coverity scan CID:147443, Type: Buffer not null terminated
coverity scan CID:147656, Type: Copy into fixed size buffer
coverity scan CID:147655, Type: Copy into fixed size buffer
coverity scan CID:147441, Type: Buffer not null terminated
coverity scan CID:147653, Type: Copy into fixed size buffer

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: liuhuang <liu.huang@zte.com.cn>
Closes #5165
2016-09-29 13:33:09 -07:00
cao
9ec0403d6a Fix coverity defects: CID 147610, 147608, 147607
coverity scan CID:147610, Type: Resource leak.
coverity scan CID:147608, Type: Resource leak.
coverity scan CID:147607, Type: Resource leak.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5143
2016-09-29 12:11:44 -07:00
cao
92bc927868 Fix coverity defects: CID 147602 147604
coverity scan CID:147604, Type: Resource leak.
coverity scan CID:147602, Type: Resource leak.
reason: safe_malloc calcvs, goto children but not free calcvs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5155
2016-09-23 15:43:46 -07:00
luozhengzheng
d0662a1beb Fix coverity defects: CID 147613 147614 147616 147617
coverity scan CID:147617,type: resource leaks
coverity scan CID:147616,type: resource leaks
coverity scan CID:147614,type: resource leaks
coverity scan CID:147613,type: resource leaks

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5150
2016-09-23 09:10:50 -07:00
luozhengzheng
5df39c1e43 Fix coverity defects
1.coverity scan CID:147445 function zfs_do_send in zfs_main.c
Buffer not null terminated (BUFFER_SIZE_WARNING)

2.coverity scan CID:147443 function zfs_do_bookmark in zfs_main.c
Buffer not null terminated (BUFFER_SIZE_WARNING)

3.coverity scan CID:147660 function main in zinject.c
Passing string argv[0] of unknown size to strcpy
By the way, the leak of g_zfs is fixed.

4.coverity scan CID: 147442 function make_disks in zpool_vdev.c
Buffer not null terminated (BUFFER_SIZE_WARNING)

5.coverity scan CID: 147661 function main in dir_rd_update.c
passing string cp1 of unknown size to strcpy

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5130
2016-09-22 15:55:41 -07:00
cao
884385a0b2 Fix coverity defects
Fix coverity defects:
coverity scan CID:147623, Type: Resource leak.
coverity scan CID:147622, Type: Resource leak.
reason: zpool_open zhp, but not zpool_close zhp. so resource leak.

coverity scan CID:147621, Type: Resource fd leak.
coverity scan CID:147620, Type: Resource fd leak.
reason: do_write do_read open file fd,but exception not close fd.

delete unuse definition DMU_OS_IS_L2COMPRESSIBLE.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5137
2016-09-20 17:45:45 -07:00
slashdd
792517389f Change /etc/mtab to /proc/self/mounts
Fix misleading error message:

 "The /dev/zfs device is missing and must be created.", if /etc/mtab is missing.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Eric Desrochers <eric.desrochers@canonical.com>
Closes #4680 
Closes #5029
2016-09-20 10:07:58 -07:00
luozhengzheng
30f3f2e13c Fix Coverity defects
CID 147659, 150952 and 147645

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5103
2016-09-17 15:08:54 -07:00
Dan Kimmel
2aa34383b9 DLPX-40252 integrate EP-476 compressed zfs send/receive
Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>
Issue #5078
2016-09-13 09:58:58 -07:00
George Wilson
d3c2ae1c08 OpenZFS 6950 - ARC should cache compressed data
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>

This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.

I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.

Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.

Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.

When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.

OpenZFS-issue: https://www.illumos.org/issues/6950
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0
Issue #5078
2016-09-13 09:58:33 -07:00
luozhengzheng
a425f5bff9 Fix memleak in zfs_do_* and zpool_do_*
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5056
2016-09-12 13:02:47 -07:00
loli10K
e33da554c5 Allow ZVOL bookmarks to be listed recursively
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #4503
Closes #5072
2016-09-12 12:40:53 -07:00
liuhuang
bd296705d5 Fix memory/fd leak in check_file() and is_spare()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: liuhuang <liu.huang@zte.com.cn>
Closes #5085
2016-09-12 09:44:26 -07:00
Don Brady
d02ca37979 Bring over illumos ZFS FMA logic -- phase 1
This first phase brings over the ZFS SLM module, zfs_mod.c, to handle
auto operations in response to disk events. Disk event monitoring is
provided from libudev and generates the expected payload schema for
zfs_mod. This work leverages the recently added devid and phys_path
strings in the vdev label.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #4673
2016-09-01 11:39:45 -07:00
Brian Behlendorf
9d69e9b268 Fix zhack argument processing
The argument processing is zhack makes the assumption that getopt()
will not permute argv.  This isn't true for the GNU implementation of
getopt() unless the optstring is prefixed with a '+'.  In which case
this is equivalent to setting the POSIXLY_CORRECT environment variable

In addition, update the usage() and optstrings to reflect the existing
supported options.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: liaoyuxiangqin <guo.yong33@zte.com.cn>
Closes #5047
2016-08-31 14:32:46 -07:00
Gvozden Neskovic
f0c26069bd zdb: fencepost error at zdb_cb.zcb_embedded_histogram[][]
Erroneous access detected by gcc UndefinedBehaviorSanitizer:
`zdb.c:2424:7: runtime error: index 112 out of bounds for type 'uint64_t [112]'`

Fix: increase histogram size by 1 to accommodate all possible sizes.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4934
Issue #4883
2016-08-16 14:53:09 -07:00
Hans Rosenfeld
fb390aafc8 OpenZFS 5997 - FRU field not set during pool creation and never updated
Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Signed-off-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/5997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283

Porting Notes:

In addition to the OpenZFS changes this patch realigns the events
with those found in OpenZFS.

Events which would be logged as sysevents on illumos have been
been mapped to the 'sysevent' class for Linux.  In addition, several
subclass names have been changed to match what is used in OpenZFS.
In all cases this means a '.' was changed to an '_' in the subclass.

The scripts provided by ZoL have been updated, however users which
provide scripts for any of the following events will need to rename
them based on the new subclass names.

  ereport.fs.zfs.config.sync         sysevent.fs.zfs.config_sync
  ereport.fs.zfs.zpool.destroy       sysevent.fs.zfs.pool_destroy
  ereport.fs.zfs.zpool.reguid        sysevent.fs.zfs.pool_reguid
  ereport.fs.zfs.vdev.remove         sysevent.fs.zfs.vdev_remove
  ereport.fs.zfs.vdev.clear          sysevent.fs.zfs.vdev_clear
  ereport.fs.zfs.vdev.check          sysevent.fs.zfs.vdev_check
  ereport.fs.zfs.vdev.spare          sysevent.fs.zfs.vdev_spare
  ereport.fs.zfs.vdev.autoexpand     sysevent.fs.zfs.vdev_autoexpand
  ereport.fs.zfs.resilver.start      sysevent.fs.zfs.resilver_start
  ereport.fs.zfs.resilver.finish     sysevent.fs.zfs.resilver_finish
  ereport.fs.zfs.scrub.start         sysevent.fs.zfs.scrub_start
  ereport.fs.zfs.scrub.finish        sysevent.fs.zfs.scrub_finish
  ereport.fs.zfs.bootfs.vdev.attach  sysevent.fs.zfs.bootfs_vdev_attach
2016-08-12 13:06:48 -07:00
Chunwei Chen
b320dd91a9 Fix infinite loop when zdb -R with d flag
Also print decompress progress to stderr so it wouldn't pollute raw output
with r flag.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4956
2016-08-11 15:21:32 -07:00
Brian Behlendorf
f3c9cac143 Fix gcc -Warray-bounds check for dump_object() in zdb
As of gcc 6.1.1 20160621 (Red Hat 6.1.1-3) an array bounds warnings
is detected in the zdb the dump_object() function.  The analysis is
correct but difficult to interpret because this is implemented as a
macro.  Rework the ZDB_OT_NAME in to a function and remove the case
detected by gcc which is a side effect of the DMU_OT_IS_VALID() macro.

  zdb.c: In function ‘dump_object’:
  zdb.c:1931:288: error: array subscript is outside array bounds
      [-Werror=array-bounds]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #4907
2016-08-02 13:14:47 -07:00
Gvozden Neskovic
df053d67a9 ztest: memory leaks reported by AddressSanitizer
Leaks reported by using AddressSanitizer, GCC 6.1.0

Direct leak of 4097 byte(s) in 1 object(s) allocated from:
    #1 0x414f73 in process_options cmd/ztest/ztest.c:721

Direct leak of 5440 byte(s) in 17 object(s) allocated from:
    #1 0x41bfd5 in umem_alloc ../../lib/libspl/include/umem.h:88
    #2 0x41bfd5 in ztest_zap_parallel cmd/ztest/ztest.c:4659
    #3 0x4163a8 in ztest_execute cmd/ztest/ztest.c:5907

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4896
2016-07-29 15:34:12 -07:00
Brian Behlendorf
fcf64f45d9 Fix zdb crash with 4K-only devices
Here's the problem - on 4K native devices in userland on
Linux using O_DIRECT, buffers must be 4K aligned or I/O
will fail with EINVAL, causing zdb (and others) to coredump.
Since userland probably doesn't need optimized buffer caches,
we just force 4K alignment on everything.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #4479
2016-07-27 13:38:46 -07:00
Brian Behlendorf
8a39abaafa Multi-thread 'zpool import' for blkid
Commit 519129f added support to multi-thread 'zpool import' for
the case where block devices are scanned for under /dev/.  This
commit generalizes that logic and applies it to the case where
device names are acquired from libblkid.

The zpool_find_import_scan() and zpool_find_import_blkid()
functions create an AVL tree containing each device name.  Each
entry in this tree is dispatched to a taskq where the function
zpool_open_func() validates the device by opening it and reading
the label.  This may result in additional entries being added
to the tree and those device paths being verified.

This is largely how the upstream OpenZFS code behaves but due to
significant differences the non-Linux code has been dropped for
readability.  Additionally, this code makes use of taskqs and
kmutexs which are normally not available to the command line tools.
Special care has been taken to allow their use in the import
functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #4794
2016-07-27 13:38:46 -07:00
Gvozden Neskovic
a64f903b06 Fixes for issues found with cppcheck tool
The patch fixes small number of errors/false positives reported by `cppcheck`,
static analysis tool for C/C++.

cppcheck 1.72

$ cppcheck . --force --quiet
[cmd/zfs/zfs_main.c:4444]: (error) Possible null pointer dereference: who_perm
[cmd/zfs/zfs_main.c:4445]: (error) Possible null pointer dereference: who_perm
[cmd/zfs/zfs_main.c:4446]: (error) Possible null pointer dereference: who_perm
[cmd/zpool/zpool_iter.c:317]: (error) Uninitialized variable: nvroot
[cmd/zpool/zpool_vdev.c:1526]: (error) Memory leak: child
[lib/libefi/rdwr_efi.c:1118]: (error) Memory leak: efi_label
[lib/libuutil/uu_misc.c:207]: (error) va_list 'args' was opened but not closed by va_end().
[lib/libzfs/libzfs_import.c:1554]: (error) Dangerous usage of 'diskname' (strncpy doesn't always null-terminate it).
[lib/libzfs/libzfs_sendrecv.c:3279]: (error) Dereferencing 'cp' after it is deallocated / released
[tests/zfs-tests/cmd/file_write/file_write.c:154]: (error) Possible null pointer dereference: operation
[tests/zfs-tests/cmd/randfree_file/randfree_file.c:90]: (error) Memory leak: buf
[cmd/zinject/zinject.c:1068]: (error) Uninitialized variable: dataset
[module/icp/io/sha2_mod.c:698]: (error) Uninitialized variable: blocks_per_int64

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1392
2016-07-27 13:31:22 -07:00
Gvozden Neskovic
c9187d867f Fixes and enhancements of SIMD raidz parity
- Implementation lock replaced with atomic variable

- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`

- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813

- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392

- Minor fixes and cleanups

- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4813
Issue #1392
2016-07-19 16:43:07 -07:00
Brian Behlendorf
b756ff2445 Update arc_summary.py for prefetch changes
Commit 7f60329 removed several kstats which arc_summary.py read.
Remove these kstats from arc_summary.py in the same way this was
handled in FreeNAS.

FreeNAS-commit: https://github.com/freenas/freenas/commit/3901f73

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4695
2016-07-19 09:12:41 -07:00
Gvozden Neskovic
ae25d22235 Add RAID-Z routines for SSE2 instruction set, in x86_64 mode.
The patch covers low-end and older x86 CPUs.  Parity generation is
equivalent to SSSE3 implementation, but reconstruction is somewhat
slower.  Previous 'sse' implementation is renamed to 'ssse3' to
indicate highest instruction set used.

Benchmark results:
scalar_rec_p                    4    720476442
scalar_rec_q                    4    187462804
scalar_rec_r                    4    138996096
scalar_rec_pq                   4    140834951
scalar_rec_pr                   4    129332035
scalar_rec_qr                   4    81619194
scalar_rec_pqr                  4    53376668

sse2_rec_p                      4    2427757064
sse2_rec_q                      4    747120861
sse2_rec_r                      4    499871637
sse2_rec_pq                     4    522403710
sse2_rec_pr                     4    464632780
sse2_rec_qr                     4    319124434
sse2_rec_pqr                    4    205794190

ssse3_rec_p                     4    2519939444
ssse3_rec_q                     4    1003019289
ssse3_rec_r                     4    616428767
ssse3_rec_pq                    4    706326396
ssse3_rec_pr                    4    570493618
ssse3_rec_qr                    4    400185250
ssse3_rec_pqr                   4    377541245

original_rec_p                  4    691658568
original_rec_q                  4    195510948
original_rec_r                  4    26075538
original_rec_pq                 4    103087368
original_rec_pr                 4    15767058
original_rec_qr                 4    15513175
original_rec_pqr                4    10746357

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4783
2016-07-13 10:24:55 -07:00
Igor Kozhukhov
eca7b76001 OpenZFS 6314 - buffer overflow in dsl_dataset_name
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6314
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee
2016-06-28 13:47:03 -07:00
Matthew Ahrens
47dfff3b86 OpenZFS 2605, 6980, 6902
2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12

6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f

Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
  'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
  this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
  rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.
2016-06-28 13:47:02 -07:00
Ned Bass
50c957f702 Implement large_dnode pool feature
Justification
-------------

This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks.  Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided.  Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks.  Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.

ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.

Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.

Implementation
--------------

The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.

Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.

The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run

 # zfs set dnodesize=auto tank/fish

The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.

The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.

New DMU interfaces:
  dmu_object_alloc_dnsize()
  dmu_object_claim_dnsize()
  dmu_object_reclaim_dnsize()

New ZAP interfaces:
  zap_create_dnsize()
  zap_create_norm_dnsize()
  zap_create_flags_dnsize()
  zap_create_claim_norm_dnsize()
  zap_create_link_dnsize()

The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.

These are a few noteworthy changes to key functions:

* The prototype for dnode_hold_impl() now takes a "slots" parameter.
  When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
  ensure the hole at the specified object offset is large enough to
  hold the dnode being created. The slots parameter is also used
  to ensure a dnode does not span multiple dnode blocks. In both of
  these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
  these failure cases are only possible when using DNODE_MUST_BE_FREE.

  If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
  dnode_hold_impl() will check if the requested dnode is already
  consumed as an extra dnode slot by an large dnode, in which case
  it returns ENOENT.

* The function dmu_object_alloc() advances to the next dnode block
  if dnode_hold_impl() returns an error for a requested object.
  This is because the beginning of the next dnode block is the only
  location it can safely assume to either be a hole or a valid
  starting point for a dnode.

* dnode_next_offset_level() and other functions that iterate
  through dnode blocks may no longer use a simple array indexing
  scheme. These now use the current dnode's dn_num_slots field to
  advance to the next dnode in the block. This is to ensure we
  properly skip the current dnode's bonus area and don't interpret it
  as a valid dnode.

zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.

For ZIL create log records, zdb will now display the slot count for
the object.

ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.

Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number.  This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.

ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.

Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.

While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.

For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.

ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.

Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.

Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-06-24 13:13:21 -07:00
Brian Behlendorf
391bba1d28 Revert "Add a test case for dmu_free_long_range() to ztest"
This reverts commit d0de2e82df which
introduced a new test case to ztest which is failing occasionally
during automated testing.  The change is being reverted until
the issue can be fully investigated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4754
2016-06-24 12:48:08 -07:00
Boris Protopopov
d0de2e82df Add a test case for dmu_free_long_range() to ztest
Signed-off-by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4754
2016-06-21 10:59:28 -07:00
Gvozden Neskovic
ab9f4b0b82 SIMD implementation of vdev_raidz generate and reconstruct routines
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.

Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite

New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
  module load, the parameter will only accept first 3 options, and
  the other implementations can be set once module is finished
  loading. Possible values for this option are:
    "fastest" - use the fastest math available
    "original" - use the original raidz code
    "scalar" - new scalar impl
    "sse" - new SSE impl if available
    "avx2" - new AVX2 impl if available

See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4328
2016-06-21 09:27:26 -07:00
Brian Behlendorf
f74b821a66 Add zfs allow and zfs unallow support
ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands.  In addition, non-
privileged users should be able to run all of the following commands:

  * zpool [list | iostat | status | get]
  * zfs [list | get]

Historically this functionality was not available on Linux.  In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability.  Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.

Even with this change some limitations remain.  Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace).  This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code.  It
may be possible to add this functionality in the future.

This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite.  These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.

Two minor bug fixes were required for test-running.py.  First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it.  Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.

Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #362 
Closes #434 
Closes #4100
Closes #4394 
Closes #4410 
Closes #4487
2016-06-07 09:16:52 -07:00
Colin Ian King
2627e75245 Fix minor spelling mistakes
Trivial spelling mistake fix in error message text.

* Fix spelling mistake "adminstrator" -> "administrator"
* Fix spelling mistake "specificed" -> "specified"
* Fix spelling mistake "interperted" -> "interpreted"

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4728
2016-06-06 14:45:26 -07:00
Jinshan Xiong
1eeb4562a7 Implementation of AVX2 optimized Fletcher-4
New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.

New zcommon module parameters:
-  zfs_fletcher_4_impl (str): selects the implementation to use.
    "fastest" - use the fastest version available
    "cycle"   - cycle trough all available impl for ztest
    "scalar"  - use the original version
    "avx2"    - new AVX2 implementation if available

Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar:  4216 MB/s
- AVX2:   14499 MB/s

See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4330
2016-06-02 14:30:51 -07:00
Tony Hutter
26ef0cc7db OpenZFS 6531 - Provide mechanism to artificially limit disk performance
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130

Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
  to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files
2016-05-26 10:11:51 -07:00
Tony Hutter
7e945072d1 Add request size histograms (-r) to zpool iostat, minor man page fix
Add -r option to "zpool iostat" to print request size histograms for the leaf
ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs
("agg"). These stats can be useful for seeing how well the ZFS IO aggregator
is working.

$ zpool iostat -r
mypool        sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0    530      0      0      0
1K              0      0    260      0      0      0    116    246      0      0
2K              0      0      0      0      0      0      0    431      0      0
4K              0      0      0      0      0      0      3    107      0      0
8K             15      0     35      0      0      0      0      6      0      0
16K             0      0      0      0      0      0      0     39      0      0
32K             0      0      0      0      0      0      0      0      0      0
64K            20      0     40      0      0      0      0      0      0      0
128K            0      0     20      0      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0    155     19      0      0
8M              0      0      0      0      0      0      0    811      0      0
16M             0      0      0      0      0      0      0     68      0      0
--------------------------------------------------------------------------------

Also rename the stray "-G" in the man page to be "-w" for latency histograms.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #4659
2016-05-25 15:49:35 -07:00
Christer Ekholm
3491d6eb06 Consistently use parsable instead of parseable
This is a purely cosmetical change, to consistently prefer one of
two (both acceptable) choises for the word parsable in documentation and
code. I don't really care which to use, but acording to wiktionary
https://en.wiktionary.org/wiki/parsable#English parsable is preferred.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4682
2016-05-23 10:20:42 -07:00
Christer Ekholm
bc2d809387 Make zpool list -vp print individual vdev sizes parsable.
Add argument format to print_one_column(), and use it to call
zfs_nicenum_format with, instead of just zfs_nicenum. Don't print "%"
for fragmentation or capacity percent values.

The calls to print_one_colum is made with ZFS_NICENUM_RAW if
cb->cb_literal (zpool list called with -p), and ZFS_NICENUM_1024 if not.

Also zpool_get_prop is modified to don't add "%" or "x" if literal.

Signed-off-by: Christer Ekholm <che@chrekh.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov
Closes #4657
2016-05-18 10:15:32 -07:00
Brian Behlendorf
ada8258141 Revert "zhack: Add 'feature disable' command"
This reverts commit 8302528617 and
ebecfcd699 which broke the build.
While these patches do apply cleanly and passed previous test
runs they need to be updated to account for the changes made in
commit 241b541574.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3878
2016-05-17 11:52:07 -07:00
Brian Behlendorf
ebecfcd699 zhack: Add 'feature enable' force option
Add a force option to allow zhack to add features which are
part of the known set of supported features.  By default
this is disabled.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3878
2016-05-17 11:00:42 -07:00
Brian Behlendorf
8302528617 zhack: Add 'feature disable' command
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3878
2016-05-17 11:00:21 -07:00
Brian Behlendorf
72c407f827 zhack: Fix 'feature ref' comparison check
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3878
2016-05-17 10:59:37 -07:00
Boris Protopopov
e3a07cd033 Use zfs range locks in ztest
The zfs range lock interface no longer tightly depends on a
znode_t and therefore can be used in ztest.  This allows the
previous ztest specific implementation to be removed, and for
additional test coverage of the shared version.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4023
Issue #4024
2016-05-17 10:40:30 -07:00
Tony Hutter
193a37cb24 Add -lhHpw options to "zpool iostat" for avg latency, histograms, & queues
Update the zfs module to collect statistics on average latencies, queue sizes,
and keep an internal histogram of all IO latencies.  Along with this, update
"zpool iostat" with some new options to print out the stats:

-l: Include average IO latencies stats:

 total_wait     disk_wait    syncq_wait    asyncq_wait  scrub
 read  write   read  write   read  write   read  write   wait
-----  -----  -----  -----  -----  -----  -----  -----  -----
    -   41ms      -    2ms      -   46ms      -    4ms      -
    -    5ms      -    1ms      -    1us      -    4ms      -
    -    5ms      -    1ms      -    1us      -    4ms      -
    -      -      -      -      -      -      -      -      -
    -   49ms      -    2ms      -   47ms      -      -      -
    -      -      -      -      -      -      -      -      -
    -    2ms      -    1ms      -      -      -    1ms      -
-----  -----  -----  -----  -----  -----  -----  -----  -----
  1ms    1ms    1ms  413us   16us   25us      -    5ms      -
  1ms    1ms    1ms  413us   16us   25us      -    5ms      -
  2ms    1ms    2ms  412us   26us   25us      -    5ms      -
    -    1ms      -  413us      -   25us      -    5ms      -
    -    1ms      -  460us      -   29us      -    5ms      -
196us    1ms  196us  370us    7us   23us      -    5ms      -
-----  -----  -----  -----  -----  -----  -----  -----  -----

-w: Print out latency histograms:

sdb           total           disk         sync_queue      async_queue
latency    read   write    read   write    read   write    read   write   scrub
-------  ------  ------  ------  ------  ------  ------  ------  ------  ------
1ns           0       0       0       0       0       0       0       0       0
...
33us          0       0       0       0       0       0       0       0       0
66us          0       0     107    2486       2     788      12      12       0
131us         2     797     359    4499      10     558     184     184       6
262us        22     801     264    1563      10     286     287     287      24
524us        87     575      71   52086      15    1063     136     136      92
1ms         152    1190       5   41292       4    1693     252     252     141
2ms         245    2018       0   50007       0    2322     371     371     220
4ms         189    7455      22  162957       0    3912    6726    6726     199
8ms         108    9461       0  102320       0    5775    2526    2526      86
17ms         23   11287       0   37142       0    8043    1813    1813      19
34ms          0   14725       0   24015       0   11732    3071    3071       0
67ms          0   23597       0    7914       0   18113    5025    5025       0
134ms         0   33798       0     254       0   25755    7326    7326       0
268ms         0   51780       0      12       0   41593   10002   10002       0
537ms         0   77808       0       0       0   64255   13120   13120       0
1s            0  105281       0       0       0   83805   20841   20841       0
2s            0   88248       0       0       0   73772   14006   14006       0
4s            0   47266       0       0       0   29783   17176   17176       0
9s            0   10460       0       0       0    4130    6295    6295       0
17s           0       0       0       0       0       0       0       0       0
34s           0       0       0       0       0       0       0       0       0
69s           0       0       0       0       0       0       0       0       0
137s          0       0       0       0       0       0       0       0       0
-------------------------------------------------------------------------------

-h: Help

-H: Scripted mode. Do not display headers, and separate fields by a single
    tab instead of arbitrary space.

-q: Include current number of entries in sync & async read/write queues,
    and scrub queue:

 syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read
 pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
-----  -----  -----  -----  -----  -----  -----  -----  -----  -----
    0      0      0      0     78     29      0      0      0      0
    0      0      0      0     78     29      0      0      0      0
    0      0      0      0      0      0      0      0      0      0
    -      -      -      -      -      -      -      -      -      -
    0      0      0      0      0      0      0      0      0      0
    -      -      -      -      -      -      -      -      -      -
    0      0      0      0      0      0      0      0      0      0
-----  -----  -----  -----  -----  -----  -----  -----  -----  -----
    0      0    227    394      0     19      0      0      0      0
    0      0    227    394      0     19      0      0      0      0
    0      0    108     98      0     19      0      0      0      0
    0      0     19     98      0      0      0      0      0      0
    0      0     78     98      0      0      0      0      0      0
    0      0     19     88      0      0      0      0      0      0
-----  -----  -----  -----  -----  -----  -----  -----  -----  -----

-p: Display numbers in parseable (exact) values.

Also, update iostat syntax to allow the user to specify specific vdevs
to show statistics for.  The three options for choosing pools/vdevs are:

Display a list of pools:
    zpool iostat ... [pool ...]

Display a list of vdevs from a specific pool:
    zpool iostat ... [pool vdev ...]

Display a list of vdevs from any pools:
    zpool iostat ... [vdev ...]

Lastly, allow zpool command "interval" value to be floating point:
    zpool iostat -v 0.5

Signed-off-by: Tony Hutter <hutter2@llnl.gov
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4433
2016-05-12 12:36:32 -07:00
Adam Stevko
2a8b84b747 OpenZFS 3993, 4700
3993 zpool(1M) and zfs(1M) should support -p for "list" and "get"
4700 "zpool get" doesn't support -H or -o options

Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/3993
OpenZFS-issue: https://www.illumos.org/issues/4700
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c58b352

Porting notes:
I removed ZoL's zpool_get_prop_literal() in favor of
zpool_get_prop(..., boolean_t literal) since that's what OpenZFS
uses.  The functionality is the same.
2016-05-11 11:49:37 -07:00
Brian Behlendorf
3ac2794cbf OpenZFS 5669 - altroot not set in zpool create
5669 altroot not set in zpool create when specified with -o
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/5669
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c423721
Closes #4594
2016-05-05 09:27:55 -07:00
Josef 'Jeff' Sipek
8a5fc74880 Illumos 6659 - nvlist_free(NULL) is a no-op
6659 nvlist_free(NULL) is a no-op
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/6659
  https://github.com/illumos/illumos-gate/commit/aab83bb

Ported-by: David Quigley <dpquigl@davequigley.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4566
2016-04-27 15:58:23 -07:00
Brian Behlendorf
2d82ea8b11 Use udev for partition detection
When ZFS partitions a block device it must wait for udev to create
both a device node and all the device symlinks.  This process takes
a variable length of time and depends on factors such how many links
must be created, the complexity of the rules, etc.  Complicating
the situation further it is not uncommon for udev to create and
then remove a link multiple times while processing the udev rules.

Given the above, the existing scheme of waiting for an expected
partition to appear by name isn't 100% reliable.  At this point
udev may still remove and recreate think link resulting in the
kernel modules being unable to open the device.

In order to address this the zpool_label_disk_wait() function
has been updated to use libudev.  Until the registered system
device acknowledges that it in fully initialized the function
will wait.  Once fully initialized all device links are checked
and allowed to settle for 50ms.  This makes it far more likely
that all the device nodes will exist when the kernel modules
need to open them.

For systems without libudev an alternate zpool_label_disk_wait()
was updated to include a settle time.  In addition, the kernel
modules were updated to include retry logic for this ENOENT case.
Due to the improved checks in the utilities it is unlikely this
logic will be invoked.  However, if the rare event it is needed
it will prevent a failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #4523
Closes #3708
Closes #4077
Closes #4144
Closes #4214
Closes #4517
2016-04-25 11:13:20 -07:00
Brian Behlendorf
5b4136bd49 Create unique partition labels
When partitioning a device a name may be specified for each partition.
Internally zfs doesn't use this partition name for anything so it
has always just been set to "zfs".

However this isn't optimal because udev will create symlinks using
this name in /dev/disk/by-partlabel/.  If the name isn't unique
then all the links cannot be created.

Therefore a random 64-bit value has been added to the partition
label, i.e "zfs-1234567890abcdef".  Additional information could
be encoded here but since partitions may be reused that might
result in confusion and it was decided against.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #4517
2016-04-25 11:13:09 -07:00
Chunwei Chen
6760077194 Make zfs mount according to relatime config in dataset
Also enable lazytime in mount.zfs

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4482
2016-04-05 18:55:59 -07:00
Don Brady
39fc0cb557 Add support for devid and phys_path keys in vdev disk labels
This is foundational work for ZED.

Updates a leaf vdev's persistent device strings on Linux platform

* only applies for a dedicated leaf vdev (aka whole disk)
* updated during pool create|add|attach|import
* used for matching device matching during auto-{online,expand,replace}
* stored in a leaf disk config label (i.e. alongside 'path' NVP)
* can opt-out using env var ZFS_VDEV_DEVID_OPT_OUT=YES

Some examples:

    path: '/dev/sdb1'
    devid: 'scsi-350000394a8ca4fbc-part1'
    phys_path: 'pci-0000:04:00.0-sas-0x50000394a8ca4fbf-lun-0'

    path: '/dev/mapper/mpatha'
    devid: 'dm-uuid-mpath-35000c5006304de3f'

Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2856
Closes #3978
Closes #4416
2016-03-31 13:45:53 -07:00