This is foundational work for ZED.
Updates a leaf vdev's persistent device strings on Linux platform
* only applies for a dedicated leaf vdev (aka whole disk)
* updated during pool create|add|attach|import
* used for matching device matching during auto-{online,expand,replace}
* stored in a leaf disk config label (i.e. alongside 'path' NVP)
* can opt-out using env var ZFS_VDEV_DEVID_OPT_OUT=YES
Some examples:
path: '/dev/sdb1'
devid: 'scsi-350000394a8ca4fbc-part1'
phys_path: 'pci-0000:04:00.0-sas-0x50000394a8ca4fbf-lun-0'
path: '/dev/mapper/mpatha'
devid: 'dm-uuid-mpath-35000c5006304de3f'
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2856Closes#3978Closes#4416
When checking a whole disk to see if it can be safely added to
the pool a variety of checks are done. One of those checks is
to attempt to determine the partition information and scan all
the partitions for existing filesystems.
Since ZoL contains a EFI library this partition scanning is
easy to do for GPT partitioned disks. However, for non-GPT
partitioned disks (MBR/EBR) things are a bit harder. The lack of
a convenient library means non-GPT partitioned disks will not
have all their partitions checked. For this reason, the default
behavior was to require the force option. For example:
invalid vdev specification
use '-f' to override the following errors:
/dev/vdb does not contain an GPT label but it may contain partition
information in the MBR.
However in practice requiring the force option for this case is
counter-intuitively less safe. The reason is because only the first
error is returned. By passing the force option it will suppress
this first warning and potentially others you were not aware of.
Therefore this patch inverts the default behavior for non-GPT
formated disks (unformatted, MBR/EBR, etc). If no GPT table is
detected and there is no file system detected on the provided
block device. Then it will be assumed that block device is safe
to use.
Longer term it would be nice to see MBR/EBR scanning added to
the utilities. This should be fairly straight forward to do.
However these days it's somewhat less critical because Linux
defaults to GPT partition tables for devices 2TB or larger.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2660Closes#2274
Historically libblkid support was detected as part of configure
and optionally enabled. This was done because at the time support
for detecting ZFS pool vdevs had just be added to libblkid and
those updated packages were not yet part of many distributions.
This is no longer the case and any reasonably current distribution
will ship a version of libblkid which can detect ZFS pool vdevs.
This patch makes libblkid mandatory at build time and libblkid
the preferred method of scanning for ZFS pools. For distributions
which include a modern version of libblkid there is no change in
behavior. Explicitly scanning the default search paths is still
supported and can be enabled with the '-s' command line option.
Additionally making libblkid mandatory means that the 'zpool create'
command can reliably detect if a specified device has an existing
non-ZFS filesystem (ext4, xfs) and print a warning.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2448
5118 When verifying or creating a storage pool, error messages
only show one device
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <boris.protopopov@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://github.com/illumos/illumos-gate/commit/75fbdf9https://www.illumos.org/issues/5118
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3567
The newroot nvlist should be freed before returning.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3264
When using 'zpool import' to scan for available pools prefer vdev names
which reference vdevs with more valid labels. There should be two labels
at the start of the device and two labels at the end of the device. If
labels are missing then the device has been damaged or is in some other
way incomplete. Preferring names with fully intact labels helps weed out
bad paths and improves the likelihood of being able to import the pool.
This behavior only applies when scanning /dev/ for valid pools. If a
cache file exists the pools described by the cache file will be used.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes#3145Closes#2844Closes#3107
The Intel DC S3500 and Intel DC S3700 are optimized to handle 4KB
sectors well despite of their 8KB page sizes, so we move them to a new
category for enterprise drives where they will receive ashift=12. They
are joined by the Intel 730 series, which uses the same disk controller,
as well as a San Disk enterprise drive. The drive IDs for these two were
obtained by myself with the drive_id utility. The drive ID for the 240GB
Intel 730 model was extrapolated from the drive ID for the 480GB model.
Lastly, we also add some Western Digital mobile drives. ryuo in
\#zfsonlinux on freenode obtained "ATA WDC WD2500BEVT-0" from
running drive_id on his own hardware. The additional drives in that
family were extrapolated from that identifer.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2601
4374 dn_free_ranges should use range_tree_t
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4374https://github.com/illumos/illumos-gate/commit/bf16b11
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2531
zfsonlinux/zfs@1db7b9be75 should have
fixed this, but this particular string was overlooked.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2288
The vast majority of these changes are in Linux specific code.
They are the result of not having an automated style checker to
validate the code when it was originally written. Others were
caused when the common code was slightly adjusted for Linux.
This patch contains no functional changes. It only refreshes
the code to conform to style guide.
Everyone submitting patches for inclusion upstream should now
run 'make checkstyle' and resolve any warning prior to opening
a pull request. The automated builders have been updated to
fail a build if when 'make checkstyle' detects an issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1821
Added:
Adata S396 (obtained from drive_id)
Apple MacBookAir3,1 SSD (obtained from drive_id)
Apple MacBookPro10,1 SSD (obtained from drive_id)
Intel 510 (obtained from drive_id)
Intel 710 (obtained from drive_id)
Intel DC S3500 (obtained from drive_id)
Netapp LUN (obtained from illumos user's sd.conf)
OCZ Agility 3 (obtained from drive_id)
OCZ Vertex (obtained from drive_id)
Samsung PM800 (obtained from drive_id)
Sandisk U100 (obtained from drive_id)
Sun Comstar (obtained from illumos user's sd.conf)
Notes:
1. The entries for the Intel DC S3500 were extrapolated from the 800GB
model's entry, which is "ATA INTEL SSDSC2BB80".
2. The entires for the Intel 710 were extrapolated from the 120GG
model's entry, which is "ATA INTEL SSDSA2BZ12".
3. The entires for the Intel 510 were extrapolated from the 250GB
model's entry, which is "ATA INTEL SSDSC2MH25".
4. The entires for the Apple MacBookPro10,1 SSD were extrapolated from
the 512GB model's entry, which is "ATA APPLE SSD SM512E". Google
searches suggest that this is a rebadged Samsung 830.
5. The entires for the Apple MacBookAir3,1 SSD were extrapolated from
the 128GB model's entry, which is "ATA APPLE SSD TS128C". Google
searches suggest that this is a rebadged Kingston SSDNow V+ 100 (based
on Toshiba).
6. Sun Comstar is an iSCSI Target, so we cannot tell what the correct
sector size is through this method. We list it only for reference
purposes, but it is commented out. Similarly, it is not clear what the
right thing to do for Netapp is, so we comment it out.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1907
When creating a new pool, or adding/replacing a disk in an existing
pool, partition tables will be automatically created on the devices.
Under normal circumstances it will take less than a second for udev
to create the expected device files under /dev/. However, it has
been observed that if the system is doing heavy IO concurrently udev
may take far longer. If you also throw in some cheap dodgy hardware
it may take even longer.
To prevent zpool commands from failing due to this the default wait
time for udev is being increased to 30 seconds. This will have no
impact on normal usage, the increase timeout should only be noticed
if your udev rules are incorrectly configured.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1646
Add Corsair Force GS drive (obtained from drive_id)
Add Kingston HyperX 3K (obtained from drive_id)
Add OCZ Vertex 4 drive (obtained from drive_id)
Add Samsung SM843T enterprise drive (obtained from drive_id)
Add entries for additional sizes of Intel 320/330/335/520 series
Add Cruical C400 (obtained from Illumos user's sd.conf)
Add Toshiba SSD (obtained from Illumos user's sd.conf)
Add Samsung's first SLC SSD (obtained from drive_id)
Add OCZ Core Series (obtained from drive_id)
Add Intel DC S3700 (obtained from drive_id)
Notes:
1. The drive identifer obtained for the Samsung SM843T was MZ7WD480. The
rest were extrapolated. The additional entries were checked with Google
to verify that such drives exist in the wild.
2. The additional entries for Intel drives were extrapolated from
existing entries. The additional entries were checked with Google to
verify that such drives exist in the wild.
3. The "ATA C400-MTFDDAC512M" and "ATA TOSHIBA THNSNH51" entries
are from the sd.conf of gcbirzan on freenode. Additional entries were
extrapolated from them and checked with Google.
4. I obtained the Samsung MCCOE64G entry from an actual drive. The
Samsung MCCOE32G entry was extrapolated from it and checked with
Google.
5. I obtained the SSDSC2BA10 from a 100GB Intel DC S3700 drive and
extrapolated the entries for the additional models.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1752
This implements vdev_bdev_database_check(). It alters the detected
sector size of any device listed in a database of drives known to lie
about their physical sector sizes.
This is based on "6931570 Add flash devices' VID/PID to disk table to
advertising 4K physical sector size" from Open Solaris and on
sg_simple4.c from sg3_utils. About two dozen lines are taken from
sg_simple4.c, which is GPLv2 licensed. However, sg_simple4.c is
analogous to a Hello World program and is safe for us to use. We
requested that Douglas Gilbert, the author of sg_simple4.c, confirm that
this is the case. A cutdown version of his response is as follows:
```
I would consider a SCSI INQUIRY example using the Linux sg
driver interface (also written by me) as the equivalent of an
"hello world" program in C.
```
The database was created with the help of the freenode and ZFSOnLinux
communities.
Some notes:
1. The following drives both were confirmed to lie via reports in IRC
and they contain capacity information in their identifiers:
INTEL SSDSA2M080
INTEL SSDSA2M160
M4-CT256M4SSD2
WDC WD15EARS-00S
WDC WD15EARS-00Z
WDC WD20EARS-00M
The identifiers for different capacity models were extrapolated and
added under the assumption that those models also lie. Google was used
to verify that the extrapolated drive identifiers existed prior to their
inclusion.
2. The OCZ-VERTEX2 3.5 identifer applies to two drives that differ
solely in page size (and slightly in capacity). One uses 4096-byte pages
and the other uses 8192-byte pages. Both are set to use 8192-byte pages.
We could detect the page size by checking the capacity, but that would
unnecessarily complicate the code.
3. It is possible for updated drive firmware to correctly report the
sector size. There were reports of a few advanced format drives doing
that. One report stated that the vendor changed the identification
string while another was unclear on this. Both reports involved WDC
models.
4. Google was used to determine the size of pages in the listed flash
devices. Reports of 8192-byte pages took precedence over reports of
4096-byte pages.
5. Devices behind USB adapters can have their identification strings
altered. Identification strings obtained across USB adapters are
omitted and no attempt is made to correct for alterations made by USB
adapters when doing comparisons against the database. Two entries in the
Open Solaris database that appear to have been altered by a USB
adapter were omitted.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1652
The issue with hot spares in ZoL is because it opens all leaf
vdevs exclusively (O_EXCL). On Linux, exclusive opens cause
subsequent exclusive opens to fail with EBUSY.
This could be resolved by not opening any of the devices
exclusively, which is what Illumos does, but the additional
protection offered by exclusive opens is desirable. It cleanly
prevents you from accidentally adding an in-use non-ZFS device
to your pool.
To fix this we very slightly relaxed the usage of O_EXCL in
the following ways.
1) Functions which open the device but only read had the
O_EXCL flag removed and were updated to use O_RDONLY.
2) A common holder was added to the vdev disk code. This
allow the ZFS code to internally open the device multiple
times but non-ZFS callers may not.
3) An exception was added to make_disks() for hot spare when
creating partition tables. For hot spare devices which
are already opened exclusively we skip creating the partition
table because this must already have been done when the disk
was originally added as a hot spare.
Additional minor changes include fixing check_in_use() to use
a partition instead of a slice suffix. And is_spare() was moved
above make_disks() to avoid adding a forward reference.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#250
The 'zpool replace' command would fail when given a short name
because unlike on other platforms the short name cannot be
deterministically expanded to a single path. Multiple path
prefixes must be checked and in addition the partition suffix
for whole disks is determined by the prefix.
To handle this complexity a zfs_strcmp_pathname() function was
added which takes either a short or fully qualified device name.
Short names will be expanded using the prefixes in the default
import search path, or the ZPOOL_IMPORT_PATH environment variable
if it's defined. All posible expansions are then compared against
the comparison path. Care is taken to strip redundant slashes to
ensure legitimate matches are not missed.
In the context of this work the existing zfs_resolve_shortname()
function was extended to consider the ZPOOL_IMPORT_PATH when set.
The zfs_append_partition() interface was also simplified to take
only a single buffer.
The vast majority of these changes rework existing Linux specific
code which was originally written to accomidate udev. However,
there is some minimal cleanup which removes Illumos specific code.
This was done to improve readability but the basic flow and intent
of the upstream code was maintained.
These changes are the logical conclusion of the previos work to
adjust the 'zpool import' search behavior, see commit 44867b6a.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#544Closes#976
vdev_id parses the file /etc/zfs/vdev_id.conf to map a physical path
in a storage topology to a channel name. The channel name is combined
with a disk enclosure slot number to create an alias that reflects the
physical location of the drive. This is particularly helpful when it
comes to tasks like replacing failed drives. Slot numbers may also be
re-mapped in case the default numbering is unsatisfactory. The drive
aliases will be created as symbolic links in /dev/disk/by-vdev.
The only currently supported topologies are sas_direct and sas_switch:
o sas_direct - a channel is uniquely identified by a PCI slot and a
HBA port
o sas_switch - a channel is uniquely identified by a SAS switch port
A multipath mode is supported in which dm-mpath devices are handled by
examining the first running component disk, as reported by 'multipath
-l'. In multipath mode the configuration file should contain a
channel definition with the same name for each path to a given
enclosure.
vdev_id can replace the existing zpool_id script on systems where the
storage topology conforms to sas_direct or sas_switch. The script
could be extended to support other topologies as well. The advantage
of vdev_id is that it is driven by a single static input file that can
be shared across multiple nodes having a common storage toplogy.
zpool_id, on the other hand, requires a unique /etc/zfs/zdev.conf per
node and a separate slot-mapping file. However, zpool_id provides the
flexibility of using any device names that show up in
/dev/disk/by-path, so it may still be needed on some systems.
vdev_id's functionality subsumes that of the sas_switch_id script, and
it is unlikely that anyone is using it, so sas_switch_id is removed.
Finally, /dev/disk/by-vdev is added to the list of directories that
'zpool import' will scan.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#713
When creating a new pool, make_root_vdev() calls check_in_use() to
ensure that none of the consituent disks are in use. If the disk
contains a valid vdev label it is read to retrieve the list of its
child vdevs and these are checked recursively. However, the
partitions stored in the vdev label my no longer exist, for example
if the partition table has since been altered. In any such case we
would want the pool creation to proceed, so this change removes the
check from check_slice() that returns an error if the device doesn't
exist. As an added assurance, the Solaris implementation also
returns sucess on ENOENT.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly. This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes. The drive may behave this way
for compatibility reasons. For example, the WDC WD20EARS disks are
known to exhibit this behavior.
When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).
This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time. This can significantly improve
performance in certain cases. However, it will have an impact on your
total pool capacity. See the updated ashift property description
in the zpool.8 man page for additional details.
Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior. The most common ashift values are 9 and 12.
Example:
zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd
Closes#280
Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Creating whole-disk vdevs can intermittently fail if a udev-managed symlink to
the disk partition is already in place. To avoid this, we now remove any such
symlink before partitioning the disk. This makes zpool_label_disk_wait() truly
wait for the new link to show up instead of returning if it finds an old link
still in place. Otherwise there is a window between when udev deletes and
recreates the link during which access attempts will fail with ENOENT.
Also, clean up a comment about waiting for udev to create symlinks. It no
longer needs to describe the special cases for the link names, since that is
now handled in a separate helper function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Portability between Solaris and Linux isn't really an issue for us anymore, and
removing sections like this one helps simplify the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change adds two helper functions for working with vdev names and paths.
zfs_resolve_shortname() resolves a shorthand vdev name to an absolute path
of a file in /dev, /dev/disk/by-id, /dev/disk/by-label, /dev/disk/by-path,
/dev/disk/by-uuid, /dev/disk/zpool. This was previously done only in the
function is_shorthand_path(), but we need a general helper function to
implement shorthand names for additional zpool subcommands like remove.
is_shorthand_path() is accordingly updated to call the helper function.
There is a minor change in the way zfs_resolve_shortname() tests if a file
exists. is_shorthand_path() effectively used open() and stat64() to test for
file existence, since its scope includes testing if a device is a whole disk
and collecting file status information. zfs_resolve_shortname(), on the other
hand, only uses access() to test for existence and leaves it to the caller to
perform any additional file operations. This seemed like the most general and
lightweight approach, and still preserves the semantics of is_shorthand_path().
zfs_append_partition() appends a partition suffix to a device path. This
should be used to generate the name of a whole disk as it is stored in the vdev
label. The user-visible names of whole disks do not contain the partition
information, while the name in the vdev label does. The code was lifted from
the function make_disks(), which now just calls the helper function. Again,
having a helper function to do this supports general handling of shorthand
names in the user interface.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The string array 'char dirs[5][8]' was too small to accomodate the terminating
NUL character in "by-label". This change adds the needed additional byte.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This topic branch contains all the changes needed to integrate the user
side zfs tools with Linux style devices. Primarily this includes fixing
up the Solaris libefi library to be Linux friendly, and integrating with
the libblkid library which is provided by e2fsprogs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>