Illumos 5027 - zfs large block support

5027 zfs large block support
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5027
  https://github.com/illumos/illumos-gate/commit/b515258

Porting Notes:

* Included in this patch is a tiny ISP2() cleanup in zio_init() from
Illumos 5255.

* Unlike the upstream Illumos commit this patch does not impose an
arbitrary 128K block size limit on volumes.  Volumes, like filesystems,
are limited by the zfs_max_recordsize=1M module option.

* By default the maximum record size is limited to 1M by the module
option zfs_max_recordsize.  This value may be safely increased up to
16M which is the largest block size supported by the on-disk format.
At the moment, 1M blocks clearly offer a significant performance
improvement but the benefits of going beyond this for the majority
of workloads are less clear.

* The illumos version of this patch increased DMU_MAX_ACCESS to 32M.
This was determined not to be large enough when using 16M blocks
because the zfs_make_xattrdir() function will fail (EFBIG) when
assigning a TX.  This was immediately observed under Linux because
all newly created files must have a security xattr created and
that was failing.  Therefore, we've set DMU_MAX_ACCESS to 64M.

* On 32-bit platforms a hard limit of 1M is set for blocks due
to the limited virtual address space.  We should be able to relax
this one the ABD patches are merged.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #354
This commit is contained in:
Matthew Ahrens
2014-11-03 12:15:08 -08:00
committed by Brian Behlendorf
parent 3df293404a
commit f1512ee61e
55 changed files with 613 additions and 155 deletions
+18
View File
@@ -945,6 +945,24 @@ Largest data block to write to zil
Default value: \fB32,768\fR.
.RE
.sp
.ne 2
.na
\fBzfs_max_recordsize\fR (int)
.ad
.RS 12n
We currently support block sizes from 512 bytes to 16MB. The benefits of
larger blocks, and thus larger IO, need to be weighed against the cost of
COWing a giant block to modify one byte. Additionally, very large blocks
can have an impact on i/o latency, and also potentially on the memory
allocator. Therefore, we do not allow the recordsize to be set larger than
zfs_max_recordsize (default 1MB). Larger blocks can be created by changing
this tunable, and pools with larger blocks can always be imported and used,
regardless of this setting.
.sp
Default value: \fB1,048,576\fR.
.RE
.sp
.ne 2
.na
+21
View File
@@ -411,5 +411,26 @@ never return to being \fBenabled\fR.
.RE
.sp
.ne 2
.na
\fB\fBlarge_blocks\fR\fR
.ad
.RS 4n
.TS
l l .
GUID org.open-zfs:large_block
READ\-ONLY COMPATIBLE no
DEPENDENCIES extensible_dataset
.TE
The \fBlarge_block\fR feature allows the record size on a dataset to be
set larger than 128KB.
This feature becomes \fBactive\fR once a \fBrecordsize\fR property has been
set larger than 128KB, and will return to being \fBenabled\fR once all
filesystems that have ever had their recordsize larger than 128KB are destroyed.
.RE
.SH "SEE ALSO"
\fBzpool\fR(8)
+36 -5
View File
@@ -174,12 +174,12 @@ zfs \- configures ZFS file systems
.LP
.nf
\fBzfs\fR \fBsend\fR [\fB-DnPpRve\fR] [\fB-\fR[\fBiI\fR] \fIsnapshot\fR] \fIsnapshot\fR
\fBzfs\fR \fBsend\fR [\fB-DnPpRveL\fR] [\fB-\fR[\fBiI\fR] \fIsnapshot\fR] \fIsnapshot\fR
.fi
.LP
.nf
\fBzfs\fR \fBsend\fR [\fB-e\fR] [\fB-i \fIsnapshot\fR|\fIbookmark\fR]\fR \fIfilesystem\fR|\fIvolume\fR|\fIsnapshot\fR
\fBzfs\fR \fBsend\fR [\fB-eL\fR] [\fB-i \fIsnapshot\fR|\fIbookmark\fR]\fR \fIfilesystem\fR|\fIvolume\fR|\fIsnapshot\fR
.fi
.LP
@@ -2706,7 +2706,7 @@ See \fBzpool-features\fR(5) for details on ZFS feature flags and the
.sp
.ne 2
.na
\fBzfs send\fR [\fB-DnPpRve\fR] [\fB-\fR[\fBiI\fR] \fIsnapshot\fR] \fIsnapshot\fR
\fBzfs send\fR [\fB-DnPpRveL\fR] [\fB-\fR[\fBiI\fR] \fIsnapshot\fR] \fIsnapshot\fR
.ad
.sp .6
.RS 4n
@@ -2759,6 +2759,22 @@ If the \fB-i\fR or \fB-I\fR flags are used in conjunction with the \fB-R\fR flag
Generate a deduplicated stream. Blocks which would have been sent multiple times in the send stream will only be sent once. The receiving system must also support this feature to receive a deduplicated stream. This flag can be used regardless of the dataset's dedup property, but performance will be much better if the filesystem uses a dedup-capable checksum (eg. sha256).
.RE
.sp
.ne 2
.mk
.na
\fB\fB-L\fR\fR
.ad
.sp .6
.RS 4n
Generate a stream which may contain blocks larger than 128KB. This flag
has no effect if the \fBlarge_blocks\fR pool feature is disabled, or if
the \fRrecordsize\fR property of this filesystem has never been set above
128KB. The receiving system must have the \fBlarge_blocks\fR pool feature
enabled as well. See \fBzpool-features\fR(5) for details on ZFS feature
flags and the \fBlarge_blocks\fR feature.
.RE
.sp
.ne 2
.mk
@@ -2828,7 +2844,7 @@ The format of the stream is committed. You will be able to receive your streams
.sp
.ne 2
.na
\fBzfs send\fR [\fB-e\fR] [\fB-i\fR \fIsnapshot\fR|\fIbookmark\fR] \fIfilesystem\fR|\fIvolume\fR|\fIsnapshot\fR
\fBzfs send\fR [\fB-eL\fR] [\fB-i\fR \fIsnapshot\fR|\fIbookmark\fR] \fIfilesystem\fR|\fIvolume\fR|\fIsnapshot\fR
.ad
.sp .6
.RS 4n
@@ -2856,6 +2872,22 @@ be the origin snapshot, or an earlier snapshot in the origin's filesystem,
or the origin's origin, etc.
.RE
.sp
.ne 2
.mk
.na
\fB\fB-L\fR\fR
.ad
.sp .6
.RS 4n
Generate a stream which may contain blocks larger than 128KB. This flag
has no effect if the \fBlarge_blocks\fR pool feature is disabled, or if
the \fRrecordsize\fR property of this filesystem has never been set above
128KB. The receiving system must have the \fBlarge_blocks\fR pool feature
enabled as well. See \fBzpool-features\fR(5) for details on ZFS feature
flags and the \fBlarge_blocks\fR feature.
.RE
.sp
.ne 2
.mk
@@ -2909,7 +2941,6 @@ The \fB-d\fR and \fB-e\fR options cause the file system name of the target snaps
Discard the first element of the sent snapshot's file system name, using the remaining elements to determine the name of the target file system for the new snapshot as described in the paragraph above.
.RE
.sp
.ne 2
.na