mirror_zfs/module/zfs
Prakash Surya 2b13331d62 Set "arc_meta_limit" to 3/4 arc_c_max by default
Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..

Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).

This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.

If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.

So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.

To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).

Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:

    - 16K for the dnode object's block - [        16384 bytes]
    - N * sizeof(dnode_t) -------------- [      N * 928 bytes]
    - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) *  72 bytes]
    - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
    - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]

To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:

    Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
    Pinned "arc_meta_used" bytes = 17000 + N * 1544

This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.

Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:

    Total = 17000 + N * 1544 + S * N
            ^^^^^^^^^^^^^^^^   ^^^^^
                metadata        data

So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).

                                  File Sizes (S)
       |    512   |   1024   |   2048   |   4096   |   8192   |   16384  |
    ---+----------+----------+----------+----------+----------+----------+
     1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
     2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
 N   4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
     8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
    16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
    32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |

As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.

This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
..
arc.c Set "arc_meta_limit" to 3/4 arc_c_max by default 2014-02-21 16:10:49 -08:00
bplist.c Illumos #3464 2013-09-04 16:01:24 -07:00
bpobj.c Illumos #3603, #3604: bobj improvements 2013-10-31 14:57:51 -07:00
bptree.c 26126 panic system rather than corrupting pool if we hit bug 26100 2013-11-05 13:18:26 -08:00
dbuf_stats.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
dbuf.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
ddt_zap.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
ddt.c Add ddt, ddt_entry, and l2arc_hdr caches 2014-01-07 10:33:11 -08:00
dmu_diff.c Illumos #3598 2013-10-31 14:58:04 -07:00
dmu_object.c Illumos #3598 2013-10-31 14:58:04 -07:00
dmu_objset.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
dmu_send.c Add zfs_send_corrupt_data module option 2013-12-18 16:46:35 -08:00
dmu_traverse.c Illumos 4504 traverse_visitbp: visit group before user 2014-01-29 15:50:49 -08:00
dmu_tx.c 4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0 2014-01-31 10:49:34 -08:00
dmu_zfetch.c Use enum type(zfetch_dirn_t) instead 2014-01-23 12:56:33 -08:00
dmu.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
dnode_sync.c Illumos #3742 2013-11-04 10:55:25 -08:00
dnode.c Illumos #4045 write throttle & i/o scheduler performance work 2013-12-06 09:32:43 -08:00
dsl_dataset.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
dsl_deadlist.c Illumos #3104: eliminate empty bpobjs 2013-01-08 10:35:43 -08:00
dsl_deleg.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
dsl_destroy.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
dsl_dir.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
dsl_pool.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
dsl_prop.c Illumos #3742 2013-11-04 10:55:25 -08:00
dsl_scan.c Add erratum for issue #2094 2014-02-21 12:10:40 -08:00
dsl_synctask.c Illumos #3598 2013-10-31 14:58:04 -07:00
dsl_userhold.c Some nvlist allocations in hold processing need to use KM_PUSHPAGE. 2013-12-02 14:02:46 -08:00
fm.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
gzip.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
lz4.c Force LZ4_FORCE_SW_BITCOUNT for Sparc 2014-01-09 15:54:03 -08:00
lzjb.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
Makefile.in Add visibility in to cached dbufs 2013-10-25 13:59:40 -07:00
metaslab.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
refcount.c Illumos #3464 2013-09-04 16:01:24 -07:00
rrwlock.c Fix several new KM_SLEEP warnings 2013-09-25 15:44:22 -07:00
sa.c Properly handle updates of variably-sized SA entries. 2013-12-20 13:52:33 -08:00
sha256.c Add linux sha2 support 2010-08-31 13:41:59 -07:00
spa_boot.c Add linux kernel module support 2010-08-31 13:41:58 -07:00
spa_config.c Add generic errata infrastructure 2014-02-21 12:10:40 -08:00
spa_errlog.c Illumos #3743 2013-11-04 10:55:25 -08:00
spa_history.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
spa_misc.c Add ddt, ddt_entry, and l2arc_hdr caches 2014-01-07 10:33:11 -08:00
spa_stats.c Revert changes to zbookmark_t 2014-02-21 12:10:39 -08:00
spa.c Add generic errata infrastructure 2014-02-21 12:10:40 -08:00
space_map.c Illumos #3464 2013-09-04 16:01:24 -07:00
txg.c Call gethrtime() only once per new txg creation 2014-01-23 13:31:51 -08:00
uberblock.c Illumos #3598 2013-10-31 14:58:04 -07:00
unique.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
vdev_cache.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
vdev_disk.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
vdev_file.c vdev_file_io_start() to use taskq_dispatch(TQ_PUSHPAGE) 2014-01-23 09:58:07 -08:00
vdev_label.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
vdev_mirror.c Illumos #4045 write throttle & i/o scheduler performance work 2013-12-06 09:32:43 -08:00
vdev_missing.c Illumos #3598 2013-10-31 14:58:04 -07:00
vdev_queue.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
vdev_raidz.c Illumos #4045 write throttle & i/o scheduler performance work 2013-12-06 09:32:43 -08:00
vdev_root.c Illumos #3598 2013-10-31 14:58:04 -07:00
vdev.c Illumos #4045 write throttle & i/o scheduler performance work 2013-12-06 09:32:43 -08:00
zap_leaf.c Illumos #3598 2013-10-31 14:58:04 -07:00
zap_micro.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zap.c Illumos #3743 2013-11-04 10:55:25 -08:00
zfeature_common.c Illumos #3035 LZ4 compression support in ZFS and GRUB 2013-01-29 09:28:20 -08:00
zfeature.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zfs_acl.c Allow chown/chgrp when no ACL SAs exist. 2014-01-23 11:07:29 -08:00
zfs_byteswap.c Add linux kernel module support 2010-08-31 13:41:58 -07:00
zfs_ctldir.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zfs_debug.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zfs_dir.c Illumos #4347 ZPL can use dmu_tx_assign(TXG_WAIT) 2013-12-06 09:30:51 -08:00
zfs_fm.c Illumos #4045 write throttle & i/o scheduler performance work 2013-12-06 09:32:43 -08:00
zfs_fuid.c Illumos #3522 2013-10-30 14:51:27 -07:00
zfs_ioctl.c Fix the creation of ZPOOL_HIST_CMD pool history entries. 2014-01-07 09:00:26 -08:00
zfs_log.c Only commit the ZIL once in zpl_writepages() (msync() case). 2013-11-23 15:08:29 -08:00
zfs_onexit.c Illumos #3598 2013-10-31 14:58:04 -07:00
zfs_replay.c Illumos #3598 2013-10-31 14:58:04 -07:00
zfs_rlock.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zfs_sa.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zfs_vfsops.c Propagate errors when registering "relatime" property callback. 2014-02-12 09:38:28 -08:00
zfs_vnops.c Fix zfs_getattr_fast types 2014-01-09 15:50:23 -08:00
zfs_znode.c Implement relatime. 2014-01-29 15:50:44 -08:00
zil.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zio_checksum.c Illumos #3598 2013-10-31 14:58:04 -07:00
zio_compress.c Illumos #3598 2013-10-31 14:58:04 -07:00
zio_inject.c Illumos #3598 2013-10-31 14:58:04 -07:00
zio.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zle.c Update core ZFS code from build 121 to build 141. 2010-05-28 13:45:14 -07:00
zpl_ctldir.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zpl_export.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zpl_file.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zpl_inode.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zpl_super.c Prune metadata from ghost lists in arc_adjust_meta 2014-02-21 16:10:49 -08:00
zpl_xattr.c cstyle: Resolve C style issues 2013-12-18 16:46:35 -08:00
zrlock.c Export ZFS symbols needed by Lustre. 2010-09-17 16:24:15 -07:00
zvol.c Use long holds in zvol_set_volsize() 2014-01-14 14:46:12 -08:00