mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-22 18:40:43 +03:00

Author	SHA1	Message	Date
Darik Horn	28eb9213d8	Linux 3.2 compat: set_nlink() Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170 Use the new set_nlink() kernel function instead. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #462	2011-12-16 20:02:52 -08:00
Prakash Surya	8f2503e0af	Store copy of tqent_flags prior to servicing task A preallocated taskq_ent_t's tqent_flags must be checked prior to servicing the taskq_ent_t. Once a preallocated taskq entry is serviced, the ownership of the entry is handed back to the caller of taskq_dispatch, thus the entry's contents can potentially be mangled. In particular, this is a problem in the case where a preallocated taskq entry is serviced, and the caller clears it's tqent_flags field. Thus, when the function returns and task_done is called, it looks as though the entry is not a preallocated task (when in fact it is a preallocated task). In this situation, task_done will place the preallocated taskq_ent_t structure onto the taskq_t's free list. This is a huge mistake. If the taskq_ent_t is then freed by the caller of taskq_dispatch, the taskq_t's free list will hold a pointer to garbage data. Even worse, if nothing has over written the freed memory before the pointer is dereferenced, it may still look as though it points to a valid list_head belonging to a taskq_ent_t structure. Thus, the task entry's flags are now copied prior to servicing the task. This copy is then checked to see if it is a preallocated task, and determine if the entry needs to be passed down to the task_done function. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #71	2011-12-16 16:54:00 -08:00
Prakash Surya	e7e5f78e7b	Swap taskq_ent_t with taskqid_t in taskq_thread_t The taskq_t's active thread list is sorted based on its tqt_ent->tqent_id field. The list is kept sorted solely by inserting new taskq_thread_t's in their correct sorted location; no other means is used. This means that once inserted, if a taskq_thread_t's tqt_ent->tqent_id field changes, the list runs the risk of no longer being sorted. Prior to the introduction of the taskq_dispatch_prealloc() interface, this was not a problem as a taskq_ent_t actively being serviced under the old interface should always have a static tqent_id field. Thus, once the taskq_thread_t is added to the taskq_t's active thread list, the taskq_thread_t's tqt_ent->tqent_id field would remain constant. Now, this is no longer the case. Currently, if using the taskq_dispatch_prealloc() interface, any given taskq_ent_t actively being serviced _may_ have its tqent_id value incremented. This happens when the preallocated taskq_ent_t structure is recursively dispatched. Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id field silently modified from under its feet. If this were to happen to a taskq_thread_t on a taskq_t's active thread list, this would compromise the integrity of the order of the list (as the list _may_ no longer be sorted). To get around this, the taskq_thread_t's taskq_ent_t pointer was replaced with its own static copy of the tqent_id. So, as a taskq_ent_t is pulled off of the taskq_t's pending list, a static copy of its tqent_id is made and this copy is used to sort the active thread list. Using a static copy is key in ensuring the integrity of the order of the active thread list. Even if the underlying taskq_ent_t is recursively dispatched (as has its tqent_id modified), this static copy stored inside the taskq_thread_t will remain constant. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #71	2011-12-16 13:26:54 -08:00
Garrett D'Amore	a38718a63d	Illumos #734 : Use taskq_dispatch_ent() interface It has been observed that some of the hottest locks are those of the zio taskqs. Contention on these locks can limit the rate at which zios are dispatched which limits performance. This upstream change from Illumos uses new interface to the taskqs which allow them to utilize a prealloc'ed taskq_ent_t. This removes the need to perform an allocation at dispatch time while holding the contended lock. This has the effect of improving system performance. Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Alexey Zaytsev <alexey.zaytsev@nexenta.com> Reviewed by: Jason Brian King <jason.brian.king@gmail.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue: https://www.illumos.org/issues/734 Ported-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #482	2011-12-14 09:19:30 -08:00
Prakash Surya	699d5ee8a9	Exercise new taskq interface in splat-taskq tests The splat-taskq test functions were slightly modified to exercise the new taskq interface in addition to the old interface. If the old interface passes each of its tests, the new interface is exercised. Both sub tests (old interface and new interface) must pass for each test as a whole to pass. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #65	2011-12-13 16:10:57 -08:00
Prakash Surya	44217f7aad	Implement taskq_dispatch_prealloc() interface This patch implements the taskq_dispatch_prealloc() interface which was introduced by the following illumos-gate commit. It allows for a preallocated taskq_ent_t to be used when dispatching items to a taskq. This eliminates a memory allocation which helps minimize lock contention in the taskq when dispatching functions. commit 5aeb94743e3be0c51e86f73096334611ae3a058e Author: Garrett D'Amore <garrett@nexenta.com> Date: Wed Jul 27 07:13:44 2011 -0700 734 taskq_dispatch_prealloc() desired 943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 16:10:57 -08:00
Prakash Surya	ac1e5b6033	Add Test: "Single task queue, recursive dispatch" Added another splat taskq test to ensure tasks can be recursively submitted to a single task queue without issue. When the taskq_dispatch_prealloc() interface is introduced, this use case can potentially cause a deadlock if a taskq_ent_t is dispatched while its tqent_list field is not empty. This _should_ never be a problem with the existing taskq_dispatch() interface. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 16:10:57 -08:00
Prakash Surya	2c02b71b14	Replace tq_work_list and tq_threads in taskq_t To lay the ground work for introducing the taskq_dispatch_prealloc() interface, the tq_work_list and tq_threads fields had to be replaced with new alternatives in the taskq_t structure. The tq_threads field was replaced with tq_thread_list. Rather than storing the pointers to the taskq's kernel threads in an array, they are now stored as a list. In addition to laying the ground work for the taskq_dispatch_prealloc() interface, this change could also enable taskq threads to be dynamically created and destroyed as threads can now be added and removed to this list relatively easily. The tq_work_list field was replaced with tq_active_list. Instead of keeping a list of taskq_ent_t's which are currently being serviced, a list of taskq_threads currently servicing a taskq_ent_t is kept. This frees up the taskq_ent_t's tqent_list field when it is being serviced (i.e. now when a taskq_ent_t is being serviced, it's tqent_list field will be empty). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 16:10:50 -08:00
Prakash Surya	046a70c93b	Replace struct spl_task with struct taskq_ent The spl_task structure was renamed to taskq_ent, and all of its fields were renamed to have a prefix of 'tqent' rather than 't'. This was to align with the naming convention which the ZFS code assumes. Previously these fields were private so the name never mattered. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 12:28:09 -08:00
Prakash Surya	ed948fa72b	Add SPLAT_TEST_FINI call for SPLAT_TASKQ_TEST6_ID This change adds the neglected SPLAT_TEST_FINI call for the SPLAT_TASKQ_TEST6_ID, just as is done for the other 5 SPLAT_TASKQ_* tests. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #64	2011-12-13 12:26:16 -08:00
Brian Behlendorf	30a9524e45	Set zvol_major/zvol_threads permissions The zvol_major and zvol_threads module options were being created with 0 permission bits. This prevented them from being listed in the /sys/module/zfs/parameters/ directory, although they were visible in `modinfo zfs`. This patch fixes the issue by updating the permission bits to 0444. For the moment these options must be read-only because they are used during module initialization. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #392	2011-12-07 09:27:50 -08:00
Brian Behlendorf	23bdb07d4e	Update default ARC memory limits In the upstream OpenSolaris ZFS code the maximum ARC usage is limited to 3/4 of memory or all but 1GB, whichever is larger. Because of how Linux's VM subsystem is organized these defaults have proven to be too large which can lead to stability issues. To avoid making everyone manually tune the ARC the defaults are being changed to 1/2 of memory or all but 4GB. The rational for this is as follows: * Desktop Systems (less than 8GB of memory) Limiting the ARC to 1/2 of memory is desirable for desktop systems which have highly dynamic memory requirements. For example, launching your web browser can suddenly result in a demand for several gigabytes of memory. This memory must be reclaimed from the ARC cache which can take some time. The user will experience this reclaim time as a sluggish system with poor interactive performance. Thus in this case it is preferable to leave the memory as free and available for immediate use. * Server Systems (more than 8GB of memory) Using all but 4GB of memory for the ARC is preferable for server systems. These systems often run with minimal user interaction and have long running daemons with relatively stable memory demands. These systems will benefit most by having as much data cached in memory as possible. These values should work well for most configurations. However, if you have a desktop system with more than 8GB of memory you may wish to further restrict the ARC. This can still be accomplished by setting the 'zfs_arc_max' module option. Additionally, keep in mind these aren't currently hard limits. The ARC is based on a slab implementation which can suffer from memory fragmentation. Because this fragmentation is not visible from the ARC it may believe it is within the specified limits while actually consuming slightly more memory. How much more memory get's consumed will be determined by how badly fragmented the slabs are. In the long term this can be mitigated by slab defragmentation code which was OpenSolaris solution. Or preferably, using the page cache to back the ARC under Linux would be even better. See issue #75 for the benefits of more tightly integrating with the page cache. This change also fixes a issue where the default ARC max was being set incorrectly for machines with less than 2GB of memory. The constant in the arc_c_max comparison must be explicitly cast to a uint64_t type to prevent overflow and the wrong conditional branch being taken. This failure was typically observed in VMs which are commonly created with less than 2GB of memory. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #75	2011-12-05 12:02:12 -08:00
Brian Behlendorf	f31b3ebe6e	Allow xattrs on symlinks The Solaris version of ZFS does not allow xattrs to be set on symlinks due to the way they implemented the attropen() system call. Linux however implements xattrs through the lgetxattr() and lsetxattr() system calls which do not have this limitation. The only reason this hasn't always worked under ZFS on Linux is that the xattr handlers were not registered for symlink type inodes. This was done simply to be consistent with the Solaris behavior. Upon futher reflection I believe this should be allowed under Linux. The only ill effect would be that the xattrs on symlinks will not be visible when the pool is imported on a Solaris system. This also has the benefit that it allows for SELinux style security xattr labeling which expects to be able to set xattrs on all inode types. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #272	2011-11-29 10:24:24 -08:00
Brian Behlendorf	82a37189aa	Implement SA based xattrs The current ZFS implementation stores xattrs on disk using a hidden directory. In this directory a file name represents the xattr name and the file contexts are the xattr binary data. This approach is very flexible and allows for arbitrarily large xattrs. However, it also suffers from a significant performance penalty. Accessing a single xattr can requires up to three disk seeks. 1) Lookup the dnode object. 2) Lookup the dnodes's xattr directory object. 3) Lookup the xattr object in the directory. To avoid this performance penalty Linux filesystems such as ext3 and xfs try to store the xattr as part of the inode on disk. When the xattr is to large to store in the inode then a single external block is allocated for them. In practice most xattrs are small and this approach works well. The addition of System Attributes (SA) to zfs provides us a clean way to make this optimization. When the dataset property 'xattr=sa' is set then xattrs will be preferentially stored as System Attributes. This allows tiny xattrs (~100 bytes) to be stored with the dnode and up to 64k of xattrs to be stored in the spill block. If additional xattr space is required, which is unlikely under Linux, they will be stored using the traditional directory approach. This optimization results in roughly a 3x performance improvement when accessing xattrs which brings zfs roughly to parity with ext4 and xfs (see table below). When multiple xattrs are stored per-file the performance improvements are even greater because all of the xattrs stored in the spill block will be cached. However, by default SA based xattrs are disabled in the Linux port to maximize compatibility with other implementations. If you do enable SA based xattrs then they will not be visible on platforms which do not support this feature. ---------------------------------------------------------------------- Time in seconds to get/set one xattr of N bytes on 100,000 files ------+--------------------------------+------------------------------ \| setxattr \| getxattr bytes \| ext4 xfs zfs-dir zfs-sa \| ext4 xfs zfs-dir zfs-sa ------+--------------------------------+------------------------------ 1 \| 2.33 31.88 21.50 4.57 \| 2.35 2.64 6.29 2.43 32 \| 2.79 30.68 21.98 4.60 \| 2.44 2.59 6.78 2.48 256 \| 3.25 31.99 21.36 5.92 \| 2.32 2.71 6.22 3.14 1024 \| 3.30 32.61 22.83 8.45 \| 2.40 2.79 6.24 3.27 4096 \| 3.57 317.46 22.52 10.73 \| 2.78 28.62 6.90 3.94 16384 \| n/a 2342.39 34.30 19.20 \| n/a 45.44 145.90 7.55 65536 \| n/a 2941.39 128.15 131.32* \| n/a 141.92 256.85 262.12* Legend: * ext4 - Stock RHEL6.1 ext4 mounted with '-o user_xattr'. * xfs - Stock RHEL6.1 xfs mounted with default options. * zfs-dir - Directory based xattrs only. * zfs-sa - Prefer SAs but spill in to directories as needed, a trailing * indicates overflow in to directories occured. NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file. NOTE: XFS and ZFS have no limit on xattr name/value pairs per file. NOTE: Linux limits individual name/value pairs to 65536 bytes. NOTE: All setattr/getattr's were done after dropping the cache. NOTE: All tests were run against a single hard drive. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #443	2011-11-28 15:45:51 -08:00
Prakash Surya	e05bec805b	Fix a typo referencing an incorrect symbol The splat_taskq_test4_common function was incorrectly referencing the splat_taskq-test13_func symbol, when it meant to be using the splat_taskq_test4_func symbol. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #61	2011-11-21 16:52:36 -08:00
Brian Behlendorf	ca5fd24984	Limit maximum ashift value to 12 While we initially allowed you to set your ashift as large as 17 (SPA_MAXBLOCKSIZE) that is actually unsafe. What wasn't considered at the time is that each uberblock written to the vdev label ring buffer will be of this size. Now the buffer is statically sized to 128k and we need to be able to fit several uberblocks in it. With a large ashift that becomes a problem. Therefore I'm reducing the maximum configurable ashift value to 12. This is large enough for the 4k sector drives and small enough that we can still keep the most recent 32 uberblock in the vdev label ring buffer. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #425	2011-11-11 14:50:48 -08:00
Brian Behlendorf	1114ae6ae7	Prepend spl_ to all init/fini functions This is a bit of cleanup I'd been meaning to get to for a while to reduce the chance of a type conflict. Well that conflict finally occurred with the kstat_init() function which conflicts with a function in the 2.6.32-6-pve kernel. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #56	2011-11-11 09:18:28 -08:00
Brian Behlendorf	adcd70bd1a	Linux 3.1 compat, fops->fsync() The Linux 3.1 kernel updated the fops->fsync() callback yet again. They now pass the requested range and delegate the responsibility for calling filemap_write_and_wait_range() to the callback. In addition imutex is no longer held by the caller and the callback is responsible for taking the lock if required. This commit updates the code to provide a zpl_fsync() function for the updated API. Implementations for the previous two APIs are also maintained for compatibility. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #445	2011-11-10 10:03:08 -08:00
Brian Behlendorf	fe71c0e567	Linux 3.1 compat, shrink_*cache_memory As of Linux 3.1 the shrink_dcache_memory and shrink_icache_memory functions have been removed. This same task is now accomplished more cleanly with per super block shrinkers. This unfortunately leaves us no easy way to support the dnlc_reduce_cache() function. This support has always been entirely optional. So when no reasonable interface is available allow the dnlc_reduce_cache() function to effectively become a no-op. The downside of this change is that it will prevent the zfs arc meta data limts from being enforced. However, the current zfs implementation in this regard is already flawed and needs to be reworked. If the arc needs to enfore a meta data limit it will need to be extended to coordinate directly with the zpl. This will allow us to drop all this compatibility code and get more fine grained control over the cache management. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #52	2011-11-09 19:36:30 -08:00
Brian Behlendorf	12ff95ff57	Linux 3.1 compat, kern_path_parent() Prior to Linux 3.1 the kern_path_parent symbol was exported for use by kernel modules. As of Linux 3.1 it is now longer easily available. To handle this case the spl will now dynamically look up address of the missing symbol at module load time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #52	2011-11-09 16:51:25 -08:00
Brian Behlendorf	5547c2f1bf	Simplify BDI integration Update the code to use the bdi_setup_and_register() helper to simplify the bdi integration code. The updated code now just registers the bdi during mount and destroys it during unmount. The only complication is that for 2.6.32 - 2.6.33 kernels the helper wasn't available so in these cases the zfs code must provide it. Luckily the bdi_setup_and_register() function is trivial. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #367	2011-11-08 10:19:03 -08:00
Brian Behlendorf	591fb62f19	Disown dataset in zfs_sb_create() Fix an unlikely failure cause in zfs_sb_create() which could leave the dataset owned on error and thus unavailable until after a reboot. Disown the dataset if SA are expected but are in fact missing. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-11-08 10:18:40 -08:00
Brian Behlendorf	ae6ba3dbe6	Improve meta data performance Profiling the system during meta data intensive workloads such as creating/removing millions of files, revealed that the system was cpu bound. A large fraction of that cpu time was being spent waiting on the virtual address space spin lock. It turns out this was caused by certain heavily used kmem_caches being backed by virtual memory. By default a kmem_cache will dynamically determine the type of memory used based on the object size. For large objects virtual memory is usually preferable and for small object physical memory is a better choice. See the spl_slab_alloc() function for a longer discussion on this. However, there is a certain amount of gray area when defining a 'large' object. For the following caches it turns out they were just over the line: * dnode_cache * zio_cache * zio_link_cache * zio_buf_512_cache * zfs_data_buf_512_cache Now because we know there will be a lot of churn in these caches, and because we know the slabs will still be reasonably sized. We can safely request with the KMC_KMEM flag that the caches be backed with physical memory addresses. This entirely avoids the need to serialize on the virtual address space lock. As a bonus this also reduces our vmalloc usage which will be good for 32-bit kernels which have a very small virtual address space. It will also probably be good for interactive performance since unrelated processes could also block of this same global lock. Finally, we may see less cpu time being burned in the arc_reclaim and txg_sync_threads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #258	2011-11-03 10:19:21 -07:00
Brian Behlendorf	6a95d0b74c	Fix NULL deref in balance_pgdat() Be careful not to unconditionally clear the PF_MEMALLOC bit in the task structure. It may have already been set when entering zpl_putpage() in which case it must remain set on exit. In particular the kswapd thread will have PF_MEMALLOC set in order to prevent it from entering direct reclaim. By clearing it we allow the following NULL deref to potentially occur. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #287	2011-11-03 10:15:39 -07:00
Gunnar Beutner	a7b125e9a5	Fix a race condition in zfs_getattr_fast() zfs_getattr_fast() was missing a lock on the ZFS superblock which could result in zfs_znode_dmu_fini() clearing the zp->z_sa_hdl member while zfs_getattr_fast() was accessing the znode. The result of this would usually be a panic. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes #431	2011-11-03 10:13:09 -07:00
Brian Behlendorf	b8b6e4c453	Fix NULL deref in balance_pgdat() Be careful not to unconditionally clear the PF_MEMALLOC bit in the task structure. It may have already been set when entering kv_alloc() in which case it must remain set on exit. In particular the kswapd thread will have PF_MEMALLOC set in order to prevent it from entering direct reclaim. By clearing it we allow the following NULL deref to potentially occur. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes ZFS issue #287	2011-11-03 09:50:22 -07:00
Xin Li	c475167627	Illumos #1661 : Fix flaw in sa_find_sizes() calculation When calculating space needed for SA_BONUS buffers, hdrsize is always rounded up to next 8-aligned boundary. However, in two places the round up was done against sum of 'total' plus hdrsize. On the other hand, hdrsize increments by 4 each time, which means in certain conditions, we would end up returning with will_spill == 0 and (total + hdrsize) larger than full_space, leading to a failed assertion because it's invalid for dmu_set_bonus. Reviewed by: Matthew Ahrens <matt@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue: https://www.illumos.org/issues/1661 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #426	2011-10-24 09:57:52 -07:00
Darik Horn	3cee2262a6	Change sun.com URLs to zfsonlinux.org ZFS contains error messages that point to the defunct www.sun.com domain, which is currently offline. Change these error messages to use the zfsonlinux.org mirror instead. This commit depends on: zfsonlinux/zfsonlinux.github.com@8e10ead3dc Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-24 09:52:21 -07:00
Gunnar Beutner	f3989ed322	vn_rdwr() didn't properly advance the file position This would cause problems when using 'zfs send' with a file as the target (rather than a pipe or a socket as is usually the case) as for each write the destination offset in the file would be 0. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes ZFS issue #391	2011-10-18 16:51:35 -07:00
Brian Behlendorf	6f2255ba8a	Set mtime on symbolic links Register the setattr/getattr callbacks for symlinks. Without these the generic inode_setattr() and generic_fillattr() functions will be used. In the setattr case this will only result in the inode being updated in memory, the dirty_inode callback would also normally run but none is registered for zfs. The straight forward fix is to set the setattr/getattr callbacks for symlinks so they are handled just like files and directories. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #412	2011-10-18 15:49:31 -07:00
Alexander Stetsenko	8d35c1499d	Illumos #755 : dmu_recv_stream builds incomplete guid_to_ds_map An incomplete guid_to_ds_map would cause restore_write_byref() to fail while receiving a de-duplicated backup stream. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Garrett D`Amore <garrett@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/755 - https://github.com/illumos/illumos-gate/commit/ec5cf9d53a Signed-off-by: Gunnar Beutner <gunnar@beutner.name> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #372	2011-10-18 11:18:14 -07:00
Brian Behlendorf	ecc3981007	Fix various typos in comments Just clean up some of the typos and spelling mistakes in the comments of spl-kmem.c. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-11 10:32:49 -07:00
Brian Behlendorf	86f35f34f4	Export symbols for the VFS API Export all symbols already marked extern in the zfs_vfsops.h header. Several non-static symbols have also been added to the header and exportewd. This allows external modules to more easily create and manipulate properly created ZFS filesystem type datasets. Rename zfsvfs_teardown() to zfs_sb_teardown and export it. This is done simply for consistency with the rest of the code base. All other zfsvfs_* functions have already been renamed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-11 10:25:59 -07:00
Gunnar Beutner	8d177c181f	Fixed typo in spl_slab_alloc() The typo did not have any effect (apart from a negligible performance impact) because skc->skc_flags * KMC_OFFSLAB is always non-null when at least one bit in skc->skc_flags is set. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-11 10:03:43 -07:00
Gunnar Beutner	64c075c3f4	Properly destroy work items in spl_kmem_cache_destroy() In a non-debug build the ASSERT() would be optimized away which could cause pending work items to not be cancelled. We must also use cancel_delayed_work_sync() rather than just cancel_delayed_work() to actually wait until work items have completed. Otherwise they might accidentally access free'd memory. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes ZFS bugs #279, #62, #363, #418	2011-10-11 09:59:19 -07:00
Gunnar Beutner	763b2f3b57	Fixed invalid resource re-use in file_find() File descriptors are a per-process resource. The same descriptor in different processes can refer to different files. find_file() incorrectly assumed that file descriptors are globally unique. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes ZFS issue #386	2011-10-11 09:51:51 -07:00
Brian Behlendorf	6b3b569df3	Remove /etc/hostid missing warning No longer print the following warning to the console when the /etc/hostid file is missing. This is the expected default behavior. Keeping the hostid in sync with the initramfs is now accomplished by creating the /etc/hostid in the initramfs not on the system. SPL: The /etc/hostid file is not found. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-06 14:58:09 -07:00
Brian Behlendorf	e45aa45298	Export symbols for the full SA API Export all the symbols for the system attribute (SA) API. This allows external module to cleanly manipulate the SAs associated with a dnode. Documention for the SA API can be found in the module/zfs/sa.c source. This change also removes the zfs_sa_uprade_pre, and zfs_sa_uprade_post prototypes. The functions themselves were dropped some time ago. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-05 15:59:56 -07:00
Andreas Dilger	baab063016	zpl: Fix "df -i" to have better free inodes value Due to the confusion in Linux statfs between f_frsize and f_bsize the blocks counts were changed to be in units of z_max_blksize instead of SPA_MINBLOCKSIZE as it is on other platforms. However, the free files calculation in zfs_statvfs() is limited by the free blocks count, since each dnode consumes one block/sector. This provided a reasonable estimate of free inodes, but on Linux this meant that the free inodes count was underestimated by a large amount, since 256 512-byte dnodes can fit into a 128kB block, and more if the max blocksize is increased to 1MB or larger. Also, the use of SPA_MINBLOCKSIZE is semantically incorrect since DNODE_SIZE may change to a value other than SPA_MINBLOCKSIZE and may even change per dataset, and devices with large sectors setting ashift will also use a larger blocksize. Correct the f_ffree calculation to use (availbytes >> DNODE_SHIFT) to more accurately compute the maximum number of dnodes that can be created. Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #413 Closes #400	2011-09-28 11:27:10 -07:00
Brian Behlendorf	dee28b0700	Export symbols for the full ZAP API Export all the symbols for the ZAP API. This allows external modules to cleanly interface with ZAP type objects. Previously only a subset of the functionality was exposed. Documention for the ZAP API can be found in the sys/zap.h header. This change also removes a duplicate zap_increment_int() prototype. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-09-27 16:12:36 -07:00
Brian Behlendorf	fa6e5ced2f	Suppress kmem_alloc() warning in zfs_prop_set_special() Suppress the warning for this large kmem_alloc() because it is not that far over the warning threshhold (8k) and it is short lived. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-09-15 20:26:51 -07:00
Brian Behlendorf	2708f716c0	Fix usage of zsb after free Caught by code inspection, the variable zsb was referenced after being freed. Move the kmem_free() to the end of the function. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-09-09 10:29:48 -07:00
Brian Behlendorf	95d9fd028b	Fix incompatible pointer type warning This warning was accidentally introduced by commit `f3ab88d646` which updated the .readpages() implementation. The fix is to simply cast the helper function to the appropriate type when passed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-08-19 15:16:30 -07:00
Brian Behlendorf	f3ab88d646	Correctly lock pages for .readpages() Unlike the .readpage() callback which is passed a single locked page to be populated. The .readpages() callback is passed a list of unlocked pages which are all marked for read-ahead (PG_readahead set). It is the responsibly of .readpages() to ensure to pages are properly locked before being populated. Prior to this change the requested read-ahead pages would be updated outside of the page lock which is unsafe. The unlocked pages would then be unlocked again which is harmless but should have been immediately detected as bug. Unfortunately, newer kernels failed detect this issue because the check is done with a VM_BUG_ON which is disabled by default. Luckily, the old Debian Lenny 2.6.26 kernel caught this because it simply uses a BUG_ON. The straight forward fix for this is to update the .readpages() callback to use the read_cache_pages() helper function. The helper function will ensure that each page in the list is properly locked before it is passed to the .readpage() callback. In addition resolving the bug, this results in a nice simplification of the existing code. The downside to this change is that instead of passing one large read request to the dmu multiple smaller ones are submitted. All of these requests however are marked for readahead so the lower layers should issue a large I/O regardless. Thus most of the request should hit the ARC cache. Futher optimization of this code can be done in the future is a perform analysis determines it to be worthwhile. But for the moment, it is preferable that code be correct and understandable. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #355	2011-08-08 13:24:52 -07:00
Brian Behlendorf	76659dc110	Add backing_device_info per-filesystem For a long time now the kernel has been moving away from using the pdflush daemon to write 'old' dirty pages to disk. The primary reason for this is because the pdflush daemon is single threaded and can be a limiting factor for performance. Since pdflush sequentially walks the dirty inode list for each super block any delay in processing can slow down dirty page writeback for all filesystems. The replacement for pdflush is called bdi (backing device info). The bdi system involves creating a per-filesystem control structure each with its own private sets of queues to manage writeback. The advantage is greater parallelism which improves performance and prevents a single filesystem from slowing writeback to the others. For a long time both systems co-existed in the kernel so it wasn't strictly required to implement the bdi scheme. However, as of Linux 2.6.36 kernels the pdflush functionality has been retired. Since ZFS already bypasses the page cache for most I/O this is only an issue for mmap(2) writes which must go through the page cache. Even then adding this missing support for newer kernels was overlooked because there are other mechanisms which can trigger writeback. However, there is one critical case where not implementing the bdi functionality can cause problems. If an application handles a page fault it can enter the balance_dirty_pages() callpath. This will result in the application hanging until the number of dirty pages in the system drops below the dirty ratio. Without a registered backing_device_info for the filesystem the dirty pages will not get written out. Thus the application will hang. As mentioned above this was less of an issue with older kernels because pdflush would eventually write out the dirty pages. This change adds a backing_device_info structure to the zfs_sb_t which is already allocated per-super block. It is then registered when the filesystem mounted and unregistered on unmount. It will not be registered for mounted snapshots which are read-only. This change will result in flush-<pool> thread being dynamically created and destroyed per-mounted filesystem for writeback. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #174	2011-08-04 13:37:38 -07:00
Brian Behlendorf	3c0e5c0f45	Cleanup mmap(2) writes While the existing implementation of .writepage()/zpl_putpage() was functional it was not entirely correct. In particular, it would move dirty pages in to a clean state simply after copying them in to the ARC cache. This would result in the pages being lost if the system were to crash enough though the Linux VFS believed them to be safe on stable storage. Since at the moment virtually all I/O, except mmap(2), bypasses the page cache this isn't as bad as it sounds. However, as hopefully start using the page cache more getting this right becomes more important so it's good to improve this now. This patch takes a big step in that direction by updating the code to correctly move dirty pages through a writeback phase before they are marked clean. When a dirty page is copied in to the ARC it will now be set in writeback and a completion callback is registered with the transaction. The page will stay in writeback until the dmu runs the completion callback indicating the page is on stable storage. At this point the page can be safely marked clean. This process is normally entirely asynchronous and will be repeated for every dirty page. This may initially sound inefficient but most of these pages will end up in a few txgs. That means when they are eventually written to disk they should be nicely batched. However, there is room for improvement. It may still be desirable to batch up the pages in to larger writes for the dmu. This would reduce the number of callbacks and small 4k buffer required by the ARC. Finally, if the caller requires that the I/O be done synchronously by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set. Then the I/O will trigger a zil_commit() to flush the data to stable storage. At which point the registered callbacks will be run leaving the date safe of disk and marked clean before returning from .writepage. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-08-02 10:34:55 -07:00
Martin Matuska	cddafdcbc5	Illumos #1313 : Integer overflow in txg_delay() The function txg_delay() is used to delay txg (transaction group) threads in ZFS. The timeout value for this function is calculated using: int timeout = ddi_get_lbolt() + ticks; Later, the actual wait is performed: while (ddi_get_lbolt() < timeout && tx->tx_syncing_txg < txg-1 && !txg_stalled(dp)) (void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock, timeout - ddi_get_lbolt()); The ddi_get_lbolt() function returns current uptime in clock ticks and is typed as clock_t. The clock_t type on 64-bit architectures is int64_t. The "timeout" variable will overflow depending on the tick frequency (e.g. for 1000 it will overflow in 28.855 days). This will make the expression "ddi_get_lbolt() < timeout" always false - txg threads will not be delayed anymore at all. This leads to a slowdown in ZFS writes. The attached patch initializes timeout as clock_t to match the return value of ddi_get_lbolt(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #352	2011-08-01 12:09:43 -07:00
Alexander Stetsenko	0b7936d5c2	Illumos #278 : get rid zfs of python and pyzfs dependencies Remove all python and pyzfs dependencies for consistency and to ensure full functionality even in a mimimalist environment. Reviewed by: gordon.w.ross@gmail.com Reviewed by: trisk@opensolaris.org Reviewed by: alexander.r.eremin@gmail.com Reviewed by: jerry.jelinek@joyent.com Approved by: garrett@nexenta.com References to Illumos issue and patch: - https://www.illumos.org/issues/278 - https://github.com/illumos/illumos-gate/commit/1af68beac3 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340 Issue #160 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-08-01 12:09:36 -07:00
Martin Matuska	ca5252204a	Illumos #1043 : Recursive zfs snapshot destroy fails Prior to revision 11314 if a user was recursively destroying snapshots of a dataset the target dataset was not required to exist. The zfs_secpolicy_destroy_snaps() function introduced the security check on the target dataset, so since then if the target dataset does not exist, the recursive destroy is not performed. Before 11314, only a delete permission check on the snapshot's master dataset was performed. Steps to reproduce: zfs create pool/a zfs snapshot pool/a@s1 zfs destroy -r pool@s1 Therefore I suggest to fallback to the old security check, if the target snapshot does not exist and continue with the destroy. References to Illumos issue and patch: - https://www.illumos.org/issues/1043 - https://www.illumos.org/attachments/217/recursive_dataset_destroy.patch Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Eric Schrock	3e31d2b080	Illumos #883 : ZIL reuse during remount corruption Moving the zil_free() cleanup to zil_close() prevents this problem from occurring in the first place. There is a very good description of the issue and fix in Illumus #883. Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com> Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com> Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Garrett D'Amore <garrett@nexenta.com> Reivewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/883 - https://github.com/illumos/illumos-gate/commit/c9ba2a43cb Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00

... 57 58 59 60 61 ...

3355 Commits