Commit Graph

140 Commits

Author SHA1 Message Date
Brian Behlendorf
f6dcdf13f8 Fix 'statement with no effect' warning
Because the secpolicy_* macros are all currently defined to (0).
And because the caller of this function does not check the return
code.  The compiler complains that this statement has no effect
which is correct and OK.  To suppress the warning explictly cast
the result to (void).
2011-02-23 13:03:19 -08:00
Brian Behlendorf
a31a70bbd1 Fix enum compiler warning
Generally it's a good idea to use enums for switch statements,
but in this case it causes warning because the enum is really a
set of flags.  These flags are OR'ed together in some cases
resulting in values which are not part of the original enum.
This causes compiler warning such as this about invalid cases.

  error: case value ‘33’ not in enumerated type ‘zprop_source_t’

To handle this we simply case the enum to an int for the switch
statement.  This leaves all other enum type checking in place
and effectively disabled these warnings.
2011-02-23 12:52:51 -08:00
Brian Behlendorf
61e909608d Linux 2.6.x compat, blkdev_compat.h
For legacy reasons the zvol.c and vdev_disk.c Linux compatibility
code ended up in sys/blkdev.h and sys/vdev_disk.h headers.  While
there are worse places for this code to live it should be in a
linux/blkdev_compat.h header.  This change moves this block device
Linux compatibility code in to the linux/blkdev_compat.h header
and updates all the correct #include locations.  This is not a
functional change or bug fix, it is just code cleanup.
2011-02-23 12:29:38 -08:00
Brian Behlendorf
5d0265c0dd Merge branch 'zpl' 2011-02-18 09:31:25 -08:00
Brian Behlendorf
037849f854 Use provided uid/gid for setattr
When changing the uid/gid of a file via zfs_setattr() use the
Posix id passed in iattr->ia_uid/gid.  While the zfs_fuid_create()
code already had the fuid support disabled for Linux it was
returning the uid/gid from the credential.  With this change
the 'chown' command which relies on setxattr is now working
properly.

Also remove a little stray white space which was in front of
zfs_update_inode() call and the end of zfs_setattr().
2011-02-17 14:23:48 -08:00
Brian Behlendorf
efd1832bc6 Fix symlink(2) inode reference count
Under Linux sys_symlink(2) should result in a inode being created
with one reference for the inode itself, and a second reference on
the inode which is held by the new dentry.  Under Solaris this
appears not to be the case.  Their zfs_symlink() handler drops
the inode reference before returning.

The result of this under Linux is that the reference count for
symlinks is always one smaller than it should have been. This
results in a BUG() when the symlink is unlinked.  To handle this
the Linux port now keeps the inode reference which differs from
the Solaris behavior.  This results in correct reference counts.

Closes #96
2011-02-17 11:34:47 -08:00
Brian Behlendorf
5095000169 Use -zfs_readlink() error
The zfs_readlink() function returns a Solaris positive error value
and that needs to be converted to a Linux negative error value.
While in this case nothing would actually go wrong, it's still
incorrect and should be fixed if for no other reason than clarity.
2011-02-17 09:48:06 -08:00
Brian Behlendorf
8b4f9a2d55 Fix readlink(2)
This patch addresses three issues related to symlinks.

1) Revert the zfs_follow_link() function to a modified version
of the original zfs_readlink().  The only changes from the
original OpenSolaris version relate to using Linux types.
For the moment this means no vnode's and no zfsvfs_t.  The
caller zpl_follow_link() was also updated accordingly.  This
change was reverted because it was slightly gratuitious.

2) Update zpl_follow_link() to use local variables for the
link buffer.  I'd forgotten that iov.iov_base is updated by
uiomove() so after the call to zfs_readlink() it can not longer
be used.  We need our own private copy of the link pointer.

3) Allocate MAXPATHLEN instead of MAXPATHLEN+1.  By default
MAXPATHLEN is 4096 bytes which is a full page, adding one to
it pushes it slightly over a page.  That means you'll likely
end up allocating 2 pages which is wasteful of memory and
possibly slightly slower.
2011-02-16 15:54:55 -08:00
Ricardo M. Correia
54a179e7b8 Add API to wait for pending commit callbacks
This adds an API to wait for pending commit callbacks of already-synced
transactions to finish processing.  This is needed by the DMU-OSD in
Lustre during device finalization when some callbacks may still not be
called, this leads to non-zero reference count errors.  See lustre.org
bug 23931.
2011-02-16 11:20:06 -08:00
Brian Behlendorf
a6695d83b7 Add get/setattr, get/setxattr hooks
While the attr/xattr hooks were already in place for regular
files this hooks can also apply to directories and special files.
While they aren't typically used in this way, it should be
supported.  This patch registers these additional callbacks
for both directory and special inode types.
2011-02-16 09:55:53 -08:00
Brian Behlendorf
d8fd10545b Fix FIFO and socket handling
Under Linux when creating a fifo or socket type device in the ZFS
filesystem it's critical that the rdev is stored in a SA.  This
was already being correctly done for character and block devices,
but that logic needed to be extended to include FIFOs and sockets.

This patch takes care of device creation but a follow on patch
may still be required to verify that the dev_t is being correctly
packed/unpacked from the SA.
2011-02-16 09:51:44 -08:00
Brian Behlendorf
d567444809 Create minors for all zvols
It was noticed that when you have zvols in multiple datasets
not all of the zvol devices are created at module load time.
Fajarnugraha did the leg work to identify that the root cause of
this bug is a non-zero return value from zvol_create_minors_cb().

Returning a non-zero value from the dmu_objset_find_spa() callback
function results in aborting processing the remaining children in
a dataset.  Since we want to ensure that the callback in run on
all children regardless of error simply unconditionally return
zero from the zvol_create_minors_cb().  This callback function
is solely used for this purpose so surpressing the error is safe.

Closes #96
2011-02-16 09:50:06 -08:00
Brian Behlendorf
2c395def27 Linux 2.6.36 compat, sops->evict_inode()
The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback.  It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.
2011-02-11 13:47:51 -08:00
Brian Behlendorf
f9637c6c8b Linux 2.6.33 compat, get/set xattr callbacks
The xattr handler prototypes were sanitized with the idea being that
the same handlers could be used for multiple methods.  The result of
this was the inode type was changes to a dentry, and both the get()
and set() hooks had a handler_flags argument added.  The list()
callback was similiarly effected but no autoconf check was added
because we do not use the list() callback.
2011-02-11 10:41:00 -08:00
Brian Behlendorf
7268e1bec8 Linux 2.6.35 compat, fops->fsync()
The fsync() callback in the file_operations structure used to take
3 arguments.  The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers.  To
handle this a compatibility prototype was added to ensure the right
prototype is used.  Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.
2011-02-11 09:05:51 -08:00
Brian Behlendorf
777d4af891 Linux 2.6.35 compat, const struct xattr_handler
The const keyword was added to the 'struct xattr_handler' in the
generic Linux super_block structure.  To handle this we define an
appropriate xattr_handler_t typedef which can be used.  This was
the preferred solution because it keeps the code clean and readable.
2011-02-10 16:29:00 -08:00
Brian Behlendorf
6839eed23e Use 'noop' IO Scheduler
Initial testing has shown the the right IO scheduler to use under Linux
is noop.  This strikes the ideal balance by allowing the zfs elevator
to do all request ordering and prioritization.  While allowing the
Linux elevator to do the maximum front/back merging allowed by the
physical device.  This yields the largest possible requests for the
device with the lowest total overhead.

While 'noop' should be right for your system you can choose a different
IO scheduler with the 'zfs_vdev_scheduler' option.  You may set this
value to any of the standard Linux schedulers: noop, cfq, deadline,
anticipatory.  In addition, if you choose 'none' zfs will not attempt
to change the IO scheduler for the block device.
2011-02-10 09:27:22 -08:00
Brian Behlendorf
4db77a74a6 Suppress large kmem_alloc() warning
The following warning was observed under normal operation.  It's
not fatal but it's something to be addressed long term.  Flag the
offending allocation with KM_NODEBUG to suppress the warning and
flag the call site.

SPL: Showing stack for process 21761
Pid: 21761, comm: iozone Tainted: P           ----------------
2.6.32-71.14.1.el6.x86_64 #1
Call Trace:
 [<ffffffffa05465a7>] spl_debug_dumpstack+0x27/0x40 [spl]
 [<ffffffffa054a84d>] kmem_alloc_debug+0x11d/0x130 [spl]
 [<ffffffffa05de166>] dmu_buf_hold_array_by_dnode+0xa6/0x4e0 [zfs]
 [<ffffffffa05de825>] dmu_buf_hold_array+0x65/0x90 [zfs]
 [<ffffffffa05de891>] dmu_read_uio+0x41/0xd0 [zfs]
 [<ffffffffa0654827>] zfs_read+0x147/0x470 [zfs]
 [<ffffffffa06644a2>] zpl_read_common+0x52/0x70 [zfs]
 [<ffffffffa0664503>] zpl_read+0x43/0x70 [zfs]
 [<ffffffff8116d905>] vfs_read+0xb5/0x1a0
 [<ffffffff8116da41>] sys_read+0x51/0x90
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
2011-02-10 09:27:22 -08:00
Brian Behlendorf
ceb43b935d Invalidate dcache and inode cache
When performing a 'zfs rollback' it's critical to invalidate
the previous dcache and inode cache.  If we don't there will
stale cache entries which when accessed will result in EIOs.
2011-02-10 09:27:22 -08:00
Brian Behlendorf
8926ab7a50 Move cv_destroy() outside zp->z_range_lock()
With the recent SPL change (d599e4fa) that forces cv_destroy()
to block until all waiters have been woken.  It is now unsafe
to call cv_destroy() under the zp->z_range_lock() because it
is used as the condition variable mutex.  If there are waiters
cv_destroy() will block until they wake up and aquire the mutex.
However, they will never aquire the mutex because cv_destroy()
will not return allowing it's caller to drop the lock.  Deadlock.

To avoid this cv_destroy() is now run asynchronously in a taskq.
This solves two problems:

1) It is no longer run under the zp->z_range_lock so no deadlock.
2) Since cv_destroy() may now block we don't want this slowing
   down zfs_range_unlock() and throttling the system.

This was not as much of an issue under OpenSolaris because their
cv_destroy() implementation does not do anything.  They do however
risk a bad paging request if cv_destroy() returns, the memory holding
the condition variable is free'd, and then the waiters wake up and
try to reference it.  It's a very small unlikely race, but it is
possible.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
c0d35759c5 Add mmap(2) support
It's worth taking a moment to describe how mmap is implemented
for zfs because it differs considerably from other Linux filesystems.
However, this issue is handled the same way under OpenSolaris.

The issue is that by design zfs bypasses the Linux page cache and
leaves all caching up to the ARC.  This has been shown to work
well for the common read(2)/write(2) case.  However, mmap(2)
is problem because it relies on being tightly integrated with the
page cache.  To handle this we cache mmap'ed files twice, once in
the ARC and a second time in the page cache.  The code is careful
to keep both copies synchronized.

When a file with an mmap'ed region is written to using write(2)
both the data in the ARC and existing pages in the page cache
are updated.  For a read(2) data will be read first from the page
cache then the ARC if needed.  Neither a write(2) or read(2) will
will ever result in new pages being added to the page cache.

New pages are added to the page cache only via .readpage() which
is called when the vfs needs to read a page off disk to back the
virtual memory region.  These pages may be modified without
notifying the ARC and will be written out periodically via
.writepage().  This will occur due to either a sync or the usual
page aging behavior.  Note because a read(2) of a mmap'ed file
will always check the page cache first even when the ARC is out
of date correct data will still be returned.

While this implementation ensures correct behavior it does have
have some drawbacks.  The most obvious of which is that it
increases the required memory footprint when access mmap'ed
files.  It also adds additional complexity to the code keeping
both caches synchronized.

Longer term it may be possible to cleanly resolve this wart by
mapping page cache pages directly on to the ARC buffers.  The
Linux address space operations are flexible enough to allow
selection of which pages back a particular index.  The trick
would be working out the details of which subsystem is in
charge, the ARC, the page cache, or both.  It may also prove
helpful to move the ARC buffers to a scatter-gather lists
rather than a vmalloc'ed region.

Additionally, zfs_write/read_common() were used in the readpage
and writepage hooks because it was fairly easy.  However, it
would be better to update zfs_fillpage and zfs_putapage to be
Linux friendly and use them instead.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
cc5f931cfd Add Hooks for Linux Xattr Operations
The Linux specific xattr operations have all been located in the
file zpl_xattr.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
51f0bbe425 Add Hooks for Linux Super Block Operations
The Linux specific super block operations have all been located in the
file zpl_super.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
ee154f01bf Add Hooks for Linux Inode Operations
The Linux specific inode operations have all been located in the
file zpl_inode.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
1efb473f89 Add Hooks for Linux File Operations
The Linux specific file operations have all been located in the
file zpl_file.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.

This first zpl_* commit also includes a common zpl.h header with
minimal entries to register the Linux specific hooks.  In also
adds all the new zpl_* file to the Makefile.in.  This is not a
standalone commit, you required the following zpl_* commits.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
633e8030b3 Wrap with HAVE_XVATTR
For the moment exactly how to handle xvattr is not clear.  This
change largely consists of the code to comment out the offending
bits until something reasonable can be done.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
3c4988c83e Add zp->z_is_zvol flag
A new flag is required for the zfs_rlock code to determine if
it is operation of the zvol of zpl dataset.  This used to be
keyed off the zp->z_vnode, which was a hack to begin with, but
with the removal of vnodes we needed a dedicated flag.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
3558fd73b5 Prototype/structure update for Linux
I appologize in advance why to many things ended up in this commit.
When it could be seperated in to a whole series of commits teasing
that all apart now would take considerable time and I'm not sure
there's much merrit in it.  As such I'll just summerize the intent
of the changes which are all (or partly) in this commit.  Broadly
the intent is to remove as much Solaris specific code as possible
and replace it with native Linux equivilants.  More specifically:

1) Replace all instances of zfsvfs_t with zfs_sb_t.  While the
type is largely the same calling it private super block data
rather than a zfsvfs is more consistent with how Linux names
this.  While non critical it makes the code easier to read when
your thinking in Linux friendly VFS terms.

2) Replace vnode_t with struct inode.  The Linux VFS doesn't have
the notion of a vnode and there's absolutely no good reason to
create one.  There are in fact several good reasons to remove it.
It just adds overhead on Linux if we were to manage one, it
conplicates the code, and it likely will lead to bugs so there's
a good change it will be out of date.  The code has been updated
to remove all need for this type.

3) Replace all vtype_t's with umode types.  Along with this shift
all uses of types to mode bits.  The Solaris code would pass a
vtype which is redundant with the Linux mode.  Just update all the
code to use the Linux mode macros and remove this redundancy.

4) Remove using of vn_* helpers and replace where needed with
inode helpers.  The big example here is creating iput_aync to
replace vn_rele_async.  Other vn helpers will be addressed as
needed but they should be be emulated.  They are a Solaris VFS'ism
and should simply be replaced with Linux equivilants.

5) Update znode alloc/free code.  Under Linux it's common to
embed the inode specific data with the inode itself.  This removes
the need for an extra memory allocation.  In zfs this information
is called a znode and it now embeds the inode with it.  Allocators
have been updated accordingly.

6) Minimal integration with the vfs flags for setting up the
super block and handling mount options has been added this
code will need to be refined but functionally it's all there.

This will be the first and last of these to large to review commits.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
6149f4c45f Remove dmu_write_pages() support
For the moment we do not use dmu_write_pages() to write pages
directly in to a dmu object.  It may be required at some point
in the future, but for now is simplest and cleanest to drop it.
It can be easily readded if/when needed.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
eb28321e2d Create a root znode without VFS dependencies
For portability reasons it's handy to be able to create a root
znode and basic filesystem components without requiring the full
cooperation of the VFS.  We are committing to this to simply the
filesystem creations code.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
bcf308227c Remove zfs_ctldir.[ch]
This code is used for snapshot and heavily leverages Solaris
functionality we do not want to reimplement.  These files have
been removed, including references to them, and will be replaced
by a zfs_snap.c/zpl_snap.c implementation which handles snapshots.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
b516a07b99 Disable fuid features
These features should probably be enabled in the Linux zpl code.
For now I'm disabling them until it's clear what needs to be done.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
d5e53f9d06 Disable zfs_sync during oops/panic
Minor update to ensure zfs_sync() is disabled if a kernel oops/panic
is triggered.  As the comment says 'data integrity is job one'.  This
change could have been done by defining panicstr to oops_in_progress
in the SPL.  But I felt it was better to use the native Linux API
here since to be clear.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
acb5376940 Disable Shutdown/Reboot
This support has been disable with HAVE_SHUTDOWN.  We can support
this at some point by adding the needed reboot notifiers.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
cb28b3494e Remove SYNC_ATTR check
This flag does not need to be support under Linux.  As the comment
says it was only there to support fsflush() for old filesystem like
UFS.  This is not needed under Linux.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
e15c023014 Remove mount options
Mount option parsing is still very Linux specific and will be
handled above this zfs filesystem layer.  Honoring those mount
options once set if of course the responsibility of the lower
layers.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
d7cafa8e3e Remove zfs_active_fs_count
This variable was used to ensure that the ZFS module is never
removed while the filesystem is mounted.  Once again the generic
Linux VFS handles this case for us so it can be removed.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
42ab36aa36 Remove unused mount functions
The functions zfs_mount_label_policy(), zfs_mountroot(), zfs_mount()
will not be needed because most of what they do is already handled
by the generic Linux VFS layer.  They all call zfs_domount() which
creates the actual dataset, the caller of this library call which
will be in the zpl layer is responsible for what's left.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
c0b3dc7d07 Remove zfs_major/zfs_minor/zfsfstype
Under Linux we don't need to reserve a major or minor number for
the filesystem.  We can rely on the VFS to handle colisions without
this being handled by the lower ZFS layers.

Additionally, there is no need to keep a zfsfstype around.  We are
not limited on Linux by the OpenSolaris infrastructure which needed
this.  The upper zpl layer can specify the filesystem type.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
4b3f12ecd5 Remove Solaris VFS Hooks
The ZFS code is being restructured to act as a library and a stand
alone module.  This allows us to leverage most of the existing code
with minimal modification.  It also means we need to drop the Solaris
vfs/vnode functions they will be replaced by Linux equivilants and
updated to be Linux friendly.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
960e08fe3e VFS: Add zfs_inode_update() helper
For the moment we have left ZFS unchanged and it updates many values
as part of the znode.  However, some of these values should be set
in the inode.  For the moment this is handled by adding a function
called zfs_inode_update() which updates the inode based on the znode.

This is considered a workaround until we can systematically go
through the ZFS code and have it directly update the inode.  At
which point zfs_update_inode() can be dropped entirely.  Keeping
two copies of the same data isn't only inefficient it's a breeding
ground for bugs.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
7304b6e50f VFS: Integrate zfs_znode_alloc()
Under Linux the convention for filesystem specific data structure is
to embed it along with the generic vfs data structure.  This differs
significantly from Solaris.

Since we want to integrates as cleanly with the Linux VFS as possible.
This changes modifies zfs_znode_alloc() to allocate a znode with an
embedded inode for use with the generic VFS.  This is done by calling
iget_locked() which will allocate a new inode if needed by calling
sb->alloc_inode().  This function allocates enough memory for a
znode_t by returns a pointer to the inode structure for Linux's VFS.
This function is also responsible for setting the callback
znode->z_set_ops_inodes() which is used to register the correct
handlers for the inode.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
10c6047ea5 Enable zfs_znode compilation
Basic compilation of the bulk of zfs_znode.c has been enabled.  After
much consideration it was decided to convert the existing vnode based
interfaces to more friendly Linux interfaces.  The following commits
will systematically replace update the requiter interfaces.  There
are of course pros and cons to this decision.

Pros:
* This simplifies intergration with Linux in the long term.  There is
  no longer any need to manage vnodes which are a foreign concept to
  the Linux VFS.
* Improved long term maintainability.
* Minor performance improvements by removing vnode overhead.

Cons:
* Added work in the short term to modify multiple ZFS interfaces.
* Harder to pull in changes if we ever see any new code from Solaris.
* Mixed Solaris and Linux interfaces in some ZFS code.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
a405c8a665 ACL related changes
A small collection of ACL related changes related to not
supporting fuid mapping.  This whole are will need to be
closely investigated.
2011-02-10 09:26:26 -08:00
Brian Behlendorf
3fc050aaf2 Init/destroy tsd
Add missing tsd_destroy() call for rrw_tsd_key to avoid a leak.
2011-02-10 09:25:38 -08:00
Brian Behlendorf
ab892c5f0a Replace VOP_* calls with direct zfs_* calls
These generic Solaris wrappers are no longer required.  Simply
directly call the correct zfs functions for clarity.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
538f669f63 Add trivial acl helpers
The zfs acl code makes use of the two OpenSolaris helper functions
acl_trivial_access_masks() and ace_trivial_common().  Since they are
only called from zfs_acl.c I've brought them over from OpenSolaris
and added them as static function to this file.  This way I don't
need to reimplement this functionality from scratch in the SPL.

Long term once I take a more careful look at the acl implementation
it may be the case that these functions really aren't needed.  If
that turns out to be the case they can then be removed.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
c60bc1fbf0 Remove dead ACL code
The following code was unused which caused gcc to complain.
Since it was deadcode it has simply been removed.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
4e1b54fdde Remove zfs_parse_bootfs() support
Remove unneeded bootfs functions.  This support shouldn't be required
for the Linux port, and even if it is it would need to be reworked
to integrate cleanly with Linux.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
9ee7fac531 VFS: Wrap with HAVE_SHARE
Certain NFS/SMB share functionality is not yet in place.  These
functions used to be wrapped with the generic HAVE_ZPL to prevent
them from being compiled.  I still don't want them compiled but
I'm working toward eliminating the use of HAVE_ZPL.  So I'm just
renaming the wrapper here to HAVE_SHARE.  They still won't be
compiled until all the share issues are worked through.  Share
support is the last missing piece from zfs_ioctl.c.
2011-02-10 09:21:43 -08:00