Commit Graph

2445 Commits

Author SHA1 Message Date
Brian Behlendorf
01c0e61da0 Add init scripts
To support automatically mounting your zfs on filesystem on boot
a basic init script is needed.  Unfortunately, every distribution
has their own idea of the _right_ way to do things.  Rather than
write one very complicated portable init script, which would be
invariably replaced by the distributions own anyway.  I have
instead added support to provide multiple distribution specific
init scripts.

The correct init script for your distribution will be selected
by ZFS_AC_DEFAULT_PACKAGE which will set DEFAULT_INIT_SCRIPT.
During 'make install' the correct script for your system will
be installed from zfs/etc/init.d/zfs.DEFAULT_INIT_SCRIPT to the
usual /etc/init.d/zfs location.

Currently, there is zfs.fedora and a more generic zfs.lsb init
script.  Hopefully, the distribution maintainers who know best
how they want their init scripts to function will feedback their
approved versions to be included in the project.

This change does not consider upstart jobs but I'm not at all
opposed to add that sort of thing.
2011-03-17 16:51:54 -07:00
Brian Behlendorf
0de19dad9c Register .remount_fs handler
Register the missing .remount_fs handler.  This handler isn't strictly
required because the VFS does a pretty good job updating most of the
MS_* flags.  However, there's no harm in using the hook to call the
registered zpl callback for various MS_* flags.  Additionaly, this
allows us to lay the ground work for more complicated argument parsing
in the future.
2011-03-15 13:33:29 -07:00
Brian Behlendorf
03f9ba9d99 Register .sync_fs handler
Register the missing .sync_fs handler.  This is a noop in most cases
because the usual requirement is that sync just be initiated.  As part
of the DMU's normal transaction processing txgs will be frequently
synced.  However, when the 'wait' flag is set the requirement is that
.sync_fs must not return until the data is safe on disk.  With the
addition of the .sync_fs handler this is now properly implemented.
2011-03-15 13:33:29 -07:00
Brian Behlendorf
9ac97c2a93 Print mount/umount errors
Because we are dependent of the system mount/umount utilities to
ensure correct mtab locking, we should not suppress their error
output.  During a successful mount/umount they will be silent,
but during a failure the error message they print is the only sure
way to know why a mount failed.  This is because the (u)mount(8)
return code does not contain the result of the system call issued.
The only way to clearly idenify why thing failed is to rely on
the error message printed by the tool.

Longer term once libmount is available we can issue the mount/umount
system calls within the tool and still be ensured correct mtab locking.

Closed #107
2011-03-09 15:26:48 -08:00
Brian Behlendorf
450dc149bd Range lock performance improvements
The original range lock implementation had to be modified by commit
8926ab7 because it was unsafe on Linux.  In particular, calling
cv_destroy() immediately after cv_broadcast() is dangerous because
the waiters may still be asleep.  Thus the following cv_destroy()
will free memory which may still be in use.

This was fixed by updating cv_destroy() to block on waiters but
this in turn introduced a deadlock.  The deadlock was resolved
with the use of a taskq to move the offending free outside the
range lock.  This worked well but using the taskq for the free
resulted in a serious performace hit.  This is somewhat ironic
because at the time I felt using the taskq might improve things
by making the free asynchronous.

This patch refines the original fix and moves the free from the
taskq to a private free list.  Then items which must be free'd
are simply inserted in to the list.  When the range lock is dropped
it's safe to free the items.  The list is walked and all rl_t
entries are freed.

This change improves small cached read performance by 26x.  This
was expected because for small reads the number of locking calls
goes up significantly.  More surprisingly this change significantly
improves large cache read performance.  This probably attributable
to better cpu/memory locality.  Very likely the same processor
which allocated the memory is now freeing it.

bs	ext3	zfs	zfs+fix		faster
----------------------------------------------
512     435     3       79      	26x
1k      820     7       160     	22x
2k      1536    14      305     	21x
4k      2764    28      572     	20x
8k      3788    50      1024    	20x
16k     4300    86      1843    	21x
32k     4505    138     2560    	18x
64k     5324    252     3891    	15x
128k    5427    276     4710    	17x
256k    5427    413     5017    	12x
512k    5427    497     5324    	10x
1m      5427    521     5632    	10x

Closes #142
2011-03-08 12:44:06 -08:00
Brian Behlendorf
126400a1ca Add zfs_open()/zfs_close()
In the original implementation the zfs_open()/zfs_close() hooks
were dropped for simplicity.  This was functional but not 100%
correct with the expected ZFS sematics.  Updating and re-adding the
zfs_open()/zfs_close() hooks resolves the following issues.

1) The ZFS_APPENDONLY file attribute is once again honored.  While
there are still no Linux tools to set/clear these attributes once
there are it should behave correctly.

2) Minimal virus scan file attribute hooks were added.  Once again
this support in disabled but the infrastructure is back in place.

3) Most importantly correctly handle assigning files which were
opened syncronously to the intent log.  Without this change O_SYNC
modifications could be lost during a system crash even though they
were marked synchronous.
2011-03-08 11:04:51 -08:00
Brian Behlendorf
6788762766 Linux 2.6.31 compat, include linux/seq_file.h
Explicitly include the linux/seq_file.h header in vfs.h.  This header
is required for the sequence handlers and is included indirectly in
newer kernels.
2011-03-07 13:52:00 -08:00
Brian Behlendorf
5484965ab6 Drop HAVE_XVATTR macros
When I began work on the Posix layer it immediately became clear to
me that to integrate cleanly with the Linux VFS certain Solaris
specific things would have to go.  One of these things was to elimate
as many Solaris specific types from the ZPL layer as possible.  They
would be replaced with their Linux equivalents.  This would not only
be good for performance, but for the general readability and health of
the code.  The Solaris and Linux VFS are different beasts and should
be treated as such.  Most of the code remains common for constructing
transactions and such, but there are subtle and important differenced
which need to be repsected.

This policy went quite for for certain types such as the vnode_t,
and it initially seemed to be working out well for the vattr_t.  There
was a relatively small amount of related xvattr_t code I was forced to
comment out with HAVE_XVATTR.  But it didn't look that hard to come
back soon and replace it all with a native Linux type.

However, after going doing this path with xvattr some distance it
clear that this code was woven in the ZPL more deeply than I thought.
In particular its hooks went very deep in to the ZPL replay code
and replacing it would not be as easy as I originally thought.

Rather than continue persuing replacing and removing this code I've
taken a step back and reevaluted things.  This commit reverts many of
my previous commits which removed xvattr related code.  It restores
much of the code to its original upstream state and now relies on
improved xvattr_t support in the zfs package itself.

The result of this is that much of the code which I had commented
out, which accidentally broke things like replay, is now back in
place and working.  However, there may be a small performance
impact for getattr/setattr operations because they now require
a translation from native Linux to Solaris types.  For now that's
a price I'm willing to pay.  Once everything is completely functional
we can revisting the issue of removing the vattr_t/xvattr_t types.

Closes #111
2011-03-02 11:44:34 -08:00
Brian Behlendorf
321a498b95 Add xvattr support
With the removal of the minimal xvattr support from the spl this
support needs to be replaced in the zfs package.  This is fairly
easily accomplished by directly adding portions of the sys/vnode.h
header from OpenSolaris.  These xvattr additions have been placed
in the sys/xvattr.h header file and included as needed where simply
a sys/vnode.h was included before.

In additon to the xvattr types and helper macros two functions
were also included.  The xva_init() and xva_getxoptattr() functions
were included as static inline functions in xvattr.h.  They are
simple enough and it was simpler to place them here rather than
in their own .c file.
2011-03-02 11:43:50 -08:00
Brian Behlendorf
47995fa691 Remove xvattr support
The xvattr support in the spl has always simply consisted of
defining a couple structures and a few #defines.  This was enough
to enable compilation of code which just passed xvattr types
around but not enough to effectively manipulate them.

This change removes even this minimal support leaving it up
to packages which leverage the spl to prove the full xvattr
support.  By removing it from the spl we ensure not conflict
with the higher level packages.

This just leaves minimal vnode support for basical manipulation
of files.  This code is does have the proper support functions
in the spl and a set of regression tests.

Additionally, this change removed the unused 'caller_context_t *'
type and replaces it with a 'void *'.
2011-03-02 11:34:46 -08:00
Brian Behlendorf
a4a1e1ecb4 Add TIMESPEC_OVERFLOW helper
Add the TIMESPEC_OVERFLOW helper macro to allow easy checking
of timespec overflow.
2011-03-02 11:34:43 -08:00
Brian Behlendorf
19c1eb829d Add zlib regression test
A zlib regression test has been added to verify the correct behavior
of z_compress_level() and z_uncompress.  The test case simply takes
a 128k buffer, it compresses the buffer, it them uncompresses the
buffer, and finally it compares the buffers after the transform.
If the buffers match then everything is fine and no data was lost.
It performs this test for all 9 zlib compression levels.
2011-02-25 16:56:46 -08:00
Brian Behlendorf
5c1967ebe2 Fix zlib compression
While portions of the code needed to support z_compress_level() and
z_uncompress() where in place.  In reality the current implementation
was non-functional, it just was compilable.

The critical missing component was to setup a workspace for the
compress/uncompress stream structures to use.  A kmem_cache was
added for the workspace area because we require a large chunk
of memory.  This avoids to need to continually alloc/free this
memory and vmap() the pages which is very slow.  Several objects
will reside in the per-cpu kmem_cache making them quick to acquire
and release.  A further optimization would be to adjust the
implementation to additional ensure the memory is local to the cpu.
Currently that may not be the case.
2011-02-25 16:56:22 -08:00
Fajar A. Nugraha
4c0d8e50b9 Use udev to create /dev/zvol/[dataset_name] links
This commit allows zvols with names longer than 32 characters, which
fixes issue on https://github.com/behlendorf/zfs/issues/#issue/102.

Changes include:
- use /dev/zd* device names for zvol, where * is the device minor
  (include/sys/fs/zfs.h, module/zfs/zvol.c).
- add BLKZNAME ioctl to get dataset name from userland
  (include/sys/fs/zfs.h, module/zfs/zvol.c, cmd/zvol_id).
- add udev rule to create /dev/zvol/[dataset_name] and the legacy
  /dev/[dataset_name] symlink. For partitions on zvol, it will create
  /dev/zvol/[dataset_name]-part* (etc/udev/rules.d/60-zvol.rules,
  cmd/zvol_id).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-02-25 09:43:19 -08:00
Darik Horn
61da501f9d Add the new blkdev_compat.h header to the DIST target.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-02-24 09:40:06 -08:00
Brian Behlendorf
5a52a782a0 Use Linux flock struct
Rather than defining our own structure which will conflict with
Linux's version when building 32-bit.  Simply setup a typedef
to always use the correct Linux version for both 32 ad 64-bit
builds.
2011-02-23 14:32:15 -08:00
Brian Behlendorf
914b063133 Linux compat 2.6.37, invalidate_inodes()
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules.  This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.

Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address.  The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn.  Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address.  All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.

Long term we should try to get this, or another, interface made
available to modules again.
2011-02-23 12:44:32 -08:00
Brian Behlendorf
45066d1f20 Linux 2.6.38 compat, blkdev_get_by_path()
The open_bdev_exclusive() function has been replaced (again) by the
more generic blkdev_get_by_path() function.  Additionally, the
counterpart function close_bdev_exclusive() has been replaced by
blkdev_put().  Because these functions are more generic versions
of the functions they replaced the compatibility macro must add
the FMODE_EXCL mask to ensure they are exclusive.

Closes #114
2011-02-23 12:29:38 -08:00
Brian Behlendorf
61e909608d Linux 2.6.x compat, blkdev_compat.h
For legacy reasons the zvol.c and vdev_disk.c Linux compatibility
code ended up in sys/blkdev.h and sys/vdev_disk.h headers.  While
there are worse places for this code to live it should be in a
linux/blkdev_compat.h header.  This change moves this block device
Linux compatibility code in to the linux/blkdev_compat.h header
and updates all the correct #include locations.  This is not a
functional change or bug fix, it is just code cleanup.
2011-02-23 12:29:38 -08:00
Brian Behlendorf
5d0265c0dd Merge branch 'zpl' 2011-02-18 09:31:25 -08:00
Ricardo M. Correia
54a179e7b8 Add API to wait for pending commit callbacks
This adds an API to wait for pending commit callbacks of already-synced
transactions to finish processing.  This is needed by the DMU-OSD in
Lustre during device finalization when some callbacks may still not be
called, this leads to non-zero reference count errors.  See lustre.org
bug 23931.
2011-02-16 11:20:06 -08:00
Brian Behlendorf
2c395def27 Linux 2.6.36 compat, sops->evict_inode()
The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback.  It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.
2011-02-11 13:47:51 -08:00
Brian Behlendorf
f9637c6c8b Linux 2.6.33 compat, get/set xattr callbacks
The xattr handler prototypes were sanitized with the idea being that
the same handlers could be used for multiple methods.  The result of
this was the inode type was changes to a dentry, and both the get()
and set() hooks had a handler_flags argument added.  The list()
callback was similiarly effected but no autoconf check was added
because we do not use the list() callback.
2011-02-11 10:41:00 -08:00
Brian Behlendorf
7268e1bec8 Linux 2.6.35 compat, fops->fsync()
The fsync() callback in the file_operations structure used to take
3 arguments.  The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers.  To
handle this a compatibility prototype was added to ensure the right
prototype is used.  Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.
2011-02-11 09:05:51 -08:00
Brian Behlendorf
777d4af891 Linux 2.6.35 compat, const struct xattr_handler
The const keyword was added to the 'struct xattr_handler' in the
generic Linux super_block structure.  To handle this we define an
appropriate xattr_handler_t typedef which can be used.  This was
the preferred solution because it keeps the code clean and readable.
2011-02-10 16:29:00 -08:00
Brian Behlendorf
6839eed23e Use 'noop' IO Scheduler
Initial testing has shown the the right IO scheduler to use under Linux
is noop.  This strikes the ideal balance by allowing the zfs elevator
to do all request ordering and prioritization.  While allowing the
Linux elevator to do the maximum front/back merging allowed by the
physical device.  This yields the largest possible requests for the
device with the lowest total overhead.

While 'noop' should be right for your system you can choose a different
IO scheduler with the 'zfs_vdev_scheduler' option.  You may set this
value to any of the standard Linux schedulers: noop, cfq, deadline,
anticipatory.  In addition, if you choose 'none' zfs will not attempt
to change the IO scheduler for the block device.
2011-02-10 09:27:22 -08:00
Brian Behlendorf
c0d35759c5 Add mmap(2) support
It's worth taking a moment to describe how mmap is implemented
for zfs because it differs considerably from other Linux filesystems.
However, this issue is handled the same way under OpenSolaris.

The issue is that by design zfs bypasses the Linux page cache and
leaves all caching up to the ARC.  This has been shown to work
well for the common read(2)/write(2) case.  However, mmap(2)
is problem because it relies on being tightly integrated with the
page cache.  To handle this we cache mmap'ed files twice, once in
the ARC and a second time in the page cache.  The code is careful
to keep both copies synchronized.

When a file with an mmap'ed region is written to using write(2)
both the data in the ARC and existing pages in the page cache
are updated.  For a read(2) data will be read first from the page
cache then the ARC if needed.  Neither a write(2) or read(2) will
will ever result in new pages being added to the page cache.

New pages are added to the page cache only via .readpage() which
is called when the vfs needs to read a page off disk to back the
virtual memory region.  These pages may be modified without
notifying the ARC and will be written out periodically via
.writepage().  This will occur due to either a sync or the usual
page aging behavior.  Note because a read(2) of a mmap'ed file
will always check the page cache first even when the ARC is out
of date correct data will still be returned.

While this implementation ensures correct behavior it does have
have some drawbacks.  The most obvious of which is that it
increases the required memory footprint when access mmap'ed
files.  It also adds additional complexity to the code keeping
both caches synchronized.

Longer term it may be possible to cleanly resolve this wart by
mapping page cache pages directly on to the ARC buffers.  The
Linux address space operations are flexible enough to allow
selection of which pages back a particular index.  The trick
would be working out the details of which subsystem is in
charge, the ARC, the page cache, or both.  It may also prove
helpful to move the ARC buffers to a scatter-gather lists
rather than a vmalloc'ed region.

Additionally, zfs_write/read_common() were used in the readpage
and writepage hooks because it was fairly easy.  However, it
would be better to update zfs_fillpage and zfs_putapage to be
Linux friendly and use them instead.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
1efb473f89 Add Hooks for Linux File Operations
The Linux specific file operations have all been located in the
file zpl_file.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.

This first zpl_* commit also includes a common zpl.h header with
minimal entries to register the Linux specific hooks.  In also
adds all the new zpl_* file to the Makefile.in.  This is not a
standalone commit, you required the following zpl_* commits.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
3c4988c83e Add zp->z_is_zvol flag
A new flag is required for the zfs_rlock code to determine if
it is operation of the zvol of zpl dataset.  This used to be
keyed off the zp->z_vnode, which was a hack to begin with, but
with the removal of vnodes we needed a dedicated flag.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
3558fd73b5 Prototype/structure update for Linux
I appologize in advance why to many things ended up in this commit.
When it could be seperated in to a whole series of commits teasing
that all apart now would take considerable time and I'm not sure
there's much merrit in it.  As such I'll just summerize the intent
of the changes which are all (or partly) in this commit.  Broadly
the intent is to remove as much Solaris specific code as possible
and replace it with native Linux equivilants.  More specifically:

1) Replace all instances of zfsvfs_t with zfs_sb_t.  While the
type is largely the same calling it private super block data
rather than a zfsvfs is more consistent with how Linux names
this.  While non critical it makes the code easier to read when
your thinking in Linux friendly VFS terms.

2) Replace vnode_t with struct inode.  The Linux VFS doesn't have
the notion of a vnode and there's absolutely no good reason to
create one.  There are in fact several good reasons to remove it.
It just adds overhead on Linux if we were to manage one, it
conplicates the code, and it likely will lead to bugs so there's
a good change it will be out of date.  The code has been updated
to remove all need for this type.

3) Replace all vtype_t's with umode types.  Along with this shift
all uses of types to mode bits.  The Solaris code would pass a
vtype which is redundant with the Linux mode.  Just update all the
code to use the Linux mode macros and remove this redundancy.

4) Remove using of vn_* helpers and replace where needed with
inode helpers.  The big example here is creating iput_aync to
replace vn_rele_async.  Other vn helpers will be addressed as
needed but they should be be emulated.  They are a Solaris VFS'ism
and should simply be replaced with Linux equivilants.

5) Update znode alloc/free code.  Under Linux it's common to
embed the inode specific data with the inode itself.  This removes
the need for an extra memory allocation.  In zfs this information
is called a znode and it now embeds the inode with it.  Allocators
have been updated accordingly.

6) Minimal integration with the vfs flags for setting up the
super block and handling mount options has been added this
code will need to be refined but functionally it's all there.

This will be the first and last of these to large to review commits.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
6149f4c45f Remove dmu_write_pages() support
For the moment we do not use dmu_write_pages() to write pages
directly in to a dmu object.  It may be required at some point
in the future, but for now is simplest and cleanest to drop it.
It can be easily readded if/when needed.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
bcf308227c Remove zfs_ctldir.[ch]
This code is used for snapshot and heavily leverages Solaris
functionality we do not want to reimplement.  These files have
been removed, including references to them, and will be replaced
by a zfs_snap.c/zpl_snap.c implementation which handles snapshots.
2011-02-10 09:27:21 -08:00
Brian Behlendorf
960e08fe3e VFS: Add zfs_inode_update() helper
For the moment we have left ZFS unchanged and it updates many values
as part of the znode.  However, some of these values should be set
in the inode.  For the moment this is handled by adding a function
called zfs_inode_update() which updates the inode based on the znode.

This is considered a workaround until we can systematically go
through the ZFS code and have it directly update the inode.  At
which point zfs_update_inode() can be dropped entirely.  Keeping
two copies of the same data isn't only inefficient it's a breeding
ground for bugs.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
7304b6e50f VFS: Integrate zfs_znode_alloc()
Under Linux the convention for filesystem specific data structure is
to embed it along with the generic vfs data structure.  This differs
significantly from Solaris.

Since we want to integrates as cleanly with the Linux VFS as possible.
This changes modifies zfs_znode_alloc() to allocate a znode with an
embedded inode for use with the generic VFS.  This is done by calling
iget_locked() which will allocate a new inode if needed by calling
sb->alloc_inode().  This function allocates enough memory for a
znode_t by returns a pointer to the inode structure for Linux's VFS.
This function is also responsible for setting the callback
znode->z_set_ops_inodes() which is used to register the correct
handlers for the inode.
2011-02-10 09:27:20 -08:00
Brian Behlendorf
a405c8a665 ACL related changes
A small collection of ACL related changes related to not
supporting fuid mapping.  This whole are will need to be
closely investigated.
2011-02-10 09:26:26 -08:00
Brian Behlendorf
8299a1f41e Add Linux Compat Infrastructure
Lay the initial ground work for a include/linux/ compatibility
directory.  This was less critical in the past because the bulk
of the ZFS code consumes the Solaris API via the SPL.  This API
was stable and the bulk Linux API differences were handled in
the SPL.

However, with the addition of a full Posix layer written directly
against the Linux APIs we are going to need more compatibility
code.  It makes sense that all this code should be cleanly located
in one place.  Subsequent patches should move the existing zvol
and vdev_disk compatibility code in to this directory.
2011-02-10 09:25:10 -08:00
Brian Behlendorf
590329b50c Add basic uio support
This code originates in OpenSolaris and was modified by KQ Infotech
to be compatible with Linux.  While supporting uios in the short
term is useful to get something working this is not an abstraction
we want to keep.  This code is expected to be short lived and
removed as soon as all the remaining uio based APIs and updated.
2011-02-10 09:21:43 -08:00
Brian Behlendorf
e5c39b95a7 Export required vfs/vn symbols 2011-02-10 09:21:42 -08:00
Brian Behlendorf
872e8d2697 Add initial rw_uio functions to the dmu
These functions were dropped originally because I felt they would
need to be rewritten anyway to avoid using uios.  However, this
patch readds then with they dea they can just be reworked and
the uio bits dropped.
2011-02-04 16:14:34 -08:00
Brian Behlendorf
c5d915f423 Minimal libshare infrastructure
ZFS even under Solaris does not strictly require libshare to be
available.  The current implementation attempts to dlopen() the
library to access the needed symbols.  If this fails libshare
support is simply disabled.

This means that on Linux we only need the most minimal libshare
implementation.  In fact just enough to prevent the build from
failing.  Longer term we can decide if we want to implement a
libshare library like Solaris.  At best this would be an abstraction
layer between ZFS and NFS/SMB.  Alternately, we can drop libshare
entirely and directly integrate ZFS with Linux's NFS/SMB.

Finally the bare bones user-libshare.m4 test was dropped.  If we
do decide to implement libshare at some point it will surely be
as part of this package so the check is not needed.
2011-02-04 16:14:29 -08:00
Brian Behlendorf
d599e4fa79 Block in cv_destroy() on all waiters
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters.  However, I've now seen several instances in
OpenSolaris code where they do the following:

  cv_broadcast();
  cv_destroy();

This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT.  This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request.  But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.

Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled.  This may take some time because each waiter must
acquire the mutex.

This change may have an impact on performance for frequently
created and destroyed condition variables.  That however is a price
worth paying it avoid crashing your system.  If performance issues
are observed they can be addressed by the caller.
2011-02-04 14:09:08 -08:00
Brian Behlendorf
feb46b92a7 Open up libzfs_run_process/libzfs_load_module
Recently helper functions were added to libzfs_util to load a kernel
module or execute a process.  Initially this functionality was limited
to libzfs but it has become clear there will be other consumers.  This
change opens up the interface so it may be used where appropriate.
2011-01-28 12:47:57 -08:00
Brian Behlendorf
b3259b6a2b Autoconf selinux support
If libselinux is detected on your system at configure time link
against it.  This allows us to use a library call to detect if
selinux is enabled and if it is to pass the mount option:

  "context=\"system_u:object_r:file_t:s0"

For now this is required because none of the existing selinux
policies are aware of the zfs filesystem type.  Because of this
they do not properly enable xattr based labeling even though
zfs supports all of the required hooks.

Until distro's add zfs as a known xattr friendly fs type we
must use mntpoint labeling.  Alternately, end users could modify
their existing selinux policy with a little guidance.
2011-01-28 12:45:19 -08:00
Brian Behlendorf
0aff071d18 Minor policy interface
Simply add the policy function wrappers.  They are completely
non-functional and always return that everything is OK, but once
again they simplify compilation of dependent packages for now.
These can/should be removed once the security policy of the
dependent application is completely understood and intergrade
as appropriate with Linux.
2011-01-27 16:06:09 -08:00
Brian Behlendorf
ef57fb98e4 Add missing headers
Dependent packages require the following missing headers to
simplify compilation.  The headers are basically just stubbed
out with minimal content required.
2011-01-27 16:06:09 -08:00
Brian Behlendorf
3fc97f9335 Add VSA_ACE_* and MAX_ACL_ENTRIES defines
The following flags are use to get the proper mask when getting
and setting ACLs.  I'm hopeful this can all largely go away at
some point.

We also add a define for the maximum number of ACL entries.
MAX_ACL_ENTRIES is used as the maximum number of entries for
each type.
2011-01-27 16:06:09 -08:00
Brian Behlendorf
e2b25f698c Add MAXUID define
For Linux the maximum uid can vary depending on how your kernel
is built.  The Linux kernel still can be compiled with 16 but uids
and gids, although I'm not aware of a major distribution which does
this (maybe an embedded one?).  Given that caviot it is reasonably
safe to define the MAXUID as 2147483647.
2011-01-27 16:06:09 -08:00
Brian Behlendorf
5f46a517f1 Add FIGNORECASE define
The FIGNORECASE case define is now needed, place it with the
related flags.
2011-01-27 16:06:09 -08:00
Brian Behlendorf
3e5d3d3285 Add ksid_index_t and ksid_t types
Add the ksid_index_t enum and ksid_t type for use.  These types
are now used by packages which depend on the SPL.
2011-01-27 16:06:09 -08:00
Brian Behlendorf
d700637207 Minimal VFS additions
This patch simply removes the place holder vfs_t type and includes
some generic Linux VFS headers.  It also makes some minor fid_t
additions for compatibility.
2011-01-27 16:06:04 -08:00
Brian Behlendorf
647fa73cf3 Remove VN_HOLD/VN_RELE/VOP_PUTPAGE
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
2011-01-12 11:38:05 -08:00
Brian Behlendorf
bd6ac72b03 Add a few additional vnode #defines
These additional constants now have users in dependant packages.
2011-01-12 11:38:05 -08:00
Brian Behlendorf
dcd9cb5a17 Clean vattr_t and vsecattr_t types
Minor cleanup for the vattr_t and vsecattr_t types.
2011-01-12 11:38:04 -08:00
Brian Behlendorf
1b439713f1 FRSYNC Should Use O_SYNC
The Solaris FRSYNC maps most logically to the Linux O_SYNC.  There
is no O_RSYNC on Linux but this wasn't noticed until just recently.
2011-01-12 11:38:04 -08:00
Brian Behlendorf
4295b530ee Add vn_mode_to_vtype/vn_vtype to_mode helpers
Add simple helpers to convert a vnode->v_type to a inode->i_mode.
These should be used sparingly but they are handy to have.
2011-01-12 11:38:04 -08:00
Neependra Khare
3f688a8c38 Add cv_timedwait_interruptible() function
The cv_timedwait() function by definition must wait unconditionally
for cv_signal()/cv_broadcast() before waking.  This causes processes
to go in the D state which increases the load average.  The load
average is the summation of processes in D state and run queue.

To avoid this it can be desirable to sleep interruptibly.  These
processes do not count against the load average but may be woken by
a signal.  It is up to the caller to determine why the process
was woken it may be for one of three reasons.

  1) cv_signal()/cv_broadcast()
  2) the timeout expired
  3) a signal was received

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-01-11 12:14:48 -08:00
Brian Behlendorf
6bf4d76f47 Linux Compat: inode->i_mutex/i_sem
Create spl_inode_lock/spl_inode_unlock compability macros to simply
access to the inode mutex/sem.  This avoids the need to have to ugly
up the code with the required #define's at every call site.  At the
moment the SPL only uses this in one place but higher layers can
benefit from the macro.
2011-01-11 12:14:48 -08:00
Brian Behlendorf
5b63b3eb6f Use cv_timedwait_interruptible in arc
The issue is that cv_timedwait() sleeps uninterruptibly to block signals
and avoid waking up early.  Under Linux this counts against the load
average keeping it artificially high.  This change allows the arc to
sleep interruptibly which mean it may be woken up early due to a signal.

Normally this means some extra care must be taken to handle a potential
signal.  But for the arcs usage of cv_timedwait() there is no harm in
waking up before the timeout expires so no extra handling is required.
2010-12-14 10:06:44 -08:00
Brian Behlendorf
9fe45dc1ac Add Thread Specific Data (TSD) Implementation
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels.  This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.

The majority of the entries in the hash table are for specific tsd
entries.  These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.

The hash table contains two additional type of entries.  They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create().  It is used to store the address of the destructor function
and it is used as an anchor point.  All tsd entries which use the same
key will be linked to this entry.  This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.

The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key.  The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it.  This
list is using during tsd_exit() to ensure all registered destructors
are run for the process.  The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid.  Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
2010-12-07 10:02:32 -08:00
Ricardo M. Correia
c2f997b0b3 Make kmutex_t typesafe in all cases.
When HAVE_MUTEX_OWNER and CONFIG_SMP are defined, kmutex_t is just
a typedef for struct mutex.

This is generally OK but has the downside that it can make mistakes
such as mutex_lock(&kmutex_var) to pass by unnoticed until someone
compiles the code without HAVE_MUTEX_OWNER or CONFIG_SMP (in which
case kmutex_t is a real struct). Note that the correct API to call
should have been mutex_enter() rather than mutex_lock().

We prevent these kind of mistakes by making kmutex_t a real structure
with only one field. This makes kmutex_t typesafe and it shouldn't
have any impact on the generated assembly code.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-29 11:25:32 -08:00
Brian Behlendorf
058de03caa Clear cv->cv_mutex when not in use
For debugging purposes the condition varaibles keep track of the
mutex used during a wait.  The idea is to validate that all callers
always use the same mutex.  Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe.  My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex.  However, there is overhead in doing this and it does
appear to be allowed under Solaris.

To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped.  This ensures that while the condition variable
is in use the incorrect mutex case is detected.  It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.

Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant.  The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex.  It turns out that was not the case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-29 11:02:34 -08:00
Brian Behlendorf
8326eb4605 Linux 2.6.36 compat, blk_* macros removed
Most of the blk_* macros were removed in 2.6.36.  Ostensibly this was
done to improve readability and allow easier grepping.  However, from
a portability stand point the macros are helpful.  Therefore the needed
macros are redefined here if they are missing from the kernel.
2010-11-10 17:00:40 -08:00
Brian Behlendorf
675de5aa37 Linux 2.6.36 compat, synchronous bio flag
The name of the flag used to mark a bio as synchronous has changed
again in the 2.6.36 kernel due to the unification of the BIO_RW_*
and REQ_* flags.  The new flag is called REQ_SYNC.  To simplify
checking this flag I have introduced the vdev_disk_dio_is_sync()
helper function.  Based on the results of several new autoconf
tests it uses the correct mask to check for a synchronous bio.

Preferred interface for flagging a synchronous bio:
  2.6.12-2.6.29: BIO_RW_SYNC
  2.6.30-2.6.35: BIO_RW_SYNCIO
  2.6.36-2.6.xx: REQ_SYNC
2010-11-10 17:00:33 -08:00
Brian Behlendorf
f4af6bb783 Linux 2.6.36 compat, use REQ_FAILFAST_MASK
As of linux-2.6.36 the BIO_RW_FAILFAST and REQ_FAILFAST flags
have been unified under the REQ_* names.  These flags always had
to be kept in-sync so this is a nice step forward, unfortunately
it means we need to be careful to only use the new unified flags
when the BIO_RW_* flags are not defined.  Additional autoconf
checks were added for this and if it is ever unclear which method
to use no flags are set.  This is safe but may result in longer
delays before a disk is failed.

Perferred interface for setting FAILFAST on a bio:
  2.6.12-2.6.27: BIO_RW_FAILFAST
  2.6.28-2.6.35: BIO_RW_FAILFAST_{DEV|TRANSPORT|DRIVER}
  2.6.36-2.6.xx: REQ_FAILFAST_{DEV|TRANSPORT|DRIVER}
2010-11-10 16:59:49 -08:00
Ned Bass
00ba7ef900 Give ENOTSUP a valid user space error value
The ZFS module returns ENOTSUP for several error conditions where an operation
is not (yet) supported.  The SPL defined ENOTSUP in terms of ENOTSUPP, but that
is an internal Linux kernel error code that should not be seen by user
programs.  As a result the zfs utilities print a confusing error message if an
unsupported operation is attempted:

    internal error: Unknown error 524
    Aborted

This change defines ENOTSUP in terms of EOPNOTSUPP which is consistent with
user space.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-10 13:25:49 -08:00
Brian Behlendorf
a50cede388 Linux 2.6.36 compat, wrap RLIM64_INFINITY
As of linux-2.6.36 RLIM64_INFINITY is defined in linux/resource.h.
This is handled by conditionally defining RLIM64_INFINITY in the
SPL only when the kernel does not provide it.
2010-11-09 13:28:55 -08:00
Brian Behlendorf
8294c69bb7 Clear owner after dropping mutex
It's important to clear mp->owner after calling mutex_unlock()
because when CONFIG_DEBUG_MUTEXES is defined the mutex owner
is verified in mutex_unlock().  If we set it to NULL this check
fails and the lockdep support is immediately disabled.
2010-11-05 11:52:30 -07:00
Ned Bass
79e7242a91 Add helper functions for manipulating device names
This change adds two helper functions for working with vdev names and paths.
zfs_resolve_shortname() resolves a shorthand vdev name to an absolute path
of a file in /dev, /dev/disk/by-id, /dev/disk/by-label, /dev/disk/by-path,
/dev/disk/by-uuid, /dev/disk/zpool.  This was previously done only in the
function is_shorthand_path(), but we need a general helper function to
implement shorthand names for additional zpool subcommands like remove.
is_shorthand_path() is accordingly updated to call the helper function.

There is a minor change in the way zfs_resolve_shortname() tests if a file
exists.  is_shorthand_path() effectively used open() and stat64() to test for
file existence, since its scope includes testing if a device is a whole disk
and collecting file status information.  zfs_resolve_shortname(), on the other
hand, only uses access() to test for existence and leaves it to the caller to
perform any additional file operations.  This seemed like the most general and
lightweight approach, and still preserves the semantics of is_shorthand_path().

zfs_append_partition() appends a partition suffix to a device path.  This
should be used to generate the name of a whole disk as it is stored in the vdev
label. The user-visible names of whole disks do not contain the partition
information, while the name in the vdev label does.   The code was lifted from
the function make_disks(), which now just calls the helper function.  Again,
having a helper function to do this supports general handling of shorthand
names in the user interface.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-10-22 12:25:30 -07:00
Brian Behlendorf
baa40d45cb Fix missing 'zpool events'
It turns out that 'zpool events' over 1024 bytes in size where being
silently dropped.  This was discovered while writing the zfault.sh
tests to validate common failure modes.

This could occur because the zfs interface for passing an arbitrary
size nvlist_t over an ioctl() is to provide a buffer for the packed
nvlist which is usually big enough.  In this case 1024 byte is the
default.  If the kernel determines the buffer is to small it returns
ENOMEM and the minimum required size of the nvlist_t.  This was
working properly but in the case of 'zpool events' the event stream
was advanced dispite the error.  Thus the retry with the bigger
buffer would succeed but it would skip over the previous event.

The fix is to pass this size to zfs_zevent_next() and determine
before removing the event from the list if it will fit.  This was
preferable to checking after the event was returned because this
avoids the need to rewind the stream.
2010-10-12 14:55:03 -07:00
Brian Behlendorf
a69052be7f Initial zio delay timing
While there is no right maximum timeout for a disk IO we can start
laying the ground work to measure how long they do take in practice.
This change simply measures the IO time and if it exceeds 30s an
event is posted for 'zpool events'.

This value was carefully selected because for sd devices it implies
that at least one timeout (SD_TIMEOUT) has occured.  Unfortunately,
even with FAILFAST set we may retry and request and not get an
error.  This behavior is strongly dependant on the device driver
and how it is hooked in to the scsi error handling stack.  However
by setting the limit at 30s we can log the event even if no error
was returned.

Slightly longer term we can start recording these delays perhaps
as a simple power-of-two histrogram.  This histogram can then be
reported as part of the 'zpool status' command when given an command
line option.

None of this code changes the internal behavior of ZFS.  Currently
it is simply for reporting excessively long delays.
2010-10-12 14:55:02 -07:00
Brian Behlendorf
2959d94a0a Add FAILFAST support
ZFS works best when it is notified as soon as possible when a device
failure occurs.  This allows it to immediately start any recovery
actions which may be needed.  In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.

That's the theory at least.  In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.

Unfortunately, without additional kernels patchs there's not much
which can be done to improve this.  Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed.  This can be improved and I suspect I'll be
submitting patches upstream to handle this.
2010-10-12 14:55:02 -07:00
Brian Behlendorf
312c07edfd Generate zevents for speculative and soft errors
By default the Solaris code does not log speculative or soft io errors
in either 'zpool status' or post an event.  Under Linux we don't want
to change the expected behavior of 'zpool status' so these io errors
are still suppressed there.

However, since we do need to know about these events for Linux FMA and
the 'zpool events' interface is new we do post the events.  With the
addition of the zio_flags field the posted events now contain enough
information that a user space consumer can identify and discard these
events if it sees fit.
2010-10-12 14:55:00 -07:00
Ricardo M. Correia
a68d91d770 atomic_*_*_nv() functions need to return the new value atomically.
A local variable must be used for the return value to avoid a
potential race once the spin lock is dropped.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-09-17 16:03:25 -07:00
Brian Behlendorf
e37e1d3040 Use linux __KERNEL__ define
Previously the project contained who zfs_context.h files,
one for user space builds and one for kernel space builds.
It was the responsibility of the source including the file
to ensure the right one was included based on the order of
the include paths.

This was the way it was done in OpenSolaris but for our
purposes I felt it was overly obscure.  The user and kernel
zfs_context.h files have been combined in to a single file
and a #define determines if you get the user or kernel
context.

The issue here was that I used the _KERNEL macro which is
defined as part of the spl which will only be defined for
most builds after you include the right zfs_context.  It is
safer to use the __KERNEL__ macro which is automatically
defined as part of the kernel build process and passed as
a command line compiler option.  It will always be defined
if your building in the kernel and never for user space.
2010-09-10 09:36:39 -07:00
Brian Behlendorf
6283f55ea1 Support custom build directories and move includes
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory.  The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.

For example, this project is designed to work on various different
Linux distributions each of which work slightly differently.  This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.

Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution.  When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.

wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz
tar -xzf zfs-x.y.z.tar.gz
cd zfs-x-y-z

------------------------- run concurrently ----------------------
<ubuntu system>  <fedora system>  <debian system>  <rhel6 system>
mkdir ubuntu     mkdir fedora     mkdir debian     mkdir rhel6
cd ubuntu        cd fedora        cd debian        cd rhel6
../configure     ../configure     ../configure     ../configure
make             make             make             make
make check       make check       make check       make check

This change also moves many of the include headers from individual
incude/sys directories under the modules directory in to a single
top level include directory.  This has the advantage of making
the build rules cleaner and logically it makes a bit more sense.
2010-09-08 12:38:56 -07:00
Brian Behlendorf
a7958f7eef Support custom build directories
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory.  The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.

For example, this project is designed to work on various different
Linux distributions each of which work slightly differently.  This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.

Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution.  When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.

wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz
tar -xzf spl-x.y.z.tar.gz
cd spl-x-y-z

------------------------- run concurrently ----------------------
<ubuntu system>  <fedora system>  <debian system>  <rhel6 system>
mkdir ubuntu     mkdir fedora     mkdir debian     mkdir rhel6
cd ubuntu        cd fedora        cd debian        cd rhel6
../configure     ../configure     ../configure     ../configure
make             make             make             make
make check       make check       make check       make check

This is something the project has almost supported for a long time
but finishing this support should save me lots of time.
2010-09-05 21:49:05 -07:00
Brian Behlendorf
73fc084e92 Move vendor check to spl-build.m4
This check was previously done with a hack in config.guess.
However, since a new config.guess is copied in to place when
forcing a full autoreconf this change was easily lost and
never a good idea.  This commit also updates all of the
autoconf style support scripts in config.
2010-09-02 16:12:02 -07:00
Brian Behlendorf
8371f981f1 Add list_link_replace() function
The list_link_replace() function with swap a new item it to the place
of an old item in a list.  It is the callers responsibility to ensure
all lists involved are locked properly.
2010-08-27 14:23:48 -07:00
Brian Behlendorf
d85e28ad69 Add MUTEX_NOT_HELD() function
Simply implement the missing MUTEX_NOT_HELD() function using
the !MUTEX_HELD construct.
2010-08-27 14:23:48 -07:00
Brian Behlendorf
2b3543025c Stub out kmem cache defrag API
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation.  This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature.  This is safe for
now because the move callbacks are just an optimization.  Even
if they are registered we don't ever really have to call them.
2010-08-27 14:23:42 -07:00
Brian Behlendorf
8dbd3fbd5e Add missing atomic functions
These functions were not previous needed so they were not added.
Now they are so add the full set.

atomic_inc_32_nv()
atomic_dec_32_nv()
atomic_inc_64_nv()
atomic_dec_64_nv()
2010-08-27 13:02:55 -07:00
Li Wei
4be55565fe Fix stack overflow in vn_rdwr() due to memory reclaim
Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp
mask we may enter memory reclaim during IO.  In this case shrink_slab()
entered another file system which is notoriously hungry for stack.
This additional stack usage may cause a stack overflow.  This patch
removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file
during vn_open() to avoid any reclaim in the vn_rdwr() IO path.  The
original mask is then restored at vn_close() time.  Hats off to the
loop driver which does something similiar for the same reason.

  [...]
  shrink_slab+0xdc/0x153
  try_to_free_pages+0x1da/0x2d7
  __alloc_pages+0x1d7/0x2da
  do_generic_mapping_read+0x2c9/0x36f
  file_read_actor+0x0/0x145
  __generic_file_aio_read+0x14f/0x19b
  generic_file_aio_read+0x34/0x39
  do_sync_read+0xc7/0x104
  vfs_read+0xcb/0x171
  :spl:vn_rdwr+0x2b8/0x402
  :zfs:vdev_file_io_start+0xad/0xe1
  [...]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-12 09:34:33 -07:00
Ned Bass
46aa7b3939 Correctly handle rwsem_is_locked() behavior
A race condition in rwsem_is_locked() was fixed in Linux 2.6.33 and the fix was
backported to RHEL5 as of kernel 2.6.18-190.el5.  Details can be found here:

https://bugzilla.redhat.com/show_bug.cgi?id=526092

The race condition was fixed in the kernel by acquiring the semaphore's
wait_lock inside rwsem_is_locked().  The SPL worked around the race condition
by acquiring the wait_lock before calling that function, but with the fix in
place it must not do that.

This commit implements an autoconf test to detect whether the fixed version of
rwsem_is_locked() is present.  The previous version of rwsem_is_locked() was an
inline static function while the new version is exported as a symbol which we
can check for in module.symvers.  Depending on the result we correctly
implement the needed compatibility macros for proper spinlock handling.

Finally, we do the right thing with spin locks in RW_*_HELD() by using the
new compatibility macros.  We only only acquire the semaphore's wait_lock if
it is calling a rwsem_is_locked() that does not itself try to acquire the lock.

Some new overhead and a small harmless race is introduced by this change.
This is because RW_READ_HELD() and RW_WRITE_HELD() now acquire and release
the wait_lock twice: once for the call to rwsem_is_locked() and once for
the call to rw_owner().  This can't be avoided if calling a rwsem_is_locked()
that takes the wait_lock, as it will in more recent kernels.

The other case which only occurs in legacy kernels could be optimized by
taking the lock only once, as was done prior to this commit.  However, I
decided that the performance gain probably wasn't significant enough to
justify the messy special cases required.

The function spl_rw_get_owner() was only used to enable the afore-mentioned
optimization.  Since it is no longer used, I removed it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-10 16:43:00 -07:00
Brian Behlendorf
099dc9c2d2 Add uninstall Makefile targets
Extend the Makefiles with an uninstall target to cleanly
remove a package which was installed with 'make install'.

Additionally, ensure a 'depmod -a' is run as part of the
install to update the module dependency information.
2010-07-28 14:55:32 -07:00
Brian Behlendorf
287b2fb117 Add Debian and Slackware style packaging via alien
The long term fix for Debian and Slackware style packaging is
to add native support for building these packages.  Unfortunately,
that is a large chunk of work I don't have time for right now.
That said it would be nice to have at least basic packages for
these distributions.

As a quick short/medium term solution I've settled on using alien
to convert the RPM packages to DEB or TGZ style packages.  The
build system has been updated with the following build targets
which will first build RPM packages and then convert them as
needed to the target package type:

  make rpm: Create .rpm packages
  make deb: Create .deb packages
  make tgz: Create .tgz packages
  make pkg: Create the right package type for your distribution

The solution comes with lot of caveats and your mileage may vary.
But basically the big limitations are that the resulting packages:

  1) Will not have the correct dependency information.
  2) Will not not include the kernel version in the release.
  3) Will not handle all differences between distributions.

But the resulting packages should be easy to install and remove
from your system and take care of running 'depmod -a' and such.
As I said at the top this is not the right long term solution.
If any of the upstream distribution maintainers want to jump in
and help do this right for their distribution I'd love the help.
2010-07-27 15:52:34 -07:00
Brian Behlendorf
10129680f8 Ensure kmem_alloc() and vmem_alloc() never fail
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP.  They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available.  This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.

At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places.  This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available.  They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.

Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock.  The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.

Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options.  The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
2010-07-26 15:47:55 -07:00
Ricardo M. Correia
15b52c083e Fix max_ncpus definition.
It was being defined as the constant 64 and at first I changed it to be
NR_CPUS instead.

However, NR_CPUS can be a large value on recent kernels (4096), and this
may cause too large kmem allocations to happen.

Therefore, now we use num_possible_cpus(), which should return a (typically)
small value which represents the maximum number of CPUs than can be brought
online in the running hardware (this value is determined at boot time by
arch-specific kernel code).

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 15:49:25 -07:00
Ricardo M. Correia
81672c0122 Display DEBUG keyword during module load when --enable-debug is used.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 15:31:03 -07:00
Ricardo M. Correia
9dd5d138b2 Fix bcopy() to allow memory area overlap
Under Solaris bcopy() allows overlapping memory areas so we
must use memmove() instead of memcpy().

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 13:48:53 -07:00
Ricardo M. Correia
22cd0f19b1 Fix compilation error due to undefined ACCESS_ONCE macro.
When CONFIG_DEBUG_MUTEXES is turned on in RHEL5's kernel config, the mutexes
store the owner for debugging purposes, therefore the SPL will enable
HAVE_MUTEX_OWNER. However, the SPL code uses ACCESS_ONCE() to access the
owner, and this macro is not defined in the RHEL5 kernel, therefore we define it
ourselves in include/linux/compiler_compat.h.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 13:47:52 -07:00
Brian Behlendorf
b17edc10a9 Prefix all SPL debug macros with 'S'
To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros.  They must also build with DEBUG defined.
2010-07-20 13:30:40 -07:00
Brian Behlendorf
55abb0929e Split <sys/debug.h> header
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts.  The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY.  The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging.  Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.

This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros.  However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.

Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.

Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set.  While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC.  It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.

There's lots of change here but this cleanup was overdue.
2010-07-20 13:29:35 -07:00
Brian Behlendorf
f0ff89fc86 Linux 2.6.35 compat: filp_fsync() dropped 'stuct dentry *'
The prototype for filp_fsync() drop the unused argument 'stuct dentry *'.
I've fixed this by adding the needed autoconf check and moving all of
those filp related functions to file_compat.h.  This will simplify
handling any further API changes in the future.
2010-07-14 11:40:55 -07:00
Brian Behlendorf
82b8c8fa64 Proposed fix for low memory ZFS deadlocks
Deadlocks in the zvol were observed when one of the ZFS threads
performing IO trys to allocate memory while the system is low
on memory.  The low memory condition causes dirty pages to be
synced to the zvol but this can't progress because the original
thread is blocked waiting on a memory allocation.  Thus we end
up deadlocking.

A proper solution proposed by Wizeman is to change KM_SLEEP from
GFP_KERNEL top GFP_NOFS.  This will prevent the memory allocation
which is trying to allocate memory from forcing a sync to the
zvol in shrink_page_list()->pageout().

The down side to all of this is that we are using a pretty big
hammer by changing KM_SLEEP.  This change means ALL of the zfs
memory allocations will be until to trigger dirty data to be
synced.  The caller still should be able to reclaim memory from
the various slab caches.  We will be totally dependent of other
kernel processes which happen to be running and a small number
of asynchronous reclaim threads to trigger the reclaim of dirty
data pages.  This should be OK but I think we may see some
slightly longer allocation times when under memory pressure.

We shall see.
2010-07-13 21:30:56 -07:00
Brian Behlendorf
a4bfd8ea1b Add __divdi3(), remove __udivdi3() kernel dependency
Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this.  That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.

Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests.  Much to my surprise the unsigned
64-bit division regression tests failed.

This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel.  This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed.  After some investigation this turned out to be exactly
the case.

Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl.  There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.

The update implementation now passed all the unsigned and signed
regression tests.  This should be functional, but not fast, which is
good enough for out purposes.  If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform.  I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.
2010-07-13 16:44:02 -07:00
Ned Bass
f0d8bb26b4 Implementation of the TQ_FRONT flag.
Adds a task queue to receive tasks dispatched with TQ_FRONT.  Worker
threads pull tasks from this high priority queue before the default
pending queue.

Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task.  The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-01 10:59:38 -07:00
Brian Behlendorf
c950d1480d Only make compiler warnings fatal with --enable-debug
While in theory I like the idea of compiler warnings always being
fatal.  In practice this causes problems when small harmless errors
cause build failures for end users.  To handle this I've updated
the build system such that -Werror is only used when --enable-debug
is passed to configure.  This is how I always build when developing
so I'll catch all build warnings and end users will not get stuck
by minor issues.
2010-06-30 17:05:36 -07:00
Brian Behlendorf
6801b7154c Linux-2.6.33 compat, O_DSYNC flag added
Prior to linux-2.6.33 only O_DSYNC semantics were implemented and
they used the O_SYNC flag.  As of linux-2.6.33 this behavior was
properly split in to O_SYNC and O_DSYNC respectively.
2010-06-30 12:49:39 -07:00
Brian Behlendorf
79a3bf130b Linux-2.6.33 compat, .ctl_name removed from struct ctl_table
As of linux-2.6.33 the ctl_name member of the ctl_table struct
has been entirely removed.  The upstream code has been updated
to depend entirely on the the procname member.  To handle this
all references to ctl_name are wrapped in a CTL_NAME macro which
simply expands to nothing for newer kernels.  Older kernels are
supported by having it expand to .ctl_name = X just as before.
2010-06-30 12:49:12 -07:00
Brian Behlendorf
ede0bdffb6 Treat mutex->owner as volatile
When HAVE_MUTEX_OWNER is defined and we are directly accessing
mutex->owner treat is as volative with the ACCESS_ONCE() helper.
Without this you may get a stale cached value when accessing it
from different cpus.  This can result in incorrect behavior from
mutex_owned() and mutex_owner().  This is not a problem for the
!HAVE_MUTEX_OWNER case because in this case all the accesses
are covered by a spin lock which similarly gaurentees we will
not be accessing stale data.

Secondly, check CONFIG_SMP before allowing access to mutex->owner.
I see that for non-SMP setups the kernel does not track the owner
so we cannot rely on it.

Thirdly, check CONFIG_MUTEX_DEBUG when this is defined and the
HAVE_MUTEX_OWNER is defined surprisingly the mutex->owner will
not be cleared on mutex_exit().  When this is the case the SPL
needs to make sure to do it to ensure MUTEX_HELD() behaves as
expected or you will certainly assert in mutex_destroy().

Finally, improve the mutex regression tests.  For mutex_owned() we
now minimally check that it behaves correctly when checked from the
owner thread or the non-owner thread.  This subtle behaviour has bit
me before and I'd like to catch it early next time if it reappears.

As for mutex_owned() regression test additonally verify that
mutex->owner is always cleared on mutex_exit().
2010-06-28 16:02:57 -07:00
Brian Behlendorf
5be4767ae1 Accept but ignore TASKQ_DC_BATCH and TQ_FRONT
For the moment the SPL accepts the TASKQ_DC_BATCH and TQ_FRONT
flags however they get silently ignored.  This is harmless for
the moment but it does need to be implemented at some point.
2010-06-28 11:39:43 -07:00
Brian Behlendorf
e6de04b73c Add kmem_vasprintf function
We might as well have both asprintf() variants.  This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().
2010-06-24 09:41:59 -07:00
Brian Behlendorf
438683c0a9 Revert "Support TQ_FRONT flag used by taskq_dispatch()"
This reverts commit eb12b3782c.
2010-06-21 10:19:44 -07:00
Brian Behlendorf
8ffef449ef Add missing header util/sscanf.h 2010-06-14 14:20:31 -07:00
Brian Behlendorf
def465ad4b Include kstat.h from kmem.h
It turns out Solaris incidentally includes kstat.h from kmem.h.  As
a side effect of this certain higher level .c files which should
explicitly include kstat.h don't because they happen to get it
via kmem.h.  To make like easier for everyone I do the same.
2010-06-14 14:18:48 -07:00
Brian Behlendorf
eb12b3782c Support TQ_FRONT flag used by taskq_dispatch()
Allow taskq_dispatch() to insert work items at the head of the
queue instead of just the tail by passing the TQ_FRONT flag.
2010-06-11 15:57:25 -07:00
Brian Behlendorf
32c6147dee Minor cleanup and Solaris API additions.
Minor formatting cleanups.

API additions:
* {U}INT8_{MIN,MAX}, {U}INT16_{MIN,MAX} macros.
* id_t typedef
* ddi_get_lbolt(), ddi_get_lbolt64() functions.
2010-06-11 15:57:25 -07:00
Brian Behlendorf
b868e22f05 Add kmem_asprintf(), strfree(), strdup(), and minor cleanup.
This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup().  They are all implemented as a thin layer which just calls
their Linux counterparts.  As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels.  If the kernel does
not provide it then spl-generic implements it.

Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.
2010-06-11 15:57:25 -07:00
Brian Behlendorf
bb1bb2c4c4 Add xuio_* structures and typedefs.
Add the basic xuio structure and typedefs for Solaris style zero copy.
There's a decent chance this will not be the way I handle this on Linux
but providing the basic types simplifies things for now.
2010-06-11 15:57:25 -07:00
Brian Behlendorf
750a7101f8 Stub out additional missing headers 2010-06-11 15:57:25 -07:00
Brian Behlendorf
ae4c36adce Cleanly split Linux proc.h (fs) from conflicting Solaris proc.h (process)
Under linux the proc.h header is for the /proc filesystem, and under
Solaris the proc/h header if for processes.  This patch correctly
moves the Linux proc functionality in a linux/proc_compat.h header
and leaves the sys/proc.h for use by Solaris.  Minor updates were
required to all the call sites where it was included of course.
2010-06-11 15:57:25 -07:00
Brian Behlendorf
49638d8388 Refresh autogen.sh products with automake 1.11.1. 2010-05-21 15:52:06 -07:00
Brian Behlendorf
32f5faff69 Simplify rwlock implementation.
Remove RW_COUNT() from the rwlock implementation.  The idea was that it
could be used as a generic wrapper for getting at the internal state
of a rwlock.  While a good idea it's proven problematic to keep it
correct for multiple archs and internal implementation changes.  In
short it hasn't been worth the trouble.

With that and simplicity in mind things have been updated to use the
rwsem_is_locked() function instead of RW_COUNT for the RW_*_HELD()
functions.  As for rw_upgrade() it remains only implemented for
the generic rwsem implemenation.  It remains to be determined if its
worth the effort of adding a custom implementation for each arch.
2010-05-20 14:20:34 -07:00
Brian Behlendorf
23d91792ef Use KM_NODEBUG macro in preference to __GFP_NOWARN. 2010-05-20 14:16:59 -07:00
Brian Behlendorf
716154c592 Public Release Prep
Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files.  Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.
2010-05-17 15:18:00 -07:00
Brian Behlendorf
8e2140b770 Add 3 missing typedefs.
Add processorid_t, pc_t, index_t.
2010-05-14 09:42:53 -07:00
Brian Behlendorf
a76df2dc0f Add console_*printf() functions.
Add support for the missing console_vprintf() and console_printf()
functions.
2010-05-14 09:40:52 -07:00
Brian Behlendorf
f752b46eb3 Add cv_wait_interruptible() function.
This is a minor extension to the condition variable API to allow
for reasonable signal handling on Linux.  The cv_wait() function by
definition must wait unconditionally for cv_signal()/cv_broadcast()
before waking it.  This makes it impossible to woken by a signal
such as SIGTERM.  The cv_wait_interruptible() function was added
to handle this case.  It behaves identically to cv_wait() with the
exception that it waits interruptibly allowing a signal to wake it
up.  This means you do need to be careful and check issig() after
waking.
2010-05-14 09:24:51 -07:00
Brian Behlendorf
ef6c136884 Disable rw_tryupgrade() for newer kernels
For kernels using the CONFIG_RWSEM_GENERIC_SPINLOCK implementation
nothing has changed.  But if your kernel is building with arch
specific rwsems rw_tryupgrade() has been disabled until it can
be implemented correctly.  In particular, the x86 implementation
now leverages atomic primatives for serialization rather than
spinlocks.  So to get this working again it will need to be
implemented as a cmpxchg for x86 and likely something similiar
for other arches we are interested in.  For now it's safest
to simply disable it.
2010-04-22 12:28:19 -07:00
Brian Behlendorf
8934764e60 Add support for 'make -s' silent builds
The cleanest way to do this is to set AM_LIBTOOLFLAGS = --silent.  However,
AM_LIBTOOLFLAGS is not honored by automake-1.9.6-2.1 which is what I have
been using.  To cleanly handle this I am updating to automake-1.11-3 which
is why it looks like there is a lot of churn in the Makefiles.
2010-03-26 15:41:17 -07:00
Brian Behlendorf
16b719f006 Allow spl_config.h to be included by dependant packages (updated)
We need dependent packages to be able to include spl_config.h to
build properly.  This was partially solved in commit 0cbaeb1 by using
AH_BOTTOM to #undef common #defines (PACKAGE, VERSION, etc) which
autoconf always adds and cannot be easily removed.  This solution
works as long as the spl_config.h is included before your projects
config.h.  That turns out to be easier said than done.  In particular,
this is a problem when your package includes its config.h using the
-include gcc option which ensures the first thing included is your
config.h.

To handle all cases cleanly I have removed the AH_BOTTOM hack and
replaced it with an AC_CONFIG_HEADERS command.  This command runs
immediately after spl_config.h is written and with a little awk-foo
it strips the offending #defines from the file.  This eliminates
the problem entirely and makes header safe for inclusion.

Also in this change I have removed the few places in the code where
spl_config.h is included.  It is now added to the gcc compile line
to ensure the config results are always available.

Finally, I have also disabled the verbose kernel builds.  If you
want them back you can always build with 'make V=1'.  Since things
are working now they don't need to be on by default.
2010-03-22 14:45:33 -07:00
Brian Behlendorf
3977f8370f Linux 2.6.32 compat, proc_handler() API change
As of linux-2.6.32 the 'struct file *filp' argument was dropped from
the proc_handle() prototype.  It was apparently unused _almost_
everywhere in the kernel and this was simply cleanup.

I've added a new SPL_AC_5ARGS_PROC_HANDLER autoconf check for this and
the proper compat macros to correctly define the prototypes and some
helper functions.  It's not pretty but API compat changes rarely are.
2010-03-04 12:14:56 -08:00
Ricardo M. Correia
694921bc49 sun-misc-gitignore
Add .gitignore files.

Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>
2010-01-08 09:37:54 -08:00
Ricardo M. Correia
f7e8739c94 sun-fix-whitespace
Whitespace fixes.

Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>
2010-01-08 09:37:54 -08:00
Brian Behlendorf
3a03ce5cbf Check for changed gaurd macro in 2.6.28+ for rwsem implementation.
As part of the 2.6.28 cleanup which moved all the linux/include/asm/
headers in to linux/arch, the guard headers for many header files
changed.  The i386 rwsem implementation keys off this header to
ensure the internal members of the rwsem structure are interpreted
correctly.  This change checks for the new guard macro in addition
to the only one, the implementation of the rwsem has not changed
for i386 so this is safe and correct.
2009-12-17 11:57:44 -08:00
Brian Behlendorf
d04c8a563c Atomic64 compatibility for 32-bit systems without kernel support.
This patch is another step towards updating the code to handle the
32-bit kernels which I have not been regularly testing.  This changes
do not really impact the common case I'm expected which is the latest
kernel running on an x86_64 arch.

Until the linux-2.6.31 kernel the x86 arch did not have support for
64-bit atomic operations.  Additionally, the new atomic_compat.h support
for this case was wrong because it embedded a spinlock in the atomic
variable which must always and only be 64-bits total.  To handle these
32-bit issues we now simply fall back to the --enable-atomic-spinlock
implementation if the kernel does not provide the 64-bit atomic funcs.

The second issue this patch addresses is the DEBUG_KMEM assumption that
there will always be atomic64 funcs available.  On 32-bit archs this may
not be true, and actually that's just fine.  In that case the kernel will
will never be able to allocate more the 32-bits worth anyway.  So just
check if atomic64 funcs are available, if they are not it means this
is a 32-bit machine and we can safely use atomic_t's instead.
2009-12-04 15:54:12 -08:00
Brian Behlendorf
5652e7b497 When using x86 specific rwsem correctly intepret rwsem->count. 2009-12-01 15:47:27 -08:00
Brian Behlendorf
a5d6f6020a Add missing atomic64 compat helpers for 32-bit systems.
The use of these functions was added with the recent atomic work
and not tested on 32-bit systems.  Add the missing compat functions:
atomic64_inc, atomic64_dec, atomic64_add_return, atomic64_sub_return,
atomic64_inc_return, atomic64_dec_return.
2009-12-01 10:15:27 -08:00
Brian Behlendorf
1273cf284b Always use the generic mutex_destroy(). 2009-11-15 15:04:02 -08:00
Brian Behlendorf
05b48408fb Add mutex_enter_nested() as wrapper for mutex_lock_nested().
This symbol can be used by GPL modules which use the SPL to handle
cases where a call path takes a two different locks by the same
name.  This is needed to avoid a false positive in the lock checker.
2009-11-15 14:27:15 -08:00
Brian Behlendorf
8b45dda2bc Linux 2.6.31 kmem cache alignment fixes and cleanup.
The big fix here is the removal of kmalloc() in kv_alloc().  It used
to be true in previous kernels that kmallocs over PAGE_SIZE would
always be pages aligned.  This is no longer true atleast in 2.6.31
there are no longer any alignment expectations.  Since kv_alloc()
requires the resulting address to be page align we no only either
directly allocate pages in the KMC_KMEM case, or directly call
__vmalloc() both of which will always return a page aligned address.
Additionally, to avoid wasting memory size is always a power of two.

As for cleanup several helper functions were introduced to calculate
the aligned sizes of various data structures.  This helps ensure no
case is accidentally missed where the alignment needs to be taken in
to account.  The helpers now use P2ROUNDUP_TYPE instead of P2ROUNDUP
which is safer since the type will be explict and we no longer count
on the compiler to auto promote types hopefully as we expected.

Always wnforce minimum (SPL_KMEM_CACHE_ALIGN) and maximum (PAGE_SIZE)
alignment restrictions at cache creation time.

Use SPL_KMEM_CACHE_ALIGN in splat alignment test.
2009-11-13 11:12:43 -08:00
Brian Behlendorf
c89fdee4d3 Remove __GFP_NOFAIL in kmem and retry internally.
As of 2.6.31 it's clear __GFP_NOFAIL should no longer be used and it
may disappear from the kernel at any time.  To handle this I have simply
added *_nofail wrappers in the kmem implementation which perform the
retry for non-atomic allocations.

From linux-2.6.31 mm/page_alloc.c:1166
/*
 * __GFP_NOFAIL is not to be used in new code.
 *
 * All __GFP_NOFAIL callers should be fixed so that they
 * properly detect and handle allocation failures.
 *
 * We most definitely don't want callers attempting to
 * allocate greater than order-1 page units with
 * __GFP_NOFAIL.
 */
WARN_ON_ONCE(order > 1);
2009-11-12 15:11:24 -08:00
Brian Behlendorf
baf2979ed3 Linux 2.6.31 Compatibility Updates
SPL_AC_2ARGS_SET_FS_PWD macro updated to explicitly include
linux/fs_struct.h which was dropped from linux/sched.h.

min_wmark_pages, low_wmark_pages, high_wmark_pages macros
introduced in newer kernels.  For older kernels mm_compat.h
was introduced to define them as needed as direct mappings
to per zone min_pages, low_pages, max_pages.
2009-11-10 14:06:57 -08:00
Brian Behlendorf
055ffd98cf Autoconf --enable-debug-* cleanup
Cleanup the --enable-debug-* configure options, this has been pending
for quite some time and I am glad I finally got to it.  To summerize:

1) All SPL_AC_DEBUG_* macros were updated to be a more autoconf
friendly.  This mainly involved shift to the GNU approved usage of
AC_ARG_ENABLE and ensuring AS_IF is used rather than directly using
an if [ test ] construct.

2) --enable-debug-kmem=yes by default.  This simply enabled keeping
a running tally of total memory allocated and freed and reporting a
memory leak if there was one at module unload.  Additionally, it
ensure /proc/spl/kmem/slab will exist by default which is handy.
The overhead is low for this and it should not impact performance.

3) --enable-debug-kmem-tracking=no by default.  This option was added
to provide a configure option to enable to detailed memory allocation
tracking.  This support was always there but you had to know where to
turn it on.  By default this support is disabled because it is known
to badly hurt performence, however it is invaluable when chasing a
memory leak.

4) --enable-debug-kstat removed.  After further reflection I can't see
why you would ever really want to turn this support off.  It is now
always on which had the nice side effect of simplifying the proc handling
code in spl-proc.c.  We can now always assume the top level directory
will be there.

5) --enable-debug-callb removed.  This never really did anything, it was
put in provisionally because it might have been needed.  It turns out
it was not so I am just removing it to prevent confusion.
2009-10-30 13:58:51 -07:00
Brian Behlendorf
302b88e6ab Add autoconf checks for atomic64_cmpxchg + atomic64_xchg
These functions didn't exist for all archs prior to 2.6.24.  This
patch addes an autoconf test to detect this and add them when needed.
The autoconf check is needed instead of just an #ifndef because in
the most modern kernels atomic64_{cmp}xchg are implemented as in
inline function and not a #define.
2009-10-30 13:53:17 -07:00
Brian Behlendorf
5e9b5d832b Use Linux atomic primitives by default.
Previously Solaris style atomic primitives were implemented simply by
wrapping the desired operation in a global spinlock.  This was easy to
implement at the time when I wasn't 100% sure I could safely layer the
Solaris atomic primatives on the Linux counterparts.  It however was
likely not good for performance.

After more investigation however it does appear the Solaris primitives
can be layered on Linux's fairly safely.  The Linux atomic_t type really
just wraps a long so we can simply cast the Solaris unsigned value to
either a atomic_t or atomic64_t.  The only lingering problem for both
implementations is that Solaris provides no atomic read function.  This
means reading a 64-bit value on a 32-bit arch can (and will) result in
word breaking.  I was very concerned about this initially, but upon
further reflection it is a limitation of the Solaris API.  So really
we are just being bug-for-bug compatible here.

With this change the default implementation is layered on top of Linux
atomic types.  However, because we're assuming a lot about the internal
implementation of those types I've made it easy to fall-back to the
generic approach.  Simply build with --enable-atomic_spinlocks if
issues are encountered with the new implementation.
2009-10-30 10:55:25 -07:00
Brian Behlendorf
51a727e90f Set cwd to '/' for the process executing insmod.
Ricardo has pointed out that under Solaris the cwd is set to '/'
during module load, while under Linux it is set to the callers cwd.
To handle this cleanly I've reworked the module *_init()/_exit()
macros so they call a *_setup()/_cleanup() function when any SPL
dependent module is loaded or unloaded.  This gives us a chance to
perform any needed modification of the process, in this case changing
the cwd.  It also handily provides a way to avoid creating wrapper
init()/exit() functions because the Solaris and Linux prototypes
differ slightly.  All dependent modules should now call the spl
helper macros spl_module_{init,exit}() instead of the native linux
versions.

Unfortunately, it appears that under Linux there has been no consistent
API in the kernel to set the cwd in a module.  Because of this I have
had to add more autoconf magic than I'd like.  However, what I have
done is correct and has been tested on RHEL5, SLES11, FC11, and CHAOS
kernels.

In addition, I have change the rootdir type from a 'void *' to the
correct 'vnode_t *' type.  And I've set rootdir to a non-NULL value.
2009-10-01 16:06:15 -07:00
Brian Behlendorf
0e77fc118e Expand SEM() outside init_rwsem and directly call __init_rwsem().
We need to directly call __init_rwsem() or the name gets expanded
to SEM(lock-name).  This is safe and correct for the support arches
x86/x86_64/ppc/ppc64.
2009-09-29 03:19:09 -07:00
Brian Behlendorf
4d54fdee1d Reimplement mutexs for Linux lock profiling/analysis
For a generic explanation of why mutexs needed to be reimplemented
to work with the kernel lock profiling see commits:
  e811949a57 and
  d28db80fd0

The specific changes made to the mutex implemetation are as follows.
The Linux mutex structure is now directly embedded in the kmutex_t.
This allows a kmutex_t to be directly case to a mutex struct and
passed directly to the Linux primative.

Just like with the rwlocks it is critical that these functions be
implemented as '#defines to ensure the location information is
preserved.  The preprocessor can then do a direct replacement of
the Solaris primative with the linux primative.

Just as with the rwlocks we need to track the lock owner.  Here
things get a little more interesting because depending on your
kernel version, and how you've built your kernel Linux may already
do this for you.  If your running a 2.6.29 or newer kernel on a
SMP system the lock owner will be tracked.  This was added to Linux
to support adaptive mutexs, more on that shortly.  Alternately, your
kernel might track the lock owner if you've set CONFIG_DEBUG_MUTEXES
in the kernel build.  If neither of the above things is true for
your kernel the kmutex_t type will include and track the lock owner
to ensure correct behavior.  This is all handled by a new autoconf
check called SPL_AC_MUTEX_OWNER.

Concerning adaptive mutexs these are a very recent development and
they did not make it in to either the latest FC11 of SLES11 kernels.
Ideally, I'd love to see this kernel change appear in one of these
distros because it does help performance.  From Linux kernel commit:
  0d66bf6d3514b35eb6897629059443132992dbd7
  "Testing with Ingo's test-mutex application...
  gave a 345% boost for VFS scalability on my testbox"
However, if you don't want to backport this change yourself you
can still simply export the task_curr() symbol.  The kmutex_t
implementation will use this symbol when it's available to
provide it's own adaptive mutexs.

Finally, DEBUG_MUTEX support was removed including the proc handlers.
This was done because now that we are cleanly integrated with the
kernel profiling all this information and much much more is available
in debug kernel builds.  This code was now redundant.

Update mutexs validated on:
    - SLES10   (ppc64)
    - SLES11   (x86_64)
    - CHAOS4.2 (x86_64)
    - RHEL5.3  (x86_64)
    - RHEL6    (x86_64)
    - FC11     (x86_64)
2009-09-25 14:47:01 -07:00
Brian Behlendorf
d28db80fd0 Update rwlocks to track owner to ensure correct semantics
The behavior of RW_*_HELD was updated because it was not quite right.
It is not sufficient to return non-zero when the lock is help, we must
only do this when the current task in the holder.

This means we need to track the lock owner which is not something
tracked in a Linux semaphore.  After some experimentation the
solution I settled on was to embed the Linux semaphore at the start
of a larger krwlock_t structure which includes the owner field.
This maintains good performance and allows us to cleanly intergrate
with the kernel lock analysis tools.  My reasons:

1) By placing the Linux semaphore at the start of krwlock_t we can
then simply cast krwlock_t to a rw_semaphore and pass that on to
the linux kernel.  This allows us to use '#defines so the preprocessor
can do direct replacement of the Solaris primative with the linux
equivilant.  This is important because it then maintains the location
information for each rw_* call point.

2) Additionally, by adding the owner to krwlock_t we can keep this
needed extra information adjacent to the lock itself.  This removes
the need for a fancy lookup to get the owner which is optimal for
performance.  We can also leverage the existing spin lock in the
semaphore to ensure owner is updated correctly.

3) All helper functions which do not need to strictly be implemented
as a define to preserve location information can be done as a static
inline function.

4) Adding the owner to krwlock_t allows us to remove all memory
allocations done during lock initialization.  This is good for all
the obvious reasons, we do give up the ability to specific the lock
name.  The Linux profiling tools will stringify the lock name used
in the code via the preprocessor and use that.

Update rwlocks validated on:
- SLES10   (ppc64)
- SLES11   (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3  (x86_64)
- RHEL6    (x86_64)
- FC11     (x86_64)
2009-09-25 14:14:35 -07:00
Brian Behlendorf
e811949a57 Reimplement rwlocks for Linux lock profiling/analysis.
It turns out that the previous rwlock implementation worked well but
did not integrate properly with the upstream kernel lock profiling/
analysis tools.  This is a major problem since it would be awfully
nice to be able to use the automatic lock checker and profiler.

The problem is that the upstream lock tools use the pre-processor
to create a lock class for each uniquely named locked.  Since the
rwsem was embedded in a wrapper structure the name was always the
same.  The effect was that we only ended up with one lock class for
the entire SPL which caused the lock dependency checker to flag
nearly everything as a possible deadlock.

The solution was to directly map a krwlock to a Linux rwsem using
a typedef there by eliminating the wrapper structure.  This was not
done initially because the rwsem implementation is specific to the arch.
To fully implement the Solaris krwlock API using only the provided rwsem
API is not possible.  It can only be done by directly accessing some of
the internal data member of the rwsem structure.

For example, the Linux API provides a different function for dropping
a reader vs writer lock.  Whereas the Solaris API uses the same function
and the caller does not pass in what type of lock it is.  This means to
properly drop the lock we need to determine if the lock is currently a
reader or writer lock.  Then we need to call the proper Linux API function.
Unfortunately, there is no provided API for this so we must extracted this
information directly from arch specific lock implementation.  This is
all do able, and what I did, but it does complicate things considerably.

The good news is that in addition to the profiling benefits of this
change.  We may see performance improvements due to slightly reduced
overhead when creating rwlocks and manipulating them.

The only function I was forced to sacrafice was rw_owner() because this
information is simply not stored anywhere in the rwsem.  Luckily this
appears not to be a commonly used function on Solaris, and it is my
understanding it is mainly used for debugging anyway.

In addition to the core rwlock changes, extensive updates were made to
the rwlock regression tests.  Each class of test was extended to provide
more API coverage and to be more rigerous in checking for misbehavior.

This is a pretty significant change and with that in mind I have been
careful to validate it on several platforms before committing.  The full
SPLAT regression test suite was run numberous times on all of the following
platforms.  This includes various kernels ranging from 2.6.16 to 2.6.29.

- SLES10   (ppc64)
- SLES11   (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3  (x86_64)
- RHEL6    (x86_64)
- FC11     (x86_64)
2009-09-18 16:09:47 -07:00
Brian Behlendorf
c65d62d8bf Disable stack overflow checking by default.
The run time stack overflow checking is being disabled by default
because it is not safe for use with 2.6.29 and latter kernels.  These
kernels do now have their own stack overflow checking so this support
has become redundant anyway.  It can be re-enabled for older kernels or
arches without stack overflow checking by redefining CHECK_STACK().
2009-07-30 13:52:11 -07:00
Brian Behlendorf
6ae7fef5b9 Update global_page_state() support for 2.6.29 kernels.
Basically everything we need to monitor the global memory state of
the system is now cleanly available via global_page_state().  The
problem is that this interface is still fairly recent, and there
has been one change in the page state enum which we need to handle.
These changes basically boil down to the following:
- If global_page_state() is available we should use it.  Several
  autoconf checks have been added to detect the correct enum names.
- If global_page_state() is not available check to see if
  get_zone_counts() symbol is available and use that.
- If the get_zone_counts() symbol is not exported we have no choice
  be to dynamically aquire it at load time.  This is an absolute
  last resort for old kernel which we don't want to patch to
  cleanly export the symbol.
2009-07-28 15:06:42 -07:00
Brian Behlendorf
ec7d53e99a Add basic credential support and splat tests.
The previous credential implementation simply provided the needed types and
a couple of dummy functions needed.  This update correctly ties the basic
Solaris credential API in to one of two Linux kernel APIs.

Prior to 2.6.29 the linux kernel embeded all credentials in the task
structure.  For these kernels, we pass around the entire task struct as if
it were the credential, then we use the helper functions to extract the
credential related bits.

As of 2.6.29 a new credential type was added which we can and do fairly
cleanly layer on top of.  Once again the helper functions nicely hide
the implementation details from all callers.

Three tests were added to the splat test framework to verify basic
correctness.  They should be extended as needed when need credential
functions are added.
2009-07-27 17:18:59 -07:00
Ricardo M. Correia
ac95d0974b Fixed NULL dereference by tcd_for_each() when the kmalloc() call in module/spl/spl-debug.c:1163 returns NULL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2009-07-14 15:24:59 -07:00
Brian Behlendorf
b11b08ed64 Add a little paranoia here to ensure endianess is set correctly. 2009-07-14 14:28:04 -07:00
Brian Behlendorf
06dea10380 Add basic groupmember() function, not sup groups. 2009-07-10 10:58:06 -07:00
Brian Behlendorf
d3126abe75 Add ddi_copyin/ddi_copyout support for fake kernel originated ioctls. 2009-07-10 10:56:32 -07:00
Brian Behlendorf
2a734e9c26 Define ACE_ALL_PERMS for use by ACLs 2009-07-09 15:00:25 -07:00
Brian Behlendorf
c18cbcfe66 Define FKIOCTL which is used on Solaris to mark an in-kernel ioctl. 2009-07-09 14:59:41 -07:00
Brian Behlendorf
3a68dc5374 Add ASSERTV macro to simplify removing variables (the V in ASSERTV)
which are only used in ASSERT().
2009-07-09 12:15:23 -07:00
Brian Behlendorf
915404bd50 Add basic support for TASKQ_THREADS_CPU_PCT taskq flag which is
used to scale the number of threads based on the number of online
CPUs.  As CPUs are added/removed we should rescale the thread
count appropriately, but currently this is only done at create.
2009-07-09 10:07:52 -07:00
Brian Behlendorf
86933a6e51 Simplify rpm build rules, added config/rpm.am.
Distro friendly changes such that the kernel modules are packaged seperately.
2009-07-01 14:37:44 -07:00
Brian Behlendorf
f4f9cd75a1 Install spl-devel products in /usr/src/spl-SPL_VERSION/LINUX_VERSION/
Remove the spl symlink, it's just confusing
2009-06-26 16:30:44 -07:00
Brian Behlendorf
2e0e7e6976 Packaging improvements for RHEL and SLES (part 2)
- Allow checking for exported symbols in both Module.symvers
  and Module.symvers.  My stock SLES kernel ships an objects
  directory with Module.symvers, yet produces a Module.symvers
  in the local build directory.
2009-06-16 11:34:28 -07:00
Brian Behlendorf
39a3d2a421 Packaging improvements for RHEL and SLES
- Properly honor --prefix in build system and rpm spec file.
- Add '--define require_kdir' to spec file to support building
  rpms against kernel sources installed in non-default locations.
- Add '--define require_kobj' to spec file to support building
  rpms against kernel object installed in non-default locations.
- Stop suppressing errors in autogen.sh script.
- Improved logic to detect missing kernel objects when they are
  not located with the source.  This is the common case for SLES
  as well as in-tree chaos kernel builds and is done to simply
  support for multiple arches.
- Moved spl-devel build products to /usr/src/spl-<version>, a
  spl symlink is created to reference the last installed version.
2009-06-16 10:44:59 -07:00
Brian Behlendorf
e554dffa60 SLES10 Fixes (part 9)
- Proper ioctl() 32/64-bit binary compatibility.  We need to ensure the
  ioctl data itself is always packed the same for 32/64-bit binaries.
  Additionally, the correct thing to do is encode this size in bytes
  as part of the command using _IOC_SIZE().
- Minor formatting changes to respect the 80 character limit.
- Move all SPLAT_SUBSYSTEM_* defines in to splat-ctl.h.
- Increase SPLAT_SUBSYSTEM_UNKNOWN because we were getting close
  to accidentally using it for a real registered subsystem.
2009-05-21 10:56:11 -07:00
Brian Behlendorf
124ca8a5a9 SLES10 Fixes (part 7)
- Initial SLES testing uncovered a long standing bug in the debug
  tracing.  The tcd_for_each() macro expected a NULL to terminate
  the trace_data[i] array but this was only ever true due to luck.
  All trace_data[] iterators are now properly capped by TCD_TYPE_MAX.
- SPLAT_MAJOR 229 conflicted with a 'hvc' device on my SLES system.
  Since this was always an arbitrary choice I picked something else.
- The HAVE_PGDAT_LIST case should set pgdat_list_addr to the value stored
  at the address of the memory location returned by kallsyms_lookup_name().
2009-05-20 15:30:13 -07:00
Brian Behlendorf
5232d256b4 SLES10 Fixes (part 6)
- Prior to 2.6.17 there were no *_pgdat helper functions in mm/mmzone.c.
  Instead for_each_zone() operated directly on pgdat_list which may or
  may not have been exported depending on how your kernel was compiled.
  Now new configure checks determine if you have the helpers or not, and
  if the needed symbols are exported.  If they are not exported then they
  are dynamically aquired at runtime by kallsyms_lookup_name().
2009-05-20 14:23:13 -07:00
Brian Behlendorf
3731931529 Powerpc Fixes (part 1):
- Enable builds for powerpc ISA type.
- Add DIV_ROUND_UP and roundup macros if unavailable.
- Cast 64-bit values for %lld format string to (long long) to
  quiet compile warning.
2009-05-20 12:23:24 -07:00
Brian Behlendorf
fe4573928f SLES10 Fixes (part 5):
- Fix incorrect mapping for spl_device_create()->class_device_create()
  which is the prefered API for 2.6.13 to 2.6.17 based kernels.
2009-05-20 11:54:40 -07:00
Brian Behlendorf
6c9433c150 SLES10 Fixes (part 3):
- Configure check for mutex_lock_nested().  This function was introduced
  as part of the mutex validator in 2.6.18, but if it's unavailable then
  it's safe to fallback to a plain mutex_lock().
2009-05-20 11:00:39 -07:00
Brian Behlendorf
96dded3844 SLES10 Fixes (part 2):
- Configure check, the div64_64() function was renamed to
  div64_u64() as of 2.6.26.
- Configure check, the global_page_state() fuction was introduced
  in 2.6.18 kernels.  The earlier 2.6.16 based SLES10 must not try
  and use it, thankfully get_zone_counts() is still available.
- To simplify debugging poison all symbols aquired dynamically
  using spl_kallsyms_lookup_name() with SYMBOL_POISON.
- Add console messages when the user mode helpers fail.
- spl_kmem_init_globals() use bit shifts instead of division.
- When the monotonic clock is unavailable __gethrtime() must perform
  the HZ division as an 'unsigned long long' because the SPL only
  implements __udivdi3(), and not __divdi3() for 'long long' division
  on 32-bit arches.
2009-05-20 10:08:37 -07:00
Brian Behlendorf
759dfe7d43 Add list_move_tail() function. 2009-03-19 21:40:07 -07:00
Brian Behlendorf
0cbaeb117a Allow spl_config.h to be included by dependant packages
We need dependent packages to be able to include spl_config.h so they
can leverage the configure checks the SPL has done.  This is important
because several of the spl headers need the results of these checks to
work properly.  Unfortunately, the autoheader build product is always
private to a particular build and defined certain common things.
(PACKAGE, VERSION, etc).  This prevents other packages which also use
autoheader from being include because the definitions conflict.  To
avoid this problem the SPL build system leverage AH_BOTTOM to include
a spl_unconfig.h at the botton of the autoheader build product.  This
custom include undefs all known shared symbols to prevent the confict.
This does however mean that those definition are also not availble
to the SPL package either.  The SPL package therefore uses the
equivilant SPL_META_* definitions.
2009-03-17 14:55:59 -07:00
Brian Behlendorf
e11d6c5f50 FC10/i686 Compatibility Update (2.6.27.19-170.2.35.fc10.i686)
In the interests of portability I have added a FC10/i686 box to
my list of development platforms.  The hope is this will allow me
to keep current with upstream kernel API changes, and at the same
time ensure I don't accidentally break x86 support.  This patch
resolves all remaining issues observed under that environment.

1) SPL_AC_ZONE_STAT_ITEM_FIA autoconf check added.  As of 2.6.21
the kernel added a clean API for modules to get the global count
for free, inactive, and active pages.  The SPL attempts to detect
if this API is available and directly map spl_global_page_state()
to global_page_state().  If the full API is not available then
spl_global_page_state() is implemented as a thin layer to get
these values via get_zone_counts() if that symbol is available.

2) New kmem:vmem_size regression test added to validate correct
vmem_size() functionality.  The test case acquires the current
global vmem state, allocates from the vmem region, then verifies
the allocation is correctly reflected in the vmem_size() stats.

3) Change splat_kmem_cache_thread_test() to always use KMC_KMEM
based memory.  On x86 systems with limited virtual address space
failures resulted due to exhaustig the address space.  The tests
really need to problem exhausting all memory on the system thus
we need to use the physical address space.

4) Change kmem:slab_lock to cap it's memory usage at availrmem
instead of using the native linux nr_free_pages().  This provides
additional test coverage of the SPL Linux VM integration.

5) Change kmem:slab_overcommit to perform allocation of 256K
instead of 1M.  On x86 based systems it is not possible to create
a kmem backed slab with entires of that size.  To compensate for
this the number of allocations performed in increased by 4x.

6) Additional autoconf documentation for proposed upstream API
changes to make additional symbols available to modules.

7) Console error messages added when spl_kallsyms_lookup_name()
fails to locate an expected symbol.  This causes the module to fail
to load and we need to know exactly which symbol was not available.
2009-03-17 12:16:31 -07:00
Brian Behlendorf
7257ec4185 Fix taskq_wait() not waiting bug
I'm very surprised this has not surfaced until now.  But the taskq_wait()
implementation work only wait successfully the first time it was called.
Subsequent usage of taskq_wait() on the taskq would not wait.

The issue was caused by tq->tq_lowest_id being set to MAX_INT after the
first wait completed.  This caused subsequent waits which check that the
waiting id is less than the lowest taskq id to always succeed.  The fix
is to ensure that tq->tq_lowest_id is never set larger than tq->tq_next.id.

Additional fixes which were added to this patch include:
1) Fix a race by placing the taskq_wait_check() in the tq->tq_lock spinlock.
2) taskq_wait() should wait for the largest outstanding id.
3) Multiple spelling corrections.
4) Added taskq wait regression test to validate correct behavior.
2009-03-15 15:13:49 -07:00
Brian Behlendorf
8123ac4f0d Added SPL_AC_5ARGS_DEVICE_CREATE autoconf configure check
As of 2.6.27 kernels the device_create() API changed to include
a private data argument.  This check detects which version of
device_create() function the kernel has and properly defines
spl_device_create() to use the correct prototype.
2009-03-13 13:38:43 -07:00
Ricardo M. Correia
6c33eb8162 Minor bug fix in XDR code introduced in last minute change before landing.
1) Removed xdr_bytesrec typedef which has no consumers.  If we re-add
   it should also probably be xdr_bytesrec_t.
2009-03-11 16:27:35 -07:00
Ricardo M. Correia
f48b61938a Add XDR implementation
Added proper XDR implementation (Lustre bug 17662), needed for on-disk
compatibility between platforms of different endianness.
2009-03-11 13:00:26 -07:00
Brian Behlendorf
0c617c9a63 Build system cleanup
1) Undefine non-unique entries in spl_config.h
2) Minor Makefile cleanup
3) Don't use includedir for proper kernel header install
2009-03-11 12:37:34 -07:00
Brian Behlendorf
c5f704607b Build system and packaging (RPM support)
An update to the build system to properly support all commonly
used Makefile targets these include:

  make all        # Build everything
  make install    # Install everything
  make clean	  # Clean up build products
  make distclean  # Clean up everything
  make dist       # Create package tarball
  make srpm       # Create package source RPM
  make rpm        # Create package binary RPMs
  make tags       # Create ctags and etags for everything

Extra care was taken to ensure that the source RPMs are fully
rebuildable against Fedora/RHEL/Chaos kernels.  To build binary
RPMs from the source RPM for your system simply run:

  rpmbuild --rebuild spl-x.y.z-1.src.rpm

This will produce two binary RPMs with correct 'requires'
dependencies for your kernel.  One will contain all spl modules
and support utilities, the other is a devel package for compiling
additional kernel modules which are dependant on the spl.

  spl-x.y.z-1_<kernel version>.x86_64.rpm
  spl-devel-x.y.2-1_<kernel version>.x86_64.rpm
2009-03-09 15:56:55 -07:00
Ricardo M. Correia
32f74c5280 XXX: Temporarily disable vmem_size(). 2009-03-05 10:13:59 -08:00
Brian Behlendorf
04fa349d69 Merge branch 'kallsyms' 2009-03-04 10:19:41 -08:00
Brian Behlendorf
d1ff2312b0 Linux VM Integration Cleanup
Remove all instances of functions being reimplemented in the SPL.
When the prototypes are available in the linux headers but the
function address itself is not exported use kallsyms_lookup_name()
to find the address.  The function name itself can them become a
define which calls a function pointer.  This is preferable to
reimplementing the function in the SPL because it ensures we get
the correct version of the function for the running kernel.  This
is actually pretty safe because the prototype is defined in the
headers so we know we are calling the function properly.

This patch also includes a rhel5 kernel patch we exports the needed
symbols so we don't need to use kallsyms_lookup_name().  There are
autoconf checks to detect if the symbol is exported and if so to
use it directly.  We should add patches for stock upstream kernels
as needed if for no other reason than so we can easily track which
additional symbols we needed exported.  Those patches can also be
used by anyone willing to rebuild their kernel, but this should
not be a requirement.  The rhel5 version of the export-symbols
patch has been applied to the chaos kernel.

Additional fixes:
1) Implement vmem_size() function using get_vmalloc_info()
2) SPL_CHECK_SYMBOL_EXPORT macro updated to use $LINUX_OBJ instead
   of $LINUX because Module.symvers is a build product.  When
   $LINUX_OBJ != $LINUX we will not properly detect exported symbols.
3) SPL_LINUX_COMPILE_IFELSE macro updated to add include2 and
   $LINUX/include search paths to allow proper compilation when
   the kernel target build directory is not the source directory.
2009-03-04 10:04:15 -08:00
Ricardo M. Correia
eb7c7f44e8 Changed ptob()/btop() mult/div into bit shifts.
Added necessary include for PAGE_SHIFT.
2009-02-25 15:50:58 -08:00
Ricardo M. Correia
7819a92a9b Added btop() and moved ptob() to include/sys/param.h. 2009-02-25 15:50:50 -08:00
Ricardo M. Correia
4327ac3ff9 Changed z_compress_level() and z_uncompress() prototypes to match the ones in Solaris.
Fixes compilation warning.
2009-02-23 11:45:59 -08:00
Brian Behlendorf
a1cf80b493 Matching kmem_free() fix for use after free case.
See commit bb01879ebe for a full
description.  This issue should have been addressed in the same
commit but it slipped my mind.
2009-02-19 12:28:10 -08:00
Brian Behlendorf
99639e4a13 Add zone_get_hostid() function
Minimal support added for the zone_get_hostid() function.  Only
global zones are supported therefore this function must be called
with a NULL argumment.  Additionally, I've added the HW_HOSTID_LEN
define and updated all instances where a hard coded magic value
of 11 was used; "A good riddance of bad rubbish!"
2009-02-19 11:26:17 -08:00
Brian Behlendorf
bb01879ebe Coverity 9654, 9654: Use After Free
Because vmem_free() was implemented as a macro using the ','
operator to evaluate both arguments and we performed the free
before evaluating size we would deference the free'd pointer.
To resolve the problem we just invert the ordering and evaluate
size first just as if it was evaluated by the caller when being
passed to this function.  This ensure that if the caller is
doing something reckless like performing an assignment as
part of the size argument we still perform it and it simply
doesn't get removed by the macro.  Oh course nobody should
be doing this sort of thing, but just in case.
2009-02-17 16:51:19 -08:00
Brian Behlendorf
15dc8b072e Coverity 9652, 9653: No Effect
Removed 2 ASSERT()s which had no effect because by definition
size_t is always an unsigned type thus is always >= 0.
2009-02-17 16:30:58 -08:00
Brian Behlendorf
014b1d6f54 Coverity 9641: Buffer Size
When SPLAT_TEST_INIT() initialized SPLAT_KMEM_TEST11_NAME the short
short test name overran the static length buffer of SPLAT_NAME_SIZE.
This was fixed by increasing the buffer length from 16 to 20 bytes.
2009-02-17 16:24:26 -08:00
Brian Behlendorf
9b1b8e4c24 kmem slab magazine ageing deadlock
- The previous magazine ageing sceme relied on the on_each_cpu()
  function to call spl_magazine_age() on each cpu.  It turns out
  this could deadlock with do_flush_tlb_all() which also relies
  on the IPI based on_each_cpu().  To avoid this problem a per-
  magazine delayed work item is created and indepentantly
  scheduled to the correct cpu removing the need for on_each_cpu().
- Additionally two unused fields were removed from the type
  spl_kmem_cache_t, they were hold overs from previous cleanup.
    - struct work_struct work
    - struct timer_list timer
2009-02-17 15:52:18 -08:00
Brian Behlendorf
f6c5d4ff88 Build system update
- Added default build flags:
  -Wall -Wstrict-prototypes -Werror -Wshadow
- Added missing Makefile's for include/ subdirectories.
2009-02-12 14:45:22 -08:00
Brian Behlendorf
37db7d8cf9 kmem slab fixes
- Default SPL_KMEM_CACHE_DELAY changed to 15 to match Solaris.
- Aged out slab checking occurs every SPL_KMEM_CACHE_DELAY / 3.
- skc->skc_reap tunable added whichs allows callers of
  spl_slab_reclaim() to cap the number of slabs reclaimed.
  On Solaris all eligible slabs are always reclaimed, and this
  is still the default behavior.  However, I suspect that is
  not always wise for reasons such as in the next comment.
- spl_slab_reclaim() added cond_resched() while walking the
  slab/object free lists.  Soft lockups were observed when
  freeing large numbers of vmalloc'd slabs/objets.
- spl_slab_reclaim() 'sks->sks_ref > 0' check changes from
  incorrect 'break' to 'continue' to ensure all slabs are
  checked.
- spl_cache_age() reworked to avoid a deadlock with
  do_flush_tlb_all() which occured because we slept waiting
  for completion in spl_cache_age().  To waiting for magazine
  reclamation to finish is not required so we no longer wait.
- spl_magazine_create() and spl_magazine_destroy() shifted
  back to using for_each_online_cpu() instead of the
  spl_on_each_cpu() approach which was of course a bad idea
  due to memory allocations which Ricardo pointed out.
2009-02-12 13:32:10 -08:00
Ricardo M. Correia
f500ccff35 Minor bug fix due to MAXOFFSET_T constant being too large on 32-bit systems. 2009-02-07 00:53:39 +00:00
Brian Behlendorf
4ab13d3b5c Additional Linux VM integration
Added support for Solaris swapfs_minfree, and swapfs_reserve tunables.
In additional availrmem is now available and return a reasonable value
which is reasonably analogous to the Solaris meaning.  On linux we
return the sun of free and inactive pages since these are all easily
reclaimable.

All tunables are available in /proc/sys/kernel/spl/vm/* and they may
need a little adjusting once we observe the real behavior.  Some of
the defaults are mapped to similar linux counterparts, others are
straight from the OpenSolaris defaults.
2009-02-05 12:26:34 -08:00
Brian Behlendorf
36b313dacf Linux VM integration / device special files
Support added to provide reasonable values for the global Solaris
VM variables: minfree, desfree, lotsfree, needfree.  These values
are set to the sum of their per-zone linux counterparts which
should be close enough for Solaris consumers.

When a non-GPL app links against the SPL we cannot use the udev
interfaces, which means non of the device special files are created.
Because of this I had added a poor mans udev which cause the SPL
to invoke an upcall and create the basic devices when a minor
is registered.  When a minor is unregistered we use the vnode
interface to unlink the special file.
2009-02-04 15:15:41 -08:00
Brian Behlendorf
31a033ecd4 2.6.27+ portability changes
- Added SPL_AC_3ARGS_ON_EACH_CPU configure check to determine
  if the older 4 argument version of on_each_cpu() should be
  used or the new 3 argument version.  The retry argument was
  dropped in the new API which was never used anyway.
- Updated work queue compatibility wrappers.  The old way this
  worked was to pass a data point when initialized the workqueue.
  The new API assumed the work item is embedding in a structure
  and we us container_of() to find that data pointer.
- Updated skc->skc_flags to be an unsigned long which is now
  type checked in the bit operations.  This silences the warnings.
- Updated autogen products and splat tests accordingly
2009-02-02 15:12:30 -08:00
Brian Behlendorf
416bae036b Add new workqueue header 2009-01-30 21:11:42 -08:00
Brian Behlendorf
ea3e6ca9e5 kmem_cache hardening and performance improvements
- Added slab work queue task which gradually ages and free's slabs
  from the cache which have not been used recently.
- Optimized slab packing algorithm to ensure each slab contains the
  maximum number of objects without create to large a slab.
- Fix deadlock, we can never call kv_free() under the skc_lock.  We
  now unlink the objects and slabs from the cache itself and attach
  them to a private work list.  The contents of the list are then
  subsequently freed outside the spin lock.
- Move magazine create/destroy operation on to local cpu.
- Further performace optimizations by minimize the usage of the large
  per-cache skc_lock.  This includes the addition of KMC_BIT_REAPING
  bit mask which is used to prevent concurrent reaping, and to defer
  new slab creation when reaping is occuring.
- Add KMC_BIT_DESTROYING bit mask which is set when the cache is being
  destroyed, this is used to catch any task accessing the cache while
  it is being destroyed.
- Add comments to all the functions and additional comments to try
  and make everything as clear as possible.
- Major cleanup and additions to the SPLAT kmem tests to more
  rigerously stress the cache implementation and look for any problems.
  This includes correctness and performance tests.
- Updated portable work queue interfaces
2009-01-30 20:54:49 -08:00
Brian Behlendorf
0f233eac33 Pull the blkdev header in to the sunldi for some useful structure definitions and helper functions 2009-01-26 16:47:49 -08:00
Brian Behlendorf
48e0606a52 Implement kmem cache alignment argument 2009-01-26 09:02:04 -08:00
Brian Behlendorf
e4f3ea278e Remove stray ` from macro 2009-01-23 08:59:11 -08:00
Brian Behlendorf
511176398c Update debug.h to standardize VERIFY3_IMPL error messages in debug and non-debug mode 2009-01-22 09:41:47 -08:00
Brian Behlendorf
1e4ed6c990 Add missing stub headers 2009-01-09 16:04:44 -08:00
Brian Behlendorf
121d48c97d Add basic ksid_lookupdomain and ksiddomain_rele support, just allocations 2009-01-09 15:30:53 -08:00
Brian Behlendorf
0e41414946 Add two new stub headers 2009-01-09 14:04:13 -08:00
Brian Behlendorf
97735c39e3 Add VOP_SEEK 2009-01-09 13:59:39 -08:00
Brian Behlendorf
d83ba26e18 Add missing policy includes, add missing sun ddi bits 2009-01-09 10:49:47 -08:00
Brian Behlendorf
70997fb4b1 Add share.h stub 2009-01-09 10:06:18 -08:00
Brian Behlendorf
71c8ab9c68 Drat fix missing ; 2009-01-09 10:05:03 -08:00
Brian Behlendorf
23f5c4c281 Add missing callback_context_t and fid_t types 2009-01-09 10:03:37 -08:00
Brian Behlendorf
703e7a3cf4 Add stubs for three more includes 2009-01-09 09:47:27 -08:00
Brian Behlendorf
d702c04ff1 Add 5 splat tests for list handling 2009-01-07 12:54:03 -08:00
Brian Behlendorf
4c18c39ecb Add include/sys/compress.h header 2009-01-06 09:47:00 -08:00
Brian Behlendorf
160c63ab76 Add P2BOUNDARY macro 2009-01-06 09:23:13 -08:00
Brian Behlendorf
7adbea4141 Pull in some default page typedefs 2009-01-05 16:14:38 -08:00
Brian Behlendorf
0f37204417 Add DTRACE_PROBE(a) 2009-01-05 16:09:21 -08:00
Brian Behlendorf
b53c565e65 Stub u8_textprep.h for inclusion purposes 2009-01-05 15:37:07 -08:00
Brian Behlendorf
e9cb2b4f64 Add system taskq support 2009-01-05 15:08:03 -08:00
Brian Behlendorf
8a2b328b18 Remove u8_textprep, we will not be implementing this nightmare yet 2009-01-05 11:32:08 -08:00
Brian Behlendorf
f3fc90c249 Include the header 2008-12-23 16:48:15 -08:00
Brian Behlendorf
925ca8cc01 Add sys/thread.h 2008-12-23 16:27:36 -08:00
Brian Behlendorf
bb9cfc6cc3 Define needfree 2008-12-23 15:59:36 -08:00
Brian Behlendorf
2b88beb74f Add timer.h header 2008-12-23 15:40:20 -08:00
Brian Behlendorf
bbdec3be06 Add u8 stub 2008-12-23 15:38:15 -08:00
Brian Behlendorf
de79fdd3a8 Move sunddi include 2008-12-23 13:32:07 -08:00
Brian Behlendorf
9d457afd1b Add sunddi to uio 2008-12-23 13:30:04 -08:00
Brian Behlendorf
dc0f920710 Minor updates 2008-12-23 13:25:52 -08:00
Brian Behlendorf
926e2b6058 Pull in lock types 2008-12-23 13:18:39 -08:00
Brian Behlendorf
c1d42c2f1d Add header 2008-12-23 13:05:50 -08:00
Brian Behlendorf
f5b92a66ad Add a few more missing header which the upstream stock kernel context expects 2008-12-23 13:03:09 -08:00
Brian Behlendorf
2ee63a549a Add struct ddi_strtox functions 2008-12-05 16:23:57 -08:00
Brian Behlendorf
72e7de6026 Prefix META_ALIAS with SPL_ 2008-11-26 13:26:05 -08:00
Brian Behlendorf
abc3ca149d Prefix all META_* #defines with SPL to prevent colisions which include our spl_config.h. Dependent packages may do this to leverage the autoconf check we have already run aganst the kernel. 2008-11-26 13:09:37 -08:00
behlendo
7212e2cd27 Add missing autogen products
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@182 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-26 17:07:59 +00:00
behlendo
6a1c3d418a * include/sys/sunddi.h, modules/spl/spl-module.c : Removed default
udev support from sunddi implementation because it uses GPL-only
symbols.  This support is optionally available for SPL consumers
if they define HAVE_GPL_ONLY_SYMBOLS and license their module as
GPL using the MODULE_LICENSE("GPL") macro.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@179 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-13 21:43:30 +00:00
behlendo
36833ea4e4 Slightly increase SPLAT_NAME_SIZE to ensure string is always
NULL terminated.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@174 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-05 21:27:31 +00:00
behlendo
0498e6c585 Removed useless check
Fix forward NULL in splat kmem_cache test ctors/dtors



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@171 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-04 23:18:31 +00:00
behlendo
c8e60837b7 * spl-09-fix-kmem-track-oops.patch
This fixes an oops when unloading the modules, in the case where memory
tracking was enabled and there were memory leaks. The comment in the
code explains what was the problem.

* spl-10-fix-assert-verify-ndebug.patch
This fixes ASSERT*() and VERIFY*() macros in non-debug builds. VERIFY*()
macros are supposed to check the condition and panic even in production
builds, and ASSERT*() macros don't need to evaluate the arguments.
Also some 32-bit fixes.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@165 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 22:02:15 +00:00
behlendo
c22e7a427b Under Solaris KM_SLEEP ensures success (or at least you hang forever).
That said when working with a finite resource like memory failure really
is always a possibility.  It would be far better longer term if the ZFS
code could be weened off this assumption and properly handle the cases
where an allocation fails.  Still I've applied the patch to spl-0.3.4
since this layer is supposed to emulate Solaris as closely as possible.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@164 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 21:51:33 +00:00
behlendo
a0f6da3d95 Add a SPL_AC_TYPE_ATOMIC64_T test to configure for systems which do
already supprt atomic64_t types.

* spl-07-kmem-cleanup.patch
This moves all the debugging code from sys/kmem.h to spl-kmem.c, because
the huge macros were hard to debug and were bloating functions that
allocated memory. I also fixed some other minor problems, including
32-bit fixes and a reported memory leak which was just due to using the
wrong free function.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@163 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 21:06:04 +00:00
behlendo
550f170525 Apply two nice improvements caught by Ricardo,
spl-05-div64.patch
This is a much less intrusive fix for undefined 64-bit division symbols
when compiling the DMU in 32-bit kernels.

* spl-06-atomic64.patch
This is a workaround for 32-bit kernels that don't have atomic64_t.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@162 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 20:34:17 +00:00
behlendo
749045bbfa Apply a nice fix caught by Ricardo,
* spl-04-fix-taskq-spinlock-lockup.patch
Fixes a deadlock in the BIO completion handler, due to the taskq code
prematurely re-enabling interrupts when another spinlock had disabled
them in the IDE IRQ handler.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@161 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 20:21:08 +00:00
behlendo
f6c81c5ea7 Reviewed and applied spl-01-rm-gpl-symbol-set_cpus_allowed.patch
from Ricardo which removes a dependency on the GPL-only symbol
set_cpus_allowed().  Using this symbol is simpler but in the name
of portability we are adopting a spinlock based solution here
to remove this dependency.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@160 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 20:07:20 +00:00
behlendo
25557fd884 Sigh more compat fixes, this is almost right for 2.6.9 - 2.6.26 kernels.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@157 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 23:47:44 +00:00
behlendo
b61a6e8bdc Pull in initial 32-bit support patches.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@156 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 22:42:04 +00:00
behlendo
3d061e9d10 Commit bulk of remaining 2.6.9 and 2.6.26 compat changes.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@155 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 22:13:47 +00:00
behlendo
322640b7b5 Include linux/uaccess.h compat changes.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@154 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 19:10:14 +00:00
behlendo
86de8532a9 More 2.6.26 compat changes
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@153 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 17:56:40 +00:00
behlendo
6a6cafbe8d Pull in timespec, list, and type compat changes to support
building against a wider range of kernels.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@152 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 17:20:11 +00:00
behlendo
46c685d0c4 Add class / device portability code. Two autoconf tests
were added to cover the 3 possible APIs from 2.6.9 to
2.6.26.  We attempt to use the newest interfaces and if
not available fallback to the oldest.  This a rework of
some changes proposed by Ricardo for RHEL4.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@150 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-10 03:50:36 +00:00
behlendo
877a32e91e Pull in fls64 compat changes from spl-00-rhel4-compat.patch,
to allow greater compatibility with kernels pre 2.6.16.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@149 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-06 04:52:39 +00:00
behlendo
7afde631f6 Start bringing in Ricardo's spl-00-rhel4-compat.patch, a few chunks
at a time as I audit it.  This chunk finishes moving the SPL entirely
off the linux slab on to the SPL implementation.  It differs slightly
from the proposed version in that the spl continues to export to
all the Solaris types and functions.  These do conflict with the
Linux slab so a module usings these interfaces must not include the
SPL slab if they also intend to use the linux slab.  Or they must
explcitly #undef the macros which remap the functioin to their
spl_* equivilants.

A nice side of effect of dropping the entire linux slab is we
don't need to autoconf checks anymore.  They kept messing with
the slab API endlessly!



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@148 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-05 04:16:09 +00:00
behlendo
a1502d76ae - Remove hash functionality from slab in favor of direct lookups
based of the spl_kmem_obj_t tacked on the end of each object.
  This actually isn't so back because we are now allocing large
  chunks for the slab and partitioning it ourselves.  So there's
  not a ton of wasted space.  We may suffer a performance hit
  however due to alignment issues.

- Remove remaining depenancies on the linux slab implementation.
  We're standing on our own now for better or worse.

- Rework slabs to be either kmem or vmem based.  If neither
  KMC_VMEM of KMC_KMEM are specified we make a decent guess
  about what will work best for their based on the object 
  size.  Additionally we provide a kmem_virt() function caller
  can use to see if they have a virtual or physical address.

- Minor fixups in the test suite.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@141 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-07-01 03:28:54 +00:00
behlendo
1c3832576d Remove stray call to spl_cache_free() and remove all the
cycle count which was costing me overhead.  It was hurting
performance pretty badly for heavily used caches.  I'm also
thinking the hash may be hurting me as well and it might
be worth sticking a pointer in to a little space after the
alloced object.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@140 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-28 20:03:11 +00:00
behlendo
fece7c99bf Victory! I've reworked caches with large objects which are
based by vmalloc()'ed memory.  I now alloc a slab which is
roughly 32*spl_obj_size and in this block of memory I place
the slab descriptor, slab object descriptors, and objects
themselves.  This greatly reduces vmalloc lock contention.

Still some minor cleanup remains and fine tuning but
it's working pretty well.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@139 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-28 05:04:46 +00:00
behlendo
ff449ac406 Further slab improvements, I'm getting close to something which works
well for the expected workloads.  Improvement in this commit include:

- Added DEBUG_KMEM_TRACKING #define which can optionally be set
  when DEBUG_KMEM is defined to do per allocation tracking.  This
  allows us to get all the lightweight kmem debugging enabled by
  default which is pretty light weight, and only when looking 
  for a memory leak we can briefly enable the per alloc tracking.

- Added set_normalized_timespec() in to SPL to simply using
  the timespec() primatives from within a module.

- Added per-spinlock cycle counters to the slab in an attempt
  to run down a lock contention issue.  The contended lock 
  was in vmalloc() but I'm going to leave the cycle counters
  in place for a little while until I'm convinced there arn't
  other locking improvement possible in the slab.

- Added a proc interface to the slab to export per slab
  cache statistics to /proc/spl/kmem/slab for analysis.

- Reworked spl_slab_alloc() function to allocate from kmem for
  small allocation and vmem for large allocations.  This improved
  things considerably but futher work is needed.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@138 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-27 21:40:11 +00:00
behlendo
e9d7a2bef5 Fix for memory corruption caused by overruning the magazine
when repopulating it.  Plus I fixed a few more suble races in
that part of the code which were catching me.  Finally I fixed
a small race in kmem_test8.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@137 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-26 19:49:42 +00:00
behlendo
4afaaefa05 Implement per-cpu local caches. This seems to have bough me another
factor of 10x improvement on SMP system due to reduced lock contention.
This may put me in the ballpark of what is needed.  We can still further
improve things on NUMA systems by creating an additional L3 cache per 
memory node instead of the current global pool.  With luck this won't
be needed.  I should also take another look at the locking now that
everything is working.  There's a good chance I can tighten it up a
little bit and improve things a little more.

   kmem_lock: time (sec)        slabs           objs            hash
   kmem_lock:                   tot/max/calc    tot/max/calc    size/depth
   kmem_lock:  0.000999926      6/6/1           192/192/32      32768/0
   kmem_lock:  0.000999926      4/4/2           128/128/64      32768/0
   kmem_lock:  0.000999926      4/4/4           128/128/128     32768/0
   kmem_lock:  0.000999926      4/4/8           128/128/256     32768/0
   kmem_lock:  0.000999926      4/4/16          128/128/512     32768/0
   kmem_lock:  0.000999926      4/4/32          128/128/1024    32768/0
   kmem_lock:  0.000999926      4/4/64          128/128/2048    32768/0
   kmem_lock:  0.000999926      8/8/128         256/256/4096    32768/0
   kmem_lock:  0.003999704      24/23/256       768/736/8192    32768/1
   kmem_lock:  0.012999038      44/41/512       1408/1312/16384 32768/1
   kmem_lock:  0.051996153      96/93/1024      3072/2976/32768 32768/2
   kmem_lock:  0.181986536      187/184/2048    5984/5888/65536 32768/3
   kmem_lock:  0.655951469      342/339/4096    10944/10848/131072 32768/4



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@136 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-25 20:57:45 +00:00
behlendo
d46630e0f3 The first locking issue was due to the semaphore I used. I was trying
to be overly clever and the context switch when the semaphore was busy
was destroying performance.  Converting to a simple spin lock bough me
a factor of 50 or so.  That said it's still not good enough.  Tests
show bad performance and we are still CPU bound.  The logical fix is
I need to implement per-cpu hot caches to minimize the SMP contention.
Linux and Solaris both have this, I was hoping to do without but it
looks like that's not to be.

   kmem_lock: time (sec)        slabs           objs            hash
   kmem_lock:                   tot/max/calc    tot/max/calc    size/depth
   kmem_lock:  0.022000000      7/6/64  224/177/2048    32768/1
   kmem_lock:  0.039000000      13/13/128       416/404/4096    32768/1
   kmem_lock:  0.079000000      23/21/256       736/672/8192    32768/1
   kmem_lock:  0.158000000      48/47/512       1536/1504/16384 32768/1
   kmem_lock:  0.345000000      105/105/1024    3360/3358/32768 32768/2
   kmem_lock:  0.760000000      202/200/2048    6464/6400/65536 32768/3



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@135 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-24 17:18:15 +00:00
behlendo
5cbd57fa91 Fix minor chaos/fc9 kernel discrepencies in build
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@133 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-13 23:56:26 +00:00
behlendo
2fb9b26a85 * : modules/sys/kmem-slab.c : Re-implemented the slab to no
longer be based on the linux slab but to be its own complete
implementation.  The new slab behaves much more like the
Solaris slab than the Linux slab.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@132 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-13 23:41:06 +00:00
behlendo
41cf38df92 Add missing () to quiet warnings in NDEBUG case
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@128 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-04 22:52:13 +00:00
behlendo
475cdc788e Just use CONFIG_SLUB to detect SLUB use
Add ASSERTF to the NDEBUG build
Fix minor issue with various debug build flags



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@126 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-04 21:09:25 +00:00
behlendo
c30df9c863 Fixes:
1) Ensure mutex_init() never fails in the case of ENOMEM by retrying
   forever.  I don't think I've ever seen this happen but it was clear
   after code inspection that if it did we would immediately crash.

2) Enable full debugging in check.sh for sanity tests.  Might as well
   get as much debug as we can in the case of a failure.

3) Reworked list of kmem caches tracked by SPL in to a hash with the
   key based on the address of the kmem_cache_t.  This should speed
   up the constructor/destructor/shrinker lookup needed now for newer
   kernel which removed the destructor support.

4) Updated kmem_cache_create to handle the case where CONFIG_SLUB
   is defined.  The slub would occasionally merge slab caches which
   resulted in non-unique keys for our hash lookup in 3).  To fix this
   we detect if the slub is enabled and then set the needed flag
   to prevent this merging from ever occuring.

5) New kernels removed the proc_dir_entry pointer from items
   registered by sysctl.  This means we can no long be sneaky and
   manually insert things in to the sysctl tree simply by walking
   the proc tree.  So I'm forced to create a seperate tree for
   all the things I can't easily support via sysctl interface.
   I don't like it but it will do for now.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@124 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-04 06:00:46 +00:00
behlendo
691d2bd733 Update utsname to use proper compatible interface to avoid API issues.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@123 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-03 21:20:18 +00:00
behlendo
57d862349b Breaking the world for a little bit. If anyone is going to continue
working on this branch for the next few days I suggested you work
off of the 0.3.1 tag.  The following changes are fairly extensive
and are designed to make the SPL compatible with all kernels in
the range of 2.6.18-2.6.25.  There were 13 relevant API changes
between these releases and I have added the needed autoconf tests
to check for them.  However, this has not all been tested extensively.
I'll sort of the breakage on Fedora Core 9 and RHEL5 this week.

SPL_AC_TYPE_UINTPTR_T
SPL_AC_TYPE_KMEM_CACHE_T
SPL_AC_KMEM_CACHE_DESTROY_INT
SPL_AC_ATOMIC_PANIC_NOTIFIER
SPL_AC_3ARGS_INIT_WORK
SPL_AC_2ARGS_REGISTER_SYSCTL
SPL_AC_KMEM_CACHE_T
SPL_AC_KMEM_CACHE_CREATE_DTOR
SPL_AC_3ARG_KMEM_CACHE_CREATE_CTOR
SPL_AC_SET_SHRINKER
SPL_AC_PATH_IN_NAMEIDATA
SPL_AC_TASK_CURR
SPL_AC_CTL_UNNUMBERED



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@119 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-02 17:28:49 +00:00
behlendo
715f625146 Go through and add a header with the proper UCRL number.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@114 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-26 04:38:26 +00:00
behlendo
cc7449ccd6 - Properly fix the debug support for all the ASSERT's, VERIFIES, etc can be
compiled out when doing performance runs.
- Bite the bullet and fully autoconfize the debug options in the configure
  time parameters.  By default all the debug support is disable in the core
  SPL build, but available to modules which enable it when building against
  the SPL.  To enable particular SPL debug support use the follow configure
  options:

  --enable-debug		Internal ASSERTs
  --enable-debug-kmem		Detailed memory accounting
  --enable-debug-mutex		Detailed mutex tracking
  --enable-debug_kstat          Kstat info exported to /proc
  --enable-debug-callb		Additional callb debug



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@111 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-19 02:49:12 +00:00
behlendo
6ab69573ff SPL additions to increase support for updated ZFS build
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@110 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-15 23:39:19 +00:00
behlendo
4efd41189a Rework condition variable implementation to be consistent with
other primitive implementations.  Additionally ensure that GFP_ATOMIC
is use for allocations when in interrupt context.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@108 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-15 17:10:30 +00:00
behlendo
8464443f8d Add a comment so I remember to fix this.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@106 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-12 16:53:41 +00:00
behlendo
c6dc93d6a8 By default disable extra KMEM and MUTEX debugging to aid performance.
They can easily be re-enabled when new stability issues are uncovered.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@105 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-09 22:53:20 +00:00
behlendo
5c2bb9b2c3 Stability hack. Under Solaris when KM_SLEEP is set kmem_cache_alloc()
may not fail.  To get this behavior I'd added a retry to the shim layer
even though it is abusive to the VM, at least it should prevent the crash.
Additionally I added a proc counter so I can easily check how often this
is happening.  It should be fairly rare, but likely will get worse and
worse the longer the machine has been up.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@104 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-09 21:21:33 +00:00
behlendo
04a479f706 Add an almost feature complete implemenation of kstat. I chose
not to support a few flags (we assert if they are used), and I
did not add the libkstat interface and instead exported everything
to proc for easy access.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@103 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-08 23:21:47 +00:00
behlendo
427a782d7d Decrease of kmem warnign threshold back to 2 pages, no worse than a stack.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@100 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-07 19:33:01 +00:00
behlendo
13cdca65ec Add vmem memory accounting
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@99 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-07 18:54:32 +00:00
behlendo
404992e31a - Relocate 'stats_per' in to proper /proc/sys/spl/mutex/ directory
- Shift to spinlock for mutex list addition and removal



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@98 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-07 17:58:22 +00:00
behlendo
4f86a887d8 Remaining issues fixed after reenabled mutex debugging.
- Ensure the mutex_stats_sem and mutex_stats_list are initialized
- Only spin if you have to in mutex_init



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@97 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-06 23:19:27 +00:00
behlendo
e8b31e8482 - Updated rwlock's to reside in a .c file instead of a static inline
- Updated rwlock's so they can be safely initialized in ctors.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@96 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-06 23:00:49 +00:00
behlendo
d6a26c6a32 Lots of fixes here:
- Detailed kmem memory allocation tracking.  We can now get on
  spl module unload a list of all memory allocations which were
  not free'd and where the original alloc was.  E.g.

SPL: 15554:632:(spl-kmem.c:442:kmem_fini()) kmem leaked 90/319332 bytes
SPL: 15554:648:(spl-kmem.c:451:kmem_fini()) address          size  data             func:line
SPL: 15554:648:(spl-kmem.c:457:kmem_fini()) ffff8100734b68b8 32    0100000001005a5a __spl_mutex_init:70
SPL: 15554:648:(spl-kmem.c:457:kmem_fini()) ffff8100734b6148 13    &tl->tl_lock     __spl_mutex_init:74
SPL: 15554:648:(spl-kmem.c:457:kmem_fini()) ffff81007ac43730 32    0100000001005a5a __spl_mutex_init:70
SPL: 15554:648:(spl-kmem.c:457:kmem_fini()) ffff81007ac437d8 13    &tl->tl_lock     __spl_mutex_init:74

- Shift to using rwsems in kmem implmentation, to simply locking and
  improve concurency.

- Shift to using rwsems in mutex implementation, additionally ensure we
  never sleep in the init function if non-zero preempt_count or 
  interrupts are disabled as can happen in a slab cache ctor/dtor.

- Other minor formating fixes and such.

TODO:

- Finish the vmem memory allocation tracking

- Vet all other SPL primatives for potential sleeping during *_init.  I
suspect the rwlock implemenation does this and should be fixes just
like the mutex implemenation.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@95 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-06 20:38:28 +00:00
behlendo
9ab1ac14ad Commit adaptive mutexes. This seems to have introduced some new
crashes but it's not clear to me yet if these are a problem with
the mutex implementation or ZFSs usage of it.

Minor taskq fixes to add new tasks to the end of the pending list.

Minor enhansements to the debug infrastructure.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@94 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-05 20:18:49 +00:00
behlendo
bcd68186d8 New an improved taskq implementation for the SPL. It allows a
configurable number of threads like the Solaris version and almost
all of the options are supported.  Unfortunately, it appears to have
made absolutely no difference to our performance numbers.  I need
to keep looking for where we are bottle necking.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@93 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-25 22:10:47 +00:00
behlendo
839d8b438e Update kmem.h to properly use new debug subsystem.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@92 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-24 20:21:07 +00:00
behlendo
3561541c24 Prep for 0.2.1 tag
Minor fixes to headers to use debug macros
Added /proc/sys/spl/version



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@90 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-24 17:41:23 +00:00
wartens2
8100fe56f1 Make sure that when calling __vmem_alloc that we
do not have __GFP_ZERO set.  Once the memory is allocated
then zero out the memory if __GFP_ZERO is passed to
__vmem_alloc.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@88 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-24 17:07:56 +00:00
behlendo
6e605b6e58 Minor improvement to taskq handling. This is a small step towards
dynamic taskqs which still need to be fully implemented.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@87 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-23 21:19:47 +00:00
behlendo
b831734a43 Stack usage is my enemy. Trade cpu cycles in the debug code to
ensure I never add anything to the stack I don't absolutely need.
All this debug code could be removed from a production build
anyway so I'm not so worried about the performance impact.  We
may also consider revisting the mutex and condvar implementation
to ensure no additional stack is used there.

Initial indications are I have reduced the worst case stack
usage to 9080 bytes.  Still to large for the default 8k stacks
so I have been forced to run with 16k stacks until I can
reduce the worst offenders.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@83 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-22 16:55:26 +00:00
behlendo
7fea96c04f More fixes to ensure we get good debug logs even if we're in the
process of destroying the stacks.  Threshhold set fairly aggressively
top 80% of stack usage.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@82 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-21 22:44:11 +00:00
behlendo
892d51061e Handful of minor stack checking fixes
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@79 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-21 18:08:33 +00:00
behlendo
937879f11d Update SPL to use new debug infrastructure. This means:
- Replacing all BUG_ON()'s with proper ASSERT()'s
- Using ENTRY,EXIT,GOTO, and RETURN macro to instument call paths



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@78 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-21 17:29:47 +00:00
behlendo
2fae1b3d0a Frist minor batch of fixes. Catch a dropped ;, and use SBUG instead of BUG.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@77 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-19 00:02:11 +00:00
behlendo
ce86265693 Whoops need this!
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@76 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-18 23:42:45 +00:00
behlendo
57d1b18858 First commit of lustre style internal debug support. These
changes bring over everything lustre had for debugging with
two exceptions.  I dropped by the debug daemon and upcalls
just because it made things a little easier.  They can be
readded easily enough if we feel they are needed.

Everything compiles and seems to work on first inspection
but I suspect there are a handful of issues still lingering
which I'll be sorting out right away.  I just wanted to get
all these changes commited and safe.  I'm getting a little
paranoid about losing them.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@75 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-18 23:39:58 +00:00
behlendo
d61e12af5a - Add some spinlocks to cover all the private data in the mutex. I don't
think this should fix anything but it's a good idea regardless.

- Drop the lock before calling the construct/destructor for the slab
otherwise we can't sleep in a constructor/destructor and for long running
functions we may NMI.

- Do something braindead, but safe for the console debug logs for now.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@73 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-15 20:53:36 +00:00
behlendo
c5fd77fcbf Just cleanup up an error case to avoid overspamming the console.
We get the stack once from the BUG() no reason to dump it twice.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@72 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-14 18:37:20 +00:00
behlendo
12ea923056 Adjust the condition variables to simply sleep uninteruptibly.
This way we don't have to contend with superious wakeups which
it appears ZFS is not so careful to handle anyway.  So this is
probably for the best.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@70 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-11 22:49:48 +00:00
behlendo
115aed0dd8 - Add more strict in_atomic() checking to the mutex entry
function just to be extra safety and paranoid.

- Rewrite the thread shim to take full advantage of the
new kernel kthread API.  This greatly simplifies things.

- Add a new regression test for thread_exit() to ensure
it properly terminates a thread immediately without
allowing futher execution of the thread.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@69 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-11 17:03:57 +00:00
behlendo
79f92663e3 Fix race in rwlock implementation which can occur when
your task is rescheduled to a different cpu after you've
taken the lock but before calling RW_LOCK_HELD is called.
We need the spinlock to ensure there is a wmb() there.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@68 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-07 23:54:34 +00:00
behlendo
968eccd1d1 Update the thread shim to use the current kernel threading API.
We need to use kthread_create() here for a few reasons.  First
off to old kernel_thread() API functioin will be going away.
Secondly, and more importantly if I use kthread_create() we can
then properly implement a thread_exit() function which terminates
the kernel thread at any point with do_exit().  This fixes our
cleanup bug which was caused by dropping a mutex twice after
thread_exit() didn't really exit.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@66 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-04 04:44:16 +00:00
behlendo
996faa6869 Correctly implement atomic_cas_ptr() function. Ideally all of these
atomic operations will be rewritten anyway with the correct arch
specific assembly.  But not today.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@65 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-03 21:48:57 +00:00
behlendo
0a6fd143fd - Remapped ldi_handle_t to struct block_device * which is much more useful
- Added liunx block device headers to sunldi.h
- Made __taskq_dispatch safe for interrupt context where it turns out we
  need to be useing it.
- Fixed NULL const/dest bug for kmem slab caches
- Places debug __dprintf debugging messages under a spin_lock_irqsave
  so it's safe to use then in interrupt handlers.  For debugging only!



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@64 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-03 16:33:31 +00:00
behlendo
0998fdd6db Apparently it's OK for done to be NULL, which was not clear in the
Solaris man page.  Anyway, since apparently this usage is accectable
I've updated the function to handle it.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@63 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-01 17:00:06 +00:00
behlendo
6a585c61de Double large kmalloc warning size to 4 pages. It was 2 pages, and ideally
it should be dropped to one page but in the short term we should be able
to easily live with 4 page allocations.

Fix the nvlist bug, it turns out the user space side of things were
packing the nvlists correctly as little endian, and the kernel space
side of things due to a missing #define were unpacking them as big endian.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@62 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-01 16:09:18 +00:00
behlendo
4fd2f7eea5 Add vmem_zalloc support.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@60 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-31 23:04:07 +00:00
behlendo
8d0f1ee907 Add some crude debugging support. It leaves alot to be
desired, but it should allow more easy kernel debugging for now.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@59 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-31 20:42:36 +00:00
behlendo
9f4c835a0e Correctly functioning 64-bit atomic shim layer. It's not
what I would call effecient but it does have the advantage
of being correct which is all I need right now.  I added
a regression test as well.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@57 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-28 18:21:09 +00:00
behlendo
4a4295b267 Remove minor lingering CDDL tait of copied headers. Required
headers rewritten to include minimally what we need.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@56 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-27 23:40:09 +00:00
behlendo
d429b03d85 - Thinko fix to the SPL module interface
- Enhanse the VERIFY() support to output the values which
  failed to compare as expected before crashing.  This make
  debugging much much much easier.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@55 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-27 22:06:59 +00:00
behlendo
8ac547ec4c Relocated to zfs repo
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@54 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-27 17:58:10 +00:00
behlendo
336bb0c0c1 Two fixes to the module interface. Could be worse!
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@53 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-21 19:16:25 +00:00
behlendo
4e62fd4104 OK, a first reasonable attempt at a solaris module/chdev shim layer.
This should handle the absolute minimum I need for ZFS.  It will 
register the chdev with the right callbacks.  Then the generic 
registered linux callback will find the right registered solaris
callback for the function and munge the args just right before
passing it on.  Should work, but untested (just compiled), so I
expect bugs.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@52 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-20 23:30:15 +00:00
behlendo
e4f1d29f89 OK, some pretty substantial rework here. I've merged the spl-file
stuff which only inclused the getf()/releasef() in to the vnode area
where it will only really be used.  These calls allow a user to
grab an open file struct given only the known open fd for a particular
user context.  ZFS makes use of these, but they're a bit tricky to
test from within the kernel since you already need the file open
and know the fd.  So we basically spook the system calls to setup
the environment we need for the splat test case and verify given
just the know fd we can get the file, create the needed vnode, and
then use the vnode interface as usual to read and write from it.

While I was hacking away I also noticed a NULL termination issue
in the second kobj test case so I fixed that too.  In fact, I fixed
a few other things as well but all for the best!



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@51 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-18 23:20:30 +00:00
behlendo
5d86345d37 Initial pass at a file API getf/releasef hooks
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@50 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-18 04:56:43 +00:00
behlendo
1ec74a114c Minimal signal handling interface.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@49 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-17 18:29:57 +00:00
behlendo
2bdb28fbe0 Missing headers, more minor fixes
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@48 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-15 00:05:38 +00:00
behlendo
c19c06f3b0 Fix kmem memory accounting
Adjust kmem slab interface to make a copy of the slab name before
passing it on to the linux slab (we free it latter too)



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@47 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-14 20:56:26 +00:00
behlendo
79b31f3601 Fix KMEM_DEBUG support (enable by default)
Add vmem_alloc/vmem_free support (and test case)
Add missing time functions



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@46 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-14 19:04:41 +00:00
behlendo
af828292e5 Add missing headers
Rework vnodes to be based on the slab cache, just like on Solaris.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@45 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-14 00:04:01 +00:00
behlendo
ea19fbed05 Add missing headers
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@44 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-13 22:52:23 +00:00
behlendo
8ddd0ee415 Add two more missing headers
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@43 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-13 20:41:29 +00:00
behlendo
73e540a0d1 Drop unicode support, provided in ZFS tree libport
Update uio support


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@42 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-13 19:49:09 +00:00
behlendo
36e6f86146 - Add some more missing headers
- Map the LE/BE_* byteorder macros to the linux versions
- More minor vnodes fixes


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@41 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-12 23:48:28 +00:00
behlendo
2f5d55aac5 Add copyin/copyout mapping
Fix some vnode types



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@40 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-12 21:33:28 +00:00
behlendo
4b17158506 - Implemented vnode interfaces and 6 test cases to the test suite.
- Re-implmented kobj support based on the vnode support.
- Add TESTS option to check.sh, and removed delay after module load.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@39 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-12 20:52:46 +00:00
behlendo
9490c14835 Apply fix from bug239 for rwlock deadlock.
Update check.sh script to take V=1 env var so you can run it verbosely as
follows if your chasing something: sudo make check V=1

Add new kobj api and needed regression tests to allow reading of files from
within the kernel.  Normally thats not something I support but the spa layer
needs the support for its config file.

Add some more missing stub headers



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@38 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-11 20:54:40 +00:00
behlendo
b123971fc2 Two more GPL only symbols moved to helper functions in the spl module.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@37 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-11 02:08:57 +00:00
behlendo
ee4766827a Remap gethrestime() with #define to new symbol and export that new
symbol to avoid direct use of GPL only symbol.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@36 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-10 21:38:39 +00:00
behlendo
51f443a074 Add some typedefs to make it clearer when we passing a function,
Add fm_panic define
Add another bad atomic hack (need to do this right)


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@35 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-10 19:25:20 +00:00
behlendo
4098c921b6 Fix systemic naming mistake
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@34 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-10 19:04:14 +00:00
behlendo
6adf99e7d6 Add missing headers
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@33 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-10 17:05:34 +00:00
behlendo
12472b242d Just filling in more of the env.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@32 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-08 00:58:32 +00:00
behlendo
05ae387b50 Add somre debugging support
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@31 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-08 00:18:21 +00:00
behlendo
0b3cf046cb Add the initial vestigates of vnode support
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@30 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-07 23:07:02 +00:00
behlendo
3b3ba48fe9 Add missing cred.h functions
Resolve compiler warning with kmem_free (unused len)
Add stub for byteorder.h
Add zlib shim for compress2 and uncompress functions



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@29 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-07 20:48:44 +00:00
behlendo
b0dd3380aa Minor atomic cleanup, this needs to be done right.
Fixed a bug in the timer code
Added missing macros



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@28 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-07 00:28:32 +00:00
behlendo
ed61a7d05e Add some missing rw_lock symbols
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@27 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-06 23:42:37 +00:00
behlendo
77b1fe8fa8 Add highbit func,
Add sloopy atomic declaration which will need to be fixed (eventually)
Fill out more of the Solaris VM hooks
Adjust the create_thread function



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@26 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-06 23:12:55 +00:00
behlendo
a713518f5d Checkpoint for the night,
added a few more stub headers,
fleshed out a few stub headers,
added a FIXME file,
added various compatibility macros


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@25 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-05 00:58:54 +00:00
behlendo
23f28c4f75 Remove spl.h, just include the headers you need.
Add a few more stubs.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@24 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-04 20:06:29 +00:00
behlendo
48f940b943 Fix type
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@23 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-04 19:38:27 +00:00
behlendo
14c5326ccd More stub headers,
moved generic to sysmacros,
added some more macros for kernel compatibility


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@22 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-04 18:22:31 +00:00
behlendo
dbb484ec60 Stub out some missing headers which are expected. I'll fill
in what the contents need to be as I encounter the warnings
about missing prototypes, symbols, and such.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@21 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-01 18:30:12 +00:00
behlendo
ea70970ff5 Almost dropped this!
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@19 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-01 00:47:23 +00:00
behlendo
f4b377415b Reorganize /include/ to add a /sys/, this way we don't need to
muck with #includes in existing Solaris style source to get it
to find the right stuff.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@18 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-01 00:45:59 +00:00
behlendo
09b414e880 Minor nit, SOLARIS should be SPL
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@17 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-28 00:52:31 +00:00
behlendo
596e65b4e8 OK, I think this is the last of major cleanup and restructuring.
We've dropped all the linux- prefixes on the file in favor of spl-
which makes more sense.  And we've cleaned up some of the includes
so everybody should be including their own dependencies properly.
All a module which wants to use the spl support needs to do in
include spl.h and ensure it has access to Module.symvers.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@16 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-28 00:48:31 +00:00
behlendo
7c50328b40 More cleanup.
- Removed all references to kzt and replaced with splat
- Moved portions of include files which do not need to be
  available to all source files in to local.h files in 
  proper source subdirs.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@14 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 23:42:31 +00:00
behlendo
f1b59d2620 Lots of build fixes. This is turning out to be a very good
idea since it forcefully codifing the ABI.  Since the shim
layer is no longer linked at build time in to the test suite
we can;'t cut any corners and get away with it.

Everything is working now with the exception of sorting
setting Module.symvers properly.  This may take a little
Makefile reorg.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@5 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 19:09:51 +00:00
behlendo
3d4ea0ced6 More build fixes, I have the kernel module almost building and its
feeling a lot more sane, cleaner, and linuxy.  I may finish this tonight.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@4 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 00:59:48 +00:00
behlendo
564f6d1509 User space build fixes:
- Add list handling compatibility library
- Drop uu_* list handling in favor of local list implementation
- libtoolize
- generic makefile cleanup



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@3 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-26 23:20:41 +00:00
behlendo
f1ca4da6f7 Initial commit. All spl source written up to this point wrapped
in an initial reasonable autoconf style build system.  This does
not yet build but the configure system does appear to work properly
and integrate with the kernel.  Hopefully the next commit gets
us back to a buildable version we can run the test suite against.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@1 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-26 20:36:04 +00:00