Commit Graph

277 Commits

Author SHA1 Message Date
Chris Dunlop
791dc876eb Allow 64-bit timestamps to be set on 64-bit kernels
ZFS and 64-bit linux are perfectly capable of dealing with 64-bit
timestamps, but ZFS deliberately prevents setting them.  Adjust
the SPL such that TIMESPEC_OVERFLOW will not always assume 32-bit
values and instead use the correct values for your kernel build.
This effectively allows 64-bit timestamps on 64-bit systems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #487
2011-12-12 11:06:03 -08:00
Brian Behlendorf
1114ae6ae7 Prepend spl_ to all init/fini functions
This is a bit of cleanup I'd been meaning to get to for a while
to reduce the chance of a type conflict.  Well that conflict
finally occurred with the kstat_init() function which conflicts
with a function in the 2.6.32-6-pve kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #56
2011-11-11 09:18:28 -08:00
Brian Behlendorf
12ff95ff57 Linux 3.1 compat, kern_path_parent()
Prior to Linux 3.1 the kern_path_parent symbol was exported for
use by kernel modules.  As of Linux 3.1 it is now longer easily
available.  To handle this case the spl will now dynamically
look up address of the missing symbol at module load time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
2011-11-09 16:51:25 -08:00
Gunnar Beutner
f5e76dea03 Cleaned up MUTEX() #define
The old define assumed a specific layout of the kmutex_t struct. This
patch makes the macro independent from the actual struct layout.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-19 09:59:32 -07:00
Gunnar Beutner
66cdc93b8c Remove the spinlocks for mutex_enter()/mutex_exit()
The m_owner variable is protected by the mutex itself. Reading the variable
is guaranteed to be atomic (due to it being a word-sized reference) and
ACCESS_ONCE() takes care of read cache effects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-19 09:58:57 -07:00
Gunnar Beutner
3160d4f56b Fix race condition in mutex_exit()
On kernels with CONFIG_DEBUG_MUTEXES mutex_exit() clears the mutex
owner after releasing the mutex. This would cause mutex_owner()
to return an incorrect owner if another thread managed to lock the
mutex before mutex_exit() had a chance to clear the owner.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #167
2011-10-19 09:58:41 -07:00
Gunnar Beutner
763b2f3b57 Fixed invalid resource re-use in file_find()
File descriptors are a per-process resource. The same descriptor
in different processes can refer to different files. find_file()
incorrectly assumed that file descriptors are globally unique.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #386
2011-10-11 09:51:51 -07:00
Brian Behlendorf
86fd39f354 Linux 2.6.39 compat, mutex owner
Prior to Linux 2.6.39 when CONFIG_DEBUG_MUTEXES was defined
the kernel stored a thread_info pointer as the mutex owner.
From this you could get the pointer of the current task_struct
to compare with get_current().

As of Linux 2.6.39 this behavior has changed and now the mutex
stores a pointer to the task_struct.  This commit detects the
type of pointer stored in the mutex and adjusts the mutex_owner()
and mutex_owned() functions to perform the correct comparision.
2011-06-24 13:00:08 -07:00
Darik Horn
0d54dcb566 Read the /etc/hostid file directly.
Deprecate the /usr/bin/hostid call by reading the /etc/hostid file
directly. Add the spl_hostid_path parameter to override the default
/etc/hostid path.

Rename the set_hostid() function to hostid_exec() to better reflect
actual behavior and complement the new hostid_read() function.

Use HW_INVALID_HOSTID as the spl_hostid sentinel value because
zero seems to be a valid gethostid() result on Linux.
2011-06-24 09:58:03 -07:00
Brian Behlendorf
372c257233 Add TASKQ_NORECLAIM flag
It has become necessary to be able to optionally disable
direct memory reclaim for certain taskqs.  To support
this the TASKQ_NORECLAIM flags has been added which sets
the PF_MEMALLOC bit for all threads in the taskq.
2011-05-06 15:23:58 -07:00
Brian Behlendorf
c1f95c2b94 Correct MAXUID
The uid_t on most systems is in fact and unsigned 32-bit value.
This is almost always correct, however you could compile your
kernel to use an unsigned 16-bit value for uid_t.  In practice
I've never encountered a distribution which does this so I'm
willing to overlook this corner case for now.
2011-04-29 13:58:45 -07:00
Gunnar Beutner
9d4b7c17a0 Renamed 'struct fid' for NFS
Renamed 'struct fid' because its name conflicts with another
struct in the Linux kernel headers.  The fid_t typedef remains
unchanged intentionally.
2011-04-29 12:10:54 -07:00
Brian Behlendorf
d837ae395b Fix 32-bit MAXOFFSET_T definition
The correct definition of MAXOFFSET_T under Solaris is in reality
tied to the maximum size of a 'long long' type.  With this in mind
MAXOFFSET_T is now defined as LLONG_MAX which ensures the correct
value is used on both 32-bit and 64-bit systems.
2011-04-22 16:17:13 -07:00
Darik Horn
fa6f7d8f9d Import spl_hostid as a module parameter.
Provide a call_usermodehelper() alternative by letting the hostid be passed as
a module parameter like this:

  $ modprobe spl spl_hostid=0x12345678

Internally change the spl_hostid variable to unsigned long because that is the
type that the coreutils /usr/bin/hostid returns.

Move the hostid command into GET_HOSTID_CMD for consistency with the similar
GET_KALLSYMS_ADDR_CMD invocation.

Use argv[0] instead of sh_path for consistency internally and with other Linux
drivers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-21 09:41:01 -07:00
Brian Behlendorf
3dfc591ac4 Linux 2.6.39 compat, zlib_deflate_workspacesize()
The function zlib_deflate_workspacesize() now take 2 arguments.
This was done to avoid always having to allocate the maximum size
workspace (268K).  The caller can now specific the windowBits and
memLevel compression parameters to get a smaller workspace.

For our purposes we introduce a spl_zlib_deflate_workspacesize()
wrapper which accepts both arguments.  When the two argument
version of zlib_deflate_workspacesize() is available the arguments
are passed through.  When it's not we assume the worst case and
a maximally sized workspace is used.
2011-04-20 14:39:15 -07:00
Brian Behlendorf
e76f4bf11d Add dnlc_reduce_cache() support
Provide the dnlc_reduce_cache() function which attempts to prune
cached entries from the dcache and icache.  After the entries are
pruned any slabs which they may have been using are reaped.

Note the API takes a reclaim percentage but we don't have easy
access to the total number of cache entries to calculate the
reclaim count.  However, in practice this doesn't need to be
exactly correct.  We simply need to reclaim some useful fraction
(but not all) of the cache.  The caller can determine if more
needs to be done.
2011-04-06 20:06:03 -07:00
Brian Behlendorf
83150861e6 Decrease target objects per slab
By decreasing the number of target objects per slab we increase
the likelyhood that a slab can be freed.  This reduces the level
of fragmentation in the slab which has been observed to be a
problem for certain workloads.  The penalty for this is that we
also decrease the speed which need objects can be allocated.
2011-04-06 20:06:03 -07:00
Brian Behlendorf
3336e29cc2 Add slab usage summeries to /proc
One of the most common things you want to know when looking at
the slab is how much memory is being used.  This information was
available in /proc/spl/kmem/slab but only on a per-slab basis.
This commit adds the following /proc/sys/kernel/spl/kmem/slab*
entries to make total slab usage easily available at a glance.

  slab_kmem_total - Total kmem slab size
  slab_kmem_avail - Alloc'd kmem slab size
  slab_kmem_max   - Max observed kmem slab size
  slab_vmem_total - Total vmem slab size
  slab_vmem_avail - Alloc'd vmem slab size
  slab_vmem_max   - Max observed vmem slab size

NOTE: The slab_*_max values are expected to over report because
they show maximum values since boot, not current values.
2011-04-06 20:06:03 -07:00
Brian Behlendorf
91cb1d91a4 Add .va_dentry helper
While this extra structure memory does not exist under Solaris
it is needed under Linux to pass the dentry.  This allows the
dentry to be easily instantiated before the inode is unlocked.
2011-04-06 20:06:03 -07:00
Brian Behlendorf
734fcac78d Add crgetfsuid()/crgetfsgid() helpers
Solaris credentials don't have an fsuid/fsguid field but Linux
credentials do.  To handle this case the Solaris API is being
modestly extended to include the crgetfsuid()/crgetfsgid()
helper functions.

Addititionally, because the crget*() helpers are implemented
identically regardless of HAVE_CRED_STRUCT they have been
moved outside the #ifdef to common code.  This simplification
means we only have one version of the helper to keep to to date.
2011-03-22 12:18:44 -07:00
Brian Behlendorf
cb255ae572 Remove default GFP_NOFS allocations
As originally described in commit 82b8c8fa64
this was done to prevent certain deadlocks from occuring in the system.
However, as suspected the price for doing this proved to be too high.
The VM is having a hard time effectively reclaiming memory thus we are
reverting this change.

However, we still need to fundamentally handle the issue.  Under
Solaris the KM_PUSHPAGE mask is used commonly in I/O paths to ensure
a memory allocations will succeed.  We leverage this fact and redefine
KM_PUSHPAGE to include GFP_NOFS.  This ensures that in these common
I/O path we don't trigger additional reclaim.  This minimizes the
change to the Solaris code.
2011-03-19 14:50:39 -07:00
Brian Behlendorf
6788762766 Linux 2.6.31 compat, include linux/seq_file.h
Explicitly include the linux/seq_file.h header in vfs.h.  This header
is required for the sequence handlers and is included indirectly in
newer kernels.
2011-03-07 13:52:00 -08:00
Brian Behlendorf
47995fa691 Remove xvattr support
The xvattr support in the spl has always simply consisted of
defining a couple structures and a few #defines.  This was enough
to enable compilation of code which just passed xvattr types
around but not enough to effectively manipulate them.

This change removes even this minimal support leaving it up
to packages which leverage the spl to prove the full xvattr
support.  By removing it from the spl we ensure not conflict
with the higher level packages.

This just leaves minimal vnode support for basical manipulation
of files.  This code is does have the proper support functions
in the spl and a set of regression tests.

Additionally, this change removed the unused 'caller_context_t *'
type and replaces it with a 'void *'.
2011-03-02 11:34:46 -08:00
Brian Behlendorf
a4a1e1ecb4 Add TIMESPEC_OVERFLOW helper
Add the TIMESPEC_OVERFLOW helper macro to allow easy checking
of timespec overflow.
2011-03-02 11:34:43 -08:00
Brian Behlendorf
5c1967ebe2 Fix zlib compression
While portions of the code needed to support z_compress_level() and
z_uncompress() where in place.  In reality the current implementation
was non-functional, it just was compilable.

The critical missing component was to setup a workspace for the
compress/uncompress stream structures to use.  A kmem_cache was
added for the workspace area because we require a large chunk
of memory.  This avoids to need to continually alloc/free this
memory and vmap() the pages which is very slow.  Several objects
will reside in the per-cpu kmem_cache making them quick to acquire
and release.  A further optimization would be to adjust the
implementation to additional ensure the memory is local to the cpu.
Currently that may not be the case.
2011-02-25 16:56:22 -08:00
Brian Behlendorf
5a52a782a0 Use Linux flock struct
Rather than defining our own structure which will conflict with
Linux's version when building 32-bit.  Simply setup a typedef
to always use the correct Linux version for both 32 ad 64-bit
builds.
2011-02-23 14:32:15 -08:00
Brian Behlendorf
d599e4fa79 Block in cv_destroy() on all waiters
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters.  However, I've now seen several instances in
OpenSolaris code where they do the following:

  cv_broadcast();
  cv_destroy();

This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT.  This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request.  But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.

Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled.  This may take some time because each waiter must
acquire the mutex.

This change may have an impact on performance for frequently
created and destroyed condition variables.  That however is a price
worth paying it avoid crashing your system.  If performance issues
are observed they can be addressed by the caller.
2011-02-04 14:09:08 -08:00
Brian Behlendorf
0aff071d18 Minor policy interface
Simply add the policy function wrappers.  They are completely
non-functional and always return that everything is OK, but once
again they simplify compilation of dependent packages for now.
These can/should be removed once the security policy of the
dependent application is completely understood and intergrade
as appropriate with Linux.
2011-01-27 16:06:09 -08:00
Brian Behlendorf
ef57fb98e4 Add missing headers
Dependent packages require the following missing headers to
simplify compilation.  The headers are basically just stubbed
out with minimal content required.
2011-01-27 16:06:09 -08:00
Brian Behlendorf
3fc97f9335 Add VSA_ACE_* and MAX_ACL_ENTRIES defines
The following flags are use to get the proper mask when getting
and setting ACLs.  I'm hopeful this can all largely go away at
some point.

We also add a define for the maximum number of ACL entries.
MAX_ACL_ENTRIES is used as the maximum number of entries for
each type.
2011-01-27 16:06:09 -08:00
Brian Behlendorf
e2b25f698c Add MAXUID define
For Linux the maximum uid can vary depending on how your kernel
is built.  The Linux kernel still can be compiled with 16 but uids
and gids, although I'm not aware of a major distribution which does
this (maybe an embedded one?).  Given that caviot it is reasonably
safe to define the MAXUID as 2147483647.
2011-01-27 16:06:09 -08:00
Brian Behlendorf
5f46a517f1 Add FIGNORECASE define
The FIGNORECASE case define is now needed, place it with the
related flags.
2011-01-27 16:06:09 -08:00
Brian Behlendorf
3e5d3d3285 Add ksid_index_t and ksid_t types
Add the ksid_index_t enum and ksid_t type for use.  These types
are now used by packages which depend on the SPL.
2011-01-27 16:06:09 -08:00
Brian Behlendorf
d700637207 Minimal VFS additions
This patch simply removes the place holder vfs_t type and includes
some generic Linux VFS headers.  It also makes some minor fid_t
additions for compatibility.
2011-01-27 16:06:04 -08:00
Brian Behlendorf
647fa73cf3 Remove VN_HOLD/VN_RELE/VOP_PUTPAGE
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
2011-01-12 11:38:05 -08:00
Brian Behlendorf
bd6ac72b03 Add a few additional vnode #defines
These additional constants now have users in dependant packages.
2011-01-12 11:38:05 -08:00
Brian Behlendorf
dcd9cb5a17 Clean vattr_t and vsecattr_t types
Minor cleanup for the vattr_t and vsecattr_t types.
2011-01-12 11:38:04 -08:00
Brian Behlendorf
1b439713f1 FRSYNC Should Use O_SYNC
The Solaris FRSYNC maps most logically to the Linux O_SYNC.  There
is no O_RSYNC on Linux but this wasn't noticed until just recently.
2011-01-12 11:38:04 -08:00
Brian Behlendorf
4295b530ee Add vn_mode_to_vtype/vn_vtype to_mode helpers
Add simple helpers to convert a vnode->v_type to a inode->i_mode.
These should be used sparingly but they are handy to have.
2011-01-12 11:38:04 -08:00
Neependra Khare
3f688a8c38 Add cv_timedwait_interruptible() function
The cv_timedwait() function by definition must wait unconditionally
for cv_signal()/cv_broadcast() before waking.  This causes processes
to go in the D state which increases the load average.  The load
average is the summation of processes in D state and run queue.

To avoid this it can be desirable to sleep interruptibly.  These
processes do not count against the load average but may be woken by
a signal.  It is up to the caller to determine why the process
was woken it may be for one of three reasons.

  1) cv_signal()/cv_broadcast()
  2) the timeout expired
  3) a signal was received

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-01-11 12:14:48 -08:00
Brian Behlendorf
9fe45dc1ac Add Thread Specific Data (TSD) Implementation
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels.  This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.

The majority of the entries in the hash table are for specific tsd
entries.  These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.

The hash table contains two additional type of entries.  They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create().  It is used to store the address of the destructor function
and it is used as an anchor point.  All tsd entries which use the same
key will be linked to this entry.  This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.

The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key.  The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it.  This
list is using during tsd_exit() to ensure all registered destructors
are run for the process.  The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid.  Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
2010-12-07 10:02:32 -08:00
Ricardo M. Correia
c2f997b0b3 Make kmutex_t typesafe in all cases.
When HAVE_MUTEX_OWNER and CONFIG_SMP are defined, kmutex_t is just
a typedef for struct mutex.

This is generally OK but has the downside that it can make mistakes
such as mutex_lock(&kmutex_var) to pass by unnoticed until someone
compiles the code without HAVE_MUTEX_OWNER or CONFIG_SMP (in which
case kmutex_t is a real struct). Note that the correct API to call
should have been mutex_enter() rather than mutex_lock().

We prevent these kind of mistakes by making kmutex_t a real structure
with only one field. This makes kmutex_t typesafe and it shouldn't
have any impact on the generated assembly code.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-29 11:25:32 -08:00
Brian Behlendorf
058de03caa Clear cv->cv_mutex when not in use
For debugging purposes the condition varaibles keep track of the
mutex used during a wait.  The idea is to validate that all callers
always use the same mutex.  Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe.  My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex.  However, there is overhead in doing this and it does
appear to be allowed under Solaris.

To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped.  This ensures that while the condition variable
is in use the incorrect mutex case is detected.  It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.

Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant.  The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex.  It turns out that was not the case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-29 11:02:34 -08:00
Ned Bass
00ba7ef900 Give ENOTSUP a valid user space error value
The ZFS module returns ENOTSUP for several error conditions where an operation
is not (yet) supported.  The SPL defined ENOTSUP in terms of ENOTSUPP, but that
is an internal Linux kernel error code that should not be seen by user
programs.  As a result the zfs utilities print a confusing error message if an
unsupported operation is attempted:

    internal error: Unknown error 524
    Aborted

This change defines ENOTSUP in terms of EOPNOTSUPP which is consistent with
user space.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-10 13:25:49 -08:00
Brian Behlendorf
a50cede388 Linux 2.6.36 compat, wrap RLIM64_INFINITY
As of linux-2.6.36 RLIM64_INFINITY is defined in linux/resource.h.
This is handled by conditionally defining RLIM64_INFINITY in the
SPL only when the kernel does not provide it.
2010-11-09 13:28:55 -08:00
Brian Behlendorf
8294c69bb7 Clear owner after dropping mutex
It's important to clear mp->owner after calling mutex_unlock()
because when CONFIG_DEBUG_MUTEXES is defined the mutex owner
is verified in mutex_unlock().  If we set it to NULL this check
fails and the lockdep support is immediately disabled.
2010-11-05 11:52:30 -07:00
Ricardo M. Correia
a68d91d770 atomic_*_*_nv() functions need to return the new value atomically.
A local variable must be used for the return value to avoid a
potential race once the spin lock is dropped.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-09-17 16:03:25 -07:00
Brian Behlendorf
8371f981f1 Add list_link_replace() function
The list_link_replace() function with swap a new item it to the place
of an old item in a list.  It is the callers responsibility to ensure
all lists involved are locked properly.
2010-08-27 14:23:48 -07:00
Brian Behlendorf
d85e28ad69 Add MUTEX_NOT_HELD() function
Simply implement the missing MUTEX_NOT_HELD() function using
the !MUTEX_HELD construct.
2010-08-27 14:23:48 -07:00
Brian Behlendorf
2b3543025c Stub out kmem cache defrag API
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation.  This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature.  This is safe for
now because the move callbacks are just an optimization.  Even
if they are registered we don't ever really have to call them.
2010-08-27 14:23:42 -07:00