mirror_zfs/include/sys/zfs_context.h

685 lines
18 KiB
C
Raw Normal View History

2008-12-11 22:14:49 +03:00
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
Support custom build directories and move includes One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.
2010-09-05 00:26:23 +04:00
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
2008-12-11 22:14:49 +03:00
*/
/*
* Copyright 2011 Nexenta Systems, Inc. All rights reserved.
* Copyright (c) 2012, Joyent, Inc. All rights reserved.
* Copyright (c) 2012 by Delphix. All rights reserved.
*/
2008-12-11 22:14:49 +03:00
#ifndef _SYS_ZFS_CONTEXT_H
#define _SYS_ZFS_CONTEXT_H
#ifdef __KERNEL__
Support custom build directories and move includes One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.
2010-09-05 00:26:23 +04:00
#include <sys/note.h>
#include <sys/types.h>
#include <sys/t_lock.h>
#include <sys/atomic.h>
#include <sys/sysmacros.h>
#include <sys/bitmap.h>
#include <sys/cmn_err.h>
#include <sys/kmem.h>
#include <sys/taskq.h>
#include <sys/buf.h>
#include <sys/param.h>
#include <sys/systm.h>
#include <sys/cpuvar.h>
#include <sys/kobj.h>
#include <sys/conf.h>
#include <sys/disp.h>
#include <sys/debug.h>
#include <sys/random.h>
#include <sys/byteorder.h>
#include <sys/systm.h>
#include <sys/list.h>
#include <sys/uio_impl.h>
Support custom build directories and move includes One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.
2010-09-05 00:26:23 +04:00
#include <sys/dirent.h>
#include <sys/time.h>
#include <vm/seg_kmem.h>
#include <sys/zone.h>
#include <sys/zfs_debug.h>
#include <sys/fm/fs/zfs.h>
#include <sys/sunddi.h>
#include <sys/ctype.h>
#include <sys/disp.h>
#include <linux/dcache_compat.h>
Support custom build directories and move includes One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.
2010-09-05 00:26:23 +04:00
#else /* _KERNEL */
2008-12-11 22:14:49 +03:00
#define _SYS_MUTEX_H
#define _SYS_RWLOCK_H
#define _SYS_CONDVAR_H
#define _SYS_SYSTM_H
#define _SYS_T_LOCK_H
#define _SYS_VNODE_H
#define _SYS_VFS_H
#define _SYS_SUNDDI_H
#define _SYS_CALLB_H
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <stdarg.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <strings.h>
#include <pthread.h>
2008-12-11 22:14:49 +03:00
#include <assert.h>
#include <alloca.h>
#include <umem.h>
#include <limits.h>
#include <atomic.h>
#include <dirent.h>
#include <time.h>
#include <ctype.h>
2008-12-11 22:14:49 +03:00
#include <sys/note.h>
#include <sys/types.h>
#include <sys/cred.h>
#include <sys/sysmacros.h>
#include <sys/bitmap.h>
#include <sys/resource.h>
#include <sys/byteorder.h>
#include <sys/list.h>
#include <sys/uio.h>
#include <sys/zfs_debug.h>
#include <sys/sdt.h>
#include <sys/kstat.h>
#include <sys/u8_textprep.h>
Add linux events This topic branch leverages the Solaris style FMA call points in ZFS to create a user space visible event notification system under Linux. This new system is called zevent and it unifies all previous Solaris style ereports and sysevent notifications. Under this Linux specific scheme when a sysevent or ereport event occurs an nvlist describing the event is created which looks almost exactly like a Solaris ereport. These events are queued up in the kernel when they occur and conditionally logged to the console. It is then up to a user space application to consume the events and do whatever it likes with them. To make this possible the existing /dev/zfs ABI has been extended with two new ioctls which behave as follows. * ZFS_IOC_EVENTS_NEXT Get the next pending event. The kernel will keep track of the last event consumed by the file descriptor and provide the next one if available. If no new events are available the ioctl() will block waiting for the next event. This ioctl may also be called in a non-blocking mode by setting zc.zc_guid = ZEVENT_NONBLOCK. In the non-blocking case if no events are available ENOENT will be returned. It is possible that ESHUTDOWN will be returned if the ioctl() is called while module unloading is in progress. And finally ENOMEM may occur if the provided nvlist buffer is not large enough to contain the entire event. * ZFS_IOC_EVENTS_CLEAR Clear are events queued by the kernel. The kernel will keep a fairly large number of recent events queued, use this ioctl to clear the in kernel list. This will effect all user space processes consuming events. The zpool command has been extended to use this events ABI with the 'events' subcommand. You may run 'zpool events -v' to output a verbose log of all recent events. This is very similar to the Solaris 'fmdump -ev' command with the key difference being it also includes what would be considered sysevents under Solaris. You may also run in follow mode with the '-f' option. To clear the in kernel event queue use the '-c' option. $ sudo cmd/zpool/zpool events -fv TIME CLASS May 13 2010 16:31:15.777711000 ereport.fs.zfs.config.sync class = "ereport.fs.zfs.config.sync" ena = 0x40982b7897700001 detector = (embedded nvlist) version = 0x0 scheme = "zfs" pool = 0xed976600de75dfa6 (end detector) time = 0x4bec8bc3 0x2e5aed98 pool = "zpios" pool_guid = 0xed976600de75dfa6 pool_context = 0x0 While the 'zpool events' command is handy for interactive debugging it is not expected to be the primary consumer of zevents. This ABI was primarily added to facilitate the addition of a user space monitoring daemon. This daemon would consume all events posted by the kernel and based on the type of event perform an action. For most events simply forwarding them on to syslog is likely enough. But this interface also cleanly allows for more sophisticated actions to be taken such as generating an email for a failed drive. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-26 22:42:43 +04:00
#include <sys/fm/fs/zfs.h>
#include <sys/sunddi.h>
#include <sys/debug.h>
2008-12-11 22:14:49 +03:00
/*
* Stack
*/
#define noinline __attribute__((noinline))
2008-12-11 22:14:49 +03:00
/*
* Debugging
*/
/*
* Note that we are not using the debugging levels.
*/
#define CE_CONT 0 /* continuation */
#define CE_NOTE 1 /* notice */
#define CE_WARN 2 /* warning */
#define CE_PANIC 3 /* panic */
#define CE_IGNORE 4 /* print nothing */
extern int aok;
2008-12-11 22:14:49 +03:00
/*
* ZFS debugging
*/
extern void dprintf_setup(int *argc, char **argv);
extern void __dprintf(const char *file, const char *func,
int line, const char *fmt, ...);
2008-12-11 22:14:49 +03:00
extern void cmn_err(int, const char *, ...);
extern void vcmn_err(int, const char *, __va_list);
extern void panic(const char *, ...);
extern void vpanic(const char *, __va_list);
#define fm_panic panic
/*
* DTrace SDT probes have different signatures in userland than they do in
* kernel. If they're being used in kernel code, re-define them out of
* existence for their counterparts in libzpool.
*/
#ifdef DTRACE_PROBE
#undef DTRACE_PROBE
#define DTRACE_PROBE(a) ((void)0)
#endif /* DTRACE_PROBE */
#ifdef DTRACE_PROBE1
#undef DTRACE_PROBE1
#define DTRACE_PROBE1(a, b, c) ((void)0)
#endif /* DTRACE_PROBE1 */
#ifdef DTRACE_PROBE2
#undef DTRACE_PROBE2
#define DTRACE_PROBE2(a, b, c, d, e) ((void)0)
#endif /* DTRACE_PROBE2 */
#ifdef DTRACE_PROBE3
#undef DTRACE_PROBE3
#define DTRACE_PROBE3(a, b, c, d, e, f, g) ((void)0)
#endif /* DTRACE_PROBE3 */
#ifdef DTRACE_PROBE4
#undef DTRACE_PROBE4
#define DTRACE_PROBE4(a, b, c, d, e, f, g, h, i) ((void)0)
#endif /* DTRACE_PROBE4 */
/*
* Threads
*/
#define TS_MAGIC 0x72f158ab4261e538ull
#define TS_RUN 0x00000002
#ifdef __linux__
#define STACK_SIZE 8192 /* Linux x86 and amd64 */
#else
#define STACK_SIZE 24576 /* Solaris */
#endif
/* in libzpool, p0 exists only to have its address taken */
typedef struct proc {
uintptr_t this_is_never_used_dont_dereference_it;
} proc_t;
extern struct proc p0;
#define curproc (&p0)
typedef void (*thread_func_t)(void *);
typedef void (*thread_func_arg_t)(void *);
typedef pthread_t kt_did_t;
typedef struct kthread {
kt_did_t t_tid;
thread_func_t t_func;
void * t_arg;
} kthread_t;
2008-12-11 22:14:49 +03:00
#define curthread zk_thread_current()
Add visibility in to arc_read This change is an attempt to add visibility into the arc_read calls occurring on a system, in real time. To do this, a list was added to the in memory SPA data structure for a pool, with each element on the list corresponding to a call to arc_read. These entries are then exported through the kstat interface, which can then be interpreted in userspace. For each arc_read call, the following information is exported: * A unique identifier (uint64_t) * The time the entry was added to the list (hrtime_t) (*not* wall clock time; relative to the other entries on the list) * The objset ID (uint64_t) * The object number (uint64_t) * The indirection level (uint64_t) * The block ID (uint64_t) * The name of the function originating the arc_read call (char[24]) * The arc_flags from the arc_read call (uint32_t) * The PID of the reading thread (pid_t) * The command or name of thread originating read (char[16]) From this exported information one can see, in real time, exactly what is being read, what function is generating the read, and whether or not the read was found to be already cached. There is still some work to be done, but this should serve as a good starting point. Specifically, dbuf_read's are not accounted for in the currently exported information. Thus, a follow up patch should probably be added to export these calls that never call into arc_read (they only hit the dbuf hash table). In addition, it might be nice to create a utility similar to "arcstat.py" to digest the exported information and display it in a more readable format. Or perhaps, log the information and allow for it to be "replayed" at a later time. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
#define getcomm() "unknown"
#define thread_exit zk_thread_exit
#define thread_create(stk, stksize, func, arg, len, pp, state, pri) \
zk_thread_create(stk, stksize, (thread_func_t)func, arg, \
len, NULL, state, pri, PTHREAD_CREATE_DETACHED)
#define thread_join(t) zk_thread_join(t)
#define newproc(f,a,cid,pri,ctp,pid) (ENOSYS)
extern kthread_t *zk_thread_current(void);
extern void zk_thread_exit(void);
extern kthread_t *zk_thread_create(caddr_t stk, size_t stksize,
thread_func_t func, void *arg, size_t len,
proc_t *pp, int state, pri_t pri, int detachstate);
extern void zk_thread_join(kt_did_t tid);
#define kpreempt_disable() ((void)0)
#define kpreempt_enable() ((void)0)
#define PS_NONE -1
2008-12-11 22:14:49 +03:00
#define issig(why) (FALSE)
#define ISSIG(thr, why) (FALSE)
/*
* Mutexes
*/
#define MTX_MAGIC 0x9522f51362a6e326ull
#define MTX_INIT ((void *)NULL)
#define MTX_DEST ((void *)-1UL)
2008-12-11 22:14:49 +03:00
typedef struct kmutex {
void *m_owner;
uint64_t m_magic;
pthread_mutex_t m_lock;
2008-12-11 22:14:49 +03:00
} kmutex_t;
#define MUTEX_DEFAULT 0
#define MUTEX_HELD(m) ((m)->m_owner == curthread)
#define MUTEX_NOT_HELD(m) (!MUTEX_HELD(m))
2008-12-11 22:14:49 +03:00
extern void mutex_init(kmutex_t *mp, char *name, int type, void *cookie);
extern void mutex_destroy(kmutex_t *mp);
2008-12-11 22:14:49 +03:00
extern void mutex_enter(kmutex_t *mp);
extern void mutex_exit(kmutex_t *mp);
extern int mutex_tryenter(kmutex_t *mp);
extern void *mutex_owner(kmutex_t *mp);
extern int mutex_held(kmutex_t *mp);
2008-12-11 22:14:49 +03:00
/*
* RW locks
*/
#define RW_MAGIC 0x4d31fb123648e78aull
#define RW_INIT ((void *)NULL)
#define RW_DEST ((void *)-1UL)
2008-12-11 22:14:49 +03:00
typedef struct krwlock {
void *rw_owner;
void *rw_wr_owner;
uint64_t rw_magic;
pthread_rwlock_t rw_lock;
uint_t rw_readers;
2008-12-11 22:14:49 +03:00
} krwlock_t;
typedef int krw_t;
#define RW_READER 0
#define RW_WRITER 1
#define RW_DEFAULT RW_READER
2008-12-11 22:14:49 +03:00
#define RW_READ_HELD(x) ((x)->rw_readers > 0)
#define RW_WRITE_HELD(x) ((x)->rw_wr_owner == curthread)
#define RW_LOCK_HELD(x) (RW_READ_HELD(x) || RW_WRITE_HELD(x))
2008-12-11 22:14:49 +03:00
#undef RW_LOCK_HELD
#define RW_LOCK_HELD(x) (RW_READ_HELD(x) || RW_WRITE_HELD(x))
#undef RW_LOCK_HELD
#define RW_LOCK_HELD(x) (RW_READ_HELD(x) || RW_WRITE_HELD(x))
2008-12-11 22:14:49 +03:00
extern void rw_init(krwlock_t *rwlp, char *name, int type, void *arg);
extern void rw_destroy(krwlock_t *rwlp);
extern void rw_enter(krwlock_t *rwlp, krw_t rw);
extern int rw_tryenter(krwlock_t *rwlp, krw_t rw);
extern int rw_tryupgrade(krwlock_t *rwlp);
extern void rw_exit(krwlock_t *rwlp);
#define rw_downgrade(rwlp) do { } while (0)
extern uid_t crgetuid(cred_t *cr);
Illumos #2882, #2883, #2900 2882 implement libzfs_core 2883 changing "canmount" property to "on" should not always remount dataset 2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Chris Siden <christopher.siden@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Bill Pijewski <wdp@joyent.com> Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: https://www.illumos.org/issues/2882 https://www.illumos.org/issues/2883 https://www.illumos.org/issues/2900 illumos/illumos-gate@4445fffbbb1ea25fd0e9ea68b9380dd7a6709025 Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1293 Porting notes: WARNING: This patch changes the user/kernel ABI. That means that the zfs/zpool utilities built from master are NOT compatible with the 0.6.2 kernel modules. Ensure you load the matching kernel modules from master after updating the utilities. Otherwise the zfs/zpool commands will be unable to interact with your pool and you will see errors similar to the following: $ zpool list failed to read pool configuration: bad address no pools available $ zfs list no datasets available Add zvol minor device creation to the new zfs_snapshot_nvl function. Remove the logging of the "release" operation in dsl_dataset_user_release_sync(). The logging caused a null dereference because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the logging functions try to get the ds name via the dsl_dataset_name() function. I've got no idea why this particular code would have worked in Illumos. This code has subsequently been completely reworked in Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring). Squash some "may be used uninitialized" warning/erorrs. Fix some printf format warnings for %lld and %llu. Apply a few spa_writeable() changes that were made to Illumos in illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and 3115 fixes. Add a missing call to fnvlist_free(nvl) in log_internal() that was added in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time (zfsonlinux/zfs@9e11c73) because it depended on future work.
2013-08-28 15:45:09 +04:00
extern uid_t crgetruid(cred_t *cr);
2008-12-11 22:14:49 +03:00
extern gid_t crgetgid(cred_t *cr);
extern int crgetngroups(cred_t *cr);
extern gid_t *crgetgroups(cred_t *cr);
/*
* Condition variables
*/
#define CV_MAGIC 0xd31ea9a83b1b30c4ull
typedef struct kcondvar {
uint64_t cv_magic;
pthread_cond_t cv;
} kcondvar_t;
2008-12-11 22:14:49 +03:00
#define CV_DEFAULT 0
2008-12-11 22:14:49 +03:00
extern void cv_init(kcondvar_t *cv, char *name, int type, void *arg);
extern void cv_destroy(kcondvar_t *cv);
extern void cv_wait(kcondvar_t *cv, kmutex_t *mp);
extern clock_t cv_timedwait(kcondvar_t *cv, kmutex_t *mp, clock_t abstime);
extern void cv_signal(kcondvar_t *cv);
extern void cv_broadcast(kcondvar_t *cv);
#define cv_timedwait_interruptible(cv, mp, at) cv_timedwait(cv, mp, at)
#define cv_wait_interruptible(cv, mp) cv_wait(cv, mp)
#define cv_wait_io(cv, mp) cv_wait(cv, mp)
2008-12-11 22:14:49 +03:00
/*
* Thread-specific data
*/
#define tsd_get(k) pthread_getspecific(k)
#define tsd_set(k, v) pthread_setspecific(k, v)
#define tsd_create(kp, d) pthread_key_create(kp, d)
#define tsd_destroy(kp) /* nothing */
/*
* Thread-specific data
*/
#define tsd_get(k) pthread_getspecific(k)
#define tsd_set(k, v) pthread_setspecific(k, v)
#define tsd_create(kp, d) pthread_key_create(kp, d)
#define tsd_destroy(kp) /* nothing */
2008-12-11 22:14:49 +03:00
/*
* kstat creation, installation and deletion
*/
extern kstat_t *kstat_create(char *, int,
char *, char *, uchar_t, ulong_t, uchar_t);
extern void kstat_install(kstat_t *);
extern void kstat_delete(kstat_t *);
Add visibility in to arc_read This change is an attempt to add visibility into the arc_read calls occurring on a system, in real time. To do this, a list was added to the in memory SPA data structure for a pool, with each element on the list corresponding to a call to arc_read. These entries are then exported through the kstat interface, which can then be interpreted in userspace. For each arc_read call, the following information is exported: * A unique identifier (uint64_t) * The time the entry was added to the list (hrtime_t) (*not* wall clock time; relative to the other entries on the list) * The objset ID (uint64_t) * The object number (uint64_t) * The indirection level (uint64_t) * The block ID (uint64_t) * The name of the function originating the arc_read call (char[24]) * The arc_flags from the arc_read call (uint32_t) * The PID of the reading thread (pid_t) * The command or name of thread originating read (char[16]) From this exported information one can see, in real time, exactly what is being read, what function is generating the read, and whether or not the read was found to be already cached. There is still some work to be done, but this should serve as a good starting point. Specifically, dbuf_read's are not accounted for in the currently exported information. Thus, a follow up patch should probably be added to export these calls that never call into arc_read (they only hit the dbuf hash table). In addition, it might be nice to create a utility similar to "arcstat.py" to digest the exported information and display it in a more readable format. Or perhaps, log the information and allow for it to be "replayed" at a later time. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
extern void kstat_set_raw_ops(kstat_t *ksp,
int (*headers)(char *buf, size_t size),
int (*data)(char *buf, size_t size, void *data),
void *(*addr)(kstat_t *ksp, loff_t index));
2008-12-11 22:14:49 +03:00
/*
* Kernel memory
*/
#define KM_SLEEP UMEM_NOFAIL
#define KM_PUSHPAGE KM_SLEEP
#define KM_NOSLEEP UMEM_DEFAULT
#define KM_NODEBUG 0x0
2008-12-11 22:14:49 +03:00
#define KMC_NODEBUG UMC_NODEBUG
Improve meta data performance Profiling the system during meta data intensive workloads such as creating/removing millions of files, revealed that the system was cpu bound. A large fraction of that cpu time was being spent waiting on the virtual address space spin lock. It turns out this was caused by certain heavily used kmem_caches being backed by virtual memory. By default a kmem_cache will dynamically determine the type of memory used based on the object size. For large objects virtual memory is usually preferable and for small object physical memory is a better choice. See the spl_slab_alloc() function for a longer discussion on this. However, there is a certain amount of gray area when defining a 'large' object. For the following caches it turns out they were just over the line: * dnode_cache * zio_cache * zio_link_cache * zio_buf_512_cache * zfs_data_buf_512_cache Now because we know there will be a lot of churn in these caches, and because we know the slabs will still be reasonably sized. We can safely request with the KMC_KMEM flag that the caches be backed with physical memory addresses. This entirely avoids the need to serialize on the virtual address space lock. As a bonus this also reduces our vmalloc usage which will be good for 32-bit kernels which have a very small virtual address space. It will also probably be good for interactive performance since unrelated processes could also block of this same global lock. Finally, we may see less cpu time being burned in the arc_reclaim and txg_sync_threads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #258
2011-11-02 03:56:48 +04:00
#define KMC_KMEM 0x0
#define KMC_VMEM 0x0
2008-12-11 22:14:49 +03:00
#define kmem_alloc(_s, _f) umem_alloc(_s, _f)
#define kmem_zalloc(_s, _f) umem_zalloc(_s, _f)
#define kmem_free(_b, _s) umem_free(_b, _s)
#define vmem_alloc(_s, _f) kmem_alloc(_s, _f)
#define vmem_zalloc(_s, _f) kmem_zalloc(_s, _f)
#define vmem_free(_b, _s) kmem_free(_b, _s)
2008-12-11 22:14:49 +03:00
#define kmem_cache_create(_a, _b, _c, _d, _e, _f, _g, _h, _i) \
umem_cache_create(_a, _b, _c, _d, _e, _f, _g, _h, _i)
#define kmem_cache_destroy(_c) umem_cache_destroy(_c)
#define kmem_cache_alloc(_c, _f) umem_cache_alloc(_c, _f)
#define kmem_cache_free(_c, _b) umem_cache_free(_c, _b)
#define kmem_debugging() 0
#define kmem_cache_reap_now(_c) /* nothing */
#define kmem_cache_set_move(_c, _cb) /* nothing */
#define POINTER_INVALIDATE(_pp) /* nothing */
#define POINTER_IS_VALID(_p) 0
2008-12-11 22:14:49 +03:00
typedef umem_cache_t kmem_cache_t;
typedef enum kmem_cbrc {
KMEM_CBRC_YES,
KMEM_CBRC_NO,
KMEM_CBRC_LATER,
KMEM_CBRC_DONT_NEED,
KMEM_CBRC_DONT_KNOW
} kmem_cbrc_t;
2008-12-11 22:14:49 +03:00
/*
* Task queues
*/
typedef struct taskq taskq_t;
typedef uintptr_t taskqid_t;
typedef void (task_func_t)(void *);
typedef struct taskq_ent {
struct taskq_ent *tqent_next;
struct taskq_ent *tqent_prev;
task_func_t *tqent_func;
void *tqent_arg;
uintptr_t tqent_flags;
} taskq_ent_t;
#define TQENT_FLAG_PREALLOC 0x1 /* taskq_dispatch_ent used */
2008-12-11 22:14:49 +03:00
#define TASKQ_PREPOPULATE 0x0001
#define TASKQ_CPR_SAFE 0x0002 /* Use CPR safe protocol */
#define TASKQ_DYNAMIC 0x0004 /* Use dynamic thread scheduling */
#define TASKQ_THREADS_CPU_PCT 0x0008 /* Scale # threads by # cpus */
#define TASKQ_DC_BATCH 0x0010 /* Mark threads as batch */
2008-12-11 22:14:49 +03:00
#define TQ_SLEEP KM_SLEEP /* Can block for memory */
#define TQ_NOSLEEP KM_NOSLEEP /* cannot block for memory; may fail */
#define TQ_PUSHPAGE KM_PUSHPAGE /* Cannot perform I/O */
#define TQ_NOQUEUE 0x02 /* Do not enqueue if can't dispatch */
#define TQ_FRONT 0x08 /* Queue in front */
2008-12-11 22:14:49 +03:00
extern taskq_t *system_taskq;
extern taskq_t *taskq_create(const char *, int, pri_t, int, int, uint_t);
#define taskq_create_proc(a, b, c, d, e, p, f) \
(taskq_create(a, b, c, d, e, f))
#define taskq_create_sysdc(a, b, d, e, p, dc, f) \
(taskq_create(a, b, maxclsyspri, d, e, f))
2008-12-11 22:14:49 +03:00
extern taskqid_t taskq_dispatch(taskq_t *, task_func_t, void *, uint_t);
extern taskqid_t taskq_dispatch_delay(taskq_t *, task_func_t, void *, uint_t,
clock_t);
extern void taskq_dispatch_ent(taskq_t *, task_func_t, void *, uint_t,
taskq_ent_t *);
extern int taskq_empty_ent(taskq_ent_t *);
extern void taskq_init_ent(taskq_ent_t *);
2008-12-11 22:14:49 +03:00
extern void taskq_destroy(taskq_t *);
extern void taskq_wait(taskq_t *);
extern void taskq_wait_id(taskq_t *, taskqid_t);
extern int taskq_member(taskq_t *, kthread_t *);
extern int taskq_cancel_id(taskq_t *, taskqid_t);
2008-12-11 22:14:49 +03:00
extern void system_taskq_init(void);
extern void system_taskq_fini(void);
2008-12-11 22:14:49 +03:00
#define XVA_MAPSIZE 3
#define XVA_MAGIC 0x78766174
/*
* vnodes
*/
typedef struct vnode {
uint64_t v_size;
int v_fd;
char *v_path;
} vnode_t;
#define AV_SCANSTAMP_SZ 32 /* length of anti-virus scanstamp */
2008-12-11 22:14:49 +03:00
typedef struct xoptattr {
timestruc_t xoa_createtime; /* Create time of file */
uint8_t xoa_archive;
uint8_t xoa_system;
uint8_t xoa_readonly;
uint8_t xoa_hidden;
uint8_t xoa_nounlink;
uint8_t xoa_immutable;
uint8_t xoa_appendonly;
uint8_t xoa_nodump;
uint8_t xoa_settable;
uint8_t xoa_opaque;
uint8_t xoa_av_quarantined;
uint8_t xoa_av_modified;
uint8_t xoa_av_scanstamp[AV_SCANSTAMP_SZ];
uint8_t xoa_reparse;
uint8_t xoa_offline;
uint8_t xoa_sparse;
2008-12-11 22:14:49 +03:00
} xoptattr_t;
typedef struct vattr {
uint_t va_mask; /* bit-mask of attributes */
u_offset_t va_size; /* file size in bytes */
} vattr_t;
typedef struct xvattr {
vattr_t xva_vattr; /* Embedded vattr structure */
uint32_t xva_magic; /* Magic Number */
uint32_t xva_mapsize; /* Size of attr bitmap (32-bit words) */
uint32_t *xva_rtnattrmapp; /* Ptr to xva_rtnattrmap[] */
uint32_t xva_reqattrmap[XVA_MAPSIZE]; /* Requested attrs */
uint32_t xva_rtnattrmap[XVA_MAPSIZE]; /* Returned attrs */
xoptattr_t xva_xoptattrs; /* Optional attributes */
} xvattr_t;
typedef struct vsecattr {
uint_t vsa_mask; /* See below */
int vsa_aclcnt; /* ACL entry count */
void *vsa_aclentp; /* pointer to ACL entries */
int vsa_dfaclcnt; /* default ACL entry count */
void *vsa_dfaclentp; /* pointer to default ACL entries */
size_t vsa_aclentsz; /* ACE size in bytes of vsa_aclentp */
} vsecattr_t;
#define AT_TYPE 0x00001
#define AT_MODE 0x00002
#define AT_UID 0x00004
#define AT_GID 0x00008
#define AT_FSID 0x00010
#define AT_NODEID 0x00020
#define AT_NLINK 0x00040
#define AT_SIZE 0x00080
#define AT_ATIME 0x00100
#define AT_MTIME 0x00200
#define AT_CTIME 0x00400
#define AT_RDEV 0x00800
#define AT_BLKSIZE 0x01000
#define AT_NBLOCKS 0x02000
#define AT_SEQ 0x08000
#define AT_XVATTR 0x10000
#define CRCREAT 0
extern int fop_getattr(vnode_t *vp, vattr_t *vap);
#define VOP_CLOSE(vp, f, c, o, cr, ct) vn_close(vp)
2008-12-11 22:14:49 +03:00
#define VOP_PUTPAGE(vp, of, sz, fl, cr, ct) 0
#define VOP_GETATTR(vp, vap, fl, cr, ct) fop_getattr((vp), (vap));
2008-12-11 22:14:49 +03:00
#define VOP_FSYNC(vp, f, cr, ct) fsync((vp)->v_fd)
#define VN_RELE(vp) vn_close(vp)
extern int vn_open(char *path, int x1, int oflags, int mode, vnode_t **vpp,
int x2, int x3);
extern int vn_openat(char *path, int x1, int oflags, int mode, vnode_t **vpp,
int x2, int x3, vnode_t *vp, int fd);
extern int vn_rdwr(int uio, vnode_t *vp, void *addr, ssize_t len,
offset_t offset, int x1, int x2, rlim64_t x3, void *x4, ssize_t *residp);
extern void vn_close(vnode_t *vp);
#define vn_remove(path, x1, x2) remove(path)
#define vn_rename(from, to, seg) rename((from), (to))
#define vn_is_readonly(vp) B_FALSE
extern vnode_t *rootdir;
#include <sys/file.h> /* for FREAD, FWRITE, etc */
/*
* Random stuff
*/
#define ddi_get_lbolt() (gethrtime() >> 23)
#define ddi_get_lbolt64() (gethrtime() >> 23)
2008-12-11 22:14:49 +03:00
#define hz 119 /* frequency when using gethrtime() >> 23 for lbolt */
extern void delay(clock_t ticks);
#define SEC_TO_TICK(sec) ((sec) * hz)
#define MSEC_TO_TICK(msec) ((msec) / (MILLISEC / hz))
#define USEC_TO_TICK(usec) ((usec) / (MICROSEC / hz))
#define NSEC_TO_TICK(usec) ((usec) / (NANOSEC / hz))
2008-12-11 22:14:49 +03:00
#define gethrestime_sec() time(NULL)
#define gethrestime(t) \
do {\
(t)->tv_sec = gethrestime_sec();\
(t)->tv_nsec = 0;\
} while (0);
2008-12-11 22:14:49 +03:00
#define max_ncpus 64
#define minclsyspri 60
#define maxclsyspri 99
#define CPU_SEQID (pthread_self() & (max_ncpus - 1))
2008-12-11 22:14:49 +03:00
#define kcred NULL
#define CRED() NULL
#define ptob(x) ((x) * PAGESIZE)
extern uint64_t physmem;
extern int highbit(ulong_t i);
extern int random_get_bytes(uint8_t *ptr, size_t len);
extern int random_get_pseudo_bytes(uint8_t *ptr, size_t len);
extern void kernel_init(int);
extern void kernel_fini(void);
struct spa;
extern void nicenum(uint64_t num, char *buf);
extern void show_pool_stats(struct spa *);
typedef struct callb_cpr {
kmutex_t *cc_lockp;
} callb_cpr_t;
#define CALLB_CPR_INIT(cp, lockp, func, name) { \
(cp)->cc_lockp = lockp; \
}
#define CALLB_CPR_SAFE_BEGIN(cp) { \
ASSERT(MUTEX_HELD((cp)->cc_lockp)); \
}
#define CALLB_CPR_SAFE_END(cp, lockp) { \
ASSERT(MUTEX_HELD((cp)->cc_lockp)); \
}
#define CALLB_CPR_EXIT(cp) { \
ASSERT(MUTEX_HELD((cp)->cc_lockp)); \
mutex_exit((cp)->cc_lockp); \
}
#define zone_dataset_visible(x, y) (1)
#define INGLOBALZONE(z) (1)
extern char *kmem_vasprintf(const char *fmt, va_list adx);
extern char *kmem_asprintf(const char *fmt, ...);
#define strfree(str) kmem_free((str), strlen(str) + 1)
2008-12-11 22:14:49 +03:00
/*
* Hostname information
*/
2009-02-18 23:51:31 +03:00
extern char hw_serial[]; /* for userland-emulated hostid access */
2008-12-11 22:14:49 +03:00
extern int ddi_strtoul(const char *str, char **nptr, int base,
unsigned long *result);
extern int ddi_strtoull(const char *str, char **nptr, int base,
u_longlong_t *result);
2008-12-11 22:14:49 +03:00
/* ZFS Boot Related stuff. */
struct _buf {
intptr_t _fd;
};
struct bootstat {
uint64_t st_size;
};
typedef struct ace_object {
uid_t a_who;
uint32_t a_access_mask;
uint16_t a_flags;
uint16_t a_type;
uint8_t a_obj_type[16];
uint8_t a_inherit_obj_type[16];
} ace_object_t;
#define ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE 0x05
#define ACE_ACCESS_DENIED_OBJECT_ACE_TYPE 0x06
#define ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE 0x07
#define ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE 0x08
extern struct _buf *kobj_open_file(char *name);
extern int kobj_read_file(struct _buf *file, char *buf, unsigned size,
unsigned off);
extern void kobj_close_file(struct _buf *file);
extern int kobj_get_filesize(struct _buf *file, uint64_t *size);
extern int zfs_secpolicy_snapshot_perms(const char *name, cred_t *cr);
extern int zfs_secpolicy_rename_perms(const char *from, const char *to,
cred_t *cr);
extern int zfs_secpolicy_destroy_perms(const char *name, cred_t *cr);
extern zoneid_t getzoneid(void);
/* SID stuff */
typedef struct ksiddomain {
uint_t kd_ref;
uint_t kd_len;
char *kd_name;
} ksiddomain_t;
ksiddomain_t *ksid_lookupdomain(const char *);
void ksiddomain_rele(ksiddomain_t *);
2009-07-03 02:44:48 +04:00
#define DDI_SLEEP KM_SLEEP
#define ddi_log_sysevent(_a, _b, _c, _d, _e, _f, _g) \
sysevent_post_event(_c, _d, _b, "libzpool", _e, _f)
Support custom build directories and move includes One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.
2010-09-05 00:26:23 +04:00
#endif /* _KERNEL */
2008-12-11 22:14:49 +03:00
#endif /* _SYS_ZFS_CONTEXT_H */