836 lines
32 KiB
Plaintext
836 lines
32 KiB
Plaintext
|
perf-record(1)
|
||
|
==============
|
||
|
|
||
|
NAME
|
||
|
----
|
||
|
perf-record - Run a command and record its profile into perf.data
|
||
|
|
||
|
SYNOPSIS
|
||
|
--------
|
||
|
[verse]
|
||
|
'perf record' [-e <EVENT> | --event=EVENT] [-a] <command>
|
||
|
'perf record' [-e <EVENT> | --event=EVENT] [-a] \-- <command> [<options>]
|
||
|
|
||
|
DESCRIPTION
|
||
|
-----------
|
||
|
This command runs a command and gathers a performance counter profile
|
||
|
from it, into perf.data - without displaying anything.
|
||
|
|
||
|
This file can then be inspected later on, using 'perf report'.
|
||
|
|
||
|
|
||
|
OPTIONS
|
||
|
-------
|
||
|
<command>...::
|
||
|
Any command you can specify in a shell.
|
||
|
|
||
|
-e::
|
||
|
--event=::
|
||
|
Select the PMU event. Selection can be:
|
||
|
|
||
|
- a symbolic event name (use 'perf list' to list all events)
|
||
|
|
||
|
- a raw PMU event in the form of rN where N is a hexadecimal value
|
||
|
that represents the raw register encoding with the layout of the
|
||
|
event control registers as described by entries in
|
||
|
/sys/bus/event_source/devices/cpu/format/*.
|
||
|
|
||
|
- a symbolic or raw PMU event followed by an optional colon
|
||
|
and a list of event modifiers, e.g., cpu-cycles:p. See the
|
||
|
linkperf:perf-list[1] man page for details on event modifiers.
|
||
|
|
||
|
- a symbolically formed PMU event like 'pmu/param1=0x3,param2/' where
|
||
|
'param1', 'param2', etc are defined as formats for the PMU in
|
||
|
/sys/bus/event_source/devices/<pmu>/format/*.
|
||
|
|
||
|
- a symbolically formed event like 'pmu/config=M,config1=N,config3=K/'
|
||
|
|
||
|
where M, N, K are numbers (in decimal, hex, octal format). Acceptable
|
||
|
values for each of 'config', 'config1' and 'config2' are defined by
|
||
|
corresponding entries in /sys/bus/event_source/devices/<pmu>/format/*
|
||
|
param1 and param2 are defined as formats for the PMU in:
|
||
|
/sys/bus/event_source/devices/<pmu>/format/*
|
||
|
|
||
|
There are also some parameters which are not defined in .../<pmu>/format/*.
|
||
|
These params can be used to overload default config values per event.
|
||
|
Here are some common parameters:
|
||
|
- 'period': Set event sampling period
|
||
|
- 'freq': Set event sampling frequency
|
||
|
- 'time': Disable/enable time stamping. Acceptable values are 1 for
|
||
|
enabling time stamping. 0 for disabling time stamping.
|
||
|
The default is 1.
|
||
|
- 'call-graph': Disable/enable callgraph. Acceptable str are "fp" for
|
||
|
FP mode, "dwarf" for DWARF mode, "lbr" for LBR mode and
|
||
|
"no" for disable callgraph.
|
||
|
- 'stack-size': user stack size for dwarf mode
|
||
|
- 'name' : User defined event name. Single quotes (') may be used to
|
||
|
escape symbols in the name from parsing by shell and tool
|
||
|
like this: name=\'CPU_CLK_UNHALTED.THREAD:cmask=0x1\'.
|
||
|
- 'aux-output': Generate AUX records instead of events. This requires
|
||
|
that an AUX area event is also provided.
|
||
|
- 'aux-sample-size': Set sample size for AUX area sampling. If the
|
||
|
'--aux-sample' option has been used, set aux-sample-size=0 to disable
|
||
|
AUX area sampling for the event.
|
||
|
|
||
|
See the linkperf:perf-list[1] man page for more parameters.
|
||
|
|
||
|
Note: If user explicitly sets options which conflict with the params,
|
||
|
the value set by the parameters will be overridden.
|
||
|
|
||
|
Also not defined in .../<pmu>/format/* are PMU driver specific
|
||
|
configuration parameters. Any configuration parameter preceded by
|
||
|
the letter '@' is not interpreted in user space and sent down directly
|
||
|
to the PMU driver. For example:
|
||
|
|
||
|
perf record -e some_event/@cfg1,@cfg2=config/ ...
|
||
|
|
||
|
will see 'cfg1' and 'cfg2=config' pushed to the PMU driver associated
|
||
|
with the event for further processing. There is no restriction on
|
||
|
what the configuration parameters are, as long as their semantic is
|
||
|
understood and supported by the PMU driver.
|
||
|
|
||
|
- a hardware breakpoint event in the form of '\mem:addr[/len][:access]'
|
||
|
where addr is the address in memory you want to break in.
|
||
|
Access is the memory access type (read, write, execute) it can
|
||
|
be passed as follows: '\mem:addr[:[r][w][x]]'. len is the range,
|
||
|
number of bytes from specified addr, which the breakpoint will cover.
|
||
|
If you want to profile read-write accesses in 0x1000, just set
|
||
|
'mem:0x1000:rw'.
|
||
|
If you want to profile write accesses in [0x1000~1008), just set
|
||
|
'mem:0x1000/8:w'.
|
||
|
|
||
|
- a group of events surrounded by a pair of brace ("{event1,event2,...}").
|
||
|
Each event is separated by commas and the group should be quoted to
|
||
|
prevent the shell interpretation. You also need to use --group on
|
||
|
"perf report" to view group events together.
|
||
|
|
||
|
--filter=<filter>::
|
||
|
Event filter. This option should follow an event selector (-e).
|
||
|
If the event is a tracepoint, the filter string will be parsed by
|
||
|
the kernel. If the event is a hardware trace PMU (e.g. Intel PT
|
||
|
or CoreSight), it'll be processed as an address filter. Otherwise
|
||
|
it means a general filter using BPF which can be applied for any
|
||
|
kind of event.
|
||
|
|
||
|
- tracepoint filters
|
||
|
|
||
|
In the case of tracepoints, multiple '--filter' options are combined
|
||
|
using '&&'.
|
||
|
|
||
|
- address filters
|
||
|
|
||
|
A hardware trace PMU advertises its ability to accept a number of
|
||
|
address filters by specifying a non-zero value in
|
||
|
/sys/bus/event_source/devices/<pmu>/nr_addr_filters.
|
||
|
|
||
|
Address filters have the format:
|
||
|
|
||
|
filter|start|stop|tracestop <start> [/ <size>] [@<file name>]
|
||
|
|
||
|
Where:
|
||
|
- 'filter': defines a region that will be traced.
|
||
|
- 'start': defines an address at which tracing will begin.
|
||
|
- 'stop': defines an address at which tracing will stop.
|
||
|
- 'tracestop': defines a region in which tracing will stop.
|
||
|
|
||
|
<file name> is the name of the object file, <start> is the offset to the
|
||
|
code to trace in that file, and <size> is the size of the region to
|
||
|
trace. 'start' and 'stop' filters need not specify a <size>.
|
||
|
|
||
|
If no object file is specified then the kernel is assumed, in which case
|
||
|
the start address must be a current kernel memory address.
|
||
|
|
||
|
<start> can also be specified by providing the name of a symbol. If the
|
||
|
symbol name is not unique, it can be disambiguated by inserting #n where
|
||
|
'n' selects the n'th symbol in address order. Alternately #0, #g or #G
|
||
|
select only a global symbol. <size> can also be specified by providing
|
||
|
the name of a symbol, in which case the size is calculated to the end
|
||
|
of that symbol. For 'filter' and 'tracestop' filters, if <size> is
|
||
|
omitted and <start> is a symbol, then the size is calculated to the end
|
||
|
of that symbol.
|
||
|
|
||
|
If <size> is omitted and <start> is '*', then the start and size will
|
||
|
be calculated from the first and last symbols, i.e. to trace the whole
|
||
|
file.
|
||
|
|
||
|
If symbol names (or '*') are provided, they must be surrounded by white
|
||
|
space.
|
||
|
|
||
|
The filter passed to the kernel is not necessarily the same as entered.
|
||
|
To see the filter that is passed, use the -v option.
|
||
|
|
||
|
The kernel may not be able to configure a trace region if it is not
|
||
|
within a single mapping. MMAP events (or /proc/<pid>/maps) can be
|
||
|
examined to determine if that is a possibility.
|
||
|
|
||
|
Multiple filters can be separated with space or comma.
|
||
|
|
||
|
- bpf filters
|
||
|
|
||
|
A BPF filter can access the sample data and make a decision based on the
|
||
|
data. Users need to set an appropriate sample type to use the BPF
|
||
|
filter. BPF filters need root privilege.
|
||
|
|
||
|
The sample data field can be specified in lower case letter. Multiple
|
||
|
filters can be separated with comma. For example,
|
||
|
|
||
|
--filter 'period > 1000, cpu == 1'
|
||
|
or
|
||
|
--filter 'mem_op == load || mem_op == store, mem_lvl > l1'
|
||
|
|
||
|
The former filter only accept samples with period greater than 1000 AND
|
||
|
CPU number is 1. The latter one accepts either load and store memory
|
||
|
operations but it should have memory level above the L1. Since the
|
||
|
mem_op and mem_lvl fields come from the (memory) data_source, it'd only
|
||
|
work with some events which set the data_source field.
|
||
|
|
||
|
Also user should request to collect that information (with -d option in
|
||
|
the above case). Otherwise, the following message will be shown.
|
||
|
|
||
|
$ sudo perf record -e cycles --filter 'mem_op == load'
|
||
|
Error: cycles event does not have PERF_SAMPLE_DATA_SRC
|
||
|
Hint: please add -d option to perf record.
|
||
|
failed to set filter "BPF" on event cycles with 22 (Invalid argument)
|
||
|
|
||
|
Essentially the BPF filter expression is:
|
||
|
|
||
|
<term> <operator> <value> (("," | "||") <term> <operator> <value>)*
|
||
|
|
||
|
The <term> can be one of:
|
||
|
ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
|
||
|
code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
|
||
|
p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
|
||
|
mem_dtlb, mem_blk, mem_hops
|
||
|
|
||
|
The <operator> can be one of:
|
||
|
==, !=, >, >=, <, <=, &
|
||
|
|
||
|
The <value> can be one of:
|
||
|
<number> (for any term)
|
||
|
na, load, store, pfetch, exec (for mem_op)
|
||
|
l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl)
|
||
|
na, none, hit, miss, hitm, fwd, peer (for mem_snoop)
|
||
|
remote (for mem_remote)
|
||
|
na, locked (for mem_locked)
|
||
|
na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb)
|
||
|
na, by_data, by_addr (for mem_blk)
|
||
|
hops0, hops1, hops2, hops3 (for mem_hops)
|
||
|
|
||
|
--exclude-perf::
|
||
|
Don't record events issued by perf itself. This option should follow
|
||
|
an event selector (-e) which selects tracepoint event(s). It adds a
|
||
|
filter expression 'common_pid != $PERFPID' to filters. If other
|
||
|
'--filter' exists, the new filter expression will be combined with
|
||
|
them by '&&'.
|
||
|
|
||
|
-a::
|
||
|
--all-cpus::
|
||
|
System-wide collection from all CPUs (default if no target is specified).
|
||
|
|
||
|
-p::
|
||
|
--pid=::
|
||
|
Record events on existing process ID (comma separated list).
|
||
|
|
||
|
-t::
|
||
|
--tid=::
|
||
|
Record events on existing thread ID (comma separated list).
|
||
|
This option also disables inheritance by default. Enable it by adding
|
||
|
--inherit.
|
||
|
|
||
|
-u::
|
||
|
--uid=::
|
||
|
Record events in threads owned by uid. Name or number.
|
||
|
|
||
|
-r::
|
||
|
--realtime=::
|
||
|
Collect data with this RT SCHED_FIFO priority.
|
||
|
|
||
|
--no-buffering::
|
||
|
Collect data without buffering.
|
||
|
|
||
|
-c::
|
||
|
--count=::
|
||
|
Event period to sample.
|
||
|
|
||
|
-o::
|
||
|
--output=::
|
||
|
Output file name.
|
||
|
|
||
|
-i::
|
||
|
--no-inherit::
|
||
|
Child tasks do not inherit counters.
|
||
|
|
||
|
-F::
|
||
|
--freq=::
|
||
|
Profile at this frequency. Use 'max' to use the currently maximum
|
||
|
allowed frequency, i.e. the value in the kernel.perf_event_max_sample_rate
|
||
|
sysctl. Will throttle down to the currently maximum allowed frequency.
|
||
|
See --strict-freq.
|
||
|
|
||
|
--strict-freq::
|
||
|
Fail if the specified frequency can't be used.
|
||
|
|
||
|
-m::
|
||
|
--mmap-pages=::
|
||
|
Number of mmap data pages (must be a power of two) or size
|
||
|
specification with appended unit character - B/K/M/G. The
|
||
|
size is rounded up to have nearest pages power of two value.
|
||
|
Also, by adding a comma, the number of mmap pages for AUX
|
||
|
area tracing can be specified.
|
||
|
|
||
|
-g::
|
||
|
Enables call-graph (stack chain/backtrace) recording for both
|
||
|
kernel space and user space.
|
||
|
|
||
|
--call-graph::
|
||
|
Setup and enable call-graph (stack chain/backtrace) recording,
|
||
|
implies -g. Default is "fp" (for user space).
|
||
|
|
||
|
The unwinding method used for kernel space is dependent on the
|
||
|
unwinder used by the active kernel configuration, i.e
|
||
|
CONFIG_UNWINDER_FRAME_POINTER (fp) or CONFIG_UNWINDER_ORC (orc)
|
||
|
|
||
|
Any option specified here controls the method used for user space.
|
||
|
|
||
|
Valid options are "fp" (frame pointer), "dwarf" (DWARF's CFI -
|
||
|
Call Frame Information) or "lbr" (Hardware Last Branch Record
|
||
|
facility).
|
||
|
|
||
|
In some systems, where binaries are build with gcc
|
||
|
--fomit-frame-pointer, using the "fp" method will produce bogus
|
||
|
call graphs, using "dwarf", if available (perf tools linked to
|
||
|
the libunwind or libdw library) should be used instead.
|
||
|
Using the "lbr" method doesn't require any compiler options. It
|
||
|
will produce call graphs from the hardware LBR registers. The
|
||
|
main limitation is that it is only available on new Intel
|
||
|
platforms, such as Haswell. It can only get user call chain. It
|
||
|
doesn't work with branch stack sampling at the same time.
|
||
|
|
||
|
When "dwarf" recording is used, perf also records (user) stack dump
|
||
|
when sampled. Default size of the stack dump is 8192 (bytes).
|
||
|
User can change the size by passing the size after comma like
|
||
|
"--call-graph dwarf,4096".
|
||
|
|
||
|
When "fp" recording is used, perf tries to save stack enties
|
||
|
up to the number specified in sysctl.kernel.perf_event_max_stack
|
||
|
by default. User can change the number by passing it after comma
|
||
|
like "--call-graph fp,32".
|
||
|
|
||
|
-q::
|
||
|
--quiet::
|
||
|
Don't print any warnings or messages, useful for scripting.
|
||
|
|
||
|
-v::
|
||
|
--verbose::
|
||
|
Be more verbose (show counter open errors, etc).
|
||
|
|
||
|
-s::
|
||
|
--stat::
|
||
|
Record per-thread event counts. Use it with 'perf report -T' to see
|
||
|
the values.
|
||
|
|
||
|
-d::
|
||
|
--data::
|
||
|
Record the sample virtual addresses.
|
||
|
|
||
|
--phys-data::
|
||
|
Record the sample physical addresses.
|
||
|
|
||
|
--data-page-size::
|
||
|
Record the sampled data address data page size.
|
||
|
|
||
|
--code-page-size::
|
||
|
Record the sampled code address (ip) page size
|
||
|
|
||
|
-T::
|
||
|
--timestamp::
|
||
|
Record the sample timestamps. Use it with 'perf report -D' to see the
|
||
|
timestamps, for instance.
|
||
|
|
||
|
-P::
|
||
|
--period::
|
||
|
Record the sample period.
|
||
|
|
||
|
--sample-cpu::
|
||
|
Record the sample cpu.
|
||
|
|
||
|
--sample-identifier::
|
||
|
Record the sample identifier i.e. PERF_SAMPLE_IDENTIFIER bit set in
|
||
|
the sample_type member of the struct perf_event_attr argument to the
|
||
|
perf_event_open system call.
|
||
|
|
||
|
-n::
|
||
|
--no-samples::
|
||
|
Don't sample.
|
||
|
|
||
|
-R::
|
||
|
--raw-samples::
|
||
|
Collect raw sample records from all opened counters (default for tracepoint counters).
|
||
|
|
||
|
-C::
|
||
|
--cpu::
|
||
|
Collect samples only on the list of CPUs provided. Multiple CPUs can be provided as a
|
||
|
comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2.
|
||
|
In per-thread mode with inheritance mode on (default), samples are captured only when
|
||
|
the thread executes on the designated CPUs. Default is to monitor all CPUs.
|
||
|
|
||
|
User space tasks can migrate between CPUs, so when tracing selected CPUs,
|
||
|
a dummy event is created to track sideband for all CPUs.
|
||
|
|
||
|
-B::
|
||
|
--no-buildid::
|
||
|
Do not save the build ids of binaries in the perf.data files. This skips
|
||
|
post processing after recording, which sometimes makes the final step in
|
||
|
the recording process to take a long time, as it needs to process all
|
||
|
events looking for mmap records. The downside is that it can misresolve
|
||
|
symbols if the workload binaries used when recording get locally rebuilt
|
||
|
or upgraded, because the only key available in this case is the
|
||
|
pathname. You can also set the "record.build-id" config variable to
|
||
|
'skip to have this behaviour permanently.
|
||
|
|
||
|
-N::
|
||
|
--no-buildid-cache::
|
||
|
Do not update the buildid cache. This saves some overhead in situations
|
||
|
where the information in the perf.data file (which includes buildids)
|
||
|
is sufficient. You can also set the "record.build-id" config variable to
|
||
|
'no-cache' to have the same effect.
|
||
|
|
||
|
-G name,...::
|
||
|
--cgroup name,...::
|
||
|
monitor only in the container (cgroup) called "name". This option is available only
|
||
|
in per-cpu mode. The cgroup filesystem must be mounted. All threads belonging to
|
||
|
container "name" are monitored when they run on the monitored CPUs. Multiple cgroups
|
||
|
can be provided. Each cgroup is applied to the corresponding event, i.e., first cgroup
|
||
|
to first event, second cgroup to second event and so on. It is possible to provide
|
||
|
an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must have
|
||
|
corresponding events, i.e., they always refer to events defined earlier on the command
|
||
|
line. If the user wants to track multiple events for a specific cgroup, the user can
|
||
|
use '-e e1 -e e2 -G foo,foo' or just use '-e e1 -e e2 -G foo'.
|
||
|
|
||
|
If wanting to monitor, say, 'cycles' for a cgroup and also for system wide, this
|
||
|
command line can be used: 'perf stat -e cycles -G cgroup_name -a -e cycles'.
|
||
|
|
||
|
-b::
|
||
|
--branch-any::
|
||
|
Enable taken branch stack sampling. Any type of taken branch may be sampled.
|
||
|
This is a shortcut for --branch-filter any. See --branch-filter for more infos.
|
||
|
|
||
|
-j::
|
||
|
--branch-filter::
|
||
|
Enable taken branch stack sampling. Each sample captures a series of consecutive
|
||
|
taken branches. The number of branches captured with each sample depends on the
|
||
|
underlying hardware, the type of branches of interest, and the executed code.
|
||
|
It is possible to select the types of branches captured by enabling filters. The
|
||
|
following filters are defined:
|
||
|
|
||
|
- any: any type of branches
|
||
|
- any_call: any function call or system call
|
||
|
- any_ret: any function return or system call return
|
||
|
- ind_call: any indirect branch
|
||
|
- ind_jmp: any indirect jump
|
||
|
- call: direct calls, including far (to/from kernel) calls
|
||
|
- u: only when the branch target is at the user level
|
||
|
- k: only when the branch target is in the kernel
|
||
|
- hv: only when the target is at the hypervisor level
|
||
|
- in_tx: only when the target is in a hardware transaction
|
||
|
- no_tx: only when the target is not in a hardware transaction
|
||
|
- abort_tx: only when the target is a hardware transaction abort
|
||
|
- cond: conditional branches
|
||
|
- call_stack: save call stack
|
||
|
- no_flags: don't save branch flags e.g prediction, misprediction etc
|
||
|
- no_cycles: don't save branch cycles
|
||
|
- hw_index: save branch hardware index
|
||
|
- save_type: save branch type during sampling in case binary is not available later
|
||
|
For the platforms with Intel Arch LBR support (12th-Gen+ client or
|
||
|
4th-Gen Xeon+ server), the save branch type is unconditionally enabled
|
||
|
when the taken branch stack sampling is enabled.
|
||
|
- priv: save privilege state during sampling in case binary is not available later
|
||
|
- counter: save occurrences of the event since the last branch entry. Currently, the
|
||
|
feature is only supported by a newer CPU, e.g., Intel Sierra Forest and
|
||
|
later platforms. An error out is expected if it's used on the unsupported
|
||
|
kernel or CPUs.
|
||
|
|
||
|
+
|
||
|
The option requires at least one branch type among any, any_call, any_ret, ind_call, cond.
|
||
|
The privilege levels may be omitted, in which case, the privilege levels of the associated
|
||
|
event are applied to the branch filter. Both kernel (k) and hypervisor (hv) privilege
|
||
|
levels are subject to permissions. When sampling on multiple events, branch stack sampling
|
||
|
is enabled for all the sampling events. The sampled branch type is the same for all events.
|
||
|
The various filters must be specified as a comma separated list: --branch-filter any_ret,u,k
|
||
|
Note that this feature may not be available on all processors.
|
||
|
|
||
|
-W::
|
||
|
--weight::
|
||
|
Enable weightened sampling. An additional weight is recorded per sample and can be
|
||
|
displayed with the weight and local_weight sort keys. This currently works for TSX
|
||
|
abort events and some memory events in precise mode on modern Intel CPUs.
|
||
|
|
||
|
--namespaces::
|
||
|
Record events of type PERF_RECORD_NAMESPACES. This enables 'cgroup_id' sort key.
|
||
|
|
||
|
--all-cgroups::
|
||
|
Record events of type PERF_RECORD_CGROUP. This enables 'cgroup' sort key.
|
||
|
|
||
|
--transaction::
|
||
|
Record transaction flags for transaction related events.
|
||
|
|
||
|
--per-thread::
|
||
|
Use per-thread mmaps. By default per-cpu mmaps are created. This option
|
||
|
overrides that and uses per-thread mmaps. A side-effect of that is that
|
||
|
inheritance is automatically disabled. --per-thread is ignored with a warning
|
||
|
if combined with -a or -C options.
|
||
|
|
||
|
-D::
|
||
|
--delay=::
|
||
|
After starting the program, wait msecs before measuring (-1: start with events
|
||
|
disabled), or enable events only for specified ranges of msecs (e.g.
|
||
|
-D 10-20,30-40 means wait 10 msecs, enable for 10 msecs, wait 10 msecs, enable
|
||
|
for 10 msecs, then stop). Note, delaying enabling of events is useful to filter
|
||
|
out the startup phase of the program, which is often very different.
|
||
|
|
||
|
-I::
|
||
|
--intr-regs::
|
||
|
Capture machine state (registers) at interrupt, i.e., on counter overflows for
|
||
|
each sample. List of captured registers depends on the architecture. This option
|
||
|
is off by default. It is possible to select the registers to sample using their
|
||
|
symbolic names, e.g. on x86, ax, si. To list the available registers use
|
||
|
--intr-regs=\?. To name registers, pass a comma separated list such as
|
||
|
--intr-regs=ax,bx. The list of register is architecture dependent.
|
||
|
|
||
|
--user-regs::
|
||
|
Similar to -I, but capture user registers at sample time. To list the available
|
||
|
user registers use --user-regs=\?.
|
||
|
|
||
|
--running-time::
|
||
|
Record running and enabled time for read events (:S)
|
||
|
|
||
|
-k::
|
||
|
--clockid::
|
||
|
Sets the clock id to use for the various time fields in the perf_event_type
|
||
|
records. See clock_gettime(). In particular CLOCK_MONOTONIC and
|
||
|
CLOCK_MONOTONIC_RAW are supported, some events might also allow
|
||
|
CLOCK_BOOTTIME, CLOCK_REALTIME and CLOCK_TAI.
|
||
|
|
||
|
-S::
|
||
|
--snapshot::
|
||
|
Select AUX area tracing Snapshot Mode. This option is valid only with an
|
||
|
AUX area tracing event. Optionally, certain snapshot capturing parameters
|
||
|
can be specified in a string that follows this option:
|
||
|
|
||
|
- 'e': take one last snapshot on exit; guarantees that there is at least one
|
||
|
snapshot in the output file;
|
||
|
- <size>: if the PMU supports this, specify the desired snapshot size.
|
||
|
|
||
|
In Snapshot Mode trace data is captured only when signal SIGUSR2 is received
|
||
|
and on exit if the above 'e' option is given.
|
||
|
|
||
|
--aux-sample[=OPTIONS]::
|
||
|
Select AUX area sampling. At least one of the events selected by the -e option
|
||
|
must be an AUX area event. Samples on other events will be created containing
|
||
|
data from the AUX area. Optionally sample size may be specified, otherwise it
|
||
|
defaults to 4KiB.
|
||
|
|
||
|
--proc-map-timeout::
|
||
|
When processing pre-existing threads /proc/XXX/mmap, it may take a long time,
|
||
|
because the file may be huge. A time out is needed in such cases.
|
||
|
This option sets the time out limit. The default value is 500 ms.
|
||
|
|
||
|
--switch-events::
|
||
|
Record context switch events i.e. events of type PERF_RECORD_SWITCH or
|
||
|
PERF_RECORD_SWITCH_CPU_WIDE. In some cases (e.g. Intel PT, CoreSight or Arm SPE)
|
||
|
switch events will be enabled automatically, which can be suppressed by
|
||
|
by the option --no-switch-events.
|
||
|
|
||
|
--vmlinux=PATH::
|
||
|
Specify vmlinux path which has debuginfo.
|
||
|
(enabled when BPF prologue is on)
|
||
|
|
||
|
--buildid-all::
|
||
|
Record build-id of all DSOs regardless whether it's actually hit or not.
|
||
|
|
||
|
--buildid-mmap::
|
||
|
Record build ids in mmap2 events, disables build id cache (implies --no-buildid).
|
||
|
|
||
|
--aio[=n]::
|
||
|
Use <n> control blocks in asynchronous (Posix AIO) trace writing mode (default: 1, max: 4).
|
||
|
Asynchronous mode is supported only when linking Perf tool with libc library
|
||
|
providing implementation for Posix AIO API.
|
||
|
|
||
|
--affinity=mode::
|
||
|
Set affinity mask of trace reading thread according to the policy defined by 'mode' value:
|
||
|
|
||
|
- node - thread affinity mask is set to NUMA node cpu mask of the processed mmap buffer
|
||
|
- cpu - thread affinity mask is set to cpu of the processed mmap buffer
|
||
|
|
||
|
--mmap-flush=number::
|
||
|
|
||
|
Specify minimal number of bytes that is extracted from mmap data pages and
|
||
|
processed for output. One can specify the number using B/K/M/G suffixes.
|
||
|
|
||
|
The maximal allowed value is a quarter of the size of mmaped data pages.
|
||
|
|
||
|
The default option value is 1 byte which means that every time that the output
|
||
|
writing thread finds some new data in the mmaped buffer the data is extracted,
|
||
|
possibly compressed (-z) and written to the output, perf.data or pipe.
|
||
|
|
||
|
Larger data chunks are compressed more effectively in comparison to smaller
|
||
|
chunks so extraction of larger chunks from the mmap data pages is preferable
|
||
|
from the perspective of output size reduction.
|
||
|
|
||
|
Also at some cases executing less output write syscalls with bigger data size
|
||
|
can take less time than executing more output write syscalls with smaller data
|
||
|
size thus lowering runtime profiling overhead.
|
||
|
|
||
|
-z::
|
||
|
--compression-level[=n]::
|
||
|
Produce compressed trace using specified level n (default: 1 - fastest compression,
|
||
|
22 - smallest trace)
|
||
|
|
||
|
--all-kernel::
|
||
|
Configure all used events to run in kernel space.
|
||
|
|
||
|
--all-user::
|
||
|
Configure all used events to run in user space.
|
||
|
|
||
|
--kernel-callchains::
|
||
|
Collect callchains only from kernel space. I.e. this option sets
|
||
|
perf_event_attr.exclude_callchain_user to 1.
|
||
|
|
||
|
--user-callchains::
|
||
|
Collect callchains only from user space. I.e. this option sets
|
||
|
perf_event_attr.exclude_callchain_kernel to 1.
|
||
|
|
||
|
Don't use both --kernel-callchains and --user-callchains at the same time or no
|
||
|
callchains will be collected.
|
||
|
|
||
|
--timestamp-filename
|
||
|
Append timestamp to output file name.
|
||
|
|
||
|
--timestamp-boundary::
|
||
|
Record timestamp boundary (time of first/last samples).
|
||
|
|
||
|
--switch-output[=mode]::
|
||
|
Generate multiple perf.data files, timestamp prefixed, switching to a new one
|
||
|
based on 'mode' value:
|
||
|
|
||
|
- "signal" - when receiving a SIGUSR2 (default value) or
|
||
|
- <size> - when reaching the size threshold, size is expected to
|
||
|
be a number with appended unit character - B/K/M/G
|
||
|
- <time> - when reaching the time threshold, size is expected to
|
||
|
be a number with appended unit character - s/m/h/d
|
||
|
|
||
|
Note: the precision of the size threshold hugely depends
|
||
|
on your configuration - the number and size of your ring
|
||
|
buffers (-m). It is generally more precise for higher sizes
|
||
|
(like >5M), for lower values expect different sizes.
|
||
|
|
||
|
A possible use case is to, given an external event, slice the perf.data file
|
||
|
that gets then processed, possibly via a perf script, to decide if that
|
||
|
particular perf.data snapshot should be kept or not.
|
||
|
|
||
|
Implies --timestamp-filename, --no-buildid and --no-buildid-cache.
|
||
|
The reason for the latter two is to reduce the data file switching
|
||
|
overhead. You can still switch them on with:
|
||
|
|
||
|
--switch-output --no-no-buildid --no-no-buildid-cache
|
||
|
|
||
|
--switch-output-event::
|
||
|
Events that will cause the switch of the perf.data file, auto-selecting
|
||
|
--switch-output=signal, the results are similar as internally the side band
|
||
|
thread will also send a SIGUSR2 to the main one.
|
||
|
|
||
|
Uses the same syntax as --event, it will just not be recorded, serving only to
|
||
|
switch the perf.data file as soon as the --switch-output event is processed by
|
||
|
a separate sideband thread.
|
||
|
|
||
|
This sideband thread is also used to other purposes, like processing the
|
||
|
PERF_RECORD_BPF_EVENT records as they happen, asking the kernel for extra BPF
|
||
|
information, etc.
|
||
|
|
||
|
--switch-max-files=N::
|
||
|
|
||
|
When rotating perf.data with --switch-output, only keep N files.
|
||
|
|
||
|
--dry-run::
|
||
|
Parse options then exit. --dry-run can be used to detect errors in cmdline
|
||
|
options.
|
||
|
|
||
|
'perf record --dry-run -e' can act as a BPF script compiler if llvm.dump-obj
|
||
|
in config file is set to true.
|
||
|
|
||
|
--synth=TYPE::
|
||
|
Collect and synthesize given type of events (comma separated). Note that
|
||
|
this option controls the synthesis from the /proc filesystem which represent
|
||
|
task status for pre-existing threads.
|
||
|
|
||
|
Kernel (and some other) events are recorded regardless of the
|
||
|
choice in this option. For example, --synth=no would have MMAP events for
|
||
|
kernel and modules.
|
||
|
|
||
|
Available types are:
|
||
|
|
||
|
- 'task' - synthesize FORK and COMM events for each task
|
||
|
- 'mmap' - synthesize MMAP events for each process (implies 'task')
|
||
|
- 'cgroup' - synthesize CGROUP events for each cgroup
|
||
|
- 'all' - synthesize all events (default)
|
||
|
- 'no' - do not synthesize any of the above events
|
||
|
|
||
|
--tail-synthesize::
|
||
|
Instead of collecting non-sample events (for example, fork, comm, mmap) at
|
||
|
the beginning of record, collect them during finalizing an output file.
|
||
|
The collected non-sample events reflects the status of the system when
|
||
|
record is finished.
|
||
|
|
||
|
--overwrite::
|
||
|
Makes all events use an overwritable ring buffer. An overwritable ring
|
||
|
buffer works like a flight recorder: when it gets full, the kernel will
|
||
|
overwrite the oldest records, that thus will never make it to the
|
||
|
perf.data file.
|
||
|
|
||
|
When '--overwrite' and '--switch-output' are used perf records and drops
|
||
|
events until it receives a signal, meaning that something unusual was
|
||
|
detected that warrants taking a snapshot of the most current events,
|
||
|
those fitting in the ring buffer at that moment.
|
||
|
|
||
|
'overwrite' attribute can also be set or canceled for an event using
|
||
|
config terms. For example: 'cycles/overwrite/' and 'instructions/no-overwrite/'.
|
||
|
|
||
|
Implies --tail-synthesize.
|
||
|
|
||
|
--kcore::
|
||
|
Make a copy of /proc/kcore and place it into a directory with the perf data file.
|
||
|
|
||
|
--max-size=<size>::
|
||
|
Limit the sample data max size, <size> is expected to be a number with
|
||
|
appended unit character - B/K/M/G
|
||
|
|
||
|
--num-thread-synthesize::
|
||
|
The number of threads to run when synthesizing events for existing processes.
|
||
|
By default, the number of threads equals 1.
|
||
|
|
||
|
ifdef::HAVE_LIBPFM[]
|
||
|
--pfm-events events::
|
||
|
Select a PMU event using libpfm4 syntax (see http://perfmon2.sf.net)
|
||
|
including support for event filters. For example '--pfm-events
|
||
|
inst_retired:any_p:u:c=1:i'. More than one event can be passed to the
|
||
|
option using the comma separator. Hardware events and generic hardware
|
||
|
events cannot be mixed together. The latter must be used with the -e
|
||
|
option. The -e option and this one can be mixed and matched. Events
|
||
|
can be grouped using the {} notation.
|
||
|
endif::HAVE_LIBPFM[]
|
||
|
|
||
|
--control=fifo:ctl-fifo[,ack-fifo]::
|
||
|
--control=fd:ctl-fd[,ack-fd]::
|
||
|
ctl-fifo / ack-fifo are opened and used as ctl-fd / ack-fd as follows.
|
||
|
Listen on ctl-fd descriptor for command to control measurement.
|
||
|
|
||
|
Available commands:
|
||
|
|
||
|
- 'enable' : enable events
|
||
|
- 'disable' : disable events
|
||
|
- 'enable name' : enable event 'name'
|
||
|
- 'disable name' : disable event 'name'
|
||
|
- 'snapshot' : AUX area tracing snapshot).
|
||
|
- 'stop' : stop perf record
|
||
|
- 'ping' : ping
|
||
|
- 'evlist [-v|-g|-F] : display all events
|
||
|
|
||
|
-F Show just the sample frequency used for each event.
|
||
|
-v Show all fields.
|
||
|
-g Show event group information.
|
||
|
|
||
|
Measurements can be started with events disabled using --delay=-1 option. Optionally
|
||
|
send control command completion ('ack\n') to ack-fd descriptor to synchronize with the
|
||
|
controlling process. Example of bash shell script to enable and disable events during
|
||
|
measurements:
|
||
|
|
||
|
#!/bin/bash
|
||
|
|
||
|
ctl_dir=/tmp/
|
||
|
|
||
|
ctl_fifo=${ctl_dir}perf_ctl.fifo
|
||
|
test -p ${ctl_fifo} && unlink ${ctl_fifo}
|
||
|
mkfifo ${ctl_fifo}
|
||
|
exec {ctl_fd}<>${ctl_fifo}
|
||
|
|
||
|
ctl_ack_fifo=${ctl_dir}perf_ctl_ack.fifo
|
||
|
test -p ${ctl_ack_fifo} && unlink ${ctl_ack_fifo}
|
||
|
mkfifo ${ctl_ack_fifo}
|
||
|
exec {ctl_fd_ack}<>${ctl_ack_fifo}
|
||
|
|
||
|
perf record -D -1 -e cpu-cycles -a \
|
||
|
--control fd:${ctl_fd},${ctl_fd_ack} \
|
||
|
-- sleep 30 &
|
||
|
perf_pid=$!
|
||
|
|
||
|
sleep 5 && echo 'enable' >&${ctl_fd} && read -u ${ctl_fd_ack} e1 && echo "enabled(${e1})"
|
||
|
sleep 10 && echo 'disable' >&${ctl_fd} && read -u ${ctl_fd_ack} d1 && echo "disabled(${d1})"
|
||
|
|
||
|
exec {ctl_fd_ack}>&-
|
||
|
unlink ${ctl_ack_fifo}
|
||
|
|
||
|
exec {ctl_fd}>&-
|
||
|
unlink ${ctl_fifo}
|
||
|
|
||
|
wait -n ${perf_pid}
|
||
|
exit $?
|
||
|
|
||
|
--threads=<spec>::
|
||
|
Write collected trace data into several data files using parallel threads.
|
||
|
<spec> value can be user defined list of masks. Masks separated by colon
|
||
|
define CPUs to be monitored by a thread and affinity mask of that thread
|
||
|
is separated by slash:
|
||
|
|
||
|
<cpus mask 1>/<affinity mask 1>:<cpus mask 2>/<affinity mask 2>:...
|
||
|
|
||
|
CPUs or affinity masks must not overlap with other corresponding masks.
|
||
|
Invalid CPUs are ignored, but masks containing only invalid CPUs are not
|
||
|
allowed.
|
||
|
|
||
|
For example user specification like the following:
|
||
|
|
||
|
0,2-4/2-4:1,5-7/5-7
|
||
|
|
||
|
specifies parallel threads layout that consists of two threads,
|
||
|
the first thread monitors CPUs 0 and 2-4 with the affinity mask 2-4,
|
||
|
the second monitors CPUs 1 and 5-7 with the affinity mask 5-7.
|
||
|
|
||
|
<spec> value can also be a string meaning predefined parallel threads
|
||
|
layout:
|
||
|
|
||
|
- cpu - create new data streaming thread for every monitored cpu
|
||
|
- core - create new thread to monitor CPUs grouped by a core
|
||
|
- package - create new thread to monitor CPUs grouped by a package
|
||
|
- numa - create new threed to monitor CPUs grouped by a NUMA domain
|
||
|
|
||
|
Predefined layouts can be used on systems with large number of CPUs in
|
||
|
order not to spawn multiple per-cpu streaming threads but still avoid LOST
|
||
|
events in data directory files. Option specified with no or empty value
|
||
|
defaults to CPU layout. Masks defined or provided by the option value are
|
||
|
filtered through the mask provided by -C option.
|
||
|
|
||
|
--debuginfod[=URLs]::
|
||
|
Specify debuginfod URL to be used when cacheing perf.data binaries,
|
||
|
it follows the same syntax as the DEBUGINFOD_URLS variable, like:
|
||
|
|
||
|
http://192.168.122.174:8002
|
||
|
|
||
|
If the URLs is not specified, the value of DEBUGINFOD_URLS
|
||
|
system environment variable is used.
|
||
|
|
||
|
--off-cpu::
|
||
|
Enable off-cpu profiling with BPF. The BPF program will collect
|
||
|
task scheduling information with (user) stacktrace and save them
|
||
|
as sample data of a software event named "offcpu-time". The
|
||
|
sample period will have the time the task slept in nanoseconds.
|
||
|
|
||
|
Note that BPF can collect stack traces using frame pointer ("fp")
|
||
|
only, as of now. So the applications built without the frame
|
||
|
pointer might see bogus addresses.
|
||
|
|
||
|
include::intel-hybrid.txt[]
|
||
|
|
||
|
SEE ALSO
|
||
|
--------
|
||
|
linkperf:perf-stat[1], linkperf:perf-list[1], linkperf:perf-intel-pt[1]
|