hpcrun-flat:
Statistical Profiling
The HPCToolkit Performance Tools
2011/02/22
Version 5.3.2-r4197
hpcrun-flat
is a flat statistical sampling-based profiler.
It supports multiple sample sources during one execution and creates an IP (instruction pointer) histogram, or flat profile, for each sample source.
hpcrun-flat
profiles complex applications (forks, execs, threads and dynamically loaded libraries) and may be used in conjunction with parallel process launchers such as MPICH's mpiexec
and SLURM's srun.
See hpctoolkit(1)
for an overview of HPCToolkit.
Table of Contents
hpcrun-flat
[profiling-options]
[--]
command
[command-arguments]
hpcrun-flat
[info-options]
hpcrun-flat
profiles the execution of an arbitrary command command
using statistical sampling (rather than instrumentation).
It collects per-thread flat profiles, also known as IP (instruction pointer) histograms.
Sample points may be generated from multiple simultaneous sampling sources.
hpcrun-flat
profiles complex applications that use forks, execs, and threads (but not dynamic linking/unlinking); it may be used in conjuction with parallel process launchers such as MPICH's mpiexec
and SLURM's srun.
To configure hpcrun-flat's
sampling sources, specify events and periods using the -e/--event
option.
For an event e
and period p,
after every p
instances of e,
a sample is generated that causes hpcrun
to inspect the and record information about the monitored command.
When command
terminates, per-thread profiles are written to files with the names of the form:
command.hpcrun-flat.hostname.pid.tid
hpcrun-flat
enables a user to abort a process and write the partial profiling data to disk by sending the Interrupt signal (INT or Ctrl-C).
This can be extremely useful on long-running or misbehaving applications.
The special option --
can be used to stop hpcrun-flat
option parsing; this is especially useful when command
takes arguments of its own.
- command
- The command to profile.
- command-arguments
- Arguments to the command to profile.
Default values for an option's optional arguments are shown in {}.
- -l, --list-events-short
- List available events. (N.B.: some may not be profilable)
- -L, ---list-events-long
- Similar to above but with more information.
- --paths
- Print paths for external PAPI and MONITOR.
- -V, --version
- Print version information.
- -h, --help
- Print help.
- --debug [n]
- Debug: use debug level n.
{1}
- -r [yes |no], --recursive [yes |no]
-
Profile processes spawned by command.
{no}. (Each process will receive its own output file.)
- -t mode, --threads mode
-
Select thread profiling mode {each}:
- each Create separate profiles for each thread.
- all Create one combined profile of all threads.
Note that only POSIX threads are supported.
Also note that the WALLCLOCK event cannot be used in a multithreaded process.
- -e event[:period], --event event[:period]
-
An event to profile and its corresponding sample period. event
may be either a PAPI or native processor event. May pass multiple times. {PAPI_TOT_CYC:999999}
- It is recommended to always specify the sampling period for each profiling event.
- The special event WALLCLOCK may be used to profile the `wall clock.' It may be used only once and cannot be used with another event. It is an error to specify a period.
- Hardware and drivers 1) limit the maximum number of events that can be monitored simultaneously, and 2) forbid certain combinations of events. Check your documentation.
- See the ``Sample sources'' under NOTES for additional details.
- -o [outpath], --output [outpath]
- Directory for output data. {.}
- --papi-flag flag
- Profile style flag. {PAPI_POSIX_PROFIL}
Assume we wish to profile the application zoo.
The following examples lists some useful events for different processor architectures.
In each case, the special option --
is used to clearly demarcate the end of hpcrun-flat
options.
- hpcrun-flat -e WALLCLOCK -- zoo
- Opteron, (Rev B-F)
- hpcrun-flat -e DC_L2_REFILL:1300013 -e PAPI_L2_DCM:510011 -e PAPI_STL_ICY:5300013 -e PAPI_TOT_CYC:13000019 -- zoo (DC_L2_REFILL is an approximation of L1 D-cache misses).
- hpcrun-flat -e PAPI_L2_DCM:510011 -e PAPI_TLB_DM:510013 -e PAPI_STL_ICY:5300013 -e PAPI_TOT_CYC:13000019 -- zoo
- Pentium IV
- hpcrun-flat -e PAPI_TOT_CYC:30000001 -e PAPI_TOT_INS:3000001 -e PAPI_FP_INS:1000001 -e PAPI_LD_INS:1000001 -e PAPI_TLB_TL:32767 -e PAPI_L2_TCM:32767 -e PAPI_RES_STL:1000001 -e BSQ_cache_reference_RD_3rdL_MISS -- zoo
- hpcrun-flat -e PAPI_SR_INS:1000001 -e PAPI_L1_DCM:32767 -e resource_stall_SBFULL:32767 -- zoo
- hpcrun-flat -e PAPI_FP_OPS:32767 -e PAPI_BR_MSP:32767 -- zoo
- Itanium 2
- hpcrun-flat -e BE_EXE_BUBBLE_ALL:344221 -e BE_L1D_FPU_BUBBLE_ALL:344221 -e FE_BUBBLE_ALL:144221 -e PAPI_TOT_CYC:344221 -- zoo
- hpcrun-flat -e PAPI_L1_DCM:144221 -e PAPI_FP_OPS:344221 -e PAPI_TOT_CYC:1044221 -- zoo
- Cycle accounting events: BACK_END_BUBBLE_ALL, BE_EXE_BUBBLE_ALL, BE_L1D_FPU_BUBBLE_ALL, BE_FLUSH_BUBBLE_ALL, BE_RSE_BUBBLE_ALL, FE_BUBBLE_ALL
hpcrun-flat
uses the PAPI library to provide access to hardware performance counter events.
If you have not configured HPCToolkit to use the PAPI library, you will be unable to measure hardware performance counter events.
The PAPI library supports a large collection of hardware counter events.
Some events have standard names across all platforms, e.g. PAPI_TOT_CYC, the event that measures total cycles.
In addition to events whose names begin with the PAPI_ prefix, platforms also provide access to a set of native events with names that are specific to the platform's processor.
A complete list of events supported by the PAPI library for your platform may be obtained by using the --list-events
option.
Any event whose name begins with the PAPI_ prefix that is listed as "Profilable" can be used as an event in a sampling source --- provided it does not conflict with another event.
The precise rules for selecting good events and periods are complex.
- Choosing sampling events.
We recommend using PAPI events in general.
However, some PAPI events are not profilable because of PAPI implementation details.
Also, PAPI's standard event list may not cover an architectural feature you are interested in.
In such cases, it is necessary to resort to native events.
In many cases, you will have to consult the architecture's manual to fully understand what the event means: there is no standard event list or naming scheme and events sometimes have unusual meanings.
- Number of sampling events.
Currently, hpcrun does not multiplex hardware counters.
This means that the number of events that you may concurrently profile against is limited by your architecture's performance monitoring unit.
Note that some architectures hard-wire one or more counters to a specific event (such as cycles).
- Choosing sampling periods.
The key requirement in choosing sampling periods is that you obtain enough samples to provide statistical significance.
We usually recommend a sampling rate between 100s-1000s of samples per second.
This usually only produces 1-5% execution time overhead.
Choosing sampling rates depends on the architecture and sometimes the application.
Choosing periods for cycle and instruction-related events are usually easy.
Cycles directly relates to the clock speed.
Instruction-related events relates to the issue rate and width.
Choosing periods for other events seems harder because different applications uses resources differently.
For example, some applications are memory intensive and others are not.
However, if the goal is to identify rate-limiting factors of the architecture, then it is not necessary to consider the application.
For example, if the goal is to determine whether L2 D-cache latency is a limiting factor, then it is only necessary to work backward from the architecture's specifications to determine what number of L2 D-cache misses per second would be problematic.
- Architectural event conflicts.
With some performance monitoring units, certain events may not be concurrently used with other events.
- Architectural interrupt delay.
On most out-of-order pipelined architectures, a hardware counter interrupt is not precisely attributed to the instruction that induced the counter to overflow.
The gap is commonly 50-70 instructions.
This means that one should not assume that aggregation at the source line level is fully precise.
(E.g., if a L1 D-cache miss is attributed to a statement that has been compiled to register-only operations, assume the miss is attributed to a nearby load.)
However, aggregation at the procedure and loop level is reliable.
- System itimer (WALLCLOCK).
On Linux systems, the kernel will not deliver itimer interrupts faster than the unit of a jiffy, which defaults to 4 milliseconds.
One can configure the kernel to use a value as small as 1 millisecond, but it is unlikely the kernel will actually deliver itimer signals at that rate when a period of 1000 microseconds is requested.
However, on Linux one can get quite close to the kernel Hz rate by setting the itimer interval to something less than the Hz rate.
For example, if the Hz rate is 1000 microseconds, one can use 500 microseconds (or just 1) and obtain about 999 interrupts per second.
- hpcrun-flat uses preloaded shared libraries to initiate profiling. For this reason, it cannot be used to profile setuid programs.
- For the same reason, it cannot profile statically linked applications.
- Bug: hpcrun-flat cannot currently profile programs that themselves use preloading.
hpctoolkit(1)
.
Version: 5.3.2-r4197 of 2011/02/22.
- Copyright
- © 2002-2013, Rice University.
- License
- See README.License.
Nathan Tallent
John Mellor-Crummey
Rob Fowler
Rice HPCToolkit Research Group
Email: hpctoolkit-forum =at= rice.edu
WWW: http://hpctoolkit.org.