nvprof -help - 全栈程序员必看

大家好，又见面了，我是你们的朋友全栈君。
Usage: nvprof [options] [application] [application-arguments]

Options:

--aggregate-mode <on|off>

Turn on/off aggregate mode for events and metrics specified

by subsequent "--events" and "--metrics" options. Those

event/metric values will be collected for each domain instance,

instead of the whole device. Allowed values:

on - turn on aggregate mode (default)

off - turn off aggregate mode



--analysis-metrics

Collect profiling data that can be imported to Visual Profiler's

"analysis" mode. Note: Use "--export-profile" to specify

an export file.



--concurrent-kernels <on|off>

Turn on/off concurrent kernel execution. If concurrent kernel

execution is off, all kernels running on one device will

be serialized. Allowed values:

on - turn on concurrent kernel execution (default)

off - turn off concurrent kernel execution



--continuous-sampling-interval <interval>

Set the continuous mode sampling interval in milliseconds.

Minimum is 1 ms. Default is 2 ms.



--dependency-analysis

Generate event dependency graph for host and device activities

and run dependency analysis.



--device-buffer-size <size in MBs>

Set the device memory size (in MBs) reserved for storing

profiling data for non-CDP operations, especially for concurrent

kernel tracing, for each buffer on a context. The default

value is 8MB. The size should be a positive integer.



--device-cdp-buffer-size <size in MBs>

Set the device memory size (in MBs) reserved for storing

profiling data for CDP operations for each buffer on a context.

The default value is 8MB. The size should be a positive

integer.



--devices <device ids>

Change the scope of subsequent "--events", "--metrics", "--query-events"

and "--query-metrics" options.

Allowed values:

all - change scope to all valid devices

comma-separated device IDs - change scope to specified

devices



--event-collection-mode <mode>

Choose event collection mode for all events/metrics Allowed

values:

kernel - events/metrics are collected only for durations

of kernel executions (default)

continuous - events/metrics are collected for duration

of application. This is not applicable for non-tesla devices.

This mode is compatible only with NVLink events/metrics.

This modeis incompatible with "--profile-all-processes"

or "--profile-child-processes" or "--replay-mode kernel"

or "--replay-mode application".



-e, --events <event names>

Specify the events to be profiled on certain device(s). Multiple

event names separated by comma can be specified. Which device(s)

are profiled is controlled by the "--devices" option. Otherwise

events will be collected on all devices.

For a list of available events, use "--query-events".

Use "--events all" to profile all events available for each

device.

Use "--devices" and "--kernels" to select a specific kernel

invocation.



--kernel-latency-timestamps <on|off>

Turn on/off collection of kernel latency timestamps, namely

queued and submitted. The queued timestamp is captured when

a kernel launch command was queued into the CPU command

buffer. The submitted timestamp denotes when the CPU command

buffer containing this kernel launch was submitted to the

GPU. Turning this option on may incur an overhead during

profiling. Allowed values:

on - turn on collection of kernel latency timestamps

off - turn off collection of kernel latency timestamps

(default)



--kernels <kernel path syntax>

Change the scope of subsequent "--events", "--metrics" options.

The syntax is as follows:

<kernel name>

Limit scope to given kernel name.

or

<context id/name>:<stream id/name>:<kernel name>:<invocation>

The context/stream IDs, names, kernel name and invocation

can be regular expressions. Empty string matches any number

or characters. If <context id/name> or <stream id/name>

is a positive number, it's strictly matched against the

CUDA context/stream ID. Otherwise it's treated as a regular

expression and matched against the context/stream name specified

by the NVTX library. If the invocation count is a positive

number, it's strictly matched against the invocation of

the kernel. Otherwise it's treated as a regular expression.

Example: --kernels "1:foo:bar:2" will profile any kernel

whose name contains "bar" and is the 2nd instance on context

1 and on stream named "foo".



-m, --metrics <metric names>

Specify the metrics to be profiled on certain device(s).

Multiple metric names separated by comma can be specified.

Which device(s) are profiled is controlled by the "--devices"

option. Otherwise metrics will be collected on all devices.

For a list of available metrics, use "--query-metrics".

Use "--metrics all" to profile all metrics available for

each device.

Use "--devices" and "--kernels" to select a specific kernel

invocation.

Note: "--metrics all" does not include some metrics which

are needed for Visual Profiler's source level analysis.

For that, use "--analysis-metrics".



--pc-sampling-period <period>

Specify PC Sampling period in cycles, at which the sampling

records will be dumped. Allowed values for the period are

integers between 5 to 31 both inclusive.

This will set the sampling period to (2^period) cycles

Default value is a number between 5 and 12 based on the setup.Note:

Only available for GM20X+.





--profile-all-processes

Profile all processes launched by the same user who launched

this nvprof instance. Note: Only one instance of nvprof

can run with this option at the same time. Under this mode,

there's no need to specify an application to run.



--profile-api-trace <none|runtime|driver|all>

Turn on/off CUDA runtime/driver API tracing. Allowed values:

none - turn off API tracing

runtime - only turn on CUDA runtime API tracing

driver - only turn on CUDA driver API tracing

all - turn on all API tracing (default)



--profile-child-processes

Profile the application and all child processes launched

by it.



--profile-from-start <on|off>

Enable/disable profiling from the start of the application.

If it's disabled, the application can use {cu,cuda}Profiler{Start,Stop}

to turn on/off profiling. Allowed values:

on - enable profiling from start (default)

off - disable profiling from start



--profiling-semaphore-pool-size <count>

Set the profiling semaphore pool size reserved for storing

profiling data for serialized kernels and memory operations

for each context. The default value is 65536. The size should

be a positive integer.



--query-events

List all the events available on the device(s). Device(s)

queried can be controlled by the "--devices" option.



--query-metrics

List all the metrics available on the device(s). Device(s)

queried can be controlled by the "--devices" option.



--replay-mode <mode>

Choose replay mode used when not all events/metrics can be

collected in a single run. Allowed values:

disabled - replay is disabled, events/metrics couldn't

be profiled will be dropped

kernel - each kernel invocation is replayed (default)

application - the entire application is replayed.

This modeis incompatible with "--profile-all-processes"

or "profile-child-processes".



-a, --source-level-analysis <source level analysis names>

Specify the source level metrics to be profiled on a certain

kernel invocation. Use "--devices" and "--kernels" to select

a specific kernel invocation. Allowed values: one or more

of the following, separated by commas

global_access: global access

shared_access: shared access

branch: divergent branch

instruction_execution: instruction execution

pc_sampling: pc sampling, available only for GM20X+

Note: Use "--export-profile" to specify an export file.



--system-profiling <on|off>

Turn on/off power, clock, and thermal profiling. Allowed

values:

on - turn on system profiling

off - turn off system profiling (default)



-t, --timeout <seconds>

Set an execution timeout (in seconds) for the CUDA application.

Note: Timeout starts counting from the moment the CUDA driver

is initialized. If the application doesn't call any CUDA

APIs, timeout won't be triggered.



--track-memory-allocations <on|off>

Turn on/off tracking of memory operations, which involves

recording timestamps, memory size, memory type and program

counters of the memory allocations and frees. Turning this

option on may incur an overhead during profiling. Allowed

values:

on - turn on tracking of memory allocations and

free

off - turn off tracking of memory allocations and

free (default)



--unified-memory-profiling <per-process-device|off>

Configure unified memory profiling. Allowed values:

per-process-device - collect counts for each process

and each device (default)

off - turn off unified memory profiling



--cpu-profiling <on|off>

Turn on CPU profiling. Note: CPU profiling is not supported

in multi-process mode.



--cpu-profiling-explain-ccff <filename>

Path to a PGI pgexplain.xml file that should be used to interpret

Common Compiler Feedback Format (CCFF) messages.



--cpu-profiling-frequency <frequency>

Set the CPU profiling frequency in samples per second. Default

is 25Hz. Maximum is 500Hz.



--cpu-profiling-max-depth <depth>

Set the maximum depth of each call stack. Zero means no limit.

Default is zero.



--cpu-profiling-mode <flat|top-down|bottom-up>

Set the output mode of CPU profiling. Allowed values:

flat - Show flat profile

top-down - Show parent functions at the top

bottom-up - Show parent functions at the bottom

(default)



--cpu-profiling-percentage-threshold <threshold>

Filter out the entries that are below the set percentage

threshold. The limit should be an integer between 0 and

100, inclusive. Zero means no limit. Default is zero.



--cpu-profiling-scope <function|instruction>

Choose the profiling scope. Allowed values:

function - Each level in the stack trace represents

a distinct function (default)

instruction - Each level in the stack trace represents

a distinct instruction address



--cpu-profiling-show-ccff <on|off>

Choose whether to print Common Compiler Feedback Format (CCFF)

messages embedded in the binary. Note: this option implies

"--cpu-profiling-scope instruction".Default is off.



--cpu-profiling-show-library <on|off>

Choose whether to print the library name for each sample.



--cpu-profiling-thread-mode <separated|aggregated>

Set the thread mode of CPU profiling. Allowed values:

separated - Show separate profile for each thread

aggregated - Aggregate data from all threads (default)



--cpu-profiling-unwind-stack <on|off>

Choose whether to unwind the CPU call-stack at each sample

point. Default is on.



--openacc-profiling <on|off>

Enable/disable recording information from the OpenACC profiling

interface. Note: if the OpenACC profiling interface is available

depends on the OpenACC runtime. Default is on.



--context-name <name>

Name of the CUDA context.

"%i" in the context name string is replaced with

the ID of the context.

"%p" in the context name string is replaced with

the process ID of the application being profiled.

"%q{<ENV>}" in the context name string is replaced

with the value of the environment variable "<ENV>". If the

environment variable is not set it's an error.

"%h" in the context name string is replaced with

the hostname of the system.

"%%" in the context name string is replaced with

"%". Any other character following "%" is illegal.



--csv

Use comma-separated values in the output.



--demangling <on|off>

Turn on/off C++ name demangling of function names. Allowed

values:

on - turn on demangling (default)

off - turn off demangling



-u, --normalized-time-unit <s|ms|us|ns|col|auto>

Specify the unit of time that will be used in the output.

Allowed values:

s - second, ms - millisecond, us - microsecond,

ns - nanosecond

col - a fixed unit for each column

auto (default) - the scale is chosen for each value

based on its length.



--openacc-summary-mode <mode>

Set how durations are computed in the OpenACC summary. Allowed

values:

exclusive: show exclusive times (default)

inclusive: show inclusive times



--print-api-summary

Print a summary of CUDA runtime/driver API calls.



--print-api-trace

Print CUDA runtime/driver API trace.



--print-dependency-analysis-trace

Print dependency analysis trace.



--print-gpu-summary

Print a summary of the activities on the GPU (including CUDA

kernels and memcpy's/memset's).



--print-gpu-trace

Print individual kernel invocations (including CUDA memcpy's/memset's)

and sort them in chronological order. In event/metric profiling

mode, show events/metrics for each kernel invocation.



--print-openacc-constructs

Include parent construct names in OpenACC profile.



--print-openacc-summary

Print a summary of the OpenACC profile.



--print-openacc-trace

Print a trace of the OpenACC profile.



-s, --print-summary

Print a summary of the profiling result on screen. Note:

This is the default unless "--export-profile" or other print

options are used.



--print-summary-per-gpu

Print a summary of the profiling result for each GPU.



--process-name <name>

Name of the process.

"%p" in the process name string is replaced with

the process ID of the application being profiled.

"%q{<ENV>}" in the process name string is replaced

with the value of the environment variable "<ENV>". If the

environment variable is not set it's an error.

"%h" in the process name string is replaced with

the hostname of the system.

"%%" in the process name string is replaced with

"%". Any other character following "%" is illegal.



--quiet

Suppress all nvprof output.



--stream-name <name>

Name of the CUDA stream.

"%i" in the stream name string is replaced with the

ID of the stream.

"%p" in the stream name string is replaced with

the process ID of the application being profiled.

"%q{<ENV>}" in the stream name string is replaced

with the value of the environment variable "<ENV>". If the

environment variable is not set it's an error.

"%h" in the stream name string is replaced with

the hostname of the system.

"%%" in the stream name string is replaced with

"%". Any other character following "%" is illegal.



-o, --export-profile <filename>

Export the result file which can be imported later or opened

by the NVIDIA Visual Profiler.

"%p" in the file name string is replaced with the

process ID of the application being profiled.

"%q{<ENV>}" in the file name string is replaced

with the value of the environment variable "<ENV>". If the

environment variable is not set it's an error.

"%h" in the file name string is replaced with the

hostname of the system.

"%%" in the file name string is replaced with "%".

Any other character following "%" is illegal.

By default, this option disables the summary output. Note:

If the application being profiled creates child processes,

or if '--profile-all-processes' is used, the "%p" format

is needed to get correct export files for each process.



-f, --force-overwrite

Force overwriting all output files (any existing files will

be overwritten).



-i, --import-profile <filename>

Import a result profile from a previous run.



--log-file <filename>

Make nvprof send all its output to the specified file, or

one of the standard channels. The file will be overwritten.

If the file doesn't exist, a new one will be created.

"%1" as the whole file name indicates standard output

channel (stdout).

"%2" as the whole file name indicates standard error

channel (stderr). Note: This is the default.

"%p" in the file name string is replaced with the

process ID of the application being profiled.

"%q{<ENV>}" in the file name string is replaced

with the value of the environment variable "<ENV>". If the

environment variable is not set it's an error.

"%h" in the file name string is replaced with the

hostname of the system.

"%%" in the file name is replaced with "%".

Any other character following "%" is illegal.



--print-nvlink-topology

Print nvlink topology



-h, --help

Print this help information.



-V, --version

Print version information of this tool.
发布者：全栈程序员-站长，转载请注明出处：https://javaforall.net/135846.html原文链接：https://javaforall.net