1
0
Fork 0
mirror of https://github.com/ossrs/srs.git synced 2025-03-09 15:49:59 +00:00

Upgrade gperftools to 2.9 for GCP/GMC/GMP/GMD. (#2247)

This commit is contained in:
winlin 2021-12-12 15:38:30 +08:00
parent 63da0dca92
commit 44e9dc83e9
346 changed files with 169666 additions and 78 deletions

View file

@ -0,0 +1,264 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<link rel="stylesheet" href="designstyle.css">
<title>Google CPU Profiler Binary Data File Format</title>
</HEAD>
<BODY>
<h1>Google CPU Profiler Binary Data File Format</h1>
<p align=right>
<i>Last modified
<script type=text/javascript>
var lm = new Date(document.lastModified);
document.write(lm.toDateString());
</script></i>
</p>
<p>This file documents the binary data file format produced by the
Google CPU Profiler. For information about using the CPU Profiler,
see <a href="cpuprofile.html">its user guide</a>.
<p>The profiler source code, which generates files using this format, is at
<code>src/profiler.cc</code></a>.
<h2>CPU Profile Data File Structure</h2>
<p>CPU profile data files each consist of four parts, in order:
<ul>
<li> Binary header
<li> Binary profile records
<li> Binary trailer
<li> Text list of mapped objects
</ul>
<p>The binary data is expressed in terms of "slots." These are words
large enough to hold the program's pointer type, i.e., for 32-bit
programs they are 4 bytes in size, and for 64-bit programs they are 8
bytes. They are stored in the profile data file in the native byte
order (i.e., little-endian for x86 and x86_64).
<h2>Binary Header</h2>
<p>The binary header format is show below. Values written by the
profiler, along with requirements currently enforced by the analysis
tools, are shown in parentheses.
<p>
<table summary="Header Format"
frame="box" rules="sides" cellpadding="5" width="50%">
<tr>
<th width="30%">slot</th>
<th width="70%">data</th>
</tr>
<tr>
<td>0</td>
<td>header count (0; must be 0)</td>
</tr>
<tr>
<td>1</td>
<td>header slots after this one (3; must be &gt;= 3)</td>
</tr>
<tr>
<td>2</td>
<td>format version (0; must be 0)</td>
</tr>
<tr>
<td>3</td>
<td>sampling period, in microseconds</td>
</tr>
<tr>
<td>4</td>
<td>padding (0)</td>
</tr>
</table>
<p>The headers currently generated for 32-bit and 64-bit little-endian
(x86 and x86_64) profiles are shown below, for comparison.
<p>
<table summary="Header Example" frame="box" rules="sides" cellpadding="5">
<tr>
<th></th>
<th>hdr count</th>
<th>hdr words</th>
<th>version</th>
<th>sampling period</th>
<th>pad</th>
</tr>
<tr>
<td>32-bit or 64-bit (slots)</td>
<td>0</td>
<td>3</td>
<td>0</td>
<td>10000</td>
<td>0</td>
</tr>
<tr>
<td>32-bit (4-byte words in file)</td>
<td><tt>0x00000</tt></td>
<td><tt>0x00003</tt></td>
<td><tt>0x00000</tt></td>
<td><tt>0x02710</tt></td>
<td><tt>0x00000</tt></td>
</tr>
<tr>
<td>64-bit LE (4-byte words in file)</td>
<td><tt>0x00000&nbsp;0x00000</tt></td>
<td><tt>0x00003&nbsp;0x00000</tt></td>
<td><tt>0x00000&nbsp;0x00000</tt></td>
<td><tt>0x02710&nbsp;0x00000</tt></td>
<td><tt>0x00000&nbsp;0x00000</tt></td>
</tr>
</table>
<p>The contents are shown in terms of slots, and in terms of 4-byte
words in the profile data file. The slot contents for 32-bit and
64-bit headers are identical. For 32-bit profiles, the 4-byte word
view matches the slot view. For 64-bit profiles, each (8-byte) slot
is shown as two 4-byte words, ordered as they would appear in the
file.
<p>The profiling tools examine the contents of the file and use the
expected locations and values of the header words field to detect
whether the file is 32-bit or 64-bit.
<h2>Binary Profile Records</h2>
<p>The binary profile record format is shown below.
<p>
<table summary="Profile Record Format"
frame="box" rules="sides" cellpadding="5" width="50%">
<tr>
<th width="30%">slot</th>
<th width="70%">data</th>
</tr>
<tr>
<td>0</td>
<td>sample count, must be &gt;= 1</td>
</tr>
<tr>
<td>1</td>
<td>number of call chain PCs (num_pcs), must be &gt;= 1</td>
</tr>
<tr>
<td>2 .. (num_pcs + 1)</td>
<td>call chain PCs, most-recently-called function first.
</tr>
</table>
<p>The total length of a given record is 2 + num_pcs.
<p>Note that multiple profile records can be emitted by the profiler
having an identical call chain. In that case, analysis tools should
sum the counts of all records having identical call chains.
<p><b>Note:</b> Some profile analysis tools terminate if they see
<em>any</em> profile record with a call chain with its first entry
having the address 0. (This is similar to the binary trailer.)
<h3>Example</h3>
This example shows the slots contained in a sample profile record.
<p>
<table summary="Profile Record Example"
frame="box" rules="sides" cellpadding="5">
<tr>
<td>5</td>
<td>3</td>
<td>0xa0000</td>
<td>0xc0000</td>
<td>0xe0000</td>
</tr>
</table>
<p>In this example, 5 ticks were received at PC 0xa0000, whose
function had been called by the function containing 0xc0000, which had
been called from the function containing 0xe0000.
<h2>Binary Trailer</h2>
<p>The binary trailer consists of three slots of data with fixed
values, shown below.
<p>
<table summary="Trailer Format"
frame="box" rules="sides" cellpadding="5" width="50%">
<tr>
<th width="30%">slot</th>
<th width="70%">value</th>
</tr>
<tr>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
</tr>
</table>
<p>Note that this is the same data that would contained in a profile
record with sample count = 0, num_pcs = 1, and a one-element call
chain containing the address 0.
<h2>Text List of Mapped Objects</h2>
<p>The binary data in the file is followed immediately by a list of
mapped objects. This list consists of lines of text separated by
newline characters.
<p>Each line is one of the following types:
<ul>
<li>Build specifier, starting with "<tt>build=</tt>". For example:
<pre> build=/path/to/binary</pre>
Leading spaces on the line are ignored.
<li>Mapping line from ProcMapsIterator::FormatLine. For example:
<pre> 40000000-40015000 r-xp 00000000 03:01 12845071 /lib/ld-2.3.2.so</pre>
The first address must start at the beginning of the line.
</ul>
<p>Unrecognized lines should be ignored by analysis tools.
<p>When processing the paths see in mapping lines, occurrences of
<tt>$build</tt> followed by a non-word character (i.e., characters
other than underscore or alphanumeric characters), should be replaced
by the path given on the last build specifier line.
<hr>
<address>Chris Demetriou<br>
<!-- Created: Mon Aug 27 12:18:26 PDT 2007 -->
<!-- hhmts start -->
Last modified: Mon Aug 27 12:18:26 PDT 2007 (cgd)
<!-- hhmts end -->
</address>
</BODY>
</HTML>

View file

@ -0,0 +1,536 @@
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<HTML>
<HEAD>
<link rel="stylesheet" href="designstyle.css">
<title>Gperftools CPU Profiler</title>
</HEAD>
<BODY>
<p align=right>
<i>Last modified
<script type=text/javascript>
var lm = new Date(document.lastModified);
document.write(lm.toDateString());
</script></i>
</p>
<p>This is the CPU profiler we use at Google. There are three parts
to using it: linking the library into an application, running the
code, and analyzing the output.</p>
<p>On the off-chance that you should need to understand it, the CPU
profiler data file format is documented separately,
<a href="cpuprofile-fileformat.html">here</a>.
<H1>Linking in the Library</H1>
<p>To install the CPU profiler into your executable, add
<code>-lprofiler</code> to the link-time step for your executable.
(It's also probably possible to add in the profiler at run-time using
<code>LD_PRELOAD</code>, e.g.
<code>% env LD_PRELOAD="/usr/lib/libprofiler.so" &lt;binary&gt;</code>,
but this isn't necessarily recommended.)</p>
<p>This does <i>not</i> turn on CPU profiling; it just inserts the
code. For that reason, it's practical to just always link
<code>-lprofiler</code> into a binary while developing; that's what we
do at Google. (However, since any user can turn on the profiler by
setting an environment variable, it's not necessarily recommended to
install profiler-linked binaries into a production, running
system.)</p>
<H1>Running the Code</H1>
<p>There are several alternatives to actually turn on CPU profiling
for a given run of an executable:</p>
<ol>
<li> <p>Define the environment variable CPUPROFILE to the filename
to dump the profile to. For instance, if you had a version of
<code>/bin/ls</code> that had been linked against libprofiler,
you could run:</p>
<pre>% env CPUPROFILE=ls.prof /bin/ls</pre>
</li>
<li> <p>In addition to defining the environment variable CPUPROFILE
you can also define CPUPROFILESIGNAL. This allows profiling to be
controlled via the signal number that you specify. The signal number
must be unused by the program under normal operation. Internally it
acts as a switch, triggered by the signal, which is off by default.
For instance, if you had a copy of <code>/bin/chrome</code> that had been
been linked against libprofiler, you could run:</p>
<pre>% env CPUPROFILE=chrome.prof CPUPROFILESIGNAL=12 /bin/chrome &</pre>
<p>You can then trigger profiling to start:</p>
<pre>% killall -12 chrome</pre>
<p>Then after a period of time you can tell it to stop which will
generate the profile:</p>
<pre>% killall -12 chrome</pre>
</li>
<li> <p>In your code, bracket the code you want profiled in calls to
<code>ProfilerStart()</code> and <code>ProfilerStop()</code>.
(These functions are declared in <code>&lt;gperftools/profiler.h&gt;</code>.)
<code>ProfilerStart()</code> will take
the profile-filename as an argument.</p>
</li>
</ol>
<p>In Linux 2.6 and above, profiling works correctly with threads,
automatically profiling all threads. In Linux 2.4, profiling only
profiles the main thread (due to a kernel bug involving itimers and
threads). Profiling works correctly with sub-processes: each child
process gets its own profile with its own name (generated by combining
CPUPROFILE with the child's process id).</p>
<p>For security reasons, CPU profiling will not write to a file -- and
is thus not usable -- for setuid programs.</p>
<p>See the include-file <code>gperftools/profiler.h</code> for
advanced-use functions, including <code>ProfilerFlush()</code> and
<code>ProfilerStartWithOptions()</code>.</p>
<H2>Modifying Runtime Behavior</H2>
<p>You can more finely control the behavior of the CPU profiler via
environment variables.</p>
<table frame=box rules=sides cellpadding=5 width=100%>
<tr valign=top>
<td><code>CPUPROFILE_FREQUENCY=<i>x</i></code></td>
<td>default: 100</td>
<td>
How many interrupts/second the cpu-profiler samples.
</td>
</tr>
<tr valign=top>
<td><code>CPUPROFILE_REALTIME=1</code></td>
<td>default: [not set]</td>
<td>
If set to any value (including 0 or the empty string), use
ITIMER_REAL instead of ITIMER_PROF to gather profiles. In
general, ITIMER_REAL is not as accurate as ITIMER_PROF, and also
interacts badly with use of alarm(), so prefer ITIMER_PROF unless
you have a reason prefer ITIMER_REAL.
</td>
</tr>
</table>
<h1><a name="pprof">Analyzing the Output</a></h1>
<p><code>pprof</code> is the script used to analyze a profile. It has
many output modes, both textual and graphical. Some give just raw
numbers, much like the <code>-pg</code> output of <code>gcc</code>,
and others show the data in the form of a dependency graph.</p>
<p>pprof <b>requires</b> <code>perl5</code> to be installed to run.
It also requires <code>dot</code> to be installed for any of the
graphical output routines, and <code>gv</code> to be installed for
<code>--gv</code> mode (described below).
</p>
<p>Here are some ways to call pprof. These are described in more
detail below.</p>
<pre>
% pprof /bin/ls ls.prof
Enters "interactive" mode
% pprof --text /bin/ls ls.prof
Outputs one line per procedure
% pprof --gv /bin/ls ls.prof
Displays annotated call-graph via 'gv'
% pprof --gv --focus=Mutex /bin/ls ls.prof
Restricts to code paths including a .*Mutex.* entry
% pprof --gv --focus=Mutex --ignore=string /bin/ls ls.prof
Code paths including Mutex but not string
% pprof --list=getdir /bin/ls ls.prof
(Per-line) annotated source listing for getdir()
% pprof --disasm=getdir /bin/ls ls.prof
(Per-PC) annotated disassembly for getdir()
% pprof --text localhost:1234
Outputs one line per procedure for localhost:1234
% pprof --callgrind /bin/ls ls.prof
Outputs the call information in callgrind format
</pre>
<h3>Analyzing Text Output</h3>
<p>Text mode has lines of output that look like this:</p>
<pre>
14 2.1% 17.2% 58 8.7% std::_Rb_tree::find
</pre>
<p>Here is how to interpret the columns:</p>
<ol>
<li> Number of profiling samples in this function
<li> Percentage of profiling samples in this function
<li> Percentage of profiling samples in the functions printed so far
<li> Number of profiling samples in this function and its callees
<li> Percentage of profiling samples in this function and its callees
<li> Function name
</ol>
<h3>Analyzing Callgrind Output</h3>
<p>Use <a href="http://kcachegrind.sourceforge.net">kcachegrind</a> to
analyze your callgrind output:</p>
<pre>
% pprof --callgrind /bin/ls ls.prof > ls.callgrind
% kcachegrind ls.callgrind
</pre>
<p>The cost is specified in 'hits', i.e. how many times a function
appears in the recorded call stack information. The 'calls' from
function a to b record how many times function b was found in the
stack traces directly below function a.</p>
<p>Tip: if you use a debug build the output will include file and line
number information and kcachegrind will show an annotated source
code view.</p>
<h3>Node Information</h3>
<p>In the various graphical modes of pprof, the output is a call graph
annotated with timing information, like so:</p>
<A HREF="pprof-test-big.gif">
<center><table><tr><td>
<img src="pprof-test.gif">
</td></tr></table></center>
</A>
<p>Each node represents a procedure. The directed edges indicate
caller to callee relations. Each node is formatted as follows:</p>
<center><pre>
Class Name
Method Name
local (percentage)
<b>of</b> cumulative (percentage)
</pre></center>
<p>The last one or two lines contains the timing information. (The
profiling is done via a sampling method, where by default we take 100
samples a second. Therefor one unit of time in the output corresponds
to about 10 milliseconds of execution time.) The "local" time is the
time spent executing the instructions directly contained in the
procedure (and in any other procedures that were inlined into the
procedure). The "cumulative" time is the sum of the "local" time and
the time spent in any callees. If the cumulative time is the same as
the local time, it is not printed.</p>
<p>For instance, the timing information for test_main_thread()
indicates that 155 units (about 1.55 seconds) were spent executing the
code in <code>test_main_thread()</code> and 200 units were spent while
executing <code>test_main_thread()</code> and its callees such as
<code>snprintf()</code>.</p>
<p>The size of the node is proportional to the local count. The
percentage displayed in the node corresponds to the count divided by
the total run time of the program (that is, the cumulative count for
<code>main()</code>).</p>
<h3>Edge Information</h3>
<p>An edge from one node to another indicates a caller to callee
relationship. Each edge is labelled with the time spent by the callee
on behalf of the caller. E.g, the edge from
<code>test_main_thread()</code> to <code>snprintf()</code> indicates
that of the 200 samples in <code>test_main_thread()</code>, 37 are
because of calls to <code>snprintf()</code>.</p>
<p>Note that <code>test_main_thread()</code> has an edge to
<code>vsnprintf()</code>, even though <code>test_main_thread()</code>
doesn't call that function directly. This is because the code was
compiled with <code>-O2</code>; the profile reflects the optimized
control flow.</p>
<h3>Meta Information</h3>
<p>The top of the display should contain some meta information
like:</p>
<pre>
/tmp/profiler2_unittest
Total samples: 202
Focusing on: 202
Dropped nodes with &lt;= 1 abs(samples)
Dropped edges with &lt;= 0 samples
</pre>
<p>This section contains the name of the program, and the total
samples collected during the profiling run. If the
<code>--focus</code> option is on (see the <a href="#focus">Focus</a>
section below), the legend also contains the number of samples being
shown in the focused display. Furthermore, some unimportant nodes and
edges are dropped to reduce clutter. The characteristics of the
dropped nodes and edges are also displayed in the legend.</p>
<h3><a name=focus>Focus and Ignore</a></h3>
<p>You can ask pprof to generate a display focused on a particular
piece of the program. You specify a regular expression. Any portion
of the call-graph that is on a path which contains at least one node
matching the regular expression is preserved. The rest of the
call-graph is dropped on the floor. For example, you can focus on the
<code>vsnprintf()</code> libc call in <code>profiler2_unittest</code>
as follows:</p>
<pre>
% pprof --gv --focus=vsnprintf /tmp/profiler2_unittest test.prof
</pre>
<A HREF="pprof-vsnprintf-big.gif">
<center><table><tr><td>
<img src="pprof-vsnprintf.gif">
</td></tr></table></center>
</A>
<p>Similarly, you can supply the <code>--ignore</code> option to
ignore samples that match a specified regular expression. E.g., if
you are interested in everything except calls to
<code>snprintf()</code>, you can say:</p>
<pre>
% pprof --gv --ignore=snprintf /tmp/profiler2_unittest test.prof
</pre>
<h3>Interactive mode</a></h3>
<p>By default -- if you don't specify any flags to the contrary --
pprof runs in interactive mode. At the <code>(pprof)</code> prompt,
you can run many of the commands described above. You can type
<code>help</code> for a list of what commands are available in
interactive mode.</p>
<h3><a name=options>pprof Options</a></h3>
For a complete list of pprof options, you can run <code>pprof
--help</code>.
<h4>Output Type</h4>
<p>
<center>
<table frame=box rules=sides cellpadding=5 width=100%>
<tr valign=top>
<td><code>--text</code></td>
<td>
Produces a textual listing. (Note: If you have an X display, and
<code>dot</code> and <code>gv</code> installed, you will probably
be happier with the <code>--gv</code> output.)
</td>
</tr>
<tr valign=top>
<td><code>--gv</code></td>
<td>
Generates annotated call-graph, converts to postscript, and
displays via gv (requres <code>dot</code> and <code>gv</code> be
installed).
</td>
</tr>
<tr valign=top>
<td><code>--dot</code></td>
<td>
Generates the annotated call-graph in dot format and
emits to stdout (requres <code>dot</code> be installed).
</td>
</tr>
<tr valign=top>
<td><code>--ps</code></td>
<td>
Generates the annotated call-graph in Postscript format and
emits to stdout (requres <code>dot</code> be installed).
</td>
</tr>
<tr valign=top>
<td><code>--pdf</code></td>
<td>
Generates the annotated call-graph in PDF format and emits to
stdout (requires <code>dot</code> and <code>ps2pdf</code> be
installed).
</td>
</tr>
<tr valign=top>
<td><code>--gif</code></td>
<td>
Generates the annotated call-graph in GIF format and
emits to stdout (requres <code>dot</code> be installed).
</td>
</tr>
<tr valign=top>
<td><code>--list=&lt;<i>regexp</i>&gt;</code></td>
<td>
<p>Outputs source-code listing of routines whose
name matches &lt;regexp&gt;. Each line
in the listing is annotated with flat and cumulative
sample counts.</p>
<p>In the presence of inlined calls, the samples
associated with inlined code tend to get assigned
to a line that follows the location of the
inlined call. A more precise accounting can be
obtained by disassembling the routine using the
--disasm flag.</p>
</td>
</tr>
<tr valign=top>
<td><code>--disasm=&lt;<i>regexp</i>&gt;</code></td>
<td>
Generates disassembly of routines that match
&lt;regexp&gt;, annotated with flat and
cumulative sample counts and emits to stdout.
</td>
</tr>
</table>
</center>
<h4>Reporting Granularity</h4>
<p>By default, pprof produces one entry per procedure. However you can
use one of the following options to change the granularity of the
output. The <code>--files</code> option seems to be particularly
useless, and may be removed eventually.</p>
<center>
<table frame=box rules=sides cellpadding=5 width=100%>
<tr valign=top>
<td><code>--addresses</code></td>
<td>
Produce one node per program address.
</td>
</tr>
<td><code>--lines</code></td>
<td>
Produce one node per source line.
</td>
</tr>
<td><code>--functions</code></td>
<td>
Produce one node per function (this is the default).
</td>
</tr>
<td><code>--files</code></td>
<td>
Produce one node per source file.
</td>
</tr>
</table>
</center>
<h4>Controlling the Call Graph Display</h4>
<p>Some nodes and edges are dropped to reduce clutter in the output
display. The following options control this effect:</p>
<center>
<table frame=box rules=sides cellpadding=5 width=100%>
<tr valign=top>
<td><code>--nodecount=&lt;n&gt;</code></td>
<td>
This option controls the number of displayed nodes. The nodes
are first sorted by decreasing cumulative count, and then only
the top N nodes are kept. The default value is 80.
</td>
</tr>
<tr valign=top>
<td><code>--nodefraction=&lt;f&gt;</code></td>
<td>
This option provides another mechanism for discarding nodes
from the display. If the cumulative count for a node is
less than this option's value multiplied by the total count
for the profile, the node is dropped. The default value
is 0.005; i.e. nodes that account for less than
half a percent of the total time are dropped. A node
is dropped if either this condition is satisfied, or the
--nodecount condition is satisfied.
</td>
</tr>
<tr valign=top>
<td><code>--edgefraction=&lt;f&gt;</code></td>
<td>
This option controls the number of displayed edges. First of all,
an edge is dropped if either its source or destination node is
dropped. Otherwise, the edge is dropped if the sample
count along the edge is less than this option's value multiplied
by the total count for the profile. The default value is
0.001; i.e., edges that account for less than
0.1% of the total time are dropped.
</td>
</tr>
<tr valign=top>
<td><code>--focus=&lt;re&gt;</code></td>
<td>
This option controls what region of the graph is displayed
based on the regular expression supplied with the option.
For any path in the callgraph, we check all nodes in the path
against the supplied regular expression. If none of the nodes
match, the path is dropped from the output.
</td>
</tr>
<tr valign=top>
<td><code>--ignore=&lt;re&gt;</code></td>
<td>
This option controls what region of the graph is displayed
based on the regular expression supplied with the option.
For any path in the callgraph, we check all nodes in the path
against the supplied regular expression. If any of the nodes
match, the path is dropped from the output.
</td>
</tr>
</table>
</center>
<p>The dropped edges and nodes account for some count mismatches in
the display. For example, the cumulative count for
<code>snprintf()</code> in the first diagram above was 41. However
the local count (1) and the count along the outgoing edges (12+1+20+6)
add up to only 40.</p>
<h1>Caveats</h1>
<ul>
<li> If the program exits because of a signal, the generated profile
will be <font color=red>incomplete, and may perhaps be
completely empty</font>.
<li> The displayed graph may have disconnected regions because
of the edge-dropping heuristics described above.
<li> If the program linked in a library that was not compiled
with enough symbolic information, all samples associated
with the library may be charged to the last symbol found
in the program before the library. This will artificially
inflate the count for that symbol.
<li> If you run the program on one machine, and profile it on
another, and the shared libraries are different on the two
machines, the profiling output may be confusing: samples that
fall within shared libaries may be assigned to arbitrary
procedures.
<li> If your program forks, the children will also be profiled
(since they inherit the same CPUPROFILE setting). Each process
is profiled separately; to distinguish the child profiles from
the parent profile and from each other, all children will have
their process-id appended to the CPUPROFILE name.
<li> Due to a hack we make to work around a possible gcc bug, your
profiles may end up named strangely if the first character of
your CPUPROFILE variable has ascii value greater than 127.
This should be exceedingly rare, but if you need to use such a
name, just set prepend <code>./</code> to your filename:
<code>CPUPROFILE=./&Auml;gypten</code>.
</ul>
<hr>
<address>Sanjay Ghemawat<br>
<!-- Created: Tue Dec 19 10:43:14 PST 2000 -->
<!-- hhmts start -->
Last modified: Fri May 9 14:41:29 PDT 2008
<!-- hhmts end -->
</address>
</BODY>
</HTML>

View file

@ -0,0 +1,109 @@
body {
background-color: #ffffff;
color: black;
margin-right: 1in;
margin-left: 1in;
}
h1, h2, h3, h4, h5, h6 {
color: #3366ff;
font-family: sans-serif;
}
@media print {
/* Darker version for printing */
h1, h2, h3, h4, h5, h6 {
color: #000080;
font-family: helvetica, sans-serif;
}
}
h1 {
text-align: center;
font-size: 18pt;
}
h2 {
margin-left: -0.5in;
}
h3 {
margin-left: -0.25in;
}
h4 {
margin-left: -0.125in;
}
hr {
margin-left: -1in;
}
/* Definition lists: definition term bold */
dt {
font-weight: bold;
}
address {
text-align: right;
}
/* Use the <code> tag for bits of code and <var> for variables and objects. */
code,pre,samp,var {
color: #006000;
}
/* Use the <file> tag for file and directory paths and names. */
file {
color: #905050;
font-family: monospace;
}
/* Use the <kbd> tag for stuff the user should type. */
kbd {
color: #600000;
}
div.note p {
float: right;
width: 3in;
margin-right: 0%;
padding: 1px;
border: 2px solid #6060a0;
background-color: #fffff0;
}
UL.nobullets {
list-style-type: none;
list-style-image: none;
margin-left: -1em;
}
/* pretty printing styles. See prettify.js */
.str { color: #080; }
.kwd { color: #008; }
.com { color: #800; }
.typ { color: #606; }
.lit { color: #066; }
.pun { color: #660; }
.pln { color: #000; }
.tag { color: #008; }
.atn { color: #606; }
.atv { color: #080; }
pre.prettyprint { padding: 2px; border: 1px solid #888; }
.embsrc { background: #eee; }
@media print {
.str { color: #060; }
.kwd { color: #006; font-weight: bold; }
.com { color: #600; font-style: italic; }
.typ { color: #404; font-weight: bold; }
.lit { color: #044; }
.pun { color: #440; }
.pln { color: #000; }
.tag { color: #006; font-weight: bold; }
.atn { color: #404; }
.atv { color: #060; }
}
/* Table Column Headers */
.hdr {
color: #006;
font-weight: bold;
background-color: #dddddd; }
.hdr2 {
color: #006;
background-color: #eeeeee; }

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

View file

@ -0,0 +1,534 @@
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<HTML>
<HEAD>
<link rel="stylesheet" href="designstyle.css">
<title>Gperftools Heap Leak Checker</title>
</HEAD>
<BODY>
<p align=right>
<i>Last modified
<script type=text/javascript>
var lm = new Date(document.lastModified);
document.write(lm.toDateString());
</script></i>
</p>
<p>This is the heap checker we use at Google to detect memory leaks in
C++ programs. There are three parts to using it: linking the library
into an application, running the code, and analyzing the output.</p>
<H1>Linking in the Library</H1>
<p>The heap-checker is part of tcmalloc, so to install the heap
checker into your executable, add <code>-ltcmalloc</code> to the
link-time step for your executable. Also, while we don't necessarily
recommend this form of usage, it's possible to add in the profiler at
run-time using <code>LD_PRELOAD</code>:</p>
<pre>% env LD_PRELOAD="/usr/lib/libtcmalloc.so" <binary></pre>
<p>This does <i>not</i> turn on heap checking; it just inserts the
code. For that reason, it's practical to just always link
<code>-ltcmalloc</code> into a binary while developing; that's what we
do at Google. (However, since any user can turn on the profiler by
setting an environment variable, it's not necessarily recommended to
install heapchecker-linked binaries into a production, running
system.) Note that if you wish to use the heap checker, you must
also use the tcmalloc memory-allocation library. There is no way
currently to use the heap checker separate from tcmalloc.</p>
<h1>Running the Code</h1>
<p>Note: For security reasons, heap profiling will not write to a file
-- and is thus not usable -- for setuid programs.</p>
<h2><a name="whole_program">Whole-program Heap Leak Checking</a></h2>
<p>The recommended way to use the heap checker is in "whole program"
mode. In this case, the heap-checker starts tracking memory
allocations before the start of <code>main()</code>, and checks again
at program-exit. If it finds any memory leaks -- that is, any memory
not pointed to by objects that are still "live" at program-exit -- it
aborts the program (via <code>exit(1)</code>) and prints a message
describing how to track down the memory leak (using <A
HREF="heapprofile.html#pprof">pprof</A>).</p>
<p>The heap-checker records the stack trace for each allocation while
it is active. This causes a significant increase in memory usage, in
addition to slowing your program down.</p>
<p>Here's how to run a program with whole-program heap checking:</p>
<ol>
<li> <p>Define the environment variable HEAPCHECK to the <A
HREF="#types">type of heap-checking</A> to do. For instance,
to heap-check
<code>/usr/local/bin/my_binary_compiled_with_tcmalloc</code>:</p>
<pre>% env HEAPCHECK=normal /usr/local/bin/my_binary_compiled_with_tcmalloc</pre>
</ol>
<p>No other action is required.</p>
<p>Note that since the heap-checker uses the heap-profiling framework
internally, it is not possible to run both the heap-checker and <A
HREF="heapprofile.html">heap profiler</A> at the same time.</p>
<h3><a name="types">Flavors of Heap Checking</a></h3>
<p>These are the legal values when running a whole-program heap
check:</p>
<ol>
<li> <code>minimal</code>
<li> <code>normal</code>
<li> <code>strict</code>
<li> <code>draconian</code>
</ol>
<p>"Minimal" heap-checking starts as late as possible in a
initialization, meaning you can leak some memory in your
initialization routines (that run before <code>main()</code>, say),
and not trigger a leak message. If you frequently (and purposefully)
leak data in one-time global initializers, "minimal" mode is useful
for you. Otherwise, you should avoid it for stricter modes.</p>
<p>"Normal" heap-checking tracks <A HREF="#live">live objects</A> and
reports a leak for any data that is not reachable via a live object
when the program exits.</p>
<p>"Strict" heap-checking is much like "normal" but has a few extra
checks that memory isn't lost in global destructors. In particular,
if you have a global variable that allocates memory during program
execution, and then "forgets" about the memory in the global
destructor (say, by setting the pointer to it to NULL) without freeing
it, that will prompt a leak message in "strict" mode, though not in
"normal" mode.</p>
<p>"Draconian" heap-checking is appropriate for those who like to be
very precise about their memory management, and want the heap-checker
to help them enforce it. In "draconian" mode, the heap-checker does
not do "live object" checking at all, so it reports a leak unless
<i>all</i> allocated memory is freed before program exit. (However,
you can use <A HREF="#disable">IgnoreObject()</A> to re-enable
liveness-checking on an object-by-object basis.)</p>
<p>"Normal" mode, as the name implies, is the one used most often at
Google. It's appropriate for everyday heap-checking use.</p>
<p>In addition, there are two other possible modes:</p>
<ul>
<li> <code>as-is</code>
<li> <code>local</code>
</ul>
<p><code>as-is</code> is the most flexible mode; it allows you to
specify the various <A HREF="#options">knobs</A> of the heap checker
explicitly. <code>local</code> activates the <A
HREF="#explicit">explicit heap-check instrumentation</A>, but does not
turn on any whole-program leak checking.</p>
<h3><A NAME="tweaking">Tweaking whole-program checking</A></h3>
<p>In some cases you want to check the whole program for memory leaks,
but waiting for after <code>main()</code> exits to do the first
whole-program leak check is waiting too long: e.g. in a long-running
server one might wish to simply periodically check for leaks while the
server is running. In this case, you can call the static method
<code>HeapLeakChecker::NoGlobalLeaks()</code>, to verify no global leaks have happened
as of that point in the program.</p>
<p>Alternately, doing the check after <code>main()</code> exits might
be too late. Perhaps you have some objects that are known not to
clean up properly at exit. You'd like to do the "at exit" check
before those objects are destroyed (since while they're live, any
memory they point to will not be considered a leak). In that case,
you can call <code>HeapLeakChecker::NoGlobalLeaks()</code> manually, near the end of
<code>main()</code>, and then call <code>HeapLeakChecker::CancelGlobalCheck()</code> to
turn off the automatic post-<code>main()</code> check.</p>
<p>Finally, there's a helper macro for "strict" and "draconian" modes,
which require all global memory to be freed before program exit. This
freeing can be time-consuming and is often unnecessary, since libc
cleans up all memory at program-exit for you. If you want the
benefits of "strict"/"draconian" modes without the cost of all that
freeing, look at <code>REGISTER_HEAPCHECK_CLEANUP</code> (in
<code>heap-checker.h</code>). This macro allows you to mark specific
cleanup code as active only when the heap-checker is turned on.</p>
<h2><a name="explicit">Explicit (Partial-program) Heap Leak Checking</h2>
<p>Instead of whole-program checking, you can check certain parts of your
code to verify they do not have memory leaks. This check verifies that
between two parts of a program, no memory is allocated without being freed.</p>
<p>To use this kind of checking code, bracket the code you want
checked by creating a <code>HeapLeakChecker</code> object at the
beginning of the code segment, and call
<code>NoLeaks()</code> at the end. These functions, and all others
referred to in this file, are declared in
<code>&lt;gperftools/heap-checker.h&gt;</code>.
</p>
<p>Here's an example:</p>
<pre>
HeapLeakChecker heap_checker("test_foo");
{
code that exercises some foo functionality;
this code should not leak memory;
}
if (!heap_checker.NoLeaks()) assert(NULL == "heap memory leak");
</pre>
<p>Note that adding in the <code>HeapLeakChecker</code> object merely
instruments the code for leak-checking. To actually turn on this
leak-checking on a particular run of the executable, you must still
run with the heap-checker turned on:</p>
<pre>% env HEAPCHECK=local /usr/local/bin/my_binary_compiled_with_tcmalloc</pre>
<p>If you want to do whole-program leak checking in addition to this
manual leak checking, you can run in <code>normal</code> or some other
mode instead: they'll run the "local" checks in addition to the
whole-program check.</p>
<h2><a name="disable">Disabling Heap-checking of Known Leaks</a></h2>
<p>Sometimes your code has leaks that you know about and are willing
to accept. You would like the heap checker to ignore them when
checking your program. You can do this by bracketing the code in
question with an appropriate heap-checking construct:</p>
<pre>
...
{
HeapLeakChecker::Disabler disabler;
&lt;leaky code&gt;
}
...
</pre>
Any objects allocated by <code>leaky code</code> (including inside any
routines called by <code>leaky code</code>) and any objects reachable
from such objects are not reported as leaks.
<p>Alternately, you can use <code>IgnoreObject()</code>, which takes a
pointer to an object to ignore. That memory, and everything reachable
from it (by following pointers), is ignored for the purposes of leak
checking. You can call <code>UnIgnoreObject()</code> to undo the
effects of <code>IgnoreObject()</code>.</p>
<h2><a name="options">Tuning the Heap Checker</h2>
<p>The heap leak checker has many options, some that trade off running
time and accuracy, and others that increase the sensitivity at the
risk of returning false positives. For most uses, the range covered
by the <A HREF="#types">heap-check flavors</A> is enough, but in
specialized cases more control can be helpful.</p>
<p>
These options are specified via environment varaiables.
</p>
<p>This first set of options controls sensitivity and accuracy. These
options are ignored unless you run the heap checker in <A
HREF="#types">as-is</A> mode.
<table frame=box rules=sides cellpadding=5 width=100%>
<tr valign=top>
<td><code>HEAP_CHECK_AFTER_DESTRUCTORS</code></td>
<td>Default: false</td>
<td>
When true, do the final leak check after all other global
destructors have run. When false, do it after all
<code>REGISTER_HEAPCHECK_CLEANUP</code>, typically much earlier in
the global-destructor process.
</td>
</tr>
<tr valign=top>
<td><code>HEAP_CHECK_IGNORE_THREAD_LIVE</code></td>
<td>Default: true</td>
<td>
If true, ignore objects reachable from thread stacks and registers
(that is, do not report them as leaks).
</td>
</tr>
<tr valign=top>
<td><code>HEAP_CHECK_IGNORE_GLOBAL_LIVE</code></td>
<td>Default: true</td>
<td>
If true, ignore objects reachable from global variables and data
(that is, do not report them as leaks).
</td>
</tr>
</table>
<p>These options modify the behavior of whole-program leak
checking.</p>
<table frame=box rules=sides cellpadding=5 width=100%>
<tr valign=top>
<td><code>HEAP_CHECK_MAX_LEAKS</code></td>
<td>Default: 20</td>
<td>
The maximum number of leaks to be printed to stderr (all leaks are still
emitted to file output for pprof to visualize). If negative or zero,
print all the leaks found.
</td>
</tr>
</table>
<p>These options apply to all types of leak checking.</p>
<table frame=box rules=sides cellpadding=5 width=100%>
<tr valign=top>
<td><code>HEAP_CHECK_IDENTIFY_LEAKS</code></td>
<td>Default: false</td>
<td>
If true, generate the addresses of the leaked objects in the
generated memory leak profile files.
</td>
</tr>
<tr valign=top>
<td><code>HEAP_CHECK_TEST_POINTER_ALIGNMENT</code></td>
<td>Default: false</td>
<td>
If true, check all leaks to see if they might be due to the use
of unaligned pointers.
</td>
</tr>
<tr valign=top>
<td><code>HEAP_CHECK_POINTER_SOURCE_ALIGNMENT</code></td>
<td>Default: sizeof(void*)</td>
<td>
Alignment at which all pointers in memory are supposed to be located.
Use 1 if any alignment is ok.
</td>
</tr>
<tr valign=top>
<td><code>PPROF_PATH</code></td>
<td>Default: pprof</td>
<td>
The location of the <code>pprof</code> executable.
</td>
</tr>
<tr valign=top>
<td><code>HEAP_CHECK_DUMP_DIRECTORY</code></td>
<td>Default: /tmp</td>
<td>
Where the heap-profile files are kept while the program is running.
</td>
</tr>
</table>
<h2>Tips for Handling Detected Leaks</h2>
<p>What do you do when the heap leak checker detects a memory leak?
First, you should run the reported <code>pprof</code> command;
hopefully, that is enough to track down the location where the leak
occurs.</p>
<p>If the leak is a real leak, you should fix it!</p>
<p>If you are sure that the reported leaks are not dangerous and there
is no good way to fix them, then you can use
<code>HeapLeakChecker::Disabler</code> and/or
<code>HeapLeakChecker::IgnoreObject()</code> to disable heap-checking
for certain parts of the codebase.</p>
<p>In "strict" or "draconian" mode, leaks may be due to incomplete
cleanup in the destructors of global variables. If you don't wish to
augment the cleanup routines, but still want to run in "strict" or
"draconian" mode, consider using <A
HREF="#tweaking"><code>REGISTER_HEAPCHECK_CLEANUP</code></A>.</p>
<h2>Hints for Debugging Detected Leaks</h2>
<p>Sometimes it can be useful to not only know the exact code that
allocates the leaked objects, but also the addresses of the leaked objects.
Combining this e.g. with additional logging in the program
one can then track which subset of the allocations
made at a certain spot in the code are leaked.
<br/>
To get the addresses of all leaked objects
define the environment variable <code>HEAP_CHECK_IDENTIFY_LEAKS</code>
to be <code>1</code>.
The object addresses will be reported in the form of addresses
of fake immediate callers of the memory allocation routines.
Note that the performance of doing leak-checking in this mode
can be noticeably worse than the default mode.
</p>
<p>One relatively common class of leaks that don't look real
is the case of multiple initialization.
In such cases the reported leaks are typically things that are
linked from some global objects,
which are initialized and say never modified again.
The non-obvious cause of the leak is frequently the fact that
the initialization code for these objects executes more than once.
<br/>
E.g. if the code of some <code>.cc</code> file is made to be included twice
into the binary, then the constructors for global objects defined in that file
will execute twice thus leaking the things allocated on the first run.
<br/>
Similar problems can occur if object initialization is done more explicitly
e.g. on demand by a slightly buggy code
that does not always ensure only-once initialization.
</p>
<p>
A more rare but even more puzzling problem can be use of not properly
aligned pointers (maybe inside of not properly aligned objects).
Normally such pointers are not followed by the leak checker,
hence the objects reachable only via such pointers are reported as leaks.
If you suspect this case
define the environment variable <code>HEAP_CHECK_TEST_POINTER_ALIGNMENT</code>
to be <code>1</code>
and then look closely at the generated leak report messages.
</p>
<h1>How It Works</h1>
<p>When a <code>HeapLeakChecker</code> object is constructed, it dumps
a memory-usage profile named
<code>&lt;prefix&gt;.&lt;name&gt;-beg.heap</code> to a temporary
directory. When <code>NoLeaks()</code>
is called (for whole-program checking, this happens automatically at
program-exit), it dumps another profile, named
<code>&lt;prefix&gt;.&lt;name&gt;-end.heap</code>.
(<code>&lt;prefix&gt;</code> is typically determined automatically,
and <code>&lt;name&gt;</code> is typically <code>argv[0]</code>.) It
then compares the two profiles. If the second profile shows
more memory use than the first, the
<code>NoLeaks()</code> function will
return false. For "whole program" profiling, this will cause the
executable to abort (via <code>exit(1)</code>). In all cases, it will
print a message on how to process the dumped profiles to locate
leaks.</p>
<h3><A name=live>Detecting Live Objects</A></h3>
<p>At any point during a program's execution, all memory that is
accessible at that time is considered "live." This includes global
variables, and also any memory that is reachable by following pointers
from a global variable. It also includes all memory reachable from
the current stack frame and from current CPU registers (this captures
local variables). Finally, it includes the thread equivalents of
these: thread-local storage and thread heaps, memory reachable from
thread-local storage and thread heaps, and memory reachable from
thread CPU registers.</p>
<p>In all modes except "draconian," live memory is not
considered to be a leak. We detect this by doing a liveness flood,
traversing pointers to heap objects starting from some initial memory
regions we know to potentially contain live pointer data. Note that
this flood might potentially not find some (global) live data region
to start the flood from. If you find such, please file a bug.</p>
<p>The liveness flood attempts to treat any properly aligned byte
sequences as pointers to heap objects and thinks that it found a good
pointer whenever the current heap memory map contains an object with
the address whose byte representation we found. Some pointers into
not-at-start of object will also work here.</p>
<p>As a result of this simple approach, it's possible (though
unlikely) for the flood to be inexact and occasionally result in
leaked objects being erroneously determined to be live. For instance,
random bit patterns can happen to look like pointers to leaked heap
objects. More likely, stale pointer data not corresponding to any
live program variables can be still present in memory regions,
especially in thread stacks. For instance, depending on how the local
<code>malloc</code> is implemented, it may reuse a heap object
address:</p>
<pre>
char* p = new char[1]; // new might return 0x80000000, say.
delete p;
new char[1]; // new might return 0x80000000 again
// This last new is a leak, but doesn't seem it: p looks like it points to it
</pre>
<p>In other words, imprecisions in the liveness flood mean that for
any heap leak check we might miss some memory leaks. This means that
for local leak checks, we might report a memory leak in the local
area, even though the leak actually happened before the
<code>HeapLeakChecker</code> object was constructed. Note that for
whole-program checks, a leak report <i>does</i> always correspond to a
real leak (since there's no "before" to have created a false-live
object).</p>
<p>While this liveness flood approach is not very portable and not
100% accurate, it works in most cases and saves us from writing a lot
of explicit clean up code and other hassles when dealing with thread
data.</p>
<h3>Visualizing Leak with <code>pprof</code></h3>
<p>
The heap checker automatically prints basic leak info with stack traces of
leaked objects' allocation sites, as well as a pprof command line that can be
used to visualize the call-graph involved in these allocations.
The latter can be much more useful for a human
to see where/why the leaks happened, especially if the leaks are numerous.
</p>
<h3>Leak-checking and Threads</h3>
<p>At the time of HeapLeakChecker's construction and during
<code>NoLeaks()</code> calls, we grab a lock
and then pause all other threads so other threads do not interfere
with recording or analyzing the state of the heap.</p>
<p>In general, leak checking works correctly in the presence of
threads. However, thread stack data liveness determination (via
<code>base/thread_lister.h</code>) does not work when the program is
running under GDB, because the ptrace functionality needed for finding
threads is already hooked to by GDB. Conversely, leak checker's
ptrace attempts might also interfere with GDB. As a result, GDB can
result in potentially false leak reports. For this reason, the
heap-checker turns itself off when running under GDB.</p>
<p>Also, <code>thread_lister</code> only works for Linux pthreads;
leak checking is unlikely to handle other thread implementations
correctly.</p>
<p>As mentioned in the discussion of liveness flooding, thread-stack
liveness determination might mis-classify as reachable objects that
very recently became unreachable (leaked). This can happen when the
pointers to now-logically-unreachable objects are present in the
active thread stack frame. In other words, trivial code like the
following might not produce the expected leak checking outcome
depending on how the compiled code works with the stack:</p>
<pre>
int* foo = new int [20];
HeapLeakChecker check("a_check");
foo = NULL;
// May fail to trigger.
if (!heap_checker.NoLeaks()) assert(NULL == "heap memory leak");
</pre>
<hr>
<address>Maxim Lifantsev<br>
<!-- Created: Tue Dec 19 10:43:14 PST 2000 -->
<!-- hhmts start -->
Last modified: Fri Jul 13 13:14:33 PDT 2007
<!-- hhmts end -->
</address>
</body>
</html>

View file

@ -0,0 +1,391 @@
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<HTML>
<HEAD>
<link rel="stylesheet" href="designstyle.css">
<title>Gperftools Heap Profiler</title>
</HEAD>
<BODY>
<p align=right>
<i>Last modified
<script type=text/javascript>
var lm = new Date(document.lastModified);
document.write(lm.toDateString());
</script></i>
</p>
<p>This is the heap profiler we use at Google, to explore how C++
programs manage memory. This facility can be useful for</p>
<ul>
<li> Figuring out what is in the program heap at any given time
<li> Locating memory leaks
<li> Finding places that do a lot of allocation
</ul>
<p>The profiling system instruments all allocations and frees. It
keeps track of various pieces of information per allocation site. An
allocation site is defined as the active stack trace at the call to
<code>malloc</code>, <code>calloc</code>, <code>realloc</code>, or,
<code>new</code>.</p>
<p>There are three parts to using it: linking the library into an
application, running the code, and analyzing the output.</p>
<h1>Linking in the Library</h1>
<p>To install the heap profiler into your executable, add
<code>-ltcmalloc</code> to the link-time step for your executable.
Also, while we don't necessarily recommend this form of usage, it's
possible to add in the profiler at run-time using
<code>LD_PRELOAD</code>:
<pre>% env LD_PRELOAD="/usr/lib/libtcmalloc.so" &lt;binary&gt;</pre>
<p>This does <i>not</i> turn on heap profiling; it just inserts the
code. For that reason, it's practical to just always link
<code>-ltcmalloc</code> into a binary while developing; that's what we
do at Google. (However, since any user can turn on the profiler by
setting an environment variable, it's not necessarily recommended to
install profiler-linked binaries into a production, running
system.) Note that if you wish to use the heap profiler, you must
also use the tcmalloc memory-allocation library. There is no way
currently to use the heap profiler separate from tcmalloc.</p>
<h1>Running the Code</h1>
<p>There are several alternatives to actually turn on heap profiling
for a given run of an executable:</p>
<ol>
<li> <p>Define the environment variable HEAPPROFILE to the filename
to dump the profile to. For instance, to profile
<code>/usr/local/bin/my_binary_compiled_with_tcmalloc</code>:</p>
<pre>% env HEAPPROFILE=/tmp/mybin.hprof /usr/local/bin/my_binary_compiled_with_tcmalloc</pre>
<li> <p>In your code, bracket the code you want profiled in calls to
<code>HeapProfilerStart()</code> and <code>HeapProfilerStop()</code>.
(These functions are declared in <code>&lt;gperftools/heap-profiler.h&gt;</code>.)
<code>HeapProfilerStart()</code> will take the
profile-filename-prefix as an argument. Then, as often as
you'd like before calling <code>HeapProfilerStop()</code>, you
can use <code>HeapProfilerDump()</code> or
<code>GetHeapProfile()</code> to examine the profile. In case
it's useful, <code>IsHeapProfilerRunning()</code> will tell you
whether you've already called HeapProfilerStart() or not.</p>
</ol>
<p>For security reasons, heap profiling will not write to a file --
and is thus not usable -- for setuid programs.</p>
<H2>Modifying Runtime Behavior</H2>
<p>You can more finely control the behavior of the heap profiler via
environment variables.</p>
<table frame=box rules=sides cellpadding=5 width=100%>
<tr valign=top>
<td><code>HEAP_PROFILE_ALLOCATION_INTERVAL</code></td>
<td>default: 1073741824 (1 Gb)</td>
<td>
Dump heap profiling information each time the specified number of
bytes has been allocated by the program.
</td>
</tr>
<tr valign=top>
<td><code>HEAP_PROFILE_INUSE_INTERVAL</code></td>
<td>default: 104857600 (100 Mb)</td>
<td>
Dump heap profiling information whenever the high-water memory
usage mark increases by the specified number of bytes.
</td>
</tr>
<tr valign=top>
<td><code>HEAP_PROFILE_TIME_INTERVAL</code></td>
<td>default: 0</td>
<td>
Dump heap profiling information each time the specified
number of seconds has elapsed.
</td>
</tr>
<tr valign=top>
<td><code>HEAPPROFILESIGNAL</code></td>
<td>default: disabled</td>
<td>
Dump heap profiling information whenever the specified signal is sent to the
process.
</td>
</tr>
<tr valign=top>
<td><code>HEAP_PROFILE_MMAP</code></td>
<td>default: false</td>
<td>
Profile <code>mmap</code>, <code>mremap</code> and <code>sbrk</code>
calls in addition
to <code>malloc</code>, <code>calloc</code>, <code>realloc</code>,
and <code>new</code>. <b>NOTE:</b> this causes the profiler to
profile calls internal to tcmalloc, since tcmalloc and friends use
mmap and sbrk internally for allocations. One partial solution is
to filter these allocations out when running <code>pprof</code>,
with something like
<code>pprof --ignore='DoAllocWithArena|SbrkSysAllocator::Alloc|MmapSysAllocator::Alloc</code>.
</td>
</tr>
<tr valign=top>
<td><code>HEAP_PROFILE_ONLY_MMAP</code></td>
<td>default: false</td>
<td>
Only profile <code>mmap</code>, <code>mremap</code>, and <code>sbrk</code>
calls; do not profile
<code>malloc</code>, <code>calloc</code>, <code>realloc</code>,
or <code>new</code>.
</td>
</tr>
<tr valign=top>
<td><code>HEAP_PROFILE_MMAP_LOG</code></td>
<td>default: false</td>
<td>
Log <code>mmap</code>/<code>munmap</code> calls.
</td>
</tr>
</table>
<H2>Checking for Leaks</H2>
<p>You can use the heap profiler to manually check for leaks, for
instance by reading the profiler output and looking for large
allocations. However, for that task, it's easier to use the <A
HREF="heap_checker.html">automatic heap-checking facility</A> built
into tcmalloc.</p>
<h1><a name="pprof">Analyzing the Output</a></h1>
<p>If heap-profiling is turned on in a program, the program will
periodically write profiles to the filesystem. The sequence of
profiles will be named:</p>
<pre>
&lt;prefix&gt;.0000.heap
&lt;prefix&gt;.0001.heap
&lt;prefix&gt;.0002.heap
...
</pre>
<p>where <code>&lt;prefix&gt;</code> is the filename-prefix supplied
when running the code (e.g. via the <code>HEAPPROFILE</code>
environment variable). Note that if the supplied prefix
does not start with a <code>/</code>, the profile files will be
written to the program's working directory.</p>
<p>The profile output can be viewed by passing it to the
<code>pprof</code> tool -- the same tool that's used to analyze <A
HREF="cpuprofile.html">CPU profiles</A>.
<p>Here are some examples. These examples assume the binary is named
<code>gfs_master</code>, and a sequence of heap profile files can be
found in files named:</p>
<pre>
/tmp/profile.0001.heap
/tmp/profile.0002.heap
...
/tmp/profile.0100.heap
</pre>
<h3>Why is a process so big</h3>
<pre>
% pprof --gv gfs_master /tmp/profile.0100.heap
</pre>
<p>This command will pop-up a <code>gv</code> window that displays
the profile information as a directed graph. Here is a portion
of the resulting output:</p>
<p><center>
<img src="heap-example1.png">
</center></p>
A few explanations:
<ul>
<li> <code>GFS_MasterChunk::AddServer</code> accounts for 255.6 MB
of the live memory, which is 25% of the total live memory.
<li> <code>GFS_MasterChunkTable::UpdateState</code> is directly
accountable for 176.2 MB of the live memory (i.e., it directly
allocated 176.2 MB that has not been freed yet). Furthermore,
it and its callees are responsible for 729.9 MB. The
labels on the outgoing edges give a good indication of the
amount allocated by each callee.
</ul>
<h3>Comparing Profiles</h3>
<p>You often want to skip allocations during the initialization phase
of a program so you can find gradual memory leaks. One simple way to
do this is to compare two profiles -- both collected after the program
has been running for a while. Specify the name of the first profile
using the <code>--base</code> option. For example:</p>
<pre>
% pprof --base=/tmp/profile.0004.heap gfs_master /tmp/profile.0100.heap
</pre>
<p>The memory-usage in <code>/tmp/profile.0004.heap</code> will be
subtracted from the memory-usage in
<code>/tmp/profile.0100.heap</code> and the result will be
displayed.</p>
<h3>Text display</h3>
<pre>
% pprof --text gfs_master /tmp/profile.0100.heap
255.6 24.7% 24.7% 255.6 24.7% GFS_MasterChunk::AddServer
184.6 17.8% 42.5% 298.8 28.8% GFS_MasterChunkTable::Create
176.2 17.0% 59.5% 729.9 70.5% GFS_MasterChunkTable::UpdateState
169.8 16.4% 75.9% 169.8 16.4% PendingClone::PendingClone
76.3 7.4% 83.3% 76.3 7.4% __default_alloc_template::_S_chunk_alloc
49.5 4.8% 88.0% 49.5 4.8% hashtable::resize
...
</pre>
<p>
<ul>
<li> The first column contains the direct memory use in MB.
<li> The fourth column contains memory use by the procedure
and all of its callees.
<li> The second and fifth columns are just percentage
representations of the numbers in the first and fourth columns.
<li> The third column is a cumulative sum of the second column
(i.e., the <code>k</code>th entry in the third column is the
sum of the first <code>k</code> entries in the second column.)
</ul>
<h3>Ignoring or focusing on specific regions</h3>
<p>The following command will give a graphical display of a subset of
the call-graph. Only paths in the call-graph that match the regular
expression <code>DataBuffer</code> are included:</p>
<pre>
% pprof --gv --focus=DataBuffer gfs_master /tmp/profile.0100.heap
</pre>
<p>Similarly, the following command will omit all paths subset of the
call-graph. All paths in the call-graph that match the regular
expression <code>DataBuffer</code> are discarded:</p>
<pre>
% pprof --gv --ignore=DataBuffer gfs_master /tmp/profile.0100.heap
</pre>
<h3>Total allocations + object-level information</h3>
<p>All of the previous examples have displayed the amount of in-use
space. I.e., the number of bytes that have been allocated but not
freed. You can also get other types of information by supplying a
flag to <code>pprof</code>:</p>
<center>
<table frame=box rules=sides cellpadding=5 width=100%>
<tr valign=top>
<td><code>--inuse_space</code></td>
<td>
Display the number of in-use megabytes (i.e. space that has
been allocated but not freed). This is the default.
</td>
</tr>
<tr valign=top>
<td><code>--inuse_objects</code></td>
<td>
Display the number of in-use objects (i.e. number of
objects that have been allocated but not freed).
</td>
</tr>
<tr valign=top>
<td><code>--alloc_space</code></td>
<td>
Display the number of allocated megabytes. This includes
the space that has since been de-allocated. Use this
if you want to find the main allocation sites in the
program.
</td>
</tr>
<tr valign=top>
<td><code>--alloc_objects</code></td>
<td>
Display the number of allocated objects. This includes
the objects that have since been de-allocated. Use this
if you want to find the main allocation sites in the
program.
</td>
</table>
</center>
<h3>Interactive mode</a></h3>
<p>By default -- if you don't specify any flags to the contrary --
pprof runs in interactive mode. At the <code>(pprof)</code> prompt,
you can run many of the commands described above. You can type
<code>help</code> for a list of what commands are available in
interactive mode.</p>
<h1>Caveats</h1>
<ul>
<li> Heap profiling requires the use of libtcmalloc. This
requirement may be removed in a future version of the heap
profiler, and the heap profiler separated out into its own
library.
<li> If the program linked in a library that was not compiled
with enough symbolic information, all samples associated
with the library may be charged to the last symbol found
in the program before the library. This will artificially
inflate the count for that symbol.
<li> If you run the program on one machine, and profile it on
another, and the shared libraries are different on the two
machines, the profiling output may be confusing: samples that
fall within the shared libaries may be assigned to arbitrary
procedures.
<li> Several libraries, such as some STL implementations, do their
own memory management. This may cause strange profiling
results. We have code in libtcmalloc to cause STL to use
tcmalloc for memory management (which in our tests is better
than STL's internal management), though it only works for some
STL implementations.
<li> If your program forks, the children will also be profiled
(since they inherit the same HEAPPROFILE setting). Each
process is profiled separately; to distinguish the child
profiles from the parent profile and from each other, all
children will have their process-id attached to the HEAPPROFILE
name.
<li> Due to a hack we make to work around a possible gcc bug, your
profiles may end up named strangely if the first character of
your HEAPPROFILE variable has ascii value greater than 127.
This should be exceedingly rare, but if you need to use such a
name, just set prepend <code>./</code> to your filename:
<code>HEAPPROFILE=./&Auml;gypten</code>.
</ul>
<hr>
<address>Sanjay Ghemawat
<!-- Created: Tue Dec 19 10:43:14 PST 2000 -->
</address>
</body>
</html>

View file

@ -0,0 +1,20 @@
<HTML>
<HEAD>
<title>Gperftools</title>
</HEAD>
<BODY>
<ul>
<li> <A HREF="tcmalloc.html">thread-caching malloc</A>
<li> <A HREF="heap_checker.html">heap-checking using tcmalloc</A>
<li> <A HREF="heapprofile.html">heap-profiling using tcmalloc</A>
<li> <A HREF="cpuprofile.html">CPU profiler</A>
</ul>
<hr>
Last modified: Thu Feb 2 14:40:47 PST 2012
</BODY>
</HTML>

View file

@ -0,0 +1,15 @@
digraph Overview {
node [shape = box]
{rank=same
T1 [label="Thread Cache"]
Tsep [label="...", shape=plaintext]
Tn [label="Thread Cache"]
T1 -> Tsep -> Tn [style=invis]
}
C [label="Central\nHeap"]
T1 -> C [dir=both]
Tn -> C [dir=both]
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.3 KiB

View file

@ -0,0 +1,25 @@
digraph PageHeap {
rankdir=LR
node [shape=box, width=0.3, height=0.3]
nodesep=.05
heap [shape=record, height=3, label="<f0>1 page|<f1>2 pages|<f2>3 pages|...|<f128>128 pages"]
O0 [shape=record, label=""]
O1 [shape=record, label=""]
O2 [shape=record, label="{|}"]
O3 [shape=record, label="{|}"]
O4 [shape=record, label="{||}"]
O5 [shape=record, label="{||}"]
O6 [shape=record, label="{|...|}"]
O7 [shape=record, label="{|...|}"]
sep1 [shape=plaintext, label="..."]
sep2 [shape=plaintext, label="..."]
sep3 [shape=plaintext, label="..."]
sep4 [shape=plaintext, label="..."]
heap:f0 -> O0 -> O1 -> sep1
heap:f1 -> O2 -> O3 -> sep2
heap:f2 -> O4 -> O5 -> sep3
heap:f128 -> O6 -> O7 -> sep4
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 109 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 98 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

View file

@ -0,0 +1,131 @@
.\" DO NOT MODIFY THIS FILE! It was generated by help2man 1.23.
.TH PPROF "1" "February 2005" "pprof (part of gperftools)" Google
.SH NAME
pprof \- manual page for pprof (part of gperftools)
.SH SYNOPSIS
.B pprof
[\fIoptions\fR] \fI<program> <profile>\fR
.SH DESCRIPTION
.IP
Prints specified cpu- or heap-profile
.SH OPTIONS
.TP
\fB\-\-cum\fR
Sort by cumulative data
.TP
\fB\-\-base=\fR<base>
Subtract <base> from <profile> before display
.SS "Reporting Granularity:"
.TP
\fB\-\-addresses\fR
Report at address level
.TP
\fB\-\-lines\fR
Report at source line level
.TP
\fB\-\-functions\fR
Report at function level [default]
.TP
\fB\-\-files\fR
Report at source file level
.SS "Output type:"
.TP
\fB\-\-text\fR
Generate text report [default]
.TP
\fB\-\-gv\fR
Generate Postscript and display
.TP
\fB\-\-list=\fR<regexp>
Generate source listing of matching routines
.TP
\fB\-\-disasm=\fR<regexp>
Generate disassembly of matching routines
.TP
\fB\-\-dot\fR
Generate DOT file to stdout
.TP
\fB\-\-ps\fR
Generate Postscript to stdout
.TP
\fB\-\-pdf\fR
Generate PDF to stdout
.TP
\fB\-\-gif\fR
Generate GIF to stdout
.SS "Heap-Profile Options:"
.TP
\fB\-\-inuse_space\fR
Display in-use (mega)bytes [default]
.TP
\fB\-\-inuse_objects\fR
Display in-use objects
.TP
\fB\-\-alloc_space\fR
Display allocated (mega)bytes
.TP
\fB\-\-alloc_objects\fR
Display allocated objects
.TP
\fB\-\-show_bytes\fR
Display space in bytes
.TP
\fB\-\-drop_negative\fR
Ignore negaive differences
.SS "Call-graph Options:"
.TP
\fB\-\-nodecount=\fR<n>
Show at most so many nodes [default=80]
.TP
\fB\-\-nodefraction=\fR<f>
Hide nodes below <f>*total [default=.005]
.TP
\fB\-\-edgefraction=\fR<f>
Hide edges below <f>*total [default=.001]
.TP
\fB\-\-focus=\fR<regexp>
Focus on nodes matching <regexp>
.TP
\fB\-\-ignore=\fR<regexp>
Ignore nodes matching <regexp>
.TP
\fB\-\-scale=\fR<n>
Set GV scaling [default=0]
.SH EXAMPLES
pprof /bin/ls ls.prof
.IP
Outputs one line per procedure
.PP
pprof \fB\-\-gv\fR /bin/ls ls.prof
.IP
Displays annotated call-graph via 'gv'
.PP
pprof \fB\-\-gv\fR \fB\-\-focus\fR=\fIMutex\fR /bin/ls ls.prof
.IP
Restricts to code paths including a .*Mutex.* entry
.PP
pprof \fB\-\-gv\fR \fB\-\-focus\fR=\fIMutex\fR \fB\-\-ignore\fR=\fIstring\fR /bin/ls ls.prof
.IP
Code paths including Mutex but not string
.PP
pprof \fB\-\-list\fR=\fIgetdir\fR /bin/ls ls.prof
.IP
Dissassembly (with per-line annotations) for getdir()
.PP
pprof \fB\-\-disasm\fR=\fIgetdir\fR /bin/ls ls.prof
.IP
Dissassembly (with per-PC annotations) for getdir()
.SH COPYRIGHT
Copyright \(co 2005 Google Inc.
.SH "SEE ALSO"
Further documentation for
.B pprof
is maintained as a web page called
.B cpu_profiler.html
and is likely installed at one of the following locations:
.IP
.B /usr/share/gperftools/cpu_profiler.html
.br
.B /usr/local/share/gperftools/cpu_profiler.html
.PP

View file

@ -0,0 +1,11 @@
[see also]
Further documentation for
.B pprof
is maintained as a web page called
.B cpu_profiler.html
and is likely installed at one of the following locations:
.IP
.B /usr/share/gperftools/cpu_profiler.html
.br
.B /usr/local/share/gperftools/cpu_profiler.html
.PP

View file

@ -0,0 +1,260 @@
<HTML>
<HEAD>
<title>pprof and Remote Servers</title>
</HEAD>
<BODY>
<h1><code>pprof</code> and Remote Servers</h1>
<p>In mid-2006, we added an experimental facility to <A
HREF="cpu_profiler.html">pprof</A>, the tool that analyzes CPU and
heap profiles. This facility allows you to collect profile
information from running applications. It makes it easy to collect
profile information without having to stop the program first, and
without having to log into the machine where the application is
running. This is meant to be used on webservers, but will work on any
application that can be modified to accept TCP connections on a port
of its choosing, and to respond to HTTP requests on that port.</p>
<p>We do not currently have infrastructure, such as apache modules,
that you can pop into a webserver or other application to get the
necessary functionality "for free." However, it's easy to generate
the necessary data, which should allow the interested developer to add
the necessary support into his or her applications.</p>
<p>To use <code>pprof</code> in this experimental "server" mode, you
give the script a host and port it should query, replacing the normal
commandline arguments of application + profile file:</p>
<pre>
% pprof internalweb.mycompany.com:80
</pre>
<p>The host must be listening on that port, and be able to accept HTTP/1.0
requests -- sent via <code>wget</code> and <code>curl</code> -- for
several urls. The following sections list the urls that
<code>pprof</code> can send, and the responses it expects in
return.</p>
<p>Here are examples that pprof will recognize, when you give them
on the commandline, are urls. In general, you
specify the host and a port (the port-number is required), and put
the service-name at the end of the url.:</p>
<blockquote><pre>
http://myhost:80/pprof/heap # retrieves a heap profile
http://myhost:8008/pprof/profile # retrieves a CPU profile
http://myhost:80 # retrieves a CPU profile (the default)
http://myhost:8080/ # retrieves a CPU profile (the default)
myhost:8088/pprof/growth # "http://" is optional, but port is not
http://myhost:80/myservice/pprof/heap # /pprof/heap just has to come at the end
http://myhost:80/pprof/pmuprofile # CPU profile using performance counters
</pre></blockquote>
<h2> <code><b>/pprof/heap</b></code> </h2>
<p><code>pprof</code> asks for the url <code>/pprof/heap</code> to
get heap information. The actual url is controlled via the variable
<code>HEAP_PAGE</code> in the <code>pprof</code> script, so you
can change it if you'd like.</p>
<p>There are two ways to get this data. The first is to call</p>
<pre>
MallocExtension::instance()->GetHeapSample(&output);
</pre>
<p>and have the server send <code>output</code> back as an HTTP
response to <code>pprof</code>. <code>MallocExtension</code> is
defined in the header file <code>gperftools/malloc_extension.h</code>.</p>
<p>Note this will only only work if the binary is being run with
sampling turned on (which is not the default). To do this, set the
environment variable <code>TCMALLOC_SAMPLE_PARAMETER</code> to a
positive value, such as 524288, before running.</p>
<p>The other way is to call <code>HeapProfileStart(filename)</code>
(from <code>heap-profiler.h</code>), continue to do work, and then,
some number of seconds later, call <code>GetHeapProfile()</code>
(followed by <code>HeapProfilerStop()</code>). The server can send
the output of <code>GetHeapProfile</code> back as the HTTP response to
pprof. (Note you must <code>free()</code> this data after using it.)
This is similar to how <A HREF="#profile">profile requests</A> are
handled, below. This technique does not require the application to
run with sampling turned on.</p>
<p>Here's an example of what the output should look like:</p>
<pre>
heap profile: 1923: 127923432 [ 1923: 127923432] @ heap_v2/524288
1: 312 [ 1: 312] @ 0x2aaaabaf5ccc 0x2aaaaba4cd2c 0x2aaaac08c09a
928: 122586016 [ 928: 122586016] @ 0x2aaaabaf682c 0x400680 0x400bdd 0x2aaaab1c368a 0x2aaaab1c8f77 0x2aaaab1c0396 0x2aaaab1c86ed 0x4007ff 0x2aaaaca62afa
1: 16 [ 1: 16] @ 0x2aaaabaf5ccc 0x2aaaabb04bac 0x2aaaabc1b262 0x2aaaabc21496 0x2aaaabc214bb
[...]
</pre>
<p> Older code may produce "version 1" heap profiles which look like this:<p/>
<pre>
heap profile: 14933: 791700132 [ 14933: 791700132] @ heap
1: 848688 [ 1: 848688] @ 0xa4b142 0x7f5bfc 0x87065e 0x4056e9 0x4125f8 0x42b4f1 0x45b1ba 0x463248 0x460871 0x45cb7c 0x5f1744 0x607cee 0x5f4a5e 0x40080f 0x2aaaabad7afa
1: 1048576 [ 1: 1048576] @ 0xa4a9b2 0x7fd025 0x4ca6d8 0x4ca814 0x4caa88 0x2aaaab104cf0 0x404e20 0x4125f8 0x42b4f1 0x45b1ba 0x463248 0x460871 0x45cb7c 0x5f1744 0x607cee 0x5f4a5e 0x40080f 0x2aaaabad7afa
2942: 388629374 [ 2942: 388629374] @ 0xa4b142 0x4006a0 0x400bed 0x5f0cfa 0x5f1744 0x607cee 0x5f4a5e 0x40080f 0x2aaaabad7afa
[...]
</pre>
<p>pprof accepts both old and new heap profiles and automatically
detects which one you are using.</p>
<h2> <code><b>/pprof/growth</b></code> </h2>
<p><code>pprof</code> asks for the url <code>/pprof/growth</code> to
get heap-profiling delta (growth) information. The actual url is
controlled via the variable <code>GROWTH_PAGE</code> in the
<code>pprof</code> script, so you can change it if you'd like.</p>
<p>The server should respond by calling</p>
<pre>
MallocExtension::instance()->GetHeapGrowthStacks(&output);
</pre>
<p>and sending <code>output</code> back as an HTTP response to
<code>pprof</code>. <code>MallocExtension</code> is defined in the
header file <code>gperftools/malloc_extension.h</code>.</p>
<p>Here's an example, from an actual Google webserver, of what the
output should look like:</p>
<pre>
heap profile: 741: 812122112 [ 741: 812122112] @ growth
1: 1572864 [ 1: 1572864] @ 0x87da564 0x87db8a3 0x84787a4 0x846e851 0x836d12f 0x834cd1c 0x8349ba5 0x10a3177 0x8349961
1: 1048576 [ 1: 1048576] @ 0x87d92e8 0x87d9213 0x87d9178 0x87d94d3 0x87da9da 0x8a364ff 0x8a437e7 0x8ab7d23 0x8ab7da9 0x8ac7454 0x8348465 0x10a3161 0x8349961
[...]
</pre>
<h2> <A NAME="profile"><code><b>/pprof/profile</b></code></A> </h2>
<p><code>pprof</code> asks for the url
<code>/pprof/profile?seconds=XX</code> to get cpu-profiling
information. The actual url is controlled via the variable
<code>PROFILE_PAGE</code> in the <code>pprof</code> script, so you can
change it if you'd like.</p>
<p>The server should respond by calling
<code>ProfilerStart(filename)</code>, continuing to do its work, and
then, XX seconds later, calling <code>ProfilerStop()</code>. (These
functions are declared in <code>gperftools/profiler.h</code>.) The
application is responsible for picking a unique filename for
<code>ProfilerStart()</code>. After calling
<code>ProfilerStop()</code>, the server should read the contents of
<code>filename</code> and send them back as an HTTP response to
<code>pprof</code>.</p>
<p>Obviously, to get useful profile information the application must
continue to run in the XX seconds that the profiler is running. Thus,
the profile start-stop calls should be done in a separate thread, or
be otherwise non-blocking.</p>
<p>The profiler output file is binary, but near the end of it, it
should have lines of text somewhat like this:</p>
<pre>
01016000-01017000 rw-p 00015000 03:01 59314 /lib/ld-2.2.2.so
</pre>
<h2> <code><b>/pprof/pmuprofile</b></code> </h2>
<code>pprof</code> asks for a url of the form
<code>/pprof/pmuprofile?event=hw_event:unit_mask&period=nnn&seconds=xxx</code>
to get cpu-profiling information. The actual url is controlled via the variable
<code>PMUPROFILE_PAGE</code> in the <code>pprof</code> script, so you can
change it if you'd like.</p>
<p>
This is similar to pprof, but is meant to be used with your CPU's hardware
performance counters. The server could be implemented on top of a library
such as <a href="http://perfmon2.sourceforge.net/">
<code>libpfm</code></a>. It should collect a sample every nnn occurrences
of the event and stop the sampling after xxx seconds. Much of the code
for <code>/pprof/profile</code> can be reused for this purpose.
</p>
<p>The server side routines (the equivalent of
ProfilerStart/ProfilerStart) are not available as part of perftools,
so this URL is unlikely to be that useful.</p>
<h2> <code><b>/pprof/contention</b></code> </h2>
<p>This is intended to be able to profile (thread) lock contention in
addition to CPU and memory use. It's not yet usable.</p>
<h2> <code><b>/pprof/cmdline</b></code> </h2>
<p><code>pprof</code> asks for the url <code>/pprof/cmdline</code> to
figure out what application it's profiling. The actual url is
controlled via the variable <code>PROGRAM_NAME_PAGE</code> in the
<code>pprof</code> script, so you can change it if you'd like.</p>
<p>The server should respond by reading the contents of
<code>/proc/self/cmdline</code>, converting all internal NUL (\0)
characters to newlines, and sending the result back as an HTTP
response to <code>pprof</code>.</p>
<p>Here's an example return value:<p>
<pre>
/root/server/custom_webserver
80
--configfile=/root/server/ws.config
</pre>
<h2> <code><b>/pprof/symbol</b></code> </h2>
<p><code>pprof</code> asks for the url <code>/pprof/symbol</code> to
map from hex addresses to variable names. The actual url is
controlled via the variable <code>SYMBOL_PAGE</code> in the
<code>pprof</code> script, so you can change it if you'd like.</p>
<p>When the server receives a GET request for
<code>/pprof/symbol</code>, it should return a line formatted like
so:</p>
<pre>
num_symbols: ###
</pre>
<p>where <code>###</code> is the number of symbols found in the
binary. (For now, the only important distinction is whether the value
is 0, which it is for executables that lack debug information, or
not-0).</p>
<p>This is perhaps the hardest request to write code for, because in
addition to the GET request for this url, the server must accept POST
requests. This means that after the HTTP headers, pprof will pass in
a list of hex addresses connected by <code>+</code>, like so:</p>
<pre>
curl -d '0x0824d061+0x0824d1cf' http://remote_host:80/pprof/symbol
</pre>
<p>The server should read the POST data, which will be in one line,
and for each hex value, should write one line of output to the output
stream, like so:</p>
<pre>
&lt;hex address&gt;&lt;tab&gt;&lt;function name&gt;
</pre>
<p>For instance:</p>
<pre>
0x08b2dabd _Update
</pre>
<p>The other reason this is the most difficult request to implement,
is that the application will have to figure out for itself how to map
from address to function name. One possibility is to run <code>nm -C
-n &lt;program name&gt;</code> to get the mappings at
program-compile-time. Another, at least on Linux, is to call out to
addr2line for every <code>pprof/symbol</code> call, for instance
<code>addr2line -Cfse /proc/<getpid>/exe 0x12345678 0x876543210</code>
(presumably with some caching!)</p>
<p><code>pprof</code> itself does just this for local profiles (not
ones that talk to remote servers); look at the subroutine
<code>GetProcedureBoundaries</code>.</p>
<hr>
Last modified: Mon Jun 12 21:30:14 PDT 2006
</body>
</html>

View file

@ -0,0 +1,22 @@
digraph SpanMap {
node [shape=box, width=0.3, height=0.3]
nodesep=.05
map [shape=record, width=6, label="<f0>|<f1>|<f2>|<f3>|<f4>|<f5>|<f6>|<f7>|<f8>|<f9>|<f10>"]
S0 [label="a"]
S1 [label="b"]
S2 [label="c"]
S3 [label="d"]
map:f0 -> S0
map:f1 -> S0
map:f2 -> S1
map:f3 -> S2
map:f4 -> S2
map:f5 -> S2
map:f6 -> S2
map:f7 -> S2
map:f8 -> S3
map:f9 -> S3
map:f10 -> S3
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.3 KiB

View file

@ -0,0 +1,480 @@
time.1.ptmalloc.64:0.56 user 0.02 system 0.57 elapsed 100% CPU
time.1.tcmalloc.64:0.38 user 0.02 system 0.40 elapsed 98% CPU
time.1.ptmalloc.128:0.61 user 0.01 system 0.61 elapsed 101% CPU
time.1.tcmalloc.128:0.35 user 0.00 system 0.35 elapsed 99% CPU
time.1.ptmalloc.256:0.59 user 0.01 system 0.60 elapsed 100% CPU
time.1.tcmalloc.256:0.27 user 0.02 system 0.28 elapsed 102% CPU
time.1.ptmalloc.512:0.57 user 0.00 system 0.57 elapsed 100% CPU
time.1.tcmalloc.512:0.25 user 0.01 system 0.25 elapsed 101% CPU
time.1.ptmalloc.1024:0.52 user 0.00 system 0.52 elapsed 99% CPU
time.1.tcmalloc.1024:0.22 user 0.02 system 0.24 elapsed 97% CPU
time.1.ptmalloc.2048:0.47 user 0.00 system 0.47 elapsed 99% CPU
time.1.tcmalloc.2048:0.22 user 0.02 system 0.25 elapsed 95% CPU
time.1.ptmalloc.4096:0.48 user 0.01 system 0.48 elapsed 100% CPU
time.1.tcmalloc.4096:0.25 user 0.01 system 0.25 elapsed 100% CPU
time.1.ptmalloc.8192:0.49 user 0.02 system 0.49 elapsed 102% CPU
time.1.tcmalloc.8192:0.27 user 0.02 system 0.28 elapsed 101% CPU
time.1.ptmalloc.16384:0.51 user 0.04 system 0.55 elapsed 99% CPU
time.1.tcmalloc.16384:0.35 user 0.02 system 0.37 elapsed 100% CPU
time.1.ptmalloc.32768:0.53 user 0.14 system 0.66 elapsed 100% CPU
time.1.tcmalloc.32768:0.67 user 0.02 system 0.69 elapsed 99% CPU
time.1.ptmalloc.65536:0.68 user 0.31 system 0.98 elapsed 100% CPU
time.1.tcmalloc.65536:0.71 user 0.01 system 0.72 elapsed 99% CPU
time.1.ptmalloc.131072:0.90 user 0.72 system 1.62 elapsed 99% CPU
time.1.tcmalloc.131072:0.94 user 0.03 system 0.97 elapsed 99% CPU
time.2.ptmalloc.64:1.05 user 0.00 system 0.53 elapsed 196% CPU
time.2.tcmalloc.64:0.66 user 0.03 system 0.37 elapsed 185% CPU
time.2.ptmalloc.128:1.77 user 0.01 system 0.89 elapsed 198% CPU
time.2.tcmalloc.128:0.53 user 0.01 system 0.29 elapsed 184% CPU
time.2.ptmalloc.256:1.14 user 0.01 system 0.62 elapsed 182% CPU
time.2.tcmalloc.256:0.45 user 0.02 system 0.26 elapsed 180% CPU
time.2.ptmalloc.512:1.26 user 0.40 system 1.79 elapsed 92% CPU
time.2.tcmalloc.512:0.43 user 0.02 system 0.27 elapsed 166% CPU
time.2.ptmalloc.1024:0.98 user 0.03 system 0.56 elapsed 179% CPU
time.2.tcmalloc.1024:0.44 user 0.02 system 0.34 elapsed 134% CPU
time.2.ptmalloc.2048:0.87 user 0.02 system 0.44 elapsed 199% CPU
time.2.tcmalloc.2048:0.49 user 0.02 system 0.34 elapsed 148% CPU
time.2.ptmalloc.4096:0.92 user 0.03 system 0.48 elapsed 196% CPU
time.2.tcmalloc.4096:0.50 user 0.02 system 0.49 elapsed 105% CPU
time.2.ptmalloc.8192:1.05 user 0.04 system 0.55 elapsed 196% CPU
time.2.tcmalloc.8192:0.59 user 0.01 system 0.51 elapsed 116% CPU
time.2.ptmalloc.16384:1.30 user 0.14 system 0.72 elapsed 198% CPU
time.2.tcmalloc.16384:0.63 user 0.03 system 0.68 elapsed 96% CPU
time.2.ptmalloc.32768:1.33 user 0.56 system 1.00 elapsed 189% CPU
time.2.tcmalloc.32768:1.16 user 0.01 system 1.17 elapsed 99% CPU
time.2.ptmalloc.65536:1.86 user 1.79 system 2.01 elapsed 181% CPU
time.2.tcmalloc.65536:1.35 user 0.01 system 1.35 elapsed 100% CPU
time.2.ptmalloc.131072:2.61 user 5.19 system 4.81 elapsed 162% CPU
time.2.tcmalloc.131072:1.86 user 0.04 system 1.90 elapsed 100% CPU
time.3.ptmalloc.64:1.79 user 0.03 system 0.67 elapsed 268% CPU
time.3.tcmalloc.64:1.58 user 0.04 system 0.62 elapsed 260% CPU
time.3.ptmalloc.128:2.77 user 1.34 system 3.07 elapsed 133% CPU
time.3.tcmalloc.128:1.19 user 0.01 system 0.50 elapsed 236% CPU
time.3.ptmalloc.256:2.14 user 0.02 system 0.85 elapsed 252% CPU
time.3.tcmalloc.256:0.96 user 0.01 system 0.41 elapsed 236% CPU
time.3.ptmalloc.512:3.37 user 1.31 system 3.33 elapsed 140% CPU
time.3.tcmalloc.512:0.93 user 0.04 system 0.39 elapsed 243% CPU
time.3.ptmalloc.1024:1.66 user 0.01 system 0.64 elapsed 260% CPU
time.3.tcmalloc.1024:0.81 user 0.02 system 0.44 elapsed 187% CPU
time.3.ptmalloc.2048:2.07 user 0.01 system 0.82 elapsed 252% CPU
time.3.tcmalloc.2048:1.10 user 0.04 system 0.59 elapsed 191% CPU
time.3.ptmalloc.4096:2.01 user 0.03 system 0.79 elapsed 258% CPU
time.3.tcmalloc.4096:0.87 user 0.03 system 0.65 elapsed 137% CPU
time.3.ptmalloc.8192:2.22 user 0.11 system 0.83 elapsed 280% CPU
time.3.tcmalloc.8192:0.96 user 0.06 system 0.75 elapsed 135% CPU
time.3.ptmalloc.16384:2.56 user 0.47 system 1.02 elapsed 295% CPU
time.3.tcmalloc.16384:0.99 user 0.04 system 1.03 elapsed 99% CPU
time.3.ptmalloc.32768:3.29 user 1.75 system 1.96 elapsed 256% CPU
time.3.tcmalloc.32768:1.67 user 0.02 system 1.69 elapsed 99% CPU
time.3.ptmalloc.65536:4.04 user 6.62 system 4.92 elapsed 216% CPU
time.3.tcmalloc.65536:1.91 user 0.02 system 1.98 elapsed 97% CPU
time.3.ptmalloc.131072:5.55 user 17.86 system 12.44 elapsed 188% CPU
time.3.tcmalloc.131072:2.78 user 0.02 system 2.82 elapsed 99% CPU
time.4.ptmalloc.64:3.42 user 1.36 system 3.20 elapsed 149% CPU
time.4.tcmalloc.64:2.42 user 0.02 system 0.71 elapsed 341% CPU
time.4.ptmalloc.128:3.98 user 1.79 system 3.89 elapsed 148% CPU
time.4.tcmalloc.128:1.87 user 0.02 system 0.58 elapsed 325% CPU
time.4.ptmalloc.256:4.06 user 2.14 system 4.12 elapsed 150% CPU
time.4.tcmalloc.256:1.69 user 0.02 system 0.51 elapsed 331% CPU
time.4.ptmalloc.512:4.48 user 2.15 system 4.39 elapsed 150% CPU
time.4.tcmalloc.512:1.62 user 0.03 system 0.52 elapsed 314% CPU
time.4.ptmalloc.1024:3.18 user 0.03 system 0.84 elapsed 381% CPU
time.4.tcmalloc.1024:1.53 user 0.02 system 0.56 elapsed 274% CPU
time.4.ptmalloc.2048:3.24 user 0.02 system 0.84 elapsed 384% CPU
time.4.tcmalloc.2048:1.44 user 0.04 system 0.66 elapsed 221% CPU
time.4.ptmalloc.4096:3.50 user 0.04 system 0.91 elapsed 389% CPU
time.4.tcmalloc.4096:1.31 user 0.01 system 0.89 elapsed 148% CPU
time.4.ptmalloc.8192:6.77 user 3.85 system 4.14 elapsed 256% CPU
time.4.tcmalloc.8192:1.20 user 0.05 system 0.97 elapsed 127% CPU
time.4.ptmalloc.16384:7.08 user 5.06 system 4.63 elapsed 262% CPU
time.4.tcmalloc.16384:1.27 user 0.03 system 1.25 elapsed 103% CPU
time.4.ptmalloc.32768:5.57 user 4.22 system 3.31 elapsed 295% CPU
time.4.tcmalloc.32768:2.17 user 0.03 system 2.25 elapsed 97% CPU
time.4.ptmalloc.65536:6.11 user 15.05 system 9.19 elapsed 230% CPU
time.4.tcmalloc.65536:2.51 user 0.02 system 2.57 elapsed 98% CPU
time.4.ptmalloc.131072:7.58 user 33.15 system 21.28 elapsed 191% CPU
time.4.tcmalloc.131072:3.57 user 0.07 system 3.66 elapsed 99% CPU
time.5.ptmalloc.64:4.44 user 2.08 system 4.37 elapsed 148% CPU
time.5.tcmalloc.64:2.87 user 0.02 system 0.79 elapsed 361% CPU
time.5.ptmalloc.128:4.77 user 2.77 system 5.14 elapsed 146% CPU
time.5.tcmalloc.128:2.65 user 0.03 system 0.72 elapsed 367% CPU
time.5.ptmalloc.256:5.82 user 2.88 system 5.49 elapsed 158% CPU
time.5.tcmalloc.256:2.33 user 0.01 system 0.66 elapsed 352% CPU
time.5.ptmalloc.512:6.27 user 3.11 system 5.34 elapsed 175% CPU
time.5.tcmalloc.512:2.14 user 0.03 system 0.70 elapsed 307% CPU
time.5.ptmalloc.1024:6.82 user 3.18 system 5.23 elapsed 191% CPU
time.5.tcmalloc.1024:2.20 user 0.02 system 0.70 elapsed 313% CPU
time.5.ptmalloc.2048:6.57 user 3.46 system 5.22 elapsed 192% CPU
time.5.tcmalloc.2048:2.15 user 0.03 system 0.82 elapsed 264% CPU
time.5.ptmalloc.4096:8.75 user 5.09 system 5.26 elapsed 263% CPU
time.5.tcmalloc.4096:1.68 user 0.03 system 1.08 elapsed 158% CPU
time.5.ptmalloc.8192:4.48 user 0.61 system 1.51 elapsed 335% CPU
time.5.tcmalloc.8192:1.47 user 0.07 system 1.18 elapsed 129% CPU
time.5.ptmalloc.16384:5.71 user 1.98 system 2.14 elapsed 358% CPU
time.5.tcmalloc.16384:1.58 user 0.03 system 1.52 elapsed 105% CPU
time.5.ptmalloc.32768:7.19 user 7.81 system 5.53 elapsed 270% CPU
time.5.tcmalloc.32768:2.63 user 0.05 system 2.72 elapsed 98% CPU
time.5.ptmalloc.65536:8.45 user 23.51 system 14.30 elapsed 223% CPU
time.5.tcmalloc.65536:3.12 user 0.05 system 3.21 elapsed 98% CPU
time.5.ptmalloc.131072:10.22 user 43.63 system 27.84 elapsed 193% CPU
time.5.tcmalloc.131072:4.42 user 0.07 system 4.51 elapsed 99% CPU
time.6.ptmalloc.64:5.57 user 2.56 system 5.08 elapsed 159% CPU
time.6.tcmalloc.64:3.20 user 0.01 system 0.89 elapsed 360% CPU
time.6.ptmalloc.128:5.98 user 3.52 system 5.71 elapsed 166% CPU
time.6.tcmalloc.128:2.76 user 0.02 system 0.78 elapsed 355% CPU
time.6.ptmalloc.256:4.61 user 0.02 system 1.19 elapsed 389% CPU
time.6.tcmalloc.256:2.65 user 0.02 system 0.74 elapsed 356% CPU
time.6.ptmalloc.512:8.28 user 3.88 system 6.61 elapsed 183% CPU
time.6.tcmalloc.512:2.60 user 0.02 system 0.72 elapsed 362% CPU
time.6.ptmalloc.1024:4.75 user 0.00 system 1.22 elapsed 387% CPU
time.6.tcmalloc.1024:2.56 user 0.02 system 0.79 elapsed 325% CPU
time.6.ptmalloc.2048:8.90 user 4.59 system 6.15 elapsed 219% CPU
time.6.tcmalloc.2048:2.37 user 0.06 system 0.96 elapsed 250% CPU
time.6.ptmalloc.4096:11.41 user 7.02 system 6.31 elapsed 291% CPU
time.6.tcmalloc.4096:1.82 user 0.03 system 1.19 elapsed 154% CPU
time.6.ptmalloc.8192:11.64 user 8.25 system 5.97 elapsed 332% CPU
time.6.tcmalloc.8192:1.83 user 0.07 system 1.38 elapsed 136% CPU
time.6.ptmalloc.16384:7.44 user 2.98 system 3.01 elapsed 345% CPU
time.6.tcmalloc.16384:1.83 user 0.08 system 1.80 elapsed 105% CPU
time.6.ptmalloc.32768:8.69 user 12.35 system 8.04 elapsed 261% CPU
time.6.tcmalloc.32768:3.14 user 0.06 system 3.24 elapsed 98% CPU
time.6.ptmalloc.65536:10.52 user 35.43 system 20.75 elapsed 221% CPU
time.6.tcmalloc.65536:3.62 user 0.03 system 3.72 elapsed 98% CPU
time.6.ptmalloc.131072:11.74 user 59.00 system 36.93 elapsed 191% CPU
time.6.tcmalloc.131072:5.33 user 0.04 system 5.42 elapsed 98% CPU
time.7.ptmalloc.64:6.60 user 3.45 system 6.01 elapsed 167% CPU
time.7.tcmalloc.64:3.50 user 0.04 system 0.94 elapsed 376% CPU
time.7.ptmalloc.128:7.09 user 4.25 system 6.69 elapsed 169% CPU
time.7.tcmalloc.128:3.13 user 0.03 system 0.84 elapsed 374% CPU
time.7.ptmalloc.256:9.28 user 4.85 system 7.20 elapsed 196% CPU
time.7.tcmalloc.256:3.06 user 0.02 system 0.82 elapsed 375% CPU
time.7.ptmalloc.512:9.13 user 4.78 system 6.79 elapsed 204% CPU
time.7.tcmalloc.512:2.99 user 0.03 system 0.83 elapsed 359% CPU
time.7.ptmalloc.1024:10.85 user 6.41 system 7.52 elapsed 229% CPU
time.7.tcmalloc.1024:3.05 user 0.04 system 0.89 elapsed 345% CPU
time.7.ptmalloc.2048:5.65 user 0.08 system 1.47 elapsed 388% CPU
time.7.tcmalloc.2048:3.01 user 0.01 system 0.98 elapsed 306% CPU
time.7.ptmalloc.4096:6.09 user 0.08 system 1.58 elapsed 389% CPU
time.7.tcmalloc.4096:2.25 user 0.03 system 1.32 elapsed 171% CPU
time.7.ptmalloc.8192:6.73 user 0.85 system 1.99 elapsed 379% CPU
time.7.tcmalloc.8192:2.22 user 0.08 system 1.61 elapsed 142% CPU
time.7.ptmalloc.16384:8.87 user 4.66 system 4.04 elapsed 334% CPU
time.7.tcmalloc.16384:2.07 user 0.07 system 2.07 elapsed 103% CPU
time.7.ptmalloc.32768:10.61 user 17.85 system 11.22 elapsed 253% CPU
time.7.tcmalloc.32768:3.68 user 0.06 system 3.79 elapsed 98% CPU
time.7.ptmalloc.65536:13.05 user 45.97 system 27.28 elapsed 216% CPU
time.7.tcmalloc.65536:4.16 user 0.07 system 4.31 elapsed 98% CPU
time.7.ptmalloc.131072:13.22 user 62.67 system 41.33 elapsed 183% CPU
time.7.tcmalloc.131072:6.10 user 0.06 system 6.25 elapsed 98% CPU
time.8.ptmalloc.64:7.31 user 3.92 system 6.39 elapsed 175% CPU
time.8.tcmalloc.64:4.00 user 0.01 system 1.04 elapsed 383% CPU
time.8.ptmalloc.128:9.40 user 5.41 system 7.67 elapsed 192% CPU
time.8.tcmalloc.128:3.61 user 0.02 system 0.94 elapsed 386% CPU
time.8.ptmalloc.256:10.61 user 6.35 system 7.96 elapsed 212% CPU
time.8.tcmalloc.256:3.30 user 0.02 system 0.99 elapsed 335% CPU
time.8.ptmalloc.512:12.42 user 7.10 system 8.79 elapsed 221% CPU
time.8.tcmalloc.512:3.35 user 0.04 system 0.94 elapsed 358% CPU
time.8.ptmalloc.1024:13.63 user 8.54 system 8.95 elapsed 247% CPU
time.8.tcmalloc.1024:3.44 user 0.02 system 0.96 elapsed 359% CPU
time.8.ptmalloc.2048:6.45 user 0.03 system 1.67 elapsed 386% CPU
time.8.tcmalloc.2048:3.55 user 0.05 system 1.09 elapsed 328% CPU
time.8.ptmalloc.4096:6.83 user 0.26 system 1.80 elapsed 393% CPU
time.8.tcmalloc.4096:2.78 user 0.06 system 1.53 elapsed 185% CPU
time.8.ptmalloc.8192:7.59 user 1.29 system 2.36 elapsed 376% CPU
time.8.tcmalloc.8192:2.57 user 0.07 system 1.84 elapsed 142% CPU
time.8.ptmalloc.16384:10.15 user 6.20 system 5.20 elapsed 314% CPU
time.8.tcmalloc.16384:2.40 user 0.05 system 2.42 elapsed 101% CPU
time.8.ptmalloc.32768:11.82 user 24.48 system 14.60 elapsed 248% CPU
time.8.tcmalloc.32768:4.37 user 0.05 system 4.47 elapsed 98% CPU
time.8.ptmalloc.65536:15.41 user 58.94 system 34.42 elapsed 215% CPU
time.8.tcmalloc.65536:4.90 user 0.04 system 4.96 elapsed 99% CPU
time.8.ptmalloc.131072:16.07 user 82.93 system 52.51 elapsed 188% CPU
time.8.tcmalloc.131072:7.13 user 0.04 system 7.19 elapsed 99% CPU
time.9.ptmalloc.64:8.44 user 4.59 system 6.92 elapsed 188% CPU
time.9.tcmalloc.64:4.00 user 0.02 system 1.05 elapsed 382% CPU
time.9.ptmalloc.128:10.92 user 6.14 system 8.31 elapsed 205% CPU
time.9.tcmalloc.128:3.88 user 0.02 system 1.01 elapsed 382% CPU
time.9.ptmalloc.256:13.01 user 7.75 system 9.12 elapsed 227% CPU
time.9.tcmalloc.256:3.89 user 0.01 system 1.00 elapsed 386% CPU
time.9.ptmalloc.512:14.96 user 8.89 system 9.73 elapsed 244% CPU
time.9.tcmalloc.512:3.80 user 0.03 system 1.01 elapsed 377% CPU
time.9.ptmalloc.1024:15.42 user 10.20 system 9.80 elapsed 261% CPU
time.9.tcmalloc.1024:3.86 user 0.03 system 1.19 elapsed 325% CPU
time.9.ptmalloc.2048:7.24 user 0.02 system 1.87 elapsed 388% CPU
time.9.tcmalloc.2048:3.98 user 0.05 system 1.26 elapsed 319% CPU
time.9.ptmalloc.4096:7.96 user 0.18 system 2.06 elapsed 394% CPU
time.9.tcmalloc.4096:3.27 user 0.04 system 1.69 elapsed 195% CPU
time.9.ptmalloc.8192:9.00 user 1.63 system 2.79 elapsed 380% CPU
time.9.tcmalloc.8192:3.00 user 0.06 system 2.05 elapsed 148% CPU
time.9.ptmalloc.16384:12.07 user 8.13 system 6.55 elapsed 308% CPU
time.9.tcmalloc.16384:2.85 user 0.05 system 2.75 elapsed 105% CPU
time.9.ptmalloc.32768:13.99 user 29.65 system 18.02 elapsed 242% CPU
time.9.tcmalloc.32768:4.98 user 0.06 system 5.13 elapsed 98% CPU
time.9.ptmalloc.65536:16.89 user 70.42 system 42.11 elapsed 207% CPU
time.9.tcmalloc.65536:5.55 user 0.04 system 5.65 elapsed 98% CPU
time.9.ptmalloc.131072:18.53 user 94.11 system 61.17 elapsed 184% CPU
time.9.tcmalloc.131072:8.06 user 0.04 system 8.16 elapsed 99% CPU
time.10.ptmalloc.64:9.81 user 5.70 system 7.42 elapsed 208% CPU
time.10.tcmalloc.64:4.43 user 0.03 system 1.20 elapsed 370% CPU
time.10.ptmalloc.128:12.69 user 7.81 system 9.02 elapsed 227% CPU
time.10.tcmalloc.128:4.27 user 0.02 system 1.13 elapsed 378% CPU
time.10.ptmalloc.256:15.04 user 9.53 system 9.92 elapsed 247% CPU
time.10.tcmalloc.256:4.23 user 0.02 system 1.09 elapsed 388% CPU
time.10.ptmalloc.512:17.30 user 10.46 system 10.61 elapsed 261% CPU
time.10.tcmalloc.512:4.14 user 0.05 system 1.10 elapsed 379% CPU
time.10.ptmalloc.1024:16.96 user 9.38 system 9.30 elapsed 283% CPU
time.10.tcmalloc.1024:4.27 user 0.06 system 1.18 elapsed 366% CPU
time.10.ptmalloc.2048:8.07 user 0.03 system 2.06 elapsed 393% CPU
time.10.tcmalloc.2048:4.49 user 0.07 system 1.33 elapsed 342% CPU
time.10.ptmalloc.4096:8.66 user 0.25 system 2.25 elapsed 394% CPU
time.10.tcmalloc.4096:3.61 user 0.05 system 1.78 elapsed 205% CPU
time.10.ptmalloc.8192:21.52 user 17.43 system 10.41 elapsed 374% CPU
time.10.tcmalloc.8192:3.59 user 0.10 system 2.33 elapsed 158% CPU
time.10.ptmalloc.16384:20.55 user 24.85 system 12.55 elapsed 361% CPU
time.10.tcmalloc.16384:3.29 user 0.04 system 3.22 elapsed 103% CPU
time.10.ptmalloc.32768:15.23 user 38.13 system 22.49 elapsed 237% CPU
time.10.tcmalloc.32768:5.62 user 0.05 system 5.72 elapsed 99% CPU
time.10.ptmalloc.65536:19.80 user 85.42 system 49.98 elapsed 210% CPU
time.10.tcmalloc.65536:6.23 user 0.09 system 6.36 elapsed 99% CPU
time.10.ptmalloc.131072:20.91 user 106.97 system 69.08 elapsed 185% CPU
time.10.tcmalloc.131072:8.94 user 0.09 system 9.09 elapsed 99% CPU
time.11.ptmalloc.64:10.82 user 6.34 system 7.92 elapsed 216% CPU
time.11.tcmalloc.64:4.80 user 0.03 system 1.24 elapsed 387% CPU
time.11.ptmalloc.128:14.58 user 8.61 system 9.81 elapsed 236% CPU
time.11.tcmalloc.128:4.65 user 0.03 system 1.21 elapsed 384% CPU
time.11.ptmalloc.256:17.38 user 10.98 system 10.75 elapsed 263% CPU
time.11.tcmalloc.256:4.51 user 0.03 system 1.18 elapsed 384% CPU
time.11.ptmalloc.512:19.18 user 11.71 system 10.95 elapsed 282% CPU
time.11.tcmalloc.512:4.57 user 0.02 system 1.19 elapsed 384% CPU
time.11.ptmalloc.1024:19.94 user 12.41 system 10.48 elapsed 308% CPU
time.11.tcmalloc.1024:4.71 user 0.05 system 1.29 elapsed 367% CPU
time.11.ptmalloc.2048:8.70 user 0.04 system 2.35 elapsed 371% CPU
time.11.tcmalloc.2048:4.97 user 0.07 system 1.43 elapsed 350% CPU
time.11.ptmalloc.4096:22.47 user 18.43 system 10.82 elapsed 377% CPU
time.11.tcmalloc.4096:4.22 user 0.03 system 1.91 elapsed 221% CPU
time.11.ptmalloc.8192:11.61 user 2.38 system 3.73 elapsed 374% CPU
time.11.tcmalloc.8192:3.74 user 0.09 system 2.46 elapsed 155% CPU
time.11.ptmalloc.16384:14.13 user 13.38 system 9.60 elapsed 286% CPU
time.11.tcmalloc.16384:3.61 user 0.03 system 3.63 elapsed 100% CPU
time.11.ptmalloc.32768:17.92 user 43.84 system 26.74 elapsed 230% CPU
time.11.tcmalloc.32768:6.31 user 0.03 system 6.45 elapsed 98% CPU
time.11.ptmalloc.65536:22.40 user 96.38 system 58.30 elapsed 203% CPU
time.11.tcmalloc.65536:6.92 user 0.12 system 6.98 elapsed 100% CPU
time.11.ptmalloc.131072:21.03 user 108.04 system 72.78 elapsed 177% CPU
time.11.tcmalloc.131072:9.79 user 0.08 system 9.94 elapsed 99% CPU
time.12.ptmalloc.64:12.23 user 7.16 system 8.38 elapsed 231% CPU
time.12.tcmalloc.64:5.21 user 0.05 system 1.41 elapsed 371% CPU
time.12.ptmalloc.128:16.97 user 10.19 system 10.47 elapsed 259% CPU
time.12.tcmalloc.128:5.10 user 0.02 system 1.31 elapsed 390% CPU
time.12.ptmalloc.256:19.99 user 12.10 system 11.57 elapsed 277% CPU
time.12.tcmalloc.256:5.01 user 0.03 system 1.29 elapsed 390% CPU
time.12.ptmalloc.512:21.85 user 12.66 system 11.46 elapsed 300% CPU
time.12.tcmalloc.512:5.05 user 0.00 system 1.32 elapsed 379% CPU
time.12.ptmalloc.1024:9.40 user 0.04 system 2.40 elapsed 393% CPU
time.12.tcmalloc.1024:5.14 user 0.02 system 1.39 elapsed 369% CPU
time.12.ptmalloc.2048:9.72 user 0.04 system 2.49 elapsed 391% CPU
time.12.tcmalloc.2048:5.74 user 0.05 system 1.62 elapsed 355% CPU
time.12.ptmalloc.4096:10.64 user 0.20 system 2.75 elapsed 393% CPU
time.12.tcmalloc.4096:4.45 user 0.03 system 2.04 elapsed 218% CPU
time.12.ptmalloc.8192:12.66 user 3.30 system 4.30 elapsed 371% CPU
time.12.tcmalloc.8192:4.21 user 0.13 system 2.65 elapsed 163% CPU
time.12.ptmalloc.16384:15.73 user 15.68 system 11.14 elapsed 281% CPU
time.12.tcmalloc.16384:4.17 user 0.06 system 4.10 elapsed 102% CPU
time.12.ptmalloc.32768:19.45 user 56.00 system 32.74 elapsed 230% CPU
time.12.tcmalloc.32768:6.96 user 0.08 system 7.14 elapsed 98% CPU
time.12.ptmalloc.65536:23.33 user 110.45 system 65.06 elapsed 205% CPU
time.12.tcmalloc.65536:7.77 user 0.15 system 7.72 elapsed 102% CPU
time.12.ptmalloc.131072:24.03 user 124.74 system 82.94 elapsed 179% CPU
time.12.tcmalloc.131072:10.81 user 0.06 system 10.94 elapsed 99% CPU
time.13.ptmalloc.64:14.08 user 7.60 system 8.85 elapsed 244% CPU
time.13.tcmalloc.64:5.51 user 0.01 system 1.47 elapsed 375% CPU
time.13.ptmalloc.128:18.20 user 10.98 system 10.99 elapsed 265% CPU
time.13.tcmalloc.128:5.34 user 0.01 system 1.39 elapsed 382% CPU
time.13.ptmalloc.256:21.48 user 13.94 system 12.25 elapsed 289% CPU
time.13.tcmalloc.256:5.33 user 0.01 system 1.39 elapsed 381% CPU
time.13.ptmalloc.512:24.22 user 14.84 system 12.97 elapsed 301% CPU
time.13.tcmalloc.512:5.49 user 0.02 system 1.41 elapsed 389% CPU
time.13.ptmalloc.1024:25.26 user 17.03 system 12.85 elapsed 328% CPU
time.13.tcmalloc.1024:5.65 user 0.04 system 1.50 elapsed 378% CPU
time.13.ptmalloc.2048:10.41 user 0.03 system 2.69 elapsed 387% CPU
time.13.tcmalloc.2048:5.93 user 0.10 system 1.77 elapsed 339% CPU
time.13.ptmalloc.4096:11.37 user 0.52 system 3.04 elapsed 391% CPU
time.13.tcmalloc.4096:5.08 user 0.11 system 2.22 elapsed 233% CPU
time.13.ptmalloc.8192:21.76 user 18.54 system 10.58 elapsed 380% CPU
time.13.tcmalloc.8192:5.04 user 0.16 system 2.93 elapsed 177% CPU
time.13.ptmalloc.16384:26.35 user 34.47 system 17.01 elapsed 357% CPU
time.13.tcmalloc.16384:4.66 user 0.04 system 4.66 elapsed 100% CPU
time.13.ptmalloc.32768:21.41 user 63.59 system 38.14 elapsed 222% CPU
time.13.tcmalloc.32768:7.71 user 0.03 system 7.83 elapsed 98% CPU
time.13.ptmalloc.65536:24.99 user 120.80 system 71.59 elapsed 203% CPU
time.13.tcmalloc.65536:8.87 user 0.64 system 8.37 elapsed 113% CPU
time.13.ptmalloc.131072:25.97 user 142.27 system 96.00 elapsed 175% CPU
time.13.tcmalloc.131072:11.48 user 0.06 system 11.67 elapsed 98% CPU
time.14.ptmalloc.64:15.01 user 9.11 system 9.41 elapsed 256% CPU
time.14.tcmalloc.64:5.98 user 0.02 system 1.58 elapsed 378% CPU
time.14.ptmalloc.128:20.34 user 12.72 system 11.62 elapsed 284% CPU
time.14.tcmalloc.128:5.88 user 0.04 system 1.51 elapsed 392% CPU
time.14.ptmalloc.256:24.26 user 14.95 system 12.92 elapsed 303% CPU
time.14.tcmalloc.256:5.72 user 0.02 system 1.50 elapsed 381% CPU
time.14.ptmalloc.512:27.28 user 16.45 system 13.89 elapsed 314% CPU
time.14.tcmalloc.512:5.99 user 0.02 system 1.54 elapsed 388% CPU
time.14.ptmalloc.1024:25.84 user 16.99 system 12.61 elapsed 339% CPU
time.14.tcmalloc.1024:5.94 user 0.06 system 1.59 elapsed 375% CPU
time.14.ptmalloc.2048:11.96 user 0.01 system 3.12 elapsed 382% CPU
time.14.tcmalloc.2048:6.39 user 0.07 system 1.79 elapsed 359% CPU
time.14.ptmalloc.4096:20.19 user 11.77 system 8.26 elapsed 386% CPU
time.14.tcmalloc.4096:5.65 user 0.05 system 2.32 elapsed 244% CPU
time.14.ptmalloc.8192:22.01 user 16.39 system 9.89 elapsed 387% CPU
time.14.tcmalloc.8192:5.44 user 0.11 system 3.07 elapsed 180% CPU
time.14.ptmalloc.16384:18.15 user 22.40 system 15.02 elapsed 269% CPU
time.14.tcmalloc.16384:5.29 user 0.08 system 5.34 elapsed 100% CPU
time.14.ptmalloc.32768:24.29 user 72.07 system 42.63 elapsed 225% CPU
time.14.tcmalloc.32768:8.47 user 0.02 system 8.62 elapsed 98% CPU
time.14.ptmalloc.65536:27.63 user 130.56 system 78.64 elapsed 201% CPU
time.14.tcmalloc.65536:9.85 user 1.61 system 9.04 elapsed 126% CPU
time.14.ptmalloc.131072:28.87 user 146.38 system 100.54 elapsed 174% CPU
time.14.tcmalloc.131072:12.46 user 0.11 system 12.71 elapsed 98% CPU
time.15.ptmalloc.64:16.25 user 10.05 system 9.82 elapsed 267% CPU
time.15.tcmalloc.64:6.30 user 0.02 system 1.64 elapsed 385% CPU
time.15.ptmalloc.128:22.33 user 13.23 system 12.24 elapsed 290% CPU
time.15.tcmalloc.128:6.08 user 0.03 system 1.59 elapsed 384% CPU
time.15.ptmalloc.256:26.56 user 16.57 system 13.70 elapsed 314% CPU
time.15.tcmalloc.256:6.14 user 0.03 system 1.61 elapsed 382% CPU
time.15.ptmalloc.512:29.68 user 18.08 system 14.56 elapsed 327% CPU
time.15.tcmalloc.512:6.12 user 0.04 system 1.68 elapsed 364% CPU
time.15.ptmalloc.1024:17.07 user 6.22 system 6.26 elapsed 371% CPU
time.15.tcmalloc.1024:6.38 user 0.02 system 1.75 elapsed 364% CPU
time.15.ptmalloc.2048:26.64 user 17.25 system 11.51 elapsed 381% CPU
time.15.tcmalloc.2048:6.77 user 0.18 system 1.92 elapsed 361% CPU
time.15.ptmalloc.4096:13.21 user 0.74 system 3.57 elapsed 390% CPU
time.15.tcmalloc.4096:6.03 user 0.09 system 2.36 elapsed 258% CPU
time.15.ptmalloc.8192:22.92 user 17.51 system 10.50 elapsed 385% CPU
time.15.tcmalloc.8192:5.96 user 0.12 system 3.36 elapsed 180% CPU
time.15.ptmalloc.16384:19.37 user 24.87 system 16.69 elapsed 264% CPU
time.15.tcmalloc.16384:5.88 user 0.07 system 5.84 elapsed 101% CPU
time.15.ptmalloc.32768:25.43 user 82.30 system 48.98 elapsed 219% CPU
time.15.tcmalloc.32768:9.11 user 0.05 system 9.30 elapsed 98% CPU
time.15.ptmalloc.65536:29.31 user 140.07 system 83.78 elapsed 202% CPU
time.15.tcmalloc.65536:8.51 user 1.59 system 9.75 elapsed 103% CPU
time.15.ptmalloc.131072:30.22 user 163.15 system 109.50 elapsed 176% CPU
time.15.tcmalloc.131072:13.35 user 0.10 system 13.54 elapsed 99% CPU
time.16.ptmalloc.64:17.69 user 10.11 system 10.11 elapsed 274% CPU
time.16.tcmalloc.64:6.63 user 0.04 system 1.72 elapsed 387% CPU
time.16.ptmalloc.128:23.05 user 14.37 system 12.75 elapsed 293% CPU
time.16.tcmalloc.128:6.61 user 0.02 system 1.71 elapsed 387% CPU
time.16.ptmalloc.256:29.11 user 19.35 system 14.57 elapsed 332% CPU
time.16.tcmalloc.256:6.62 user 0.03 system 1.73 elapsed 382% CPU
time.16.ptmalloc.512:31.65 user 18.71 system 14.71 elapsed 342% CPU
time.16.tcmalloc.512:6.63 user 0.04 system 1.73 elapsed 383% CPU
time.16.ptmalloc.1024:31.99 user 21.22 system 14.87 elapsed 357% CPU
time.16.tcmalloc.1024:6.81 user 0.04 system 1.79 elapsed 382% CPU
time.16.ptmalloc.2048:30.35 user 21.36 system 13.30 elapsed 388% CPU
time.16.tcmalloc.2048:6.91 user 0.50 system 2.01 elapsed 367% CPU
time.16.ptmalloc.4096:18.85 user 7.18 system 6.61 elapsed 393% CPU
time.16.tcmalloc.4096:6.70 user 0.10 system 2.62 elapsed 259% CPU
time.16.ptmalloc.8192:22.19 user 14.30 system 9.37 elapsed 389% CPU
time.16.tcmalloc.8192:6.18 user 0.19 system 3.58 elapsed 177% CPU
time.16.ptmalloc.16384:31.22 user 46.78 system 22.92 elapsed 340% CPU
time.16.tcmalloc.16384:6.79 user 0.07 system 6.86 elapsed 99% CPU
time.16.ptmalloc.32768:27.31 user 87.32 system 52.00 elapsed 220% CPU
time.16.tcmalloc.32768:9.85 user 0.06 system 10.07 elapsed 98% CPU
time.16.ptmalloc.65536:32.83 user 160.62 system 95.67 elapsed 202% CPU
time.16.tcmalloc.65536:10.18 user 0.09 system 10.41 elapsed 98% CPU
time.16.ptmalloc.131072:31.99 user 173.41 system 115.98 elapsed 177% CPU
time.16.tcmalloc.131072:14.52 user 0.05 system 14.67 elapsed 99% CPU
time.17.ptmalloc.64:19.38 user 11.61 system 10.61 elapsed 291% CPU
time.17.tcmalloc.64:7.11 user 0.02 system 1.84 elapsed 386% CPU
time.17.ptmalloc.128:26.25 user 16.15 system 13.53 elapsed 313% CPU
time.17.tcmalloc.128:6.97 user 0.02 system 1.78 elapsed 390% CPU
time.17.ptmalloc.256:30.66 user 18.36 system 14.97 elapsed 327% CPU
time.17.tcmalloc.256:6.94 user 0.04 system 1.80 elapsed 387% CPU
time.17.ptmalloc.512:33.71 user 22.79 system 15.95 elapsed 354% CPU
time.17.tcmalloc.512:7.00 user 0.02 system 1.83 elapsed 381% CPU
time.17.ptmalloc.1024:33.49 user 22.47 system 15.00 elapsed 373% CPU
time.17.tcmalloc.1024:7.20 user 0.03 system 1.90 elapsed 380% CPU
time.17.ptmalloc.2048:23.87 user 11.92 system 9.26 elapsed 386% CPU
time.17.tcmalloc.2048:6.01 user 1.83 system 2.15 elapsed 363% CPU
time.17.ptmalloc.4096:14.69 user 0.95 system 3.98 elapsed 392% CPU
time.17.tcmalloc.4096:7.25 user 0.10 system 2.62 elapsed 279% CPU
time.17.ptmalloc.8192:22.44 user 13.52 system 9.39 elapsed 382% CPU
time.17.tcmalloc.8192:7.21 user 0.24 system 3.95 elapsed 188% CPU
time.17.ptmalloc.16384:23.33 user 33.67 system 21.89 elapsed 260% CPU
time.17.tcmalloc.16384:7.28 user 0.06 system 7.10 elapsed 103% CPU
time.17.ptmalloc.32768:29.35 user 103.11 system 60.36 elapsed 219% CPU
time.17.tcmalloc.32768:10.53 user 0.07 system 10.71 elapsed 98% CPU
time.17.ptmalloc.65536:33.21 user 170.89 system 100.84 elapsed 202% CPU
time.17.tcmalloc.65536:10.85 user 0.05 system 11.04 elapsed 98% CPU
time.17.ptmalloc.131072:34.98 user 182.87 system 122.05 elapsed 178% CPU
time.17.tcmalloc.131072:15.27 user 0.09 system 15.49 elapsed 99% CPU
time.18.ptmalloc.64:21.08 user 12.15 system 11.43 elapsed 290% CPU
time.18.tcmalloc.64:7.45 user 0.03 system 1.95 elapsed 383% CPU
time.18.ptmalloc.128:27.65 user 17.26 system 14.03 elapsed 320% CPU
time.18.tcmalloc.128:7.46 user 0.03 system 1.92 elapsed 389% CPU
time.18.ptmalloc.256:32.78 user 20.55 system 15.70 elapsed 339% CPU
time.18.tcmalloc.256:7.31 user 0.02 system 1.88 elapsed 389% CPU
time.18.ptmalloc.512:33.31 user 20.06 system 15.05 elapsed 354% CPU
time.18.tcmalloc.512:7.33 user 0.02 system 1.91 elapsed 383% CPU
time.18.ptmalloc.1024:35.46 user 24.83 system 16.30 elapsed 369% CPU
time.18.tcmalloc.1024:7.60 user 0.06 system 2.05 elapsed 373% CPU
time.18.ptmalloc.2048:19.98 user 6.80 system 6.76 elapsed 395% CPU
time.18.tcmalloc.2048:6.89 user 1.29 system 2.28 elapsed 357% CPU
time.18.ptmalloc.4096:15.99 user 0.93 system 4.32 elapsed 391% CPU
time.18.tcmalloc.4096:7.70 user 0.10 system 2.77 elapsed 280% CPU
time.18.ptmalloc.8192:23.51 user 14.84 system 9.97 elapsed 384% CPU
time.18.tcmalloc.8192:8.16 user 0.27 system 4.25 elapsed 197% CPU
time.18.ptmalloc.16384:35.79 user 52.41 system 26.47 elapsed 333% CPU
time.18.tcmalloc.16384:7.81 user 0.07 system 7.61 elapsed 103% CPU
time.18.ptmalloc.32768:33.17 user 116.07 system 68.64 elapsed 217% CPU
time.18.tcmalloc.32768:11.34 user 0.13 system 11.57 elapsed 99% CPU
time.18.ptmalloc.65536:35.91 user 177.82 system 106.75 elapsed 200% CPU
time.18.tcmalloc.65536:11.54 user 0.06 system 11.74 elapsed 98% CPU
time.18.ptmalloc.131072:36.38 user 187.18 system 126.91 elapsed 176% CPU
time.18.tcmalloc.131072:16.34 user 0.05 system 16.43 elapsed 99% CPU
time.19.ptmalloc.64:22.90 user 13.23 system 11.82 elapsed 305% CPU
time.19.tcmalloc.64:7.81 user 0.02 system 2.01 elapsed 388% CPU
time.19.ptmalloc.128:30.13 user 18.58 system 14.77 elapsed 329% CPU
time.19.tcmalloc.128:7.74 user 0.02 system 2.01 elapsed 386% CPU
time.19.ptmalloc.256:35.33 user 21.41 system 16.35 elapsed 347% CPU
time.19.tcmalloc.256:7.79 user 0.04 system 2.04 elapsed 382% CPU
time.19.ptmalloc.512:39.30 user 26.22 system 17.84 elapsed 367% CPU
time.19.tcmalloc.512:7.80 user 0.06 system 2.05 elapsed 381% CPU
time.19.ptmalloc.1024:35.70 user 23.90 system 15.66 elapsed 380% CPU
time.19.tcmalloc.1024:8.08 user 0.06 system 2.16 elapsed 376% CPU
time.19.ptmalloc.2048:18.33 user 3.28 system 5.47 elapsed 394% CPU
time.19.tcmalloc.2048:8.71 user 0.05 system 2.40 elapsed 363% CPU
time.19.ptmalloc.4096:16.94 user 0.89 system 4.64 elapsed 383% CPU
time.19.tcmalloc.4096:8.21 user 0.07 system 2.85 elapsed 289% CPU
time.19.ptmalloc.8192:25.61 user 17.15 system 11.33 elapsed 377% CPU
time.19.tcmalloc.8192:8.79 user 0.30 system 4.58 elapsed 198% CPU
time.19.ptmalloc.16384:27.11 user 46.66 system 29.67 elapsed 248% CPU
time.19.tcmalloc.16384:8.64 user 0.05 system 8.58 elapsed 101% CPU
time.19.ptmalloc.32768:33.80 user 117.69 system 70.65 elapsed 214% CPU
time.19.tcmalloc.32768:11.88 user 0.07 system 12.04 elapsed 99% CPU
time.19.ptmalloc.65536:36.90 user 180.21 system 109.01 elapsed 199% CPU
time.19.tcmalloc.65536:12.17 user 0.07 system 12.40 elapsed 98% CPU
time.19.ptmalloc.131072:38.50 user 195.15 system 132.81 elapsed 175% CPU
time.19.tcmalloc.131072:17.44 user 0.10 system 17.65 elapsed 99% CPU
time.20.ptmalloc.64:23.37 user 13.74 system 11.86 elapsed 312% CPU
time.20.tcmalloc.64:8.18 user 0.02 system 2.10 elapsed 389% CPU
time.20.ptmalloc.128:31.29 user 19.97 system 15.53 elapsed 329% CPU
time.20.tcmalloc.128:8.03 user 0.02 system 2.12 elapsed 378% CPU
time.20.ptmalloc.256:38.40 user 25.65 system 18.25 elapsed 350% CPU
time.20.tcmalloc.256:8.05 user 0.05 system 2.12 elapsed 380% CPU
time.20.ptmalloc.512:40.60 user 27.70 system 18.46 elapsed 369% CPU
time.20.tcmalloc.512:8.22 user 0.08 system 2.20 elapsed 375% CPU
time.20.ptmalloc.1024:40.02 user 28.52 system 17.56 elapsed 390% CPU
time.20.tcmalloc.1024:8.50 user 0.07 system 2.19 elapsed 391% CPU
time.20.ptmalloc.2048:16.13 user 0.23 system 4.23 elapsed 386% CPU
time.20.tcmalloc.2048:8.98 user 0.03 system 2.45 elapsed 367% CPU
time.20.ptmalloc.4096:17.14 user 0.87 system 4.60 elapsed 391% CPU
time.20.tcmalloc.4096:8.93 user 0.20 system 2.97 elapsed 306% CPU
time.20.ptmalloc.8192:25.24 user 17.16 system 11.14 elapsed 380% CPU
time.20.tcmalloc.8192:9.78 user 0.30 system 5.14 elapsed 195% CPU
time.20.ptmalloc.16384:39.93 user 60.36 system 30.24 elapsed 331% CPU
time.20.tcmalloc.16384:9.57 user 0.09 system 9.43 elapsed 102% CPU
time.20.ptmalloc.32768:36.44 user 130.23 system 76.79 elapsed 217% CPU
time.20.tcmalloc.32768:12.71 user 0.09 system 12.97 elapsed 98% CPU
time.20.ptmalloc.65536:39.79 user 202.09 system 120.34 elapsed 200% CPU
time.20.tcmalloc.65536:12.93 user 0.06 system 13.15 elapsed 98% CPU
time.20.ptmalloc.131072:41.91 user 202.76 system 138.51 elapsed 176% CPU
time.20.tcmalloc.131072:18.23 user 0.07 system 18.42 elapsed 99% CPU

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.7 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.3 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.1 KiB

View file

@ -0,0 +1,778 @@
<!doctype html public "-//w3c//dtd html 4.01 transitional//en">
<!-- $Id: $ -->
<html>
<head>
<title>TCMalloc : Thread-Caching Malloc</title>
<link rel="stylesheet" href="designstyle.css">
<style type="text/css">
em {
color: red;
font-style: normal;
}
</style>
</head>
<body>
<h1>TCMalloc : Thread-Caching Malloc</h1>
<address>Sanjay Ghemawat</address>
<h2><A name=motivation>Motivation</A></h2>
<p>TCMalloc is faster than the glibc 2.3 malloc (available as a
separate library called ptmalloc2) and other mallocs that I have
tested. ptmalloc2 takes approximately 300 nanoseconds to execute a
malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc
implementation takes approximately 50 nanoseconds for the same
operation pair. Speed is important for a malloc implementation
because if malloc is not fast enough, application writers are inclined
to write their own custom free lists on top of malloc. This can lead
to extra complexity, and more memory usage unless the application
writer is very careful to appropriately size the free lists and
scavenge idle objects out of the free list.</p>
<p>TCMalloc also reduces lock contention for multi-threaded programs.
For small objects, there is virtually zero contention. For large
objects, TCMalloc tries to use fine grained and efficient spinlocks.
ptmalloc2 also reduces lock contention by using per-thread arenas but
there is a big problem with ptmalloc2's use of per-thread arenas. In
ptmalloc2 memory can never move from one arena to another. This can
lead to huge amounts of wasted space. For example, in one Google
application, the first phase would allocate approximately 300MB of
memory for its URL canonicalization data structures. When the first
phase finished, a second phase would be started in the same address
space. If this second phase was assigned a different arena than the
one used by the first phase, this phase would not reuse any of the
memory left after the first phase and would add another 300MB to the
address space. Similar memory blowup problems were also noticed in
other applications.</p>
<p>Another benefit of TCMalloc is space-efficient representation of
small objects. For example, N 8-byte objects can be allocated while
using space approximately <code>8N * 1.01</code> bytes. I.e., a
one-percent space overhead. ptmalloc2 uses a four-byte header for
each object and (I think) rounds up the size to a multiple of 8 bytes
and ends up using <code>16N</code> bytes.</p>
<h2><A NAME="Usage">Usage</A></h2>
<p>To use TCMalloc, just link TCMalloc into your application via the
"-ltcmalloc" linker flag.</p>
<p>You can use TCMalloc in applications you didn't compile yourself,
by using LD_PRELOAD:</p>
<pre>
$ LD_PRELOAD="/usr/lib/libtcmalloc.so" <binary>
</pre>
<p>LD_PRELOAD is tricky, and we don't necessarily recommend this mode
of usage.</p>
<p>TCMalloc includes a <A HREF="heap_checker.html">heap checker</A>
and <A HREF="heapprofile.html">heap profiler</A> as well.</p>
<p>If you'd rather link in a version of TCMalloc that does not include
the heap profiler and checker (perhaps to reduce binary size for a
static binary), you can link in <code>libtcmalloc_minimal</code>
instead.</p>
<h2><A NAME="Overview">Overview</A></h2>
<p>TCMalloc assigns each thread a thread-local cache. Small
allocations are satisfied from the thread-local cache. Objects are
moved from central data structures into a thread-local cache as
needed, and periodic garbage collections are used to migrate memory
back from a thread-local cache into the central data structures.</p>
<center><img src="overview.gif"></center>
<p>TCMalloc treats objects with size &lt;= 256K ("small" objects)
differently from larger objects. Large objects are allocated directly
from the central heap using a page-level allocator (a page is a 8K
aligned region of memory). I.e., a large object is always
page-aligned and occupies an integral number of pages.</p>
<p>A run of pages can be carved up into a sequence of small objects,
each equally sized. For example a run of one page (4K) can be carved
up into 32 objects of size 128 bytes each.</p>
<h2><A NAME="Small_Object_Allocation">Small Object Allocation</A></h2>
<p>Each small object size maps to one of approximately 88 allocatable
size-classes. For example, all allocations in the range 961 to 1024
bytes are rounded up to 1024. The size-classes are spaced so that
small sizes are separated by 8 bytes, larger sizes by 16 bytes, even
larger sizes by 32 bytes, and so forth. The maximal spacing is
controlled so that not too much space is wasted when an allocation
request falls just past the end of a size class and has to be rounded
up to the next class.</p>
<p>A thread cache contains a singly linked list of free objects per
size-class.</p>
<center><img src="threadheap.gif"></center>
<p>When allocating a small object: (1) We map its size to the
corresponding size-class. (2) Look in the corresponding free list in
the thread cache for the current thread. (3) If the free list is not
empty, we remove the first object from the list and return it. When
following this fast path, TCMalloc acquires no locks at all. This
helps speed-up allocation significantly because a lock/unlock pair
takes approximately 100 nanoseconds on a 2.8 GHz Xeon.</p>
<p>If the free list is empty: (1) We fetch a bunch of objects from a
central free list for this size-class (the central free list is shared
by all threads). (2) Place them in the thread-local free list. (3)
Return one of the newly fetched objects to the applications.</p>
<p>If the central free list is also empty: (1) We allocate a run of
pages from the central page allocator. (2) Split the run into a set
of objects of this size-class. (3) Place the new objects on the
central free list. (4) As before, move some of these objects to the
thread-local free list.</p>
<h3><A NAME="Sizing_Thread_Cache_Free_Lists">
Sizing Thread Cache Free Lists</A></h3>
<p>It is important to size the thread cache free lists correctly. If
the free list is too small, we'll need to go to the central free list
too often. If the free list is too big, we'll waste memory as objects
sit idle in the free list.</p>
<p>Note that the thread caches are just as important for deallocation
as they are for allocation. Without a cache, each deallocation would
require moving the memory to the central free list. Also, some threads
have asymmetric alloc/free behavior (e.g. producer and consumer threads),
so sizing the free list correctly gets trickier.</p>
<p>To size the free lists appropriately, we use a slow-start algorithm
to determine the maximum length of each individual free list. As the
free list is used more frequently, its maximum length grows. However,
if a free list is used more for deallocation than allocation, its
maximum length will grow only up to a point where the whole list can
be efficiently moved to the central free list at once.</p>
<p>The psuedo-code below illustrates this slow-start algorithm. Note
that <code>num_objects_to_move</code> is specific to each size class.
By moving a list of objects with a well-known length, the central
cache can efficiently pass these lists between thread caches. If
a thread cache wants fewer than <code>num_objects_to_move</code>,
the operation on the central free list has linear time complexity.
The downside of always using <code>num_objects_to_move</code> as
the number of objects to transfer to and from the central cache is
that it wastes memory in threads that don't need all of those objects.
<pre>
Start each freelist max_length at 1.
Allocation
if freelist empty {
fetch min(max_length, num_objects_to_move) from central list;
if max_length < num_objects_to_move { // slow-start
max_length++;
} else {
max_length += num_objects_to_move;
}
}
Deallocation
if length > max_length {
// Don't try to release num_objects_to_move if we don't have that many.
release min(max_length, num_objects_to_move) objects to central list
if max_length < num_objects_to_move {
// Slow-start up to num_objects_to_move.
max_length++;
} else if max_length > num_objects_to_move {
// If we consistently go over max_length, shrink max_length.
overages++;
if overages > kMaxOverages {
max_length -= num_objects_to_move;
overages = 0;
}
}
}
</pre>
See also the section on <a href="#Garbage_Collection">Garbage Collection</a>
to see how it affects the <code>max_length</code>.
<h2><A NAME="Medium_Object_Allocation">Medium Object Allocation</A></h2>
<p>A medium object size (256K &le; size &le; 1MB) is rounded up to a page
size (8K) and is handled by a central page heap. The central page heap
includes an array of 128 free lists. The <code>k</code>th entry is a
free list of runs that consist of <code>k + 1</code> pages:</p>
<center><img src="pageheap.gif"></center>
<p>An allocation for <code>k</code> pages is satisfied by looking in
the <code>k</code>th free list. If that free list is empty, we look
in the next free list, and so forth. If no medium-object free list
can satisfy the allocation, the allocation is treated as a large object.
<h2><A NAME="Large_Object_Allocation">Large Object Allocation</A></h2>
Allocations of 1MB or more are considered large allocations. Spans
of free memory which can satisfy these allocations are tracked in
a red-black tree sorted by size. Allocations follow the <em>best-fit</em>
algorithm: the tree is searched to find the smallest span of free
space which is larger than the requested allocation. The allocation
is carved out of that span, and the remaining space is reinserted
either into the large object tree or possibly into one of the smaller
free-lists as appropriate.
If no span of free memory is located that can fit the requested
allocation, we fetch memory from the system (using <code>sbrk</code>,
<code>mmap</code>, or by mapping in portions of
<code>/dev/mem</code>).</p>
<p>If an allocation for <code>k</code> pages is satisfied by a run
of pages of length &gt; <code>k</code>, the remainder of the
run is re-inserted back into the appropriate free list in the
page heap.</p>
<h2><A NAME="Spans">Spans</A></h2>
<p>The heap managed by TCMalloc consists of a set of pages. A run of
contiguous pages is represented by a <code>Span</code> object. A span
can either be <em>allocated</em>, or <em>free</em>. If free, the span
is one of the entries in a page heap linked-list. If allocated, it is
either a large object that has been handed off to the application, or
a run of pages that have been split up into a sequence of small
objects. If split into small objects, the size-class of the objects
is recorded in the span.</p>
<p>A central array indexed by page number can be used to find the span to
which a page belongs. For example, span <em>a</em> below occupies 2
pages, span <em>b</em> occupies 1 page, span <em>c</em> occupies 5
pages and span <em>d</em> occupies 3 pages.</p>
<center><img src="spanmap.gif"></center>
<p>In a 32-bit address space, the central array is represented by a a
2-level radix tree where the root contains 32 entries and each leaf
contains 2^14 entries (a 32-bit address space has 2^19 8K pages, and
the first level of tree divides the 2^19 pages by 2^5). This leads to
a starting memory usage of 64KB of space (2^14*4 bytes) for the
central array, which seems acceptable.</p>
<p>On 64-bit machines, we use a 3-level radix tree.</p>
<h2><A NAME="Deallocation">Deallocation</A></h2>
<p>When an object is deallocated, we compute its page number and look
it up in the central array to find the corresponding span object. The
span tells us whether or not the object is small, and its size-class
if it is small. If the object is small, we insert it into the
appropriate free list in the current thread's thread cache. If the
thread cache now exceeds a predetermined size (2MB by default), we run
a garbage collector that moves unused objects from the thread cache
into central free lists.</p>
<p>If the object is large, the span tells us the range of pages covered
by the object. Suppose this range is <code>[p,q]</code>. We also
lookup the spans for pages <code>p-1</code> and <code>q+1</code>. If
either of these neighboring spans are free, we coalesce them with the
<code>[p,q]</code> span. The resulting span is inserted into the
appropriate free list in the page heap.</p>
<h2>Central Free Lists for Small Objects</h2>
<p>As mentioned before, we keep a central free list for each
size-class. Each central free list is organized as a two-level data
structure: a set of spans, and a linked list of free objects per
span.</p>
<p>An object is allocated from a central free list by removing the
first entry from the linked list of some span. (If all spans have
empty linked lists, a suitably sized span is first allocated from the
central page heap.)</p>
<p>An object is returned to a central free list by adding it to the
linked list of its containing span. If the linked list length now
equals the total number of small objects in the span, this span is now
completely free and is returned to the page heap.</p>
<h2><A NAME="Garbage_Collection">Garbage Collection of Thread Caches</A></h2>
<p>Garbage collecting objects from a thread cache keeps the size of
the cache under control and returns unused objects to the central free
lists. Some threads need large caches to perform well while others
can get by with little or no cache at all. When a thread cache goes
over its <code>max_size</code>, garbage collection kicks in and then the
thread competes with the other threads for a larger cache.</p>
<p>Garbage collection is run only during a deallocation. We walk over
all free lists in the cache and move some number of objects from the
free list to the corresponding central list.</p>
<p>The number of objects to be moved from a free list is determined
using a per-list low-water-mark <code>L</code>. <code>L</code>
records the minimum length of the list since the last garbage
collection. Note that we could have shortened the list by
<code>L</code> objects at the last garbage collection without
requiring any extra accesses to the central list. We use this past
history as a predictor of future accesses and move <code>L/2</code>
objects from the thread cache free list to the corresponding central
free list. This algorithm has the nice property that if a thread
stops using a particular size, all objects of that size will quickly
move from the thread cache to the central free list where they can be
used by other threads.</p>
<p>If a thread consistently deallocates more objects of a certain size
than it allocates, this <code>L/2</code> behavior will cause at least
<code>L/2</code> objects to always sit in the free list. To avoid
wasting memory this way, we shrink the maximum length of the freelist
to converge on <code>num_objects_to_move</code> (see also
<a href="#Sizing_Thread_Cache_Free_Lists">Sizing Thread Cache Free Lists</a>).
<pre>
Garbage Collection
if (L != 0 && max_length > num_objects_to_move) {
max_length = max(max_length - num_objects_to_move, num_objects_to_move)
}
</pre>
<p>The fact that the thread cache went over its <code>max_size</code> is
an indication that the thread would benefit from a larger cache. Simply
increasing <code>max_size</code> would use an inordinate amount of memory
in programs that have lots of active threads. Developers can bound the
memory used with the flag --tcmalloc_max_total_thread_cache_bytes.</p>
<p>Each thread cache starts with a small <code>max_size</code>
(e.g. 64KB) so that idle threads won't pre-allocate memory they don't
need. Each time the cache runs a garbage collection, it will also try
to grow its <code>max_size</code>. If the sum of the thread cache
sizes is less than --tcmalloc_max_total_thread_cache_bytes,
<code>max_size</code> grows easily. If not, thread cache 1 will try
to steal from thread cache 2 (picked round-robin) by decreasing thread
cache 2's <code>max_size</code>. In this way, threads that are more
active will steal memory from other threads more often than they are
have memory stolen from themselves. Mostly idle threads end up with
small caches and active threads end up with big caches. Note that
this stealing can cause the sum of the thread cache sizes to be
greater than --tcmalloc_max_total_thread_cache_bytes until thread
cache 2 deallocates some memory to trigger a garbage collection.</p>
<h2><A NAME="performance">Performance Notes</A></h2>
<h3>PTMalloc2 unittest</h3>
<p>The PTMalloc2 package (now part of glibc) contains a unittest
program <code>t-test1.c</code>. This forks a number of threads and
performs a series of allocations and deallocations in each thread; the
threads do not communicate other than by synchronization in the memory
allocator.</p>
<p><code>t-test1</code> (included in
<code>tests/tcmalloc/</code>, and compiled as
<code>ptmalloc_unittest1</code>) was run with a varying numbers of
threads (1-20) and maximum allocation sizes (64 bytes -
32Kbytes). These tests were run on a 2.4GHz dual Xeon system with
hyper-threading enabled, using Linux glibc-2.3.2 from RedHat 9, with
one million operations per thread in each test. In each case, the test
was run once normally, and once with
<code>LD_PRELOAD=libtcmalloc.so</code>.
<p>The graphs below show the performance of TCMalloc vs PTMalloc2 for
several different metrics. Firstly, total operations (millions) per
elapsed second vs max allocation size, for varying numbers of
threads. The raw data used to generate these graphs (the output of the
<code>time</code> utility) is available in
<code>t-test1.times.txt</code>.</p>
<table>
<tr>
<td><img src="tcmalloc-opspersec.vs.size.1.threads.png"></td>
<td><img src="tcmalloc-opspersec.vs.size.2.threads.png"></td>
<td><img src="tcmalloc-opspersec.vs.size.3.threads.png"></td>
</tr>
<tr>
<td><img src="tcmalloc-opspersec.vs.size.4.threads.png"></td>
<td><img src="tcmalloc-opspersec.vs.size.5.threads.png"></td>
<td><img src="tcmalloc-opspersec.vs.size.8.threads.png"></td>
</tr>
<tr>
<td><img src="tcmalloc-opspersec.vs.size.12.threads.png"></td>
<td><img src="tcmalloc-opspersec.vs.size.16.threads.png"></td>
<td><img src="tcmalloc-opspersec.vs.size.20.threads.png"></td>
</tr>
</table>
<ul>
<li> TCMalloc is much more consistently scalable than PTMalloc2 - for
all thread counts &gt;1 it achieves ~7-9 million ops/sec for small
allocations, falling to ~2 million ops/sec for larger
allocations. The single-thread case is an obvious outlier,
since it is only able to keep a single processor busy and hence
can achieve fewer ops/sec. PTMalloc2 has a much higher variance
on operations/sec - peaking somewhere around 4 million ops/sec
for small allocations and falling to &lt;1 million ops/sec for
larger allocations.
<li> TCMalloc is faster than PTMalloc2 in the vast majority of
cases, and particularly for small allocations. Contention
between threads is less of a problem in TCMalloc.
<li> TCMalloc's performance drops off as the allocation size
increases. This is because the per-thread cache is
garbage-collected when it hits a threshold (defaulting to
2MB). With larger allocation sizes, fewer objects can be stored
in the cache before it is garbage-collected.
<li> There is a noticeable drop in TCMalloc's performance at ~32K
maximum allocation size; at larger sizes performance drops less
quickly. This is due to the 32K maximum size of objects in the
per-thread caches; for objects larger than this TCMalloc
allocates from the central page heap.
</ul>
<p>Next, operations (millions) per second of CPU time vs number of
threads, for max allocation size 64 bytes - 128 Kbytes.</p>
<table>
<tr>
<td><img src="tcmalloc-opspercpusec.vs.threads.64.bytes.png"></td>
<td><img src="tcmalloc-opspercpusec.vs.threads.256.bytes.png"></td>
<td><img src="tcmalloc-opspercpusec.vs.threads.1024.bytes.png"></td>
</tr>
<tr>
<td><img src="tcmalloc-opspercpusec.vs.threads.4096.bytes.png"></td>
<td><img src="tcmalloc-opspercpusec.vs.threads.8192.bytes.png"></td>
<td><img src="tcmalloc-opspercpusec.vs.threads.16384.bytes.png"></td>
</tr>
<tr>
<td><img src="tcmalloc-opspercpusec.vs.threads.32768.bytes.png"></td>
<td><img src="tcmalloc-opspercpusec.vs.threads.65536.bytes.png"></td>
<td><img src="tcmalloc-opspercpusec.vs.threads.131072.bytes.png"></td>
</tr>
</table>
<p>Here we see again that TCMalloc is both more consistent and more
efficient than PTMalloc2. For max allocation sizes &lt;32K, TCMalloc
typically achieves ~2-2.5 million ops per second of CPU time with a
large number of threads, whereas PTMalloc achieves generally 0.5-1
million ops per second of CPU time, with a lot of cases achieving much
less than this figure. Above 32K max allocation size, TCMalloc drops
to 1-1.5 million ops per second of CPU time, and PTMalloc drops almost
to zero for large numbers of threads (i.e. with PTMalloc, lots of CPU
time is being burned spinning waiting for locks in the heavily
multi-threaded case).</p>
<H2><A NAME="runtime">Modifying Runtime Behavior</A></H2>
<p>You can more finely control the behavior of the tcmalloc via
environment variables.</p>
<p>Generally useful flags:</p>
<table frame=box rules=sides cellpadding=5 width=100%>
<tr valign=top>
<td><code>TCMALLOC_SAMPLE_PARAMETER</code></td>
<td>default: 0</td>
<td>
The approximate gap between sampling actions. That is, we
take one sample approximately once every
<code>tcmalloc_sample_parmeter</code> bytes of allocation.
This sampled heap information is available via
<code>MallocExtension::GetHeapSample()</code> or
<code>MallocExtension::ReadStackTraces()</code>. A reasonable
value is 524288.
</td>
</tr>
<tr valign=top>
<td><code>TCMALLOC_RELEASE_RATE</code></td>
<td>default: 1.0</td>
<td>
Rate at which we release unused memory to the system, via
<code>madvise(MADV_DONTNEED)</code>, on systems that support
it. Zero means we never release memory back to the system.
Increase this flag to return memory faster; decrease it
to return memory slower. Reasonable rates are in the
range [0,10].
</td>
</tr>
<tr valign=top>
<td><code>TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD</code></td>
<td>default: 1073741824</td>
<td>
Allocations larger than this value cause a stack trace to be
dumped to stderr. The threshold for dumping stack traces is
increased by a factor of 1.125 every time we print a message so
that the threshold automatically goes up by a factor of ~1000
every 60 messages. This bounds the amount of extra logging
generated by this flag. Default value of this flag is very large
and therefore you should see no extra logging unless the flag is
overridden.
</td>
</tr>
<tr valign=top>
<td><code>TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES</code></td>
<td>default: 16777216</td>
<td>
Bound on the total amount of bytes allocated to thread caches. This
bound is not strict, so it is possible for the cache to go over this
bound in certain circumstances. This value defaults to 16MB. For
applications with many threads, this may not be a large enough cache,
which can affect performance. If you suspect your application is not
scaling to many threads due to lock contention in TCMalloc, you can
try increasing this value. This may improve performance, at a cost
of extra memory use by TCMalloc. See <a href="#Garbage_Collection">
Garbage Collection</a> for more details.
</td>
</tr>
</table>
<p>Advanced "tweaking" flags, that control more precisely how tcmalloc
tries to allocate memory from the kernel.</p>
<table frame=box rules=sides cellpadding=5 width=100%>
<tr valign=top>
<td><code>TCMALLOC_SKIP_MMAP</code></td>
<td>default: false</td>
<td>
If true, do not try to use <code>mmap</code> to obtain memory
from the kernel.
</td>
</tr>
<tr valign=top>
<td><code>TCMALLOC_SKIP_SBRK</code></td>
<td>default: false</td>
<td>
If true, do not try to use <code>sbrk</code> to obtain memory
from the kernel.
</td>
</tr>
<tr valign=top>
<td><code>TCMALLOC_DEVMEM_START</code></td>
<td>default: 0</td>
<td>
Physical memory starting location in MB for <code>/dev/mem</code>
allocation. Setting this to 0 disables <code>/dev/mem</code>
allocation.
</td>
</tr>
<tr valign=top>
<td><code>TCMALLOC_DEVMEM_LIMIT</code></td>
<td>default: 0</td>
<td>
Physical memory limit location in MB for <code>/dev/mem</code>
allocation. Setting this to 0 means no limit.
</td>
</tr>
<tr valign=top>
<td><code>TCMALLOC_DEVMEM_DEVICE</code></td>
<td>default: /dev/mem</td>
<td>
Device to use for allocating unmanaged memory.
</td>
</tr>
<tr valign=top>
<td><code>TCMALLOC_MEMFS_MALLOC_PATH</code></td>
<td>default: ""</td>
<td>
If set, specify a path where hugetlbfs or tmpfs is mounted.
This may allow for speedier allocations.
</td>
</tr>
<tr valign=top>
<td><code>TCMALLOC_MEMFS_LIMIT_MB</code></td>
<td>default: 0</td>
<td>
Limit total memfs allocation size to specified number of MB.
0 means "no limit".
</td>
</tr>
<tr valign=top>
<td><code>TCMALLOC_MEMFS_ABORT_ON_FAIL</code></td>
<td>default: false</td>
<td>
If true, abort() whenever memfs_malloc fails to satisfy an allocation.
</td>
</tr>
<tr valign=top>
<td><code>TCMALLOC_MEMFS_IGNORE_MMAP_FAIL</code></td>
<td>default: false</td>
<td>
If true, ignore failures from mmap.
</td>
</tr>
<tr valign=top>
<td><code>TCMALLOC_MEMFS_MAP_PRIVATE</code></td>
<td>default: false</td>
<td>
If true, use MAP_PRIVATE when mapping via memfs, not MAP_SHARED.
</td>
</tr>
</table>
<H2><A NAME="compiletime">Modifying Behavior In Code</A></H2>
<p>The <code>MallocExtension</code> class, in
<code>malloc_extension.h</code>, provides a few knobs that you can
tweak in your program, to affect tcmalloc's behavior.</p>
<h3>Releasing Memory Back to the System</h3>
<p>By default, tcmalloc will release no-longer-used memory back to the
kernel gradually, over time. The <a
href="#runtime">tcmalloc_release_rate</a> flag controls how quickly
this happens. You can also force a release at a given point in the
progam execution like so:</p>
<pre>
MallocExtension::instance()->ReleaseFreeMemory();
</pre>
<p>You can also call <code>SetMemoryReleaseRate()</code> to change the
<code>tcmalloc_release_rate</code> value at runtime, or
<code>GetMemoryReleaseRate</code> to see what the current release rate
is.</p>
<h3>Memory Introspection</h3>
<p>There are several routines for getting a human-readable form of the
current memory usage:</p>
<pre>
MallocExtension::instance()->GetStats(buffer, buffer_length);
MallocExtension::instance()->GetHeapSample(&string);
MallocExtension::instance()->GetHeapGrowthStacks(&string);
</pre>
<p>The last two create files in the same format as the heap-profiler,
and can be passed as data files to pprof. The first is human-readable
and is meant for debugging.</p>
<h3>Generic Tcmalloc Status</h3>
<p>TCMalloc has support for setting and retrieving arbitrary
'properties':</p>
<pre>
MallocExtension::instance()->SetNumericProperty(property_name, value);
MallocExtension::instance()->GetNumericProperty(property_name, &value);
</pre>
<p>It is possible for an application to set and get these properties,
but the most useful is when a library sets the properties so the
application can read them. Here are the properties TCMalloc defines;
you can access them with a call like
<code>MallocExtension::instance()->GetNumericProperty("generic.heap_size",
&value);</code>:</p>
<table frame=box rules=sides cellpadding=5 width=100%>
<tr valign=top>
<td><code>generic.current_allocated_bytes</code></td>
<td>
Number of bytes used by the application. This will not typically
match the memory use reported by the OS, because it does not
include TCMalloc overhead or memory fragmentation.
</td>
</tr>
<tr valign=top>
<td><code>generic.heap_size</code></td>
<td>
Bytes of system memory reserved by TCMalloc.
</td>
</tr>
<tr valign=top>
<td><code>tcmalloc.pageheap_free_bytes</code></td>
<td>
Number of bytes in free, mapped pages in page heap. These bytes
can be used to fulfill allocation requests. They always count
towards virtual memory usage, and unless the underlying memory is
swapped out by the OS, they also count towards physical memory
usage.
</td>
</tr>
<tr valign=top>
<td><code>tcmalloc.pageheap_unmapped_bytes</code></td>
<td>
Number of bytes in free, unmapped pages in page heap. These are
bytes that have been released back to the OS, possibly by one of
the MallocExtension "Release" calls. They can be used to fulfill
allocation requests, but typically incur a page fault. They
always count towards virtual memory usage, and depending on the
OS, typically do not count towards physical memory usage.
</td>
</tr>
<tr valign=top>
<td><code>tcmalloc.slack_bytes</code></td>
<td>
Sum of pageheap_free_bytes and pageheap_unmapped_bytes. Provided
for backwards compatibility only. Do not use.
</td>
</tr>
<tr valign=top>
<td><code>tcmalloc.max_total_thread_cache_bytes</code></td>
<td>
A limit to how much memory TCMalloc dedicates for small objects.
Higher numbers trade off more memory use for -- in some situations
-- improved efficiency.
</td>
</tr>
<tr valign=top>
<td><code>tcmalloc.current_total_thread_cache_bytes</code></td>
<td>
A measure of some of the memory TCMalloc is using (for
small objects).
</td>
</tr>
</table>
<h2><A NAME="caveats">Caveats</A></h2>
<p>For some systems, TCMalloc may not work correctly with
applications that aren't linked against <code>libpthread.so</code> (or
the equivalent on your OS). It should work on Linux using glibc 2.3,
but other OS/libc combinations have not been tested.</p>
<p>TCMalloc may be somewhat more memory hungry than other mallocs,
(but tends not to have the huge blowups that can happen with other
mallocs). In particular, at startup TCMalloc allocates approximately
240KB of internal memory.</p>
<p>Don't try to load TCMalloc into a running binary (e.g., using JNI
in Java programs). The binary will have allocated some objects using
the system malloc, and may try to pass them to TCMalloc for
deallocation. TCMalloc will not be able to handle such objects.</p>
<hr>
<address>Sanjay Ghemawat, Paul Menage<br>
<!-- Created: Tue Dec 19 10:43:14 PST 2000 -->
<!-- hhmts start -->
Last modified: Sat Feb 24 13:11:38 PST 2007 (csilvers)
<!-- hhmts end -->
</address>
</body>
</html>

View file

@ -0,0 +1,21 @@
digraph ThreadHeap {
rankdir=LR
node [shape=box, width=0.3, height=0.3]
nodesep=.05
heap [shape=record, height=2, label="<f0>class 0|<f1>class 1|<f2>class 2|..."]
O0 [label=""]
O1 [label=""]
O2 [label=""]
O3 [label=""]
O4 [label=""]
O5 [label=""]
sep1 [shape=plaintext, label="..."]
sep2 [shape=plaintext, label="..."]
sep3 [shape=plaintext, label="..."]
heap:f0 -> O0 -> O1 -> sep1
heap:f1 -> O2 -> O3 -> sep2
heap:f2 -> O4 -> O5 -> sep3
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.4 KiB