mirror of
				https://github.com/ossrs/srs.git
				synced 2025-03-09 15:49:59 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			778 lines
		
	
	
	
		
			30 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			778 lines
		
	
	
	
		
			30 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <!doctype html public "-//w3c//dtd html 4.01 transitional//en">
 | |
| <!-- $Id: $ -->
 | |
| <html>
 | |
| <head>
 | |
| <title>TCMalloc : Thread-Caching Malloc</title>
 | |
| <link rel="stylesheet" href="designstyle.css">
 | |
| <style type="text/css">
 | |
|   em {
 | |
|     color: red;
 | |
|     font-style: normal;
 | |
|   }
 | |
| </style>
 | |
| </head>
 | |
| <body>
 | |
| 
 | |
| <h1>TCMalloc : Thread-Caching Malloc</h1>
 | |
| 
 | |
| <address>Sanjay Ghemawat</address>
 | |
| 
 | |
| <h2><A name=motivation>Motivation</A></h2>
 | |
| 
 | |
| <p>TCMalloc is faster than the glibc 2.3 malloc (available as a
 | |
| separate library called ptmalloc2) and other mallocs that I have
 | |
| tested.  ptmalloc2 takes approximately 300 nanoseconds to execute a
 | |
| malloc/free pair on a 2.8 GHz P4 (for small objects).  The TCMalloc
 | |
| implementation takes approximately 50 nanoseconds for the same
 | |
| operation pair.  Speed is important for a malloc implementation
 | |
| because if malloc is not fast enough, application writers are inclined
 | |
| to write their own custom free lists on top of malloc.  This can lead
 | |
| to extra complexity, and more memory usage unless the application
 | |
| writer is very careful to appropriately size the free lists and
 | |
| scavenge idle objects out of the free list.</p>
 | |
| 
 | |
| <p>TCMalloc also reduces lock contention for multi-threaded programs.
 | |
| For small objects, there is virtually zero contention.  For large
 | |
| objects, TCMalloc tries to use fine grained and efficient spinlocks.
 | |
| ptmalloc2 also reduces lock contention by using per-thread arenas but
 | |
| there is a big problem with ptmalloc2's use of per-thread arenas.  In
 | |
| ptmalloc2 memory can never move from one arena to another.  This can
 | |
| lead to huge amounts of wasted space.  For example, in one Google
 | |
| application, the first phase would allocate approximately 300MB of
 | |
| memory for its URL canonicalization data structures.  When the first
 | |
| phase finished, a second phase would be started in the same address
 | |
| space.  If this second phase was assigned a different arena than the
 | |
| one used by the first phase, this phase would not reuse any of the
 | |
| memory left after the first phase and would add another 300MB to the
 | |
| address space.  Similar memory blowup problems were also noticed in
 | |
| other applications.</p>
 | |
| 
 | |
| <p>Another benefit of TCMalloc is space-efficient representation of
 | |
| small objects.  For example, N 8-byte objects can be allocated while
 | |
| using space approximately <code>8N * 1.01</code> bytes.  I.e., a
 | |
| one-percent space overhead.  ptmalloc2 uses a four-byte header for
 | |
| each object and (I think) rounds up the size to a multiple of 8 bytes
 | |
| and ends up using <code>16N</code> bytes.</p>
 | |
| 
 | |
| 
 | |
| <h2><A NAME="Usage">Usage</A></h2>
 | |
| 
 | |
| <p>To use TCMalloc, just link TCMalloc into your application via the
 | |
| "-ltcmalloc" linker flag.</p>
 | |
| 
 | |
| <p>You can use TCMalloc in applications you didn't compile yourself,
 | |
| by using LD_PRELOAD:</p>
 | |
| <pre>
 | |
|    $ LD_PRELOAD="/usr/lib/libtcmalloc.so" <binary>
 | |
| </pre>
 | |
| <p>LD_PRELOAD is tricky, and we don't necessarily recommend this mode
 | |
| of usage.</p>
 | |
| 
 | |
| <p>TCMalloc includes a <A HREF="heap_checker.html">heap checker</A>
 | |
| and <A HREF="heapprofile.html">heap profiler</A> as well.</p>
 | |
| 
 | |
| <p>If you'd rather link in a version of TCMalloc that does not include
 | |
| the heap profiler and checker (perhaps to reduce binary size for a
 | |
| static binary), you can link in <code>libtcmalloc_minimal</code>
 | |
| instead.</p>
 | |
| 
 | |
| 
 | |
| <h2><A NAME="Overview">Overview</A></h2>
 | |
| 
 | |
| <p>TCMalloc assigns each thread a thread-local cache.  Small
 | |
| allocations are satisfied from the thread-local cache.  Objects are
 | |
| moved from central data structures into a thread-local cache as
 | |
| needed, and periodic garbage collections are used to migrate memory
 | |
| back from a thread-local cache into the central data structures.</p>
 | |
| <center><img src="overview.gif"></center>
 | |
| 
 | |
| <p>TCMalloc treats objects with size <= 256K ("small" objects)
 | |
| differently from larger objects.  Large objects are allocated directly
 | |
| from the central heap using a page-level allocator (a page is a 8K
 | |
| aligned region of memory).  I.e., a large object is always
 | |
| page-aligned and occupies an integral number of pages.</p>
 | |
| 
 | |
| <p>A run of pages can be carved up into a sequence of small objects,
 | |
| each equally sized.  For example a run of one page (4K) can be carved
 | |
| up into 32 objects of size 128 bytes each.</p>
 | |
| 
 | |
| 
 | |
| <h2><A NAME="Small_Object_Allocation">Small Object Allocation</A></h2>
 | |
| 
 | |
| <p>Each small object size maps to one of approximately 88 allocatable
 | |
| size-classes.  For example, all allocations in the range 961 to 1024
 | |
| bytes are rounded up to 1024.  The size-classes are spaced so that
 | |
| small sizes are separated by 8 bytes, larger sizes by 16 bytes, even
 | |
| larger sizes by 32 bytes, and so forth.  The maximal spacing is
 | |
| controlled so that not too much space is wasted when an allocation
 | |
| request falls just past the end of a size class and has to be rounded
 | |
| up to the next class.</p>
 | |
| 
 | |
| <p>A thread cache contains a singly linked list of free objects per
 | |
| size-class.</p>
 | |
| <center><img src="threadheap.gif"></center>
 | |
| 
 | |
| <p>When allocating a small object: (1) We map its size to the
 | |
| corresponding size-class.  (2) Look in the corresponding free list in
 | |
| the thread cache for the current thread.  (3) If the free list is not
 | |
| empty, we remove the first object from the list and return it.  When
 | |
| following this fast path, TCMalloc acquires no locks at all.  This
 | |
| helps speed-up allocation significantly because a lock/unlock pair
 | |
| takes approximately 100 nanoseconds on a 2.8 GHz Xeon.</p>
 | |
| 
 | |
| <p>If the free list is empty: (1) We fetch a bunch of objects from a
 | |
| central free list for this size-class (the central free list is shared
 | |
| by all threads).  (2) Place them in the thread-local free list.  (3)
 | |
| Return one of the newly fetched objects to the applications.</p>
 | |
| 
 | |
| <p>If the central free list is also empty: (1) We allocate a run of
 | |
| pages from the central page allocator.  (2) Split the run into a set
 | |
| of objects of this size-class.  (3) Place the new objects on the
 | |
| central free list.  (4) As before, move some of these objects to the
 | |
| thread-local free list.</p>
 | |
| 
 | |
| <h3><A NAME="Sizing_Thread_Cache_Free_Lists">
 | |
|   Sizing Thread Cache Free Lists</A></h3>
 | |
| 
 | |
| <p>It is important to size the thread cache free lists correctly.  If
 | |
| the free list is too small, we'll need to go to the central free list
 | |
| too often.  If the free list is too big, we'll waste memory as objects
 | |
| sit idle in the free list.</p>
 | |
| 
 | |
| <p>Note that the thread caches are just as important for deallocation
 | |
| as they are for allocation.  Without a cache, each deallocation would
 | |
| require moving the memory to the central free list.  Also, some threads
 | |
| have asymmetric alloc/free behavior (e.g. producer and consumer threads),
 | |
| so sizing the free list correctly gets trickier.</p>
 | |
| 
 | |
| <p>To size the free lists appropriately, we use a slow-start algorithm
 | |
| to determine the maximum length of each individual free list.  As the
 | |
| free list is used more frequently, its maximum length grows.  However,
 | |
| if a free list is used more for deallocation than allocation, its
 | |
| maximum length will grow only up to a point where the whole list can
 | |
| be efficiently moved to the central free list at once.</p>
 | |
| 
 | |
| <p>The psuedo-code below illustrates this slow-start algorithm.  Note
 | |
| that <code>num_objects_to_move</code> is specific to each size class.
 | |
| By moving a list of objects with a well-known length, the central
 | |
| cache can efficiently pass these lists between thread caches.  If
 | |
| a thread cache wants fewer than <code>num_objects_to_move</code>,
 | |
| the operation on the central free list has linear time complexity.
 | |
| The downside of always using <code>num_objects_to_move</code> as
 | |
| the number of objects to transfer to and from the central cache is
 | |
| that it wastes memory in threads that don't need all of those objects.
 | |
| 
 | |
| <pre>
 | |
| Start each freelist max_length at 1.
 | |
| 
 | |
| Allocation
 | |
|   if freelist empty {
 | |
|     fetch min(max_length, num_objects_to_move) from central list;
 | |
|     if max_length < num_objects_to_move {  // slow-start
 | |
|       max_length++;
 | |
|     } else {
 | |
|       max_length += num_objects_to_move;
 | |
|     }
 | |
|   }
 | |
| 
 | |
| Deallocation
 | |
|   if length > max_length {
 | |
|     // Don't try to release num_objects_to_move if we don't have that many.
 | |
|     release min(max_length, num_objects_to_move) objects to central list
 | |
|     if max_length < num_objects_to_move {
 | |
|       // Slow-start up to num_objects_to_move.
 | |
|       max_length++;
 | |
|     } else if max_length > num_objects_to_move {
 | |
|       // If we consistently go over max_length, shrink max_length.
 | |
|       overages++;
 | |
|       if overages > kMaxOverages {
 | |
|         max_length -= num_objects_to_move;
 | |
|         overages = 0;
 | |
|       }
 | |
|     }
 | |
|   }
 | |
| </pre>
 | |
| 
 | |
| See also the section on <a href="#Garbage_Collection">Garbage Collection</a>
 | |
| to see how it affects the <code>max_length</code>.
 | |
| 
 | |
| <h2><A NAME="Medium_Object_Allocation">Medium Object Allocation</A></h2>
 | |
| 
 | |
| <p>A medium object size (256K ≤ size ≤ 1MB) is rounded up to a page
 | |
| size (8K) and is handled by a central page heap. The central page heap
 | |
| includes an array of 128 free lists.  The <code>k</code>th entry is a
 | |
| free list of runs that consist of <code>k + 1</code> pages:</p>
 | |
| <center><img src="pageheap.gif"></center>
 | |
| 
 | |
| <p>An allocation for <code>k</code> pages is satisfied by looking in
 | |
| the <code>k</code>th free list.  If that free list is empty, we look
 | |
| in the next free list, and so forth.  If no medium-object free list
 | |
| can satisfy the allocation, the allocation is treated as a large object.
 | |
| 
 | |
| 
 | |
| <h2><A NAME="Large_Object_Allocation">Large Object Allocation</A></h2>
 | |
| 
 | |
| Allocations of 1MB or more are considered large allocations. Spans
 | |
| of free memory which can satisfy these allocations are tracked in
 | |
| a red-black tree sorted by size. Allocations follow the <em>best-fit</em>
 | |
| algorithm: the tree is searched to find the smallest span of free
 | |
| space which is larger than the requested allocation. The allocation
 | |
| is carved out of that span, and the remaining space is reinserted
 | |
| either into the large object tree or possibly into one of the smaller
 | |
| free-lists as appropriate.
 | |
| 
 | |
| If no span of free memory is located that can fit the requested
 | |
| allocation, we fetch memory from the system (using <code>sbrk</code>,
 | |
| <code>mmap</code>, or by mapping in portions of
 | |
| <code>/dev/mem</code>).</p>
 | |
| 
 | |
| <p>If an allocation for <code>k</code> pages is satisfied by a run
 | |
| of pages of length > <code>k</code>, the remainder of the
 | |
| run is re-inserted back into the appropriate free list in the
 | |
| page heap.</p>
 | |
| 
 | |
| 
 | |
| <h2><A NAME="Spans">Spans</A></h2>
 | |
| 
 | |
| <p>The heap managed by TCMalloc consists of a set of pages.  A run of
 | |
| contiguous pages is represented by a <code>Span</code> object.  A span
 | |
| can either be <em>allocated</em>, or <em>free</em>.  If free, the span
 | |
| is one of the entries in a page heap linked-list.  If allocated, it is
 | |
| either a large object that has been handed off to the application, or
 | |
| a run of pages that have been split up into a sequence of small
 | |
| objects.  If split into small objects, the size-class of the objects
 | |
| is recorded in the span.</p>
 | |
| 
 | |
| <p>A central array indexed by page number can be used to find the span to
 | |
| which a page belongs.  For example, span <em>a</em> below occupies 2
 | |
| pages, span <em>b</em> occupies 1 page, span <em>c</em> occupies 5
 | |
| pages and span <em>d</em> occupies 3 pages.</p>
 | |
| <center><img src="spanmap.gif"></center>
 | |
| 
 | |
| <p>In a 32-bit address space, the central array is represented by a a
 | |
| 2-level radix tree where the root contains 32 entries and each leaf
 | |
| contains 2^14 entries (a 32-bit address space has 2^19 8K pages, and
 | |
| the first level of tree divides the 2^19 pages by 2^5).  This leads to
 | |
| a starting memory usage of 64KB of space (2^14*4 bytes) for the
 | |
| central array, which seems acceptable.</p>
 | |
| 
 | |
| <p>On 64-bit machines, we use a 3-level radix tree.</p>
 | |
| 
 | |
| 
 | |
| <h2><A NAME="Deallocation">Deallocation</A></h2>
 | |
| 
 | |
| <p>When an object is deallocated, we compute its page number and look
 | |
| it up in the central array to find the corresponding span object.  The
 | |
| span tells us whether or not the object is small, and its size-class
 | |
| if it is small.  If the object is small, we insert it into the
 | |
| appropriate free list in the current thread's thread cache.  If the
 | |
| thread cache now exceeds a predetermined size (2MB by default), we run
 | |
| a garbage collector that moves unused objects from the thread cache
 | |
| into central free lists.</p>
 | |
| 
 | |
| <p>If the object is large, the span tells us the range of pages covered
 | |
| by the object.  Suppose this range is <code>[p,q]</code>.  We also
 | |
| lookup the spans for pages <code>p-1</code> and <code>q+1</code>.  If
 | |
| either of these neighboring spans are free, we coalesce them with the
 | |
| <code>[p,q]</code> span.  The resulting span is inserted into the
 | |
| appropriate free list in the page heap.</p>
 | |
| 
 | |
| 
 | |
| <h2>Central Free Lists for Small Objects</h2>
 | |
| 
 | |
| <p>As mentioned before, we keep a central free list for each
 | |
| size-class.  Each central free list is organized as a two-level data
 | |
| structure: a set of spans, and a linked list of free objects per
 | |
| span.</p>
 | |
| 
 | |
| <p>An object is allocated from a central free list by removing the
 | |
| first entry from the linked list of some span.  (If all spans have
 | |
| empty linked lists, a suitably sized span is first allocated from the
 | |
| central page heap.)</p>
 | |
| 
 | |
| <p>An object is returned to a central free list by adding it to the
 | |
| linked list of its containing span.  If the linked list length now
 | |
| equals the total number of small objects in the span, this span is now
 | |
| completely free and is returned to the page heap.</p>
 | |
| 
 | |
| 
 | |
| <h2><A NAME="Garbage_Collection">Garbage Collection of Thread Caches</A></h2>
 | |
| 
 | |
| <p>Garbage collecting objects from a thread cache keeps the size of
 | |
| the cache under control and returns unused objects to the central free
 | |
| lists.  Some threads need large caches to perform well while others
 | |
| can get by with little or no cache at all.  When a thread cache goes
 | |
| over its <code>max_size</code>, garbage collection kicks in and then the
 | |
| thread competes with the other threads for a larger cache.</p>
 | |
| 
 | |
| <p>Garbage collection is run only during a deallocation.  We walk over
 | |
| all free lists in the cache and move some number of objects from the
 | |
| free list to the corresponding central list.</p>
 | |
| 
 | |
| <p>The number of objects to be moved from a free list is determined
 | |
| using a per-list low-water-mark <code>L</code>.  <code>L</code>
 | |
| records the minimum length of the list since the last garbage
 | |
| collection.  Note that we could have shortened the list by
 | |
| <code>L</code> objects at the last garbage collection without
 | |
| requiring any extra accesses to the central list.  We use this past
 | |
| history as a predictor of future accesses and move <code>L/2</code>
 | |
| objects from the thread cache free list to the corresponding central
 | |
| free list.  This algorithm has the nice property that if a thread
 | |
| stops using a particular size, all objects of that size will quickly
 | |
| move from the thread cache to the central free list where they can be
 | |
| used by other threads.</p>
 | |
| 
 | |
| <p>If a thread consistently deallocates more objects of a certain size
 | |
| than it allocates, this <code>L/2</code> behavior will cause at least
 | |
| <code>L/2</code> objects to always sit in the free list.  To avoid
 | |
| wasting memory this way, we shrink the maximum length of the freelist
 | |
| to converge on <code>num_objects_to_move</code> (see also
 | |
| <a href="#Sizing_Thread_Cache_Free_Lists">Sizing Thread Cache Free Lists</a>).
 | |
| 
 | |
| <pre>
 | |
| Garbage Collection
 | |
|   if (L != 0 && max_length > num_objects_to_move) {
 | |
|     max_length = max(max_length - num_objects_to_move, num_objects_to_move)
 | |
|   }
 | |
| </pre>
 | |
| 
 | |
| <p>The fact that the thread cache went over its <code>max_size</code> is
 | |
| an indication that the thread would benefit from a larger cache.  Simply
 | |
| increasing <code>max_size</code> would use an inordinate amount of memory
 | |
| in programs that have lots of active threads.  Developers can bound the
 | |
| memory used with the flag --tcmalloc_max_total_thread_cache_bytes.</p>
 | |
| 
 | |
| <p>Each thread cache starts with a small <code>max_size</code>
 | |
| (e.g. 64KB) so that idle threads won't pre-allocate memory they don't
 | |
| need.  Each time the cache runs a garbage collection, it will also try
 | |
| to grow its <code>max_size</code>.  If the sum of the thread cache
 | |
| sizes is less than --tcmalloc_max_total_thread_cache_bytes,
 | |
| <code>max_size</code> grows easily.  If not, thread cache 1 will try
 | |
| to steal from thread cache 2 (picked round-robin) by decreasing thread
 | |
| cache 2's <code>max_size</code>.  In this way, threads that are more
 | |
| active will steal memory from other threads more often than they are
 | |
| have memory stolen from themselves.  Mostly idle threads end up with
 | |
| small caches and active threads end up with big caches.  Note that
 | |
| this stealing can cause the sum of the thread cache sizes to be
 | |
| greater than --tcmalloc_max_total_thread_cache_bytes until thread
 | |
| cache 2 deallocates some memory to trigger a garbage collection.</p>
 | |
| 
 | |
| <h2><A NAME="performance">Performance Notes</A></h2>
 | |
| 
 | |
| <h3>PTMalloc2 unittest</h3>
 | |
| 
 | |
| <p>The PTMalloc2 package (now part of glibc) contains a unittest
 | |
| program <code>t-test1.c</code>. This forks a number of threads and
 | |
| performs a series of allocations and deallocations in each thread; the
 | |
| threads do not communicate other than by synchronization in the memory
 | |
| allocator.</p>
 | |
| 
 | |
| <p><code>t-test1</code> (included in
 | |
| <code>tests/tcmalloc/</code>, and compiled as
 | |
| <code>ptmalloc_unittest1</code>) was run with a varying numbers of
 | |
| threads (1-20) and maximum allocation sizes (64 bytes -
 | |
| 32Kbytes). These tests were run on a 2.4GHz dual Xeon system with
 | |
| hyper-threading enabled, using Linux glibc-2.3.2 from RedHat 9, with
 | |
| one million operations per thread in each test. In each case, the test
 | |
| was run once normally, and once with
 | |
| <code>LD_PRELOAD=libtcmalloc.so</code>.
 | |
| 
 | |
| <p>The graphs below show the performance of TCMalloc vs PTMalloc2 for
 | |
| several different metrics. Firstly, total operations (millions) per
 | |
| elapsed second vs max allocation size, for varying numbers of
 | |
| threads. The raw data used to generate these graphs (the output of the
 | |
| <code>time</code> utility) is available in
 | |
| <code>t-test1.times.txt</code>.</p>
 | |
| 
 | |
| <table>
 | |
| <tr>
 | |
|   <td><img src="tcmalloc-opspersec.vs.size.1.threads.png"></td>
 | |
|   <td><img src="tcmalloc-opspersec.vs.size.2.threads.png"></td>
 | |
|   <td><img src="tcmalloc-opspersec.vs.size.3.threads.png"></td>
 | |
| </tr>
 | |
| <tr>
 | |
|   <td><img src="tcmalloc-opspersec.vs.size.4.threads.png"></td>
 | |
|   <td><img src="tcmalloc-opspersec.vs.size.5.threads.png"></td>
 | |
|   <td><img src="tcmalloc-opspersec.vs.size.8.threads.png"></td>
 | |
| </tr>
 | |
| <tr>
 | |
|   <td><img src="tcmalloc-opspersec.vs.size.12.threads.png"></td>
 | |
|   <td><img src="tcmalloc-opspersec.vs.size.16.threads.png"></td>
 | |
|   <td><img src="tcmalloc-opspersec.vs.size.20.threads.png"></td>
 | |
| </tr>
 | |
| </table>
 | |
| 
 | |
| 
 | |
| <ul> 
 | |
|   <li> TCMalloc is much more consistently scalable than PTMalloc2 - for
 | |
|        all thread counts >1 it achieves ~7-9 million ops/sec for small
 | |
|        allocations, falling to ~2 million ops/sec for larger
 | |
|        allocations. The single-thread case is an obvious outlier,
 | |
|        since it is only able to keep a single processor busy and hence
 | |
|        can achieve fewer ops/sec. PTMalloc2 has a much higher variance
 | |
|        on operations/sec - peaking somewhere around 4 million ops/sec
 | |
|        for small allocations and falling to <1 million ops/sec for
 | |
|        larger allocations.
 | |
| 
 | |
|   <li> TCMalloc is faster than PTMalloc2 in the vast majority of
 | |
|        cases, and particularly for small allocations. Contention
 | |
|        between threads is less of a problem in TCMalloc.
 | |
| 
 | |
|   <li> TCMalloc's performance drops off as the allocation size
 | |
|        increases. This is because the per-thread cache is
 | |
|        garbage-collected when it hits a threshold (defaulting to
 | |
|        2MB). With larger allocation sizes, fewer objects can be stored
 | |
|        in the cache before it is garbage-collected.
 | |
| 
 | |
|   <li> There is a noticeable drop in TCMalloc's performance at ~32K
 | |
|        maximum allocation size; at larger sizes performance drops less
 | |
|        quickly. This is due to the 32K maximum size of objects in the
 | |
|        per-thread caches; for objects larger than this TCMalloc
 | |
|        allocates from the central page heap.
 | |
| </ul>
 | |
| 
 | |
| <p>Next, operations (millions) per second of CPU time vs number of
 | |
| threads, for max allocation size 64 bytes - 128 Kbytes.</p>
 | |
| 
 | |
| <table>
 | |
| <tr>
 | |
|   <td><img src="tcmalloc-opspercpusec.vs.threads.64.bytes.png"></td>
 | |
|   <td><img src="tcmalloc-opspercpusec.vs.threads.256.bytes.png"></td>
 | |
|   <td><img src="tcmalloc-opspercpusec.vs.threads.1024.bytes.png"></td>
 | |
| </tr>
 | |
| <tr>
 | |
|   <td><img src="tcmalloc-opspercpusec.vs.threads.4096.bytes.png"></td>
 | |
|   <td><img src="tcmalloc-opspercpusec.vs.threads.8192.bytes.png"></td>
 | |
|   <td><img src="tcmalloc-opspercpusec.vs.threads.16384.bytes.png"></td>
 | |
| </tr>
 | |
| <tr>
 | |
|   <td><img src="tcmalloc-opspercpusec.vs.threads.32768.bytes.png"></td>
 | |
|   <td><img src="tcmalloc-opspercpusec.vs.threads.65536.bytes.png"></td>
 | |
|   <td><img src="tcmalloc-opspercpusec.vs.threads.131072.bytes.png"></td>
 | |
| </tr>
 | |
| </table>
 | |
| 
 | |
| <p>Here we see again that TCMalloc is both more consistent and more
 | |
| efficient than PTMalloc2. For max allocation sizes <32K, TCMalloc
 | |
| typically achieves ~2-2.5 million ops per second of CPU time with a
 | |
| large number of threads, whereas PTMalloc achieves generally 0.5-1
 | |
| million ops per second of CPU time, with a lot of cases achieving much
 | |
| less than this figure. Above 32K max allocation size, TCMalloc drops
 | |
| to 1-1.5 million ops per second of CPU time, and PTMalloc drops almost
 | |
| to zero for large numbers of threads (i.e. with PTMalloc, lots of CPU
 | |
| time is being burned spinning waiting for locks in the heavily
 | |
| multi-threaded case).</p>
 | |
| 
 | |
| 
 | |
| <H2><A NAME="runtime">Modifying Runtime Behavior</A></H2>
 | |
| 
 | |
| <p>You can more finely control the behavior of the tcmalloc via
 | |
| environment variables.</p>
 | |
| 
 | |
| <p>Generally useful flags:</p>
 | |
| 
 | |
| <table frame=box rules=sides cellpadding=5 width=100%>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_SAMPLE_PARAMETER</code></td>
 | |
|   <td>default: 0</td>
 | |
|   <td>
 | |
|     The approximate gap between sampling actions.  That is, we
 | |
|     take one sample approximately once every
 | |
|     <code>tcmalloc_sample_parmeter</code> bytes of allocation.
 | |
|     This sampled heap information is available via
 | |
|     <code>MallocExtension::GetHeapSample()</code> or
 | |
|     <code>MallocExtension::ReadStackTraces()</code>.  A reasonable
 | |
|     value is 524288.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_RELEASE_RATE</code></td>
 | |
|   <td>default: 1.0</td>
 | |
|   <td>
 | |
|     Rate at which we release unused memory to the system, via
 | |
|     <code>madvise(MADV_DONTNEED)</code>, on systems that support
 | |
|     it.  Zero means we never release memory back to the system.
 | |
|     Increase this flag to return memory faster; decrease it
 | |
|     to return memory slower.  Reasonable rates are in the
 | |
|     range [0,10].
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD</code></td>
 | |
|   <td>default: 1073741824</td>
 | |
|   <td>
 | |
|     Allocations larger than this value cause a stack trace to be
 | |
|     dumped to stderr.  The threshold for dumping stack traces is
 | |
|     increased by a factor of 1.125 every time we print a message so
 | |
|     that the threshold automatically goes up by a factor of ~1000
 | |
|     every 60 messages.  This bounds the amount of extra logging
 | |
|     generated by this flag.  Default value of this flag is very large
 | |
|     and therefore you should see no extra logging unless the flag is
 | |
|     overridden.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES</code></td>
 | |
|   <td>default: 16777216</td>
 | |
|   <td>
 | |
|     Bound on the total amount of bytes allocated to thread caches.  This
 | |
|     bound is not strict, so it is possible for the cache to go over this
 | |
|     bound in certain circumstances.  This value defaults to 16MB.  For
 | |
|     applications with many threads, this may not be a large enough cache,
 | |
|     which can affect performance.  If you suspect your application is not
 | |
|     scaling to many threads due to lock contention in TCMalloc, you can
 | |
|     try increasing this value.  This may improve performance, at a cost
 | |
|     of extra memory use by TCMalloc.  See <a href="#Garbage_Collection">
 | |
|     Garbage Collection</a> for more details.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| </table>
 | |
| 
 | |
| <p>Advanced "tweaking" flags, that control more precisely how tcmalloc
 | |
| tries to allocate memory from the kernel.</p>
 | |
| 
 | |
| <table frame=box rules=sides cellpadding=5 width=100%>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_SKIP_MMAP</code></td>
 | |
|   <td>default: false</td>
 | |
|   <td>
 | |
|      If true, do not try to use <code>mmap</code> to obtain memory
 | |
|      from the kernel.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_SKIP_SBRK</code></td>
 | |
|   <td>default: false</td>
 | |
|   <td>
 | |
|      If true, do not try to use <code>sbrk</code> to obtain memory
 | |
|      from the kernel.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_DEVMEM_START</code></td>
 | |
|   <td>default: 0</td>
 | |
|   <td>
 | |
|     Physical memory starting location in MB for <code>/dev/mem</code>
 | |
|     allocation.  Setting this to 0 disables <code>/dev/mem</code>
 | |
|     allocation.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_DEVMEM_LIMIT</code></td>
 | |
|   <td>default: 0</td>
 | |
|   <td>
 | |
|      Physical memory limit location in MB for <code>/dev/mem</code>
 | |
|      allocation.  Setting this to 0 means no limit.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_DEVMEM_DEVICE</code></td>
 | |
|   <td>default: /dev/mem</td>
 | |
|   <td>
 | |
|      Device to use for allocating unmanaged memory.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_MEMFS_MALLOC_PATH</code></td>
 | |
|   <td>default: ""</td>
 | |
|   <td>
 | |
|      If set, specify a path where hugetlbfs or tmpfs is mounted.
 | |
|      This may allow for speedier allocations.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_MEMFS_LIMIT_MB</code></td>
 | |
|   <td>default: 0</td>
 | |
|   <td>
 | |
|      Limit total memfs allocation size to specified number of MB.
 | |
|      0 means "no limit".
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_MEMFS_ABORT_ON_FAIL</code></td>
 | |
|   <td>default: false</td>
 | |
|   <td>
 | |
|      If true, abort() whenever memfs_malloc fails to satisfy an allocation.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_MEMFS_IGNORE_MMAP_FAIL</code></td>
 | |
|   <td>default: false</td>
 | |
|   <td>
 | |
|      If true, ignore failures from mmap.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>TCMALLOC_MEMFS_MAP_PRIVATE</code></td>
 | |
|   <td>default: false</td>
 | |
|   <td>
 | |
|      If true, use MAP_PRIVATE when mapping via memfs, not MAP_SHARED.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| </table>
 | |
| 
 | |
| 
 | |
| <H2><A NAME="compiletime">Modifying Behavior In Code</A></H2>
 | |
| 
 | |
| <p>The <code>MallocExtension</code> class, in
 | |
| <code>malloc_extension.h</code>, provides a few knobs that you can
 | |
| tweak in your program, to affect tcmalloc's behavior.</p>
 | |
| 
 | |
| <h3>Releasing Memory Back to the System</h3>
 | |
| 
 | |
| <p>By default, tcmalloc will release no-longer-used memory back to the
 | |
| kernel gradually, over time.  The <a
 | |
| href="#runtime">tcmalloc_release_rate</a> flag controls how quickly
 | |
| this happens.  You can also force a release at a given point in the
 | |
| progam execution like so:</p>
 | |
| <pre>
 | |
|    MallocExtension::instance()->ReleaseFreeMemory();
 | |
| </pre>
 | |
| 
 | |
| <p>You can also call <code>SetMemoryReleaseRate()</code> to change the
 | |
| <code>tcmalloc_release_rate</code> value at runtime, or
 | |
| <code>GetMemoryReleaseRate</code> to see what the current release rate
 | |
| is.</p>
 | |
| 
 | |
| <h3>Memory Introspection</h3>
 | |
| 
 | |
| <p>There are several routines for getting a human-readable form of the
 | |
| current memory usage:</p>
 | |
| <pre>
 | |
|    MallocExtension::instance()->GetStats(buffer, buffer_length);
 | |
|    MallocExtension::instance()->GetHeapSample(&string);
 | |
|    MallocExtension::instance()->GetHeapGrowthStacks(&string);
 | |
| </pre>
 | |
| 
 | |
| <p>The last two create files in the same format as the heap-profiler,
 | |
| and can be passed as data files to pprof.  The first is human-readable
 | |
| and is meant for debugging.</p>
 | |
| 
 | |
| <h3>Generic Tcmalloc Status</h3>
 | |
| 
 | |
| <p>TCMalloc has support for setting and retrieving arbitrary
 | |
| 'properties':</p>
 | |
| <pre>
 | |
|    MallocExtension::instance()->SetNumericProperty(property_name, value);
 | |
|    MallocExtension::instance()->GetNumericProperty(property_name, &value);
 | |
| </pre>
 | |
| 
 | |
| <p>It is possible for an application to set and get these properties,
 | |
| but the most useful is when a library sets the properties so the
 | |
| application can read them.  Here are the properties TCMalloc defines;
 | |
| you can access them with a call like
 | |
| <code>MallocExtension::instance()->GetNumericProperty("generic.heap_size",
 | |
| &value);</code>:</p>
 | |
| 
 | |
| <table frame=box rules=sides cellpadding=5 width=100%>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>generic.current_allocated_bytes</code></td>
 | |
|   <td>
 | |
|     Number of bytes used by the application.  This will not typically
 | |
|     match the memory use reported by the OS, because it does not
 | |
|     include TCMalloc overhead or memory fragmentation.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>generic.heap_size</code></td>
 | |
|   <td>
 | |
|     Bytes of system memory reserved by TCMalloc.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>tcmalloc.pageheap_free_bytes</code></td>
 | |
|   <td>
 | |
|     Number of bytes in free, mapped pages in page heap.  These bytes
 | |
|     can be used to fulfill allocation requests.  They always count
 | |
|     towards virtual memory usage, and unless the underlying memory is
 | |
|     swapped out by the OS, they also count towards physical memory
 | |
|     usage.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>tcmalloc.pageheap_unmapped_bytes</code></td>
 | |
|   <td>
 | |
|     Number of bytes in free, unmapped pages in page heap.  These are
 | |
|     bytes that have been released back to the OS, possibly by one of
 | |
|     the MallocExtension "Release" calls.  They can be used to fulfill
 | |
|     allocation requests, but typically incur a page fault.  They
 | |
|     always count towards virtual memory usage, and depending on the
 | |
|     OS, typically do not count towards physical memory usage.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>tcmalloc.slack_bytes</code></td>
 | |
|   <td>
 | |
|     Sum of pageheap_free_bytes and pageheap_unmapped_bytes.  Provided
 | |
|     for backwards compatibility only.  Do not use.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>tcmalloc.max_total_thread_cache_bytes</code></td>
 | |
|   <td>
 | |
|     A limit to how much memory TCMalloc dedicates for small objects.
 | |
|     Higher numbers trade off more memory use for -- in some situations
 | |
|     -- improved efficiency.
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| <tr valign=top>
 | |
|   <td><code>tcmalloc.current_total_thread_cache_bytes</code></td>
 | |
|   <td>
 | |
|     A measure of some of the memory TCMalloc is using (for
 | |
|     small objects).
 | |
|   </td>
 | |
| </tr>
 | |
| 
 | |
| </table>
 | |
| 
 | |
| <h2><A NAME="caveats">Caveats</A></h2>
 | |
| 
 | |
| <p>For some systems, TCMalloc may not work correctly with
 | |
| applications that aren't linked against <code>libpthread.so</code> (or
 | |
| the equivalent on your OS). It should work on Linux using glibc 2.3,
 | |
| but other OS/libc combinations have not been tested.</p>
 | |
| 
 | |
| <p>TCMalloc may be somewhat more memory hungry than other mallocs,
 | |
| (but tends not to have the huge blowups that can happen with other
 | |
| mallocs).  In particular, at startup TCMalloc allocates approximately
 | |
| 240KB of internal memory.</p>
 | |
| 
 | |
| <p>Don't try to load TCMalloc into a running binary (e.g., using JNI
 | |
| in Java programs).  The binary will have allocated some objects using
 | |
| the system malloc, and may try to pass them to TCMalloc for
 | |
| deallocation.  TCMalloc will not be able to handle such objects.</p>
 | |
| 
 | |
| <hr>
 | |
| 
 | |
| <address>Sanjay Ghemawat, Paul Menage<br>
 | |
| <!-- Created: Tue Dec 19 10:43:14 PST 2000 -->
 | |
| <!-- hhmts start -->
 | |
| Last modified: Sat Feb 24 13:11:38 PST 2007  (csilvers)
 | |
| <!-- hhmts end -->
 | |
| </address>
 | |
| 
 | |
| </body>
 | |
| </html>
 |