534 lines
24 KiB
HTML
534 lines
24 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
|
|
<html> <head> <title>A Memory Allocator</title> </head>
|
|
<body bgcolor="#ffffee" vlink="#0000aa" link="#cc0000">
|
|
|
|
<h1>A Memory Allocator</h1>
|
|
|
|
<p> by <a href="http://g.oswego.edu">Doug Lea</a>
|
|
<p>
|
|
[A German adaptation and translation of this article appears
|
|
in <b>unix/mail</b> December, 1996.]
|
|
|
|
<h2>Introduction</h2>
|
|
<p>
|
|
Memory allocators form interesting case studies in the engineering
|
|
of infrastructure software. I started writing one in 1987, and
|
|
have maintained and evolved it (with the help of many volunteer
|
|
contributors) ever since. This allocator provides implementations
|
|
of the the standard C routines <code>malloc()</code>,
|
|
<code>free()</code>, and <code>realloc()</code>, as well as a few
|
|
auxiliary utility routines. The allocator has never been given a
|
|
specific name. Most people just call it <em>Doug Lea's
|
|
Malloc</em>, or <em>dlmalloc</em> for short.
|
|
|
|
<p>
|
|
The code for this allocator
|
|
has been placed in the public domain (available from
|
|
<a href="ftp://g.oswego.edu/pub/misc/malloc.c">
|
|
ftp://g.oswego.edu/pub/misc/malloc.c</a>), and is apparently
|
|
widely used: It serves as the default native version of malloc in
|
|
some versions of Linux; it is compiled into several commonly
|
|
available software packages (overriding the native malloc), and
|
|
has been used in various PC environments as well as in embedded
|
|
systems, and surely many other places I don't even know about.
|
|
|
|
<p>
|
|
I wrote the first version of the allocator after writing some C++
|
|
programs that almost exclusively relied on allocating dynamic
|
|
memory. I found that they ran much more slowly and/or with much
|
|
more total memory consumption than I expected them to. This was
|
|
due to characteristics of the memory allocators on the systems I
|
|
was running on (mainly the then-current versions of SunOs and BSD
|
|
). To counter this, at first I wrote a number of special-purpose
|
|
allocators in C++, normally by overloading <code>operator
|
|
new</code> for various classes. Some of these are described in a
|
|
paper on C++ allocation techniques that was adapted into the 1989
|
|
<em>C++ Report</em> article <a
|
|
href="ftp://g.oswego.edu/pub/papers/C++Report89.txt"> <em>Some
|
|
storage allocation techniques for container classes</em></a>.
|
|
|
|
<p>
|
|
However, I soon realized that building a special allocator for
|
|
each new class that tended to be dynamically allocated and heavily
|
|
used was not a good strategy when building kinds of
|
|
general-purpose programming support classes I was writing at the
|
|
time. (From 1986 to 1991, I was the the primary author of <A
|
|
HREF="http://g.oswego.edu/dl/libg++paper/libg++/libg++.html">
|
|
libg++ </A>, the GNU C++ library.) A broader solution was needed --
|
|
to write an allocator that was good enough under normal C++ and C
|
|
loads so that programmers would not be tempted to write
|
|
special-purpose allocators except under very special conditions.
|
|
<p>
|
|
This article presents a description of some of the main design
|
|
goals, algorithms, and implementation considerations for this
|
|
allocator. More detailed documentation can be found with the code
|
|
distribution.
|
|
|
|
|
|
<h2>Goals</h2>
|
|
|
|
A good memory allocator needs to balance a number of goals:
|
|
|
|
<dl>
|
|
<dt>Maximizing Compatibility
|
|
<dd>An allocator should be plug-compatible with others; in particular
|
|
it should obey ANSI/POSIX conventions.
|
|
|
|
<dt> Maximizing Portability
|
|
<dd> Reliance on as few system-dependent features (such as system calls)
|
|
as possible, while still providing optional support for other useful
|
|
features found only one some systems; conformance
|
|
to all known system constraints on alignment and addressing rules.
|
|
|
|
<dt> Minimizing Space
|
|
<dd> The allocator should not waste space: It should obtain as little
|
|
memory from the system as possible, and should maintain memory in ways
|
|
that minimize <em>fragmentation</em> -- ``holes''in contiguous chunks
|
|
of memory that are not used by the program.
|
|
|
|
<dt> Minimizing Time
|
|
<dd> The <code>malloc()</code>, <code>free()</code> and <code>realloc</code>
|
|
routines should be as fast as possible in the average case.
|
|
|
|
<dt> Maximizing Tunability
|
|
<dd> Optional features and behavior should be controllable by users
|
|
either statically (via <code>#define</code> and the like) or
|
|
dynamically (via control commands such as <code>mallopt</code>).
|
|
|
|
<dt> Maximizing Locality
|
|
<dd> Allocating chunks of memory that are typically
|
|
used together near each other. This helps minimize page and cache misses
|
|
during program execution.
|
|
|
|
<dt> Maximizing Error Detection
|
|
<dd> It does not seem possible for a general-purpose allocator to
|
|
also serve as general-purpose memory error testing tool
|
|
such as <em>Purify</em>. However,
|
|
allocators should provide some means for detecting corruption due
|
|
to overwriting memory, multiple frees, and so on.
|
|
|
|
<dt>Minimizing Anomalies
|
|
<dd>An allocator configured using default settings should perform well
|
|
across a wide range of real loads that depend heavily on
|
|
dynamic allocation -- windowing toolkits, GUI applications, compilers,
|
|
interpretors, development tools, network (packet)-intensive programs,
|
|
graphics-intensive packages, web browsers,
|
|
string-processing applications, and so on.
|
|
</dl>
|
|
|
|
<p>
|
|
Paul Wilson and colleagues have written an excellent survey
|
|
paper on allocation techniques that discusses some of these goals
|
|
in more detail. See Paul R. Wilson, Mark S. Johnstone, Michael
|
|
Neely, and David Boles, ``Dynamic Storage Allocation: A Survey
|
|
and Critical Review'' in <em>International Workshop on Memory
|
|
Management</em>, September 1995 (also
|
|
available via <a href=
|
|
"ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps"> ftp</a>).
|
|
(Note that the version of my allocator they describe is
|
|
<em>not</em> the most current one however.)
|
|
<p>
|
|
As they discuss,
|
|
minimizing space by minimizing wastage (generally due to
|
|
fragmentation) must be the primary goal in any allocator.
|
|
|
|
<p>
|
|
For an extreme example, among the fastest possible versions of
|
|
<code>malloc()</code> is one that always allocates the next
|
|
sequential memory location available on the system, and the
|
|
corresponding fastest version of <code>free()</code> is a no-op.
|
|
However, such an implementation is hardly ever acceptable: it will
|
|
cause a program to run out of memory quickly since it never
|
|
reclaims unused space. Wastages seen in some allocators used in
|
|
practice can be almost this extreme under some loads. As Wilson
|
|
also notes, wastage can be measured monetarily: Considered
|
|
globally, poor allocation schemes cost people perhaps even
|
|
billions of dollars in memory chips.
|
|
|
|
<p>
|
|
While time-space issues dominate, the set of trade-offs and compromises
|
|
is nearly endless. Here are just a few of the many examples:
|
|
|
|
<ul>
|
|
<li> Accommodating worst-case alignment requirements increases
|
|
wastage by forcing the allocator to skip over bytes in order
|
|
to align chunks.
|
|
|
|
<li> Most provisions for dynamic tunability (such as setting
|
|
a <em>debug</em> mode) can seriously impact time efficiency
|
|
by adding levels of indirection and increasing numbers of branches.
|
|
|
|
<li> Some provisions designed to catch errors limit range of
|
|
applicability. For example, regardless of platform, the
|
|
current malloc internally handles allocation size arguments as if they
|
|
were signed 32-bit integers, and treats nonpositive arguments
|
|
as if they were requests for a size of zero. This is considered
|
|
by nearly all users as a feature rather than a bug: A negative
|
|
32-bit argument or a huge 64-bit argument is essentially always
|
|
a programming mistake. Returning a minimally-sized chunk will
|
|
help catch this error.
|
|
|
|
<li> Accommodating the oddities of other allocators to remain
|
|
plug-compatible with them can reduce flexibility and performance.
|
|
For the oddest example, some early versions of Unix allocators
|
|
allowed programmers to <code>realloc</code>
|
|
memory that had already been <code>freed</code>. Until 1993,
|
|
I allowed this for the sake of compatibility.
|
|
(However, no one at all complained when this ``feature'' was dropped.)
|
|
|
|
<li> Some (but by no means all) heuristics that improve time and/or
|
|
space for small programs cause unacceptably
|
|
worse time and/or space characteristics for larger programs that
|
|
dominate the load on typical systems these days.
|
|
|
|
</ul>
|
|
|
|
<p>
|
|
No set of compromises along these lines can be
|
|
perfect. However, over the years, the allocator has
|
|
evolved to make trade-offs that the majority of users find to
|
|
be acceptable. The driving forces that continue to impact the
|
|
evolution of this malloc include:
|
|
|
|
<ol>
|
|
<li> Empirical studies of malloc performance by others
|
|
(including the above-mentioned paper by Wilson et al, as well
|
|
as others that it in turn cites). These papers find that
|
|
versions of this malloc increasingly rank as simultaneously
|
|
among the most time- and space-efficient memory allocators
|
|
available. However, each reveals weaknesses or opportunities
|
|
for further improvements.
|
|
|
|
<li> Changes in target workloads. The nature of the kinds of
|
|
programs that are most sensitive to malloc implementations
|
|
continually change. For perhaps the primary example, the
|
|
memory characteristics of <em>X</em> and other windowing
|
|
systems increasingly dominate.
|
|
|
|
<li> Changes in systems and processors. Implementation details
|
|
and fine-tunings that try to make code readily optimizable for
|
|
typical processors change across time. Additionally, operating
|
|
systems (including Linux and Solaris) have themselves evolved,
|
|
for example to make memory mapping an occasionally-wise choice
|
|
for system-level allocation.
|
|
|
|
<li> Suggestions, experience reports, and code from users and
|
|
contributors. The code has evolved with the help of
|
|
several regular volunteer contributors.
|
|
The majority of recent changes were instigated
|
|
by people using the version supplied in Linux, and were
|
|
implemented in large part by Wolfram Gloger for the Linux
|
|
version and then integrated by me.
|
|
</ol>
|
|
|
|
|
|
<h2>Algorithms</h2>
|
|
|
|
|
|
The two core elements of the malloc algorithm have remained
|
|
unchanged since the earliest versions:
|
|
<p>
|
|
|
|
<dl>
|
|
<dt> Boundary Tags
|
|
<dd> Chunks of memory carry around with them size information
|
|
fields both before and after the chunk. This allows for
|
|
two important capabilities:
|
|
<ul>
|
|
|
|
<li> Two bordering unused chunks can be coalesced into
|
|
one larger chunk. This minimizes the number of unusable
|
|
small chunks.
|
|
|
|
<li> All chunks can be traversed starting from any known
|
|
chunk in either a forward or backward direction.
|
|
</ul>
|
|
<p>
|
|
<img src="malloc1.gif">
|
|
|
|
<p>
|
|
The original versions implemented boundary tags exactly in
|
|
this fashion. More recent versions omit trailer
|
|
fields on chunks that are in use by the program. This
|
|
is itself a minor trade-off: The fields are not ever used
|
|
while chunks are active so need not be present. Eliminating them decreases
|
|
overhead and wastage. However,
|
|
lack of these fields weakens error detection a bit by
|
|
making it impossible to check if users mistakenly overwrite
|
|
fields that should have known values.
|
|
|
|
<dt>Binning
|
|
<dd> Available chunks are maintained in bins, grouped by size.
|
|
There are a surprisingly large number (128) of fixed-width
|
|
bins, approximately logarithmically spaced in size. Bins for
|
|
sizes less than 512 bytes each hold only exactly one size
|
|
(spaced 8 bytes apart, simplifying enforcement of 8-byte alignment).
|
|
Searches for available chunks are processed in smallest-first,
|
|
<em>best-fit</em> order. As shown by Wilson et al, best-fit
|
|
schemes (of various kinds and approximations) tend to produce
|
|
the least fragmentation on real loads
|
|
compared to other general approaches such as first-fit.
|
|
<p>
|
|
<img src="malloc2.gif">
|
|
<p>
|
|
|
|
Until the versions released in 1995, chunks were left unsorted
|
|
within bins, so that the best-fit strategy was only approximate.
|
|
More recent versions instead sort chunks by size within bins, with
|
|
ties broken by an oldest-first rule. (This was done after finding that
|
|
the minor time investment was worth it to avoid observed bad cases.)
|
|
|
|
|
|
</dl>
|
|
<p>
|
|
|
|
Thus, the general categorization of this algorithm is
|
|
<em>best-first with coalescing</em>: Freed chunks are
|
|
coalesced with neighboring ones, and held in bins that are
|
|
searched in size order.
|
|
<p>
|
|
|
|
This approach leads to fixed
|
|
bookkeeping overhead per chunk. Because both size information
|
|
and bin links must be held in each available chunk, the
|
|
smallest allocatable chunk is 16 bytes in systems with 32-bit
|
|
pointers and 24 bytes in systems with 64-bit pointers. These
|
|
minimum sizes are larger than most people would like to see --
|
|
they can lead to significant wastage for example in
|
|
applications allocating many tiny linked-list nodes. However,
|
|
the 16 bytes minimum at least is characteristic of
|
|
<em>any</em> system requiring 8-byte alignment in which there
|
|
is <em>any</em> malloc bookkeeping overhead.
|
|
|
|
<p>
|
|
This basic algorithm can be made to be very fast. Even though
|
|
it rests upon a search mechanism to find best fits, the use
|
|
of indexing techniques, exploitation of special cases, and
|
|
careful coding lead to average cases requiring only a few
|
|
dozen instructions, depending of course on the machine and the
|
|
allocation pattern.
|
|
|
|
<p>
|
|
While coalescing via boundary tags and best-fit via binning
|
|
represent the main ideas of the algorithm, further
|
|
considerations lead to a number of heuristic
|
|
improvements. They include locality preservation, wilderness
|
|
preservation, memory mapping, and caching.
|
|
|
|
<h3>Locality preservation</h3>
|
|
|
|
Chunks allocated at about the same time by a program tend to have
|
|
similar reference patterns and coexistent lifetimes. Maintaining
|
|
locality minimizes page faults and cache misses, which can have
|
|
a dramatic effect on performance on modern processors.
|
|
If locality
|
|
were the <em>only</em> goal, an allocator might always allocate
|
|
each successive chunk as close to the previous one as possible.
|
|
However, this <em>nearest-fit</em> (often approximated by <em>next-fit</em>)
|
|
strategy can lead to very bad fragmentation. In the current
|
|
version of malloc, a version of next-fit is used only in a
|
|
restricted context that maintains locality in those cases where
|
|
it conflicts the least with other goals: If a chunk of the
|
|
exact desired size is not available, the most recently split-off
|
|
space is used (and resplit) if it is big enough; otherwise
|
|
best-fit is used. This restricted use eliminates cases where
|
|
a perfectly usable existing chunk fails to be allocated; thus
|
|
eliminating at least this form of fragmentation. And, because this form
|
|
of next-fit is faster than best-fit bin-search, it speeds up
|
|
the average <code>malloc</code>.
|
|
|
|
<h3>Wilderness Preservation</h3>
|
|
|
|
The ``wilderness'' (so named by Kiem-Phong Vo) chunk represents
|
|
the space bordering the topmost address allocated from the
|
|
system. Because it is at the border, it is the only chunk that
|
|
can be arbitrarily extended
|
|
(via <code>sbrk</code> in Unix) to be bigger than it is (unless
|
|
of course <code>sbrk</code> fails because all memory has been
|
|
exhausted).
|
|
<p>
|
|
|
|
One way to deal with the wilderness chunk is to
|
|
handle it about the same way as any other chunk. (This
|
|
technique was used in most versions of this malloc until 1994).
|
|
While this simplifies and speeds up implementation, without care
|
|
it can lead to some very bad worst-case space characteristics:
|
|
Among other problems, if the wilderness chunk is used when
|
|
another available chunk exists, you increase the chances that a
|
|
later request will cause an otherwise preventable
|
|
<code>sbrk</code>.
|
|
|
|
<p>
|
|
A better strategy is currently used: treat the wilderness
|
|
chunk as ``bigger'' than all others, since it can be made so
|
|
(up to system limitations) and use it as such in a best-first
|
|
scan. This results in the wilderness chunk always being used
|
|
only if no other chunk exists, further avoiding preventable
|
|
fragmentation.
|
|
|
|
<h3>Memory Mapping</h3>
|
|
|
|
<p>
|
|
In addition to extending general-purpose allocation regions
|
|
via <code>sbrk</code>, most versions of Unix support system
|
|
calls such as <code>mmap</code> that allocate a separate
|
|
non-contiguous region of memory for use by a program. This
|
|
provides a second option within <code>malloc</code> for
|
|
satisfying a memory request. Requesting and returning a
|
|
<code>mmap</code>ed chunk can further reduce downstream
|
|
fragmentation, since a released memory map does not create a
|
|
``hole'' that would need to be managed. However, because of
|
|
built-in limitations and overheads associated with
|
|
<code>mmap</code>, it is only worth doing this in very
|
|
restricted situations. For example, in all current systems,
|
|
mapped regions must be page-aligned. Also, invoking
|
|
<code>mmap</code> and <code>mfree</code> is much slower than
|
|
carving out an existing chunk of memory. For these reasons,
|
|
the current version of malloc relies on <code>mmap</code> only
|
|
if (1) the request is greater than a (dynamically adjustable)
|
|
threshold size (currently by default 1MB) and (2) the space
|
|
requested is not already available in the existing arena so
|
|
would have to be obtained via <code>sbrk</code>.
|
|
|
|
<p>
|
|
In part because <code>mmap</code> is not always applicable in most
|
|
programs, the current version of malloc also supports
|
|
<em>trimming</em> of the main arena, which achieves one of the effects
|
|
of memory mapping -- releasing unused space back to the system. When
|
|
long-lived programs contain brief peaks where they allocate large
|
|
amounts of memory, followed by longer valleys where the have more
|
|
modest requirements, system performance as a whole can be improved
|
|
by releasing unused parts of the <em>wilderness</em> chunk back to
|
|
the system. (In nearly all versions of Unix, <code>sbrk</code> can
|
|
be used with negative arguments to achieve this effect.) Releasing
|
|
space allows the underlying operating system to cut down on swap
|
|
space requirements and reuse memory mapping tables. However, as with
|
|
<code>mmap</code>, the call itself can be expensive, so is only attempted
|
|
if trailing unused memory exceeds a tunable threshold.
|
|
|
|
|
|
<h3>Caching</h3>
|
|
|
|
<p>
|
|
In the most straightforward version of the basic algorithm,
|
|
each freed chunk is immediately coalesced with neighbors to
|
|
form the largest possible unused chunk. Similarly, chunks
|
|
are created (by splitting larger chunks) only when
|
|
explicitly requested.
|
|
|
|
<p>
|
|
Operations to split and to coalesce chunks take time. This time
|
|
overhead can sometimes be avoided by using either of both of
|
|
two <em>caching</em> strategies:
|
|
|
|
<dl>
|
|
<dt> Deferred Coalescing
|
|
<dd> Rather than coalescing freed chunks, leave them at their
|
|
current sizes in hopes that another request for the same size
|
|
will come along soon. This saves a coalesce, a later split,
|
|
and the time it would take to find a non-exactly-matching chunk
|
|
to split.
|
|
|
|
<dt> Preallocation
|
|
<dd> Rather than splitting out new chunks one-by one, pre-split
|
|
many at once. This is normally faster than doing it one-at-a-time.
|
|
</dl>
|
|
|
|
Because the basic data structures in the allocator permit
|
|
coalescing at any time, in any of <code>malloc</code>,
|
|
<code>free</code>, or <code>realloc</code>, corresponding caching
|
|
heuristics are easy to apply.
|
|
|
|
<p>
|
|
The effectiveness of caching obviously depends on the costs of
|
|
splitting, coalescing, and searching relative to the work
|
|
needed to track cached chunks. Additionally, effectiveness
|
|
less obviously depends on the policy used in deciding when
|
|
to cache versus coalesce them. .
|
|
|
|
<p>
|
|
Caching can be a good idea in programs that continuously
|
|
allocate and release chunks of only a few sizes.
|
|
For example, if you write a program that
|
|
allocates and frees many tree nodes, you might decide that is
|
|
worth it to cache some nodes, assuming you know of a fast way
|
|
to do this. However, without knowledge of the program,
|
|
<code>malloc</code> cannot know whether it would be a good
|
|
idea to coalesce cached small chunks in order to satisfy a
|
|
larger request, or whether that larger request should be taken
|
|
from somewhere else. And it is difficult for the allocator to
|
|
make more informed guesses about this matter. For example, it
|
|
is just as costly for an allocator to determine how much total
|
|
contiguous space would be gained by coalescing chunks as it
|
|
would be to just coalesce them and then resplit them.
|
|
|
|
<p>
|
|
Previous versions of the allocator used a few
|
|
search-ordering heuristics that made adequate guesses about
|
|
caching, although with occasionally bad worst-case
|
|
results. But across time, these heuristics appear to be
|
|
decreasingly effective under real loads. This is probably because
|
|
actual programs that rely heavily on malloc increasingly tend
|
|
to use a larger variety of chunk sizes. For example, in C++
|
|
programs, this probably corresponds to a trend for programs to
|
|
use an increasing number of classes. Different classes tend to
|
|
have different sizes.
|
|
|
|
<p>
|
|
As a consequence, the current version <em>never</em> caches
|
|
chunks. It appears to be more effective to concentrate
|
|
efforts on further reducing the costs of handling non-cached
|
|
chunks than to rely on policies and heuristics that are of
|
|
decreasing utility. However, the issue is still open for further
|
|
experimentation.
|
|
|
|
|
|
<h3>Lookasides</h3>
|
|
|
|
<p>
|
|
There remains one kind of caching that is highly desirable in
|
|
some applications but not implemented in this allocator --
|
|
lookasides for very small chunks. As mentioned above, the
|
|
basic algorithm imposes a minimum chunk size that can be
|
|
very wasteful for very small requests. For example, a linked
|
|
list on a system with 4-byte pointers might allocate nodes
|
|
holding only, say, two pointers, requiring only 8 bytes.
|
|
Since the minimum chunk size is 16 bytes, user programs
|
|
allocating only list nodes suffer 100% overhead.
|
|
|
|
<p>
|
|
Eliminating this problem while still maintaining portable
|
|
alignment would require that the allocator not impose
|
|
<em>any</em> overhead. Techniques for carrying this out
|
|
exist. For example, chunks could be checked to see if they
|
|
belong to a larger aggregated space via address
|
|
comparisons. However, doing so can impose significant costs;
|
|
in fact the cost would be unacceptable in this allocator.
|
|
Chunks are not otherwise tracked by address, so unless
|
|
arbitrarily limited, checking might lead to random searches
|
|
through memory. Additionally, support requires the adoption of
|
|
one or more policies controlling whether and how to ever
|
|
coalesce small chunks.
|
|
|
|
<p>
|
|
Such issues and limitations lead to one of the very few kinds
|
|
of situations in which programmers should routinely write their
|
|
own special purpose memory management routines (by, for example
|
|
in C++ overloading <code>operator new()</code>). Programs relying
|
|
on large but approximately known numbers of very small chunks
|
|
may find it profitable to build very simple allocators. For
|
|
example, chunks can be allocated out of a fixed array with
|
|
an embedded freelist, along with a provision to rely on
|
|
<code>malloc</code> as a backup if the array becomes exhausted.
|
|
Somewhat more flexibly, these can be based on the C or C++
|
|
versions of <em>obstack</em> available with GNU gcc and libg++.
|
|
|
|
<hr>
|
|
<address><a href="mailto:dl@gee.cs.oswego.edu">Doug Lea</a></address>
|
|
<!-- Created: Fri Oct 25 19:07:46 EDT 1996 -->
|
|
<!-- hhmts start -->
|
|
Last modified: Wed Dec 4 12:20:31 EST
|
|
<!-- hhmts end -->
|
|
</body>
|
|
</html>
|