mirror of
https://github.com/ossrs/srs.git
synced 2025-03-09 15:49:59 +00:00
use libco instead of state-thread(st), still have some bug
This commit is contained in:
parent
51d6c367f5
commit
7c8a35aea9
88 changed files with 4836 additions and 19273 deletions
BIN
trunk/3rdparty/st-srs/docs/fig.gif
vendored
BIN
trunk/3rdparty/st-srs/docs/fig.gif
vendored
Binary file not shown.
Before Width: | Height: | Size: 5.2 KiB |
434
trunk/3rdparty/st-srs/docs/notes.html
vendored
434
trunk/3rdparty/st-srs/docs/notes.html
vendored
|
@ -1,434 +0,0 @@
|
|||
<HTML>
|
||||
<HEAD>
|
||||
<TITLE>State Threads Library Programming Notes</TITLE>
|
||||
</HEAD>
|
||||
<BODY BGCOLOR=#FFFFFF>
|
||||
<H2>Programming Notes</H2>
|
||||
<P>
|
||||
<B>
|
||||
<UL>
|
||||
<LI><A HREF=#porting>Porting</A></LI>
|
||||
<LI><A HREF=#signals>Signals</A></LI>
|
||||
<LI><A HREF=#intra>Intra-Process Synchronization</A></LI>
|
||||
<LI><A HREF=#inter>Inter-Process Synchronization</A></LI>
|
||||
<LI><A HREF=#nonnet>Non-Network I/O</A></LI>
|
||||
<LI><A HREF=#timeouts>Timeouts</A></LI>
|
||||
</UL>
|
||||
</B>
|
||||
<P>
|
||||
<HR>
|
||||
<P>
|
||||
<A NAME="porting">
|
||||
<H3>Porting</H3>
|
||||
The State Threads library uses OS concepts that are available in some
|
||||
form on most UNIX platforms, making the library very portable across
|
||||
many flavors of UNIX. However, there are several parts of the library
|
||||
that rely on platform-specific features. Here is the list of such parts:
|
||||
<P>
|
||||
<UL>
|
||||
<LI><I>Thread context initialization</I>: Two ingredients of the
|
||||
<TT>jmp_buf</TT>
|
||||
data structure (the program counter and the stack pointer) have to be
|
||||
manually set in the thread creation routine. The <TT>jmp_buf</TT> data
|
||||
structure is defined in the <TT>setjmp.h</TT> header file and differs from
|
||||
platform to platform. Usually the program counter is a structure member
|
||||
with <TT>PC</TT> in the name and the stack pointer is a structure member
|
||||
with <TT>SP</TT> in the name. One can also look in the
|
||||
<A HREF="http://www.mozilla.org/source.html">Netscape's NSPR library source</A>
|
||||
which already has this code for many UNIX-like platforms
|
||||
(<TT>mozilla/nsprpub/pr/include/md/*.h</TT> files).
|
||||
<P>
|
||||
Note that on some BSD-derived platforms <TT>_setjmp(3)/_longjmp(3)</TT>
|
||||
calls should be used instead of <TT>setjmp(3)/longjmp(3)</TT> (that is
|
||||
the calls that manipulate only the stack and registers and do <I>not</I>
|
||||
save and restore the process's signal mask).</LI>
|
||||
<P>
|
||||
Starting with glibc 2.4 on Linux the opacity of the <TT>jmp_buf</TT> data
|
||||
structure is enforced by <TT>setjmp(3)/longjmp(3)</TT> so the
|
||||
<TT>jmp_buf</TT> ingredients cannot be accessed directly anymore (unless
|
||||
special environmental variable LD_POINTER_GUARD is set before application
|
||||
execution). To avoid dependency on custom environment, the State Threads
|
||||
library provides <TT>setjmp/longjmp</TT> replacement functions for
|
||||
all Intel CPU architectures. Other CPU architectures can also be easily
|
||||
supported (the <TT>setjmp/longjmp</TT> source code is widely available for
|
||||
many CPU architectures).
|
||||
<P>
|
||||
<LI><I>High resolution time function</I>: Some platforms (IRIX, Solaris)
|
||||
provide a high resolution time function based on the free running hardware
|
||||
counter. This function returns the time counted since some arbitrary
|
||||
moment in the past (usually machine power up time). It is not correlated in
|
||||
any way to the time of day, and thus is not subject to resetting,
|
||||
drifting, etc. This type of time is ideal for tasks where cheap, accurate
|
||||
interval timing is required. If such a function is not available on a
|
||||
particular platform, the <TT>gettimeofday(3)</TT> function can be used
|
||||
(though on some platforms it involves a system call).
|
||||
<P>
|
||||
<LI><I>The stack growth direction</I>: The library needs to know whether the
|
||||
stack grows toward lower (down) or higher (up) memory addresses.
|
||||
One can write a simple test program that detects the stack growth direction
|
||||
on a particular platform.</LI>
|
||||
<P>
|
||||
<LI><I>Non-blocking attribute inheritance</I>: On some platforms (e.g. IRIX)
|
||||
the socket created as a result of the <TT>accept(2)</TT> call inherits the
|
||||
non-blocking attribute of the listening socket. One needs to consult the manual
|
||||
pages or write a simple test program to see if this applies to a specific
|
||||
platform.</LI>
|
||||
<P>
|
||||
<LI><I>Anonymous memory mapping</I>: The library allocates memory segments
|
||||
for thread stacks by doing anonymous memory mapping (<TT>mmap(2)</TT>). This
|
||||
mapping is somewhat different on SVR4 and BSD4.3 derived platforms.
|
||||
<P>
|
||||
The memory mapping can be avoided altogether by using <TT>malloc(3)</TT> for
|
||||
stack allocation. In this case the <TT>MALLOC_STACK</TT> macro should be
|
||||
defined.</LI>
|
||||
</UL>
|
||||
<P>
|
||||
All machine-dependent feature test macros should be defined in the
|
||||
<TT>md.h</TT> header file. The assembly code for <TT>setjmp/longjmp</TT>
|
||||
replacement functions for all CPU architectures should be placed in
|
||||
the <TT>md.S</TT> file.
|
||||
<P>
|
||||
The current version of the library is ported to:
|
||||
<UL>
|
||||
<LI>IRIX 6.x (both 32 and 64 bit)</LI>
|
||||
<LI>Linux (kernel 2.x and glibc 2.x) on x86, Alpha, MIPS and MIPSEL,
|
||||
SPARC, ARM, PowerPC, 68k, HPPA, S390, IA-64, and Opteron (AMD-64)</LI>
|
||||
<LI>Solaris 2.x (SunOS 5.x) on x86, AMD64, SPARC, and SPARC-64</LI>
|
||||
<LI>AIX 4.x</LI>
|
||||
<LI>HP-UX 11 (both 32 and 64 bit)</LI>
|
||||
<LI>Tru64/OSF1</LI>
|
||||
<LI>FreeBSD on x86, AMD64, and Alpha</LI>
|
||||
<LI>OpenBSD on x86, AMD64, Alpha, and SPARC</LI>
|
||||
<LI>NetBSD on x86, Alpha, SPARC, and VAX</LI>
|
||||
<LI>MacOS X (Darwin) on PowerPC (32 bit) and Intel (both 32 and 64 bit) [universal]</LI>
|
||||
<LI>Cygwin</LI>
|
||||
</UL>
|
||||
<P>
|
||||
|
||||
<A NAME="signals">
|
||||
<H3>Signals</H3>
|
||||
Signal handling in an application using State Threads should be treated the
|
||||
same way as in a classical UNIX process application. There is no such
|
||||
thing as per-thread signal mask, all threads share the same signal handlers,
|
||||
and only asynchronous-safe functions can be used in signal handlers.
|
||||
However, there is a way to process signals synchronously by converting a
|
||||
signal event to an I/O event: a signal catching function does a write to
|
||||
a pipe which will be processed synchronously by a dedicated signal handling
|
||||
thread. The following code demonstrates this technique (error handling is
|
||||
omitted for clarity):
|
||||
<PRE>
|
||||
|
||||
/* Per-process pipe which is used as a signal queue. */
|
||||
/* Up to PIPE_BUF/sizeof(int) signals can be queued up. */
|
||||
int sig_pipe[2];
|
||||
|
||||
/* Signal catching function. */
|
||||
/* Converts signal event to I/O event. */
|
||||
void sig_catcher(int signo)
|
||||
{
|
||||
int err;
|
||||
|
||||
/* Save errno to restore it after the write() */
|
||||
err = errno;
|
||||
/* write() is reentrant/async-safe */
|
||||
write(sig_pipe[1], &signo, sizeof(int));
|
||||
errno = err;
|
||||
}
|
||||
|
||||
/* Signal processing function. */
|
||||
/* This is the "main" function of the signal processing thread. */
|
||||
void *sig_process(void *arg)
|
||||
{
|
||||
st_netfd_t nfd;
|
||||
int signo;
|
||||
|
||||
nfd = st_netfd_open(sig_pipe[0]);
|
||||
|
||||
for ( ; ; ) {
|
||||
/* Read the next signal from the pipe */
|
||||
st_read(nfd, &signo, sizeof(int), ST_UTIME_NO_TIMEOUT);
|
||||
|
||||
/* Process signal synchronously */
|
||||
switch (signo) {
|
||||
case SIGHUP:
|
||||
/* do something here - reread config files, etc. */
|
||||
break;
|
||||
case SIGTERM:
|
||||
/* do something here - cleanup, etc. */
|
||||
break;
|
||||
/* .
|
||||
.
|
||||
Other signals
|
||||
.
|
||||
.
|
||||
*/
|
||||
}
|
||||
}
|
||||
|
||||
return NULL;
|
||||
}
|
||||
|
||||
int main(int argc, char *argv[])
|
||||
{
|
||||
struct sigaction sa;
|
||||
.
|
||||
.
|
||||
.
|
||||
|
||||
/* Create signal pipe */
|
||||
pipe(sig_pipe);
|
||||
|
||||
/* Create signal processing thread */
|
||||
st_thread_create(sig_process, NULL, 0, 0);
|
||||
|
||||
/* Install sig_catcher() as a signal handler */
|
||||
sa.sa_handler = sig_catcher;
|
||||
sigemptyset(&sa.sa_mask);
|
||||
sa.sa_flags = 0;
|
||||
sigaction(SIGHUP, &sa, NULL);
|
||||
|
||||
sa.sa_handler = sig_catcher;
|
||||
sigemptyset(&sa.sa_mask);
|
||||
sa.sa_flags = 0;
|
||||
sigaction(SIGTERM, &sa, NULL);
|
||||
|
||||
.
|
||||
.
|
||||
.
|
||||
|
||||
}
|
||||
|
||||
</PRE>
|
||||
<P>
|
||||
Note that if multiple processes are used (see below), the signal pipe should
|
||||
be initialized after the <TT>fork(2)</TT> call so that each process has its
|
||||
own private pipe.
|
||||
<P>
|
||||
|
||||
<A NAME="intra">
|
||||
<H3>Intra-Process Synchronization</H3>
|
||||
Due to the event-driven nature of the library scheduler, the thread context
|
||||
switch (process state change) can only happen in a well-known set of
|
||||
library functions. This set includes functions in which a thread may
|
||||
"block":<TT> </TT>I/O functions (<TT>st_read(), st_write(), </TT>etc.),
|
||||
sleep functions (<TT>st_sleep(), </TT>etc.), and thread synchronization
|
||||
functions (<TT>st_thread_join(), st_cond_wait(), </TT>etc.). As a result,
|
||||
process-specific global data need not to be protected by locks since a thread
|
||||
cannot be rescheduled while in a critical section (and only one thread at a
|
||||
time can access the same memory location). By the same token,
|
||||
non thread-safe functions (in a traditional sense) can be safely used with
|
||||
the State Threads. The library's mutex facilities are practically useless
|
||||
for a correctly written application (no blocking functions in critical
|
||||
section) and are provided mostly for completeness. This absence of locking
|
||||
greatly simplifies an application design and provides a foundation for
|
||||
scalability.
|
||||
<P>
|
||||
|
||||
<A NAME="inter">
|
||||
<H3>Inter-Process Synchronization</H3>
|
||||
The State Threads library makes it possible to multiplex a large number
|
||||
of simultaneous connections onto a much smaller number of separate
|
||||
processes, where each process uses a many-to-one user-level threading
|
||||
implementation (<B>N</B> of <B>M:1</B> mappings rather than one <B>M:N</B>
|
||||
mapping used in native threading libraries on some platforms). This design
|
||||
is key to the application's scalability. One can think about it as if a
|
||||
set of all threads is partitioned into separate groups (processes) where
|
||||
each group has a separate pool of resources (virtual address space, file
|
||||
descriptors, etc.). An application designer has full control of how many
|
||||
groups (processes) an application creates and what resources, if any,
|
||||
are shared among different groups via standard UNIX inter-process
|
||||
communication (IPC) facilities.<P>
|
||||
There are several reasons for creating multiple processes:
|
||||
<P>
|
||||
<UL>
|
||||
<LI>To take advantage of multiple hardware entities (CPUs, disks, etc.)
|
||||
available in the system (hardware parallelism).</LI>
|
||||
<P>
|
||||
<LI>To reduce risk of losing a large number of user connections when one of
|
||||
the processes crashes. For example, if <B>C</B> user connections (threads)
|
||||
are multiplexed onto <B>P</B> processes and one of the processes crashes,
|
||||
only a fraction (<B>C/P</B>) of all connections will be lost.</LI>
|
||||
<P>
|
||||
<LI>To overcome per-process resource limitations imposed by the OS. For
|
||||
example, if <TT>select(2)</TT> is used for event polling, the number of
|
||||
simultaneous connections (threads) per process is
|
||||
limited by the <TT>FD_SETSIZE</TT> parameter (see <TT>select(2)</TT>).
|
||||
If <TT>FD_SETSIZE</TT> is equal to 1024 and each connection needs one file
|
||||
descriptor, then an application should create 10 processes to support 10,000
|
||||
simultaneous connections.</LI>
|
||||
</UL>
|
||||
<P>
|
||||
Ideally all user sessions are completely independent, so there is no need for
|
||||
inter-process communication. It is always better to have several separate
|
||||
smaller process-specific resources (e.g., data caches) than to have one large
|
||||
resource shared (and modified) by all processes. Sometimes, however, there
|
||||
is a need to share a common resource among different processes. In that case,
|
||||
standard UNIX IPC facilities can be used. In addition to that, there is a way
|
||||
to synchronize different processes so that only the thread accessing the
|
||||
shared resource will be suspended (but not the entire process) if that resource
|
||||
is unavailable. In the following code fragment a pipe is used as a counting
|
||||
semaphore for inter-process synchronization:
|
||||
<PRE>
|
||||
#ifndef PIPE_BUF
|
||||
#define PIPE_BUF 512 /* POSIX */
|
||||
#endif
|
||||
|
||||
/* Semaphore data structure */
|
||||
typedef struct ipc_sem {
|
||||
st_netfd_t rdfd; /* read descriptor */
|
||||
st_netfd_t wrfd; /* write descriptor */
|
||||
} ipc_sem_t;
|
||||
|
||||
/* Create and initialize the semaphore. Should be called before fork(2). */
|
||||
/* 'value' must be less than PIPE_BUF. */
|
||||
/* If 'value' is 1, the semaphore works as mutex. */
|
||||
ipc_sem_t *ipc_sem_create(int value)
|
||||
{
|
||||
ipc_sem_t *sem;
|
||||
int p[2];
|
||||
char b[PIPE_BUF];
|
||||
|
||||
/* Error checking is omitted for clarity */
|
||||
sem = malloc(sizeof(ipc_sem_t));
|
||||
|
||||
/* Create the pipe */
|
||||
pipe(p);
|
||||
sem->rdfd = st_netfd_open(p[0]);
|
||||
sem->wrfd = st_netfd_open(p[1]);
|
||||
|
||||
/* Initialize the semaphore: put 'value' bytes into the pipe */
|
||||
write(p[1], b, value);
|
||||
|
||||
return sem;
|
||||
}
|
||||
|
||||
/* Try to decrement the "value" of the semaphore. */
|
||||
/* If "value" is 0, the calling thread blocks on the semaphore. */
|
||||
int ipc_sem_wait(ipc_sem_t *sem)
|
||||
{
|
||||
char c;
|
||||
|
||||
/* Read one byte from the pipe */
|
||||
if (st_read(sem->rdfd, &c, 1, ST_UTIME_NO_TIMEOUT) != 1)
|
||||
return -1;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* Increment the "value" of the semaphore. */
|
||||
int ipc_sem_post(ipc_sem_t *sem)
|
||||
{
|
||||
char c;
|
||||
|
||||
if (st_write(sem->wrfd, &c, 1, ST_UTIME_NO_TIMEOUT) != 1)
|
||||
return -1;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
</PRE>
|
||||
<P>
|
||||
|
||||
Generally, the following steps should be followed when writing an application
|
||||
using the State Threads library:
|
||||
<P>
|
||||
<OL>
|
||||
<LI>Initialize the library (<TT>st_init()</TT>).</LI>
|
||||
<P>
|
||||
<LI>Create resources that will be shared among different processes:
|
||||
create and bind listening sockets, create shared memory segments, IPC
|
||||
channels, synchronization primitives, etc.</LI>
|
||||
<P>
|
||||
<LI>Create several processes (<TT>fork(2)</TT>). The parent process should
|
||||
either exit or become a "watchdog" (e.g., it starts a new process when
|
||||
an existing one crashes, does a cleanup upon application termination,
|
||||
etc.).</LI>
|
||||
<P>
|
||||
<LI>In each child process create a pool of threads
|
||||
(<TT>st_thread_create()</TT>) to handle user connections.</LI>
|
||||
</OL>
|
||||
<P>
|
||||
|
||||
<A NAME="nonnet">
|
||||
<H3>Non-Network I/O</H3>
|
||||
|
||||
The State Threads architecture uses non-blocking I/O on
|
||||
<TT>st_netfd_t</TT> objects for concurrent processing of multiple user
|
||||
connections. This architecture has a drawback: the entire process and
|
||||
all its threads may block for the duration of a <I>disk</I> or other
|
||||
non-network I/O operation, whether through State Threads I/O functions,
|
||||
direct system calls, or standard I/O functions. (This is applicable
|
||||
mostly to disk <I>reads</I>; disk <I>writes</I> are usually performed
|
||||
asynchronously -- data goes to the buffer cache to be written to disk
|
||||
later.) Fortunately, disk I/O (unlike network I/O) usually takes a
|
||||
finite and predictable amount of time, but this may not be true for
|
||||
special devices or user input devices (including stdin). Nevertheless,
|
||||
such I/O reduces throughput of the system and increases response times.
|
||||
There are several ways to design an application to overcome this
|
||||
drawback:
|
||||
|
||||
<P>
|
||||
<UL>
|
||||
<LI>Create several identical main processes as described above (symmetric
|
||||
architecture). This will improve CPU utilization and thus improve the
|
||||
overall throughput of the system.</LI>
|
||||
<P>
|
||||
<LI>Create multiple "helper" processes in addition to the main process that
|
||||
will handle blocking I/O operations (asymmetric architecture).
|
||||
This approach was suggested for Web servers in a
|
||||
<A HREF="http://www.cs.rice.edu/~vivek/flash99/">paper</A> by Peter
|
||||
Druschel et al. In this architecture the main process communicates with
|
||||
a helper process via an IPC channel (<TT>pipe(2), socketpair(2)</TT>).
|
||||
The main process instructs a helper to perform the potentially blocking
|
||||
operation. Once the operation completes, the helper returns a
|
||||
notification via IPC.
|
||||
</UL>
|
||||
<P>
|
||||
|
||||
<A NAME="timeouts">
|
||||
<H3>Timeouts</H3>
|
||||
|
||||
The <TT>timeout</TT> parameter to <TT>st_cond_timedwait()</TT> and the
|
||||
I/O functions, and the arguments to <TT>st_sleep()</TT> and
|
||||
<TT>st_usleep()</TT> specify a maximum time to wait <I>since the last
|
||||
context switch</I> not since the beginning of the function call.
|
||||
|
||||
<P>The State Threads' time resolution is actually the time interval
|
||||
between context switches. That time interval may be large in some
|
||||
situations, for example, when a single thread does a lot of work
|
||||
continuously. Note that a steady, uninterrupted stream of network I/O
|
||||
qualifies for this description; a context switch occurs only when a
|
||||
thread blocks.
|
||||
|
||||
<P>If a specified I/O timeout is less than the time interval between
|
||||
context switches the function may return with a timeout error before
|
||||
that amount of time has elapsed since the beginning of the function
|
||||
call. For example, if eight milliseconds have passed since the last
|
||||
context switch and an I/O function with a timeout of 10 milliseconds
|
||||
blocks, causing a switch, the call may return with a timeout error as
|
||||
little as two milliseconds after it was called. (On Linux,
|
||||
<TT>select()</TT>'s timeout is an <I>upper</I> bound on the amount of
|
||||
time elapsed before select returns.) Similarly, if 12 ms have passed
|
||||
already, the function may return immediately.
|
||||
|
||||
<P>In almost all cases I/O timeouts should be used only for detecting a
|
||||
broken network connection or for preventing a peer from holding an idle
|
||||
connection for too long. Therefore for most applications realistic I/O
|
||||
timeouts should be on the order of seconds. Furthermore, there's
|
||||
probably no point in retrying operations that time out. Rather than
|
||||
retrying simply use a larger timeout in the first place.
|
||||
|
||||
<P>The largest valid timeout value is platform-dependent and may be
|
||||
significantly less than <TT>INT_MAX</TT> seconds for <TT>select()</TT>
|
||||
or <TT>INT_MAX</TT> milliseconds for <TT>poll()</TT>. Generally, you
|
||||
should not use timeouts exceeding several hours. Use
|
||||
<tt>ST_UTIME_NO_TIMEOUT</tt> (<tt>-1</tt>) as a special value to
|
||||
indicate infinite timeout or indefinite sleep. Use
|
||||
<tt>ST_UTIME_NO_WAIT</tt> (<tt>0</tt>) to indicate no waiting at all.
|
||||
|
||||
<P>
|
||||
<HR>
|
||||
<P>
|
||||
</BODY>
|
||||
</HTML>
|
||||
|
3120
trunk/3rdparty/st-srs/docs/reference.html
vendored
3120
trunk/3rdparty/st-srs/docs/reference.html
vendored
File diff suppressed because it is too large
Load diff
504
trunk/3rdparty/st-srs/docs/st.html
vendored
504
trunk/3rdparty/st-srs/docs/st.html
vendored
|
@ -1,504 +0,0 @@
|
|||
<HTML>
|
||||
<HEAD>
|
||||
<TITLE>State Threads for Internet Applications</TITLE>
|
||||
</HEAD>
|
||||
<BODY BGCOLOR=#FFFFFF>
|
||||
<H2>State Threads for Internet Applications</H2>
|
||||
<H3>Introduction</H3>
|
||||
<P>
|
||||
State Threads is an application library which provides a
|
||||
foundation for writing fast and highly scalable Internet Applications
|
||||
on UNIX-like platforms. It combines the simplicity of the multithreaded
|
||||
programming paradigm, in which one thread supports each simultaneous
|
||||
connection, with the performance and scalability of an event-driven
|
||||
state machine architecture.</P>
|
||||
|
||||
<H3>1. Definitions</H3>
|
||||
<P>
|
||||
<A NAME="IA">
|
||||
<H4>1.1 Internet Applications</H4>
|
||||
</A>
|
||||
<P>
|
||||
An <I>Internet Application</I> (IA) is either a server or client network
|
||||
application that accepts connections from clients and may or may not
|
||||
connect to servers. In an IA the arrival or departure of network data
|
||||
often controls processing (that is, IA is a <I>data-driven</I> application).
|
||||
For each connection, an IA does some finite amount of work
|
||||
involving data exchange with its peer, where its peer may be either
|
||||
a client or a server.
|
||||
The typical transaction steps of an IA are to accept a connection,
|
||||
read a request, do some finite and predictable amount of work to
|
||||
process the request, then write a response to the peer that sent the
|
||||
request. One example of an IA is a Web server;
|
||||
the most general example of an IA is a proxy server, because it both
|
||||
accepts connections from clients and connects to other servers.</P>
|
||||
<P>
|
||||
We assume that the performance of an IA is constrained by available CPU
|
||||
cycles rather than network bandwidth or disk I/O (that is, CPU
|
||||
is a bottleneck resource).
|
||||
<P>
|
||||
|
||||
<A NAME="PS">
|
||||
<H4>1.2 Performance and Scalability</H4>
|
||||
</A>
|
||||
<P>
|
||||
The <I>performance</I> of an IA is usually evaluated as its
|
||||
throughput measured in transactions per second or bytes per second (one
|
||||
can be converted to the other, given the average transaction size). There are
|
||||
several benchmarks that can be used to measure throughput of Web serving
|
||||
applications for specific workloads (such as
|
||||
<A HREF="http://www.spec.org/osg/web96/">SPECweb96</A>,
|
||||
<A HREF="http://www.mindcraft.com/webstone/">WebStone</A>,
|
||||
<A HREF="http://www.zdnet.com/zdbop/webbench/">WebBench</A>).
|
||||
Although there is no common definition for <I>scalability</I>, in general it
|
||||
expresses the ability of an application to sustain its performance when some
|
||||
external condition changes. For IAs this external condition is either the
|
||||
number of clients (also known as "users," "simultaneous connections," or "load
|
||||
generators") or the underlying hardware system size (number of CPUs, memory
|
||||
size, and so on). Thus there are two types of scalability: <I>load
|
||||
scalability</I> and <I>system scalability</I>, respectively.
|
||||
<P>
|
||||
The figure below shows how the throughput of an idealized IA changes with
|
||||
the increasing number of clients (solid blue line). Initially the throughput
|
||||
grows linearly (the slope represents the maximal throughput that one client
|
||||
can provide). Within this initial range, the IA is underutilized and CPUs are
|
||||
partially idle. Further increase in the number of clients leads to a system
|
||||
saturation, and the throughput gradually stops growing as all CPUs become fully
|
||||
utilized. After that point, the throughput stays flat because there are no
|
||||
more CPU cycles available.
|
||||
In the real world, however, each simultaneous connection
|
||||
consumes some computational and memory resources, even when idle, and this
|
||||
overhead grows with the number of clients. Therefore, the throughput of the
|
||||
real world IA starts dropping after some point (dashed blue line in the figure
|
||||
below). The rate at which the throughput drops depends, among other things, on
|
||||
application design.
|
||||
<P>
|
||||
We say that an application has a good <I>load scalability</I> if it can
|
||||
sustain its throughput over a wide range of loads.
|
||||
Interestingly, the <A HREF="http://www.spec.org/osg/web99/">SPECweb99</A>
|
||||
benchmark somewhat reflects the Web server's load scalability because it
|
||||
measures the number of clients (load generators) given a mandatory minimal
|
||||
throughput per client (that is, it measures the server's <I>capacity</I>).
|
||||
This is unlike <A HREF="http://www.spec.org/osg/web96/">SPECweb96</A> and
|
||||
other benchmarks that use the throughput as their main metric (see the figure
|
||||
below).
|
||||
<P>
|
||||
<CENTER><IMG SRC="fig.gif" ALT="Figure: Throughput vs. Number of clients">
|
||||
</CENTER>
|
||||
<P>
|
||||
<I>System scalability</I> is the ability of an application to sustain its
|
||||
performance per hardware unit (such as a CPU) with the increasing number of
|
||||
these units. In other words, good system scalability means that doubling the
|
||||
number of processors will roughly double the application's throughput (dashed
|
||||
green line). We assume here that the underlying operating system also scales
|
||||
well. Good system scalability allows you to initially run an application on
|
||||
the smallest system possible, while retaining the ability to move that
|
||||
application to a larger system if necessary, without excessive effort or
|
||||
expense. That is, an application need not be rewritten or even undergo a
|
||||
major porting effort when changing system size.
|
||||
<P>
|
||||
Although scalability and performance are more important in the case of server
|
||||
IAs, they should also be considered for some client applications (such as
|
||||
benchmark load generators).
|
||||
<P>
|
||||
|
||||
<A NAME="CONC">
|
||||
<H4>1.3 Concurrency</H4>
|
||||
</A>
|
||||
<P>
|
||||
Concurrency reflects the parallelism in a system. The two unrelated types
|
||||
are <I>virtual</I> concurrency and <I>real</I> concurrency.
|
||||
<UL>
|
||||
<LI>Virtual (or apparent) concurrency is the number of simultaneous
|
||||
connections that a system supports.
|
||||
<BR><BR>
|
||||
<LI>Real concurrency is the number of hardware devices, including
|
||||
CPUs, network cards, and disks, that actually allow a system to perform
|
||||
tasks in parallel.
|
||||
</UL>
|
||||
<P>
|
||||
An IA must provide virtual concurrency in order to serve many users
|
||||
simultaneously.
|
||||
To achieve maximum performance and scalability in doing so, the number of
|
||||
programming entities than an IA creates to be scheduled by the OS kernel
|
||||
should be
|
||||
kept close to (within an order of magnitude of) the real concurrency found on
|
||||
the system. These programming entities scheduled by the kernel are known as
|
||||
<I>kernel execution vehicles</I>. Examples of kernel execution vehicles
|
||||
include Solaris lightweight processes and IRIX kernel threads.
|
||||
In other words, the number of kernel execution vehicles should be dictated by
|
||||
the system size and not by the number of simultaneous connections.
|
||||
<P>
|
||||
|
||||
<H3>2. Existing Architectures</H3>
|
||||
<P>
|
||||
There are a few different architectures that are commonly used by IAs.
|
||||
These include the <I>Multi-Process</I>,
|
||||
<I>Multi-Threaded</I>, and <I>Event-Driven State Machine</I>
|
||||
architectures.
|
||||
<P>
|
||||
<A NAME="MP">
|
||||
<H4>2.1 Multi-Process Architecture</H4>
|
||||
</A>
|
||||
<P>
|
||||
In the Multi-Process (MP) architecture, an individual process is
|
||||
dedicated to each simultaneous connection.
|
||||
A process performs all of a transaction's initialization steps
|
||||
and services a connection completely before moving on to service
|
||||
a new connection.
|
||||
<P>
|
||||
User sessions in IAs are relatively independent; therefore, no
|
||||
synchronization between processes handling different connections is
|
||||
necessary. Because each process has its own private address space,
|
||||
this architecture is very robust. If a process serving one of the connections
|
||||
crashes, the other sessions will not be affected. However, to serve many
|
||||
concurrent connections, an equal number of processes must be employed.
|
||||
Because processes are kernel entities (and are in fact the heaviest ones),
|
||||
the number of kernel entities will be at least as large as the number of
|
||||
concurrent sessions. On most systems, good performance will not be achieved
|
||||
when more than a few hundred processes are created because of the high
|
||||
context-switching overhead. In other words, MP applications have poor load
|
||||
scalability.
|
||||
<P>
|
||||
On the other hand, MP applications have very good system scalability, because
|
||||
no resources are shared among different processes and there is no
|
||||
synchronization overhead.
|
||||
<P>
|
||||
The Apache Web Server 1.x (<A HREF=#refs1>[Reference 1]</A>) uses the MP
|
||||
architecture on UNIX systems.
|
||||
<P>
|
||||
<A NAME="MT">
|
||||
<H4>2.2 Multi-Threaded Architecture</H4>
|
||||
</A>
|
||||
<P>
|
||||
In the Multi-Threaded (MT) architecture, multiple independent threads
|
||||
of control are employed within a single shared address space. Like a
|
||||
process in the MP architecture, each thread performs all of a
|
||||
transaction's initialization steps and services a connection completely
|
||||
before moving on to service a new connection.
|
||||
<P>
|
||||
Many modern UNIX operating systems implement a <I>many-to-few</I> model when
|
||||
mapping user-level threads to kernel entities. In this model, an
|
||||
arbitrarily large number of user-level threads is multiplexed onto a
|
||||
lesser number of kernel execution vehicles. Kernel execution
|
||||
vehicles are also known as <I>virtual processors</I>. Whenever a user-level
|
||||
thread makes a blocking system call, the kernel execution vehicle it is using
|
||||
will become blocked in the kernel. If there are no other non-blocked kernel
|
||||
execution vehicles and there are other runnable user-level threads, a new
|
||||
kernel execution vehicle will be created automatically. This prevents the
|
||||
application from blocking when it can continue to make useful forward
|
||||
progress.
|
||||
<P>
|
||||
Because IAs are by nature network I/O driven, all concurrent sessions block on
|
||||
network I/O at various points. As a result, the number of virtual processors
|
||||
created in the kernel grows close to the number of user-level threads
|
||||
(or simultaneous connections). When this occurs, the many-to-few model
|
||||
effectively degenerates to a <I>one-to-one</I> model. Again, like in
|
||||
the MP architecture, the number of kernel execution vehicles is dictated by
|
||||
the number of simultaneous connections rather than by number of CPUs. This
|
||||
reduces an application's load scalability. However, because kernel threads
|
||||
(lightweight processes) use fewer resources and are more light-weight than
|
||||
traditional UNIX processes, an MT application should scale better with load
|
||||
than an MP application.
|
||||
<P>
|
||||
Unexpectedly, the small number of virtual processors sharing the same address
|
||||
space in the MT architecture destroys an application's system scalability
|
||||
because of contention among the threads on various locks. Even if an
|
||||
application itself is carefully
|
||||
optimized to avoid lock contention around its own global data (a non-trivial
|
||||
task), there are still standard library functions and system calls
|
||||
that use common resources hidden from the application. For example,
|
||||
on many platforms thread safety of memory allocation routines
|
||||
(<TT>malloc(3)</TT>, <TT>free(3)</TT>, and so on) is achieved by using a single
|
||||
global lock. Another example is a per-process file descriptor table.
|
||||
This common resource table is shared by all kernel execution vehicles within
|
||||
the same process and must be protected when one modifies it via
|
||||
certain system calls (such as <TT>open(2)</TT>, <TT>close(2)</TT>, and so on).
|
||||
In addition to that, maintaining the caches coherent
|
||||
among CPUs on multiprocessor systems hurts performance when different threads
|
||||
running on different CPUs modify data items on the same cache line.
|
||||
<P>
|
||||
In order to improve load scalability, some applications employ a different
|
||||
type of MT architecture: they create one or more thread(s) <I>per task</I>
|
||||
rather than one thread <I>per connection</I>. For example, one small group
|
||||
of threads may be responsible for accepting client connections, another
|
||||
for request processing, and yet another for serving responses. The main
|
||||
advantage of this architecture is that it eliminates the tight coupling
|
||||
between the number of threads and number of simultaneous connections. However,
|
||||
in this architecture, different task-specific thread groups must share common
|
||||
work queues that must be protected by mutual exclusion locks (a typical
|
||||
producer-consumer problem). This adds synchronization overhead that causes an
|
||||
application to perform badly on multiprocessor systems. In other words, in
|
||||
this architecture, the application's system scalability is sacrificed for the
|
||||
sake of load scalability.
|
||||
<P>
|
||||
Of course, the usual nightmares of threaded programming, including data
|
||||
corruption, deadlocks, and race conditions, also make MT architecture (in any
|
||||
form) non-simplistic to use.
|
||||
<P>
|
||||
|
||||
<A NAME="EDSM">
|
||||
<H4>2.3 Event-Driven State Machine Architecture</H4>
|
||||
</A>
|
||||
<P>
|
||||
In the Event-Driven State Machine (EDSM) architecture, a single process
|
||||
is employed to concurrently process multiple connections. The basics of this
|
||||
architecture are described in Comer and Stevens
|
||||
<A HREF=#refs2>[Reference 2]</A>.
|
||||
The EDSM architecture performs one basic data-driven step associated with
|
||||
a particular connection at a time, thus multiplexing many concurrent
|
||||
connections. The process operates as a state machine that receives an event
|
||||
and then reacts to it.
|
||||
<P>
|
||||
In the idle state the EDSM calls <TT>select(2)</TT> or <TT>poll(2)</TT> to
|
||||
wait for network I/O events. When a particular file descriptor is ready for
|
||||
I/O, the EDSM completes the corresponding basic step (usually by invoking a
|
||||
handler function) and starts the next one. This architecture uses
|
||||
non-blocking system calls to perform asynchronous network I/O operations.
|
||||
For more details on non-blocking I/O see Stevens
|
||||
<A HREF=#refs3>[Reference 3]</A>.
|
||||
<P>
|
||||
To take advantage of hardware parallelism (real concurrency), multiple
|
||||
identical processes may be created. This is called Symmetric Multi-Process
|
||||
EDSM and is used, for example, in the Zeus Web Server
|
||||
(<A HREF=#refs4>[Reference 4]</A>). To more efficiently multiplex disk I/O,
|
||||
special "helper" processes may be created. This is called Asymmetric
|
||||
Multi-Process EDSM and was proposed for Web servers by Druschel
|
||||
and others <A HREF=#refs5>[Reference 5]</A>.
|
||||
<P>
|
||||
EDSM is probably the most scalable architecture for IAs.
|
||||
Because the number of simultaneous connections (virtual concurrency) is
|
||||
completely decoupled from the number of kernel execution vehicles (processes),
|
||||
this architecture has very good load scalability. It requires only minimal
|
||||
user-level resources to create and maintain additional connection.
|
||||
<P>
|
||||
Like MP applications, Multi-Process EDSM has very good system scalability
|
||||
because no resources are shared among different processes and there is no
|
||||
synchronization overhead.
|
||||
<P>
|
||||
Unfortunately, the EDSM architecture is monolithic rather than based on the
|
||||
concept of threads, so new applications generally need to be implemented from
|
||||
the ground up. In effect, the EDSM architecture simulates threads and their
|
||||
stacks the hard way.
|
||||
<P>
|
||||
|
||||
<A NAME="ST">
|
||||
<H3>3. State Threads Library</H3>
|
||||
</A>
|
||||
<P>
|
||||
The State Threads library combines the advantages of all of the above
|
||||
architectures. The interface preserves the programming simplicity of thread
|
||||
abstraction, allowing each simultaneous connection to be treated as a separate
|
||||
thread of execution within a single process. The underlying implementation is
|
||||
close to the EDSM architecture as the state of each particular concurrent
|
||||
session is saved in a separate memory segment.
|
||||
<P>
|
||||
|
||||
<H4>3.1 State Changes and Scheduling</H4>
|
||||
<P>
|
||||
The state of each concurrent session includes its stack environment
|
||||
(stack pointer, program counter, CPU registers) and its stack. Conceptually,
|
||||
a thread context switch can be viewed as a process changing its state. There
|
||||
are no kernel entities involved other than processes.
|
||||
Unlike other general-purpose threading libraries, the State Threads library
|
||||
is fully deterministic. The thread context switch (process state change) can
|
||||
only happen in a well-known set of functions (at I/O points or at explicit
|
||||
synchronization points). As a result, process-specific global data does not
|
||||
have to be protected by mutual exclusion locks in most cases. The entire
|
||||
application is free to use all the static variables and non-reentrant library
|
||||
functions it wants, greatly simplifying programming and debugging while
|
||||
increasing performance. This is somewhat similar to a <I>co-routine</I> model
|
||||
(co-operatively multitasked threads), except that no explicit yield is needed
|
||||
--
|
||||
sooner or later, a thread performs a blocking I/O operation and thus surrenders
|
||||
control. All threads of execution (simultaneous connections) have the
|
||||
same priority, so scheduling is non-preemptive, like in the EDSM architecture.
|
||||
Because IAs are data-driven (processing is limited by the size of network
|
||||
buffers and data arrival rates), scheduling is non-time-slicing.
|
||||
<P>
|
||||
Only two types of external events are handled by the library's
|
||||
scheduler, because only these events can be detected by
|
||||
<TT>select(2)</TT> or <TT>poll(2)</TT>: I/O events (a file descriptor is ready
|
||||
for I/O) and time events
|
||||
(some timeout has expired). However, other types of events (such as
|
||||
a signal sent to a process) can also be handled by converting them to I/O
|
||||
events. For example, a signal handling function can perform a write to a pipe
|
||||
(<TT>write(2)</TT> is reentrant/asynchronous-safe), thus converting a signal
|
||||
event to an I/O event.
|
||||
<P>
|
||||
To take advantage of hardware parallelism, as in the EDSM architecture,
|
||||
multiple processes can be created in either a symmetric or asymmetric manner.
|
||||
Process management is not in the library's scope but instead is left up to the
|
||||
application.
|
||||
<P>
|
||||
There are several general-purpose threading libraries that implement a
|
||||
<I>many-to-one</I> model (many user-level threads to one kernel execution
|
||||
vehicle), using the same basic techniques as the State Threads library
|
||||
(non-blocking I/O, event-driven scheduler, and so on). For an example, see GNU
|
||||
Portable Threads (<A HREF=#refs6>[Reference 6]</A>). Because they are
|
||||
general-purpose, these libraries have different objectives than the State
|
||||
Threads library. The State Threads library is <I>not</I> a general-purpose
|
||||
threading library,
|
||||
but rather an application library that targets only certain types of
|
||||
applications (IAs) in order to achieve the highest possible performance and
|
||||
scalability for those applications.
|
||||
<P>
|
||||
|
||||
<H4>3.2 Scalability</H4>
|
||||
<P>
|
||||
State threads are very lightweight user-level entities, and therefore creating
|
||||
and maintaining user connections requires minimal resources. An application
|
||||
using the State Threads library scales very well with the increasing number
|
||||
of connections.
|
||||
<P>
|
||||
On multiprocessor systems an application should create multiple processes
|
||||
to take advantage of hardware parallelism. Using multiple separate processes
|
||||
is the <I>only</I> way to achieve the highest possible system scalability.
|
||||
This is because duplicating per-process resources is the only way to avoid
|
||||
significant synchronization overhead on multiprocessor systems. Creating
|
||||
separate UNIX processes naturally offers resource duplication. Again,
|
||||
as in the EDSM architecture, there is no connection between the number of
|
||||
simultaneous connections (which may be very large and changes within a wide
|
||||
range) and the number of kernel entities (which is usually small and constant).
|
||||
In other words, the State Threads library makes it possible to multiplex a
|
||||
large number of simultaneous connections onto a much smaller number of
|
||||
separate processes, thus allowing an application to scale well with both
|
||||
the load and system size.
|
||||
<P>
|
||||
|
||||
<H4>3.3 Performance</H4>
|
||||
<P>
|
||||
Performance is one of the library's main objectives. The State Threads
|
||||
library is implemented to minimize the number of system calls and
|
||||
to make thread creation and context switching as fast as possible.
|
||||
For example, per-thread signal mask does not exist (unlike
|
||||
POSIX threads), so there is no need to save and restore a process's
|
||||
signal mask on every thread context switch. This eliminates two system
|
||||
calls per context switch. Signal events can be handled much more
|
||||
efficiently by converting them to I/O events (see above).
|
||||
<P>
|
||||
|
||||
<H4>3.4 Portability</H4>
|
||||
<P>
|
||||
The library uses the same general, underlying concepts as the EDSM
|
||||
architecture, including non-blocking I/O, file descriptors, and
|
||||
I/O multiplexing. These concepts are available in some form on most
|
||||
UNIX platforms, making the library very portable across many
|
||||
flavors of UNIX. There are only a few platform-dependent sections in the
|
||||
source.
|
||||
<P>
|
||||
|
||||
<H4>3.5 State Threads and NSPR</H4>
|
||||
<P>
|
||||
The State Threads library is a derivative of the Netscape Portable
|
||||
Runtime library (NSPR) <A HREF=#refs7>[Reference 7]</A>. The primary goal of
|
||||
NSPR is to provide a platform-independent layer for system facilities,
|
||||
where system facilities include threads, thread synchronization, and I/O.
|
||||
Performance and scalability are not the main concern of NSPR. The
|
||||
State Threads library addresses performance and scalability while
|
||||
remaining much smaller than NSPR. It is contained in 8 source files
|
||||
as opposed to more than 400, but provides all the functionality that
|
||||
is needed to write efficient IAs on UNIX-like platforms.
|
||||
<P>
|
||||
|
||||
<TABLE CELLPADDING=3>
|
||||
<TR>
|
||||
<TD></TD>
|
||||
<TH>NSPR</TH>
|
||||
<TH>State Threads</TH>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Lines of code</B></TD>
|
||||
<TD ALIGN=RIGHT>~150,000</TD>
|
||||
<TD ALIGN=RIGHT>~3000</TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Dynamic library size <BR>(debug version)</B></TD>
|
||||
<TD></TD>
|
||||
<TD></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>IRIX</TD>
|
||||
<TD ALIGN=RIGHT>~700 KB</TD>
|
||||
<TD ALIGN=RIGHT>~60 KB</TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>Linux</TD>
|
||||
<TD ALIGN=RIGHT>~900 KB</TD>
|
||||
<TD ALIGN=RIGHT>~70 KB</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
<P>
|
||||
|
||||
<H3>Conclusion</H3>
|
||||
<P>
|
||||
State Threads is an application library which provides a foundation for
|
||||
writing <A HREF=#IA>Internet Applications</A>. To summarize, it has the
|
||||
following <I>advantages</I>:
|
||||
<P>
|
||||
<UL>
|
||||
<LI>It allows the design of fast and highly scalable applications. An
|
||||
application will scale well with both load and number of CPUs.
|
||||
<P>
|
||||
<LI>It greatly simplifies application programming and debugging because, as a
|
||||
rule, no mutual exclusion locking is necessary and the entire application is
|
||||
free to use static variables and non-reentrant library functions.
|
||||
</UL>
|
||||
<P>
|
||||
The library's main <I>limitation</I>:
|
||||
<P>
|
||||
<UL>
|
||||
<LI>All I/O operations on sockets must use the State Thread library's I/O
|
||||
functions because only those functions perform thread scheduling and prevent
|
||||
the application's processes from blocking.
|
||||
</UL>
|
||||
<P>
|
||||
|
||||
<H3>References</H3>
|
||||
<OL>
|
||||
<A NAME="refs1">
|
||||
<LI> Apache Software Foundation,
|
||||
<A HREF="http://www.apache.org">http://www.apache.org</A>.
|
||||
<A NAME="refs2">
|
||||
<LI> Douglas E. Comer, David L. Stevens, <I>Internetworking With TCP/IP,
|
||||
Vol. III: Client-Server Programming And Applications</I>, Second Edition,
|
||||
Ch. 8, 12.
|
||||
<A NAME="refs3">
|
||||
<LI> W. Richard Stevens, <I>UNIX Network Programming</I>, Second Edition,
|
||||
Vol. 1, Ch. 15.
|
||||
<A NAME="refs4">
|
||||
<LI> Zeus Technology Limited,
|
||||
<A HREF="http://www.zeus.co.uk/">http://www.zeus.co.uk</A>.
|
||||
<A NAME="refs5">
|
||||
<LI> Peter Druschel, Vivek S. Pai, Willy Zwaenepoel,
|
||||
<A HREF="http://www.cs.rice.edu/~druschel/usenix99flash.ps.gz">
|
||||
Flash: An Efficient and Portable Web Server</A>. In <I>Proceedings of the
|
||||
USENIX 1999 Annual Technical Conference</I>, Monterey, CA, June 1999.
|
||||
<A NAME="refs6">
|
||||
<LI> GNU Portable Threads,
|
||||
<A HREF="http://www.gnu.org/software/pth/">http://www.gnu.org/software/pth/</A>.
|
||||
<A NAME="refs7">
|
||||
<LI> Netscape Portable Runtime,
|
||||
<A HREF="http://www.mozilla.org/docs/refList/refNSPR/">http://www.mozilla.org/docs/refList/refNSPR/</A>.
|
||||
</OL>
|
||||
|
||||
<H3>Other resources covering various architectural issues in IAs</H3>
|
||||
<OL START=8>
|
||||
<LI> Dan Kegel, <I>The C10K problem</I>,
|
||||
<A HREF="http://www.kegel.com/c10k.html">http://www.kegel.com/c10k.html</A>.
|
||||
</LI>
|
||||
<LI> James C. Hu, Douglas C. Schmidt, Irfan Pyarali, <I>JAWS: Understanding
|
||||
High Performance Web Systems</I>,
|
||||
<A HREF="http://www.cs.wustl.edu/~jxh/research/research.html">http://www.cs.wustl.edu/~jxh/research/research.html</A>.</LI>
|
||||
</OL>
|
||||
<P>
|
||||
<HR>
|
||||
<P>
|
||||
|
||||
<CENTER><FONT SIZE=-1>Portions created by SGI are Copyright © 2000
|
||||
Silicon Graphics, Inc. All rights reserved.</FONT></CENTER>
|
||||
<P>
|
||||
|
||||
</BODY>
|
||||
</HTML>
|
||||
|
60
trunk/3rdparty/st-srs/docs/timeout_heap.txt
vendored
60
trunk/3rdparty/st-srs/docs/timeout_heap.txt
vendored
|
@ -1,60 +0,0 @@
|
|||
How the timeout heap works
|
||||
|
||||
As of version 1.5, the State Threads Library represents the queue of
|
||||
sleeping threads using a heap data structure rather than a sorted
|
||||
linked list. This improves performance when there is a large number
|
||||
of sleeping threads, since insertion into a heap takes O(log N) time
|
||||
while insertion into a sorted list takes O(N) time. For example, in
|
||||
one test 1000 threads were created, each thread called st_usleep()
|
||||
with a random time interval, and then all the threads where
|
||||
immediately interrupted and joined before the sleeps had a chance to
|
||||
finish. The whole process was repeated 1000 times, for a total of a
|
||||
million sleep queue insertions and removals. With the old list-based
|
||||
sleep queue, this test took 100 seconds; now it takes only 12 seconds.
|
||||
|
||||
Heap data structures are typically based on dynamically resized
|
||||
arrays. However, since the existing ST code base was very nicely
|
||||
structured around linking the thread objects into pointer-based lists
|
||||
without the need for any auxiliary data structures, implementing the
|
||||
heap using a similar nodes-and-pointers based approach seemed more
|
||||
appropriate for ST than introducing a separate array.
|
||||
|
||||
Thus, the new ST timeout heap works by organizing the existing
|
||||
_st_thread_t objects in a balanced binary tree, just as they were
|
||||
previously organized into a doubly-linked, sorted list. The global
|
||||
_ST_SLEEPQ variable, formerly a linked list head, is now simply a
|
||||
pointer to the root of this tree, and the root node of the tree is the
|
||||
thread with the earliest timeout. Each thread object has two child
|
||||
pointers, "left" and "right", pointing to threads with later timeouts.
|
||||
|
||||
Each node in the tree is numbered with an integer index, corresponding
|
||||
to the array index in an array-based heap, and the tree is kept fully
|
||||
balanced and left-adjusted at all times. In other words, the tree
|
||||
consists of any number of fully populated top levels, followed by a
|
||||
single bottom level which may be partially populated, such that any
|
||||
existing nodes form a contiguous block to the left and the spaces for
|
||||
missing nodes form a contiguous block to the right. For example, if
|
||||
there are nine threads waiting for a timeout, they are numbered and
|
||||
arranged in a tree exactly as follows:
|
||||
|
||||
1
|
||||
/ \
|
||||
2 3
|
||||
/ \ / \
|
||||
4 5 6 7
|
||||
/ \
|
||||
8 9
|
||||
|
||||
Each node has either no children, only a left child, or both a left
|
||||
and a right child. Children always time out later than their parents
|
||||
(this is called the "heap invariant"), but when a node has two
|
||||
children, their mutual order is unspecified - the left child may time
|
||||
out before or after the right child. If a node is numbered N, its
|
||||
left child is numbered 2N, and its right child is numbered 2N+1.
|
||||
|
||||
There is no pointer from a child to its parent; all pointers point
|
||||
downward. Additions and deletions both work by starting at the root
|
||||
and traversing the tree towards the leaves, going left or right
|
||||
according to the binary digits forming the index of the destination
|
||||
node. As nodes are added or deleted, existing nodes are rearranged to
|
||||
maintain the heap invariant.
|
Loading…
Add table
Add a link
Reference in a new issue