1
0
Fork 0
mirror of https://github.com/ossrs/srs.git synced 2025-03-09 15:49:59 +00:00

use libco instead of state-thread(st), still have some bug

This commit is contained in:
xiaozhihong 2020-02-16 21:07:54 +08:00
parent 51d6c367f5
commit 7c8a35aea9
88 changed files with 4836 additions and 19273 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 5.2 KiB

View file

@ -1,434 +0,0 @@
<HTML>
<HEAD>
<TITLE>State Threads Library Programming Notes</TITLE>
</HEAD>
<BODY BGCOLOR=#FFFFFF>
<H2>Programming Notes</H2>
<P>
<B>
<UL>
<LI><A HREF=#porting>Porting</A></LI>
<LI><A HREF=#signals>Signals</A></LI>
<LI><A HREF=#intra>Intra-Process Synchronization</A></LI>
<LI><A HREF=#inter>Inter-Process Synchronization</A></LI>
<LI><A HREF=#nonnet>Non-Network I/O</A></LI>
<LI><A HREF=#timeouts>Timeouts</A></LI>
</UL>
</B>
<P>
<HR>
<P>
<A NAME="porting">
<H3>Porting</H3>
The State Threads library uses OS concepts that are available in some
form on most UNIX platforms, making the library very portable across
many flavors of UNIX. However, there are several parts of the library
that rely on platform-specific features. Here is the list of such parts:
<P>
<UL>
<LI><I>Thread context initialization</I>: Two ingredients of the
<TT>jmp_buf</TT>
data structure (the program counter and the stack pointer) have to be
manually set in the thread creation routine. The <TT>jmp_buf</TT> data
structure is defined in the <TT>setjmp.h</TT> header file and differs from
platform to platform. Usually the program counter is a structure member
with <TT>PC</TT> in the name and the stack pointer is a structure member
with <TT>SP</TT> in the name. One can also look in the
<A HREF="http://www.mozilla.org/source.html">Netscape's NSPR library source</A>
which already has this code for many UNIX-like platforms
(<TT>mozilla/nsprpub/pr/include/md/*.h</TT> files).
<P>
Note that on some BSD-derived platforms <TT>_setjmp(3)/_longjmp(3)</TT>
calls should be used instead of <TT>setjmp(3)/longjmp(3)</TT> (that is
the calls that manipulate only the stack and registers and do <I>not</I>
save and restore the process's signal mask).</LI>
<P>
Starting with glibc 2.4 on Linux the opacity of the <TT>jmp_buf</TT> data
structure is enforced by <TT>setjmp(3)/longjmp(3)</TT> so the
<TT>jmp_buf</TT> ingredients cannot be accessed directly anymore (unless
special environmental variable LD_POINTER_GUARD is set before application
execution). To avoid dependency on custom environment, the State Threads
library provides <TT>setjmp/longjmp</TT> replacement functions for
all Intel CPU architectures. Other CPU architectures can also be easily
supported (the <TT>setjmp/longjmp</TT> source code is widely available for
many CPU architectures).
<P>
<LI><I>High resolution time function</I>: Some platforms (IRIX, Solaris)
provide a high resolution time function based on the free running hardware
counter. This function returns the time counted since some arbitrary
moment in the past (usually machine power up time). It is not correlated in
any way to the time of day, and thus is not subject to resetting,
drifting, etc. This type of time is ideal for tasks where cheap, accurate
interval timing is required. If such a function is not available on a
particular platform, the <TT>gettimeofday(3)</TT> function can be used
(though on some platforms it involves a system call).
<P>
<LI><I>The stack growth direction</I>: The library needs to know whether the
stack grows toward lower (down) or higher (up) memory addresses.
One can write a simple test program that detects the stack growth direction
on a particular platform.</LI>
<P>
<LI><I>Non-blocking attribute inheritance</I>: On some platforms (e.g. IRIX)
the socket created as a result of the <TT>accept(2)</TT> call inherits the
non-blocking attribute of the listening socket. One needs to consult the manual
pages or write a simple test program to see if this applies to a specific
platform.</LI>
<P>
<LI><I>Anonymous memory mapping</I>: The library allocates memory segments
for thread stacks by doing anonymous memory mapping (<TT>mmap(2)</TT>). This
mapping is somewhat different on SVR4 and BSD4.3 derived platforms.
<P>
The memory mapping can be avoided altogether by using <TT>malloc(3)</TT> for
stack allocation. In this case the <TT>MALLOC_STACK</TT> macro should be
defined.</LI>
</UL>
<P>
All machine-dependent feature test macros should be defined in the
<TT>md.h</TT> header file. The assembly code for <TT>setjmp/longjmp</TT>
replacement functions for all CPU architectures should be placed in
the <TT>md.S</TT> file.
<P>
The current version of the library is ported to:
<UL>
<LI>IRIX 6.x (both 32 and 64 bit)</LI>
<LI>Linux (kernel 2.x and glibc 2.x) on x86, Alpha, MIPS and MIPSEL,
SPARC, ARM, PowerPC, 68k, HPPA, S390, IA-64, and Opteron (AMD-64)</LI>
<LI>Solaris 2.x (SunOS 5.x) on x86, AMD64, SPARC, and SPARC-64</LI>
<LI>AIX 4.x</LI>
<LI>HP-UX 11 (both 32 and 64 bit)</LI>
<LI>Tru64/OSF1</LI>
<LI>FreeBSD on x86, AMD64, and Alpha</LI>
<LI>OpenBSD on x86, AMD64, Alpha, and SPARC</LI>
<LI>NetBSD on x86, Alpha, SPARC, and VAX</LI>
<LI>MacOS X (Darwin) on PowerPC (32 bit) and Intel (both 32 and 64 bit) [universal]</LI>
<LI>Cygwin</LI>
</UL>
<P>
<A NAME="signals">
<H3>Signals</H3>
Signal handling in an application using State Threads should be treated the
same way as in a classical UNIX process application. There is no such
thing as per-thread signal mask, all threads share the same signal handlers,
and only asynchronous-safe functions can be used in signal handlers.
However, there is a way to process signals synchronously by converting a
signal event to an I/O event: a signal catching function does a write to
a pipe which will be processed synchronously by a dedicated signal handling
thread. The following code demonstrates this technique (error handling is
omitted for clarity):
<PRE>
/* Per-process pipe which is used as a signal queue. */
/* Up to PIPE_BUF/sizeof(int) signals can be queued up. */
int sig_pipe[2];
/* Signal catching function. */
/* Converts signal event to I/O event. */
void sig_catcher(int signo)
{
int err;
/* Save errno to restore it after the write() */
err = errno;
/* write() is reentrant/async-safe */
write(sig_pipe[1], &signo, sizeof(int));
errno = err;
}
/* Signal processing function. */
/* This is the "main" function of the signal processing thread. */
void *sig_process(void *arg)
{
st_netfd_t nfd;
int signo;
nfd = st_netfd_open(sig_pipe[0]);
for ( ; ; ) {
/* Read the next signal from the pipe */
st_read(nfd, &signo, sizeof(int), ST_UTIME_NO_TIMEOUT);
/* Process signal synchronously */
switch (signo) {
case SIGHUP:
/* do something here - reread config files, etc. */
break;
case SIGTERM:
/* do something here - cleanup, etc. */
break;
/* .
.
Other signals
.
.
*/
}
}
return NULL;
}
int main(int argc, char *argv[])
{
struct sigaction sa;
.
.
.
/* Create signal pipe */
pipe(sig_pipe);
/* Create signal processing thread */
st_thread_create(sig_process, NULL, 0, 0);
/* Install sig_catcher() as a signal handler */
sa.sa_handler = sig_catcher;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sigaction(SIGHUP, &sa, NULL);
sa.sa_handler = sig_catcher;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sigaction(SIGTERM, &sa, NULL);
.
.
.
}
</PRE>
<P>
Note that if multiple processes are used (see below), the signal pipe should
be initialized after the <TT>fork(2)</TT> call so that each process has its
own private pipe.
<P>
<A NAME="intra">
<H3>Intra-Process Synchronization</H3>
Due to the event-driven nature of the library scheduler, the thread context
switch (process state change) can only happen in a well-known set of
library functions. This set includes functions in which a thread may
"block":<TT> </TT>I/O functions (<TT>st_read(), st_write(), </TT>etc.),
sleep functions (<TT>st_sleep(), </TT>etc.), and thread synchronization
functions (<TT>st_thread_join(), st_cond_wait(), </TT>etc.). As a result,
process-specific global data need not to be protected by locks since a thread
cannot be rescheduled while in a critical section (and only one thread at a
time can access the same memory location). By the same token,
non thread-safe functions (in a traditional sense) can be safely used with
the State Threads. The library's mutex facilities are practically useless
for a correctly written application (no blocking functions in critical
section) and are provided mostly for completeness. This absence of locking
greatly simplifies an application design and provides a foundation for
scalability.
<P>
<A NAME="inter">
<H3>Inter-Process Synchronization</H3>
The State Threads library makes it possible to multiplex a large number
of simultaneous connections onto a much smaller number of separate
processes, where each process uses a many-to-one user-level threading
implementation (<B>N</B> of <B>M:1</B> mappings rather than one <B>M:N</B>
mapping used in native threading libraries on some platforms). This design
is key to the application's scalability. One can think about it as if a
set of all threads is partitioned into separate groups (processes) where
each group has a separate pool of resources (virtual address space, file
descriptors, etc.). An application designer has full control of how many
groups (processes) an application creates and what resources, if any,
are shared among different groups via standard UNIX inter-process
communication (IPC) facilities.<P>
There are several reasons for creating multiple processes:
<P>
<UL>
<LI>To take advantage of multiple hardware entities (CPUs, disks, etc.)
available in the system (hardware parallelism).</LI>
<P>
<LI>To reduce risk of losing a large number of user connections when one of
the processes crashes. For example, if <B>C</B> user connections (threads)
are multiplexed onto <B>P</B> processes and one of the processes crashes,
only a fraction (<B>C/P</B>) of all connections will be lost.</LI>
<P>
<LI>To overcome per-process resource limitations imposed by the OS. For
example, if <TT>select(2)</TT> is used for event polling, the number of
simultaneous connections (threads) per process is
limited by the <TT>FD_SETSIZE</TT> parameter (see <TT>select(2)</TT>).
If <TT>FD_SETSIZE</TT> is equal to 1024 and each connection needs one file
descriptor, then an application should create 10 processes to support 10,000
simultaneous connections.</LI>
</UL>
<P>
Ideally all user sessions are completely independent, so there is no need for
inter-process communication. It is always better to have several separate
smaller process-specific resources (e.g., data caches) than to have one large
resource shared (and modified) by all processes. Sometimes, however, there
is a need to share a common resource among different processes. In that case,
standard UNIX IPC facilities can be used. In addition to that, there is a way
to synchronize different processes so that only the thread accessing the
shared resource will be suspended (but not the entire process) if that resource
is unavailable. In the following code fragment a pipe is used as a counting
semaphore for inter-process synchronization:
<PRE>
#ifndef PIPE_BUF
#define PIPE_BUF 512 /* POSIX */
#endif
/* Semaphore data structure */
typedef struct ipc_sem {
st_netfd_t rdfd; /* read descriptor */
st_netfd_t wrfd; /* write descriptor */
} ipc_sem_t;
/* Create and initialize the semaphore. Should be called before fork(2). */
/* 'value' must be less than PIPE_BUF. */
/* If 'value' is 1, the semaphore works as mutex. */
ipc_sem_t *ipc_sem_create(int value)
{
ipc_sem_t *sem;
int p[2];
char b[PIPE_BUF];
/* Error checking is omitted for clarity */
sem = malloc(sizeof(ipc_sem_t));
/* Create the pipe */
pipe(p);
sem->rdfd = st_netfd_open(p[0]);
sem->wrfd = st_netfd_open(p[1]);
/* Initialize the semaphore: put 'value' bytes into the pipe */
write(p[1], b, value);
return sem;
}
/* Try to decrement the "value" of the semaphore. */
/* If "value" is 0, the calling thread blocks on the semaphore. */
int ipc_sem_wait(ipc_sem_t *sem)
{
char c;
/* Read one byte from the pipe */
if (st_read(sem->rdfd, &c, 1, ST_UTIME_NO_TIMEOUT) != 1)
return -1;
return 0;
}
/* Increment the "value" of the semaphore. */
int ipc_sem_post(ipc_sem_t *sem)
{
char c;
if (st_write(sem->wrfd, &c, 1, ST_UTIME_NO_TIMEOUT) != 1)
return -1;
return 0;
}
</PRE>
<P>
Generally, the following steps should be followed when writing an application
using the State Threads library:
<P>
<OL>
<LI>Initialize the library (<TT>st_init()</TT>).</LI>
<P>
<LI>Create resources that will be shared among different processes:
create and bind listening sockets, create shared memory segments, IPC
channels, synchronization primitives, etc.</LI>
<P>
<LI>Create several processes (<TT>fork(2)</TT>). The parent process should
either exit or become a "watchdog" (e.g., it starts a new process when
an existing one crashes, does a cleanup upon application termination,
etc.).</LI>
<P>
<LI>In each child process create a pool of threads
(<TT>st_thread_create()</TT>) to handle user connections.</LI>
</OL>
<P>
<A NAME="nonnet">
<H3>Non-Network I/O</H3>
The State Threads architecture uses non-blocking I/O on
<TT>st_netfd_t</TT> objects for concurrent processing of multiple user
connections. This architecture has a drawback: the entire process and
all its threads may block for the duration of a <I>disk</I> or other
non-network I/O operation, whether through State Threads I/O functions,
direct system calls, or standard I/O functions. (This is applicable
mostly to disk <I>reads</I>; disk <I>writes</I> are usually performed
asynchronously -- data goes to the buffer cache to be written to disk
later.) Fortunately, disk I/O (unlike network I/O) usually takes a
finite and predictable amount of time, but this may not be true for
special devices or user input devices (including stdin). Nevertheless,
such I/O reduces throughput of the system and increases response times.
There are several ways to design an application to overcome this
drawback:
<P>
<UL>
<LI>Create several identical main processes as described above (symmetric
architecture). This will improve CPU utilization and thus improve the
overall throughput of the system.</LI>
<P>
<LI>Create multiple "helper" processes in addition to the main process that
will handle blocking I/O operations (asymmetric architecture).
This approach was suggested for Web servers in a
<A HREF="http://www.cs.rice.edu/~vivek/flash99/">paper</A> by Peter
Druschel et al. In this architecture the main process communicates with
a helper process via an IPC channel (<TT>pipe(2), socketpair(2)</TT>).
The main process instructs a helper to perform the potentially blocking
operation. Once the operation completes, the helper returns a
notification via IPC.
</UL>
<P>
<A NAME="timeouts">
<H3>Timeouts</H3>
The <TT>timeout</TT> parameter to <TT>st_cond_timedwait()</TT> and the
I/O functions, and the arguments to <TT>st_sleep()</TT> and
<TT>st_usleep()</TT> specify a maximum time to wait <I>since the last
context switch</I> not since the beginning of the function call.
<P>The State Threads' time resolution is actually the time interval
between context switches. That time interval may be large in some
situations, for example, when a single thread does a lot of work
continuously. Note that a steady, uninterrupted stream of network I/O
qualifies for this description; a context switch occurs only when a
thread blocks.
<P>If a specified I/O timeout is less than the time interval between
context switches the function may return with a timeout error before
that amount of time has elapsed since the beginning of the function
call. For example, if eight milliseconds have passed since the last
context switch and an I/O function with a timeout of 10 milliseconds
blocks, causing a switch, the call may return with a timeout error as
little as two milliseconds after it was called. (On Linux,
<TT>select()</TT>'s timeout is an <I>upper</I> bound on the amount of
time elapsed before select returns.) Similarly, if 12 ms have passed
already, the function may return immediately.
<P>In almost all cases I/O timeouts should be used only for detecting a
broken network connection or for preventing a peer from holding an idle
connection for too long. Therefore for most applications realistic I/O
timeouts should be on the order of seconds. Furthermore, there's
probably no point in retrying operations that time out. Rather than
retrying simply use a larger timeout in the first place.
<P>The largest valid timeout value is platform-dependent and may be
significantly less than <TT>INT_MAX</TT> seconds for <TT>select()</TT>
or <TT>INT_MAX</TT> milliseconds for <TT>poll()</TT>. Generally, you
should not use timeouts exceeding several hours. Use
<tt>ST_UTIME_NO_TIMEOUT</tt> (<tt>-1</tt>) as a special value to
indicate infinite timeout or indefinite sleep. Use
<tt>ST_UTIME_NO_WAIT</tt> (<tt>0</tt>) to indicate no waiting at all.
<P>
<HR>
<P>
</BODY>
</HTML>

File diff suppressed because it is too large Load diff

View file

@ -1,504 +0,0 @@
<HTML>
<HEAD>
<TITLE>State Threads for Internet Applications</TITLE>
</HEAD>
<BODY BGCOLOR=#FFFFFF>
<H2>State Threads for Internet Applications</H2>
<H3>Introduction</H3>
<P>
State Threads is an application library which provides a
foundation for writing fast and highly scalable Internet Applications
on UNIX-like platforms. It combines the simplicity of the multithreaded
programming paradigm, in which one thread supports each simultaneous
connection, with the performance and scalability of an event-driven
state machine architecture.</P>
<H3>1. Definitions</H3>
<P>
<A NAME="IA">
<H4>1.1 Internet Applications</H4>
</A>
<P>
An <I>Internet Application</I> (IA) is either a server or client network
application that accepts connections from clients and may or may not
connect to servers. In an IA the arrival or departure of network data
often controls processing (that is, IA is a <I>data-driven</I> application).
For each connection, an IA does some finite amount of work
involving data exchange with its peer, where its peer may be either
a client or a server.
The typical transaction steps of an IA are to accept a connection,
read a request, do some finite and predictable amount of work to
process the request, then write a response to the peer that sent the
request. One example of an IA is a Web server;
the most general example of an IA is a proxy server, because it both
accepts connections from clients and connects to other servers.</P>
<P>
We assume that the performance of an IA is constrained by available CPU
cycles rather than network bandwidth or disk I/O (that is, CPU
is a bottleneck resource).
<P>
<A NAME="PS">
<H4>1.2 Performance and Scalability</H4>
</A>
<P>
The <I>performance</I> of an IA is usually evaluated as its
throughput measured in transactions per second or bytes per second (one
can be converted to the other, given the average transaction size). There are
several benchmarks that can be used to measure throughput of Web serving
applications for specific workloads (such as
<A HREF="http://www.spec.org/osg/web96/">SPECweb96</A>,
<A HREF="http://www.mindcraft.com/webstone/">WebStone</A>,
<A HREF="http://www.zdnet.com/zdbop/webbench/">WebBench</A>).
Although there is no common definition for <I>scalability</I>, in general it
expresses the ability of an application to sustain its performance when some
external condition changes. For IAs this external condition is either the
number of clients (also known as "users," "simultaneous connections," or "load
generators") or the underlying hardware system size (number of CPUs, memory
size, and so on). Thus there are two types of scalability: <I>load
scalability</I> and <I>system scalability</I>, respectively.
<P>
The figure below shows how the throughput of an idealized IA changes with
the increasing number of clients (solid blue line). Initially the throughput
grows linearly (the slope represents the maximal throughput that one client
can provide). Within this initial range, the IA is underutilized and CPUs are
partially idle. Further increase in the number of clients leads to a system
saturation, and the throughput gradually stops growing as all CPUs become fully
utilized. After that point, the throughput stays flat because there are no
more CPU cycles available.
In the real world, however, each simultaneous connection
consumes some computational and memory resources, even when idle, and this
overhead grows with the number of clients. Therefore, the throughput of the
real world IA starts dropping after some point (dashed blue line in the figure
below). The rate at which the throughput drops depends, among other things, on
application design.
<P>
We say that an application has a good <I>load scalability</I> if it can
sustain its throughput over a wide range of loads.
Interestingly, the <A HREF="http://www.spec.org/osg/web99/">SPECweb99</A>
benchmark somewhat reflects the Web server's load scalability because it
measures the number of clients (load generators) given a mandatory minimal
throughput per client (that is, it measures the server's <I>capacity</I>).
This is unlike <A HREF="http://www.spec.org/osg/web96/">SPECweb96</A> and
other benchmarks that use the throughput as their main metric (see the figure
below).
<P>
<CENTER><IMG SRC="fig.gif" ALT="Figure: Throughput vs. Number of clients">
</CENTER>
<P>
<I>System scalability</I> is the ability of an application to sustain its
performance per hardware unit (such as a CPU) with the increasing number of
these units. In other words, good system scalability means that doubling the
number of processors will roughly double the application's throughput (dashed
green line). We assume here that the underlying operating system also scales
well. Good system scalability allows you to initially run an application on
the smallest system possible, while retaining the ability to move that
application to a larger system if necessary, without excessive effort or
expense. That is, an application need not be rewritten or even undergo a
major porting effort when changing system size.
<P>
Although scalability and performance are more important in the case of server
IAs, they should also be considered for some client applications (such as
benchmark load generators).
<P>
<A NAME="CONC">
<H4>1.3 Concurrency</H4>
</A>
<P>
Concurrency reflects the parallelism in a system. The two unrelated types
are <I>virtual</I> concurrency and <I>real</I> concurrency.
<UL>
<LI>Virtual (or apparent) concurrency is the number of simultaneous
connections that a system supports.
<BR><BR>
<LI>Real concurrency is the number of hardware devices, including
CPUs, network cards, and disks, that actually allow a system to perform
tasks in parallel.
</UL>
<P>
An IA must provide virtual concurrency in order to serve many users
simultaneously.
To achieve maximum performance and scalability in doing so, the number of
programming entities than an IA creates to be scheduled by the OS kernel
should be
kept close to (within an order of magnitude of) the real concurrency found on
the system. These programming entities scheduled by the kernel are known as
<I>kernel execution vehicles</I>. Examples of kernel execution vehicles
include Solaris lightweight processes and IRIX kernel threads.
In other words, the number of kernel execution vehicles should be dictated by
the system size and not by the number of simultaneous connections.
<P>
<H3>2. Existing Architectures</H3>
<P>
There are a few different architectures that are commonly used by IAs.
These include the <I>Multi-Process</I>,
<I>Multi-Threaded</I>, and <I>Event-Driven State Machine</I>
architectures.
<P>
<A NAME="MP">
<H4>2.1 Multi-Process Architecture</H4>
</A>
<P>
In the Multi-Process (MP) architecture, an individual process is
dedicated to each simultaneous connection.
A process performs all of a transaction's initialization steps
and services a connection completely before moving on to service
a new connection.
<P>
User sessions in IAs are relatively independent; therefore, no
synchronization between processes handling different connections is
necessary. Because each process has its own private address space,
this architecture is very robust. If a process serving one of the connections
crashes, the other sessions will not be affected. However, to serve many
concurrent connections, an equal number of processes must be employed.
Because processes are kernel entities (and are in fact the heaviest ones),
the number of kernel entities will be at least as large as the number of
concurrent sessions. On most systems, good performance will not be achieved
when more than a few hundred processes are created because of the high
context-switching overhead. In other words, MP applications have poor load
scalability.
<P>
On the other hand, MP applications have very good system scalability, because
no resources are shared among different processes and there is no
synchronization overhead.
<P>
The Apache Web Server 1.x (<A HREF=#refs1>[Reference 1]</A>) uses the MP
architecture on UNIX systems.
<P>
<A NAME="MT">
<H4>2.2 Multi-Threaded Architecture</H4>
</A>
<P>
In the Multi-Threaded (MT) architecture, multiple independent threads
of control are employed within a single shared address space. Like a
process in the MP architecture, each thread performs all of a
transaction's initialization steps and services a connection completely
before moving on to service a new connection.
<P>
Many modern UNIX operating systems implement a <I>many-to-few</I> model when
mapping user-level threads to kernel entities. In this model, an
arbitrarily large number of user-level threads is multiplexed onto a
lesser number of kernel execution vehicles. Kernel execution
vehicles are also known as <I>virtual processors</I>. Whenever a user-level
thread makes a blocking system call, the kernel execution vehicle it is using
will become blocked in the kernel. If there are no other non-blocked kernel
execution vehicles and there are other runnable user-level threads, a new
kernel execution vehicle will be created automatically. This prevents the
application from blocking when it can continue to make useful forward
progress.
<P>
Because IAs are by nature network I/O driven, all concurrent sessions block on
network I/O at various points. As a result, the number of virtual processors
created in the kernel grows close to the number of user-level threads
(or simultaneous connections). When this occurs, the many-to-few model
effectively degenerates to a <I>one-to-one</I> model. Again, like in
the MP architecture, the number of kernel execution vehicles is dictated by
the number of simultaneous connections rather than by number of CPUs. This
reduces an application's load scalability. However, because kernel threads
(lightweight processes) use fewer resources and are more light-weight than
traditional UNIX processes, an MT application should scale better with load
than an MP application.
<P>
Unexpectedly, the small number of virtual processors sharing the same address
space in the MT architecture destroys an application's system scalability
because of contention among the threads on various locks. Even if an
application itself is carefully
optimized to avoid lock contention around its own global data (a non-trivial
task), there are still standard library functions and system calls
that use common resources hidden from the application. For example,
on many platforms thread safety of memory allocation routines
(<TT>malloc(3)</TT>, <TT>free(3)</TT>, and so on) is achieved by using a single
global lock. Another example is a per-process file descriptor table.
This common resource table is shared by all kernel execution vehicles within
the same process and must be protected when one modifies it via
certain system calls (such as <TT>open(2)</TT>, <TT>close(2)</TT>, and so on).
In addition to that, maintaining the caches coherent
among CPUs on multiprocessor systems hurts performance when different threads
running on different CPUs modify data items on the same cache line.
<P>
In order to improve load scalability, some applications employ a different
type of MT architecture: they create one or more thread(s) <I>per task</I>
rather than one thread <I>per connection</I>. For example, one small group
of threads may be responsible for accepting client connections, another
for request processing, and yet another for serving responses. The main
advantage of this architecture is that it eliminates the tight coupling
between the number of threads and number of simultaneous connections. However,
in this architecture, different task-specific thread groups must share common
work queues that must be protected by mutual exclusion locks (a typical
producer-consumer problem). This adds synchronization overhead that causes an
application to perform badly on multiprocessor systems. In other words, in
this architecture, the application's system scalability is sacrificed for the
sake of load scalability.
<P>
Of course, the usual nightmares of threaded programming, including data
corruption, deadlocks, and race conditions, also make MT architecture (in any
form) non-simplistic to use.
<P>
<A NAME="EDSM">
<H4>2.3 Event-Driven State Machine Architecture</H4>
</A>
<P>
In the Event-Driven State Machine (EDSM) architecture, a single process
is employed to concurrently process multiple connections. The basics of this
architecture are described in Comer and Stevens
<A HREF=#refs2>[Reference 2]</A>.
The EDSM architecture performs one basic data-driven step associated with
a particular connection at a time, thus multiplexing many concurrent
connections. The process operates as a state machine that receives an event
and then reacts to it.
<P>
In the idle state the EDSM calls <TT>select(2)</TT> or <TT>poll(2)</TT> to
wait for network I/O events. When a particular file descriptor is ready for
I/O, the EDSM completes the corresponding basic step (usually by invoking a
handler function) and starts the next one. This architecture uses
non-blocking system calls to perform asynchronous network I/O operations.
For more details on non-blocking I/O see Stevens
<A HREF=#refs3>[Reference 3]</A>.
<P>
To take advantage of hardware parallelism (real concurrency), multiple
identical processes may be created. This is called Symmetric Multi-Process
EDSM and is used, for example, in the Zeus Web Server
(<A HREF=#refs4>[Reference 4]</A>). To more efficiently multiplex disk I/O,
special "helper" processes may be created. This is called Asymmetric
Multi-Process EDSM and was proposed for Web servers by Druschel
and others <A HREF=#refs5>[Reference 5]</A>.
<P>
EDSM is probably the most scalable architecture for IAs.
Because the number of simultaneous connections (virtual concurrency) is
completely decoupled from the number of kernel execution vehicles (processes),
this architecture has very good load scalability. It requires only minimal
user-level resources to create and maintain additional connection.
<P>
Like MP applications, Multi-Process EDSM has very good system scalability
because no resources are shared among different processes and there is no
synchronization overhead.
<P>
Unfortunately, the EDSM architecture is monolithic rather than based on the
concept of threads, so new applications generally need to be implemented from
the ground up. In effect, the EDSM architecture simulates threads and their
stacks the hard way.
<P>
<A NAME="ST">
<H3>3. State Threads Library</H3>
</A>
<P>
The State Threads library combines the advantages of all of the above
architectures. The interface preserves the programming simplicity of thread
abstraction, allowing each simultaneous connection to be treated as a separate
thread of execution within a single process. The underlying implementation is
close to the EDSM architecture as the state of each particular concurrent
session is saved in a separate memory segment.
<P>
<H4>3.1 State Changes and Scheduling</H4>
<P>
The state of each concurrent session includes its stack environment
(stack pointer, program counter, CPU registers) and its stack. Conceptually,
a thread context switch can be viewed as a process changing its state. There
are no kernel entities involved other than processes.
Unlike other general-purpose threading libraries, the State Threads library
is fully deterministic. The thread context switch (process state change) can
only happen in a well-known set of functions (at I/O points or at explicit
synchronization points). As a result, process-specific global data does not
have to be protected by mutual exclusion locks in most cases. The entire
application is free to use all the static variables and non-reentrant library
functions it wants, greatly simplifying programming and debugging while
increasing performance. This is somewhat similar to a <I>co-routine</I> model
(co-operatively multitasked threads), except that no explicit yield is needed
--
sooner or later, a thread performs a blocking I/O operation and thus surrenders
control. All threads of execution (simultaneous connections) have the
same priority, so scheduling is non-preemptive, like in the EDSM architecture.
Because IAs are data-driven (processing is limited by the size of network
buffers and data arrival rates), scheduling is non-time-slicing.
<P>
Only two types of external events are handled by the library's
scheduler, because only these events can be detected by
<TT>select(2)</TT> or <TT>poll(2)</TT>: I/O events (a file descriptor is ready
for I/O) and time events
(some timeout has expired). However, other types of events (such as
a signal sent to a process) can also be handled by converting them to I/O
events. For example, a signal handling function can perform a write to a pipe
(<TT>write(2)</TT> is reentrant/asynchronous-safe), thus converting a signal
event to an I/O event.
<P>
To take advantage of hardware parallelism, as in the EDSM architecture,
multiple processes can be created in either a symmetric or asymmetric manner.
Process management is not in the library's scope but instead is left up to the
application.
<P>
There are several general-purpose threading libraries that implement a
<I>many-to-one</I> model (many user-level threads to one kernel execution
vehicle), using the same basic techniques as the State Threads library
(non-blocking I/O, event-driven scheduler, and so on). For an example, see GNU
Portable Threads (<A HREF=#refs6>[Reference 6]</A>). Because they are
general-purpose, these libraries have different objectives than the State
Threads library. The State Threads library is <I>not</I> a general-purpose
threading library,
but rather an application library that targets only certain types of
applications (IAs) in order to achieve the highest possible performance and
scalability for those applications.
<P>
<H4>3.2 Scalability</H4>
<P>
State threads are very lightweight user-level entities, and therefore creating
and maintaining user connections requires minimal resources. An application
using the State Threads library scales very well with the increasing number
of connections.
<P>
On multiprocessor systems an application should create multiple processes
to take advantage of hardware parallelism. Using multiple separate processes
is the <I>only</I> way to achieve the highest possible system scalability.
This is because duplicating per-process resources is the only way to avoid
significant synchronization overhead on multiprocessor systems. Creating
separate UNIX processes naturally offers resource duplication. Again,
as in the EDSM architecture, there is no connection between the number of
simultaneous connections (which may be very large and changes within a wide
range) and the number of kernel entities (which is usually small and constant).
In other words, the State Threads library makes it possible to multiplex a
large number of simultaneous connections onto a much smaller number of
separate processes, thus allowing an application to scale well with both
the load and system size.
<P>
<H4>3.3 Performance</H4>
<P>
Performance is one of the library's main objectives. The State Threads
library is implemented to minimize the number of system calls and
to make thread creation and context switching as fast as possible.
For example, per-thread signal mask does not exist (unlike
POSIX threads), so there is no need to save and restore a process's
signal mask on every thread context switch. This eliminates two system
calls per context switch. Signal events can be handled much more
efficiently by converting them to I/O events (see above).
<P>
<H4>3.4 Portability</H4>
<P>
The library uses the same general, underlying concepts as the EDSM
architecture, including non-blocking I/O, file descriptors, and
I/O multiplexing. These concepts are available in some form on most
UNIX platforms, making the library very portable across many
flavors of UNIX. There are only a few platform-dependent sections in the
source.
<P>
<H4>3.5 State Threads and NSPR</H4>
<P>
The State Threads library is a derivative of the Netscape Portable
Runtime library (NSPR) <A HREF=#refs7>[Reference 7]</A>. The primary goal of
NSPR is to provide a platform-independent layer for system facilities,
where system facilities include threads, thread synchronization, and I/O.
Performance and scalability are not the main concern of NSPR. The
State Threads library addresses performance and scalability while
remaining much smaller than NSPR. It is contained in 8 source files
as opposed to more than 400, but provides all the functionality that
is needed to write efficient IAs on UNIX-like platforms.
<P>
<TABLE CELLPADDING=3>
<TR>
<TD></TD>
<TH>NSPR</TH>
<TH>State Threads</TH>
</TR>
<TR>
<TD><B>Lines of code</B></TD>
<TD ALIGN=RIGHT>~150,000</TD>
<TD ALIGN=RIGHT>~3000</TD>
</TR>
<TR>
<TD><B>Dynamic library size&nbsp;&nbsp;<BR>(debug version)</B></TD>
<TD></TD>
<TD></TD>
</TR>
<TR>
<TD>IRIX</TD>
<TD ALIGN=RIGHT>~700 KB</TD>
<TD ALIGN=RIGHT>~60 KB</TD>
</TR>
<TR>
<TD>Linux</TD>
<TD ALIGN=RIGHT>~900 KB</TD>
<TD ALIGN=RIGHT>~70 KB</TD>
</TR>
</TABLE>
<P>
<H3>Conclusion</H3>
<P>
State Threads is an application library which provides a foundation for
writing <A HREF=#IA>Internet Applications</A>. To summarize, it has the
following <I>advantages</I>:
<P>
<UL>
<LI>It allows the design of fast and highly scalable applications. An
application will scale well with both load and number of CPUs.
<P>
<LI>It greatly simplifies application programming and debugging because, as a
rule, no mutual exclusion locking is necessary and the entire application is
free to use static variables and non-reentrant library functions.
</UL>
<P>
The library's main <I>limitation</I>:
<P>
<UL>
<LI>All I/O operations on sockets must use the State Thread library's I/O
functions because only those functions perform thread scheduling and prevent
the application's processes from blocking.
</UL>
<P>
<H3>References</H3>
<OL>
<A NAME="refs1">
<LI> Apache Software Foundation,
<A HREF="http://www.apache.org">http://www.apache.org</A>.
<A NAME="refs2">
<LI> Douglas E. Comer, David L. Stevens, <I>Internetworking With TCP/IP,
Vol. III: Client-Server Programming And Applications</I>, Second Edition,
Ch. 8, 12.
<A NAME="refs3">
<LI> W. Richard Stevens, <I>UNIX Network Programming</I>, Second Edition,
Vol. 1, Ch. 15.
<A NAME="refs4">
<LI> Zeus Technology Limited,
<A HREF="http://www.zeus.co.uk/">http://www.zeus.co.uk</A>.
<A NAME="refs5">
<LI> Peter Druschel, Vivek S. Pai, Willy Zwaenepoel,
<A HREF="http://www.cs.rice.edu/~druschel/usenix99flash.ps.gz">
Flash: An Efficient and Portable Web Server</A>. In <I>Proceedings of the
USENIX 1999 Annual Technical Conference</I>, Monterey, CA, June 1999.
<A NAME="refs6">
<LI> GNU Portable Threads,
<A HREF="http://www.gnu.org/software/pth/">http://www.gnu.org/software/pth/</A>.
<A NAME="refs7">
<LI> Netscape Portable Runtime,
<A HREF="http://www.mozilla.org/docs/refList/refNSPR/">http://www.mozilla.org/docs/refList/refNSPR/</A>.
</OL>
<H3>Other resources covering various architectural issues in IAs</H3>
<OL START=8>
<LI> Dan Kegel, <I>The C10K problem</I>,
<A HREF="http://www.kegel.com/c10k.html">http://www.kegel.com/c10k.html</A>.
</LI>
<LI> James C. Hu, Douglas C. Schmidt, Irfan Pyarali, <I>JAWS: Understanding
High Performance Web Systems</I>,
<A HREF="http://www.cs.wustl.edu/~jxh/research/research.html">http://www.cs.wustl.edu/~jxh/research/research.html</A>.</LI>
</OL>
<P>
<HR>
<P>
<CENTER><FONT SIZE=-1>Portions created by SGI are Copyright &copy; 2000
Silicon Graphics, Inc. All rights reserved.</FONT></CENTER>
<P>
</BODY>
</HTML>

View file

@ -1,60 +0,0 @@
How the timeout heap works
As of version 1.5, the State Threads Library represents the queue of
sleeping threads using a heap data structure rather than a sorted
linked list. This improves performance when there is a large number
of sleeping threads, since insertion into a heap takes O(log N) time
while insertion into a sorted list takes O(N) time. For example, in
one test 1000 threads were created, each thread called st_usleep()
with a random time interval, and then all the threads where
immediately interrupted and joined before the sleeps had a chance to
finish. The whole process was repeated 1000 times, for a total of a
million sleep queue insertions and removals. With the old list-based
sleep queue, this test took 100 seconds; now it takes only 12 seconds.
Heap data structures are typically based on dynamically resized
arrays. However, since the existing ST code base was very nicely
structured around linking the thread objects into pointer-based lists
without the need for any auxiliary data structures, implementing the
heap using a similar nodes-and-pointers based approach seemed more
appropriate for ST than introducing a separate array.
Thus, the new ST timeout heap works by organizing the existing
_st_thread_t objects in a balanced binary tree, just as they were
previously organized into a doubly-linked, sorted list. The global
_ST_SLEEPQ variable, formerly a linked list head, is now simply a
pointer to the root of this tree, and the root node of the tree is the
thread with the earliest timeout. Each thread object has two child
pointers, "left" and "right", pointing to threads with later timeouts.
Each node in the tree is numbered with an integer index, corresponding
to the array index in an array-based heap, and the tree is kept fully
balanced and left-adjusted at all times. In other words, the tree
consists of any number of fully populated top levels, followed by a
single bottom level which may be partially populated, such that any
existing nodes form a contiguous block to the left and the spaces for
missing nodes form a contiguous block to the right. For example, if
there are nine threads waiting for a timeout, they are numbered and
arranged in a tree exactly as follows:
1
/ \
2 3
/ \ / \
4 5 6 7
/ \
8 9
Each node has either no children, only a left child, or both a left
and a right child. Children always time out later than their parents
(this is called the "heap invariant"), but when a node has two
children, their mutual order is unspecified - the left child may time
out before or after the right child. If a node is numbered N, its
left child is numbered 2N, and its right child is numbered 2N+1.
There is no pointer from a child to its parent; all pointers point
downward. Additions and deletions both work by starting at the root
and traversing the tree towards the leaves, going left or right
according to the binary digits forming the index of the destination
node. As nodes are added or deleted, existing nodes are rearranged to
maintain the heap invariant.