How to optimize code for MMX processors.




JPEG compression has done bad things to this image; click on it to see a PNG of the same scene (288K).

Table of Contents:

  1. Caveat
  2. Who this document is for
  3. When to use MMX
  4. When not to use MMX
  5. Assumptions
  6. MMX registers and instructions
  7. Getting started
  8. Converting to MMX
  9. Optimization
  10. Further possibilities for improvement
  11. Troubleshooting
  12. Contact info

0.

Caveat

I wrote this document in 2000. Since then, everything has probably changed. So, if this document conflics with what you have learned elsewhere, trust them, not me.

1.

Who this document is for:

This document is for you if you are a programmer who needs speed in your application.

I will assume that you have a basic knowledge of x86 assembly language. (skip this) Many people have not used the GNU assembler; I will discuss briefly how its syntax differs from traditional Intel style assemblers:

Intel:

	add  (edx), eax
GNU (AT&T):
	addl %eax, (%edx)

The order of arguments is reversed; registers have a % in front of them. Constants start with $. All instructions end with a single letter representing their type: l = 32 bits, w = 16 bits, b = 8 bits

Intel:

	add ebx:[ecx * 4 - 4], $8
GNU (AT&T)
addl $8, -4(%ebx, %ecx, 1)
GNU syntax is easy to get used to, and after a few minutes, you won't even notice that you are using it. To assemble a source file, just use GCC on it as if it were a C program:
	gcc -l baz -o foo foo.s foobar.c



2.

When to use MMX

MMX is good for low-level pixel manipulation, digital signal processing, 3D geometry, and much more. Since I am a 2d graphics hacker, this document will discuss only the first. For specific algorithms in other areas, see Intel's web site1.

Use MMX only when you need the speed. To turn a function into an fully optimized MMX function will take a few hours. It is not worthwhile for code that will only be called occasionally. Make sure that you are using the best algorithm before you convert a function to MMX; if you change the algorithm, you'll have to start the conversion more or less from scratch.




3.

When not to use MMX

MMX is integer only. If you need floating point accuracy, you can use SSE (available only on Pentium III processors, and not discussed in this document).

MMX will not speed up memory accesses. If your algorithm requires frequent random-access memory lookups, it will likely not benefit from MMX.

MMX cannot be mixed with floating point code. It borrows the floating point registers for itself.



4.

Assumptions in this document:

a. 32 bit color will be used. 32 bit color is in the format ABGR. The A is not used for anything by default (although it can be).

b. All data will be aligned properly. GCC will do this for you with -O.

c. Images will have an even number of pixels in the X direction.

d. For performance discussions, all data is assumed to be in the L1 cache.



5.

MMX registers and instructions:

MMX operates on 64 bit registers (mm0 - mm7). The MMX registers are actually the floating point registers. To avoid horrible bugs, use the emms instruction at the end of your MMX code.

The registers can each be divided into 8 bytes, 4 words, 2 doublewords, or left undivided as 1 quadword. The MMX instruction set consists of 56 instructions which operate on MMX registers, plus emms. A good reference for the MMX instruction set is available on Intel's web site2.

MMX instructions (except for movd, movq, and emms) start with p and end with a letter representing the data type they operate on. For example, paddw adds words. Some MMX instructions convert types; they have a longer type description. Packsswb and punpckldw are examples of this. Packsswb converts words to bytes, and punpckldw converts doublewords to words. Some instructions end with ss or us; these mean signed or unsigned saturation. Saturation means that values are clamped to a max rather than overflowing, so that if a and b are bytes containing 100 and 200, adding with unsigned saturation will produce 255, not 44.




6.

How to begin:

Start by writing your function in C. Compile it to assembly language source using gcc -S -O3 foo.c -o foo.s

For this example, put just the function to be optimized in foo.c.

Mostly MMX is used for texturing, lighting, or blending effects. This means that you'll have a loop that looks like this:

  for (y = 0; y < ysize; y ++) {
    for (x = 0; x < xsize; x ++) {

	....

    }
   }

Locate the inner loop in the assembly language output and try to identify what gcc has done with the parameters and local variables. An inner loop might look like this:

.L10:
	movl 32(%ebp),%ecx
	movl (%ebx,%ecx,4),%eax

	...

	movl (%edx), %eax

	...  really long section of code with 
	          lots of shifts and the like ...

	movl %eax,(%edx)

	addl $4,%edx
	incl %ecx

	cmpl %edi,%ecx
	jl .L10

Let's look at the last 4 lines first:

First, edx, which is a pointer into the destination buffer, is moved ahead by one pixel. Similarly, ecx is incremented. The jump to the top is not at all out of the ordinary.

The last line before the loop control stuff usually puts the final product into the destination buffer. This shows us what register is used to index that buffer. Likewise, one of the first lines usually grabs data from one of the source buffers. This example uses 2 source buffers, one of which is also the destination.



7.

MMX conversion:

Once you have located these buffers, delete every line but those that get data from the source buffers or put it into the destination buffers (also save any instructions they need to set up pointers to these buffers). You will be re-writing the algorithm to operate in parallel. Alter the remaining instructions to use MMX registers.

For the code listing presented above:

.before_loop:
	movl 32(%ebp),%ecx

.L10:

	movd (%ebx,%ecx,4),%mm0

	movd (%edx), %mm1

	movd %mm1,(%edx)

	addl $4,%edx
	incl %ecx

	cmpl %edi,%ecx
	jl .L10

At this point, it is worthwhile to consider how to work with your data in parallel. Some algorithms can operate on 2 pixels with each MMX instruction. Usually these are algorithms that do not involve multiplication. One example of this is simple sprite transparency (not translucency). Mostly, algorithms will operate on one pixel per instruction. MMX has instructions to pack and unpack data; these are critical to one-pixel-per-instruction applications.

An example of a one-pixel-per-instruction algorithm is tinted lighting. This takes a light buffer, a tint color, and a canvas, and writes back to the canvas. The image at the top of this document is from a program which uses this effect. The effect shown in the image is based on an effect from the game Nox (but improved). I don't know for sure how they do it, but the listing it 8 (h) is how I did it.

Near the beginning of the function, the tint should be unpacked:

	movd 16(%ebp), %mm6
	punpcklbw %mm7, %mm6

Mm7 is zero in this case; often it is useful to keep one register as zero for unpack operations. The result of this for an orange tint is :

(before)	16(%ebp) = |00000000|11111111|10000000|01000000|

 (after)	%mm6 = 
|00000000|00000000|00000000|11111111|00000000|10000000|00000000|01000000|

Our listing with unpacking and repacking will be:

.L10:

	movd (%ebx,%ecx,4),%mm0
	punpcklbw %mm7, %mm0

	movd (%edx),%mm1
	punpcklbw %mm7, %mm1



	packuswb %mm7, %mm1
	movd %mm1,(%edx)

	addl $4,%edx
	incl %ecx

	cmpl %edi,%ecx
	jl .L10

Notice that this still doesn't have code to do anything. Here is a version with such code:

.L10:

	movd (%ebx,%ecx,4),%mm0
	punpcklbw %mm7, %mm0

	movd (%edx,4),%mm1

	pmullw %mm6, %mm0
	psrlw $8, %mm0
	packuswb %mm7, %mm0
	paddusb %mm0, %mm1

	movd %mm1,(%edx)

	addl $4,%edx
	incl %ecx

	cmpl %edi,%ecx
	jl .L10

In this case, it seems that mm1 doesn't need to be unpacked or repacked (since it isn't being multiplied), so those instructions were removed.

I will now discuss briefly how the algorithm works (skip this).

	movd (%ebx,%ecx,4),%mm0
	punpcklbw %mm7, %mm0

(before)	%mm0 = 
|00000000|00000000|00000000|00000000|00000000|11111111|10000000|01000000|

 (after)	%mm0 = 
|00000000|00000000|00000000|11111111|00000000|10000000|00000000|01000000|
mm0 (in this case, representing a brownish color) is unpacked so that it can be treated as 4 16-bit words.
	movd (%edx,4),%mm1

 (after)	%mm1 = 
|00000000|00000000|00000000|00000000|00000000|11111111|11110000|01000000|
mm1 gets the old value of this pixel on the canvas (cyan).
	pmullw %mm6, %mm0

(before)	%mm6 = 
|00000000|00000000|00000000|01000000|00000000|01000000|00000000|01000000|
		%mm0 = 
|00000000|00000000|00000000|00000000|00000000|11111111|00000000|01000000|


 (after)	%mm0 = 
|00000000|00000000|00000000|00000000|01111111|10000000|00010000|00000000|

mm6 (the tint) is multiplied by mm0, the lightfield. In this case, the tint is a dark gray.
	psrlw $8, %mm0

(before)	%mm0 = 
|00000000|00000000|00000000|00000000|01111111|10000000|00010000|00000000|

 (after)	%mm0 = 
|00000000|00000000|00000000|00000000|00000000|01111111|10000000|00010000|
mm0 is shifted down 8 places. This is because each channel (R, G, B) of the tint ranged from 0 to 255, as did the light field. This gave us a total range of 16 bits. Only the most significant 8 are needed, so mm0 is shifted.
	packuswb %mm7, %mm0


(before)	%mm0 = 
|00000000|00000000|00000000|00000000|00000000|01111111|10000000|00010000|

 (after)	%mm0 = 
|00000000|00000000|00000000|00000000|00000000|00000000|01111111|00010000|
mm0 is repacked. Now the top 4 bytes are 0 and the bottom 4 are ABGR again.
	paddusb %mm0, %mm1


 (before)	%mm0 = 
|00000000|00000000|00000000|00000000|00000000|00000000|01111111|00010000|
		%mm1 =
|00000000|00000000|00000000|00000000|00000000|11111111|11110000|01000000|


 (after)	%mm1 =
|00000000|00000000|00000000|00000000|00000000|11111111|11111111|01010000|

mm0 and mm1 are added with unsigned saturation. This means that values over 256, are clamped to 256. You can see this in the green channel.
	movd %mm1,(%edx)
This line puts the result of the addition back onto the canvas. The last four lines increment the index registers and loop.

This now does exactly the same thing as the 30 instructions that it replaces. It is slightly faster than the non-MMX version, but can still be improved.



8.

Optimization

a.

Analyze the running time of the inner loop; consider the time taken by each instruction. The Pentium processors have 2 pipelines; certain instructions can only be issued in the U pipeline, and certain instructions can only be issued in the V pipeline. Most MMX instructions can be issued in either. Some instructions cannot be executed at the same time as others; for a full list of pairing rules, see:

http://fatphil.org/x86/pentopt/28.html

Listing with instruction timing: Instructions that pair are shown together.

.L10:

	movd (%ebx,%ecx,4),%mm0	;1 and used.  This doesn't pair since

	punpcklbw %mm7, %mm0	;2 mm0 is assigned and used

	movd (%edx),%mm1	;3 memory accesses must be in the U pipe
	pmullw %mm6, %mm0	;   MMX multiplies take 3 clocks

				;4 It would be good to fill up this 
				;5 time somehow.

	psrlw $8, %mm0		;6  this and each of the next 3 lines

	packuswb %mm7, %mm0	;7  depend on the last;  
 
	paddusb %mm0, %mm1	;8  no pairing can happen here.

				;9 stall:  store must wait 2 cycles 
				;  after update

	movd %mm1,(%edx)	;10

	addl $4,%edx		;11
	incl %ecx		; 

	cmpl %edi,%ecx		;12 
	jl .L10			; 

This listing takes 12 clocks per pixel.

b.

Some code turns out to be redundant:

	movd (%edx,4),%mm1
	...
	paddusb %mm0,%mm1

can be replaced with:

	paddusb (%edx,4),%mm0

This accomplishes the same thing with fewer cycles and fewer registers

Simple rearrangement yields:

.L10:
	movd (%ebx,%ecx,4),%mm0	;1

	punpcklbw %mm7, %mm0	;2

	pmullw %mm6, %mm0	;3

	incl %ecx		;4
 	addl $4,%edx		;

				;5

	psrlw $8, %mm0		;6

	packuswb %mm7,%mm0	;7

	paddusb (%edx,4),%mm0	;8

				;9

	movd %mm0,-4(%edx)	;10

	cmpl %edi,%ecx		;11
	jl .L10			; 

This saves only one instruction. One reason that this gives such a small change is that the algorithm is very linear; each step depends on the last.

c.

Since all of our data has an even number of pixels in the x dimension, unrolling the loop will be simple, and could give better performance e: Simple unroll:

.L10:

	movd (%ebx,%ecx,4),%mm0	;1

	punpcklbw %mm7, %mm0	;2

	pmullw %mm6, %mm0	;3

	incl %ecx		;4
 	addl $4,%edx		;

				;5

	psrlw $8, %mm0		;6

	packuswb %mm7,%mm0	;7

	paddusb (%edx,4),%mm0	;8

				;9

	movd %mm0,-4(%edx)	;10


	movd (%ebx,%ecx,4),%mm0	;11

	punpcklbw %mm7, %mm0	;12

	pmullw %mm6, %mm0	;13

	incl %ecx		;14
 	addl $4,%edx		;

				;15

	psrlw $8, %mm0		;16

	packuswb %mm7,%mm0	;17

	paddusb (%edx,4),%mm0	;18

				;19

	movd %mm0,-4(%edx)	;20

	cmpl %edi,%ecx		;21
	jl .L10			; 
(10.5 clocks per pixel). The only difference is that the loop instructions are only done once every other pixel, rather than every pixel. It would be simple to make the index increments do twice as much, and reduce the number of clocks to 10 per pixel. This will be done later.

d.

The next step is to interleave the calculation of the first pixel with the calculation of the second. Generally, a good strategy is to move the first instruction of the second pixel up to below the first instruction it can pair with, and continue with each of the instructions of the second pixel in a linear manner. Note how the instructions of the second pixel fill up the delay that the first multiply causes. Be careful when moving lines with memory accesses; If you move one across an increment, you will need to change the index of your access.
.L10:
	movd (%ebx,%ecx,4),%mm0	;1

	movd 4(%ebx,%ecx,4),%mm1;2
	punpcklbw %mm7, %mm0	;

	pmullw %mm6, %mm0	;3  mm0 is next used in clock 6
	punpcklbw %mm7, %mm1	;

 	addl $4,%edx		;4
	pmullw %mm6, %mm1	;  mm1 is next used in clock 8

	incl %ecx		;5
	addl $4,%edx		;

	psrlw $8, %mm0		;6
 	incl %ecx		;

	packuswb %mm7,%mm0	;7

	paddusb -8(%edx),%mm0	;8
	packuswb %mm7,%mm1	;

	psrlw $8, %mm1		;9

	movd %mm0,-8(%edx)	;10
	
	paddusb -4(%edx),%mm1	;11

				;12

	movd %mm1,-4(%edx)	;13

	cmpl %edi,%ecx		;14
	jl .L10			; 

This is certainly an improvement, at 7 clocks per pixel, but the tail end of the function has a bunch of non-paired instructions, and there are still 4 index increment commands when there should be 2. There is also a space to fill between the multiply of mm0 and its next use, as well as 2 gaps between writes to registers and their stores to memory.

e.

Our next step is to remove the extra increment commands and move the first 5 lines of the loop to the end:

.before_loop:
	movd (%ebx,%ecx,4),%mm0	

	movd 4(%ebx,%ecx,4),%mm1
	punpcklbw %mm7, %mm0	

	pmullw %mm6, %mm0
	punpcklbw %mm7, %mm1
.L10


	pmullw %mm6, %mm1	;1
	psrlw $8, %mm0		;

	packuswb %mm7,%mm0	;2
 	addl $8,%edx		;

	paddusb -8(%edx),%mm0	;3
	addl $2, %ecx		;  

	psrlw $8, %mm1		;4

	movd %mm0,-8(%edx)	;5
	packuswb %mm7,%mm1	;

	movd (%ebx,%ecx,4),%mm0	;6   Moved from top

	paddusb -4(%edx),%mm1	;7
	punpcklbw %mm7, %mm0	;    Moved from top

	pmullw %mm6, %mm0	;8   Moved from top

	movd %mm1,-4(%edx)	;9

	movd 4(%ebx,%ecx,4),%mm1;10  Moved from top

	punpcklbw %mm7, %mm1	;11  Moved from top

	cmpl %edi,%ecx		;12
	jl .L10			; 

This takes us down to 6 clocks per pixel. There are now 6 un-paired instructions, but no stalls.

Continuing to move instructions from the top to the bottom will not be productive; there appears to be no way to rearrange the instructions to fill the holes left by the stalls. One reason for this is the high percentage of memory accesses. Since MMX instructions that access memory can only be issued in the U pipe, only one may be issued per cycle. This severely limits pairing opportunities.

f.

Another unrolling (4 pixels per iteration) would make the loop almost unmanageable. Nonetheless, to get to the optimal performance level, it is necessary. I will start with the code from d, since it is doesn't require priming (code before the loop).

Basic unrolling:

.L10:
	movd (%ebx,%ecx,4),%mm0	

	movd 4(%ebx,%ecx,4),%mm1
	punpcklbw %mm7, %mm0	

	pmullw %mm6, %mm0	
	punpcklbw %mm7, %mm1	

 	addl $16,%edx		
	pmullw %mm6, %mm1	

	psrlw $8, %mm0		
 	addl $4, %ecx		

	packuswb %mm7,%mm0	

	paddusb -16(%edx),%mm0	
	packuswb %mm7,%mm1	

	psrlw $8, %mm1		

	movd %mm0,-16(%edx)	
	
	paddusb -12(%edx),%mm1	

	movd %mm1,-12(%edx)

	movd -8(%ebx,%ecx,4),%mm2

	movd -4(%ebx,%ecx,4),%mm3
	punpcklbw %mm7, %mm2	

	pmullw %mm6, %mm2	
	punpcklbw %mm7, %mm3	

	pmullw %mm6, %mm3	


	psrlw $8, %mm2		

	packuswb %mm7,%mm2	

	paddusb -8(%edx),%mm2	
	packuswb %mm7,%mm3	

	psrlw $8, %mm3		

	movd %mm2,-8(%edx)	
	
	paddusb -4(%edx),%mm3	

	movd %mm3,-4(%edx)	

	cmpl %edi,%ecx		
	jl .L10			

g.

Interleaving:

When the loop was unrolled the first time, code was interleaved from the first and second pixels. This time, interleaving will be easier, since there are more choices of instructions. The strict form of this strategy, where instructions from each concurrent instruction stream (in this case, each of the four pixels) are interleaved per-instruction, is called pipelining. I chose not to follow this strategy precisely, since it tends to leave the top and bottom of the loop unpaired. Instead, I just moved instructions from the second unrolling to fill out places that looked a little loose.

.L10:
	movd (%ebx,%ecx,4),%mm0	 ;1

	movd 4(%ebx,%ecx,4),%mm1 ;2
	punpcklbw %mm7, %mm0	

	movd 8(%ebx,%ecx,4),%mm2 ;3
	pmullw %mm6, %mm0	

	movd 12(%ebx,%ecx,4),%mm3;4
	punpcklbw %mm7, %mm1	

 	addl $16,%edx		 ;5
	punpcklbw %mm7, %mm2	

	pmullw %mm6, %mm1	 ;6
	psrlw $8, %mm0		

	pmullw %mm6, %mm2	 ;7
	punpcklbw %mm7, %mm3

 	addl $4, %ecx		 ;8
	packuswb %mm7,%mm0

	paddusb -16(%edx),%mm0	 ;9
	packuswb %mm7,%mm1

	pmullw %mm6, %mm3	 ;10
	psrlw $8, %mm1		

	movd %mm0,-16(%edx)	 ;11
	psrlw $8, %mm2
	
	paddusb -12(%edx),%mm1	 ;12
	packuswb %mm7,%mm2	

	movd %mm1,-12(%edx)	 ;13
	packuswb %mm7,%mm3

	paddusb -8(%edx),%mm2	 ;14
	psrlw $8, %mm3	

	paddusb -4(%edx),%mm3	 ;15				 ;

	movd %mm2,-8(%edx)	 ;16
	

				 ;17

	movd %mm3,-4(%edx)	 ;18

	cmpl %edi,%ecx		 ;19
	jl .L10		

(4.75 clocks per pixel)

Because the interleaving was performed from the top down, there are still a few unpaired lines at the end. The index increments pair with most things; in this listing they sit wasted at the center.


h.

Perfect pairing: Compared to the convolutions needed to shave 1 cycle off d to get e, this was a piece of cake.
.before_loop:
	movd (%ebx,%ecx,4), %mm0

	movd 4(%ebx,%ecx,4), %mm1
	punpcklbw %mm7, %mm0

	pmullw %mm6, %mm0
	punpcklbw %mm7, %mm1

.L10
	movd 8(%ebx,%ecx,4), %mm2
	pmullw %mm6, %mm1
	
	movd 12(%ebx,%ecx,4), %mm3
	punpcklbw %mm7, %mm2

	psrlw $8, %mm0
	punpcklbw %mm7, %mm3
		
	packuswb %mm7, %mm0
	pmullw %mm6, %mm2
	
	paddusb (%edx), %mm0	
	psrlw $8, %mm1	
	
	packuswb %mm7, %mm1
	pmullw %mm6, %mm3
	
	paddusb 4(%edx), %mm1
	psrlw $8, %mm2
	
	movd %mm0, (%edx)
	packuswb %mm7, %mm2
	
	movd %mm1, 4(%edx)
	psrlw $8, %mm3
	
	paddusb 8(%edx), %mm2
	packuswb %mm7, %mm3
	
	paddusb 12(%edx), %mm3
	addl $4, %ecx
	
	movd (%ebx,%ecx,4), %mm0
	addl $16, %edx
	
	movd %mm2, -8(%edx)
	punpcklbw %mm7, %mm0	
	
	movd 4(%ebx,%ecx,4), %mm1
	pmullw %mm6, %mm0
	
	movd %mm3, -4(%edx)
	punpcklbw %mm7, %mm1
	
	cmpl %edi, %ecx
	jl .L10

	(16 clocks = 4 clocks per pixel).

The pairing on this listing is perfect. Except by using a better algorithm, this listing will be very difficult to improve.




9.

Further possibilities for improvement:

In one of my applications, the width of a light field is 8 pixels. I can unroll the loop one level further, re-shuffle a little, and eliminate the loop entirely.

If the tint has only a few possible values, the light buffer can be pre-tinted in each of these, and stored. Then there is no further need for multiplies, and thus no need for unpacking. Now 2 pixels can be done at once:

.L10:

	movq (%ebx,%ecx,4),%mm0
	addl $8,%edx

	paddusb -8(%edx,4), %mm0
	addl $2, %ecx

	;stall

	movq %mm0, -8(%edx,4)

	cmpl %edi,%ecx
	jl .L10
(2.5 clocks per pixel).

Note that the above algorithm will not benefit much from unrolling. Since all of the instruction but the loop instructions require memory accesses, they cannot be paired. This eliminates most of our optimization opportunities. Still, the stall can be removed:

.before_loop:
	movq (%ebx,%ecx,4),%mm0
	addl $16,%edx

.L10:

	paddusb (%edx,4), %mm0
	addl $4, %ecx

	movq %mm0, 8(%edx,4)

	movq (%ebx,%ecx,4),%mm1

	paddusb 8(%edx,4), %mm1

	movq (%ebx,%ecx,4),%mm0
	addl $16,%edx

	movq %mm1, -8(%edx,4)
	
	cmpl %edi,%ecx
	jl .L10
(1.75 clocks per pixel)


10.

Common optimization errors:

When moving instructions, be careful. You may move an instruction before one it depends on. If you are moving an instruction with a memory access and you pass an instruction that changes its index, don't forget to alter the memory access. Example: Before:


	addl %8, %edx
	movl %eax, (%edx)
After:

	movl %eax, 8(%edx)
	addl %8, %edx

Do not forget emms. It is crucial, and omitting it causes subtle floating point bugs that will cause you to tear your hair out.



Well, you've made it to the end. Hopefully, you'll make some great MMX routines and share them with the world. If I can improve this document in any way, feel free to write to me. Let me know about broken links, especially if you can suggest resources to replace them. Better yet, write your own replacements and license them under the FDL.


Notes:

1 It used to be at http://developer.intel.com/drg/mmx/AppNotes/, but Intel has rearranged their site around several times since then. Anyway, their (new?) license agreement sucks.

2 It used to be at http://developer.intel.com/drg/mmx/Manuals/prm/prm_chp5.htm,




Copyright 2000, 2001, 2002 David M. Turner.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with Invariant Sections being The GNU Free Documentation License, no Front-Cover Texts, and no Back-Cover texts. A copy of the license is included in the section entitled "GNU Free Documentation License".

Code snippets (included within <pre> tags) are released to the public domain (they are probably too small to copyright anyway)

A copy of the GNU Free Documentation License is available at http://novalis.org/documents/fdl.txt

Good luck!

- David "Novalis" Turner

novalis@novalis.org

www.novalis.org

Novalis on irc.worldforge.org