I wrote this document in 2000. Since then, everything has probably changed. So, if this document conflics with what you have learned elsewhere, trust them, not me.
This document is for you if you are a programmer who needs speed in your application.
I will assume that you have a basic knowledge of x86 assembly language. (skip this) Many people have not used the GNU assembler; I will discuss briefly how its syntax differs from traditional Intel style assemblers:
Intel:
add (edx), eaxGNU (AT&T):
addl %eax, (%edx)
The order of arguments is reversed; registers have a % in front of them. Constants start with $. All instructions end with a single letter representing their type: l = 32 bits, w = 16 bits, b = 8 bits
Intel:
add ebx:[ecx * 4 - 4], $8GNU (AT&T)
addl $8, -4(%ebx, %ecx, 1)GNU syntax is easy to get used to, and after a few minutes, you won't even notice that you are using it. To assemble a source file, just use GCC on it as if it were a C program:
gcc -l baz -o foo foo.s foobar.c
MMX is good for low-level pixel manipulation, digital signal processing, 3D geometry, and much more. Since I am a 2d graphics hacker, this document will discuss only the first. For specific algorithms in other areas, see Intel's web site1.
Use MMX only when you need the speed. To turn a function into an fully optimized MMX function will take a few hours. It is not worthwhile for code that will only be called occasionally. Make sure that you are using the best algorithm before you convert a function to MMX; if you change the algorithm, you'll have to start the conversion more or less from scratch.
MMX is integer only. If you need floating point accuracy, you can use SSE (available only on Pentium III processors, and not discussed in this document).
MMX will not speed up memory accesses. If your algorithm requires frequent random-access memory lookups, it will likely not benefit from MMX.
MMX cannot be mixed with floating point code. It borrows
the floating point registers for itself.
a. 32 bit color will be used. 32 bit color is in the format ABGR. The A is not used for anything by default (although it can be).
b. All data will be aligned properly. GCC will do this for you with -O.
c. Images will have an even number of pixels in the X direction.
d. For performance discussions, all data is assumed to be
in the L1 cache.
MMX operates on 64 bit registers (mm0 - mm7). The MMX registers are actually the floating point registers. To avoid horrible bugs, use the emms instruction at the end of your MMX code.
The registers can each be divided into 8 bytes, 4 words, 2 doublewords, or left undivided as 1 quadword. The MMX instruction set consists of 56 instructions which operate on MMX registers, plus emms. A good reference for the MMX instruction set is available on Intel's web site2.
MMX instructions (except for movd, movq, and emms) start with p and end with a letter representing the data type they operate on. For example, paddw adds words. Some MMX instructions convert types; they have a longer type description. Packsswb and punpckldw are examples of this. Packsswb converts words to bytes, and punpckldw converts doublewords to words. Some instructions end with ss or us; these mean signed or unsigned saturation. Saturation means that values are clamped to a max rather than overflowing, so that if a and b are bytes containing 100 and 200, adding with unsigned saturation will produce 255, not 44.
Start by writing your function in C. Compile it to assembly language source using gcc -S -O3 foo.c -o foo.s
For this example, put just the function to be optimized in foo.c.
Mostly MMX is used for texturing, lighting, or blending effects. This means that you'll have a loop that looks like this:
for (y = 0; y < ysize; y ++) { for (x = 0; x < xsize; x ++) { .... } }
Locate the inner loop in the assembly language output and try to identify what gcc has done with the parameters and local variables. An inner loop might look like this:
.L10: movl 32(%ebp),%ecx movl (%ebx,%ecx,4),%eax ... movl (%edx), %eax ... really long section of code with lots of shifts and the like ... movl %eax,(%edx) addl $4,%edx incl %ecx cmpl %edi,%ecx jl .L10
Let's look at the last 4 lines first:
First, edx, which is a pointer into the destination buffer, is moved ahead by one pixel. Similarly, ecx is incremented. The jump to the top is not at all out of the ordinary.
The last line before the loop control stuff usually puts the final product into the destination buffer. This shows us what register is used to index that buffer. Likewise, one of the first lines usually grabs data from one of the source buffers. This example uses 2 source buffers, one of which is also the destination.
Once you have located these buffers, delete every line but those that get data from the source buffers or put it into the destination buffers (also save any instructions they need to set up pointers to these buffers). You will be re-writing the algorithm to operate in parallel. Alter the remaining instructions to use MMX registers.
For the code listing presented above:
.before_loop: movl 32(%ebp),%ecx .L10: movd (%ebx,%ecx,4),%mm0 movd (%edx), %mm1 movd %mm1,(%edx) addl $4,%edx incl %ecx cmpl %edi,%ecx jl .L10
At this point, it is worthwhile to consider how to work with your data in parallel. Some algorithms can operate on 2 pixels with each MMX instruction. Usually these are algorithms that do not involve multiplication. One example of this is simple sprite transparency (not translucency). Mostly, algorithms will operate on one pixel per instruction. MMX has instructions to pack and unpack data; these are critical to one-pixel-per-instruction applications.
An example of a one-pixel-per-instruction algorithm is tinted lighting. This takes a light buffer, a tint color, and a canvas, and writes back to the canvas. The image at the top of this document is from a program which uses this effect. The effect shown in the image is based on an effect from the game Nox (but improved). I don't know for sure how they do it, but the listing it 8 (h) is how I did it.
Near the beginning of the function, the tint should be unpacked:
movd 16(%ebp), %mm6 punpcklbw %mm7, %mm6
Mm7 is zero in this case; often it is useful to keep one register as zero for unpack operations. The result of this for an orange tint is :
(before) 16(%ebp) = |00000000|11111111|10000000|01000000| (after) %mm6 = |00000000|00000000|00000000|11111111|00000000|10000000|00000000|01000000|
Our listing with unpacking and repacking will be:
.L10: movd (%ebx,%ecx,4),%mm0 punpcklbw %mm7, %mm0 movd (%edx),%mm1 punpcklbw %mm7, %mm1 packuswb %mm7, %mm1 movd %mm1,(%edx) addl $4,%edx incl %ecx cmpl %edi,%ecx jl .L10
Notice that this still doesn't have code to do anything. Here is a version with such code:
.L10: movd (%ebx,%ecx,4),%mm0 punpcklbw %mm7, %mm0 movd (%edx,4),%mm1 pmullw %mm6, %mm0 psrlw $8, %mm0 packuswb %mm7, %mm0 paddusb %mm0, %mm1 movd %mm1,(%edx) addl $4,%edx incl %ecx cmpl %edi,%ecx jl .L10
In this case, it seems that mm1 doesn't need to be unpacked or repacked (since it isn't being multiplied), so those instructions were removed.
I will now discuss briefly how the algorithm works (skip this).
movd (%ebx,%ecx,4),%mm0 punpcklbw %mm7, %mm0 (before) %mm0 = |00000000|00000000|00000000|00000000|00000000|11111111|10000000|01000000| (after) %mm0 = |00000000|00000000|00000000|11111111|00000000|10000000|00000000|01000000|mm0 (in this case, representing a brownish color) is unpacked so that it can be treated as 4 16-bit words.
movd (%edx,4),%mm1 (after) %mm1 = |00000000|00000000|00000000|00000000|00000000|11111111|11110000|01000000|mm1 gets the old value of this pixel on the canvas (cyan).
pmullw %mm6, %mm0 (before) %mm6 = |00000000|00000000|00000000|01000000|00000000|01000000|00000000|01000000| %mm0 = |00000000|00000000|00000000|00000000|00000000|11111111|00000000|01000000| (after) %mm0 = |00000000|00000000|00000000|00000000|01111111|10000000|00010000|00000000|mm6 (the tint) is multiplied by mm0, the lightfield. In this case, the tint is a dark gray.
psrlw $8, %mm0 (before) %mm0 = |00000000|00000000|00000000|00000000|01111111|10000000|00010000|00000000| (after) %mm0 = |00000000|00000000|00000000|00000000|00000000|01111111|10000000|00010000|mm0 is shifted down 8 places. This is because each channel (R, G, B) of the tint ranged from 0 to 255, as did the light field. This gave us a total range of 16 bits. Only the most significant 8 are needed, so mm0 is shifted.
packuswb %mm7, %mm0 (before) %mm0 = |00000000|00000000|00000000|00000000|00000000|01111111|10000000|00010000| (after) %mm0 = |00000000|00000000|00000000|00000000|00000000|00000000|01111111|00010000|mm0 is repacked. Now the top 4 bytes are 0 and the bottom 4 are ABGR again.
paddusb %mm0, %mm1 (before) %mm0 = |00000000|00000000|00000000|00000000|00000000|00000000|01111111|00010000| %mm1 = |00000000|00000000|00000000|00000000|00000000|11111111|11110000|01000000| (after) %mm1 = |00000000|00000000|00000000|00000000|00000000|11111111|11111111|01010000|mm0 and mm1 are added with unsigned saturation. This means that values over 256, are clamped to 256. You can see this in the green channel.
movd %mm1,(%edx)This line puts the result of the addition back onto the canvas. The last four lines increment the index registers and loop.
This now does exactly the same thing as the 30 instructions that it replaces. It is slightly faster than the non-MMX version, but can still be improved.
Analyze the running time of the inner loop; consider the time taken by each instruction. The Pentium processors have 2 pipelines; certain instructions can only be issued in the U pipeline, and certain instructions can only be issued in the V pipeline. Most MMX instructions can be issued in either. Some instructions cannot be executed at the same time as others; for a full list of pairing rules, see:
http://fatphil.org/x86/pentopt/28.html
Listing with instruction timing: Instructions that pair are shown together.
.L10: movd (%ebx,%ecx,4),%mm0 ;1 and used. This doesn't pair since punpcklbw %mm7, %mm0 ;2 mm0 is assigned and used movd (%edx),%mm1 ;3 memory accesses must be in the U pipe pmullw %mm6, %mm0 ; MMX multiplies take 3 clocks ;4 It would be good to fill up this ;5 time somehow. psrlw $8, %mm0 ;6 this and each of the next 3 lines packuswb %mm7, %mm0 ;7 depend on the last; paddusb %mm0, %mm1 ;8 no pairing can happen here. ;9 stall: store must wait 2 cycles ; after update movd %mm1,(%edx) ;10 addl $4,%edx ;11 incl %ecx ; cmpl %edi,%ecx ;12 jl .L10 ;
This listing takes 12 clocks per pixel. Some code turns out to be redundant:
can be replaced with:
This accomplishes the same thing with fewer cycles and fewer registers
Simple rearrangement yields:
This saves only one instruction. One reason that this gives such a small change is that the algorithm is very linear; each step depends on the last.
Since all of our data has an even number of pixels in the x dimension, unrolling the loop will be simple, and could give better performance
e:
Simple unroll:
Our next step is to remove the extra increment commands and move the first 5 lines of the loop to the end:
This takes us down to 6 clocks per pixel. There are now 6 un-paired instructions, but no stalls.
Continuing to move instructions from the top to the bottom will not be productive; there appears to be no way to rearrange the instructions to fill the holes left by the stalls. One reason for this is the high percentage of memory accesses. Since MMX instructions that access memory can only be issued in the U pipe, only one may be issued per cycle. This severely limits pairing opportunities.
Another unrolling (4 pixels per iteration) would make the loop almost unmanageable. Nonetheless, to get to the optimal performance level, it is necessary. I will start with the code from d, since it is doesn't require priming (code before the loop).
Basic unrolling:
When the loop was unrolled the first time, code was
interleaved from the first and second pixels. This time,
interleaving will be easier, since there are more choices of
instructions. The strict form of this strategy, where
instructions from each concurrent instruction stream (in
this case, each of the four pixels) are interleaved
per-instruction, is called pipelining. I chose not to follow
this strategy precisely, since it tends to leave the top and
bottom of the loop unpaired. Instead, I just moved
instructions from the second unrolling to fill out places
that looked a little loose. (4.75 clocks per pixel) Because the interleaving was performed from the top down, there are still a few unpaired lines at the end. The index increments pair with most things; in this listing they sit wasted at the center. The pairing on this listing is perfect. Except by using a better algorithm, this listing will be very difficult to improve. In one of my applications, the width of a light field is
8 pixels. I can unroll the loop one level further,
re-shuffle a little, and eliminate the loop entirely. If the tint has only a few possible values, the light
buffer can be pre-tinted in each of these, and stored. Then
there is no further need for multiplies, and thus no need
for unpacking. Now 2 pixels can be done at once: Note that the above algorithm will not benefit much from
unrolling. Since all of the instruction but the loop
instructions require memory accesses, they cannot be paired.
This eliminates most of our optimization opportunities.
Still, the stall can be removed:
When moving instructions, be careful. You may move an
instruction before one it depends on. If you are moving an
instruction with a memory access and you pass an instruction
that changes its index, don't forget to alter the memory
access. Example:
Before:
Do not forget emms. It is crucial, and omitting it
causes subtle floating point bugs that will cause you to
tear your hair out.
Well, you've made it to the end. Hopefully, you'll make
some great MMX routines and share them with the world. If I
can improve this document in any way, feel free to write to
me. Let me know about broken links, especially if you can
suggest resources to replace them. Better yet, write your
own replacements and license them under the FDL. 1 It used to be at
http://developer.intel.com/drg/mmx/AppNotes/, but Intel
has rearranged their site around several times since then.
Anyway, their (new?) license agreement sucks. 2 It used to be at
http://developer.intel.com/drg/mmx/Manuals/prm/prm_chp5.htm,
Copyright 2000, 2001, 2002 David M. Turner.
Code snippets (included within <pre> tags) are
released to the public domain (they are probably too small
to copyright anyway) A copy of the GNU Free Documentation License is available at
http://novalis.org/documents/fdl.txt
Good luck! - David "Novalis" Turner
Novalis on irc.worldforge.org
b.
movd (%edx,4),%mm1
...
paddusb %mm0,%mm1
paddusb (%edx,4),%mm0
.L10:
movd (%ebx,%ecx,4),%mm0 ;1
punpcklbw %mm7, %mm0 ;2
pmullw %mm6, %mm0 ;3
incl %ecx ;4
addl $4,%edx ;
;5
psrlw $8, %mm0 ;6
packuswb %mm7,%mm0 ;7
paddusb (%edx,4),%mm0 ;8
;9
movd %mm0,-4(%edx) ;10
cmpl %edi,%ecx ;11
jl .L10 ;
c.
.L10:
movd (%ebx,%ecx,4),%mm0 ;1
punpcklbw %mm7, %mm0 ;2
pmullw %mm6, %mm0 ;3
incl %ecx ;4
addl $4,%edx ;
;5
psrlw $8, %mm0 ;6
packuswb %mm7,%mm0 ;7
paddusb (%edx,4),%mm0 ;8
;9
movd %mm0,-4(%edx) ;10
movd (%ebx,%ecx,4),%mm0 ;11
punpcklbw %mm7, %mm0 ;12
pmullw %mm6, %mm0 ;13
incl %ecx ;14
addl $4,%edx ;
;15
psrlw $8, %mm0 ;16
packuswb %mm7,%mm0 ;17
paddusb (%edx,4),%mm0 ;18
;19
movd %mm0,-4(%edx) ;20
cmpl %edi,%ecx ;21
jl .L10 ;
(10.5 clocks per pixel). The only difference is that the loop instructions are only done once every other pixel, rather than every pixel. It would be simple to make the index increments do twice as much, and reduce the number of clocks to 10 per pixel. This will be done later.
d.
The next step is to interleave the calculation of the first pixel with the calculation of the second. Generally, a good strategy is to move the first instruction of the second pixel up to below the first instruction it can pair with, and continue with each of the instructions of the second pixel in a linear manner. Note how the instructions of the second pixel fill up the delay that the first multiply causes. Be careful when moving lines with memory accesses; If you move one across an increment, you will need to change the index of your access.
.L10:
movd (%ebx,%ecx,4),%mm0 ;1
movd 4(%ebx,%ecx,4),%mm1;2
punpcklbw %mm7, %mm0 ;
pmullw %mm6, %mm0 ;3 mm0 is next used in clock 6
punpcklbw %mm7, %mm1 ;
addl $4,%edx ;4
pmullw %mm6, %mm1 ; mm1 is next used in clock 8
incl %ecx ;5
addl $4,%edx ;
psrlw $8, %mm0 ;6
incl %ecx ;
packuswb %mm7,%mm0 ;7
paddusb -8(%edx),%mm0 ;8
packuswb %mm7,%mm1 ;
psrlw $8, %mm1 ;9
movd %mm0,-8(%edx) ;10
paddusb -4(%edx),%mm1 ;11
;12
movd %mm1,-4(%edx) ;13
cmpl %edi,%ecx ;14
jl .L10 ;
This is certainly an improvement, at 7 clocks per pixel, but the tail end of the function has a bunch of non-paired instructions, and there are still 4 index increment commands when there should be 2. There is also a space to fill between the multiply of mm0 and its next use, as well as 2 gaps between writes to registers and their stores to memory.
e.
.before_loop:
movd (%ebx,%ecx,4),%mm0
movd 4(%ebx,%ecx,4),%mm1
punpcklbw %mm7, %mm0
pmullw %mm6, %mm0
punpcklbw %mm7, %mm1
.L10
pmullw %mm6, %mm1 ;1
psrlw $8, %mm0 ;
packuswb %mm7,%mm0 ;2
addl $8,%edx ;
paddusb -8(%edx),%mm0 ;3
addl $2, %ecx ;
psrlw $8, %mm1 ;4
movd %mm0,-8(%edx) ;5
packuswb %mm7,%mm1 ;
movd (%ebx,%ecx,4),%mm0 ;6 Moved from top
paddusb -4(%edx),%mm1 ;7
punpcklbw %mm7, %mm0 ; Moved from top
pmullw %mm6, %mm0 ;8 Moved from top
movd %mm1,-4(%edx) ;9
movd 4(%ebx,%ecx,4),%mm1;10 Moved from top
punpcklbw %mm7, %mm1 ;11 Moved from top
cmpl %edi,%ecx ;12
jl .L10 ;
f.
.L10:
movd (%ebx,%ecx,4),%mm0
movd 4(%ebx,%ecx,4),%mm1
punpcklbw %mm7, %mm0
pmullw %mm6, %mm0
punpcklbw %mm7, %mm1
addl $16,%edx
pmullw %mm6, %mm1
psrlw $8, %mm0
addl $4, %ecx
packuswb %mm7,%mm0
paddusb -16(%edx),%mm0
packuswb %mm7,%mm1
psrlw $8, %mm1
movd %mm0,-16(%edx)
paddusb -12(%edx),%mm1
movd %mm1,-12(%edx)
movd -8(%ebx,%ecx,4),%mm2
movd -4(%ebx,%ecx,4),%mm3
punpcklbw %mm7, %mm2
pmullw %mm6, %mm2
punpcklbw %mm7, %mm3
pmullw %mm6, %mm3
psrlw $8, %mm2
packuswb %mm7,%mm2
paddusb -8(%edx),%mm2
packuswb %mm7,%mm3
psrlw $8, %mm3
movd %mm2,-8(%edx)
paddusb -4(%edx),%mm3
movd %mm3,-4(%edx)
cmpl %edi,%ecx
jl .L10
g.
Interleaving:
.L10:
movd (%ebx,%ecx,4),%mm0 ;1
movd 4(%ebx,%ecx,4),%mm1 ;2
punpcklbw %mm7, %mm0
movd 8(%ebx,%ecx,4),%mm2 ;3
pmullw %mm6, %mm0
movd 12(%ebx,%ecx,4),%mm3;4
punpcklbw %mm7, %mm1
addl $16,%edx ;5
punpcklbw %mm7, %mm2
pmullw %mm6, %mm1 ;6
psrlw $8, %mm0
pmullw %mm6, %mm2 ;7
punpcklbw %mm7, %mm3
addl $4, %ecx ;8
packuswb %mm7,%mm0
paddusb -16(%edx),%mm0 ;9
packuswb %mm7,%mm1
pmullw %mm6, %mm3 ;10
psrlw $8, %mm1
movd %mm0,-16(%edx) ;11
psrlw $8, %mm2
paddusb -12(%edx),%mm1 ;12
packuswb %mm7,%mm2
movd %mm1,-12(%edx) ;13
packuswb %mm7,%mm3
paddusb -8(%edx),%mm2 ;14
psrlw $8, %mm3
paddusb -4(%edx),%mm3 ;15 ;
movd %mm2,-8(%edx) ;16
;17
movd %mm3,-4(%edx) ;18
cmpl %edi,%ecx ;19
jl .L10
h.
Perfect pairing:
Compared to the convolutions needed to shave 1 cycle off d to get e, this was a piece of cake.
.before_loop:
movd (%ebx,%ecx,4), %mm0
movd 4(%ebx,%ecx,4), %mm1
punpcklbw %mm7, %mm0
pmullw %mm6, %mm0
punpcklbw %mm7, %mm1
.L10
movd 8(%ebx,%ecx,4), %mm2
pmullw %mm6, %mm1
movd 12(%ebx,%ecx,4), %mm3
punpcklbw %mm7, %mm2
psrlw $8, %mm0
punpcklbw %mm7, %mm3
packuswb %mm7, %mm0
pmullw %mm6, %mm2
paddusb (%edx), %mm0
psrlw $8, %mm1
packuswb %mm7, %mm1
pmullw %mm6, %mm3
paddusb 4(%edx), %mm1
psrlw $8, %mm2
movd %mm0, (%edx)
packuswb %mm7, %mm2
movd %mm1, 4(%edx)
psrlw $8, %mm3
paddusb 8(%edx), %mm2
packuswb %mm7, %mm3
paddusb 12(%edx), %mm3
addl $4, %ecx
movd (%ebx,%ecx,4), %mm0
addl $16, %edx
movd %mm2, -8(%edx)
punpcklbw %mm7, %mm0
movd 4(%ebx,%ecx,4), %mm1
pmullw %mm6, %mm0
movd %mm3, -4(%edx)
punpcklbw %mm7, %mm1
cmpl %edi, %ecx
jl .L10
(16 clocks = 4 clocks per pixel).
9.
Further possibilities for improvement:
.L10:
movq (%ebx,%ecx,4),%mm0
addl $8,%edx
paddusb -8(%edx,4), %mm0
addl $2, %ecx
;stall
movq %mm0, -8(%edx,4)
cmpl %edi,%ecx
jl .L10
(2.5 clocks per pixel).
.before_loop:
movq (%ebx,%ecx,4),%mm0
addl $16,%edx
.L10:
paddusb (%edx,4), %mm0
addl $4, %ecx
movq %mm0, 8(%edx,4)
movq (%ebx,%ecx,4),%mm1
paddusb 8(%edx,4), %mm1
movq (%ebx,%ecx,4),%mm0
addl $16,%edx
movq %mm1, -8(%edx,4)
cmpl %edi,%ecx
jl .L10
(1.75 clocks per pixel)
10.
Common optimization errors:
addl %8, %edx
movl %eax, (%edx)
After:
movl %eax, 8(%edx)
addl %8, %edx
Notes:
Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation
License, Version 1.1 or any later version published by the
Free Software Foundation; with Invariant Sections being The GNU Free Documentation License, no
Front-Cover Texts, and no Back-Cover texts. A copy of the
license is included in the section entitled "GNU Free
Documentation License".