DXT Decompression
DXT decompression written in assembly & C++
Matej Tomčík

DOWNLOAD

DXT Decompression library

This is a binary x86/x64 library with a header and lib file which targets Visual C++ Redistributable for Visual Studio 2012
You can build library from source, download the source code and read How to build the library

Also, if you download this library and want to use it in your project, please send me an email, so I can provide you with an updated version once I implement non-backscan image decompression and optimized alpha premultiplication.

Abstract

This document describes the technique of decompressing images compressed with DXT (S3TC) algorithm. Content of this document is divided into these sections:

DXT is a lossy texture algorithm, formerly known as S3TC. There are five versions of DXT, basically divided into three primary groups. DXT1, DXT2/3 and DXT4/5. Differences between these algorithms are explained later. Majority of todays games use DXT to store textures, it is a fast, easy to implement technique with a compression ratios of 1:8 for DXT1 and 1:4 for DXT2-5.
The reason why I decided to make this library was to provide an alternative to libsquish wchich would be reasonably faster in decompression. My library does not provide compression algorithms, only decompression. If you need to compress raw ARGB data using DXT, you can still use libsquish or FastDXT, although FastDXT supports only DXT1 and DXT5.

The library is mainly written in assembler with a subset of routines written in C++. Some functions are optimized and use SSE instructions, but it is safe to use the library on a CPU which does not support SSE, because when the Dll loads it checks whether the CPU is SSE capable and if not, it chooses pure implementation.

To illustrate the performance difference between my implementation and libsquish, I ran a few tests on an image compressed with DXT1 (128 x 128 px). The result is shown on the image below, numbers on the X axis represent how much time in milliseconds did the libraries need to decompress the image 10 000 times. As you can see, my library needs only 1/3 of the time.

DXT Decompression performance

How DXT decompression works

All DXT versions take 16 input pixels (4x4) and encode them into 8 bytes (DXT1) or 16 bytes (DXT2-5) of output. Since we don't care about compression now, we have to look at the output data and how to decompress them. DXT uses four colors and a table of indices. First two colors are packed in the RGB565 format, each 2 bytes long. Remaining two colors are calculated, for the DXT1 the formula is as follows:

if (color0 > color1)
 c2 = (2 * c0 + c1) / 3
 c3 = (2 * c1 + c0) / 3
else
 c2 = (c0 + c1) / 2
 c3 = 0

This means that the first two colors are first unpacked from the block into RGB24 format, and then each segment of the color is calculated using that formula. For DXT2-5, the formula is simplified:

c2 = (2 * c0 + c1) / 3
c3 = (2 * c1 + c0) / 3

Since DXT1 supports only fully opaque or fully transparent pixels, no additional data need to be packed in the block. On the other hand, DXT2-5 provide more precise alpha values, DXT2/3 stores additional 64 bits at the beginning of the block, where each pixel is assigned a 4 bit alpha value. DXT4/5 stores two 8 bits alpha values at the block beginnig, followed by a table of 16 indicies 48 bits long (3 bits for each index can address 8 alpha levels). The remaining 6 alpha levels are calculated using this formula:

if (alpha0 > alpha1)
 alpha2 = (6 * alpha0 + 1 * alpha1) / 7
 alpha3 = (5 * alpha0 + 2 * alpha1) / 7
 alpha4 = (4 * alpha0 + 3 * alpha1) / 7
 alpha5 = (3 * alpha0 + 4 * alpha1) / 7
 alpha6 = (2 * alpha0 + 5 * alpha1) / 7
 alpha7 = (1 * alpha0 + 6 * alpha1) / 7
else
 alpha2 = (4 * alpha0 + 1 * alpha1) / 5
 alpha3 = (3 * alpha0 + 2 * alpha1) / 5
 alpha4 = (2 * alpha0 + 3 * alpha1) / 5
 alpha5 = (1 * alpha0 + 4 * alpha1) / 5
 alpha6 = 0
 alpha7 = 255

The difference between DXT2 and DXT3 as well as DXT4 and DXT5 is that DXT2 and DXT4 consider pixel colors to be premultiplied by alpha. DXT3 and DXT5 do not. In the case you want to display a DXT3 or DXT5 image in a Windows application using standard GDI functions and you want to alpha blend the image over a chessboard pattern or any image in that matter, you need to manually premultiply pixels by alpha. Otherwise the AlphaBlend GDI function will not work properly.

DXT blocks

Convert RGB565 to RGB888 (RGB24)

To get RGB24 from RGB565, we have to first extract R,G,B separately and then expand each one. For the red color, we shift packed color 11 bits right and perform logical AND with a mask of 0x1F. To get the green color, we shift packed color 5 bits right and perform logical AND with a mask of 0x3F. The blue color can be simply obtained by AND-ing packed value with 0x1F.

Then to expand the separate colors, we simply shift them right and left and combine with logical OR. For the red and blue, since the are 5 bits width, we shift 3 bits left and 2 bits right. For the green, we shift 2 bits left and 4 bits right. See below:

# Packed value must be in little endian order (Windows)
 
R = (packed >> 11) & 0x1F;
G =  (packed >> 5) & 0x3F;
B =       (packed) & 0x1F;
 
R = (R << 3) | (R >> 2);
G = (G << 2) | (G >> 4);
B = (B << 3) | (B >> 2);

Simplified algorithm in C++

#include <cstdint>
 
// ARGB quad
union ARGB
{
  // Packed ARGB quad for image processing
  uint32_t quad;
  struct
  {
    // Little endian order, A is MSB, B is LSB
    uint8_t b, g, r, a;
  } argb;
};
 
// Unpacks RGB565 to RGB24
ARGB Unpack565(const uint16_t packed)
{
  ARGB color;
  color.argb.r = (value >> 11) & 0x1F;
  color.argb.g =  (value >> 5) & 0x3F;
  color.argb.b =       (value) & 0x1F;
  color.argb.a = 0xFF;
 
  color.argb.r = (color.argb.r << 3) | (color.argb.r >> 2);
  color.argb.g = (color.argb.g << 2) | (color.argb.g >> 4);
  color.argb.b = (color.argb.b << 3) | (color.argb.b >> 2);
 
  return color;
}

Optimized algorithm in Assembly

Example procedure written in ASM to unpack RGB565 into RGB24:

; Unpacks RGB565 to RGB24; expects packed RGB565 in EDX
Unpack565 proc
  mov eax, edx      ; Create copy of the packed RGB word
  shl eax, 5        ; Shift red
  and eax, 01F0000h ; Keep red only
  or  eax, edx      ; Add green and blue
  and eax, 01F001Fh ; Keep only red and blue
 
  mov ecx, eax      ; Create copy of EAX (red and blue)
  shl ecx, 2        ; ecx will shift 4 bits right, so shift red and blue 2 bits left
  shl edx, 3        ; Align green in EDX
  and edx, 03F00h   ; EDX contains only the green
  or  ecx, edx      ; Add green to the ecx
  shr edx, 1        ; Shift EDX 1 bit right, so the green is at the right offset when added to EAX
  or  eax, edx      ; EAX now contains RGB, green is shifted 1 bit right because EAX shifts 3 bits left
 
  shl eax, 3        ; Shift EAX 3 bits left, this will shift RB 3 bits left and G 2 bits left
  shr ecx, 4        ; Shift ecx 4 bits right, this will shift G 4 bits right and RB 2 bits right
  and ecx, 070707h  ; Remove overlapping bits
 
  or  eax, ecx      ; Combine shifted colors
  or  eax, 0FF000000h  ; Set alpha to fully opaque
 
  ret
Unpack565 endp

Intermediate solution

To speed up the process, we can use lookup tables for each color segment. Since red and blue are 5 bits width, we create an array with 32 colors and for the green segment, we create an array with 64 colors.
Example code to convert RGB565 to RGB24 using lookup tables:

.data
aLookupB  DD 00h,08h,010h,018h,021h,029h,031h,039h
          DD 042h,04Ah,052h,05Ah,063h,06Bh,073h,07Bh
          DD 084h,08Ch,094h,09Ch,0A5h,0ADh,0B5h,0BDh
          DD 0C6h,0CEh,0D6h,0DEh,0E7h,0EFh,0F7h,0FFh
aLookupR  DD 000000h,080000h,0100000h,0180000h,0210000h,0290000h,0310000h,0390000h
          DD 0420000h,04A0000h,0520000h,05A0000h,0630000h,06B0000h,0730000h,07B0000h
          DD 0840000h,08C0000h,0940000h,09C0000h,0A50000h,0AD0000h,0B50000h,0BD0000h
          DD 0C60000h,0CE0000h,0D60000h,0DE0000h,0E70000h,0EF0000h,0F70000h,0FF0000h
aLookupG  DD 0000h,0400h,0800h,0C00h,01000h,01400h,01800h,01C00h
          DD 02000h,02400h,02800h,02C00h,03000h,03400h,03800h,03C00h
          DD 04100h,04500h,04900h,04D00h,05100h,05500h,05900h,05D00h
          DD 06100h,06500h,06900h,06D00h,07100h,07500h,07900h,07D00h
          DD 08200h,08600h,08A00h,08E00h,09200h,09600h,09A00h,09E00h
          DD 0A200h,0A600h,0AA00h,0AE00h,0B200h,0B600h,0BA00h,0BE00h
          DD 0C300h,0C700h,0CB00h,0CF00h,0D300h,0D700h,0DB00h,0DF00h
          DD 0E300h,0E700h,0EB00h,0EF00h,0F300h,0F700h,0FB00h,0FF00h
.code
 
; Unpacks RGB565 to RGB24; expects packed RGB565 in EDX
Unpack565 proc
  mov ecx, edx
  and ecx, 01Fh
  or  eax, dword ptr [offset aLookupB + ecx * 4]
  shr edx, 5
  mov ecx, edx
  and ecx, 03Fh
  or  eax, dword ptr [offset aLookupG + ecx * 4]
  shr edx, 6
  or  eax, dword ptr [offset aLookupR + edx * 4]
  ret
Unpack565 endp

Final solution

To speed up even more, one can create a lookup table for all RGB565 combinations. However that would result in an additional 256 KB lookup table. Since having additional 256 KB in memory does not hurt today PCs, my implementation uses this huge table. The RGB565 packed color becomes index and the corresponding RGB24 value can be accessed via offset aRGB565Lookup + eax * 4 where eax contains RGB565. See below:

; Assuming "block" points to a DXT block
; Assuming aRGB565Lookup is declared as EXTERNDEF aRGB565Lookup:DWORD
 
mov   esi, block          ; Move to the first color
movzx eax, word ptr [esi] ; Get first color word
mov   eax, dword ptr [offset aRGB565Lookup + eax * 4]
mov   [ebp - 4], eax      ; Save RGB24 to the color table
movzx eax, word ptr [esi + 2] ; Get second color word
mov   eax, dword ptr [offset aRGB565Lookup + eax * 4]
mov   [ebp - 8], eax      ; Save RGB24 to the color table

Decompressing single DXT1 block

To get the idea how does the library work, we take a look at how do we decompress a single DXT1 block. Such a block consists of two pakced colors, each 2 bytes long and a table of color indicies. First, we unpack the first two colors into RGB24 as described above. Then we have to compare these two colors and depending on whether the first color is greater than the second or not, we compute the remaining two colors. I store these four colors on the stack so it is faster and easier to access them when building output pixels.

Once the colors are unpacked, we load the indicies and build the output pixels. Since the indicies are packed in a 32 bit table, I simply load this value into a register, negate it, then mask it with 0x3 so I get the 2-bits long index. I do negate the indicies, because to access colors on the stack, I use a single instruction mov edx, [ebp + edx * 4] to load EDX register with a color being on the stack at the address of EBP + index * 4 (each color is 4 bytes long). Lets say the index is 1 thus we are addressing second color. EBP is at a higher memory address than the color table, since allocating space on the stack decrements stack pointer (moves up in the memory). Since we have 4 colors each 4 bytes long, our color table takes up 16 bytes of memory and it starts at EBP - 16 but the first color is at EBP - 4. So if I wanted to address second color, I would have to substract 4 bytes from EBP - 4 and because I wanted to use only one instruction, negating indices allows you to negate substraction to addition. Then I just increment EBP by 16 to point to the last color, thus index of 3 after negation becomes 0, and adding 0: (EBP - 16) + 0 results in the fourth color.

The following is the assembly code which decompresses single DXT1 block into 16 output ARGB pixels:

; Decompresses DXT1 block
DXTDBlockDxt1 PROC C, pbBlock:PTR BYTE, pdwPixels:PTR DWORD
  ; Allocate space for color table
  sub esp, 16
  ; Save registers
  push esi
  push edi
 
  ; Unpack first two colors
  mov esi, pbBlock        ; Move to the first color
  mov edi, pdwPixels      ; Setup destination where we generate pixels
  movzx eax, word ptr [esi] ; Get first color word
  mov eax, dword ptr [offset aRGB565Lookup + eax * 4]
  mov [ebp - 4], eax      ; Save RGB24 to color table
  movzx eax, word ptr [esi + 2]  ; Get second color word
  mov eax, dword ptr [offset aRGB565Lookup + eax * 4]
  mov [ebp - 8], eax      ; Save RGB24 to color table
 
  ; Calculate midpoint colors
  cmp [ebp - 4], eax
  jae fcgts
  ; First color is less than or equal to the second
  ; Calculate third color
  mov dword ptr [ebp - 12], 0FF000000h
  movzx eax, byte ptr [ebp - 2]
  movzx edx, byte ptr [ebp - 6]
  add eax, edx
  shr eax, 1
  shl eax, 16
  or dword ptr [ebp - 12], eax
  movzx eax, byte ptr [ebp - 3]
  movzx edx, byte ptr [ebp - 7]
  add eax, edx
  shr eax, 1
  shl eax, 8
  or dword ptr [ebp - 12], eax
  movzx eax, byte ptr [ebp - 4]
  movzx edx, byte ptr [ebp - 8]
  add eax, edx
  shr eax, 1
  or dword ptr [ebp - 12], eax
  ; Calculate fourth color
  mov dword ptr [ebp - 16], 0h ; Set fourth color to transparent
  jmp copy
fcgts:
  ; First color is greater than the second
  ; Calculate third color
  mov dword ptr [ebp - 12], 0FF000000h
  movzx eax, byte ptr [ebp - 2]
  movzx ecx, byte ptr [ebp - 6]
  lea ecx, [ecx + eax * 2]
  mov eax, 0AAAAAAABh
  mul ecx
  shr edx, 1
  shl edx, 16
  or dword ptr [ebp - 12], edx
  movzx eax, byte ptr [ebp - 3]
  movzx ecx, byte ptr [ebp - 7]
  lea ecx, [ecx + eax * 2]
  mov eax, 0AAAAAAABh
  mul ecx
  shr edx, 1
  shl edx, 8
  or dword ptr [ebp - 12], edx
  movzx eax, byte ptr [ebp - 4]
  movzx ecx, byte ptr [ebp - 8]
  lea ecx, [ecx + eax * 2]
  mov eax, 0AAAAAAABh
  mul ecx
  shr edx, 1
  or dword ptr [ebp - 12], edx
  ; Calculate fourth color
  mov dword ptr [ebp - 16], 0FF000000h
  movzx eax, byte ptr [ebp - 6]
  movzx ecx, byte ptr [ebp - 2]
  lea ecx, [ecx + eax * 2]
  mov eax, 0AAAAAAABh
  mul ecx
  shr edx, 1
  shl edx, 16
  or dword ptr [ebp - 16], edx
  movzx eax, byte ptr [ebp - 7]
  movzx ecx, byte ptr [ebp - 3]
  lea ecx, [ecx + eax * 2]
  mov eax, 0AAAAAAABh
  mul ecx
  shr edx, 1
  shl edx, 8
  or dword ptr [ebp - 16], edx
  movzx eax, byte ptr [ebp - 8]
  movzx ecx, byte ptr [ebp - 4]
  lea ecx, [ecx + eax * 2]
  mov eax, 0AAAAAAABh
  mul ecx
  shr edx, 1
  or dword ptr [ebp - 16], edx
 
copy:
  ; Set EBP to point to the last color. Color table is at lower addresses than EBP.
  ; Since memory reference operator can only add offsets, we invert
  ; indicies and set EBP to point to the last color thus adding inverted index to the ebp
  ; will become the same as substracting non inverted index from the ebp.
  sub ebp, 16
 
  ; Get indices
  mov eax, dword ptr [esi + 4]
  not eax  ; Invert indices since we can only add offsets in mov operator []
 
  ; Setup location where our cycle ends (64 bit because 4x4 32bit color pixel table)
  lea ecx, [edi + 64]
lp:
  mov edx, eax
  and edx, 03h
  mov edx, [ebp + edx * 4]
  mov [edi], edx
  shr eax, 2
  add edi, 4
  cmp edi, ecx
  jne lp
 
  ; Restore original EBP position
  add ebp, 16
  ; Restore registers
  pop edi
  pop esi
  ; Deallocate color table
  mov esp, ebp
 
  ret
DXTDBlockDxt1 ENDP

Decompressing entire DXT1 image

When decompressing entire images, I was facing two primary issues. First, DXT compresses blocks of 16 pixels (4x4), thus you cannot just copy the result into the bitmap scanline. Second, using GDI HBITMAP requires you to backscan the bitmap. This means that the first row in the bitmap you see on the screen is actually the last row stored in the memory.

Minor issue you have to consider is handling images whose width and height are not multiply of 4. Remember, DXT works with blocks of 4x4 px, thus you cannot have an image of size 3x2 px. However, most of todays graphic cards support such dimensions, but still the compressed data would be aligned to the size of a single block.

Below is an assembly code which decompresses entire image compressed with DXT1. It simply goes through the input data buffer, decompressing each block into an intermedia pixel buffer, incrementing input data buffer by 8 (DXT1 block size), and finally copying intermediate pixel buffer into the output bitmap scan. This algorithm is for backscan bitmaps, such as HBITMAP.

; Decompresses entire DXT1 image into a backscan bitmap (ie HBITMAP)
; Width and height must be a multiple of 4. To align dim, use this formula: ((dim + 3) / 4) * 4
; Pixels must be DWORDs packed as ARGB (A being MSB, B being LSB)
_DXTDImageBackscanDxt1__PURE PROC, dwWidth:DWORD, dwHeight:DWORD, pbBlock:PTR BYTE, pdwPixels:PTR DWORD
  sub esp, 64         ; Allocate pixel block
  push edi            ; Save registers
  push esi
  push ebx
 
  lea esi, [ebp - 64] ; Setup ESI to point to the intermediate pixel buffer
 
  shl dwWidth, 2      ; Multiply width by 4 to get scanline size
  mov eax, dwHeight   ; Multiply height by scanline size to get the bitmap size
  mul dwWidth
  mov dwHeight, eax
row:
  xor ebx, ebx        ; Reset column counter
col:
  push esi            ; Address of the pixel block
  push pbBlock        ; Address of the source block
  call DXTDBlockDxt1
  add esp, 8
  add pbBlock, 8      ; Increment block pointer by 8 (DXT1 block size)
 
  mov edi, pdwPixels  ; Get pointer to the beginning of the the current block
  add edi, dwHeight
 
  ; Copy first line
  sub edi, dwWidth    ; Substract one scanline
  mov eax, [esi]
  mov dword ptr [edi + ebx], eax
  mov eax, [esi + 04h]
  mov dword ptr [edi + ebx + 4], eax
  mov eax, [esi + 08h]
  mov dword ptr [edi + ebx + 8], eax
  mov eax, [esi + 0Ch]
  mov dword ptr [edi + ebx + 12], eax
 
  ; Copy second line
  sub edi, dwWidth    ; Substract one scanline
  mov eax, [esi + 010h]
  mov dword ptr [edi + ebx], eax
  mov eax, [esi + 014h]
  mov dword ptr [edi + ebx + 4], eax
  mov eax, [esi + 018h]
  mov dword ptr [edi + ebx + 8], eax
  mov eax, [esi + 01Ch]
  mov dword ptr [edi + ebx + 12], eax
 
  ; Copy third line
  sub edi, dwWidth    ; Substract one scanline
  mov eax, [esi + 020h]
  mov dword ptr [edi + ebx], eax
  mov eax, [esi + 024h]
  mov dword ptr [edi + ebx + 4], eax
  mov eax, [esi + 028h]
  mov dword ptr [edi + ebx + 8], eax
  mov eax, [esi + 02Ch]
  mov dword ptr [edi + ebx + 12], eax
 
  ; Copy fourth line
  sub edi, dwWidth    ; Substract one scanline
  mov eax, [esi + 030h]
  mov dword ptr [edi + ebx], eax
  mov eax, [esi + 034h]
  mov dword ptr [edi + ebx + 4], eax
  mov eax, [esi + 038h]
  mov dword ptr [edi + ebx + 8], eax
  mov eax, [esi + 03Ch]
  mov dword ptr [edi + ebx + 12], eax
 
  add ebx, 16         ; Increment used width (4 pixels * 32 bits each)
  cmp ebx, dwWidth    ; Check whether there are more columns to process
  jne col             ; Process more columns
 
  shl ebx, 2          ; EBX contains scanline size, multiply by 4 to get block scanline
  sub dwHeight, ebx   ; Substract block scanline from height
  cmp dwHeight, 0     ; Check whether there are more rows to process
  jne row             ; Process more rows
 
  pop ebx             ; Restore registers
  pop esi
  pop edi
  add esp, 64         ; Deallocate pixel block
  ret
_DXTDImageBackscanDxt1__PURE ENDP

Optimization with SSE

As you can see in the code above, copying 16 pixels from the intermedia buffer into the output bitmap is quite performance expensive. Since this example shows 32-bits version, each pixel (4 bytes long) must be read from the intermediate buffer into a register, and then copied to the output bitmap. It would be great to copy each row (4 pixels, 16 bytes) at a time. And luckily if the CPU supports SSE, we can achieve that by using XMM registers (128 bits wide) as in the example below:

; Decompresses entire DXT1 image into a backscan bitmap (ie HBITMAP). Uses SSE instructions
; Width and height must be a multiple of 4. To align dim, use this formula: ((dim + 3) / 4) * 4
; Pixels must be DWORDs
_DXTDImageBackscanDxt1__SSE PROC, dwWidth:DWORD, dwHeight:DWORD, pbBlock:PTR BYTE, pdwPixels:PTR DWORD
  sub esp, 64         ; Allocate pixel block
  push edi            ; Save registers
  push esi
  push ebx
 
  lea esi, [ebp - 64] ; Setup ESI to point to the intermediate pixel buffer
 
  shl dwWidth, 2      ; Multiply width by 4 to get scanline size
  mov eax, dwHeight   ; Multiply height by scanline size to get the bitmap size
  mul dwWidth
  mov dwHeight, eax
row:
  xor ebx, ebx        ; Reset column counter
col:
  push esi            ; Address of the pixel block
  push pbBlock        ; Address of the source block
  call DXTDBlockDxt1
  add esp, 8
  add pbBlock, 8      ; Increment block pointer by 8 (DXT1 block size)
 
  mov edi, pdwPixels  ; Get pointer to the beginning of the the current block
  add edi, dwHeight
 
  ; Copy first line
  sub edi, dwWidth    ; Substract one scanline
  movups xmm0, [esi]
  movups [edi + ebx], xmm0
 
  ; Copy second line
  sub edi, dwWidth    ; Substract one scanline
  movups xmm0, [esi + 010h]
  movups [edi + ebx], xmm0
 
  ; Copy third line
  sub edi, dwWidth    ; Substract one scanline
  movups xmm0, [esi + 020h]
  movups [edi + ebx], xmm0
 
  ; Copy fourth line
  sub edi, dwWidth    ; Substract one scanline
  movups xmm0, [esi + 030h]
  movups [edi + ebx], xmm0
 
  add ebx, 16         ; Increment used width (4 pixels * 32 bits each)
  cmp ebx, dwWidth    ; Check whether there are more columns to process
  jne col             ; Process more columns
 
  shl ebx, 2          ; EBX contains scanline size, multiply by 4 to get block scanline
  sub dwHeight, ebx   ; Substract block scanline from height
  cmp dwHeight, 0     ; Check whether there are more rows to process
  jne row             ; Process more rows
 
  pop ebx             ; Restore registers
  pop esi
  pop edi
  add esp, 64         ; Deallocate pixel block
  ret
_DXTDImageBackscanDxt1__SSE ENDP

Migrating to 64-bit Windows

Another improvement comes with migrating assmebly code to 64-bit. Lets take the above code as an example. We have to copy rows of 4 pixels (4 bytes per pixel = 16 bytes) from an intermediate pixel buffer to the output bitmap. Without SSE, we would have to call mov instruction for each pixel we want to copy twice. x64 platform has 64-bits wide registers, thus we can copy two bytes at a time. This reduces the execution time by half.

Another improvement comes with the calling convention. On x86 platforms, parameters to function are passed on the stack, Microsoft x64 calling convention uses RCX, RDX, R8 and R9 for the first four parameters. Another parameters would be passed on the stack.

My library is optimized for x64 platform and as you can see on the chart on the top, it performs about 20% faster than x86 code. To see the difference between x86 and x64 code, download the source code and examine _x64.asm files.

How to build the library

To build from the source, you are going to need Visual Studio or MASM (Microsoft Macro Assembler). In Visual Studio, create a new empty Win32 project. Choose either a Dll or static library, or you can directly implement this code into your project but I highly recommend not to do so. The reason why I chosen Dll is that I can check in the DllMain routine whether the CPU supports SSE instructions and if it does, I select the SSE implementation as shown in the example below.

Once you create a new project, right click on it in the solution explorer and go to Build customizations... . Make sure the masm option is checked. Then simply add existing source .asm files into the project. For x64 platforms, you have to define a preprocessor constant _WIN64 in Project properties / Microsoft Macro Assembler / General / Preprocessor definitions. For x86 builds, you have to go the the Project properties / Linker / Advanced and set Image Has Safe Exception Handlers to No (/SAFESEH:NO)

Dll entry point

// Dll entry point
BOOL WINAPI DllMain(HINSTANCE hInstanceDll, DWORD dwReason, LPVOID)
{
  switch (dwReason)
  {
  case DLL_PROCESS_ATTACH:
    // When attaching Dll to a process, we determine whether the CPU supports
    // SSE so we can choose optimized routines instead of generic ones
    DXTDSetConfiguration(DXTDHasSseSupport() ? DXTD_CONFIG_SSE : 0);
    break;
  case DLL_PROCESS_DETACH:
  case DLL_THREAD_ATTACH:
  case DLL_THREAD_DETACH:
    break;
  }
 
  // Dll successfuly loaded
  return TRUE;
}
 
// Sets internal configuration
DXTD_API void DXTD_CALL DXTDSetConfiguration(DXTD_FLAGS flags)
{
  dxtdConfig = flags & DXTD_CONFIG_MASK;
  if (flags & DXTD_CONFIG_SSE)
  {
    dxtdConfig |= DXTD_CONFIG_SSE;
    dxtdImageBackscanDxt1 = _DXTDImageBackscanDxt1__SSE;
    dxtdImageBackscanDxt3 = _DXTDImageBackscanDxt3__SSE;
    dxtdImageBackscanDxt5 = _DXTDImageBackscanDxt5__SSE;
  }
  else
  {
    dxtdImageBackscanDxt1 = _DXTDImageBackscanDxt1__PURE;
    dxtdImageBackscanDxt3 = _DXTDImageBackscanDxt3__PURE;
    dxtdImageBackscanDxt5 = _DXTDImageBackscanDxt5__PURE;
  }
}
 
// Decompresses entire DXT1 image into a backscan bitmap (ie HBITMAP)
DXTD_API DXTD_BOOL DXTD_CALL DXTDImageBackscanDxt1(const DWORD width, const DWORD height,
  const LPBYTE inputImage, PDWORD outputPixels)
{
  if (width < 4 || height < 4 || width % 4 != 0 || height % 4 != 0)
    return 0;
 
  dxtdImageBackscanDxt1(width, height, inputImage, outputPixels);
 
  return 1;
}