1. The long journey to 1k real-time ray-tracing
  2. Making of the visual
  3. Making of the music (part1) by ern0
  4. Making of the music (part2) by TomCat
  5. Download page and source code

Making of the visual by TomCat

Technical details

HiRes TrueColor

The lowest TrueColor resolution available on any modern PC is 640 x 480 with 32 bits/pixel. You can set it by VESA BIOS easily.

 MOV BX,112H
 MOV AX,4F02H
 INT 10H

But the video mode number can differ depending on the VGA card. The most common number is 112H. It works on nVidia, Intel, DosBox etc. but not on ATI. This number sets up 640x480x24bit mode on ATI/AMD VGA cards. So in this case the right video number would be 121H.

Unfortunately, the VM VESA drivers use even weird numbers: VirtualBox - 142H, VMware - 13FH :-(

Video memory

Under DOS, on CPUs that support it, the fastest way to write to the video memory is the MOVAPS instruction. It can write 16 bytes (4 pixels) in once.

 MOVAPS [ES:DI],XMM7

To use this, we have to assemble 4 pixels into the registers. The lower 4 bytes of XMM2 is the new pixel data for one pixel. After rotating left XMM7 by 4 bytes we can insert the new pixel.

 SHUFPS XMM7,XMM7,10010011B
 MOVSS XMM7,XMM2

When calculating the color components of a pixel, we get float values. They should be converted to integer and clamped between 0 and 255. SSE instructions are very useful for doing this:

 CVTPS2DQ XMM2,XMM2
 PACKSSDW XMM2,XMM2
 PACKUSWB XMM2,XMM2

VESA high-color video memory is arranged in banks, so after 4096 writes, we have to switch to the next bank.

 ADD DI,16
 JNZ .4
 PUSHA
 SUB BX,BX
 MOV AX,4F05H
 INT 10H
 POPA
 INC DX    ; DL: number of memory bank
.4:

Adaptive sub-sampling

Basically, we trace every fourth eye ray. During the trace, I compute a stamp byte, which is a unique value depending on what was intersected.

If the stamp byte is the same as it was 4 pixels earlier, then I can interpolate between the colors. If not, then we have to trace more eye rays between the two pixels.

 INSERTPS XMM7,XMM2,00110000B ; XMM7: insert new color on the top
.2:
 SHUFPS XMM7,XMM7,10010011B   ; XMM7: rotate left
 PAVGB XMM2,XMM7              ; XMM2: averaging the colors
 MOVSS XMM7,XMM2              ; XMM7: put interpolated color on the bottom
 CMP [BP+SI],BL               ; is it the same stampbyte?
 LOOPNZ .3                    ; if no, then trace the next pixel
 TEST CL,3                    ; was the fourth pixel?
 JNZ .2                       ; if no, then interpolate the next pixel
.3:
 TEST CL,3                    ; was the fourth pixel?  
 JNZ .4                       ; if no, then skip putpixel
 CALL putpixel
 SHUFPS XMM7,XMM7,11111111B   ; XMM7: fill by the last color
 MOV BL,[BP+SI]               ; store the stampbyte
 ADD CX,8                     ; go to right by 8 pixels
.4:
 CMP CX,RESX/2+4              ; was it the last pixel in the raw?
 JNE nextpixel                ; if no, then go to the next pixel                 

So we have to trace more than every fourth eye ray, but in average this is less than every third pixel, I think.

Orthogonal projection

Shooting eye rays is performed orthogonal to the X-Y plane (in other words, parallel to the Z axis). The Direction vector is always [0,0,1] and the eye Position is the X, Y coordinates from the screen, plus any negative Z value. More precisely, P is [+94..-94,-160..+160,-8260] after aspect ratio correction.

 MOV AX,RESY/2
nextline:
 MOV CX,-RESX/2+4
nextpixel:
 PUSHA                      ; -20:DI SI BP SP BX DX CX AX 1 0
 PMOVSXWD XMM6,[BX-8]       ; XMM6: P = x,y,1,0
 CVTDQ2PS XMM6,XMM6
 MOVAPS XMM5,XMM6
 MULPS XMM6,[SI]            ; *Aspect [SI]=[0.5028877,0.39081812,-8260.683]
 SHUFPS XMM5,XMM5,11101111B ; XMM5: D = 0,0,1,0

Vectors

Performing calculations on three (or four) vector coordinates simultaneously using SSE instructions is a speedup in itself. Here is how I store the following vectors in different SSE registers:

;XMM0: temporary #1
;XMM1: temporary #2
;XMM2: color coordinates
;XMM3: reflection vector
;XMM4: normal vector
;XMM5: direction vector
;XMM6: point
;XMM7: collector for colors of 4 pixels

Normalization?

Today the most expensive instructions are the divison and the square root. Normalizing a vector uses both of them, so I tried to avoid vector normalization. That's why we are casting rays with orthogonal projection. The eye rays are unit vectors.

The reflected rays are also unit vectors because of the property of reflection; we don't need to normalize them. The only vectors where normalization is unavoidable are the shadow rays. Luckily there is a dedicated instruction for Compute Reciprocals of Square Roots.

 MOVAPS XMM0,XMM5         ; XMM5: D = VNORM(D)
 DPPS XMM0,XMM0,01111111B
 RSQRTPS XMM0,XMM0        ; instead of SQRTPS XMM0,XMM0
 MULPS XMM5,XMM0          ; instead of DIVPS XMM5,XMM0

Note: RSQRTPS gives a major performance boost. However, it is VERY approximate: it produces results with relative error less than 1.5 * 2^-12. Given that machine epsilon of single precision float numbers is 2^-24, we can say that this approximation has roughly half the precision. It could not be used on eye rays, but it's not so bad on shadow rays.

When I tried to normalize eye rays with RSQRTPS, it resulted in many artifacts on the contour of the sphere.

Shading

Only one light source isn't too interesting, so we have two lights. One light at (255,255,255) and the 2nd light is opposite to the first one at (-255,-255,-255).

I use the Phong model for shading. The diffuse component is very basic: dot(normal,shadow)

 MOVAPS XMM1,XMM4         ; XMM1: N.S
 DPPS XMM1,XMM5,01110001B
 MOVAPS [DI],XMM1
 CMP [DI+3],CH
 JLE @F                   ; Ambient
 FADD DWORD [DI]          ; Ambient+Diffuse
@@:

The specular component is more interesting: dot(reflected,shadow)^2^2^2

 DPPS XMM5,XMM3,01110001B ; XMM5: R.S
 MOVAPS [DI],XMM5
 FLD1
 FADD DWORD [DI]          ; Specular Ambient+Diffuse
@@:
 FMUL ST0,ST0             ; Specular=Specular^2
 INC CX
 JPO @B                   ; loop x3

Recursion

Three levels of recursion is very recognizable in the reflections, but more levels would be a waste. I use the stack pointer register to check the level of the current recursion.

 CMP SP,-22-2*maxlevel  ; Max recursion level = 3
 LOOPE Tquit            ; JE Tquit + DEC CX
 MOVAPS XMM5,XMM3       ; D = R
 FMUL DWORD [SI]        ; level/2 0
 FILD DWORD [SI]        ; big number for min
 CALL Trace0
Tquit:
RETN

After every recursion level, I halved the intensity of the reflected color.

Scenes

My inspiration for the 1st scene was the real-time part of the Chrome2 intro.
But this time, full-screen, and with nice intersections.

At Walt Disney Studios theme park, in the main hall at the top of the shop, three tires are rotating. I liked this so much.
For speed reasons, at the 2nd scene there is only one tire consisting of spheres; the others are reflections.

I've already tried to recreate this hypnotic motion in my 256 byte intro, but I wasn't satisfied with the result.
This ray-traced scene is much better with stored colors and reflections.


If you liked this writeup, then leave a comment at the download page :)
And make sure you have also read the Making of the music (part1) by ern0