- The long journey to 1k real-time ray-tracing
- Making of the visual
- Making of the music (part1) by ern0
- Making of the music (part2) by TomCat
- Download page and source code

The lowest TrueColor resolution available on any modern PC is 640 x 480 with 32 bits/pixel. You can set it by VESA BIOS easily.

MOV BX,112H MOV AX,4F02H INT 10H

But the video mode number can differ depending on the VGA card. The most common number is 112H. It works on nVidia, Intel, DosBox etc. but not on ATI. This number sets up 640x480x24bit mode on ATI/AMD VGA cards. So in this case the right video number would be 121H.

Unfortunately, the VM VESA drivers use even weird numbers: VirtualBox - 142H, VMware - 13FH :-(

Under DOS, on CPUs that support it, the fastest way to write to the video memory is the MOVAPS instruction. It can write 16 bytes (4 pixels) in once.

MOVAPS [ES:DI],XMM7

To use this, we have to assemble 4 pixels into the registers. The lower 4 bytes of XMM2 is the new pixel data for one pixel. After rotating left XMM7 by 4 bytes we can insert the new pixel.

SHUFPS XMM7,XMM7,10010011B MOVSS XMM7,XMM2

When calculating the color components of a pixel, we get float values. They should be converted to integer and clamped between 0 and 255. SSE instructions are very useful for doing this:

CVTPS2DQ XMM2,XMM2 PACKSSDW XMM2,XMM2 PACKUSWB XMM2,XMM2

VESA high-color video memory is arranged in banks, so after 4096 writes, we have to switch to the next bank.

ADD DI,16 JNZ .4 PUSHA SUB BX,BX MOV AX,4F05H INT 10H POPA INC DX ; DL: number of memory bank .4:

Basically, we trace every fourth eye ray. During the trace, I compute a stamp byte, which is a unique value depending on what was intersected.

If the stamp byte is the same as it was 4 pixels earlier, then I can interpolate between the colors. If not, then we have to trace more eye rays between the two pixels.

INSERTPS XMM7,XMM2,00110000B ; XMM7: insert new color on the top .2: SHUFPS XMM7,XMM7,10010011B ; XMM7: rotate left PAVGB XMM2,XMM7 ; XMM2: averaging the colors MOVSS XMM7,XMM2 ; XMM7: put interpolated color on the bottom CMP [BP+SI],BL ; is it the same stampbyte? LOOPNZ .3 ; if no, then trace the next pixel TEST CL,3 ; was the fourth pixel? JNZ .2 ; if no, then interpolate the next pixel .3: TEST CL,3 ; was the fourth pixel? JNZ .4 ; if no, then skip putpixel CALL putpixel SHUFPS XMM7,XMM7,11111111B ; XMM7: fill by the last color MOV BL,[BP+SI] ; store the stampbyte ADD CX,8 ; go to right by 8 pixels .4: CMP CX,RESX/2+4 ; was it the last pixel in the raw? JNE nextpixel ; if no, then go to the next pixel

So we have to trace more than every fourth eye ray, but in average this is less than every third pixel, I think.

Shooting eye rays is performed orthogonal to the X-Y plane (in other words, parallel to the Z axis). The Direction vector is always [0,0,1] and the eye Position is the X, Y coordinates from the screen, plus any negative Z value. More precisely, P is [+94..-94,-160..+160,-8260] after aspect ratio correction.

MOV AX,RESY/2 nextline: MOV CX,-RESX/2+4 nextpixel: PUSHA ; -20:DI SI BP SP BX DX CX AX 1 0 PMOVSXWD XMM6,[BX-8] ; XMM6: P = x,y,1,0 CVTDQ2PS XMM6,XMM6 MOVAPS XMM5,XMM6 MULPS XMM6,[SI] ; *Aspect [SI]=[0.5028877,0.39081812,-8260.683] SHUFPS XMM5,XMM5,11101111B ; XMM5: D = 0,0,1,0

Performing calculations on three (or four) vector coordinates simultaneously using SSE instructions is a speedup in itself. Here is how I store the following vectors in different SSE registers:

;XMM0: temporary #1 ;XMM1: temporary #2 ;XMM2: color coordinates ;XMM3: reflection vector ;XMM4: normal vector ;XMM5: direction vector ;XMM6: point ;XMM7: collector for colors of 4 pixels

Today the most expensive instructions are the divison and the square root. Normalizing a vector uses both of them, so I tried to avoid vector normalization. That's why we are casting rays with orthogonal projection. The eye rays are unit vectors.

The reflected rays are also unit vectors because of the property of reflection; we don't need to normalize them. The only vectors where normalization is unavoidable are the shadow rays. Luckily there is a dedicated instruction for Compute Reciprocals of Square Roots.

MOVAPS XMM0,XMM5 ; XMM5: D = VNORM(D) DPPS XMM0,XMM0,01111111B RSQRTPS XMM0,XMM0 ; instead of SQRTPS XMM0,XMM0 MULPS XMM5,XMM0 ; instead of DIVPS XMM5,XMM0

Note: RSQRTPS gives a major performance boost. However, it is VERY approximate: it produces results with relative error less than 1.5 * 2^-12. Given that machine epsilon of single precision float numbers is 2^-24, we can say that this approximation has roughly half the precision. It could not be used on eye rays, but it's not so bad on shadow rays.

When I tried to normalize eye rays with RSQRTPS, it resulted in many artifacts on the contour of the sphere.

Only one light source isn't too interesting, so we have two lights. One light at (255,255,255) and the 2nd light is opposite to the first one at (-255,-255,-255).

I use the Phong model for shading. The diffuse component is very basic: dot(normal,shadow)

MOVAPS XMM1,XMM4 ; XMM1: N.S DPPS XMM1,XMM5,01110001B MOVAPS [DI],XMM1 CMP [DI+3],CH JLE @F ; Ambient FADD DWORD [DI] ; Ambient+Diffuse @@:

The specular component is more interesting: dot(reflected,shadow)^2^2^2

DPPS XMM5,XMM3,01110001B ; XMM5: R.S MOVAPS [DI],XMM5 FLD1 FADD DWORD [DI] ; Specular Ambient+Diffuse @@: FMUL ST0,ST0 ; Specular=Specular^2 INC CX JPO @B ; loop x3

Three levels of recursion is very recognizable in the reflections, but more levels would be a waste. I use the stack pointer register to check the level of the current recursion.

CMP SP,-22-2*maxlevel ; Max recursion level = 3 LOOPE Tquit ; JE Tquit + DEC CX MOVAPS XMM5,XMM3 ; D = R FMUL DWORD [SI] ; level/2 0 FILD DWORD [SI] ; big number for min CALL Trace0 Tquit: RETN

After every recursion level, I halved the intensity of the reflected color.

My inspiration for the 1st scene was the real-time part of the Chrome2 intro.

But this time, full-screen, and with nice intersections.

At Walt Disney Studios theme part, in the main hall at the top of the shop, three tires are rotating.
I liked this so much.

For speed reasons, at the 2nd scene there is only one tire consisting of spheres; the others are reflections.

I've already tried to recreate this hypnotic motion in my
256 byte intro,
but I wasn't satisfied with the result.

This ray-traced scene is much better with stored colors and reflections.

*
If you liked this writeup, then leave a comment at the download page :)
And make sure you have also read the Making of the music (part1) by ern0
*