Line routine - Second attempt

Post by **Dbug** » Tue Mar 18, 2008 10:01 pm

Since the version of LineBench I uploaded the other day was some old attempt made when I was working on ULA Paint, I decided to work a bit and clean up the code.

The previous version had a score of 856 to draw 440 line (2275 for the BASIC ROM routine), the new one is down to 822, while using quite a lot less memory (less intermediate code and tables).

You can find it here:
http://www.defence-force.org/ftp/forum/ ... ench-2.zip

Don't hesitate to participate by giving suggestion, and even better provide a faster version.

Ideally we are looking for two kind of "faster" versions:
- The faster one that does not use too much memory (good for games)
- The insanely faster one that can gobble half of the memory (usable for demo records and stuff like that)

Have fun !

Post by **Dbug** » Thu Mar 20, 2008 7:19 pm

And here is a new version.

The last published was down to 822, Twilighte took a look and managed to reduce it to nearly 800, and since I had a day off, I spent some time on the code today, both optimizing even more the code down to 793 (yeah, we are under 800 !), plus additional code path for totaly vertical and totaly horizontal lines.

You can find it here:
http://www.defence-force.org/ftp/forum/ ... ench-3.zip

I'm still open for all suggestions and optimisations. A faster line routine is good for everybody

waskol · Post by **waskol** » Thu Mar 20, 2008 7:50 pm

I will have a look to those pieces of code.

highwayman · Post by **highwayman** » Thu Mar 20, 2008 11:52 pm

Q:

about drawing circles.
is it my memory failing, or was the atmos rom *way* faster than the original 1.0?

Post by **Dbug** » Fri Mar 21, 2008 12:32 am

No idea, but whatever the speed it's possible to do WAY better, can even handle clipping while still being at least twice faster

waskol · Post by **waskol** » Fri Mar 21, 2008 12:29 pm

I can't remember where I got this code but it was supposed to be very very fast. I wonder if it could be adapted in oric asm, with all the screen optimizations you can usually do (you know the tables), pecause the curset function is soooo slow.

Code: Select all

#include <lib.h>

#define setpix(x,y) curset(x,y,2)

void lineTwoStep(int x0, int y0, int x1, int y1)
    {   
        int i,dx,dy,stepx, stepy,length,extras,incr1,incr2,c,d;
         
        dy = y1 - y0;
        dx = x1 - x0;
        
        if (dy < 0) { dy = -dy;  stepy = -1; } else { stepy = 1; }
        if (dx < 0) { dx = -dx;  stepx = -1; } else { stepx = 1; }
        
        setpix(x0,y0);
        setpix(x1,y1);
        if (dx > dy) {
            length = (dx - 1) >> 2;
            extras = (dx - 1) & 3;
            incr2 = (dy << 2) - (dx << 1);
            if (incr2 < 0) {
                c = dy << 1;
                incr1 = c << 1;
 
                d =  incr1 - dx;
                
                for (i = 0; i < length; i++) {
                    x0 += stepx;
                    x1 -= stepx;
                    if (d < 0) {              // Pattern:
                        setpix(x0,y0);        //
                        x0+=stepx;            //
                        setpix(x0,y0);        //  x o o
                        setpix(x1,y1);        //
                        setpix(x1,y1);
                        x1 -= stepx;
                        d += incr1;
                    } else {
                        if (d < c) {			// Pattern:
                            setpix(x0,y0);		//      
                            x0+= stepx;               //
                            y0 += stepy;              //      o
                            setpix(x0,y0);		//  x o
                            setpix(x1,y1);		//
                            x1-=stepx;
                            y1 -= stepy;
                            setpix(x1,y1);
                        } else {
                            y0+=stepy;
                            setpix(x0,y0);		// Pattern:
                            x0 += stepx;              //
                            setpix(x0,y0);		//    o o 
                            y1 -= stepy;              //  x  
                            setpix(x1,y1);			
                            x1-=stepx;
                            setpix(x1,y1);			
                        }
                        d+=incr2;
                    }
                }
                if (extras > 0) {
                    if (d < 0) {
                        x0+=stepx;
                        setpix(x0,y0);
                        if (extras > 1) {x0+=stepx; setpix(x0,y0);}
                        if (extras > 2) {x1-=stepx; setpix(x1,y1);}
                    } else
                    if (d < c) {
                        x0+=stepx;
                        setpix(x0,y0);
                        if (extras > 1) {x0+=stepx; y0+=stepy; setpix(x0,y0);}
                        if (extras > 2) {x1-=stepx; setpix(x1,y1);}
                    } else {
                        x0+=stepx; 
                        y0+=stepy;
                        setpix(x0,y0);
                        if (extras > 1) {x0+=stepx;setpix(x0,y0);}
                        if (extras > 2) {x1-=stepx; y1-=stepy; setpix(x1,y1);}
                    }
                }
            } else {
                c = (dy - dx) << 1;
                incr1 = c << 1;
                d =  incr1 + dx;
                
                for (i = 0; i < length; i++) {
                    x0 += stepx;
                    x1 -= stepx;
                    if (d > 0) {  
                        y0 += stepy;
                        setpix(x0,y0);		
                        x0+=stepx;              // Pattern:
                        y0+=stepy;              //
                        setpix(x0,y0);		//      o
                        y1 -= stepy;            //    o
                        setpix(x1,y1);		//  x	
                        x1-= stepx;
                        y1 -= stepy;
                        setpix(x1,y1);		
                        d += incr1;
                    } else {
                        if (d < c) {
                            setpix(x0,y0);
                            x0 += stepx;	      // Pattern:
                            y0 += stepy;        //
                            setpix(x0,y0);      //      o
                            setpix(x1, y1);     //  x o
                            x1 -= stepx;
                            y1 -= stepy;
                            setpix(x1,y1);       
                        } else {
                            y0 += stepy;
                            setpix(x0,y0);   			// Pattern:
                            x0 += stepx;
                            setpix(x0,y0);   			//    o o
                            y1 -= stepy;                    //  x
                            setpix(x1,y1);			
                            x1 -= stepx;
                            setpix(x1,y1);			
                        }
                        d += incr2;
                    }
                }
                if (extras > 0) {
                    if (d > 0) {
                        x0 += stepx;
                        y0 += stepy;
                        setpix(x0,y0);   
                        if (extras > 1) {x0+=stepx; y0+=stepy; setpix(x0,y0);}
                        if (extras > 2) {x1-=stepx; y1-=stepy; setpix(x1,y1);}
                    } else
                    if (d < c) {
                        x0 += stepx;
                        setpix(x0,y0);
                        if (extras > 1) {x0+=stepx; y0+=stepy; setpix(x0,y0);}
                        if (extras > 2) {x1-=stepx; setpix(x1,y1);}
                    } else {
                        x0 += stepx;
                        y0 += stepy;
                        setpix(x0,y0);
                        if (extras > 1) {x0+=stepx; setpix(x0,y0);}
                        if (extras > 2) {
                            if (d > c)
                                {x1-=stepx; y1-=stepy; setpix(x1,y1);}
                            else
                                {x1-=stepx; setpix(x1,y1);}
                        }
                    }
                }
            }
        } else {
            length = (dy - 1) >> 2;
            extras = (dy - 1) & 3;
            incr2 = (dx << 2) - (dy << 1);
            if (incr2 < 0) {
                c = dx << 1;
                incr1 = c << 1;
                d =  incr1 - dy;
                
                for (i = 0; i < length; i++) {
                    y0 += stepy;
                    y1 -= stepy;
                    if (d < 0) {
                        setpix(x0,y0);
                        y0 += stepy;
                        setpix(x0,y0);
                        setpix(x1,y1);
                        y1 -= stepy; 
                        setpix(x1,y1);
                        d += incr1;
                    } else {
                        if (d < c) {
                            setpix(x0, y0);
                            x0 += stepx;
                            y0 += stepy;
                            setpix(x0,y0);
                            setpix(x1,y1);
                            x1 -= stepx;
                            y1 -= stepy;
                            setpix(x1,y1);
                        } else {
                            x0 += stepx;
                            setpix(x0,y0);
                            y0 += stepy;
                            setpix(x0,y0);
                            x1 -= stepx; setpix(x1,y1);
                            y1 -= stepy; setpix(x1,y1);
                        }
                        d+=incr2;
                    }
                }
                if (extras > 0) {
                    if (d < 0) {
                        y0+=stepy; setpix( x0, y0);
                        if (extras > 1) {y0+=stepy; setpix(x0,y0);}
                        if (extras > 2) {y1-=stepy; setpix(x1,y1);}
                    } else
                    if (d < c) {
                        y0+=stepy; setpix(x0,y0);
                        if (extras > 1) {x0+=stepx; y0+=stepy; setpix(x0,y0);}
                        if (extras > 2) {y1-=stepy;setpix(x1,y1);}
                    } else {
                        x0+=stepx; y0+=stepy; setpix(x0,y0);
                        if (extras > 1) {y0 += stepy;setpix(x0,y0);}
                        if (extras > 2) {x1-=stepx; y1-=stepy; setpix(x1,y1);}
                    }
                }
            } else {
                c = (dx - dy) << 1;
                incr1 = c << 1;
                d =  incr1 + dy;
                
                for (i = 0; i < length; i++) {
                    y0 += stepy;
                    y1 -= stepy;
                    if (d > 0) {
                        x0+=stepx; setpix(x0,y0);
                        x0+=stepx; y0+=stepy; setpix(x0,y0);
                        x1-=stepy; setpix(x1,y1);
                        x1-=stepx; y1-=stepy; setpix(x1,y1);
                        d+=incr1;
                    } else {
                        if (d < c) {
                            setpix(x0,y0);
                            x0+=stepx; y0+=stepy; setpix(x0,y0);
                            setpix(x1,y1);
                            x1-=stepx, y1-=stepy; setpix(x1,y1);
                        } else {
                            x0+=stepx; setpix(x0,y0);
                            y0+=stepy; setpix(x0,y0);
                            x1-=stepy; setpix(x1,y1);
                            y1-=stepy; setpix(x1,y1);
                        }
                        d += incr2;
                    }
                }
                if (extras > 0) {
                    if (d > 0) {
                        x0+=stepx; y0+=stepy; setpix(x0,y0);
                        if (extras > 1) {x0+=stepx; y0+=stepy; setpix(x0,y0);}
                        if (extras > 2) {x1-=stepx, y1-=stepy; setpix(x1,y1);}
                    } else
                    if (d < c) {
                        y0+=stepy; setpix(x0,y0);
                        if (extras > 1) {x0+=stepx; y0+=stepy; setpix(x0,y0);}
                        if (extras > 2) {y1-=stepy; setpix(x1,y1);}
                    } else {
                        x0+=stepx; y0+=stepy; setpix(x0,y0);
                        if (extras > 1) {y0+=stepy; setpix(x0,y0);}
                        if (extras > 2) {
                            if (d > c)
                                {x1-=stepx, y1-=stepy; setpix(x1,y1);}
                            else
                                {y1-=stepy; setpix(x1,y1);}

                        }
                    }
                }
            }
        }
    }

That's something I have adapted from a code found on internet.
You can give it a try, it is way faster than Bresnham algorithm.

Post by **Dbug** » Fri Mar 21, 2008 1:36 pm

Found the original explanation here, with the same source code:
http://cs.unc.edu/~mcmillan/comp136/Lecture6/Lines.html

waskol · Post by **waskol** » Fri Mar 21, 2008 2:17 pm

Dbug wrote:Found the original explanation here, with the same source code:
http://cs.unc.edu/~mcmillan/comp136/Lecture6/Lines.html

yes, that's this one !

Otherelse, I some ideas in order to improve it a little bit (to avoid the oric curset function). If you don't mind I will try to use your optimization tables.

Post by **Dbug** » Fri Mar 21, 2008 2:30 pm

As a matter of fact, the "plot pixel" or whatever similar routine... should be used only for the sole purpose of drawing one single pixel on screen.

For anything else composed of more than one pixel, the pixel plotting should be inlined in the rest of the code, else just the fact of plotting pixels will use most of the cpu time.

Just for reference, if you add ONE nop (or useless CLC/SEC) in the current line routine in the benchmark, the time taken will jump up by about "30" units.

Imagine the cost of a single JSR/RTS to the plot routine

Post by **Dbug** » Sun Jan 31, 2010 8:42 pm

After the discussions about 1337, decided to take a look at 8 bit Bresenham implementations. The version in C= Hacking is very buggy - actually it does not even assemble I believe -

So the last published was down to 796 units, the new code is now down to 676. Not impressive I know, but possibly the setup code could be improved. I'm not quite sure how I could improve the inner loops.

I moved the location of the code now we have a Subversion server, you can find the latest code here:
http://miniserve.defence-force.org/svn/ ... linebench/

So Chema, could you try to integrate that one in 1337 and tell us if it improves the framerate in any perceivable way?

And as last time, I'm still open for all suggestions and optimisations: A faster line routine is good for everybody

Chema · Post by **Chema** » Sun Jan 31, 2010 9:18 pm

Dbug wrote: So Chema, could you try to integrate that one in 1337 and tell us if it improves the framerate in any perceivable way?

Ok I integrated into the shipdemo test program and it improves the time in 1 "unit" of 1/100ths of second. This is quite a constant value, but most noticeable in some models, where there are more and longer lines to draw.

Would need a more accurate timing to be sure, but this could be a good first approximation.

I think, however, there are some "artifacts", some extra pixels at the end of some lines and even a single pixel in the middle of a line which is drawn 1 pixel up. This only happens sometimes.

Would a method for "crunching" all the pixels in a single scan and plot them with only one sta improve this even further?

Cheers

Post by **Dbug** » Sun Jan 31, 2010 9:47 pm

For the artefacts, would be nice if you could give me some set of coordinates I can test with.

I also published a new version, which goes from 676 to now 644

For the chunking, I guess it could be doable by adding some metaflags in the _TableBit6Reverse table, have to think about it.

Iapetus · Post by **Iapetus** » Sun Jan 31, 2010 10:16 pm

wow from 856 to 644 is a huge leap, well done Dbug, I am sure you can still improve this

Continue the good work

JamesD · Post by **JamesD** » Mon Feb 01, 2010 8:18 am

Algarbi wrote:wow from 856 to 644 is a huge leap, well done Dbug, I am sure you can still improve this

Continue the good work

He still has to fix the artifacts so that might go up a few cycles but it's definitely a nice gain. I think more savings is certainly possible but the more you save, the more effort required to get those savings.

I think the only significant improvement will be in devising a way to calculate/write all pixels for a byte in one shot, possibly reducing the number of OR operations with screen memory.

I played with the idea of calculating how many horizontal bits you write for each x. Then you just perform a table lookup of the bits and OR the byte with the screen destination. However, tables of possible bit patterns is a problem since the size of the table increases by 1 each time you add a horizontal pixel. (see bit pattern tables below). There may also be a different number of pixels per byte depending on the current x value, so I didn't really figure out a practical way to do this except for horizontal lines where I don't need to perform as many lookups.

10000000
01000000
00100000
00010000
00001000
00000100
00000010
00000001

10000000
11000000
01100000
00110000
00011000
00001100
00000110
00000011
00000001

10000000
11000000
11100000
01110000
00111000
00011100
00001110
00000111
00000011
00000001

10000000
11000000
11100000
11110000
01111000
00111100
00011110
00001111
00000111
00000011
00000001

etc...

To do this for horizontal lines you still have to draw the ends of the line separately but everything in between is just a loop writing bytes with all pixels set. And the ends can potentially be done with a lookup but it requires a table for each end. This offers a bigger gain on the CoCo/MC-10 where there are 8 pixels / byte in hi-res, the CPUs have 16 bit index registers, and the screen layout makes for easy math. I'm not sure about the Oric with the 6502, 6 pixels / byte, and it's screen layout.

Basic algorithm on CoCo is something like this (not exactly 100% C but you'll get the idea):

Code: Select all

#define BYTESPERLINE 32
#define PIXELSPERBYTE 8

If X2 < X1 then swap points
if X2 -X1 < PIXELSPERBYTE + 1   then special case because less than 2 bytes involved

; note that shifts work well for the math on this screen layout
; if you have to use preincrement then subtract one at end of following line
; calculate starting address of first byte
Address = ScreenBase + (Y * BYTESPERLINE) + (X1 / PIXELSPERBYTE )  

; lookup pixels to set and write leftmost byte
*(Address++) |=  (*(LeftTable + (X1 & 7))) ; set bits in left byte
 I =(X2-X1-2)

;loop for middle bytes/pixels if needed (test takes place before a write)
DO 
 *(Address++) = 255   ; 255 = set all bits on middle bytes
WHILE (I-- > 0)

; lookup pixels to set and write rightmost byte
*(Address++) |=  (*(RightTable + (X2 & 7)))  ; set bits in right byte

CoCo tables:
; Left end of horizontal line
11111111
01111111
00111111
00011111
00001111
00000111
00000011
00000001

; Right end of horizontal line
10000000
11000000
11100000
11110000
11111000
11111100
11111110
11111111

Oric version would have smaller tables due to less pixels / byte. Attribute bits would need to be set to what you want to use.
The math however might be ugly unless you can just lookup values... which I think I used anyway as it was faster.
; Left end of horizontal line
00111111
00011111
00001111
00000111
00000011
00000001

; Right end of horizontal line
00100000
00110000
00111000
00111100
00111110
00111111

I hope I'm making sense here but it's late, I'm tired, and that might be the formula for chocolate milk for all I know.

Just an FYI, the C64 version of Elite doesn't draw circles with a normal circle routine. It looks like it's drawing 4 lines / quarter since lines are a probably a little faster.

<edit>
Notice the last three bits of the X represent the pixel number within a byte and the top bits represent the byte number on a given screen line on the CoCo.

The tables can overlap by a byte is you put the right table first and the left table second. It saves a byte.

thrust26 · Post by **thrust26** » Thu Feb 04, 2010 10:43 pm

JamesD wrote:I think the only significant improvement will be in devising a way to calculate/write all pixels for a byte in one shot, possibly reducing the number of OR operations with screen memory.

Steve Judd has developed a quite brilliant routine (C=Hacking, ussue 10) for exactly this, which works without any tables in the inner loop.

Code: Select all

        initialize x, dx, etc.
        xold = x
        take a step in x: LSR X
        have we hit the end of a column?  If so, then plot and check on y
        is it time to take a step in y?
        if not, take another step in x
        if it is, then let a=x EOR xold
                       plot a into the buffer
                       let xold=x
        keep on going until we're finished

It saves a lot of time for lines where dX > 2 * dY on the C64 the Atari 2600 (~33% saving there). Maybe it can somehow be adapted for the Oric?