Line routine - Second attempt
Line routine - Second attempt
Since the version of LineBench I uploaded the other day was some old attempt made when I was working on ULA Paint, I decided to work a bit and clean up the code.
The previous version had a score of 856 to draw 440 line (2275 for the BASIC ROM routine), the new one is down to 822, while using quite a lot less memory (less intermediate code and tables).
You can find it here:
http://www.defence-force.org/ftp/forum/ ... ench-2.zip
Don't hesitate to participate by giving suggestion, and even better provide a faster version.
Ideally we are looking for two kind of "faster" versions:
- The faster one that does not use too much memory (good for games)
- The insanely faster one that can gobble half of the memory (usable for demo records and stuff like that)
Have fun !
The previous version had a score of 856 to draw 440 line (2275 for the BASIC ROM routine), the new one is down to 822, while using quite a lot less memory (less intermediate code and tables).
You can find it here:
http://www.defence-force.org/ftp/forum/ ... ench-2.zip
Don't hesitate to participate by giving suggestion, and even better provide a faster version.
Ideally we are looking for two kind of "faster" versions:
- The faster one that does not use too much memory (good for games)
- The insanely faster one that can gobble half of the memory (usable for demo records and stuff like that)
Have fun !
And here is a new version.
The last published was down to 822, Twilighte took a look and managed to reduce it to nearly 800, and since I had a day off, I spent some time on the code today, both optimizing even more the code down to 793 (yeah, we are under 800 !), plus additional code path for totaly vertical and totaly horizontal lines.
You can find it here:
http://www.defence-force.org/ftp/forum/ ... ench-3.zip
I'm still open for all suggestions and optimisations. A faster line routine is good for everybody
The last published was down to 822, Twilighte took a look and managed to reduce it to nearly 800, and since I had a day off, I spent some time on the code today, both optimizing even more the code down to 793 (yeah, we are under 800 !), plus additional code path for totaly vertical and totaly horizontal lines.
You can find it here:
http://www.defence-force.org/ftp/forum/ ... ench-3.zip
I'm still open for all suggestions and optimisations. A faster line routine is good for everybody
-
- Flying Officer
- Posts: 148
- Joined: Fri Oct 12, 2007 8:08 pm
I can't remember where I got this code but it was supposed to be very very fast. I wonder if it could be adapted in oric asm, with all the screen optimizations you can usually do (you know the tables), pecause the curset function is soooo slow.
That's something I have adapted from a code found on internet.
You can give it a try, it is way faster than Bresnham algorithm.
Code: Select all
#include <lib.h>
#define setpix(x,y) curset(x,y,2)
void lineTwoStep(int x0, int y0, int x1, int y1)
{
int i,dx,dy,stepx, stepy,length,extras,incr1,incr2,c,d;
dy = y1 - y0;
dx = x1 - x0;
if (dy < 0) { dy = -dy; stepy = -1; } else { stepy = 1; }
if (dx < 0) { dx = -dx; stepx = -1; } else { stepx = 1; }
setpix(x0,y0);
setpix(x1,y1);
if (dx > dy) {
length = (dx - 1) >> 2;
extras = (dx - 1) & 3;
incr2 = (dy << 2) - (dx << 1);
if (incr2 < 0) {
c = dy << 1;
incr1 = c << 1;
d = incr1 - dx;
for (i = 0; i < length; i++) {
x0 += stepx;
x1 -= stepx;
if (d < 0) { // Pattern:
setpix(x0,y0); //
x0+=stepx; //
setpix(x0,y0); // x o o
setpix(x1,y1); //
setpix(x1,y1);
x1 -= stepx;
d += incr1;
} else {
if (d < c) { // Pattern:
setpix(x0,y0); //
x0+= stepx; //
y0 += stepy; // o
setpix(x0,y0); // x o
setpix(x1,y1); //
x1-=stepx;
y1 -= stepy;
setpix(x1,y1);
} else {
y0+=stepy;
setpix(x0,y0); // Pattern:
x0 += stepx; //
setpix(x0,y0); // o o
y1 -= stepy; // x
setpix(x1,y1);
x1-=stepx;
setpix(x1,y1);
}
d+=incr2;
}
}
if (extras > 0) {
if (d < 0) {
x0+=stepx;
setpix(x0,y0);
if (extras > 1) {x0+=stepx; setpix(x0,y0);}
if (extras > 2) {x1-=stepx; setpix(x1,y1);}
} else
if (d < c) {
x0+=stepx;
setpix(x0,y0);
if (extras > 1) {x0+=stepx; y0+=stepy; setpix(x0,y0);}
if (extras > 2) {x1-=stepx; setpix(x1,y1);}
} else {
x0+=stepx;
y0+=stepy;
setpix(x0,y0);
if (extras > 1) {x0+=stepx;setpix(x0,y0);}
if (extras > 2) {x1-=stepx; y1-=stepy; setpix(x1,y1);}
}
}
} else {
c = (dy - dx) << 1;
incr1 = c << 1;
d = incr1 + dx;
for (i = 0; i < length; i++) {
x0 += stepx;
x1 -= stepx;
if (d > 0) {
y0 += stepy;
setpix(x0,y0);
x0+=stepx; // Pattern:
y0+=stepy; //
setpix(x0,y0); // o
y1 -= stepy; // o
setpix(x1,y1); // x
x1-= stepx;
y1 -= stepy;
setpix(x1,y1);
d += incr1;
} else {
if (d < c) {
setpix(x0,y0);
x0 += stepx; // Pattern:
y0 += stepy; //
setpix(x0,y0); // o
setpix(x1, y1); // x o
x1 -= stepx;
y1 -= stepy;
setpix(x1,y1);
} else {
y0 += stepy;
setpix(x0,y0); // Pattern:
x0 += stepx;
setpix(x0,y0); // o o
y1 -= stepy; // x
setpix(x1,y1);
x1 -= stepx;
setpix(x1,y1);
}
d += incr2;
}
}
if (extras > 0) {
if (d > 0) {
x0 += stepx;
y0 += stepy;
setpix(x0,y0);
if (extras > 1) {x0+=stepx; y0+=stepy; setpix(x0,y0);}
if (extras > 2) {x1-=stepx; y1-=stepy; setpix(x1,y1);}
} else
if (d < c) {
x0 += stepx;
setpix(x0,y0);
if (extras > 1) {x0+=stepx; y0+=stepy; setpix(x0,y0);}
if (extras > 2) {x1-=stepx; setpix(x1,y1);}
} else {
x0 += stepx;
y0 += stepy;
setpix(x0,y0);
if (extras > 1) {x0+=stepx; setpix(x0,y0);}
if (extras > 2) {
if (d > c)
{x1-=stepx; y1-=stepy; setpix(x1,y1);}
else
{x1-=stepx; setpix(x1,y1);}
}
}
}
}
} else {
length = (dy - 1) >> 2;
extras = (dy - 1) & 3;
incr2 = (dx << 2) - (dy << 1);
if (incr2 < 0) {
c = dx << 1;
incr1 = c << 1;
d = incr1 - dy;
for (i = 0; i < length; i++) {
y0 += stepy;
y1 -= stepy;
if (d < 0) {
setpix(x0,y0);
y0 += stepy;
setpix(x0,y0);
setpix(x1,y1);
y1 -= stepy;
setpix(x1,y1);
d += incr1;
} else {
if (d < c) {
setpix(x0, y0);
x0 += stepx;
y0 += stepy;
setpix(x0,y0);
setpix(x1,y1);
x1 -= stepx;
y1 -= stepy;
setpix(x1,y1);
} else {
x0 += stepx;
setpix(x0,y0);
y0 += stepy;
setpix(x0,y0);
x1 -= stepx; setpix(x1,y1);
y1 -= stepy; setpix(x1,y1);
}
d+=incr2;
}
}
if (extras > 0) {
if (d < 0) {
y0+=stepy; setpix( x0, y0);
if (extras > 1) {y0+=stepy; setpix(x0,y0);}
if (extras > 2) {y1-=stepy; setpix(x1,y1);}
} else
if (d < c) {
y0+=stepy; setpix(x0,y0);
if (extras > 1) {x0+=stepx; y0+=stepy; setpix(x0,y0);}
if (extras > 2) {y1-=stepy;setpix(x1,y1);}
} else {
x0+=stepx; y0+=stepy; setpix(x0,y0);
if (extras > 1) {y0 += stepy;setpix(x0,y0);}
if (extras > 2) {x1-=stepx; y1-=stepy; setpix(x1,y1);}
}
}
} else {
c = (dx - dy) << 1;
incr1 = c << 1;
d = incr1 + dy;
for (i = 0; i < length; i++) {
y0 += stepy;
y1 -= stepy;
if (d > 0) {
x0+=stepx; setpix(x0,y0);
x0+=stepx; y0+=stepy; setpix(x0,y0);
x1-=stepy; setpix(x1,y1);
x1-=stepx; y1-=stepy; setpix(x1,y1);
d+=incr1;
} else {
if (d < c) {
setpix(x0,y0);
x0+=stepx; y0+=stepy; setpix(x0,y0);
setpix(x1,y1);
x1-=stepx, y1-=stepy; setpix(x1,y1);
} else {
x0+=stepx; setpix(x0,y0);
y0+=stepy; setpix(x0,y0);
x1-=stepy; setpix(x1,y1);
y1-=stepy; setpix(x1,y1);
}
d += incr2;
}
}
if (extras > 0) {
if (d > 0) {
x0+=stepx; y0+=stepy; setpix(x0,y0);
if (extras > 1) {x0+=stepx; y0+=stepy; setpix(x0,y0);}
if (extras > 2) {x1-=stepx, y1-=stepy; setpix(x1,y1);}
} else
if (d < c) {
y0+=stepy; setpix(x0,y0);
if (extras > 1) {x0+=stepx; y0+=stepy; setpix(x0,y0);}
if (extras > 2) {y1-=stepy; setpix(x1,y1);}
} else {
x0+=stepx; y0+=stepy; setpix(x0,y0);
if (extras > 1) {y0+=stepy; setpix(x0,y0);}
if (extras > 2) {
if (d > c)
{x1-=stepx, y1-=stepy; setpix(x1,y1);}
else
{y1-=stepy; setpix(x1,y1);}
}
}
}
}
}
}
You can give it a try, it is way faster than Bresnham algorithm.
Found the original explanation here, with the same source code:
http://cs.unc.edu/~mcmillan/comp136/Lecture6/Lines.html
http://cs.unc.edu/~mcmillan/comp136/Lecture6/Lines.html
yes, that's this one !Dbug wrote:Found the original explanation here, with the same source code:
http://cs.unc.edu/~mcmillan/comp136/Lecture6/Lines.html
Otherelse, I some ideas in order to improve it a little bit (to avoid the oric curset function). If you don't mind I will try to use your optimization tables.
As a matter of fact, the "plot pixel" or whatever similar routine... should be used only for the sole purpose of drawing one single pixel on screen.
For anything else composed of more than one pixel, the pixel plotting should be inlined in the rest of the code, else just the fact of plotting pixels will use most of the cpu time.
Just for reference, if you add ONE nop (or useless CLC/SEC) in the current line routine in the benchmark, the time taken will jump up by about "30" units.
Imagine the cost of a single JSR/RTS to the plot routine
For anything else composed of more than one pixel, the pixel plotting should be inlined in the rest of the code, else just the fact of plotting pixels will use most of the cpu time.
Just for reference, if you add ONE nop (or useless CLC/SEC) in the current line routine in the benchmark, the time taken will jump up by about "30" units.
Imagine the cost of a single JSR/RTS to the plot routine
After the discussions about 1337, decided to take a look at 8 bit Bresenham implementations. The version in C= Hacking is very buggy - actually it does not even assemble I believe -
So the last published was down to 796 units, the new code is now down to 676. Not impressive I know, but possibly the setup code could be improved. I'm not quite sure how I could improve the inner loops.
I moved the location of the code now we have a Subversion server, you can find the latest code here:
http://miniserve.defence-force.org/svn/ ... linebench/
So Chema, could you try to integrate that one in 1337 and tell us if it improves the framerate in any perceivable way?
And as last time, I'm still open for all suggestions and optimisations: A faster line routine is good for everybody
So the last published was down to 796 units, the new code is now down to 676. Not impressive I know, but possibly the setup code could be improved. I'm not quite sure how I could improve the inner loops.
I moved the location of the code now we have a Subversion server, you can find the latest code here:
http://miniserve.defence-force.org/svn/ ... linebench/
So Chema, could you try to integrate that one in 1337 and tell us if it improves the framerate in any perceivable way?
And as last time, I'm still open for all suggestions and optimisations: A faster line routine is good for everybody
Ok I integrated into the shipdemo test program and it improves the time in 1 "unit" of 1/100ths of second. This is quite a constant value, but most noticeable in some models, where there are more and longer lines to draw.Dbug wrote: So Chema, could you try to integrate that one in 1337 and tell us if it improves the framerate in any perceivable way?
Would need a more accurate timing to be sure, but this could be a good first approximation.
I think, however, there are some "artifacts", some extra pixels at the end of some lines and even a single pixel in the middle of a line which is drawn 1 pixel up. This only happens sometimes.
Would a method for "crunching" all the pixels in a single scan and plot them with only one sta improve this even further?
Cheers
He still has to fix the artifacts so that might go up a few cycles but it's definitely a nice gain. I think more savings is certainly possible but the more you save, the more effort required to get those savings.Algarbi wrote:wow from 856 to 644 is a huge leap, well done Dbug, I am sure you can still improve this
Continue the good work
I think the only significant improvement will be in devising a way to calculate/write all pixels for a byte in one shot, possibly reducing the number of OR operations with screen memory.
I played with the idea of calculating how many horizontal bits you write for each x. Then you just perform a table lookup of the bits and OR the byte with the screen destination. However, tables of possible bit patterns is a problem since the size of the table increases by 1 each time you add a horizontal pixel. (see bit pattern tables below). There may also be a different number of pixels per byte depending on the current x value, so I didn't really figure out a practical way to do this except for horizontal lines where I don't need to perform as many lookups.
10000000
01000000
00100000
00010000
00001000
00000100
00000010
00000001
10000000
11000000
01100000
00110000
00011000
00001100
00000110
00000011
00000001
10000000
11000000
11100000
01110000
00111000
00011100
00001110
00000111
00000011
00000001
10000000
11000000
11100000
11110000
01111000
00111100
00011110
00001111
00000111
00000011
00000001
etc...
To do this for horizontal lines you still have to draw the ends of the line separately but everything in between is just a loop writing bytes with all pixels set. And the ends can potentially be done with a lookup but it requires a table for each end. This offers a bigger gain on the CoCo/MC-10 where there are 8 pixels / byte in hi-res, the CPUs have 16 bit index registers, and the screen layout makes for easy math. I'm not sure about the Oric with the 6502, 6 pixels / byte, and it's screen layout.
Basic algorithm on CoCo is something like this (not exactly 100% C but you'll get the idea):
Code: Select all
#define BYTESPERLINE 32
#define PIXELSPERBYTE 8
If X2 < X1 then swap points
if X2 -X1 < PIXELSPERBYTE + 1 then special case because less than 2 bytes involved
; note that shifts work well for the math on this screen layout
; if you have to use preincrement then subtract one at end of following line
; calculate starting address of first byte
Address = ScreenBase + (Y * BYTESPERLINE) + (X1 / PIXELSPERBYTE )
; lookup pixels to set and write leftmost byte
*(Address++) |= (*(LeftTable + (X1 & 7))) ; set bits in left byte
I =(X2-X1-2)
;loop for middle bytes/pixels if needed (test takes place before a write)
DO
*(Address++) = 255 ; 255 = set all bits on middle bytes
WHILE (I-- > 0)
; lookup pixels to set and write rightmost byte
*(Address++) |= (*(RightTable + (X2 & 7))) ; set bits in right byte
; Left end of horizontal line
11111111
01111111
00111111
00011111
00001111
00000111
00000011
00000001
; Right end of horizontal line
10000000
11000000
11100000
11110000
11111000
11111100
11111110
11111111
Oric version would have smaller tables due to less pixels / byte. Attribute bits would need to be set to what you want to use.
The math however might be ugly unless you can just lookup values... which I think I used anyway as it was faster.
; Left end of horizontal line
00111111
00011111
00001111
00000111
00000011
00000001
; Right end of horizontal line
00100000
00110000
00111000
00111100
00111110
00111111
I hope I'm making sense here but it's late, I'm tired, and that might be the formula for chocolate milk for all I know.
Just an FYI, the C64 version of Elite doesn't draw circles with a normal circle routine. It looks like it's drawing 4 lines / quarter since lines are a probably a little faster.
<edit>
Notice the last three bits of the X represent the pixel number within a byte and the top bits represent the byte number on a given screen line on the CoCo.
The tables can overlap by a byte is you put the right table first and the left table second. It saves a byte.
Steve Judd has developed a quite brilliant routine (C=Hacking, ussue 10) for exactly this, which works without any tables in the inner loop.JamesD wrote:I think the only significant improvement will be in devising a way to calculate/write all pixels for a byte in one shot, possibly reducing the number of OR operations with screen memory.
Code: Select all
initialize x, dx, etc.
xold = x
take a step in x: LSR X
have we hit the end of a column? If so, then plot and check on y
is it time to take a step in y?
if not, take another step in x
if it is, then let a=x EOR xold
plot a into the buffer
let xold=x
keep on going until we're finished
Have fun!
thrust26
thrust26