Line routine - Second attempt

thrust26 · Post by **thrust26** » Mon Feb 08, 2010 6:24 pm

JamesD wrote:I'm getting an "Unresolved External : osdk_stack" message on the latest code so I'm dead in the water.

That's not in the linebench code. Can you compile other projects?

thrust26 · Post by **thrust26** » Mon Feb 08, 2010 6:35 pm

Linebench is at 473 now. I think the main optimization for speed is done, so this can be used for testing and measuring.

JamesD · Post by **JamesD** » Mon Feb 08, 2010 6:39 pm

If you unroll the loop for the nearly vertical lines you should be able to do 8 bit math until you know a carry will take place. However, identifying the point where you start/end drawing like that may take more cycles than it saves or more table space than is practical.

Another possibility is the same math I wanted to try for nearly horizontal lines so you can skip dealing with x until you know it will change. But I still haven't figured out the math (mostly because I'm procrastinating) so that's not going to help either.

JamesD · Post by **JamesD** » Mon Feb 08, 2010 6:43 pm

thrust26 wrote:
JamesD wrote:I'm getting an "Unresolved External : osdk_stack" message on the latest code so I'm dead in the water.
That's not in the linebench code. Can you compile other projects?

I closed that DOS window and opened a new one. It works again now.

thrust26 · Post by **thrust26** » Mon Feb 08, 2010 6:44 pm

JamesD wrote:If you unroll the loop for the nearly vertical lines you should be able to do 8 bit math until you know a carry will take place. However, identifying the point where you start/end drawing like that may take more cycles than it saves or more table space than is practical.

Another possibility is the same math I wanted to try for nearly horizontal lines so you can skip dealing with x until you know it will change. But I still haven't figured out the math (mostly because I'm procrastinating) so that's not going to help either.

What do you mean with 8 bit math? We are already doing 8 bit math.

Unrolling would be about the only option (AFAIK) for further optimizing. But I thing Cheama would kill me then.

JamesD · Post by **JamesD** » Mon Feb 08, 2010 6:47 pm

thrust26 wrote:Linebench is at 473 now. I think the main optimization for speed is done, so this can be used for testing and measuring.

Have you checked the code in?

thrust26 · Post by **thrust26** » Mon Feb 08, 2010 7:00 pm

JamesD wrote:
thrust26 wrote:Linebench is at 473 now. I think the main optimization for speed is done, so this can be used for testing and measuring.
Have you checked the code in?

Yes.

JamesD · Post by **JamesD** » Mon Feb 08, 2010 7:33 pm

thrust26 wrote:What do you mean with 8 bit math? We are already doing 8 bit math.

I think I simplified that too much to make sense. I'll explain both approaches a little.

In the section of code below, you have to check for a carry caused by X.
In the math approach you know how many times you step in Y before X changes and you simplify the inner loop at the cost of more in the setup and outer loop. This might work but I think it's yet another special case with only a minor speedup because the real time savings aren't in the line routines anymore.

The other approach lets you adjust Y by only updating the lower byte until you know the higher byte will change. You should be able to do this up to 6 times for certain lines. However, the complexity of finding where you can do that might negate any savings. Think of it as a vertical equivalent to the optimization I suggested for the nearly horizontal line. The difference is, you must add 40 to the low byte of the screen address each pass in the inner loop. At best, the code would be much larger due to the number of special cases. At worst it could be slower because of additional tests to find where you can do this.

I don't think it would be worth attempting this unless both ideas could be combined.

Code: Select all

loop 
	iny             		; Step in y
__auto_adx
	adc #00					; +DX
	bcc skip    			; Time to step in x?
	
__auto_stepx
	inx            			; Step in x
		
__auto_dy  
	sbc #00     			; -DY
 
skip 
	; Set the new screen adress
	sta save_a
	lda _HiresAddrLow,y
	sta tmp0+0
	lda _HiresAddrHigh,y
	sta tmp0+1

draw_pixel	
	; Draw the pixel
	sty save_y
	ldy _TableDiv6,x
	lda _TableBit6Reverse,x

JamesD · Post by **JamesD** » Mon Feb 08, 2010 7:41 pm

Ok, so someone else can mull over the math concept and possibly come up with a solution...

Here is the simple part of the math:
X1 - X2 = DX
Y1 - Y2 = DY

If DX > DY THEN
DX / DY = number of steps in Y per step in X
ELSE
DY / DX = number of steps in X per step in Y

However... that isn't an even number often so you have DIV and MOD results, plus a divide is slow.

So, somehow the steps must be adjusted for each inner loop based on the MOD. Perhaps the steps + a table lookup based on the MOD before each inner loop.

Then the inner loop just uses a counter based on that in the Y direction or sets that number of bits in the X direction. Or something along those lines.

thrust26 · Post by **thrust26** » Mon Feb 08, 2010 7:42 pm

JamesD wrote:I think I simplified that too much to make sense. I'll explain both approaches a little.

I see.

Yes, there are some possibly optimizations here. But like you say, the overhead for utilizing them is often too high. And the code becomes ultra-complex soon.

thrust26 · Post by **thrust26** » Mon Feb 08, 2010 7:45 pm

There is another possible optimization, called "two-step" line drawing. It looks pretty interesting, but I haven't tried it yet.

JamesD · Post by **JamesD** » Mon Feb 08, 2010 8:46 pm

thrust26 wrote:There is another possible optimization, called "two-step" line drawing. It looks pretty interesting, but I haven't tried it yet.

I took a look at that when I was messing with this stuff back in July but never got around to trying it. Short attention span.

Since it draws from both ends it should be similar to partially unrolling the loops and simpler to implement than the math idea. If you have a CPU with enough registers it could be much faster, but I don't see that being the case here. I think it would be eliminating up to half the loop testing and branching clock cycles but that's what? A guess would be less than 10 clock cycles * 1/2 line length per line? It should be faster though.

It would definitely be larger in size and you need more temporary variables, but then I think at this point you aren't going to be able to make any speed improvements without an increase in code size.

<edit>
BTW, I'm not sure how it would compare to the chunking optimization.
Since both ends can differ where in a chunk you start I'm not sure you'd do better on near horizontal lines.

JamesD · Post by **JamesD** » Tue Feb 09, 2010 8:31 am

thrust26 wrote:Linebench is at 473 now. I think the main optimization for speed is done, so this can be used for testing and measuring.

Which emulator are you using? I'm not getting anywhere near that with Oricutron.

Xeron · Post by **Xeron** » Tue Feb 09, 2010 9:00 am

JamesD wrote:Which emulator are you using? I'm not getting anywhere near that with Oricutron.

Oricutrons timings should (in theory) be more accurate than euphoric. You shouldn't directly compare results from different emulators.

That said, i don't have a real oric to compare so if someone could try linebench on real hw, euphoric, and Oricutron and see which is nearest that would be a useful test.

JamesD · Post by **JamesD** » Tue Feb 09, 2010 10:06 am

Xeron wrote:
JamesD wrote:Which emulator are you using? I'm not getting anywhere near that with Oricutron.
Oricutrons timings should (in theory) be more accurate than euphoric. You shouldn't directly compare results from different emulators.

That said, i don't have a real oric to compare so if someone could try linebench on real hw, euphoric, and Oricutron and see which is nearest that would be a useful test.

Well, I'm getting 664 so I think the latest code isn't in the repository.