Size optimisation: Compact subroutines

Here you can ask questions or provide insights about how to use efficiently 6502 assembly code on the Oric.
User avatar
Dbug
Site Admin
Posts: 4444
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:

Size optimisation: Compact subroutines

Post by Dbug »

Since I've been working on minigames, I have been struggling with trying to get the most possibly compact way of doing things.

When I check back the code of 4KKong and Cyclotron, I notice that many things could have been improved, specially in the category of parameter passing. :?

Typically, routines are called in some similar way:

Code: Select all

lda #value1
ldx #value2
ldy #value3
jsr SomeRoutine
such code takes 9 bytes just for passing parameters.

I was thinking that since the routine knows in wich order parameters are supposed to be passed, that we could do some trickery and avoid the specification of registers, like this:

Code: Select all

jsr SomeRoutine
.byte value1,value2,value3
this call now takes 6 bytes for parameters passing, but of course we need to get some code in the called routine to 1) extract the parameters, and 2) correct the program counter so it will skip the three values.

For such a method to be efficient, the size taken by this stack correction code needs to be smaller that the sum of bytes we don't waste at each function call.

In the minigames, typical routines that are called very often are things like text display routines, memory copy, graphical primitives, sound replay things, keyboard testing, ... a way to use less memory would be to actually also not have to use a 3 bytes long absolute JSR to reach these routines. So came to me the idea that perhaps we could use BRK for such a purpose.

Let's imagine we have a table of adresses of all routines we want to use with this optimised parameter calling method, and of the number of parameters the routine needs, we could get something like this:

Code: Select all

#define ROUTINE_CLEAR_SCREEN 0
#define ROUTINE_SWITCH_TEXT 1
#define ROUTINE_SWITCH_HIRES 2
#define ROUTINE_DISPLAY_TEXT 3
#define ROUTINE_READ_KEYBOARD 4

TableRoutineAdressLow
    .byte <ClearScreen
    .byte <SwitchToText
    .byte <SwitchToHires
    .byte <DisplayText
    .byte <ReadKeyboard

TableRoutineAdressHigh
    .byte >ClearScreen
    .byte >SwitchToText
    .byte >SwitchToHires
    .byte >DisplayText
    .byte >ReadKeyboard

TableRoutineParamCount
    .byte 0
    .byte 0
    .byte 0
    .byte 2  ; Message index, screen position index
    .byte 0
Using these defines and tables, we can then have our function calls looking this way:

Code: Select all

    ; Display text
    brk
    .byte ROUTINE_DISPLAY_TEXT
    .byte MESSAGE_HELLO_WORLD,POSITION_STATUS_LINE
The code would then trigger an exception when reaching the BRK instruction, read the first byte after that indicates which is the routine to call, using the TableRoutineParamCount it would know how many parameters to read after, would read the values in A, X and Y, and then call the routine itself by reading the adress in the two other tables.

I'm not sure of how compact or efficient this would be, and I'm not very sure of how to actually make this BRK code handling routine...

So let's do a bit of maths, to compare the efficiency of normal code and of this 'optimized' code. Let's define some values:
- ROUTS the number of vectorized routines
- CALLS the number of times functions are called.
- PARAMS the number of parameters passed during the various calls

In Cyclotron, I have many calls to the following routines:
- 2x KeyboardGetChar (0 parameter)
- 3x DrawDisc (1 parameter)
- 4x KeyboardDebounce (0 parameter)
- 4x IncrementScore (0 parameter)
- 4x Temporize (0 parameter)
- 6x EraseMemory (1 parameter)
- 7x Arena_DrawRectangle (1 parameter)
- 9x Display_Message (1 parameter)

There are some others, but let's say that for the sake of discussion that I just vectorized these ones. This gives us the following:
- ROUTS=8 (because we have 8 routines).
- CALLS =39 (because it's the result of 2+3+4+4+4+6+7+9)
- PARAMS=25 (because it's what (3x1)+(7x1)+(9x1) is equal to)

With the old routine, the total memory consumption is simple to compute, since this is just the sum of all JSR and LDA/LDX/LDY used when calling the functions:
(CALLS*3)+(PARAMS*2)
=(39*3)+(25*2)
=117+50
=167 bytes

With the new routine, the cost of function calls and parameters storing is equal to one less byte for each JSR, and one less byte for each LDA/LDX/LDY used when calling the functions, plus the one time cost of the tables defining the functions, plus the size of the BRK handling code:
(CALLS*2)+PARAMS+(ROUTS*3)+BRKCODE
=(39*2)+25+8+BRKCODE
=78+25+8+BRKCODE
=111+BRKCODE

If we want (in this example code) to actually get a memory decrease, the BRK handling code has to be less than 56 bytes long.

Now, in the whole code of Cyclotron I had 133 jsr in total, some being repeated, some other not. Speculating that these 94 additional function calls are all going to different functions (worse case for the new routine because it increases the parameter table), and have no parameters either (worse case because there is no win on parameters passing), this would add to the original code 94*3=282 bytes, while it would add to the new routine 94*2+94*3=470 bytes, which I admit is not pretty.

In order to get a good candidate for vectorization, a routine needs to be called a minimum number of time. This number depends of how many parameters the routine has:

The cost for the standard routine is: CALLS*(3+PARAMS*2)
The cost for the BRK routine is: CALLS*(2+PARAMS)+3

By comparing the cost for the various number of parameters, we can find the number of calls under which the new routine is more efficient:

- For 0 parameters it need to be called at least 4 times.
- For 1 parameter it need to be called at least 3 times.
- For 2 parameters it need to be called at least 2 times.
- For 3 parameters it need to be called at least 1 time (but then it takes as much room as a standard call, benefits arrive with the second call)

So, does it looks like something doable, or does it looks like very painfull idea without much room for positive points ? And how would be this BRK handling routine coded on the Oric ? :roll:

Yuck, looks like a decently long post, thanks for reading ;)
User avatar
Euphoric
Game master
Posts: 99
Joined: Mon Jan 09, 2006 11:33 am
Location: France

Post by Euphoric »

In Cube, I had so many calls to the "string display routine" that I indeed use BRK to display messages. At first, I used BRK #n to display message number n, which took only 2 bytes. Using a single byte to refer to a message allows to gain one byte compared to passing the message address.
However, I calculated that it was shorter in my special case to use LDX #n ; BRK (3 bytes then), because I could factorize a lot of the LDX instructions (and spent less space in the parameter fetch, but still had to adjust the return address)... And, since I had less than 128 different messages, I used the 8th bit of number n to specify whether the string-display routine has to add a CR/LF sequence at the end of the message or not...

Cube's source is available, it's part of the zip package on the individual minigames page of the 2004 compo... (yes, the BRK handling code is far less than 56 bytes long)

Cheers,

Fabrice
User avatar
Dbug
Site Admin
Posts: 4444
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:

Post by Dbug »

Ok, thanks.
I downloaded the package, so here is the code you wrote/

This parts installs the BRK handler:

Code: Select all

lda #<irq_brk_handler
ldy #>irq_brk_handler
sta $0245
sty $0246
This is the BRK handler itself:

Code: Select all

irq_brk_handler
	sta saveA
	pla
	pha
	asl
	asl
	asl
	asl
	lda saveA
	bcs brk_handler
	jmp $EE22
brk_handler		; no need to save registers for BRK
	txa
	tay
	tsx
	lda $0103,x
	bne *+5
	dec $0103,x
	dec $0102,x
	tya
	pha
	and #$7F
	tax
	jsr print_msg
	pla
	bmi brk_return
	jsr $CBF0	; CR+LF
brk_return
	rti
And it's typically called this way:

Code: Select all

	ldx #WHAT_NOW
	brk
I have some questions, want to be sure that I understood everything :)
- Is it safe to write a value in $245/$246 without disabling the interuptions ? Is there no chance that some interrupts will be called in between the STA and the STY ?

- The PLA/PHA is done to get the value of the status register and then reading the value of the 4th bit that indicates that is its a BRK and not a regular interruption, right ?

- Les calculs compliqués servent à récuperer le pointeur de pile pour changer l'adresse de retour pour corriger le PC+1 après le BRK.

C'est effectivement relativement compact :)
Et la série de asl asl asl asl se compresse pas trop mal je pense :D
User avatar
carlsson
Pilot Officer
Posts: 127
Joined: Thu Jan 12, 2006 11:26 pm
Location: Västerås, Sweden

Post by carlsson »

Memory copy within a minigame? Reading input is something I prefer to do once in a loop, and if you make it a subroutine, do you need input parameters?

For comparison, I have a total of 54 subroutine calls in Jewels 20, 42 calls in Othello, 34 calls in VIQ-Bert, 41 calls in Zapactris and 30 calls in ReConnectris. Maybe I'm not doing as much textual output etc.
User avatar
Dbug
Site Admin
Posts: 4444
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:

Post by Dbug »

Well, If you take a look at the intro in Cyclotron, I have a vectorial rendering engine to draw the TRON logo, then it appears by scrolling on screen.

Then I also have a table of highscores to sort when inserting a new name.

For generating the playfieds I have circle and rectangle drawing routines, plus an emboss filter to give a 3d highlights looks.

Since there is no already made method to scroll the screen on the oric, I need some fast buffer to screen copy code too.

Also a lot of conditional code, because there is 8 totaly different levels (one or two players, with different walls placements, different end of level objectives too -time limit, last survivor, reach the exit-).
User avatar
Euphoric
Game master
Posts: 99
Joined: Mon Jan 09, 2006 11:33 am
Location: France

Post by Euphoric »

- Is it safe to write a value in $245/$246 without disabling the interuptions ? Is there no chance that some interrupts will be called in between the STA and the STY ?
Taken out of context, you're right, there's absolutely a chance that an interrupt happens between the STA/STY (let's say, one chance out of 2500)... :-)

But, since this code is executed a very short time after loading, and since the timer is reinitialized by the loading routine in rom, an interrupt cannot happen at the beginning of the program... :-)

- The PLA/PHA is done to get the value of the status register and then reading the value of the 4th bit that indicates that is its a BRK and not a regular interruption, right ?
Absolutely...
- Les calculs compliqués servent à récuperer le pointeur de pile pour changer l'adresse de retour pour corriger le PC+1 après le BRK.
Oui... on peut se contenter de décrémenter le poids faible de l'adresse de retour si on sait qu'aucune adresse de retour n'est multiple de 256... C'est pénible à vérifier pendant la phase de développement, mais on peut le faire tout à la fin pour gagner quelques octets sur la version finale...

Please note that this decrementation is not necessary if you use BRK as an immediate addressing mode instruction (with a one byte parameter, so)...

Cheers,

Fabrice
User avatar
Dbug
Site Admin
Posts: 4444
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:

Post by Dbug »

Sorry for the technical problems... I was not supposed to switch back to French language on the two last sentences :D
Les calculs compliqués servent à récuperer le pointeur de pile pour changer l'adresse de retour pour corriger le PC+1 après le BRK.
This means "the complex calculations are there to extract the stack pointer in order to change the value of the return adress to PC+1 after the BRK has been executed.
Post Reply