One reason of this fatness, is due to the way the opcodes are generated from a pseudo assembly language.
For example we have some calls to the LSH1W_DD instruction.
(As a matter of fact it is called 17 times), that is implemented this way:
Code: Select all
#define LSH1W_DD(tmp1,tmp2)\
lda tmp1 ;\
asl ;\
sta tmp2 ;\
lda tmp1+1 ;\
rol ;\
sta tmp2+1 ;\
The only problem is that in all these 17 calls, tmp1 is equal to tmp2, so the most efficient code would have been that one:
Code: Select all
asl tmp1
rol tmp1+1
Code: Select all
#define CZBW_DD(tmp1,tmp2)\
lda tmp1 ;\
sta tmp2 ;\
lda #0 ;\
sta tmp2+1 ;\
Code: Select all
lda #0
sta tmp1+1
Code: Select all
Lctk193
INDIRB_CD(Lctk190,tmp0)
CZBW_DD(tmp0,tmp0)
LSH1W_DD(tmp0,tmp0)
ADDW_DCD(tmp0,Lctk82,tmp0)
INDIRW_ZD(tmp0,tmp0)
INDIRW_CD(reg0,tmp1)
NEW_DD(tmp0,tmp1,Lctk197)
LEAVE
Lctk197
Code: Select all
Lctk193
lda Lctk190
sta tmp0
lda tmp0
sta tmp0
lda #0
sta tmp0+1
lda tmp0
asl
sta tmp0
lda tmp0+1
rol
sta tmp0+1
clc
lda #<(Lctk82)
adc tmp0
sta tmp0
lda #>(Lctk82)
adc tmp0+1
sta tmp0+1
ldy #0
lda (tmp0),y
tax
iny
lda (tmp0),y
stx tmp0
sta tmp0+1
lda reg0
sta tmp1
lda reg0+1
sta tmp1+1
lda tmp0
eor tmp1
sta tmp
lda tmp0+1
eor tmp1+1
ora tmp
beq *+5
jmp Lctk197
jmp leave
Lctk197
Some time ago Fabrice told about doing a peephole optimizer, I guess that could be doable in some way, but this is far from being obvious as soon as you take into consideration things like self modifying code. I guess it would work fine on the C code