Experimental very fast tape loading

Anything related to the tools Tap2Wav, Tap2CD, Tap2Dsk, Sedoric Disc Manager, Tape Header Creator, WriteDsk, and generaly speaking tools related to the management of Oric data files and devices.
User avatar
NekoNoNiaow
Flight Lieutenant
Posts: 272
Joined: Sun Jan 15, 2006 10:08 pm
Location: Montreal, Canadia

Re: Experimental very fast tape loading

Post by NekoNoNiaow »

Symoon wrote: Sun Apr 01, 2018 8:34 am Is it a 3 cycles or 2 cycles advantage, since I'm exiting from the BCS?
I think I tried to, but it took more time, or same with more bytes ;) Here the PLP/CLC/PHP/RTI is 15 cycles, couldn't find a shorter sequence...

Problem here is that:
- sadly registers A and Y must remain unaffected (or restored)
- I can't use much more bytes (I almost fill the page), but let's forget that for now
I have been working on that a bit during the week end and I have a solution which saves 6 cycles but trashes the X register and from what I understand it needs to contains the timer counter value after coming back from the interruption but this can probably be avoided depending on what the code after the waiting loop is doing.
This solution also puts some constraint on the stack (its depth at the moment of the waiting loop must be constant during the loading).

But if you are not doing any JSR or RTS after the waiting loop and before coming back to the waiting loop then the X trashing is not necessary and you would save 4 more cycles (so 10 total).

I will post it tomorrow, as I also have another idea I want to explore before.
Symoon wrote: Sun Apr 01, 2018 8:34 am Oh BTW it's not just the interrupt that has to be faster, that would be too easy :D It's the whole loop decoding a byte (about 100 cycles + interrupt time). I need to save about 6 cycles I think.
I am like DBug, I would love to see it . ;)
Symoon wrote: Mon Apr 02, 2018 1:57 pm Tested and working on my "slow" Atmos! :mrgreen:
Holy kitty, this is fast! Great job!
User avatar
Symoon
Archivist
Posts: 2307
Joined: Sat Jan 14, 2006 12:44 am
Location: Paris, France

Re: Experimental very fast tape loading

Post by Symoon »

Thanks for your efforts ;)
I'm currently working at 3 things in parallel:
1- cleaning and optimising a bit the working code (saving 4 or 5 bytes)
2- a "slow" version that could work on ROM 1.0 (would be around 25% slower I think)
3- coding a WAV generator that would allow to chose between "slow" 1.0/1.1 or "fast" 1.1-only version, including the loader at F16 speed, etc.

I'll try to finish 1/ first and post it this weekend!
User avatar
Symoon
Archivist
Posts: 2307
Joined: Sat Jan 14, 2006 12:44 am
Location: Paris, France

Re: Experimental very fast tape loading

Post by Symoon »

Here's the source code (well, in my usual unfriendly format) for the Oric part.
Cleaned and translated it quickly, sorry if awkward parts!

Novalight_v1.1a_source_ENGLISH.txt.zip
(4.41 KiB) Downloaded 346 times

I think it could be optimized a bit more if it didn't check for end program after each byte, but not sure it would be enough to allow a faster speed - need to check that. The last idea I have to go faster would be in total to save 23µs (one 44 kHz sample) on every main byte loop, so the start bytes could be 5, 6, 7 or 8 samples long instead of 6, 7, 8 or 9 now.
(EDIT: might be worth testing. Not checking end of program, that will be indicated by a 9 samples sinusoid, also saves the need for STA 2F... And moving the LDY#$00 into the repeat loop (befors RTS) saves 2 more µs. Total saved 19µs...)

BTW it's intended to be located in page 1 (stack) so it can't use much more bytes...
User avatar
Dbug
Site Admin
Posts: 4444
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:

Re: Experimental very fast tape loading

Post by Dbug »

Adding labels would help: Having to locate where BNE -7 and BCC -27 go... is not super optimal.
User avatar
Symoon
Archivist
Posts: 2307
Joined: Sat Jan 14, 2006 12:44 am
Location: Paris, France

Re: Experimental very fast tape loading

Post by Symoon »

Aaaaah, but it helps me calculating the hex value, and know which jumps are relative or not and need to be changed when changing the program size... (yes I'm doing everything using notepad.exe :lol: )
Give me some time to switch to better shared habits. Hey, I already tried to use $ and # correctly this time :mrgreen:
User avatar
Symoon
Archivist
Posts: 2307
Joined: Sat Jan 14, 2006 12:44 am
Location: Paris, France

Re: Experimental very fast tape loading

Post by Symoon »

Symoon wrote: Fri Apr 06, 2018 1:13 pmI think it could be optimized a bit more if it didn't check for end program after each byte, but not sure it would be enough to allow a faster speed - need to check that. The last idea I have to go faster would be in total to save 23µs (one 44 kHz sample) on every main byte loop, so the start bytes could be 5, 6, 7 or 8 samples long instead of 6, 7, 8 or 9 now.
(EDIT: might be worth testing. Not checking end of program, that will be indicated by a 9 samples sinusoid, also saves the need for STA 2F... And moving the LDY#$00 into the repeat loop (befors RTS) saves 2 more µs. Total saved 19µs...)
Ok, not working - reading byte loop is too slow.
It would have saved about 5% loading time (for instance Zorgon would have loaded in 13.7 seconds instead of 14.5).
I can't see where to sav time now, except finding a way to get rid of the JSR to read a byte when loading the program (in 043D). But reading a byte is also called before, twice, to start and to load the header...

I guess I'd rather spend time working on the signal generator tool now!
User avatar
Symoon
Archivist
Posts: 2307
Joined: Sat Jan 14, 2006 12:44 am
Location: Paris, France

Re: Experimental very fast tape loading

Post by Symoon »

Symoon wrote: Sat Apr 07, 2018 7:54 amOk, not working - reading byte loop is too slow.
Actually, it was bad programming from me (assuming C=0 when it's not actully sure has consequences... As well as exiting a program while being in a JSR :mrgreen: ). There is still hope, but requires going back to work!
User avatar
Symoon
Archivist
Posts: 2307
Joined: Sat Jan 14, 2006 12:44 am
Location: Paris, France

Re: Experimental very fast tape loading

Post by Symoon »

Victory! :D :D
I'm saving a sample per byte ;)
Two seem to almost work but a sinusoid is missed after a few hundred bytes, both on Euphoric and real machines, so I'll forget that (until an idea pops up to save more cycles ;) )

Another mystery: I noticed that with my "slow" Atmos, and only with it, I sometimes have to reboot the PC that plays the WAV file, otherwise there were errors with Novalight. But if I don't reboot and switch to another Atmos, everything is fine.
That puzzles me! And I wonder if it didn't interfere with the validity of some of my previous tests.
User avatar
Chema
Game master
Posts: 3014
Joined: Tue Jan 17, 2006 10:55 am
Location: Gijón, SPAIN
Contact:

Re: Experimental very fast tape loading

Post by Chema »

Just wanted to pop in to express how impressed I am with your work!

When I saw the video I could hardly believe it was loading Xenon1 so quickly... Unbelievable! And now even faster. :shock:
User avatar
Symoon
Archivist
Posts: 2307
Joined: Sat Jan 14, 2006 12:44 am
Location: Paris, France

Re: Experimental very fast tape loading

Post by Symoon »

Thanks! Well, it's just me having fun, pushing Fabrice's wonderful tools to the limits :wink:

You probably won't notice any change with Xenon: it changes (without loader) from 11.38 seconds to 10.80. That's half a second faster, but at such short loading speed, doing significant progress becomes complicated ;)
User avatar
iss
Wing Commander
Posts: 1641
Joined: Sat Apr 03, 2010 5:43 pm
Location: Bulgaria
Contact:

Re: Experimental very fast tape loading

Post by iss »

Congrats for victory, Symoon!
Your source looks super, I tried to tested it with some quickly written encoding code based on explanations in your file but failed. Anyway, my tests are not relevant and only for fun, I'll wait until you release the encoding part.
User avatar
Symoon
Archivist
Posts: 2307
Joined: Sat Jan 14, 2006 12:44 am
Location: Paris, France

Re: Experimental very fast tape loading

Post by Symoon »

Thanks guys.
I'll have to set the loader in page 1 now, test it again, and write a decent WAV generator. With my rotten C memories, don't hold your breath ;)

I think by default for multipart programs, the loader will be loaded again for each part (holding the name of the following high-speed program, so this remains compatible with CLOAD orders in programs). I will have to check how Fabrice and Chema did for this part.
Maby a "no loader" option, and maybe another option for 1.0 ROMs, but I'll put this last one aside until I finish 1st a complete 1.1 ROM version.
User avatar
NekoNoNiaow
Flight Lieutenant
Posts: 272
Joined: Sun Jan 15, 2006 10:08 pm
Location: Montreal, Canadia

Re: Experimental very fast tape loading

Post by NekoNoNiaow »

I am a bit late but here are my attempts at accelerating the code that Symoon posted.
(This is the code on the forum, not from the archive, I have not read it (yet).)

Not all of these attempts bring acceleration (but many do) and some have drawbacks.
I think it is worth looking at all of them, they might give ideas to other coders. ;)

The last one is I think the best if it can be adapted with your system.
It saves 14 cycles total and costs a few bytes. \(ˆˆ)/

There are variants indicated by "Note1 and "Note2"mentions. You can skip them once you get the gist of what they mean.

Please tell me if you find any errors.
Now, go grab a coffee because that is a long read. ;)

Hope this helps!

Original code for reference

Code: Select all

(...)
0460   2     38       SEC
0461   2/3   B0 FE    BCS -2   infinite loop (waiting for interrupt)
(...)

Interrupt code:
04D0   4   AE 00 03  LDX 0300   Reset flag on CB1
04D3   4   AE 08 03  LDX 0308   read timer (sinusoid duration) in X
04D6   4   8E 09 03  STX 0309   Rest timer counter (writing in #309 sets #308 with #F5 once instruction executed)
04D9   4   28        PLP        Get the system flags saved by the interrupt
04DA   2   18        CLC        Set C to 0 to leave the loop
04DB   3   08        PHP        Save the system flags
04DC   6   40        RTI        Back to the loop 
Pre-calculated interrupt stack

Code: Select all

; Setup code (is run only once at the start of the loading routine).
; Prepares an interrupt return stack at the bottom of the stack = $100 to $102.
; It should never be overwritten during the tape loading routine.
xx00   2   A9 00     LDA #0     ; push the desired RTI flag value (=0)
xx02   3   48 00 01  STA $0100
xx05   2   A9 6A     LDA #$64   ; push the desired RTI return address (046A LSB)
xx07   3   48 01 01  STA $0101
xx0A   2   A9 04     LDA #$04   ; push the desired RTI return address (046A MSB)
xx0C   3   48 01 02  STA $0102
(...)

Code: Select all

(...) ; Interrupt waiting loop : 16 cycles (previous = 4/5)
0460   2   BA        TSX        ; Save stack pointer into the second LDX below |
0462   4   8E 6A 04  STX $046A  ;                                              |-> cf Note1
0465   2   A2 00     LDX #0     ; Prepare the future RTI stack pointer 
0467   2   38        SEC        ; Make sure the carry is cleared (would be nice if the 6502 had a branch-always instruction)
0468   2   B0 FE     BCS -2     ; infinite loop, waiting for interrupt, always 2c because it never succeeds
046A   2   A2 xx     LDX #xx    ; Reload the stack pointer |
046C   2   9A        TXS        ;                          |-> Cf Note2
(...)

Code: Select all

(...) ; Interrupt routine : 20 cycles (previous = 27)
04D0   2   9A        TXS        ; Point the stack to the prepared "return of interrupt" stack
04D0   4   AE 00 03  LDX 0300   ; Reset flag on CB1
04D3   4   AE 08 03  LDX 0308   ; read timer (sinusoid duration) in X
04D6   4   8E 09 03  STX 0309   ; Reset timer counter (writing in #309 sets #308 with #F5 once instruction executed)
04D9   6   40        RTI        ; Back to the loop 

; Total cycles:
;                  waiting loop (one iteration)
;                    + IRQ routine
;                         + failed branch test when coming back from interrupt
; original code =  4 + 27 + 3 = 34
; alternative   = 16 + 20     = 36
Conclusion: worse than the original, and this trashes X so the value of the timer is lost.

Note1:

Code: Select all

Note1: if the stack pointer value is guaranteed to be constant when it reaches the waiting loop,
       then it can be precomputed in advance during the setup phase and stored directly in
       the instruction LDX #xx .
       In that case, the waiting loop routine does not need to store it and becomes shorter
       by moving the TSX/STX pair out of the loop and into the setup phase

This would bring  the number of cycles to:
; original code =  4 + 27 + 3 = 34
; alternative   = 10 + 20     = 30  (4 cycles gain)
Conclusion: 4 cycles faster than the original, but we still trash X to reload the stack pointer so we lose the value of the timer.

Note2:

Code: Select all

Note2: if no JSR instructions are used after coming back from the interrupt and
       before branching back to the waiting loop,
       then there is no need to restore the stack counter and the LDX + TXS pair at $046A can be moved
       later after the loop is done.
       Morever, this prevents the trashing of X/timer-value!

This would bring  the number of cycles to:
; original code =  4 + 27 + 3 = 34
; alternative   = 12 + 20     = 32  (2 cycles gain)
Conclusion: 2 cycles faster than the original, but we still trash X to reload the stack pointer so we lose the value of the timer.
but does not work if a JSR is done before the next iteration of the loop
because the stack is at $100 and that would be messy.

Code: Select all

If Note1 and Note2 apply together.
This would bring  the number of cycles to:
; original code =  4 + 27 + 3 = 34
; alternative   =  6 + 20     = 26  (8 cycles gain)
Conclusion: 8 cycles faster than the original and works without trashing X
but does not work if a JSR is done before the next iteration of the loop (stack at $100, messy).

JMP variant

It does still precompute an interrupt stack at $100 to $102 and pre-store it in setup.
But now, it never uses RTI but instead uses JMP to go back after the waiting loop
- allows to remove the TXS in the interrupt routine
but requires popping up the flags to clear the I flag before jumping back

Code: Select all

(...) setup phase is the same as previous attempt

(...) ; Interrupt waiting loop : 16 cycles (previous = 4/5)
0460   2   BA        TSX        ; Save stack pointer into the second LDX below |
0462   4   8E 6A 04  STX $046B  ;                                              |-> cf Note1
0465   2   A2 00     LDX #0     ; Prepare the future RTI stack pointer 
0467   2   38        SEC
0468   2   B0 FE     BCS -2     ; infinite loop, waiting for interrupt, always 2c because it never succeeds
046A   2   A2 xx     LDX #xx    ; Restore the stack pointer |
046C   2   9A        TXS        ;                           |-> Cf Note2.
(...)

(...) ; Interrupt routine : 19 cycles (previous = 27)
04D0   4   AE 00 03  LDX 0300   ; Reset flag on CB1
04D3   4   AE 08 03  LDX 0308   ; read timer (sinusoid duration) in X
04D6   4   8E 09 03  STX 0309   ; Rest timer counter (writing in #309 sets #308 with #F5 once instruction executed)
04D9   4   28        PLP        ; Restore system flags before interrupt, this clears the interrupt disable flag
046A   3   4C 69 04  JMP $046A  ; Jump right after the waiting loop

Code: Select all

; Total cycles:
;                  waiting loop (one iteration)
;                    + IRQ routine
;                         + failed branch test when coming back from interrupt
; original code =  4 + 27 + 3 = 34
; this code     = 16 + 19     = 35  (-1 cycles gain)
Conclusion: slower than the original, oops. ;)

Note1 applies as well (save stack in setup).

Code: Select all

This would bring  the number of cycles to:
; original code =  4 + 27 + 3 = 34
; this code     = 10 + 19     = 29  (5 cycles gain)
Conclusion: 5 cycles faster than the original, but we are still trashing X/timer-value.

Note2 applies as well (avoid restoring the stack in the loop).

Code: Select all

This would bring  the number of cycles to:
; original code =  4 + 27 + 3 = 34
; this code     = 12 + 19     = 31  (3 cycles gain)
Conclusion: 3 cycles faster than the original
but does not work if a JSR is done before the next iteration of the loop
because the stack is at $100 and that would be messy.

If Note1 and Note2 are applied together.

Code: Select all

This would bring  the number of cycles to:
; original code =  4 + 27 + 3 = 34
; alternative   =  6 + 19     = 25  (9 cycles gain)
Conclusion: 9 cycles faster than the original and works without trashing X
but does not work if a JSR is done before the next iteration of the loop (stack at $100, messy).

Merge loop and interrupt routine variant

The idea here is to:
- incorporate the interrupt routine directly in the waiting loop (oO)
- so after the interrupt is handled, there is no need to come back from the routine:
just pop the flags, restore the stack pointer and just continue without returning/branching

This has many advantages:
- there is no need for a dedicated interrupt stack anymore
- we can now use JSR since we are still on the regular stack

But we cannot use RTS to return from the current routine since the address of the BCS is still on the stack.
(This will be solved next variant.)

Code: Select all

(...) ; Interrupt waiting loop : 10 cycles (previous = 4/5)
0460   2   BA        TSX        ; Save stack pointer into the LDX below |
0462   4   8E D4 04  STX $04D4  ;                                       |-> Cf Note1.
0465   2   38        SEC
0466   2   B0 FE     BCS -2     ; infinite loop, waiting for interrupt, always 2c because it never succeeds
       ; Interrupt routine : 20 cycles (previous = 27)
0468   4   AE 00 03  LDX 0300   ; Reset flag on CB1
04DB   4   AE 08 03  LDX 0308   ; read timer (sinusoid duration) in X
04EE   4   8E 09 03  STX 0309   ; Rest timer counter (writing in #309 sets #308 with #F5 once instruction executed)
04D2   4   28        PLP        ; Restore system flags before interrupt, this clears the interrupt disable flag
04D3   2   A2 xx     LDX #xx    ; Reload the stack pointer |
04D5   2   9A        TXS        ;                          |-> Cf Note2.
; here the code continues normally as if it was just after the waiting loop BCS
; ...

Code: Select all

; Total cycles:
; original code =  4 + 27 + 3 = 34
; this code     = 10 + 20     = 30  (4 cycles gain)
Conclusion: 4 cycles faster than the original, but we still trash the X/timer value to restore the stack pointer.

Note1 applies as well (save stack in setup).

Code: Select all

This would bring  the number of cycles to:
; original code =  4 + 27 + 3 = 34
; this code     =  4 + 20     = 24  (10 cycles gain)
Conclusion: 10 cycles faster than the original, but we are still trashing X/timer-value.

Note2 applies as well (avoid restoring the stack in the loop).

Code: Select all

This would bring  the number of cycles to:
; original code =  4 + 27 + 3 = 34
; this code     = 10 + 16     = 26  (8 cycles gain)
Conclusion: 8 cycles faster than the original and works without trashing X
we can do JSR because we are still on the original stack
we cannot do an RTS though because the address of the BCS is still on the stack

If Note1 and Note2 are applied together.

Code: Select all

This would bring  the number of cycles to:
; original code =  4 + 27 + 3 = 34
; alternative   =  4 + 16     = 22  (12 cycles gain)
Conclusion: 14 cycles faster (YOOHOO) than the original and works without trashing X
we can do JSR because we are still on the original stack
we cannot do an RTS though because the address of the BCS is still on the stack

Final version (for now :D)

This illustrates how to make everything work together with the optims of both Note1 & 2.
This is the same as above but illustrates how to restore the stack effortlessly once the loop is done.

Code: Select all

(...) ; calling code
0xxx JSR setup_waiting_loop
; then do other stuff
(...)

Code: Select all

setup_waiting_loop: ; This runs only once
0460   2   BA        TSX        ; Save stack pointer into the LDX of the restoration code
0462   4   8E aa bb  STX aabb  ; aabb = see below
waiting_loop:    ; Interrupt waiting loop : 4 cycles (previous = 4)
0465   2   38        SEC
0467   2   B0 FE     BCS -2     ; infinite loop, waiting for interrupt, always 2c because it never succeeds
; Interrupt routine : 16 cycles (previous = 27)
0468   4   AE 00 03  LDX 0300   ; Reset flag on CB1
04DB   4   AE 08 03  LDX 0308   ; read timer (sinusoid duration) in X
04EE   4   8E 09 03  STX 0309   ; Rest timer counter (writing in #309 sets #308 with #F5 once instruction executed)
04D2   4   28        PLP        ; Restore system flags before interrupt, this clears the interrupt disable flag
                                ; The address of the BCC is still on the stack but we do not care.
04D3   .....   ; here regular code which handles the result of the interrup.
04Dx  JSR some_ROM_code   ; no problem, this works
04Dy  ......   ; Do some other stuff here
...
; Exit condition of the loop: are we finished?
xxyy-2 2/3 F0 ?? ?? BEQ waiting_loop ; Could be any branch test.
aabb-1 2   A2 xx    LDX #xx          ; Restore the stack pointer
aabb+3 2   9A       TXS
aabb+4 6   60       RTS              ; back to calling code

Code: Select all

; Total cycles:
; original code   =  4 + 27 + 3 = 34
; this code       =  4 + 16     = 20  (14 cycles gain)
Conclusion: 14 cycles gain. No disadvantages that I can see.
I am not counting the branching back as a penalty because you probably already have one.

If the branching site of the loop is too far away from waiting_loop we lose a few cycles:

Code: Select all

; Exit condition of the loop: are we finished?
xxyy-2 2/3 D0 bb-1 aa BNE exit_loop    ; Could be any branch test.
xxyy   3   4C 65 04   JMP waiting_loop ; Loop back
exit_loop:
aabb-1 2   A2 xx      LDX #xx          ; Restore the stack pointer
aabb+3 2   9A         TXS
aabb+4 6   60         RTS              ; back to calling code
Conclusion: 14-3-3 = 8 cycles gain. Still worth it. ;)

Congratulations to everyone who managed to read everything. ;)
User avatar
Symoon
Archivist
Posts: 2307
Joined: Sat Jan 14, 2006 12:44 am
Location: Paris, France

Re: Experimental very fast tape loading

Post by Symoon »

Wow, thanks for all this ;)
Interesting variants, I'll have to read again but I'm not sure it can apply at Novalight. I have done so many variants and changes yesterday that I'm a bit lost!

I should have posted the whole code: as you will notice, the infinite loop is used at several places (5 times I think) and the result is managed in very different ways, according to the byte type to decode.
It is called between two and 5 times to decode a byte (and an undetermined amount of times for RLE ;) )

Here is the very latest code version.
Novalight_v1.1d_source_ENGLISH.zip
(4.41 KiB) Downloaded 359 times
User avatar
iss
Wing Commander
Posts: 1641
Joined: Sat Apr 03, 2010 5:43 pm
Location: Bulgaria
Contact:

Re: Experimental very fast tape loading

Post by iss »

My proposal is modest but I think it makes sense - in short you don't need PHP/RTI in the interrupt routine.
Use only RTS instead - gain 3 cycles and 1 precious byte. :)
I.e. from:

Code: Select all

04CD	4	AE 00 03  LDX $0300	Re-initialize CB1 flag
04D0	4	AE 08 03  LDX $0308	Read duration in X
04D3	4	8E 09 03  STX $0309	Re-initialize timer count (writing in MSB #309 sets LSB #308 to #F5)
04D6	4	28	  PLP		Get the state saved in the stack by the interrupt
04D7	2	18	  CLC		Set C to 0 to leave the "calling" infinite loop
04D8	3	08	  PHP		Save again the state into the stack
04D9	6	40	  RTI           Go back to the "calling" infinite loop
To:

Code: Select all

04CD	4	AE 00 03  LDX $0300	Re-initialize CB1 flag
04D0	4	AE 08 03  LDX $0308	Read duration in X
04D3	4	8E 09 03  STX $0309	Re-initialize timer count (writing in MSB #309 sets LSB #308 to #F5)
04D6	4	28	  PLP		Get the state saved in the stack by the interrupt
04D7	2	18	  CLC		Set C to 0 to leave the "calling" infinite loop
04D8	3	08	  RTS		Go back to the "calling" infinite loop
Post Reply