Text compression

Here you can ask questions or provide insights about how to use efficiently 6502 assembly code on the Oric.
User avatar
Chema
Game master
Posts: 3014
Joined: Tue Jan 17, 2006 10:55 am
Location: Gijón, SPAIN
Contact:

Text compression

Post by Chema »

Greetings,

Has anybody worked on text compression for games/demos? Imagine a program where you have a set of different strings you may have to display (an adventure game is a good example of this). You would usually compress text to save space and decompress the given string on the fly for printing.

Has anybody done something similar? I know it can be done with standard compression algorithms, such as extended Huffman or LZW, but I am not sure which is the best method for this case. Note that most common algorithms won't work as they would need to decompress the whole text into memory (and not each substring at a time), and using them on each short string separately is usually not an option. This is the case, I think, of Dbugs Pack and UnPack routines (correct me if I am wrong!).

I suppose you can use an external program to analyse the whole text of the program and generate a kind of lookup table for storing the most used substrings (sets of two, three, four and even more characters that repeat often) and substitute them with values that do not represent a character or attribute, but an entry to the lookup table. This way, with the table and a small code, you can decompress each string separately when needed. The fact of the assymetry of the method (it is much more computationally expensive for compressing than for decompressing) is not a problem here, as you would compress only once (and in a PC).

Regards,
mmu_man
Flight Lieutenant
Posts: 322
Joined: Thu Sep 21, 2006 7:45 pm
Location: 26000 Valence, FRANCE
Contact:

Post by mmu_man »

Yes it could be possible to use tokenization/atomization on those.
Like how the Basic does by using ascii codes >128 for basic tokens.
User avatar
Chema
Game master
Posts: 3014
Joined: Tue Jan 17, 2006 10:55 am
Location: Gijón, SPAIN
Contact:

Post by Chema »

mmu_man wrote:Yes it could be possible to use tokenization/atomization on those.
Like how the Basic does by using ascii codes >128 for basic tokens.
Yeah... that was the idea, more or less. But I can then suppose that nobody has actually done this... strange, anyway.

I have been looking at grammar-driven compression techniques, which basically extend the token table so it codes a context-free grammar specially derived from your text to achieve higher compression.

Even if it is not hard to decompress with this method, it might be too much code and it is difficult to generate the grammar, so I might program a routine which searches for the most used tetragrams, trigrams and digrams (for instance) and store them in tables.

If anybody has furhter ideas or has some code, let me know.

Regards,
User avatar
Dbug
Site Admin
Posts: 4444
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:

Post by Dbug »

There is also the possibility to reduce the character set.
Instead of storing on 8 bits, it is possible to store each character on 6 or 5 bits.

5 bits look a bit 'short', but you can easily store the 26 letters of the alphabet, and still have some free characters for punctuation. It is then possible to use "escape" codes to do things like "swtiching to numbers mode", or "next char is a upper case letter", etc...
User avatar
Twilighte
Game master
Posts: 819
Joined: Sat Jan 07, 2006 12:07 am
Location: Luton, UK
Contact:

Post by Twilighte »

I tried a variation of Dbugs mode for the text compression in Times of Lore, but i found it better to concetrate of the compression method mentioned by Chema.

As an example, i have uploaded this file which is actually the complete text for Times of Lore.

http://www.defence-force.org/ftp/forum/ ... l_text.rar

if interested, i may be pursuaded to also provide the code that uses this data.
User avatar
Chema
Game master
Posts: 3014
Joined: Tue Jan 17, 2006 10:55 am
Location: Gijón, SPAIN
Contact:

Post by Chema »

Ok I coded a simple algorithm (I think is similar to LZO) and made a quick test. The basic idea is that it should be able to decompress any string inside the text with the minimum overhead in memory and code.

The algo works as follows:

Code: Select all

1 Scan text searching for digrams (repetitions of pairs of symbols). If none is repeated more than N times (N at least 3) or there is no more room in the table then we are done. Else we go to step 2.
2 Take the digram which repeats most and add to the token table. Substitute the digram in the text with the token.
3 Go back to step 1
As you can see we end up with a kind of grammar which looks like:

Code: Select all

<0> -> e, [space]
<1> -> d, [space]
<2> -> t, h
<3> -> s,
<4> -> i, n
...
<9> -> e, <1>
<10> ->[space], <2>
where <i> is a non-terminal symbol and the others are terminals (characters). So often repetitions of several characters are coded as combinations of the above rules. The trigram "ee[space]" is coded by rule 0 and rule 9, and ends up in the text as token 9. The tetragram "had[space]" is represented by token 83, and coded by:

Code: Select all

<1> -> d, [space]
<27> -> h, a
<83> -> <27>, <1>
There is no waste of additional space normally, as, for example, rule 27 also codes the digram "ha" which is used quite often. More levels of recursion code pentagrams or hexagrams (if those are the correct terms)... as long as they repeat often.

The reason for this is that the table will be very simple to work with, as all the entries have only 2 bytes, and should not require too much 6502 code. I have a bare idea about how to do it using the processor stack and a quite simple loop.

The problem is that the maximum number of tokens (128 for this test, with a table of 256 bytes) are not enough to achieve great compression ratios which are in the order of 45-50%. For example I compressed the tol_text.s file from the link above (not the actual text of the game, but the actual source file) and it went from 30798 bytes to 15448, while RAR leaves it in 9.6K.

Not sure if this is the correct approach, but we can study it later, I suppose.

Regards,

EDIT: Had a small bug. The output for the test file is actually some hundreds of bytes less, but does not change the overall compression ratio much. Also for decompressing the code in C is, for each char in the compressed input:

Code: Select all

void process_token(unsigned char token, FILE * textOut)
{
	unsigned char left, right;

	if (token < FIRST_TOKEN)
		fputc(token,textOut);
	else
	{
	 left=Grammar[token-FIRST_TOKEN][0];
 	 right=Grammar[token-FIRST_TOKEN][1];
	 process_token(left,textOut);
	 process_token(right,textOut);
	}
}
I know it is recursive, but surely it is easy to implement in a compact way using the stack to store tokens that have to be processed and a loop. Access to the table is also inmediate (well, not really "inmediate" :) ), loading the token-FIRST_TOKEN in reg x and isssue

Code: Select all

lda Grammar+1,x
pha
lda Grammar,x
pha
...
User avatar
Chema
Game master
Posts: 3014
Joined: Tue Jan 17, 2006 10:55 am
Location: Gijón, SPAIN
Contact:

Post by Chema »

It works!

I have implemented the small routine in 6502 for decompressing a character from a compressed text. The routine gets the character (token or not) in reg a (better use another reg to gain some cycles but...), uses the stack and works as follows:

Code: Select all

decomp
.(
    tax
    ; Prepare the stack for the decompression routine
    lda #0
    pha
    txa
    pha

loop
    pla
    ; If it is 0, we are done
    beq end
    ; If it is a token, expand it, else print and continue
    bpl printit

    and #%01111111 ; Get table entry
    asl
    tax
    lda _grammar+1,x
    pha
    lda _grammar,x
    pha
    jmp loop
printit
    ldy #0
    sta (sp),y
    jsr _putchar
    jmp loop
end
    rts
.)
I took a long text and compressed it and also obtained the grammar, which is a 256 bytes table. Then took a piece of 5161 bytes of text, which was only 2291 bytes once compressed and tested it.

It is indeed very quick and works perfectly. In addition no further space is needed and you can take any pice of the compressed text. It should preserve any code outside the Token list, even those below ASCII 32. If we use less tokens (less rules) we could even preserve attributes!.

Major drawback is that you have to create the grammar for the complete text and not part by part, so it should be done at a final stage in development. I have a dirty C program that does this and creates the compressed file. I might clean it up and make it generate something directly usable in the OSDK. If anybody is interested, drop me a line.
Post Reply