This document deals with the inner workings. If you're a tech type, and are
interested in somewhat deeper lying aspects relating to how the Imploder
operates, chances are you'll find it here.
Subjects covered are:

- Compression
- Decompression
- Reversibility
- The Library
- Overlayed Files
- Merging
- ARP
- The Music
- The Copperlist
- 68040 Cache Coherency


** Compression **

The Imploder (we only recently learned :-) does LZ77 like compression with a
per-mode static Huffman coding step on the various parts of the skip, offset
and length tuples. Due to the efficient encoding, a tuple can require less
than 12 bits, and thus strings of 2 bytes length are encodable with a decent
gain (given small Huffman patterns corresponding to likely circumstances).

To speed up the string searching, the turbo mode causes the accessible strings
to be indexed using a hashing table. However, the fact that strings with a
minimum size of two bytes are still potential candidates for compression,
requires the hashing function to necessarily be rather simplistic. When the
implosion algorithm processes highly redundant data, entries in the hashing
table tends to get very imbalanced, causing a reduction in performance.
For most types of data, notably executables, this isn't the case though.


** Decompression **

The goal of decompression is to reproduce the segment list of the
original program. This is a linked list of data/code/bss hunks, some
of which require being positioned in chip memory.

The decompression code will have to fill the data and code hunks with
the decompressed data, and, if required, perform relocation of absolute
references.

The Imploder lets LoadSeg allocate all target hunks (as BSS). These
are at the start of the hunk list. This is followed by a hunk with
the decompression code (only for non-library imploded programs),
a compressed data hunk (normally about 50% of the static data size,
depending on compression gain ofcourse), and a decompression buffer
(only upto 17K in size).
As you can see, no allocations need to be done, so the exploding
process will never fail.

During decompression time, data is decompressed from the compressed data
buffer into the decompression buffer until it has filled. It then empties
this buffer by filling hunks and processing reloc tables.
When the buffer has been emptied, the process repeats until all data has
been scatter-decompressed across the target hunks.

Now you might ask, why not directly decompress to the target hunks
instead of using a decompression buffer?
This is because the the Imploder uses the decompressed data to
decompress, and needs to be able to easily access it (for speed).
Referencing data distributed across several hunks is much more
cumbersome.

An added bonus of having separate source and target memory regions is that
a notation can be used that doesn't gain under rare circumstances, but on
average yields better results (No chance of source/target pointer
collision).

When explosion has finished, the decompression buffer, code and data
hunks are freed, and memory usage reduced to what it would have been
for a non-imploded version of the program.

People often compare the Imploder to the Power Packer. The Imploder
decompresses faster and looks cooler [:-)], but the most interesting
differences lie in the implementation of the various steps of the
decompression proccess. So let's contrast the advantages of the
Imploder's approach with the Power-Packer's implementation.

- By having LoadSeg allocate all memory as BBS hunks, explosion will
  never fail.
  The Power-Packer on the other hand allocates hunks. If it fails it
  will simply exit. Power Packed programs launched from the WorkBench
  thus won't reply the startup message, which will be left dangling in
  memory forever.

- Memory doesn't get fragmented. The explosion related hunks are at
  the end of the seglist and thus were allocated (by LoadSeg) AFTER
  the target hunks.
  This isn't true for the Power-Packer. It does leave a hole in your
  free memory list when it frees its decompression stuff.

- Additional memory usage is only about 50% of the static data size +
  the size of the decompression buffer, which is always small relative
  to large programs (maximum 17k).
  So a 30K program might require 62K to decompress (30+15+17), a 300K
  program will require 467K (300+150+17), assuming a 50% compression
  reduction.
  The memory usage report generated after a program has been imploded
  includes BSS hunks. I've discussed only static data here. BSS hunks
  don't require any extra memory usage of course.
  Power-Packed files require a buffer as large as the original program
  for both compressed data storage and decompression. Memory usage is
  therefore always about twice the static data size (again ignoring BSS)
  while for the Imploder it drops to 1.5 for executables large enough to
  matter memory wise.

(In this comparison I'm talking about executables as produced
 by Power-Packer version 3.0b.)

Non-library imploded programs have a small first hunk that calls the
decompression code hunk at the end, and frees these last three hunks. 
For library imploded programs this freeing occurs in the library, so no
preceding hunk is needed.


** Reversibility **

Before compression the Imploder preprocesses executables. It kicks out all
the redundant stuff by merging subreloc tables referring to the same hunk,
sorting relocs (improves compression), removing debug symbols etc. etc.
This is what all those info blurbs in the text window are about.

So the deploded executable isn't guaranteed to be byte by byte identical
as far as loadfile control codes are concerned.

What is guaranteed is that the memory images created when the original,
imploded and deploded program versions are loaded are identical.

So the deplosion process isn't 100% reversible. Normally this is no
problem. The reason for uncrunching is mostly wanting to recompress
an executable using a different compression mode, or having a quick
peek at the code e.g. when applying a patch with something like
NewZap.
If however you expect to need the debug symbols, or (important)
require the executable to be in the _exact_ original format in order
to have things like lpatch updates to applied, you're out of luck
if you've only got the compressed executable. So always keep the
original archives/disks!

This is yet another argument for retaining the original archives.
The Imploder is an online space creator not a distribution
archiver (See the "Philosophy" text).


** The Library **

The library code has been unrolled a bit, and optimized here and there
in order to achieve optimal performance. This makes it faster than the
normal explosion algorithm.

If you library implode a program there is NO way in which the program,
after explosion, will be able to notice. If you make sure the library
is resident, this is also true for any executable file loaded for any
purpose by any program.

For normal etc. imploded programs the startup/cleanup hunk mentioned
at the end of the "decompression" section might be detected if a program
goes through contortions involving finding the segment list via murky
DOS structures instead of simply doing PC relative hunk start referencing
which also works from the WorkBench.
I haven't encountered any programs that do this. Still this is yet another
reason to use the library; there is not even the slightest chance of it
being incompatible with an executable.

Note that the Loadseg vector is patched in an "intelligent" manner; it
will install fine for pre 2.0 kickstarts (braindead jumptable format)
as well as in BCPL free systems (2.0+)

Under pre 2.0, when a library imploded file is run from the WorkBench, and
the explode.library isn't resident yet, Exec will try to load the library
from disk. The process's message port however is in use by the WorkBench
reply message, and until it has been replied, it cannot be used by the
DOS in order to send packets. Thus the DOS gurus.

Also, BCPL code doesn't jump through the the library vector. The only
structural problem with this are handlers. These are loaded by the DOS,
and the DOS is BCPL code, again ONLY under < 2.0. Under 2.0 the library
works just like intended when it was first conceived. Transparently that
is.


** Overlayed Files **

The Imploder compresses the load part of an overlayed file as if it were a
normal executable file. Subsequently, the overlay table and the overlayed
program section are appended.
It then tries to adapt the overlay table. Because different types of
overlay supervisors are in use, the format of the overlay table isn't
known to the Imploder. The only assumption made is that the overlay table
contains seek-offset longwords, at longword aligned locations, that point
into the file to the hunk_header ($3F3) identifiers of the overlayed
program sections.
This is how the standard overlay manager operates, but nothing prevents
a programmer with sufficient technical knowledge to create a novel overlay
format (e.g. selfextracting DImp files).

If the Imploder finds one of these offsets, it is adjusted by the amount
the initial load part of the executable file has compressed.
The deplosion algorithm also tries to find these offsets when restoring the
overlay table. Thus there is always a very small chance that the imploder
will adapt a longword that was never meant to be an offset.

An overlayed file gets its information from the loader in four longwords,
at the start of the first hunk. In an imploded overlayed file, this hunk is
the root hunk, and after decrunching these longwords are adjusted and moved
into the first hunk of the actual program (the second hunk of the seglist).

Evidently this process can never be 100% deterministic, so take heed and
test any overlayed programs you've Imploded. Or don't use overlay implosion
at all if you can spare the bits.


** Merging **

Though modern linkers/compilers typically produce executables with one code
hunk and one data hunk, there are still some old executables and less evolved
linkers around. The merging option was implemented when executables with
sufficient hunks to cause a lot of redundancy were still commonplace.

Every hunk requires a longword in the allocation header, plus a hunk ID,
load size, and hunk end ID. That's 16 bytes per hunk, and thus saved for
every merge action. Doesn't sound like much, but generally hunks also
have reloc tables. These waste a lot more space, especially with references
to a lot of different hunks, though there's no easy equation.

The merging step merges matching hunks (data-data, chipdata-chipdata,
code-code) into hunks of upto the merge threshold in size. The actual
size is of course determined by the sum of the sizes of the composite hunks,
and may very well be a bit less than the specified threshold.

Obviously this process discards redundant data in an irreversible fashion,
so deploding the executable won't reverse it.

Lots of tiny hunks cause memory fragmentation, but increase the chance of
the program being able to load when the system is fragmented, and low on
memory. Thus there is a kind of optimal balance that varies from system
to system. In general it can be said that hunks less than 10K or more than
100K are "bad".

Another factor is that loading a program with many tiny hunks causes the
LoadSeg function to issue double as many tiny read commands, thus bogging
down the speed with which an executable can be loaded into memory.

For simplicity's sake, I've chosen for the Imploder to process executables
within a single buffer, without the need for additional backup buffers.
Thus, removing redundant information, and copying hunk data during the
merging and reloc cleanup process involves moving or mirroring large parts
of the buffer. This is why merging can take a while when processing a large
executable with a hundred or so hunks.


** ARP **

Programs written for use with the ARP shell are able to specify the
stacksize they require inside the first hunk of their executable. If
such a program is normal or pure imploded, the segment list won't become
visible until the program is run. Thus ARP has no way of finding out what
the proper stacksize should be.
Library imploded programs have no trouble with this because they are
already decrunched after they are loaded into memory with LoadSeg.
(Provided the library has already been made resident.)

The Imploder will recognize these files and report on them. If the
requested stack-size is larger than the usual minimum (4000 bytes)
a warning will be printed, and you'll be urged to use only library
implosion.

The chance of a programmer relying on a soon to be obsolete shell for
setting a stack LARGER than the usual default is rather slim though.
It would have been very nice if 2.0 had sported such a stack setting
feature, and indeed it had been planned, but was never implemented due
to lack of time on the part of the Commodore programmers.

We'll be on the lookout for any future changes to the executable file
format in order to fix any potential incompatibilities before they'll
cause problems.


** The Music **

When we got word the CIA-A timer was used by the OS under 2.0, we switched to
trying to allocate first CIA-A and if not available CIA-B to drive the music.
However the CIA-B interrupt priority is too high and can interfere with things
like serial transfer. So Paul got this great idea to keep on using a CIA for
precision timing purposes, but drop down to a SoftInt for doing the actual
work, modulations, etc. This works great, the amount of code executed under
CIA priority is now negligible.
Recently, the CATS started feeling guilty about hijacking the CIA-A timer and
thus created "Jumpy the magic timer device". If I understood things correctly
the latest 2.0 timer device moves out of the way and starts using a less
accurate timing source whenever an application tries to allocate the CIA-A.
Pauls music driver can run of both CIA-A and CIA-B, and it would be a pity
to make Jumpy jump without good reason, so he changed the alloction sequence
from A-B to B-A.


** The Copperlist **

There are a couple of unavoidable quirks when one uses copperlists on Intuition
screens. On certain machines, probably PAL, under certain circumstances, dragging
a dense copperlist past scanline 256 or so will cause some video crud to appear
at the top of the display. This can't hurt, but it sure does look ugly. I suspect
this is a hardware misfeature because it ain't fixed yet under 2.0
This was the reason why the screen of older Imploder versions wasn't draggable;
you might just think this muck was our doing.

Second problem is that copperlists "shine through" onto other screens in front.
For this reason we've choosen a colour > 4 for the level bars, so this is never
observable with screens less than 3 bitplanes deep.
The 2.0 OS support proper copperlist clipping, but it has been disabled by
default for compatibility reasons (yeach). Supposedly there is a bit somewhere
in the graphics base to turn this back on, so I'm sure, in due time, there will
be some preference editor to re-enable this.


** 68040 Cache Coherency **

With the advent of the 68040 processor, programs that diddle with code which is
subsequently executed will be prone to some problems. I don't mean the usual
self-modifying code causing the code cached in the data cache to no longer
be as the algorithm expects. This is something the Imploder never had a
problem with, indeed the Imploder has always worked fine with anything
upto and including an 68030.

The reason the 68040 is different is that it has a "copyback" mode. In this
mode (which WILL be used by people because it increases speed dramatically)
writes get cached and aren't guaranteed to be written out to main memory
immediately. Thus 4 subsequent byte writes will require only one longword
main memory write access. Now you might have heard that the 68040 does
bus-snooping. The odd thing is that it doesn't snoop the internal cache
buses!

Thus if you stuff some code into memory and try to execute it, chances are
some of it will still be in the data cache. The code cache won't know about
this and won't be notified when it caches from main memory those locations
which do not yet contain code still to be written out from the data caches.
This problem is amplified by the absolutely huge size of the caches.

So programs that move code, like the explosion algorithms, need to do a
cache flush after being done. As of version 4.0, the appended decompression
algorithms as well as the explode.library flush the cache, but only onder OS
2.0. The reason for this is that only OS 2.0 has calls for cache-flushing.

This is yet another reason not to distribute imploded programs; they might
just cross the path of a proud '40 owner still running under 1.3.

It will be interesting to see how many other applications will run into
trouble once the '40 comes into common use among Amiga owners. The problem
explained above is something that could not have been easily anticipated
by developers. It is known that the startup code shipped with certain
compilers does copy bits of code, so it might very well be a large problem.