This document deals with the inner workings. If you're a tech type, and are interested in somewhat deeper lying aspects relating to how the Imploder operates, chances are you'll find it here. Subjects covered are: - Compression - Decompression - Reversibility - The Library - Overlayed Files - Merging - ARP - The Music - The Copperlist - 68040 Cache Coherency ** Compression ** The Imploder (we only recently learned :-) does LZ77 like compression with a per-mode static Huffman coding step on the various parts of the skip, offset and length tuples. Due to the efficient encoding, a tuple can require less than 12 bits, and thus strings of 2 bytes length are encodable with a decent gain (given small Huffman patterns corresponding to likely circumstances). To speed up the string searching, the turbo mode causes the accessible strings to be indexed using a hashing table. However, the fact that strings with a minimum size of two bytes are still potential candidates for compression, requires the hashing function to necessarily be rather simplistic. When the implosion algorithm processes highly redundant data, entries in the hashing table tends to get very imbalanced, causing a reduction in performance. For most types of data, notably executables, this isn't the case though. ** Decompression ** The goal of decompression is to reproduce the segment list of the original program. This is a linked list of data/code/bss hunks, some of which require being positioned in chip memory. The decompression code will have to fill the data and code hunks with the decompressed data, and, if required, perform relocation of absolute references. The Imploder lets LoadSeg allocate all target hunks (as BSS). These are at the start of the hunk list. This is followed by a hunk with the decompression code (only for non-library imploded programs), a compressed data hunk (normally about 50% of the static data size, depending on compression gain ofcourse), and a decompression buffer (only upto 17K in size). As you can see, no allocations need to be done, so the exploding process will never fail. During decompression time, data is decompressed from the compressed data buffer into the decompression buffer until it has filled. It then empties this buffer by filling hunks and processing reloc tables. When the buffer has been emptied, the process repeats until all data has been scatter-decompressed across the target hunks. Now you might ask, why not directly decompress to the target hunks instead of using a decompression buffer? This is because the the Imploder uses the decompressed data to decompress, and needs to be able to easily access it (for speed). Referencing data distributed across several hunks is much more cumbersome. An added bonus of having separate source and target memory regions is that a notation can be used that doesn't gain under rare circumstances, but on average yields better results (No chance of source/target pointer collision). When explosion has finished, the decompression buffer, code and data hunks are freed, and memory usage reduced to what it would have been for a non-imploded version of the program. People often compare the Imploder to the Power Packer. The Imploder decompresses faster and looks cooler [:-)], but the most interesting differences lie in the implementation of the various steps of the decompression proccess. So let's contrast the advantages of the Imploder's approach with the Power-Packer's implementation. - By having LoadSeg allocate all memory as BBS hunks, explosion will never fail. The Power-Packer on the other hand allocates hunks. If it fails it will simply exit. Power Packed programs launched from the WorkBench thus won't reply the startup message, which will be left dangling in memory forever. - Memory doesn't get fragmented. The explosion related hunks are at the end of the seglist and thus were allocated (by LoadSeg) AFTER the target hunks. This isn't true for the Power-Packer. It does leave a hole in your free memory list when it frees its decompression stuff. - Additional memory usage is only about 50% of the static data size + the size of the decompression buffer, which is always small relative to large programs (maximum 17k). So a 30K program might require 62K to decompress (30+15+17), a 300K program will require 467K (300+150+17), assuming a 50% compression reduction. The memory usage report generated after a program has been imploded includes BSS hunks. I've discussed only static data here. BSS hunks don't require any extra memory usage of course. Power-Packed files require a buffer as large as the original program for both compressed data storage and decompression. Memory usage is therefore always about twice the static data size (again ignoring BSS) while for the Imploder it drops to 1.5 for executables large enough to matter memory wise. (In this comparison I'm talking about executables as produced by Power-Packer version 3.0b.) Non-library imploded programs have a small first hunk that calls the decompression code hunk at the end, and frees these last three hunks. For library imploded programs this freeing occurs in the library, so no preceding hunk is needed. ** Reversibility ** Before compression the Imploder preprocesses executables. It kicks out all the redundant stuff by merging subreloc tables referring to the same hunk, sorting relocs (improves compression), removing debug symbols etc. etc. This is what all those info blurbs in the text window are about. So the deploded executable isn't guaranteed to be byte by byte identical as far as loadfile control codes are concerned. What is guaranteed is that the memory images created when the original, imploded and deploded program versions are loaded are identical. So the deplosion process isn't 100% reversible. Normally this is no problem. The reason for uncrunching is mostly wanting to recompress an executable using a different compression mode, or having a quick peek at the code e.g. when applying a patch with something like NewZap. If however you expect to need the debug symbols, or (important) require the executable to be in the _exact_ original format in order to have things like lpatch updates to applied, you're out of luck if you've only got the compressed executable. So always keep the original archives/disks! This is yet another argument for retaining the original archives. The Imploder is an online space creator not a distribution archiver (See the "Philosophy" text). ** The Library ** The library code has been unrolled a bit, and optimized here and there in order to achieve optimal performance. This makes it faster than the normal explosion algorithm. If you library implode a program there is NO way in which the program, after explosion, will be able to notice. If you make sure the library is resident, this is also true for any executable file loaded for any purpose by any program. For normal etc. imploded programs the startup/cleanup hunk mentioned at the end of the "decompression" section might be detected if a program goes through contortions involving finding the segment list via murky DOS structures instead of simply doing PC relative hunk start referencing which also works from the WorkBench. I haven't encountered any programs that do this. Still this is yet another reason to use the library; there is not even the slightest chance of it being incompatible with an executable. Note that the Loadseg vector is patched in an "intelligent" manner; it will install fine for pre 2.0 kickstarts (braindead jumptable format) as well as in BCPL free systems (2.0+) Under pre 2.0, when a library imploded file is run from the WorkBench, and the explode.library isn't resident yet, Exec will try to load the library from disk. The process's message port however is in use by the WorkBench reply message, and until it has been replied, it cannot be used by the DOS in order to send packets. Thus the DOS gurus. Also, BCPL code doesn't jump through the the library vector. The only structural problem with this are handlers. These are loaded by the DOS, and the DOS is BCPL code, again ONLY under < 2.0. Under 2.0 the library works just like intended when it was first conceived. Transparently that is. ** Overlayed Files ** The Imploder compresses the load part of an overlayed file as if it were a normal executable file. Subsequently, the overlay table and the overlayed program section are appended. It then tries to adapt the overlay table. Because different types of overlay supervisors are in use, the format of the overlay table isn't known to the Imploder. The only assumption made is that the overlay table contains seek-offset longwords, at longword aligned locations, that point into the file to the hunk_header ($3F3) identifiers of the overlayed program sections. This is how the standard overlay manager operates, but nothing prevents a programmer with sufficient technical knowledge to create a novel overlay format (e.g. selfextracting DImp files). If the Imploder finds one of these offsets, it is adjusted by the amount the initial load part of the executable file has compressed. The deplosion algorithm also tries to find these offsets when restoring the overlay table. Thus there is always a very small chance that the imploder will adapt a longword that was never meant to be an offset. An overlayed file gets its information from the loader in four longwords, at the start of the first hunk. In an imploded overlayed file, this hunk is the root hunk, and after decrunching these longwords are adjusted and moved into the first hunk of the actual program (the second hunk of the seglist). Evidently this process can never be 100% deterministic, so take heed and test any overlayed programs you've Imploded. Or don't use overlay implosion at all if you can spare the bits. ** Merging ** Though modern linkers/compilers typically produce executables with one code hunk and one data hunk, there are still some old executables and less evolved linkers around. The merging option was implemented when executables with sufficient hunks to cause a lot of redundancy were still commonplace. Every hunk requires a longword in the allocation header, plus a hunk ID, load size, and hunk end ID. That's 16 bytes per hunk, and thus saved for every merge action. Doesn't sound like much, but generally hunks also have reloc tables. These waste a lot more space, especially with references to a lot of different hunks, though there's no easy equation. The merging step merges matching hunks (data-data, chipdata-chipdata, code-code) into hunks of upto the merge threshold in size. The actual size is of course determined by the sum of the sizes of the composite hunks, and may very well be a bit less than the specified threshold. Obviously this process discards redundant data in an irreversible fashion, so deploding the executable won't reverse it. Lots of tiny hunks cause memory fragmentation, but increase the chance of the program being able to load when the system is fragmented, and low on memory. Thus there is a kind of optimal balance that varies from system to system. In general it can be said that hunks less than 10K or more than 100K are "bad". Another factor is that loading a program with many tiny hunks causes the LoadSeg function to issue double as many tiny read commands, thus bogging down the speed with which an executable can be loaded into memory. For simplicity's sake, I've chosen for the Imploder to process executables within a single buffer, without the need for additional backup buffers. Thus, removing redundant information, and copying hunk data during the merging and reloc cleanup process involves moving or mirroring large parts of the buffer. This is why merging can take a while when processing a large executable with a hundred or so hunks. ** ARP ** Programs written for use with the ARP shell are able to specify the stacksize they require inside the first hunk of their executable. If such a program is normal or pure imploded, the segment list won't become visible until the program is run. Thus ARP has no way of finding out what the proper stacksize should be. Library imploded programs have no trouble with this because they are already decrunched after they are loaded into memory with LoadSeg. (Provided the library has already been made resident.) The Imploder will recognize these files and report on them. If the requested stack-size is larger than the usual minimum (4000 bytes) a warning will be printed, and you'll be urged to use only library implosion. The chance of a programmer relying on a soon to be obsolete shell for setting a stack LARGER than the usual default is rather slim though. It would have been very nice if 2.0 had sported such a stack setting feature, and indeed it had been planned, but was never implemented due to lack of time on the part of the Commodore programmers. We'll be on the lookout for any future changes to the executable file format in order to fix any potential incompatibilities before they'll cause problems. ** The Music ** When we got word the CIA-A timer was used by the OS under 2.0, we switched to trying to allocate first CIA-A and if not available CIA-B to drive the music. However the CIA-B interrupt priority is too high and can interfere with things like serial transfer. So Paul got this great idea to keep on using a CIA for precision timing purposes, but drop down to a SoftInt for doing the actual work, modulations, etc. This works great, the amount of code executed under CIA priority is now negligible. Recently, the CATS started feeling guilty about hijacking the CIA-A timer and thus created "Jumpy the magic timer device". If I understood things correctly the latest 2.0 timer device moves out of the way and starts using a less accurate timing source whenever an application tries to allocate the CIA-A. Pauls music driver can run of both CIA-A and CIA-B, and it would be a pity to make Jumpy jump without good reason, so he changed the alloction sequence from A-B to B-A. ** The Copperlist ** There are a couple of unavoidable quirks when one uses copperlists on Intuition screens. On certain machines, probably PAL, under certain circumstances, dragging a dense copperlist past scanline 256 or so will cause some video crud to appear at the top of the display. This can't hurt, but it sure does look ugly. I suspect this is a hardware misfeature because it ain't fixed yet under 2.0 This was the reason why the screen of older Imploder versions wasn't draggable; you might just think this muck was our doing. Second problem is that copperlists "shine through" onto other screens in front. For this reason we've choosen a colour > 4 for the level bars, so this is never observable with screens less than 3 bitplanes deep. The 2.0 OS support proper copperlist clipping, but it has been disabled by default for compatibility reasons (yeach). Supposedly there is a bit somewhere in the graphics base to turn this back on, so I'm sure, in due time, there will be some preference editor to re-enable this. ** 68040 Cache Coherency ** With the advent of the 68040 processor, programs that diddle with code which is subsequently executed will be prone to some problems. I don't mean the usual self-modifying code causing the code cached in the data cache to no longer be as the algorithm expects. This is something the Imploder never had a problem with, indeed the Imploder has always worked fine with anything upto and including an 68030. The reason the 68040 is different is that it has a "copyback" mode. In this mode (which WILL be used by people because it increases speed dramatically) writes get cached and aren't guaranteed to be written out to main memory immediately. Thus 4 subsequent byte writes will require only one longword main memory write access. Now you might have heard that the 68040 does bus-snooping. The odd thing is that it doesn't snoop the internal cache buses! Thus if you stuff some code into memory and try to execute it, chances are some of it will still be in the data cache. The code cache won't know about this and won't be notified when it caches from main memory those locations which do not yet contain code still to be written out from the data caches. This problem is amplified by the absolutely huge size of the caches. So programs that move code, like the explosion algorithms, need to do a cache flush after being done. As of version 4.0, the appended decompression algorithms as well as the explode.library flush the cache, but only onder OS 2.0. The reason for this is that only OS 2.0 has calls for cache-flushing. This is yet another reason not to distribute imploded programs; they might just cross the path of a proud '40 owner still running under 1.3. It will be interesting to see how many other applications will run into trouble once the '40 comes into common use among Amiga owners. The problem explained above is something that could not have been easily anticipated by developers. It is known that the startup code shipped with certain compilers does copy bits of code, so it might very well be a large problem.