1541 isn't fast enough to decode in realtime.

— MagerValp

GCR decoding on the fly by Linus Åkesson

The little routine presented here is, with all due humbleness, hands down the best code I ever wrote. It is esoteric and might not appeal to the masses, and it certainly has no commercial value. But in terms of personal satisfaction at a job well done, this is about as good as it gets. Here's why:

The code solves a real, practical problem that people have been struggling with for 30 years.
The key to the solution lies in looking at the problem from a novel angle.
The implementation would not be possible without plenty of trickery and clever coding techniques. Several instructions do more than one thing.
The routine is incredibly compact, and leaves very little room for variation; it is THE solution.

Due to the strict timing requirements, there was no way of testing the routine before it was complete. In contrast with the prevalent fashion of trial-and-error software development, writing this code was an exercise of the mind, and my only tools were pen, paper and text editor.

Acknowledgements

I would like to thank Krill of Plush for creating an excellent loader, documenting some of its inner workings and releasing the entire source code, and Ninja of The Dreams for his concise reference material All about your 1541.

The problem

Commodore compatible 5.25" floppy disks store data as magnetic polarity reversals along 35 concentric tracks on the surface. The drive positions the read head (a coil and an amplifier) over a track, spins the disk, and reads off a stream of ones and zeroes. A switch in polarity indicates a one, whereas no switch in polarity for a predetermined amount of time indicates a zero. But because the spindle motors in different drives won't go at exactly the same speed, it is not permitted to store more than two consecutive zeroes on a track. Otherwise there's a risk that the floppy drive counts off too many or too few zeroes, and the data comes out all wrong.

Hence the data is encoded using a scheme called GCR (Group Code Recording). For every four bits of data (also known as a nybble), five bits are written to the disk surface, chosen from a table. Let's refer to them as a quintuple. The encoding is designed to ensure that, no matter how you order the quintuples, there won't ever be more than two zeroes in a row.

Nybble	Quintuple	Nybble	Quintuple
`0000`	`01010`	`1000`	`01001`
`0001`	`01011`	`1001`	`11001`
`0010`	`10010`	`1010`	`11010`
`0011`	`10011`	`1011`	`11011`
`0100`	`01110`	`1100`	`01101`
`0101`	`01111`	`1101`	`11101`
`0110`	`10110`	`1110`	`11110`
`0111`	`10111`	`1111`	`10101`

When reading a disk, the floppy drive must convert the incoming bit stream, five bits at a time, into the original nybbles. The bits are, however, lumped together eight at a time, and handed off as bytes to a 6502 CPU. To indicate that a new byte has arrived, the overflow flag in the processor status register is set by means of the SO pin. It is then up to the CPU to extract the quintuples from the bytes, decode them according to the GCR scheme, and join the resulting nybbles into bytes. This is a realtime job, and the deadlines are very tight: On tracks with the highest storage density, a new byte of raw GCR-encoded data arrives every 26 microseconds.

11111222 22333334 44445555 56666677 77788888

The quintuples are packed into bytes, so that each byte contains parts of up to three different quintuples. The least common multiple of five and eight is 40, so after 40 bits we're back in phase. Thus, a GCR decoder generally contains a loop that deals with five bytes in each iteration.

Commodore's approach

Some of the floppy drives produced by Commodore sported 2 MHz processors and large ROM chips with lookup-tables to facilitate the GCR decoding. However, Commodore also made the 1541, a low-cost drive, which turned out to be immensely popular, and has earned its place in the standard setup used for C64 demo compos (an unexpanded C64 and a 1541 floppy drive). The 1541 contains a single 6502 CPU clocked at 1 MHz, 2 kB of RAM and 16 kB of ROM.

With such modest specifications, Commodore concluded that the 1541 was not capable of decoding the GCR data in realtime. Instead, the raw data for a complete 256-byte sector is stashed away in a buffer (of 320 bytes, because of the 5:4 expansion). Only afterwards does the CPU revisit each byte in the buffer, use a series of shifting and masking operations to bring each quintuple into the lower part of an index register, and perform a lookup in a ROM table of 32 entries. Actually, there are two such ROM tables, one containing low nybbles and one containing high nybbles, to be joined into a byte using bitwise-or. During this non-realtime pass, the sector checksum is also computed and verified.

loop
        bvc     loop
        clv
        lda     $1c01
        sta     ($30),y
        iny
        bne     loop

The fetch loop in the 1541 ROM simply reads all the bytes and stores them in a buffer. First it busy-waits until the overflow flag is set. Then it clears the flag in anticipation of the next byte. Then, the current byte is fetched from the hardware register ($1c01) and stored into memory. This snippet only reads 256 bytes, but it is followed by a similar loop that reads the remaining bytes.

The loop needs 19 clock cycles to complete one iteration (assuming the bvc instruction falls through immediately). This leaves several cycles of slack, to account for differences in motor speed across drives.

A loop with more than 26 cycles would miss bytes, whereas a 25-cycle loop would work well in practice. At exactly 26 cycles you would be pushing your luck.

People were not happy with the performance of the original 1541 ROM routines, not only because of the slow GCR decoding, but also because of the dreadfully slow serial protocol used to transmit the decoded data to the computer. The one thing that Commodore got right in that protocol, though, is that they provided a way to upload code into the 1541, and thus to take over the floppy drive completely. The replacement code and any data buffers must fit in the 2 kB of RAM, but this is enough if all you want to do is load data as quickly as possible from the disk and transmit it as quickly as possible to the computer. Such software is called a fastloader.

Krill's approach

As a shining example of a modern fastloader, let's have a look at how GCR decoding is performed in Krill's loader. As expected, there's a loop that receives five bytes of GCR encoded data, and performs as much partial decoding as possible in between receiving the bytes (i.e. while the sector is passing the read head). The goal is to minimise the execution time of the second, non-realtime decoding pass, without overstepping the maximum execution time of the fetch loop, 130 cycles (26 cycles times five). The partially decoded data is stored in two buffers (512 bytes in total), corresponding to the partially decoded high and low nybbles.

But Krill also realised that there is no need to synchronise with the overflow flag for every byte.

Each track has a statically assigned bit rate. As the disk spins at a constant angular velocity (300 rpm), and the bit density of the medium is constant, it is possible to store more bits on the outermost tracks. Commodore defined four bit rates: A byte arrives every 26, 28, 30 or 32 cycles.

After a synchronisation, the first, second and third byte of data are present in the hardware register according to the following schedule. Thus, if we make sure to read the register during the safe areas indicated on the last row, we can read up to three bytes after a bvc loop.

           cycle
bit rate   0         10        20        30        40        50        60        70        80        90
0          111111111111111111111111111111112222222222222222222222222222222233333333333333333333333333333333
1          111111111111111111111111111111222222222222222222222222222222333333333333333333333333333333
2          111111111111111111111111111122222222222222222222222222223333333333333333333333333333
3          111111111111111111111111112222222222222222222222222233333333333333333333333333
safe areas 111111111111111111111-----      ---222222222222-----            ---333333-----

Because of slight variations in the timing (due to the three-cycle jitter introduced by the bvc loop, and the fact that the bit detection timer is reset after every one-bit), it is advisable to read the hardware register a couple of cycles into the safe area, but not much later.

As the loop deals with five incoming bytes per iteration, it is sensible to synchronise twice: Once in anticipation of three bytes, and once in anticipation of two bytes.

After a byte has been read, it is subjected to a number of shifting and masking operations, and the partially decoded quintuples are stored away. During the non-realtime pass, the decoding is completed and the checksum is verified. Interestingly, in Krill's loader, the nybbles are not joined together into bytes, but remain in separate buffers. This is because they are going to be transmitted serially anyway, and thus split up into even smaller parts.

The epiphany

As part of my efforts to learn C64 demo programming, I became interested in GCR decoding. One day, in a flash of inspiration, I started thinking about what would happen if all the shifting operations were removed. Consider the second quintuple. It arrives partly in the first byte, partly in the second, and normally you'd try to right-justify it in an index register. For the second quintuple, it's easiest to shift to the left, like so:

        11111222 22333334
        -----222 22------       mask
        ---22222 --------       shift

(The digits represent data from the different GCR-encoded quintuples. The dashes are zero-bits.)

But what if you, instead of shifting, combined the masked parts into a single byte using bitwise-or?

        11111222 22333334
        -----222 22------       mask
        22---222                bitwise-or

That is to say, the three most significant bits remain in the low bits while the two least significant bits remain in the high bits. We could of course use this value as an index into a full-page lookup table, but it would be a sparse table. First, we've masked out three of the bits, so we can only reach 32 out of the 256 entries in the table. But actually, because these are GCR encoded values, we'll only access 16 out of those 32 entries, corresponding to the valid quintuples.

Suppose we do this with all eight quintuples, obtaining eight index values:

Now, if we were to allocate a full page of RAM for each of these sparse tables, we'd use up the entire 2 kB. But here's the kicker: The whole point of GCR encoding is to ensure that there can never be a sequence of three consecutive zeroes in the bit stream. Since we're masking out three bits, we are introducing a sequence of three zeroes! So every index contains exactly one sequence of three or more zero-bits, and this can be regarded as a key that selects one out of eight sets of table entries. If these sets are non-overlapping, we can combine all eight tables into a single page of memory.

Unfortunately the sets overlap, for the following reason: A GCR-encoded quintuple may begin and/or end with a single zero bit. If the index value contains a sequence of exactly three zeroes, we know that these are the key. If there is a sequence of five zeroes, we know that the middle bits are the key. But if the index value contains a sequence of four zeroes, then we don't know if it's the key followed by a GCR-bit, or a GCR-bit followed by the key. Example: Index 00010110 contains a sequence of four zero-bits (bits 0, 7, 6 and 5, wrapping around from the right edge to the left edge). We don't know if this should be interpreted as --33333- (the quintuple would be 01011 which is decoded as 0001) or ---88888 (the quintuple would be 10110 which is decoded as 0110).

However, if we use two pages of memory, one for the odd quintuples and one for the even quintuples, then there can be no collisions: If there are four consecutive zeroes in an index value, we know that the three that make up the key must begin at an odd or even bit position. Continuing the example from the previous paragraph, at index 00010110 we'd put the decoded nybble 0001 in the odd table and the decoded nybble 0110 in the even table.

As a bonus, remember that the quintuples are alternating between the high and low nybbles of the decoded bytes. In the odd table, we can shift all the values into the high nybble. This way, to form the final byte out of the first two nybbles, all we have to do is compute the bitwise-or of odd_table[11111---] and even_table[22---222].

This is what the tables look like (entries marked "--" are not accessed):

Even quintuples

-- -- -- -- -- -- -- -- -- 08 00 01 -- 0c 04 05 
-- -- 02 03 -- 0f 06 07 -- 09 0a 0b -- 0d 0e -- 
-- 02 -- -- 08 -- -- -- 00 -- -- -- 01 -- -- -- 
-- 03 -- -- 0c -- -- -- 04 -- -- -- 05 -- -- -- 
-- -- 08 0c -- 0f 09 0d 02 -- -- -- 03 -- -- -- 
-- 0f -- -- 0f -- -- -- 06 -- -- -- 07 -- -- -- 
-- 06 -- -- 09 -- -- -- 0a -- -- -- 0b -- -- -- 
-- 07 -- -- 0d -- -- -- 0e -- -- -- -- -- -- -- 
-- -- 00 04 02 06 0a 0e -- -- -- -- -- -- -- -- 
08 09 -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
00 0a -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
01 0b -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
-- -- 01 05 03 07 0b -- -- -- -- -- -- -- -- -- 
0c 0d -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
04 0e -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
05 -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Odd quintuples

-- -- -- -- -- 00 -- 40 -- 20 -- 60 -- a0 -- e0 
-- -- 80 -- 00 -- 10 -- -- -- c0 -- 40 -- 50 -- 
-- 80 -- 90 20 -- 30 -- -- -- f0 -- 60 -- 70 -- 
-- -- 90 -- a0 -- b0 -- -- -- d0 -- e0 -- -- -- 
-- 00 20 a0 -- -- -- -- 80 -- -- -- -- -- -- -- 
00 -- -- -- -- -- -- -- 10 -- -- -- -- -- -- -- 
-- 10 30 b0 -- -- -- -- c0 -- -- -- -- -- -- -- 
40 -- -- -- -- -- -- -- 50 -- -- -- -- -- -- -- 
-- -- -- -- 80 10 c0 50 -- 30 f0 70 90 b0 d0 -- 
20 -- -- -- -- -- -- -- 30 -- -- -- -- -- -- -- 
-- c0 f0 d0 -- -- -- -- f0 -- -- -- -- -- -- -- 
60 -- -- -- -- -- -- -- 70 -- -- -- -- -- -- -- 
-- 40 60 e0 -- -- -- -- 90 -- -- -- -- -- -- -- 
a0 -- -- -- -- -- -- -- b0 -- -- -- -- -- -- -- 
-- 50 70 -- -- -- -- -- d0 -- -- -- -- -- -- -- 
e0 -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Beautiful, aren't they?

Let's recap the algorithm:

Read five bytes.

        11111222 22333334 44445555 56666677 77788888

Extract twelve fragments using bitwise-and.

        11111--- -----222 --33333- -------4 ----5555 -66666-- ------77 ---88888
                 22------          4444---- 5-------          777-----

Combine the fragments into eight bytes using bitwise-or.

        11111--- 22---222 --33333- 4444---4 5---5555 -66666-- 777---77 ---88888

Retrieve decoded nybbles from the tables.

The digits represent decoded bits from now on.

        1111---- 3333---- 5555---- 7777----
        ----2222 ----4444 ----6666 ----8888

Join them into bytes using bitwise-or.

        11112222 33334444 55556666 77778888

Store the four bytes in a buffer.

You might be thinking: This is a novel and quite interesting approach and all, but there's nothing to suggest that it would be faster than ordinary shifting. And, indeed, much work remains.

The tricks

We'll use plenty of self-modifying code. This is not really a trick, but an established technique on the 6502 processor. In order to shave off a cycle from every instruction that modifies another one, the entire loop will be placed in the zero-page.

Rather than storing the decoded bytes using an indexed addressing mode (five cycles per byte, plus a few more to advance the index register), we'll push them on the stack (three cycles per byte, pointer automatically decremented, and both index registers are free for other purposes). The 6502 stack is restricted to one page, so a complete sector fits perfectly. The 257th byte (checksum) is handled as a special case after the loop. The downside of this method is of course that we cannot use the stack for anything else. In particular, we can't make any subroutine calls.

For masking, we can use the and instruction, which puts the result in the accumulator. However, sometimes we'd rather preserve the contents of the accumulator. Using unintended (also known as "illegal") opcodes, there are a couple of ways to achieve this: The sax instruction is a mix of sta and stx; it tries to write the contents of both A and X to memory, and due to the NMOS process used to manufacture the 6502, zeroes are stronger than ones, so bus contentions resolve into bitwise-and operations. Thus, we can use sax to store a subset of the bits in A (or X) to memory without clobbering the rest of the register. Furthermore, we can perform a bitwise-and operation using the sbx instruction (a mix of cpx and tax) with an operand of zero, which will leave the result in X.

We can obtain two copies of each incoming byte: The unintended opcode lax is a mix of lda and ldx, and can be used to fetch an encoded byte into both A and X.

To combine index parts, we can use a bitwise-or operation as described above, but of course addition would work equally well. Addition can be computed implicitly as part of an indexed memory access: Since the odd and even tables are page-aligned, we can store some of the index bits in the least significant byte of the instruction operand, and the remaining bits in an index register.

Finally, consider quintuple number four: It arrives in two parts, one of which is simply the least significant bit of the second incoming byte. Rather than mask this bit and keep it in a register, we can use the lsr instruction (the only shifting operation in the entire GCR decoder) to place it in the carry flag. To combine it with the other part of the index, we simply adc (add-with-carry) a constant zero. This frees up a register, but that's not all: As a side effect, the addition operation will clear the overflow flag. By executing the adc at the right moment, we can get rid of one clv instruction, saving two cycles.

Armed with these tools, we are faced with a magnificent puzzle of instructions, registers, bits and cycles. When taking on this challenge, I had no idea if it would be possible to find a solution. By the time I had only a few cycles left to optimise, I could hardly think of anything else for two days. But then everything clicked into place.

The code

The following loop decodes GCR on the fly. The checksum must still be verified in a separate pass.

Start at the label begin_reading_here; that is the logical beginning of the loop. In practice, the first iteration is handled partly outside the loop, which is entered at zpc_entry ("zpc" is short for zero-page code).

zpc_loop
                ; This nop is needed for the slowest bit rate, because it is
                ; not safe to read the third byte already at cycle 65.

                ; However, with the nop, the best case time for the entire loop
                ; is 130 cycles, which leaves absolutely no slack for motor speed
                ; variance at the highest bit rate.

                ; Thus, we modify the bne instruction at the end of the loop to
                ; either include or skip the nop depending on the current
                ; bit rate.

                nop

                lax     $1c01              ; 62 63 64 65   44445555
                and     #$f0               ; 66 67
                adc     #0                 ; 68 69         A <- C, also clears V
                tay                        ; 70 71
zpc_mod3        lda     oddtable           ; 72 73 74 75   lsb = --33333-
                ora     eventable,y        ; 76 77 78 79   y = 4444---4, lsb = --------

                ; in total, 80 cycles from zpc_b1

zpc_b2          bvc     zpc_b2             ; 0 1

                pha                        ; 2 3 4         second complete byte (nybbles 3, 4)
zpc_entry
                lda     #$0f               ; 5 6
                sax     zpc_mod5+1         ; 7 8 9

                lax     $1c01              ; 10 11 12 13   56666677
                and     #$80               ; 14 15
                tay                        ; 16 17
                lda     #$03               ; 18 19
                sax     zpc_mod7+1         ; 20 21 22
                lda     #$7c               ; 23 24
                sbx     #0                 ; 25 26

zpc_mod5        lda     oddtable,y         ; 27 28 29 30   y = 5-------, lsb = ----5555
                ora     eventable,x        ; 31 32 33 34   x = -66666--, lsb = --------
                pha                        ; 35 36 37      third complete byte (nybbles 5, 6)

                lax     $1c01              ; 38 39 40 41   77788888
                clv                        ; 42 43
                and     #$1f               ; 44 45
                tay                        ; 46 47

                ; in total, 48 cycles from b2

zpc_b1          bvc     zpc_b1             ; 0 1

                lda     #$e0               ; 2 3
                sbx     #0                 ; 4 5
zpc_mod7        lda     oddtable,x         ; 6 7 8 9       x = 777-----, lsb = ------77
                ora     eventable,y        ; 10 11 12 13   y = ---88888, lsb = --------
                pha                        ; 14 15 16      fourth complete byte (nybbles 7, 8)

begin_reading_here
                lda     $1c01              ; 17 18 19 20   11111222
                ldx     #$f8               ; 21 22
                sax     zpc_mod1+1         ; 23 24 25
                and     #$07               ; 26 27
                tay                        ; 28 29
                ldx     #$c0               ; 30 31

                lda     $1c01              ; 32 33 34 35   22333334
                sax     zpc_mod2+1         ; 36 37 38
                ldx     #$3e               ; 39 40
                sax     zpc_mod3+1         ; 41 42 43
                lsr                        ; 44 45         4 -> C

zpc_mod1        lda     oddtable           ; 46 47 48 49   lsb = 11111---
zpc_mod2        ora     eventable,y        ; 50 51 52 53   lsb = 22------, y = -----222
                pha                        ; 54 55 56      first complete byte (nybbles 1, 2)

                tsx                        ; 57 58
BNE_WITH_NOP    =       (zpc_loop - (* + 2)) & $ff
BNE_WITHOUT_NOP =       (zpc_loop + 1 - (* + 2)) & $ff
zpc_bne         .byt    $d0,BNE_WITH_NOP   ; 59 60 61      bne zpc_loop

Final remarks

Of course, once I had figured out the solution, I needed to write a fastloader around it to see whether it would really work. The loader grew into the Spindle trackmo toolchain, which I am saving for another article. To see the GCR decoder in action, have a look at my demo Shards of Fancy.

In interest of pure speed, I would love to see this used with a non-IRQ/no-display loader for 2 revolution track transfer. I've looked into it and don't think it's possible on highest density tracks unfortunately.

From the end of this processing until the sector header 2 away you have:

- current sector tail - 2 00s + 4 or more gap = 6+ bytes

- skipped sector header - 5 sync + 10 contents + 9 gap = 24 bytes

- skipped sector data - 5 sync + 325 data + 4 or more gap = 334+ bytes

- for total of 364+ bytes (note these are all in terms of encoded (GCR) not decoded bytes)

So 364 is the minimum number of disk bytes that may fly by in the time you need to transfer the sector just read. After that you need to go back to reading a sector header, etc.

At highest density setting, this is 26us per byte, or 26 * 364 = 9464us.

The fastest serial bit-bang transfer I think has ever been done is Vorpal V1 (like used in Winter Games). It transfers 3 bytes in a 113us loop, for 37.67us / byte.

We need to transfer 257 bytes (256+checksum) in 9464us. This means each byte must be transferred in 9464us / 257 = 36.82us.

So I'd expect Vorpal style serial transfer to allow you to squeeze by with 2 revolutions per track on lower-densities, but just miss the mark and still require 3 revolutions per track on highest density.

I would still love to tweak this code to be more density dependent to eliminate that BVC or two though. That should free enough cycles to replace PHA with STA $1801 and allow a (non-IRQ/no-display) 1 revolution parallel port track transfer :).

Aaron

Anonymous
Wed 19-Jun-2019 10:45

Great article! I have integrated the code into my disk utility and it allows me to scan the entire disk in 21 seconds using an interleave of 2 sectors!! I only send the first 4B and an error byte from each sector back to the C64, so that's why I can hit interleave of 2. I used the short checksum snipet from Shards of Fancy. I also noticed in Shards of Fancy that you use the "nop" for density 0 and 1 (Tracks 25-35), not just the slowest ones (31-35).

Now for some improvements!!
1) instead of storing $0100,$01FF->$0101, I store $01FF->$0100 so the data does not have a wrap. This is accomplished by starting the stack at $FF instead of $00, and also changing the final few instructions in the zero-page from pha before tsx to tsx before pha. This works because pha does not change the zero flag.
2) if the first byte of data is not $55, I give up after 3 tries and report error 04 for that sector and move on. also, if parity fails, give up after 3 tries and report error 05. This is because my project is a disk utility and not a loader :) I cannot "hang" trying forever.
3) save/restore the last 8B of stack, and SP, just before/after this routine, so I can still jsr/rts everywhere else in the code.
4) save/restore a portion of ZP before/after reading all the sectors I wanted, so I can return to KERNAL when done. Returning to KERNAL is important for my utility project. I am using $86-$E5 for the ZP code and only save/restore $99-$B4 -- the rest is just zero'd out.

Cheers!

Update: Make that 17 seconds. by decoding the sector HEADER on the fly as well, the "track-change" delay is reduced by accepting the first sector encountered after changing track.

Also important: The "first byte" and "last byte" decoding code is not shown on this blog, but it is visible in the Shards of Fancy demo. I found that it is unnecessary to separately handle the first byte and enter the zero-page code in the middle. You can start at the top if your "first byte" handling is as follows:

; assuming you already found the sector header of interest, and

; skipped the 'sync' that follows it, here is what you do next:

@byte0: bvc @byte0

clv

lda $1C01

cmp #$55 ; "DATA" signature

bne fail

ldx #$FF

txs ; set SP = $1FF

jmp decoder

fail: rts ; wait for another header, etc...

decoder:

nop ; required delay

@byte1: bvc @byte1

clv

lda $1C01

tax

lsr A ; set Carry Flag per first decode

txa

and #$3E

sta zaddr0+1 ; address of first "oddtable" access

nop ; required delay

@byte2: bvc @byte2

jmp zpdecode+1 ; top of zero page code (skip the nop)

Good luck all!

--Paul Nelsen

I used modified ROM's.. Once we got a hold of kevin pikles disassembler it was just adding hooks. Even using default transfers and such the time it took compared to humans was minuscule. So I could break in and trace anything anywhere and then continue where I left off. Checks which looked for such things did not work but that were few. I was still a student so money was hard to come by and that stopped many developments. I did the same thing with the amiga-Genesis development system and that was far easier comparatively since money was not stopping me from doing the basic things. I mean the C64 cost me more than a semester of school.. It wont occur to people not worried about eating :) But the $10 or $20 makes a huge difference and you got to weigh if that goes to the hardware or your stomach. Many days I lived on a single egg :) because the hardware won..

After I got the C128 things were easier since I was also getting paid by les not less.. So its kind of a shock that the one who treated me well was also such a crook. I always expect other to be a little crooked but off everyone Les was the only one who did not shaft me. And my upgrades to Vmax was due to him. I mean harald did not pay me. Because it was based on royalties.. Which he never paid so I lived on what Les sent me. And also the upgrades stopped as soon as there was a fully working system because well they told me they wont pay the royalties and that they had already stolen everything. Well they also stole my name and other things.. might have also stolen some underwear because I was missing some.. The speed difference between the C128 and C64 made real time debugging etc far easier.. By that time wrap speed was finished and adding hooks to that to do all this was like just 1 keypress away.. to go from computer memory to drive memory.. Did not care about extra drive memory since parallel cable using computer memory was so much faster and larger.. You could read/write in real time with plenty of cycles left over. If I had known I would have done all that first even on the C64. But initially things were difficult and it was all trial and error. I think EEprom programmer cost like $100 which was like giving up the car and lunch for a month so that took a while.

Vorpal did manage to do single track decode and transfer.. And it was on my list of projects before I stopped.. I got upto 2 revs with more capacity than standard disk but it needed far more research here because by now the technical specs shot way up and took far more effort for very small improvements. I could not continue even if I wanted to because the betrayals hurt my ego and nothing was important any longer. This is why I wanted to talk to scott nelsen so badly but the state of my mind also prevented me from forcing the issue. By using self modying code in the drive as well as the computer the load was fast and the demo did it in real time. Dont know if they ever released it.

see this comment
"The only loader we ever saw, that could handle data no matter how it was
framed (i.e. had no sync bytes) was the last VORPAL loader, which gave us
migraines just looking at how it managed to accomplish that task. It used
self modifying real-time GCR conversion code to do half the job, and the
serial transfer routine to do the remainder of the work, of putting things
back in proper order. Twisted is the only term that does it justice. I
never completely analyzed it, it was just too much trouble to figure all the
little details out.

Wouldn't doubt it could move the head faster, but unless they improved the
serial transfer routine, I doubt they could have actually loaded data any
faster (Can't get any quicker than real-time GCR conversion, and I couldn't
get real-time conversion of a GCR that was denser than 75% data, 25% clock).
However, it would have been easy to use the 50% data, 50% clock of alternate
byte real-time conversion (via xor), at the cost of disk space.

By comparison, Commodore was 80% data, 20% clock. Which couldn't be decoded in real-time (though I don't know what VORPAL's final throughput was, or what density of data/clock they were using). So if VORPAL was using
Commodore's GCR scheme, and Joe & Scott (blum not nelsen who did vorpal) finished reverse engineering it, they might well have beaten V-MAX!, now that I think of it. If anyone could have done it, it would have been JP."

Search This Blog

The Scammer Lair

GCR decoding on the fly on the 1541

GCR decoding on the fly by Linus Åkesson

Acknowledgements

The problem

Commodore's approach

Krill's approach

The epiphany

Even quintuples

Odd quintuples

The tricks

The code

Final remarks

Comments

Post a Comment

Popular posts from this blog

Protection Analysis

RapidLok Copy Protection for Commodore 64 disk

V-Max Copy Protection for Commodore 64 disk