Goat2 8-bit pipeline computer

Build image: 
In-game name (NO VISITORS! Must be LEARNER rank or higher!): 
goatflyer
Builder name(s) and current ranks: 
goatflyer: Regular
Creation location: 
2502 183 -2475
Machine Features: 

text file formatted with linebreaks available here:

https://pastebin.com/mk2L6g4U

The Goat2 is a Harvard-architecture RISC machine, implemented as
a pipeline of 2-stages, running with a clock period of 26 ticks.
Due to the goal of keeping critical signals short, it might be better
described as a DPU (Distributed Processing Unit) as opposed to CPU!

Its internal registers are:
- a program counter register "P" (yellow near ROM) 2524/197/-2510
- an 8-bit accumulator "A" (red beside ALU) 2476/219/-2496
- an 8-bit pointer register "X" (purple behind RAM) 2519/219/-2450
- a "link" "L" to hold a return address (purple near Incrementor) 2507/195/-2493
- 2 single-bit flags: "Z" zero (blue), and "C" carry (red) (under ALU) 2491/197/-2479

Instruction words are 13 bits wide: 8 bits of data (white) and a 5 bit opcode (pink).
Depending on the instruction, the 8 bits of data is either a RAM address,
a literal constant, a branch target address in ROM, or otherwise ignored.

Instruction decoding is scattered everywhere along the (pink) inverted
opcode bus, with the decoding of interest close to where control lines are needed.

Although the instruction set supports full 8-bit address for both ROM and RAM,
I implemented only 56 instruction words of ROM and 16 bytes of RAM
for reasons of speed and plot size.

Therefore registers "P" and "L" implement only 6 bits each, and only the lower 4 bits
of register "X" are connected to the RAM address mux.

Clock (light blue) and reset (black) control are in front of the ROM and low.
2504/181/-2474

Explanation of Features
-----------------------

Indirection
-----------

The single pointer register "X" is used for all indirection.
To illustrate its use, here is a possible "memcpy" implementation:

memcpy:
ldx srcptr ; load X with current source pointer
ldax ; load source byte
ldx dstptr ; now load X with current destination pointer
stax ; store byte to destination
inc srcptr
inc dstptr ; bump both pointers
dec loopcount ; reduce count
jnz memcpy ; repeat for the length of the transfer
...

Delay Slots
-----------

Pipeline means the next instruction is fetched while the current one is executing.

When the current instruction is a branch or call or return,
the Goat2 will still execute the fetched instruction,
rather than discard it as some other CPUs do when a branch is taken.
(Goat2 is similar to SPARC in this regard)

Because no instruction is ever "wasted", the need for "branch prediction"
is completely eliminated for 2-stage pipeline computers.

A delay slot instruction is never skipped.
If you really can't think of anything useful, then put a "nop".

Call and Return
---------------

The Goat2 has no hardware stack or stack pointer. Instead, the "call" instruction
is actually a "jump-and-link" instruction: the return address is stored into
the link register "L", and the program counter "P" receives the target address.
(Similar to ARM Cortex behaviour).
This approach allows single-cycle dispatch to called functions.

Note: the return address is saved into "L" by the 2nd stage of the pipeline,
so it (correctly) contains the address AFTER the delay slot instruction.

The Goat2 offers three different "return" instructions:

retl
This loads "P" from "L" directly.
This is useful for simple functions which do not call other functions,
the return address simply stays in "L" and is returned to at the end.

ret <addr>
This loads "P" from the contents of memory address <addr>.
The called function should store "L" at some address using the
"stl <addr>" instruction somewhere near the beginning of the function.
This is an easy way to write non-recursive functions.
Note: ret <addr> can also be used to perform indirect jumps or computed goto.

retx
This loads "P" from the location pointed to by register "X"
Recursive functions must employ some kind of software stack;
one such implementation is as follows:

myfunction:
stl temp ; hold "L" temporarily
ldx stackpointer
lda temp ; get "L" back
stax ; store return address to whereever "X" points
inc stackpointer ; bump address (this example uses post-increment "push")

... ; do function work which may involve calls to myself

dec stackpointer ; un-bump (here we must pre-decrement for "pop")
ldx stackpointer
retx ; jump to the return address saved where x points to
... ;(delay slot) something useful like loading a return value

Register Forwarding
-------------------

As mentioned previously there are two stages in the pipeline which operate in parallel.

The first stage fetches instructions and arguments for the ALU.
The arguments are read midway through the stage after the instruction was fetched.

The second stage latches the ALU arguments and opcode from the first stage,
then performs ALU computation and prepares for the writeback of the ALU output.

The writeback actually occurs at the end of the stage (at the following clock edge).

Without register forwarding this produces surprising results;
consider the naive, pipeline-unaware code sequence

lda value1 ; load "A" from "value1"
aca value2 ; add-with-carry "A" and contents of "value2", result to "A"
sta result ; store sum here

The problems with this code (without forwarding) are:
- when the first stage is fetching the arguments for aca,
register "A" has still not been updated with the contents of value1 from lda.
- When reading "A" for sta, register "A" does not yet contain the result of the add.

What forwarding does is to connect the first stage's argument reader
EITHER to the present register output (if the register is not being updated),
OR directly to the alu output bus (if the register is being updated).

This connection is safe to make even if the ALU has not finished calculating output,
since the final value will only be latched at the (next) clock edge.

With forwarding, the naive example above works correctly as the comments indicate.

The Goat2 supports forwarding for registers "A", "X", and the two flags "Z" and "C" only.
But specifically, memory forwarding is NOT supported!
Examples such as this will still give surprising results...

sta temp ; write new value
lda temp ; !!this value of temp is the old value from before the sta!!

Instruction Set
---------------

There are 3 logical (or, and, xor) and 1 arithmetic (add-with-carry) instructions
which read and update the accumulator "A".
The second argument is either an immediate constant or a RAM address.
with these 4, software can create the rest: add (no carry), sub, sbc, complement, etc.

The 4 monadic operators (increment, decrement, rotate right, and ?reserved?)
operate on a single byte in RAM only.

All the above instructions affect the "Z" flag (Z=1 if ALU result is zero)
Adding, shifting, or rotating also affect "C" flag (C=1 if carry out is set)

All other instructions, including loading, storing,
and jump/call/return instructions never affect the flags.

(PLEASE VIEW TABLE IN FIXED-WIDTH FONT)
Instruction decoding table
--------------------------

top2 00 01 10 11
bot3

000 nop ori # inc m ora m
001 jl t ani # dec m ana m
010 jc t xri # ? m xra m
011 jnc t aci # ror m aca m

100 jz t mvia # sta m lda m
101 jnz t retl stl m ret m
110 jle t mvix # stx m ldx m
111 jgt t retx stax ldax

t = a target (ROM) address
# = a numeric constant
m = a memory (RAM) address

jle = jump if either carry or zero is set
jgt = jump if both carry and zero are not set

? was originally rol, or rotate left
but it seems such a waste because the sequence
lda m, aca m, sta m produces the same output in memory and carry,
whereas there is no substitute for ror (it can divide by 2).

Also, without memory forwarding, a sequence of rols or rors
on the same address does not do the expected useful thing.

For now, this instruction spot is reserved for a better idea not described here

Single-Step & Breakpoints
-------------------------

Breakpoints are added by placing repeaters in the topmost (red) level of ROM.
There is no limit to the number of breakpoints set.
When the selected ROM is read, the system clock will stop
and the note block at 2515/190/-2472 will play.

To continue normal execution at full speed again,
press the Single-Step button once.

It is not necessary to remove the breakpoint; this is
useful for stopping at the same point in each iteration
of a loop for example.

If instead you wish to single-step the next few instructions,
first stop the clock, then press Single-Step.
Only 1 instruction will execute for each press of Single-Step.

To resume full speed operation when not stopped at a breakpoint,
just turn the clock back on.

How To Use: 

Demonstration Program: Double Dabble
------------------------------------

The demo program is an implementation of the well-known DoubleDabble algorithm
which converts a binary byte into BCD.

It illustrates many of the Goat2 features:
calling functions
different types of return
conditional branching
delay slot instructions
indirection
register forwarding

To use the demo, there is a (magenta) column of levers 2470/216/-2447
for the input byte inside one corner of the RAM (cell 0xf)
and 3 decimal digit displays designed by Cadenjb
attached outside the orange ALU output bus. 2449/217/-2502

The levers presently have a value of 0xA5 = 165 decimal.
Enter a binary number on the levers and start the computer (see below).
Let it run for about 25 minutes to view the result.

Reset/Restart Procedure
-----------------------

- Turn on RESET
- Turn on CLK
- Wait at least 2 full periods
- Turn off RESET

The program execution begins at ROM location 0,
All registers and flags are considered garbage
and should be initialized by software before use.

Although the clock will automatically shut off
due to a breakpoint I left at the end address 3,
it is polite to turn off the CLK when you are
finished using the computer.

Demo Program Listing
------------------------------

Program Listing: Double Dabble
------------------------------

// RAM addresses used:
#define input 0x0E // memory mapped levers
#define units 0x0F // top nibble drives lamps
#define tens 0x0D // top nibble drives lamps
#define hundreds 0x0B // top nibble drives lamps

#define binary 0x0C // copy of levers, consumed during execution
#define binret 0x0A // return address for toplevel
#define loopcnt 0x09 // iteration downcounter

#define tmp2 0x02
#define tmp1 0x01

#define call jl // call is an alias for the jump-and-link instruction "jl"
#define jmp jl // so is jmp

// we use call to show we expect the called function to return
// whereas we use jmp to show an unconditional jump with no return
// but remember "L" is overwritten no matter which alias is used!

// timing information cycle count: 1~ = one instruction
// 535~ * 2.6 seconds/~ = 1391 seconds, or 23 min 11 seconds.
program:
00 1c0e lda input ; read levers

01 0105 call bin2dec
02 140c sta binary ; (delay slot) store input argument

done:
03 0103 jmp done ; loop forever
04 0000 nop ; (delay slot)

; this routine converts the value stored in "binary"
; into the top nibble of 3 bcd bytes named units, tens, hundreds.
bin2dec: // 11~ + 8*65~ = 531~
05 150a stl binret ; save return address

06 0c00 mvia #0
07 140f sta units
08 140d sta tens
09 140b sta hundreds ; clear output variables

0a 0c08 mvia #8 ; we loop 8 times...
0b 1409 sta loopcnt ; because "binary" contains 8 bits

0c 0b00 aci #0 ; clear carry (add 0 to "A" with carry yields no carry)
0d 1c0c lda binary

ddloop: // 11~ + 3*18~ = 65~ for one iteration of the loop
0e 1b0c aca binary ; shift left (add to itself), msbit goes to cy
0f 140c sta binary

10 011b call dd ; do double-dabble on this digit
11 0e0f mvix #units ;(delay slot) point to units variable

12 011b call dd
13 0e0d mvix #tens ;(delay slot) point to 10s

14 011b call dd
15 0e0b mvix #hundreds ;(delay slot) point to 100s
; maximum hundreds value of 2 never overflows so carry is always clear

16 1109 dec loopcnt
17 050e jnz ddloop
18 1c0c lda binary ; (delay slot) get ready to double in next iteration

19 1d0a ret binret ; return to caller
1a 0000 nop ;(delay slot) nothing useful to put here

; this is double dabble on a single bcd digit.
; the 4 bits of digit are in the top nibble of the byte
; therefore, to add 3 we really need to add 0x30

dd: // 18~
1b 031e jnc savecy
1c 0c00 mvia #0 ; (delay slot) we will save this value if input carry was 0
1d 0b0f aci #0x0F ; otherwise we will save 0x10 (cy + 0x0F)
; note: this also clears carry

savecy:
1e 1401 sta tmp1 ; save shifted input cy here (0x00 or 0x10)

1f 1f00 ldax ; get bcd digit (high nibble)
20 0b30 aci #0x30 ; trial add 3
21 1402 sta tmp2 ; save result of trial addition here

22 0980 ani #0x80 ;if bcd digit was originally 5 or more, then bit 7 will now be set
23 0526 jnz double ; if result after AND is non-zero, this means bit 7 is set
24 1c02 lda tmp2 ;(delay slot) re-get trial result on our way there

25 1f00 ldax ; re-get original bcd digit since it was less than 5

double:
26 1402 sta tmp2 ; this is what we want to double
27 0b00 aci #0 ; but first we clear carry

28 1c01 lda tmp1 ; get (shifted) input carry
29 1b02 aca tmp2 ; *1
2a 1b02 aca tmp2 ; *2

2b 0d00 retl ; return to caller (output cy goes to the next higher digit)
2c 1700 stax ;(delay slot) store the updated bcd digit as we leave

Program size: 45 instructions ROM.
From 56 leaves 11 instruction words unused (empty).

Data size: 9 bytes used.
From 16 leaves 7 bytes unused.

Request status: 
Not approved

Comments

Since 1.16, the MC client is swamped with light update messages when the computer runs.
To see this pipeline computer running, you could export a .schematic of it and drop it into a 1.15.2 world
(which was when this promo request was submitted). I don't know if it will run in a single-player 1.16 world.

By goatflyer