What is inside the FPGA ?

I have made the thumbnails extremely small so that you may feel the need to click on them to view the diagrams in PDF format.

A few words of vocabulary :
- BRAM (Block RAM) : on-chip, dual port RAM. The chip has 48kB of this.
- core : some HDL code put together in order to perform a usable function, that you can instantiate (import) into your design.

FPGA contents

The first thing you will notice is probably the LED. No FPGA board is complete without a LED which proudly blinks to show it hasn't crashed yet.

Well, most aspects of this design were determined by the fact that, in order to stream data via ethernet, or any other wire, you must be able to flush the receive buffers fast enough. If you can't pull the packets fast enough from the tiny cramped 8kB chip buffer, packets will drop and then it will suck. This means a 50 MHz system must be able to process 12 MB/s data without overloading, not counting transmission in the other direction for ADC data.

Conting 12 MB/s incoming and 2 MB/s outgoing, this means the timing budget is about 7 clock cycles for each 16 bit word from in to out, all overhead included.

On ucLinux, this board does about 200 kbytes/s sustained throughput.

So first thing was to drop ucLinux and hack the driver to work standalone without OS. This way I got about 1-2 MB/s. Still not enough, so I learnt Verilog.

BusMux

The designers from Atmark decided to connect all on-board devices to the same tiny slow 16 bit external bus.
(a new version of the board is available with two external busses, one of which is directly connected to a SDRAM chip, allowing fast accesses, unfortunately it is fitted with a very slow ethernet chip which can't sustain the full wire speed). If you ask me they should have used 32 bit SDRAM, a PHY, and the Ethernet MAC from OpenCores.

Thus, I had to place a bus multiplexer to allow the various cores inside the FPGA to talk to the outside chips.

I did not use a bus access scheme based on address decoding because it was too slow and the SDRAM controller would read only crap data and write to nowhere. So I used a bus arbitration scheme where the cores request the bus, wait, get it, and do their thing, and release. This allows use of very few gates between the cores and the external bus (just a few muxes).

The SDRAM controller expects to be directly connected to the SDRAM chip ; it wants its registers hard placed in the output buffers. Muxing it is not at all advised, but it seems to work if you hand place the gates.

So I had to hand-place some gates inside the FPGA to meet timings. I had never touched a FPGA before last month so I went by instinct. It works.

To speed things up, the CPU can request the bus on behalf of one of the core and keep it locked, gaining 1 cycle for all subsequent accesses. This is used when talking exclusively to the MAC chip.

CPU MAC Controller

I wrote an interface allowing the CPU to talk to the MAC chip. I plugged everything into the processor's LMB bus. This is absolutely against common practice but :
- It's simpler
- It's faster

So in order to read a MAC register the CPU now needs 3 cycles instead of about 10 before. I might hack that down to 2 cycles if needed, but I don't think it is.

MAC DMA engine

When the CPU writes (address, length, direction) in this core's registers, it requests and locks the external bus, and transfers data from/to the MAC to/from a BRAM block to which it has direct access.

This eats 3 cycles per 16 bit word out of the timing budget due to the need to wait for the MAC chip to assert its asynchronous ready flag on each request. Maybe I could use a pipeline with retry to get down to 2 cycles, but I don't think it will be necessary. The fact that the MAC FIFO register has a constant address helps, as there is no need to wait for the address setup time.

As this core does not use the bus to access memory, the CPU can start parsing the packet while it is copied into memory. It just asks the core how much data has been copied yet, and once ethernet and IP headers are available, it starts parsing. Dual port memories are cool (the second port of this memory block is connected to the OPB bus).

This core also computes packet checksums in hardware.

I did not use the standard Xilinx peripheral controller and DMA controller core because the DMA controller is not compatible with 16 bit FIFO accesses, and the external bus is only 16 bits. Besides, it's slow.

SDRAM Buffering

If the packet contains audio data, the CPU will tell the OPB DMA controller (Xilinx core) to copy it to SDRAM where it is buffered. This uses the OPB bus. reads from BRAM are done at 32 bits/cycle, SDRAM accesses at 16 bits/cycle. It's pretty fast.

The CPU also uses the same DMA controller to copy data from SDRAM to the BRAM audio buffers.

General data flow

This diagram shows where the data goes. Playback data is buffered in SDRAM whereas recorded data is sent directly. It still has to be copied between two different block RAMs but this is fast enough.








I2S encoders

I wrote a core which reads data from a BRAM through one of its ports, demultiplexes the various audio channels and transmits them as I2S to outboard DACs. It also does the reverse for data coming from ADCs. See the block diagram at the right.

For flexible demultiplexing of streams running at various sample widths and rates, I implemented a very simple processor which reads a list of instructions from distributed RAM and processes them.

These instructions are of the form : Read X bytes (2 or 3 depending on sample width) and store them to the DAC fifo #Y. It waits on full FIFOs. It does the reverse for ADC data.

The PC can modify this "program" remotely.

Each "channel" contains a FIFO, and a multichannel I2S encoder. So, a "channel" can drive many DAC chips and real audio channels, all running at the same sample rate and sample width. So you need multiple "channels" only if you have DACs with differing sample widths or frequencies.

A FIFO is a useful way to cross clock domains (from CPU clock to audio master clock).

Most parameters are software-controllable (see the I2S encoders page with scope shots).

The logic includes underrun and overrun detection. Should this condition occur, the output is muted and the hardware resets itself. The PC will have to restart the stream from a known state. This will avoid blown tweeters with digital crossovers, where a buffer underrun could cause replay of old data stored in the buffer, at an unknown offset, which means channels can be reversed (woofer/tweeter) or something.

The CPU can query status and configure the hardware via various registers, including :
- audio clock coounter for PC synchronization
- choosing the clock rate for each channel
- etc.

Result

It works.

Note :
To generate these block diagrams I decided to try a new software : yEd. Click on the link to read more.