Sunday, September 21, 2014

BBB PRU and ARM throughput measurements


Overview
The goal of this exercise was to understand PRU and memory performance characteristics in support of a Beagle Bone Black (BBB) high speed analog to digital converter (ADC).  In my mind there are questions regarding how much throughput can be sustained in both instructions and memory bandwidth.  I decided to construct a couple of quick tests to actually measure the sustainable bandwidth rather than rely on paper analysis.

The concept is that one PRU will read parallel data values from an ADC based on a sample clock input.  To accommodate additional performance, a second PRU will be used to hand these values off to DRAM memory.  The BBB ARM processor then consumes the samples from DRAM.  The use of a second PRU allows for minor additional signal processing and provides some wiggle room to address issues should they arise.  This approach was used effectively in a previous exercise with a serial interface ADC.  That implementation streamed 12 bit data at 1MSPS.  The processing was limited to clocking out the data (PRU) and Hilbert transforming the data to complex I/Q and shipping it off via Ethernet (ARM).  There wasn’t a lot of performance headroom (the difference between compiling the ARM code with or without –O3 was important). 

PRU1 would be used as the source since it can most readily interface with a 12 bit ADC based on the BBB expansion pin assignments.  The first page of SRAM is used as a fifo.  PRU0 takes the samples and places them in DRAM using 32 byte stores for efficiency.  The ARM processor extracts the samples from the DRAM fifo for processing.  The processing flow is shown below.

Beagle Bone Black PRU-ADC Acquisition
PRU1
In order to emulate the processing and provide a known data source for testing, PRU1 is programmed to spin a configurable number of times and produce a simple data pattern.  The spinning provides a variable rate control and reflects the waiting for an ADC sample clock.  A known pattern is achieved by an output being the sum of the previous two output samples (masked to fit within 16 bits).  This pattern is easy to compute (computational resources) and provides better power of 2 aliasing detection than a counting pattern.  The full source code is here. The following is the salient snippet of instructions with my version of the clocks per instruction:
PRU1 Benchmark Reference Code

The following table summarizes 5nS clocks (or single cycle instructions) to sample period, MSPS and MB/s (assuming 2 bytes per sample).

PRU Clocks to MSPS (Calculated Reference)

The rough counting indicates 14 clocks but it is unclear to me whether the stores are 2 or 3 clocks based on the documentation I could find.  This puts the maximum source throughput in the range of 28.5 to 25.0 MB/s or in the neighborhood of 13 MSPS.

PRU0
The PRU0 code for this test simply follows the PRU1 position in SRAM and stores the samples to DRAM.  It also updates a head pointer in SRAM to allow the ARM processor to know its current position in the data stream.  The idea with PRU0 is that it can buffer 32 bytes worth of samples and issue a single memory line write.  In alternate applications it could be used to decimate further and apply a CIC filter.  The full source code is available here with the salient instructions shown below.
PRU0 Benchmark Reference Code

Again, based on my counting this is 33 instructions.  In each of these intervals, 32 bytes of data are transferred from SRAM to DRAM.  Based on the previous table it should in theory support 16 times the throughput indicated at 33 instructions or 12*16=  192MB/s.  I do not know the characteristics of the DRAM, DRAM controller, and bus bridges between the PRU and the final DRAM.  In my experience, you can go hunt down all of these parameters from the datasheets and still be very wrong because you missed a foot note, errata, or software configuration item.  It is better to actually measure comparable results to understand what to expect.  In any event, this is significantly higher than the source throughput and as long as there is not significant throughput degradation, this step should not be a bottleneck.

ARM
The ARM code is broken into two portions, the first is a pructl application which loads the PRU kernel images and sets up their operating parameters in SRAM, the second is a data consumer or follower which measures the throughput rate and checks the data pattern. Samples are transferred from the DRAM area accessed by the PRU to another DRAM location used as the working area for ARM computations.  Samples can be transferred and checked in various block sizes, using various copy techniques (e.g. 32 byte, 8 byte or 2 byte transfers). The source code for both utilities can be found here.
 
Results
The short story is in this configuration the PRUs can sustain 26MB/s to DRAM.  The ARM can sustain this rate but only with limited processing of the data stream.  The following walks through the data and thinking.

The source PRU is limited to 26MSPS based on its clock timing.  The PRU transferring samples to DRAM should be able to sustain more than this based on cycle counting.  By using the follower application to periodically extract blocks of samples and inspect them we can verify that the transferring PRU can at least keep pace with the source PRU.   If this were not the case we would expect some snapshots of the data sequence inspected to have errors.  Large samples are inspected by the follower (i.e. 32 kB) and it is allowed to run for minutes monitoring for errors in the sampled data stream.

Once the throughput limits of the PRU0 and PRU1 are known, the next measurement is the effective throughput of the ARM.  Without any other activity on the ARM it can verify the PRU data stream continuously at 26MB/s (as opposed to snap shots in the previous test case). If the size of block sample data is reduced from 4k, the streaming bandwidth of the ARM drops.  This makes sense as every block of sample data requires checking the DRAM head pointer in the PRU SRAM.  This should be configured as a processor non-coherent read which stalls the processor until the read is complete.  Basically, if you check the head pointer (non coherent SRAM read) for every sample you extract from the DRAM sample buffer, you begin dropping through 4MB/s into 1MB/s territory.  It is significantly better to extract lots of DRAM samples (e.g. 4kB) for every non-coherent SRAM (head pointer) access by the ARM.  No surprises here, just measurement verification of what basic computer architecture should tell you.

When the follower is configured to check the entire sample stream continuously and the throughput measuring tool is used, there are errors at any source rate above 17.7MB/s.  This appears independent of the copy technique or how the blocks of samples are moved from DRAM to the ARM checking buffers (i.e. 2 byte, 8 byte, 32 byte transfers).  To better quantify the processing and additional performance available at the ARM a couple approaches were taken.  

First, the throughput in verifying a fixed memory pattern was taken.  This involves placing a large pattern in memory (i.e. 32MB or larger than the caches) and then calculating the pattern check over this repeatedly.  This produces results on the order of 135MB/s running alone or 66MB/s in conjunction with the PRU throughput measurement tool. This makes sense in that there are two CPU bound processes switching back and forth.  It does provide a ballpark estimate of walking memory with an add and comparison which is the simplest computation we can conduct.

Next, the stream test with the PRU was used with a variable delay after the extraction and verification of each block of samples.  Using a sample block extraction size of 4kB or 2k samples we get a period of (2e3 samples / 13e6 samples per second)=154uS.  A delay of 42 uS per block could be sustained without errors.  This indicates the transfer and checking time of a block is 154 - 42 uS or 112uS.  Looking at it differently, transferring and checking 4kB per 112uS produces a throughput of (4e3 bytes /112e-6 second)=35MB/s.  This is lower than the memory-only verification rate (as expected due to the intervening SRAM head references).

Similar delay testing was conducted with a lower throughput rate of 17.7MB/s or (2e3 samples / 8.85e6 samples/sec) = 226 uS period.  The lossless delay measured here was 118uS.  This indicates a transfer and checking time of 226 - 118 uS or 108uS.  This translates into a transfer and checking throughput of (4e3 bytes / 108e-6 seconds)=37MB/s which is consistent with other measurements.

In short, the PRUs should be capable of dealing with a 10MSPS ADC.  The ARM is a slightly different story depending on the processing to be performed.  Streaming applications (e.g. SDR) will need to decimate the data either in PRU or ARM while snapshot or fetch-and-process applications (e.g. instrumentation) can leverage the full sample rate.

Saturday, September 6, 2014

Interface Board for Beagle Bone Black

The stacking of the boards has worked out well.  It provides a nice compact foot print.  By using a couple of 2x5 headers only the first board in the stack needs the full connector set.  The first board also generally needs additional pins and higher speed IO.  This allows boards higher in the stack to recoup some board space and use fewer pins for less insertion/extraction force.

The only short comings of this approach have been: a) current/power supply limits, b) lack of properly sequenced 5V supply, and c) difficulty in shielding and isolating the boards.  To address these I created an interface board (I-board, to go with the A/B/C/R BREC boards) that drags out some gpio's and provides system 5VDC rather than the regulated 3.3V supply.

Part of the goal of this project was to preserve all of the PRU pins for higher speed digital interfaces (to DACs and ADCs).  This would allow an I board to stack on top of an ADC or DAC board and then provide several other shielded boards with a power and gpio based SPI interface.  A couple of pins were reserved for discrete in/out status.  The model was the synthesizer and RF front end boards (B/C/R).  Below is a picture of the first version of that board.

Beagle Bone Black Interface Board
The ability to power down a board to quiesce any signals it may be generating is a very useful aspect in RF configurations so it was important to have a supply switching capability. Each port contains two ground pins and two system 5VDC lines.  The supply lines derive from the BBB VDD system 5V directly from the external supply.  They are routed through high side supply switches.  I used the MIC94073.  This part includes a digital enable (that works with 3.3V logic), capacitive discharge and soft start.  This provides enough current for most of the boards of interest while staying within the current carrying capabilities of IDC.  It also provides a supply sufficient that most 5V regulators can be used to produce a regulated version within specification on the daughter board.

There are PTC fuses on each port.  To try and make things less exciting for day to day work, power and ground lines are at opposite diagonal corners of the header pin mapping.  This way if you accidentally rotate a connector by 180 degrees you are not guaranteed to short supply to ground (your mileage may vary on the rest of your pins based on assignment).  The other 6 pins of each port are connected to BBB gpio lines.  These can be any configuration desired and switched at run time.  The source code on google code contains the default mapping and direction(under Iboard).  The software configures all gpio pins back to input when a port is powered down.

To test the software and evaluate hardware startup a small test harness was created with LEDs for visual inspection of lines and through hole components with loops you can get a scope probe on or a jumper clip to route digital outputs back to digital inputs.  The following is a picture of that harness.
10 Pin Port Test Harness
It was fun going back to assembling something using through hole components and things you don't need a magnifying glass and tweezers to work with.

The power on sequencing of ports and supply power is the only interesting aspect of this kind of activity.  Now you might think - "gee, how hard can it be to screw up something like that".   I have been using the Beagle Bone Black System Reference Manual (SRM) A5.6.  There are a couple of passing references to SYS_RESETn and gating signals with it, specifically the boot lines.  By version C.1 of the SRM there 5 instances of large, bold, red text indicating "NO PINS ARE TO BE DRIVEN UNTIL AFTER THE SYS_RESET LINE GOES HIGH."  So apparently, I am not alone in failing to pay proper attention to power on sequencing (not that I should have needed to have this in the SRM, this is one of those basic digital design things I can still see my old professor just smiling, shaking his head, and calmly waking away...).  This revision of the board does not use the appropriate gating but is easy to correct, consequently I won't post the schematic until I update and retest.

The other interesting aspect of this is the initialization state of the pins following SYS_RESET.  If you look at the Sitera Processor/am335x "data sheet", TI SPRS717G you see Table 4-1 called "Ball Characteristics".  This table provides the state of all pins, specifically the gpio pins of interest, and the state they are in by default after reset.  They will sit in this state until your cape overlay gets loaded (which can be several seconds).  You can modify uboot to update them, however, there are still hundreds of milliseconds until this can occur.  The point being that I used 3 gpio's to drive the high side supply switches for external power to each of the ports.  I did not choose carefully enough and consult the SPRS717G with respect to initial state. I wound up using lines which, while inputs, default to having pull ups.  This is enough to enable the port supply lines immediately following reset.  The original goal in allocation of gpios was to preserve most of the P8 connector lines for PRU input/output while using as many gpios from P9 as possible.

The bottom line is; given the IO and boards I am working with in the near future this version is good enough and will not drive IO pins prior to it being safe to do so, however, the power on sequencing needs to be corrected for general use.This is also a good example of why if you are going to use the BBB unregulated supply additional care and consideration should be taken.