Sunday, September 21, 2014

BBB PRU and ARM throughput measurements


Overview
The goal of this exercise was to understand PRU and memory performance characteristics in support of a Beagle Bone Black (BBB) high speed analog to digital converter (ADC).  In my mind there are questions regarding how much throughput can be sustained in both instructions and memory bandwidth.  I decided to construct a couple of quick tests to actually measure the sustainable bandwidth rather than rely on paper analysis.

The concept is that one PRU will read parallel data values from an ADC based on a sample clock input.  To accommodate additional performance, a second PRU will be used to hand these values off to DRAM memory.  The BBB ARM processor then consumes the samples from DRAM.  The use of a second PRU allows for minor additional signal processing and provides some wiggle room to address issues should they arise.  This approach was used effectively in a previous exercise with a serial interface ADC.  That implementation streamed 12 bit data at 1MSPS.  The processing was limited to clocking out the data (PRU) and Hilbert transforming the data to complex I/Q and shipping it off via Ethernet (ARM).  There wasn’t a lot of performance headroom (the difference between compiling the ARM code with or without –O3 was important). 

PRU1 would be used as the source since it can most readily interface with a 12 bit ADC based on the BBB expansion pin assignments.  The first page of SRAM is used as a fifo.  PRU0 takes the samples and places them in DRAM using 32 byte stores for efficiency.  The ARM processor extracts the samples from the DRAM fifo for processing.  The processing flow is shown below.

Beagle Bone Black PRU-ADC Acquisition
PRU1
In order to emulate the processing and provide a known data source for testing, PRU1 is programmed to spin a configurable number of times and produce a simple data pattern.  The spinning provides a variable rate control and reflects the waiting for an ADC sample clock.  A known pattern is achieved by an output being the sum of the previous two output samples (masked to fit within 16 bits).  This pattern is easy to compute (computational resources) and provides better power of 2 aliasing detection than a counting pattern.  The full source code is here. The following is the salient snippet of instructions with my version of the clocks per instruction:
PRU1 Benchmark Reference Code

The following table summarizes 5nS clocks (or single cycle instructions) to sample period, MSPS and MB/s (assuming 2 bytes per sample).

PRU Clocks to MSPS (Calculated Reference)

The rough counting indicates 14 clocks but it is unclear to me whether the stores are 2 or 3 clocks based on the documentation I could find.  This puts the maximum source throughput in the range of 28.5 to 25.0 MB/s or in the neighborhood of 13 MSPS.

PRU0
The PRU0 code for this test simply follows the PRU1 position in SRAM and stores the samples to DRAM.  It also updates a head pointer in SRAM to allow the ARM processor to know its current position in the data stream.  The idea with PRU0 is that it can buffer 32 bytes worth of samples and issue a single memory line write.  In alternate applications it could be used to decimate further and apply a CIC filter.  The full source code is available here with the salient instructions shown below.
PRU0 Benchmark Reference Code

Again, based on my counting this is 33 instructions.  In each of these intervals, 32 bytes of data are transferred from SRAM to DRAM.  Based on the previous table it should in theory support 16 times the throughput indicated at 33 instructions or 12*16=  192MB/s.  I do not know the characteristics of the DRAM, DRAM controller, and bus bridges between the PRU and the final DRAM.  In my experience, you can go hunt down all of these parameters from the datasheets and still be very wrong because you missed a foot note, errata, or software configuration item.  It is better to actually measure comparable results to understand what to expect.  In any event, this is significantly higher than the source throughput and as long as there is not significant throughput degradation, this step should not be a bottleneck.

ARM
The ARM code is broken into two portions, the first is a pructl application which loads the PRU kernel images and sets up their operating parameters in SRAM, the second is a data consumer or follower which measures the throughput rate and checks the data pattern. Samples are transferred from the DRAM area accessed by the PRU to another DRAM location used as the working area for ARM computations.  Samples can be transferred and checked in various block sizes, using various copy techniques (e.g. 32 byte, 8 byte or 2 byte transfers). The source code for both utilities can be found here.
 
Results
The short story is in this configuration the PRUs can sustain 26MB/s to DRAM.  The ARM can sustain this rate but only with limited processing of the data stream.  The following walks through the data and thinking.

The source PRU is limited to 26MSPS based on its clock timing.  The PRU transferring samples to DRAM should be able to sustain more than this based on cycle counting.  By using the follower application to periodically extract blocks of samples and inspect them we can verify that the transferring PRU can at least keep pace with the source PRU.   If this were not the case we would expect some snapshots of the data sequence inspected to have errors.  Large samples are inspected by the follower (i.e. 32 kB) and it is allowed to run for minutes monitoring for errors in the sampled data stream.

Once the throughput limits of the PRU0 and PRU1 are known, the next measurement is the effective throughput of the ARM.  Without any other activity on the ARM it can verify the PRU data stream continuously at 26MB/s (as opposed to snap shots in the previous test case). If the size of block sample data is reduced from 4k, the streaming bandwidth of the ARM drops.  This makes sense as every block of sample data requires checking the DRAM head pointer in the PRU SRAM.  This should be configured as a processor non-coherent read which stalls the processor until the read is complete.  Basically, if you check the head pointer (non coherent SRAM read) for every sample you extract from the DRAM sample buffer, you begin dropping through 4MB/s into 1MB/s territory.  It is significantly better to extract lots of DRAM samples (e.g. 4kB) for every non-coherent SRAM (head pointer) access by the ARM.  No surprises here, just measurement verification of what basic computer architecture should tell you.

When the follower is configured to check the entire sample stream continuously and the throughput measuring tool is used, there are errors at any source rate above 17.7MB/s.  This appears independent of the copy technique or how the blocks of samples are moved from DRAM to the ARM checking buffers (i.e. 2 byte, 8 byte, 32 byte transfers).  To better quantify the processing and additional performance available at the ARM a couple approaches were taken.  

First, the throughput in verifying a fixed memory pattern was taken.  This involves placing a large pattern in memory (i.e. 32MB or larger than the caches) and then calculating the pattern check over this repeatedly.  This produces results on the order of 135MB/s running alone or 66MB/s in conjunction with the PRU throughput measurement tool. This makes sense in that there are two CPU bound processes switching back and forth.  It does provide a ballpark estimate of walking memory with an add and comparison which is the simplest computation we can conduct.

Next, the stream test with the PRU was used with a variable delay after the extraction and verification of each block of samples.  Using a sample block extraction size of 4kB or 2k samples we get a period of (2e3 samples / 13e6 samples per second)=154uS.  A delay of 42 uS per block could be sustained without errors.  This indicates the transfer and checking time of a block is 154 - 42 uS or 112uS.  Looking at it differently, transferring and checking 4kB per 112uS produces a throughput of (4e3 bytes /112e-6 second)=35MB/s.  This is lower than the memory-only verification rate (as expected due to the intervening SRAM head references).

Similar delay testing was conducted with a lower throughput rate of 17.7MB/s or (2e3 samples / 8.85e6 samples/sec) = 226 uS period.  The lossless delay measured here was 118uS.  This indicates a transfer and checking time of 226 - 118 uS or 108uS.  This translates into a transfer and checking throughput of (4e3 bytes / 108e-6 seconds)=37MB/s which is consistent with other measurements.

In short, the PRUs should be capable of dealing with a 10MSPS ADC.  The ARM is a slightly different story depending on the processing to be performed.  Streaming applications (e.g. SDR) will need to decimate the data either in PRU or ARM while snapshot or fetch-and-process applications (e.g. instrumentation) can leverage the full sample rate.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.