Overview
The goal of this exercise was to understand PRU and memory performance characteristics in support of a Beagle Bone Black (BBB) high speed analog to digital converter (ADC). In my mind there are questions regarding how much throughput can be sustained in both instructions and memory bandwidth. I decided to construct a couple of quick tests to actually measure the sustainable bandwidth rather than rely on paper analysis.
The concept is that one PRU will read parallel data values
from an ADC based on a sample clock input.
To accommodate additional performance, a second PRU will be used to hand
these values off to DRAM memory. The BBB
ARM processor then consumes the samples from DRAM. The use of a second PRU allows for minor
additional signal processing and provides some wiggle room to address issues should
they arise. This approach was used
effectively in a previous exercise with a serial interface ADC. That implementation streamed 12 bit data at
1MSPS. The processing was limited to
clocking out the data (PRU) and Hilbert transforming the data to complex I/Q
and shipping it off via Ethernet (ARM).
There wasn’t a lot of performance headroom (the difference between
compiling the ARM code with or without –O3 was important).
PRU1 would be used as the source since it can most readily
interface with a 12 bit ADC based on the BBB expansion pin assignments. The first page of SRAM is used as a
fifo. PRU0 takes the samples and places
them in DRAM using 32 byte stores for efficiency. The ARM processor extracts the samples from
the DRAM fifo for processing. The
processing flow is shown below.
Beagle Bone Black PRU-ADC Acquisition |
In order to emulate the processing and provide a known data source for testing, PRU1 is programmed to spin a configurable number of times and produce a simple data pattern. The spinning provides a variable rate control and reflects the waiting for an ADC sample clock. A known pattern is achieved by an output being the sum of the previous two output samples (masked to fit within 16 bits). This pattern is easy to compute (computational resources) and provides better power of 2 aliasing detection than a counting pattern. The full source code is here. The following is the salient snippet of instructions with my version of the clocks per instruction:
PRU1 Benchmark Reference Code |
The following table summarizes 5nS clocks (or single cycle
instructions) to sample period, MSPS and MB/s (assuming 2 bytes per sample).
PRU Clocks to MSPS (Calculated Reference) |
The rough counting indicates 14 clocks but it is unclear to me whether the stores are 2 or 3 clocks based on the documentation I could find. This puts the maximum source throughput in the range of 28.5 to 25.0 MB/s or in the neighborhood of 13 MSPS.
The PRU0 code for this test simply follows the PRU1 position
in SRAM and stores the samples to DRAM.
It also updates a head pointer in SRAM to allow the ARM processor to
know its current position in the data stream.
The idea with PRU0 is that it can buffer 32 bytes worth of samples and
issue a single memory line write. In
alternate applications it could be used to decimate further and apply a CIC
filter. The full source code is
available here
with the salient instructions shown below.
PRU0 Benchmark Reference Code |
Again, based on my counting this is 33 instructions. In each of these intervals, 32 bytes of data
are transferred from SRAM to DRAM. Based
on the previous table it should in theory support 16 times the throughput indicated
at 33 instructions or 12*16= 192MB/s. I do not know the characteristics of the
DRAM, DRAM controller, and bus bridges between the PRU and the final DRAM. In my experience, you can go hunt down all of
these parameters from the datasheets and still be very wrong because you missed
a foot note, errata, or software configuration item. It is better to actually measure comparable
results to understand what to expect. In
any event, this is significantly higher than the source throughput and as long
as there is not significant throughput degradation, this step should not be a
bottleneck.
ARM
The ARM code is broken into two portions, the first is a
pructl application which loads the PRU kernel images and sets up their
operating parameters in SRAM, the second is a data consumer or follower which
measures the throughput rate and checks the data pattern. Samples are transferred from the DRAM area accessed by the
PRU to another DRAM location used as the working area for ARM
computations. Samples can be transferred
and checked in various block sizes, using various copy techniques (e.g. 32
byte, 8 byte or 2 byte transfers). The source code for both utilities can be found here.
Results
The short story is in this configuration the PRUs can sustain 26MB/s to DRAM. The ARM can sustain this rate but only with limited processing of the data stream. The following walks through the data and thinking.
The source PRU is limited to 26MSPS based on its clock
timing. The PRU transferring samples to
DRAM should be able to sustain more than this based on cycle counting. By using the follower application to periodically
extract blocks of samples and inspect them we can verify that the transferring
PRU can at least keep pace with the source PRU. If this were not the case we would expect some snapshots of the data
sequence inspected to have errors. Large
samples are inspected by the follower (i.e. 32 kB) and it is allowed to run for
minutes monitoring for errors in the sampled data stream.
Once the throughput limits of the PRU0 and PRU1 are known,
the next measurement is the effective throughput of the ARM. Without any other activity on the ARM it can
verify the PRU data stream continuously at 26MB/s (as opposed to snap shots in
the previous test case). If the size of block sample data is reduced from 4k,
the streaming bandwidth of the ARM drops.
This makes sense as every block of sample data requires checking the
DRAM head pointer in the PRU SRAM. This
should be configured as a processor non-coherent read which stalls the
processor until the read is complete.
Basically, if you check the head pointer (non coherent SRAM read) for
every sample you extract from the DRAM sample buffer, you begin dropping
through 4MB/s into 1MB/s territory. It
is significantly better to extract lots of DRAM samples (e.g. 4kB) for every
non-coherent SRAM (head pointer) access by the ARM. No surprises here, just measurement
verification of what basic computer architecture should tell you.
When the follower is configured to check the entire sample
stream continuously and the throughput measuring tool is used, there are errors
at any source rate above 17.7MB/s. This
appears independent of the copy technique or how the blocks of samples are
moved from DRAM to the ARM checking buffers (i.e. 2 byte, 8 byte, 32 byte
transfers). To better quantify the
processing and additional performance available at the ARM a couple approaches
were taken.
First, the throughput in verifying a fixed memory pattern was
taken. This involves placing a large
pattern in memory (i.e. 32MB or larger than the caches) and then calculating
the pattern check over this repeatedly.
This produces results on the order of 135MB/s running alone or 66MB/s in
conjunction with the PRU throughput measurement tool. This makes sense in that
there are two CPU bound processes switching back and forth. It does provide a ballpark estimate of
walking memory with an add and comparison which is the simplest computation we
can conduct.
Next, the stream test with the PRU was used with a variable
delay after the extraction and verification of each block of samples. Using a sample block extraction size of 4kB
or 2k samples we get a period of (2e3 samples / 13e6 samples per second)=154uS. A delay of 42 uS per block could be sustained
without errors. This indicates the
transfer and checking time of a block is 154 - 42 uS or 112uS. Looking at it differently, transferring and
checking 4kB per 112uS produces a throughput of (4e3 bytes /112e-6 second)=35MB/s. This is lower than the memory-only
verification rate (as expected due to the intervening SRAM head references).
Similar delay testing was conducted with a lower throughput
rate of 17.7MB/s or (2e3 samples / 8.85e6 samples/sec) = 226 uS period. The lossless delay measured here was 118uS. This indicates a transfer and checking time
of 226 - 118 uS or 108uS. This
translates into a transfer and checking throughput of (4e3 bytes / 108e-6 seconds)=37MB/s
which is consistent with other measurements.
In short, the PRUs should be capable of dealing with a
10MSPS ADC. The ARM is a slightly
different story depending on the processing to be performed. Streaming applications (e.g. SDR) will need
to decimate the data either in PRU or ARM while snapshot or fetch-and-process
applications (e.g. instrumentation) can leverage the full sample rate.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.