Project

General

Profile

Building TI Linux SDK and IPC examples on Linux SDK 9.03

The process below was done on a 64-bit x86-based Linux host running Ubuntu 24.04 using ti-processor-sdk 09_03_00_00 . Please use an acceptable version of Ubuntu such as 24.04 with the ti-processor-sdk 09_03_00_00.

TI has a good rundown on the different communication techniques available between the ARM and DSPs. This page is going to focus on some of the IPC and BigData IPC examples.
https://software-dl.ti.com/processor-sdk-linux/esd/AM57X/09_03_06_05/exports/docs/linux/Foundational_Components_IPC.html#multiple-ways-of-arm-dsp-communication

References

You will need the TI Processor SDK and the TI RTOS Processor SDK for most projects using the ARM and DSP. This guide is intended to supplement TI's documentation, not replace it. It is recommended to read through the referenced documentation.

Prerequisites

Download PROCESSOR-SDK-LINUX-AM57X and PROCESSOR-SDK-RTOS-AM57X.

PROCESSOR-SDK-LINUX-AM57X
  • Download ti-processor-sdk-linux-am57xx-evm-09_03_06_05-Linux-x86-Install.bin
PROCESSOR-SDK-RTOS-AM57X
  • Download processor_sdk_rtos_am57xx_09_03_00_00-linux-x64-installer.tar.gz

Software Dependencies

Please install the following libraries onto your system

sudo apt update
sudo apt install make binutils

Building the IPC libraries

  • Download Makefile_ipc_linux_examples into <PROCESSOR_SDK_INSTALL_DIR>/ti-processor-sdk-linux-am57xx-evm-09_03_06_05/makerules/
  • Navigate to the bottom level processor sdk directory
    $ cd <PROCESSOR_SDK_INSTALL_DIR>/ti-processor-sdk-linux-am57xx-evm-09_03_06_05/
    
  • Export the following variable
    • Replace <RTOS_SDK_INSTALL_DIR> with the proper installation path.
      $ export TI_RTOS_PATH=<RTOS_SDK_INSTALL_DIR>/processor_sdk_rtos_am57xx_09_03_00_00
      
  • Make ti's ipc examples
    $ make ti-ipc-linux-examples -j$(nproc)
    

Note: TI has yet to officially release a method for building these examples. For now this is an acceptable way to build. See: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1528563/processor-sdk-am57x-ipc-example-fails-to-create-and-execute-app/5923150

Reloading DSP/IPU firmware key

To load/reload firmware on a live system, you need to unbind and bind the omap remoteproc driver with the name of the processor's rproc device name. Capturing these below for easier reference.

Processor Device Name Remote Proc Number MultiProc id
IPU1 58820000.ipu remoteproc0
IPU2 55020000.ipu remoteproc1 1
DSP1 40800000.dsp remoteproc2 3
DSP2 41000000.dsp remoteproc3 4

Example:

root@am57xx-evm:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/unbind
root@am57xx-evm:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/bind

Running the ex02_messageq example

  • Copy the host and remote processor binaries to the SOM
    $ scp <RTOS_SDK_INSTALL_DIR>/processor_sdk_rtos_am57xx_09_03_00_00/ipc_3_52_00_00/examples/DRA7XX_linux_elf/ex02_messageq/host/bin/debug/app_host <RTOS_SDK_INSTALL_DIR>/processor_sdk_rtos_am57xx_09_03_00_00/ipc_3_52_00_00/examples/DRA7XX_linux_elf/ex02_messageq/dsp1/bin/debug/server_dsp1.xe66 <RTOS_SDK_INSTALL_DIR>/processor_sdk_rtos_am57xx_09_03_00_00/ipc_3_52_00_00/examples/DRA7XX_linux_elf/ex02_messageq/ipu1/bin/debug/server_ipu1.xem4 root@<IP_ADDRESS>:/home/root/ex02_messageq/
    
  • Load the DSP and IPU firmwares
    root@am57xx-evm:~# ln -sf /home/root/ex02_messageq/server_dsp1.xe66 /lib/firmware/dra7-dsp1-fw.xe66
    root@am57xx-evm:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/unbind
    root@am57xx-evm:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/bind
    
    root@am57xx-evm:~# ln -sf /home/root/ex02_messageq/ipc-starter/ex02_messageq/server_dsp2.xe66 /lib/firmware/dra7-dsp2-fw.xe66
    root@am57xx-evm:~# echo 41000000.dsp > /sys/bus/platform/drivers/omap-rproc/unbind
    root@am57xx-evm:~# echo 41000000.dsp > /sys/bus/platform/drivers/omap-rproc/bind
    
    root@am57xx-evm:~# ln -sf /home/root/ex02_messageq/server_ipu1.xem4 /lib/firmware/dra7-ipu1-fw.xem4
    root@am57xx-evm:~# echo 55020000.ipu > /sys/bus/platform/drivers/omap-rproc/unbind
    root@am57xx-evm:~# echo 55020000.ipu > /sys/bus/platform/drivers/omap-rproc/bind
    
  • Link the app_host file type to the ARM
    root@mitysom-am57x:~# ln -sf /lib/ld-linux-armhf.so.3 /lib/ld-linux.so.3
    
  • Host the new firmware and demonstrate inter-processor communication
    root@am57xx-evm:~# ./ex02_messageq/app_host DSP1
    root@am57xx-evm:~# ./ex02_messageq/app_host DSP2
    root@am57xx-evm:~# ./ex02_messageq/app_host IPU1
    
    • Example output
      --> main:
      [ 1219.776184] omap-iommu 40d01000.mmu: 40d01000.mmu: version 3.0
      [ 1219.782165] omap-iommu 40d02000.mmu: 40d02000.mmu: version 3.0
      --> Main_main:
      --> App_create:
      App_create: Host is ready
      <-- App_create:
      --> App_exec:
      App_exec: sending message 1
      App_exec: sending message 2
      App_exec: sending message 3
      App_exec: message received, sending message 4
      App_exec: message received, sending message 5
      App_exec: message received, sending message 6
      App_exec: message received, sending message 7
      App_exec: message received, sending message 8
      App_exec: message received, sending message 9
      App_exec: message received, sending message 10
      App_exec: message received, sending message 11
      App_exec: message received, sending message 12
      App_exec: message received, sending message 13
      App_exec: message received, sending message 14
      App_exec: message received, sending message 15
      App_exec: message received
      App_exec: message received
      App_exec: message received
      <-- App_exec: 0
      --> App_delete:
      <-- App_delete:
      <-- Main_main:
      <-- main:
      

Note: The AM57x device has two IPU subsystems (IPUSS), each of which has 2 cores. IPU2 is used as a controller in multi-media applications, so if you have Processor SDK Linux running, chances are that IPU2 already has firmware loaded. However, IPU1 is open for general purpose programming to offload the ARM tasks. (IPC for AM57xx)

Running the ipc tests

  • Build the ipc bios examples
    $ cd $TI_RTOS_PATH/processor_sdk_rtos_am57xx_09_03_00_00/
    $ make ipc_bios -j$(nproc)
    
  • Copy the build example files
    $ scp -r $TI_RTOS_PATH/ipc_3_52_00_00/packages/ti/ipc/tests/bin/* root@<IP_ADDRESS>:/home/root/ipc-tests/
    
  • load new firmware on DSPs/IPUs
    root@mitysom-am57x:~# ln -sf /home/root/ipc-tests/ti_platforms_evmDRA7XX_dsp1/messageq_single.xe66 /lib/firmware/dra7-dsp1-fw.xe66
    root@mitysom-am57x:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/unbind
    root@mitysom-am57x:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/bind
    
  • devkit: run example arm app
    root@mitysom-am57x:~# MessageQApp 1 4
    [19864.018829] omap-iommu 55082000.mmu: 55082000.mmu: version 2.1
    [19864.055328] omap-iommu 40d01000.mmu: 40d01000.mmu: version 3.0
    [19864.061248] omap-iommu 40d02000.mmu: 40d02000.mmu: version 3.0
    Using numLoops: 1; procId : 4
    Entered MessageQApp_execute
    Local MessageQId: 0x80
    Remote queueId  [0x40080]
    Exchanging 1 messages with remote processor DSP1...
    MessageQ_get #1 Msg = 0xb6500808
    Exchanged 1 messages with remote processor DSP1
    Sample application successfully completed!
    Leaving MessageQApp_execute
    
    root@mitysom-am57x:~# MessageQBench 1000 8 4
    Using numLoops: 1000; payloadSize: 8, procId : 4
    Entered MessageQApp_execute
    Local MessageQId: 0x80
    Remote queueId  [0x40080]
    Exchanging 1000 messages with remote processor DSP1...
    DSP1: Avg round trip time: 139 usecs
    Leaving MessageQApp_execute
    

Note: DSP1 is procId: 4, DSP2 is procId: 3

Note: As of 03/08/2021 the sdk-linux docs have the wrong MessageQBench arguments. Reported to TI

Timing the IPC latency using MessageQBench

Test: Send 3 doubles from the DSP to the ARM

Note: MessageQBench times messages from ARM->DSP->ARM, with messages sent one at a time

Assuming the setup time is static: The example is sending ~7,812 messages a second, double that for both directions.

(2.43s-1.15s)/10000 = 128 us
1/(128us) = 7812.5

Changing it to floats (12 bytes) resulted in ~10k messages a second.

Also ran two copies of the benchmark to each DSP resulted in the same timings, so we could likely get higher throughput numbers if both DSPs are sending data, assuming the DSP calculations aren't the bottleneck.

Show log...

Building and running the big-data-ipc example (WORK IN PROGRESS)

This example was dropped from TI's supported ipc examples back in SDK 08_02_00_04. The dma-buf cache calls don't support invalidating smaller than the entire buffer, could cause data corruption if the ARM cache wb call wrote back more data than expected. It also means the cache calls are much slower than expected which has a big impact on throughput.
TODO: Rewrite to use smaller DMA_Heap buffers, instead of initializing a single big buffer at start

  • Clone to big-data-ipc-example git repository
    $ git clone https://support.criticallink.com/git/mitysom-am57x/big-data-ipc-examples.git
    $ cd big-data-ipc-examples
    $ git checkout -B mitysom-big-data-1.0 origin/mitysom-big-data-1.0
    

    Note: All other branches in this repository are to be considered unstable and not to be utilized.
  • Examine the commits below. They contain the major changes needed to get this example to build.
    $ git log -p e0bcf1ac8f4863309696267b790409e5daf66db7
    $ git log -p fb1d4d475899f73863a2a4b89799db06650d9a2d
    $ git log -p 389cf11eca9bcb1f0dbe4dacc42f431a7d9ca85d
    
  • Export the following variables
    • Replace <PROCESSOR_SDK_INSTALL_DIR> and <RTOS_SDK_INSTALL_DIR> with the installation path to the respective SDKs.
      $ export TI_SDK_PATH=<PROCESSOR_SDK_INSTALL_DIR>/ti-processor-sdk-linux-am57xx-evm-09_03_06_05
      $ export PATH=$TI_SDK_PATH/linux-devkit/sysroots/x86_64-arago-linux/usr/bin/arm-oe-linux-gnueabi:$PATH
      $ export TI_RTOS_PATH=<RTOS_SDK_INSTALL_DIR>/processor_sdk_rtos_am57xx_09_03_00_00
      $ export IPC_INSTALL_PATH=$TI_RTOS_PATH/ipc_3_52_00_00
      # Env variables for ipc_bios compile
      $ export SDK_INSTALL_PATH=$TI_RTOS_PATH
      $ export TOOLS_INSTALL_PATH=$TI_RTOS_PATH
      $ export XDC_INSTALL_PATH=$TI_RTOS_PATH/xdctools_3_55_02_22_core
      $ export BIOS_INSTALL_PATH=$TI_RTOS_PATH/bios_6_76_03_01
      $ export LINUX_SYSROOT_DIR=$TI_SDK_PATH/linux-devkit/sysroots/armv7at2hf-neon-oe-linux-gnueabi
      
  • Build the project
    $ make host_linux
    
  • Create a project directory called big_data on the board
    root@mitysom-am57x:~# mkdir big_data
    
  • Copy the release binaries to the devkit
    $ scp host_linux/simple_buffer_example/dsp/bin/DRA7XX/release/server_dsp.xe66 root@<IP_ADDRESS>:/home/root/big_data/
    $ scp host_linux/simple_buffer_example/host/bin/DRA7XX/release/app_host root@<IP_ADDRESS>:/home/root/big_data/
    
  • Sym link the dsp server binary to dsp1 firmware
    root@mitysom-am57x:~# ln -sf /home/root/big_data/server_dsp.xe66 /lib/firmware/dra7-dsp1-fw.xe66
    root@mitysom-am57x:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/unbind
    root@mitysom-am57x:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/bind
    
  • Link the app_host file type to the ARM
    root@mitysom-am57x:~# ln -sf /lib/ld-linux-armhf.so.3 /lib/ld-linux.so.3
    
  • Run the ARM host program to send and recieve big data messages from dsp1
    root@mitysom-am57x:~# ./big_data/app_host DSP1 10 16
    root@mitysom-am57x:~# cat /sys/kernel/debug/remoteproc/remoteproc2/trace0
    

    Note: Although we specify 10 data messages, a total of 16 messages are expected. The first message shares the memory pointer with the DSP to allow it to configure its SharedRegion accordingly. The following two messages are no-ops. After that, the 10 actual big data messages are transmitted followed by two more no-ops, and a shutdown message.

Show log...

Summary of how the big data example works:

ARM Code:
  • Create 16MB shared memory region using CMEM and SharedRegion
  • Create a Heap which is used to split this shared memory into trackable chunks to send over to the DSP
  • SEND: Send the shared memory pointer to DSP so it can setup its SharedRegion to match
  • SEND: Send 2 no-op messages
    The no-op messages are priming the pump so to speak. The ARM app is set up to only send more messages when it receives one. So sending 2 no-op messages ensures there are 3 messages in flight at a time, one for DSP to process, one for ARM, and one in waiting. To try and keep all processors active.
  • RECV: Get a response MSG from DSP
    • If the message is a BIGDATA message, then validate DSP count pattern and free the buffer
  • SEND: For every message received, we send a BIGDATA message allocated from the Heap filled with the ARM count pattern to DSP
  • For the last 3 messages, send 2 no-ops and then 1 shutdown message
DSP Code:
  • RECV: Get a message from ARM
    • If the message is a SETUP message, setup SharedRegion using info from ARM
    • If the message is a BIGDATA message, then validate the ARM count pattern and replace it with a DSP count pattern. Send the message back to ARM

The Heap is only accessed directly by the ARM code. The buffers acquired from the Heap are only accessed by one processor at a time so no locks are required.

Note: The example is designed around the expectation that the ARM is sending data to the DSP to operate on and then it gets returned. If the DSP generated data on its own and then sends it to ARM, it may be beneficial for the DSP to own the Heap management

Note: Updated Big Data example to allow the number of messages and the buffer size to be adjusted by command line arguments.

Validation

Data Integrity
  • The big-data example was validated by transmitting over 5 million messages across a span of 4–5 days, with system reboots occurring between message batches. The results showed no errors related to message transmission, reserved memory usage, or message count validation.
Timing and Throughput Benchmarks
  • Table 1: Average Round Trip Time and Throughput
    Message Count Size (KB) CMEM Time (µs) DMA_Heap Time (µs) CMEM Throughput (KB/s) DMA_Heap Throughput (KB/s)
    1000 16 22736 133797 703 119
    100 256 54826 268019 4669 955
    100 1024 217495 271783 4708 3767
    10 2048 394998 429340 5184 4770
    10 4096 789427 844750 5188 4848

Note: As of SDK 8.02, CMEM is deprecated. TI has transitioned to using the officially supported kernel mechanisms for memory allocation, specifically DMA and CMA (Contiguous Memory Allocator).
Note: DMA_Heap cache calls run significantly slower than CMEM counterparts.

  • Table 2: Benchmark Timings for 10 16KB messages
    DMA_Heap (µs) DMA_Heap no cache (µs) CMEM (µs)
    Accumulated Time Time Accumulated Time Time Accumulated Time Time
    elapsedHeapAlloc: Heap mem and MessageQ alloc 19519 19519 30 30 15 15
    elapsedFilled: Write count pattern 19519 0 30 0 37 22
    elapsedToGlobal: Cache invalidate call 22630 3111 30 0 40 3
    elapsedGet: Wait for Message from DSP 81159 58529 3050 3020 2805 2765
    elapsedToLocal: Cache write back 84484 3325 3050 0 2809 4
    elapsedChecked: ARM Check Values 84514 30 3050 0 2820 11
    elapsed: Total Message Life Time 100679 16165 3080 30 2841 21

Note: "DMA_Heap no cache" was tested with out the cache invalidate and write back calls. This is not recommended as it breaks data integrity as reported by the ARM check values exception.

Analysis

CMEM demonstrates significantly better performance than DMA_Heap at smaller message sizes. This advantage is primarily due to CMEM cache operations being run on much smaller chunks of memory. At larger message sizes, the differences in round-trip time and throughput become less pronounced; however, CMEM still outperforms DMA_Heap. This sustained advantage is likely because cache maintenance operations operate on larger blocks of data less frequently.

Testing disabling the cache invalidate and write-back for DMA_Heap results in a substantial performance improvement, helping to isolate these cache operations as a key factor in the observed performance degradation. Benchmark results for CMEM and DMA_Heap with cache disabled align with expected performance metrics. However, disabling cache operations is neither valid nor recommended, as it compromises data integrity.

When cache operations (invalidate and write-back) are enabled, the latency increases significantly. This is due to additional overhead introduced in various parts of the message transfer process. For instance, MessageQ and HeapMem operations—such as allocation and deallocation—take longer when cache management is active. As a result, the delay before the ARM is able to successfully call MessageQ_get and retrieve the response is further exacerbated.

Cache maintenance operations are inefficient when using DMA_Heap because the updated cache invalidate and writeback functions no longer handle the address and size of the DMA buffer. As a result, these operations are performed over the entire 128MB of CMA-reserved memory, significantly increasing overhead. To improve this code, multiple smaller DMA_heaps would need to be allocated so the cache calls could be much faster.

The following articles provide a good explanation on the DMA buffer heap.

https://lwn.net/Articles/822052/
https://lwn.net/Articles/822521/

Go to top
Add picture from clipboard (Maximum size: 1 GB)