Building TI Linux SDK and IPC examples on Linux SDK 9.03¶
- Table of contents
- Building TI Linux SDK and IPC examples on Linux SDK 9.03
The process below was done on a 64-bit x86-based Linux host running Ubuntu 24.04 using ti-processor-sdk 09_03_00_00 . Please use an acceptable version of Ubuntu such as 24.04 with the ti-processor-sdk 09_03_00_00.
TI has a good rundown on the different communication techniques available between the ARM and DSPs. This page is going to focus on some of the IPC and BigData IPC examples.
https://software-dl.ti.com/processor-sdk-linux/esd/AM57X/09_03_06_05/exports/docs/linux/Foundational_Components_IPC.html#multiple-ways-of-arm-dsp-communication
References¶
You will need the TI Processor SDK and the TI RTOS Processor SDK for most projects using the ARM and DSP. This guide is intended to supplement TI's documentation, not replace it. It is recommended to read through the referenced documentation.
- TI Linux IPC
- TI RTOS IPC
- TI IPC Training
- https://github.com/jcormier/big-data-ipc-example/commits/benchmark
Prerequisites¶
Download PROCESSOR-SDK-LINUX-AM57X and PROCESSOR-SDK-RTOS-AM57X.
PROCESSOR-SDK-LINUX-AM57X- Download ti-processor-sdk-linux-am57xx-evm-09_03_06_05-Linux-x86-Install.bin
- Download processor_sdk_rtos_am57xx_09_03_00_00-linux-x64-installer.tar.gz
Software Dependencies¶
Please install the following libraries onto your system
sudo apt update
sudo apt install make binutils
Building the IPC libraries¶
- Download Makefile_ipc_linux_examples into <PROCESSOR_SDK_INSTALL_DIR>/ti-processor-sdk-linux-am57xx-evm-09_03_06_05/makerules/
- Navigate to the bottom level processor sdk directory
$ cd <PROCESSOR_SDK_INSTALL_DIR>/ti-processor-sdk-linux-am57xx-evm-09_03_06_05/
- Export the following variable
- Replace <RTOS_SDK_INSTALL_DIR> with the proper installation path.
$ export TI_RTOS_PATH=<RTOS_SDK_INSTALL_DIR>/processor_sdk_rtos_am57xx_09_03_00_00
- Replace <RTOS_SDK_INSTALL_DIR> with the proper installation path.
- Make ti's ipc examples
$ make ti-ipc-linux-examples -j$(nproc)
Note: TI has yet to officially release a method for building these examples. For now this is an acceptable way to build. See: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1528563/processor-sdk-am57x-ipc-example-fails-to-create-and-execute-app/5923150
Reloading DSP/IPU firmware key¶
To load/reload firmware on a live system, you need to unbind and bind the omap remoteproc driver with the name of the processor's rproc device name. Capturing these below for easier reference.
| Processor | Device Name | Remote Proc Number | MultiProc id |
| IPU1 | 58820000.ipu | remoteproc0 | |
| IPU2 | 55020000.ipu | remoteproc1 | 1 |
| DSP1 | 40800000.dsp | remoteproc2 | 3 |
| DSP2 | 41000000.dsp | remoteproc3 | 4 |
Example:
root@am57xx-evm:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/unbind root@am57xx-evm:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/bind
Running the ex02_messageq example¶
- Copy the host and remote processor binaries to the SOM
$ scp <RTOS_SDK_INSTALL_DIR>/processor_sdk_rtos_am57xx_09_03_00_00/ipc_3_52_00_00/examples/DRA7XX_linux_elf/ex02_messageq/host/bin/debug/app_host <RTOS_SDK_INSTALL_DIR>/processor_sdk_rtos_am57xx_09_03_00_00/ipc_3_52_00_00/examples/DRA7XX_linux_elf/ex02_messageq/dsp1/bin/debug/server_dsp1.xe66 <RTOS_SDK_INSTALL_DIR>/processor_sdk_rtos_am57xx_09_03_00_00/ipc_3_52_00_00/examples/DRA7XX_linux_elf/ex02_messageq/ipu1/bin/debug/server_ipu1.xem4 root@<IP_ADDRESS>:/home/root/ex02_messageq/
- Load the DSP and IPU firmwares
root@am57xx-evm:~# ln -sf /home/root/ex02_messageq/server_dsp1.xe66 /lib/firmware/dra7-dsp1-fw.xe66 root@am57xx-evm:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/unbind root@am57xx-evm:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/bind root@am57xx-evm:~# ln -sf /home/root/ex02_messageq/ipc-starter/ex02_messageq/server_dsp2.xe66 /lib/firmware/dra7-dsp2-fw.xe66 root@am57xx-evm:~# echo 41000000.dsp > /sys/bus/platform/drivers/omap-rproc/unbind root@am57xx-evm:~# echo 41000000.dsp > /sys/bus/platform/drivers/omap-rproc/bind root@am57xx-evm:~# ln -sf /home/root/ex02_messageq/server_ipu1.xem4 /lib/firmware/dra7-ipu1-fw.xem4 root@am57xx-evm:~# echo 55020000.ipu > /sys/bus/platform/drivers/omap-rproc/unbind root@am57xx-evm:~# echo 55020000.ipu > /sys/bus/platform/drivers/omap-rproc/bind
- Link the app_host file type to the ARM
root@mitysom-am57x:~# ln -sf /lib/ld-linux-armhf.so.3 /lib/ld-linux.so.3
- Host the new firmware and demonstrate inter-processor communication
root@am57xx-evm:~# ./ex02_messageq/app_host DSP1 root@am57xx-evm:~# ./ex02_messageq/app_host DSP2 root@am57xx-evm:~# ./ex02_messageq/app_host IPU1
- Example output
--> main: [ 1219.776184] omap-iommu 40d01000.mmu: 40d01000.mmu: version 3.0 [ 1219.782165] omap-iommu 40d02000.mmu: 40d02000.mmu: version 3.0 --> Main_main: --> App_create: App_create: Host is ready <-- App_create: --> App_exec: App_exec: sending message 1 App_exec: sending message 2 App_exec: sending message 3 App_exec: message received, sending message 4 App_exec: message received, sending message 5 App_exec: message received, sending message 6 App_exec: message received, sending message 7 App_exec: message received, sending message 8 App_exec: message received, sending message 9 App_exec: message received, sending message 10 App_exec: message received, sending message 11 App_exec: message received, sending message 12 App_exec: message received, sending message 13 App_exec: message received, sending message 14 App_exec: message received, sending message 15 App_exec: message received App_exec: message received App_exec: message received <-- App_exec: 0 --> App_delete: <-- App_delete: <-- Main_main: <-- main:
- Example output
Note: The AM57x device has two IPU subsystems (IPUSS), each of which has 2 cores. IPU2 is used as a controller in multi-media applications, so if you have Processor SDK Linux running, chances are that IPU2 already has firmware loaded. However, IPU1 is open for general purpose programming to offload the ARM tasks. (IPC for AM57xx)
Running the ipc tests¶
- Build the ipc bios examples
$ cd $TI_RTOS_PATH/processor_sdk_rtos_am57xx_09_03_00_00/ $ make ipc_bios -j$(nproc)
- Copy the build example files
$ scp -r $TI_RTOS_PATH/ipc_3_52_00_00/packages/ti/ipc/tests/bin/* root@<IP_ADDRESS>:/home/root/ipc-tests/
- load new firmware on DSPs/IPUs
root@mitysom-am57x:~# ln -sf /home/root/ipc-tests/ti_platforms_evmDRA7XX_dsp1/messageq_single.xe66 /lib/firmware/dra7-dsp1-fw.xe66 root@mitysom-am57x:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/unbind root@mitysom-am57x:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/bind
- devkit: run example arm app
root@mitysom-am57x:~# MessageQApp 1 4 [19864.018829] omap-iommu 55082000.mmu: 55082000.mmu: version 2.1 [19864.055328] omap-iommu 40d01000.mmu: 40d01000.mmu: version 3.0 [19864.061248] omap-iommu 40d02000.mmu: 40d02000.mmu: version 3.0 Using numLoops: 1; procId : 4 Entered MessageQApp_execute Local MessageQId: 0x80 Remote queueId [0x40080] Exchanging 1 messages with remote processor DSP1... MessageQ_get #1 Msg = 0xb6500808 Exchanged 1 messages with remote processor DSP1 Sample application successfully completed! Leaving MessageQApp_execute root@mitysom-am57x:~# MessageQBench 1000 8 4 Using numLoops: 1000; payloadSize: 8, procId : 4 Entered MessageQApp_execute Local MessageQId: 0x80 Remote queueId [0x40080] Exchanging 1000 messages with remote processor DSP1... DSP1: Avg round trip time: 139 usecs Leaving MessageQApp_execute
Note: DSP1 is procId: 4, DSP2 is procId: 3
Note: As of 03/08/2021 the sdk-linux docs have the wrong MessageQBench arguments. Reported to TI
Timing the IPC latency using MessageQBench¶
Test: Send 3 doubles from the DSP to the ARM
Note: MessageQBench times messages from ARM->DSP->ARM, with messages sent one at a time
Assuming the setup time is static: The example is sending ~7,812 messages a second, double that for both directions.
(2.43s-1.15s)/10000 = 128 us 1/(128us) = 7812.5
Changing it to floats (12 bytes) resulted in ~10k messages a second.
Also ran two copies of the benchmark to each DSP resulted in the same timings, so we could likely get higher throughput numbers if both DSPs are sending data, assuming the DSP calculations aren't the bottleneck.
Building and running the big-data-ipc example (WORK IN PROGRESS)¶
This example was dropped from TI's supported ipc examples back in SDK 08_02_00_04. The dma-buf cache calls don't support invalidating smaller than the entire buffer, could cause data corruption if the ARM cache wb call wrote back more data than expected. It also means the cache calls are much slower than expected which has a big impact on throughput.
TODO: Rewrite to use smaller DMA_Heap buffers, instead of initializing a single big buffer at start
- Clone to big-data-ipc-example git repository
$ git clone https://support.criticallink.com/git/mitysom-am57x/big-data-ipc-examples.git $ cd big-data-ipc-examples $ git checkout -B mitysom-big-data-1.0 origin/mitysom-big-data-1.0
Note: All other branches in this repository are to be considered unstable and not to be utilized. - Examine the commits below. They contain the major changes needed to get this example to build.
$ git log -p e0bcf1ac8f4863309696267b790409e5daf66db7 $ git log -p fb1d4d475899f73863a2a4b89799db06650d9a2d $ git log -p 389cf11eca9bcb1f0dbe4dacc42f431a7d9ca85d
- Export the following variables
- Replace <PROCESSOR_SDK_INSTALL_DIR> and <RTOS_SDK_INSTALL_DIR> with the installation path to the respective SDKs.
$ export TI_SDK_PATH=<PROCESSOR_SDK_INSTALL_DIR>/ti-processor-sdk-linux-am57xx-evm-09_03_06_05 $ export PATH=$TI_SDK_PATH/linux-devkit/sysroots/x86_64-arago-linux/usr/bin/arm-oe-linux-gnueabi:$PATH $ export TI_RTOS_PATH=<RTOS_SDK_INSTALL_DIR>/processor_sdk_rtos_am57xx_09_03_00_00 $ export IPC_INSTALL_PATH=$TI_RTOS_PATH/ipc_3_52_00_00 # Env variables for ipc_bios compile $ export SDK_INSTALL_PATH=$TI_RTOS_PATH $ export TOOLS_INSTALL_PATH=$TI_RTOS_PATH $ export XDC_INSTALL_PATH=$TI_RTOS_PATH/xdctools_3_55_02_22_core $ export BIOS_INSTALL_PATH=$TI_RTOS_PATH/bios_6_76_03_01 $ export LINUX_SYSROOT_DIR=$TI_SDK_PATH/linux-devkit/sysroots/armv7at2hf-neon-oe-linux-gnueabi
- Replace <PROCESSOR_SDK_INSTALL_DIR> and <RTOS_SDK_INSTALL_DIR> with the installation path to the respective SDKs.
- Build the project
$ make host_linux
- Create a project directory called big_data on the board
root@mitysom-am57x:~# mkdir big_data
- Copy the release binaries to the devkit
$ scp host_linux/simple_buffer_example/dsp/bin/DRA7XX/release/server_dsp.xe66 root@<IP_ADDRESS>:/home/root/big_data/ $ scp host_linux/simple_buffer_example/host/bin/DRA7XX/release/app_host root@<IP_ADDRESS>:/home/root/big_data/
- Sym link the dsp server binary to dsp1 firmware
root@mitysom-am57x:~# ln -sf /home/root/big_data/server_dsp.xe66 /lib/firmware/dra7-dsp1-fw.xe66 root@mitysom-am57x:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/unbind root@mitysom-am57x:~# echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/bind
- Link the app_host file type to the ARM
root@mitysom-am57x:~# ln -sf /lib/ld-linux-armhf.so.3 /lib/ld-linux.so.3
- Run the ARM host program to send and recieve big data messages from dsp1
root@mitysom-am57x:~# ./big_data/app_host DSP1 10 16 root@mitysom-am57x:~# cat /sys/kernel/debug/remoteproc/remoteproc2/trace0
Note: Although we specify 10 data messages, a total of 16 messages are expected. The first message shares the memory pointer with the DSP to allow it to configure its SharedRegion accordingly. The following two messages are no-ops. After that, the 10 actual big data messages are transmitted followed by two more no-ops, and a shutdown message.
Summary of how the big data example works:¶
ARM Code:- Create 16MB shared memory region using CMEM and SharedRegion
- Create a Heap which is used to split this shared memory into trackable chunks to send over to the DSP
- SEND: Send the shared memory pointer to DSP so it can setup its SharedRegion to match
- SEND: Send 2 no-op messages
The no-op messages are priming the pump so to speak. The ARM app is set up to only send more messages when it receives one. So sending 2 no-op messages ensures there are 3 messages in flight at a time, one for DSP to process, one for ARM, and one in waiting. To try and keep all processors active. - RECV: Get a response MSG from DSP
- If the message is a BIGDATA message, then validate DSP count pattern and free the buffer
- SEND: For every message received, we send a BIGDATA message allocated from the Heap filled with the ARM count pattern to DSP
- For the last 3 messages, send 2 no-ops and then 1 shutdown message
- RECV: Get a message from ARM
- If the message is a SETUP message, setup SharedRegion using info from ARM
- If the message is a BIGDATA message, then validate the ARM count pattern and replace it with a DSP count pattern. Send the message back to ARM
The Heap is only accessed directly by the ARM code. The buffers acquired from the Heap are only accessed by one processor at a time so no locks are required.
Note: The example is designed around the expectation that the ARM is sending data to the DSP to operate on and then it gets returned. If the DSP generated data on its own and then sends it to ARM, it may be beneficial for the DSP to own the Heap management
Note: Updated Big Data example to allow the number of messages and the buffer size to be adjusted by command line arguments.
Validation¶
Data Integrity- The big-data example was validated by transmitting over 5 million messages across a span of 4–5 days, with system reboots occurring between message batches. The results showed no errors related to message transmission, reserved memory usage, or message count validation.
- Table 1: Average Round Trip Time and Throughput
Message Count Size (KB) CMEM Time (µs) DMA_Heap Time (µs) CMEM Throughput (KB/s) DMA_Heap Throughput (KB/s) 1000 16 22736 133797 703 119 100 256 54826 268019 4669 955 100 1024 217495 271783 4708 3767 10 2048 394998 429340 5184 4770 10 4096 789427 844750 5188 4848
Note: As of SDK 8.02, CMEM is deprecated. TI has transitioned to using the officially supported kernel mechanisms for memory allocation, specifically DMA and CMA (Contiguous Memory Allocator).
Note: DMA_Heap cache calls run significantly slower than CMEM counterparts.
- Table 2: Benchmark Timings for 10 16KB messages
DMA_Heap (µs) DMA_Heap no cache (µs) CMEM (µs) Accumulated Time Time Accumulated Time Time Accumulated Time Time elapsedHeapAlloc: Heap mem and MessageQ alloc 19519 19519 30 30 15 15 elapsedFilled: Write count pattern 19519 0 30 0 37 22 elapsedToGlobal: Cache invalidate call 22630 3111 30 0 40 3 elapsedGet: Wait for Message from DSP 81159 58529 3050 3020 2805 2765 elapsedToLocal: Cache write back 84484 3325 3050 0 2809 4 elapsedChecked: ARM Check Values 84514 30 3050 0 2820 11 elapsed: Total Message Life Time 100679 16165 3080 30 2841 21
Note: "DMA_Heap no cache" was tested with out the cache invalidate and write back calls. This is not recommended as it breaks data integrity as reported by the ARM check values exception.
Analysis
CMEM demonstrates significantly better performance than DMA_Heap at smaller message sizes. This advantage is primarily due to CMEM cache operations being run on much smaller chunks of memory. At larger message sizes, the differences in round-trip time and throughput become less pronounced; however, CMEM still outperforms DMA_Heap. This sustained advantage is likely because cache maintenance operations operate on larger blocks of data less frequently.
Testing disabling the cache invalidate and write-back for DMA_Heap results in a substantial performance improvement, helping to isolate these cache operations as a key factor in the observed performance degradation. Benchmark results for CMEM and DMA_Heap with cache disabled align with expected performance metrics. However, disabling cache operations is neither valid nor recommended, as it compromises data integrity.
When cache operations (invalidate and write-back) are enabled, the latency increases significantly. This is due to additional overhead introduced in various parts of the message transfer process. For instance, MessageQ and HeapMem operations—such as allocation and deallocation—take longer when cache management is active. As a result, the delay before the ARM is able to successfully call MessageQ_get and retrieve the response is further exacerbated.
Cache maintenance operations are inefficient when using DMA_Heap because the updated cache invalidate and writeback functions no longer handle the address and size of the DMA buffer. As a result, these operations are performed over the entire 128MB of CMA-reserved memory, significantly increasing overhead. To improve this code, multiple smaller DMA_heaps would need to be allocated so the cache calls could be much faster.
The following articles provide a good explanation on the DMA buffer heap.
https://lwn.net/Articles/822052/
https://lwn.net/Articles/822521/
Go to top