Project

General

Profile

Performance problem with baremetal code

Added by Christian Kempter over 9 years ago

I tried to do some performance analysis of ARM on MitySOM-5CSX regarding floating point performance in baremetal code. Therefore I wrote the two modules performance.c and mul.c.

performace.c:

#include <stdio.h>
#include <stdlib.h>
#include "alt_cache.h" 

int __auto_semihosting;

#define N 256

void mul(const double *in_a, const double *in_b, unsigned n, double *out);

int main(int argc, char** argv) {
        static double a[N], b[N], c[N];
        unsigned i, t;

        alt_cache_system_enable();
        for (i = 0; i < N; i++) {
                a[i] = (double)rand();
                b[i] = (double)rand();
        }

        *(unsigned volatile *)0xFFFEC600 = 0xFFFFFFFF; /* timer reload value */
        *(unsigned volatile *)0xFFFEC604 = 0xFFFFFFFF; /* current timer value */
        *(unsigned volatile *)0xFFFEC608 = 0x003;      /* start timer at 200MHz and automatically reload */
        mul(a, b, N, c);
        t = *(unsigned volatile *)0xFFFEC604;
        printf("used time for %u multiplications = %u ns\n", N, 5 * (0xFFFFFFFF - t));

        return 0;
}

mul.c:

void mul(const double *in_a, const double *in_b, unsigned n, double *out)
{
    unsigned i;
    for (i = 0; i < n; i++)
        out[i] = in_a[i] * in_b[i];
}

I put these two files together with the alt_cache.[hc] from the Altera hardware-lib into the Barmetell-hello-world example of the DS-5. Linking into internal RAM, which should be the fastest RAM, results in an execution time of 47us. I've also verified the bad performance with the trace tool. Using the external RAM (linker script cycloneV-dk-ram-hosted.ld instead of cycloneV-dk-oc-ram-hosted.ld) results in 31us execution time, a little bit faster but still at least a factor 10 too slow.

Any ideas, what's going wrong there? I suspect cache initialisation fails somehow, because the difference with and without cache is only a very small improvement. I also used compiler optimization -O3.

Also I don't understand, why the external RAM is faster than the internal RAM. Any ideas?

In principle the ARM performance should be Ok. I've tested the same example also under linux, which alsu uses the private timer as its 100Hz trigger:

preformance.c (under linux):

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <assert.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>

#define N 256

void mul(const double *in_a, const double *in_b, unsigned n, double *out);

int main(int argc, char** argv) {
        static double a[N], b[N], c[N];
        uint32_t i, t[2];
        volatile uint32_t *timer;
    char *mem;
        int fd;

        assert((fd = open("/dev/mem", O_RDWR)) >= 0);
        assert(MAP_FAILED != (mem = mmap(0, 0x10000, PROT_READ, MAP_SHARED, fd, 0xFFFE0000)));
    timer = (void *)(mem + 0xc604);

        for (i = 0; i < N; i++) {
                a[i] = (double)rand();
                b[i] = (double)rand();
        }

        t[0] = *timer;
    mul(a, b, N, c);
        t[1] = *timer;
    printf("used time for %u multiplications = %u ns\n", N, 5 * (t[0] - t[1]));

    munmap(mem, 0x10000);
    close(fd);

        return 0;
}

There I get execution times of about 2.7us also with optimization level -O3. This execution time matches rough with the code generated by the compiler and the cycle count I've found in the ARM instruction set, e.g. 2 cycles for a float multiplication.

Is there anything else than caches I have to initialize?


Replies (2)

RE: Performance problem with baremetal code - Added by Michael Williamson over 9 years ago

Hello Mr. Kempter,

We don't normally run / support bare-metal applications here (we find most folks get into more trouble using bare-metal with this class of processor).

It looks like the linux application is running at the correct speed, and my guess is because it is one of the following:

  1. The L2 or L1 caches are not enabled in the bare-metal case.
  2. The compiler is not using the proper instructions for the vectorized multiply.

Are you using the same toolchain for both examples? If not, which toolchain are you using for each example? If separate, it is possible that the toolchain is not generating the proper optimized instructions for the ARM-9 vectorized instructions?

Because this issue is really specific to utilizing the Altera Cyclone V chip (and not an issue with the system-on-module integration), I would recommend that you post this issue to altera either via their Rocketboards RFI mailing list or on their support forum. I suspect this problem could be reproduced using the Altera Dev Kit or the Arrow SocKit reference board as well. You might get a quicker answer from there. It will take us some time to reproduce your issue. I apologize for the delay.

-Mike

RE: Performance problem with baremetal code - Added by Christian Kempter over 9 years ago

In the meanwhile and after an ARM Cortex training I've found the root cause of the problem:

For enabling the caches, you have also to initialize the MMU, e.g. with the attached version of performance.c. Nevertheless the baremetal code is about 10% slower than the linux code. I suppose there are two reason for this difference:
  • The baremetal gcc and the linux gcc produce different code for the loop inside mul.
  • Maybe the instruction cache inside linux have another preload state due to the linux application loader.
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "alt_cache.h" 
#include "alt_mmu.h" 

int __auto_semihosting;

#define N 256
#define ARRAY_SIZE(array) (sizeof(array) / sizeof(array[0]))

void mul(const double *in_a, const double *in_b, unsigned n, double *out);

/* MMU Page table - 16KB aligned at 16KB boundary */
static uint32_t __attribute__ ((aligned (0x4000))) alt_pt_storage[4096];

static void *alt_pt_alloc(const size_t size, void *context)
{
        return context;
}

static void mmu_init(void)
{
        uint32_t *ttb1 = NULL;

        /* Populate the page table with sections (1 MiB regions). */
        ALT_MMU_MEM_REGION_t regions[] = {
                /* Memory area: 1 GiB */
                {
                        .va         = (void *)0x00000000,
                        .pa         = (void *)0x00000000,
                        .size       = 0x40000000,
                        .access     = ALT_MMU_AP_FULL_ACCESS,
                        .attributes = ALT_MMU_ATTR_WBA,
                        .shareable  = ALT_MMU_TTB_S_NON_SHAREABLE,
                        .execute    = ALT_MMU_TTB_XN_DISABLE,
                        .security   = ALT_MMU_TTB_NS_SECURE
                },

                /* Device area: Everything else */
                {
                        .va         = (void *)0x40000000,
                        .pa         = (void *)0x40000000,
                        .size       = 0xc0000000,
                        .access     = ALT_MMU_AP_FULL_ACCESS,
                        .attributes = ALT_MMU_ATTR_DEVICE_NS,
                        .shareable  = ALT_MMU_TTB_S_NON_SHAREABLE,
                        .execute    = ALT_MMU_TTB_XN_ENABLE,
                        .security   = ALT_MMU_TTB_NS_SECURE
                }
        };

        assert(ALT_E_SUCCESS == alt_mmu_init());
        assert(alt_mmu_va_space_storage_required(regions, ARRAY_SIZE(regions)) <= sizeof(alt_pt_storage));
        assert(ALT_E_SUCCESS == alt_mmu_va_space_create(&ttb1, regions, ARRAY_SIZE(regions), alt_pt_alloc, alt_pt_storage));
        assert(ALT_E_SUCCESS == alt_mmu_va_space_enable(ttb1));
}

int main(int argc, char** argv) {
        static double a[N], b[N], c[N];
        unsigned i, t;

        mmu_init();
        alt_cache_system_enable();

        for (i = 0; i < N; i++) {
                a[i] = (double)rand();
                b[i] = (double)rand();
        }

        *(unsigned volatile *)0xFFFEC600 = 0xFFFFFFFF; /* timer reload value */
        *(unsigned volatile *)0xFFFEC604 = 0xFFFFFFFF; /* current timer value */
        *(unsigned volatile *)0xFFFEC608 = 0x003;      /* start timer at 200MHz and automatically reload */
        mul(a, b, N, c);
        t = *(unsigned volatile *)0xFFFEC604;
        printf("used time for %u multiplications = %u ns\n", N, 5 * (0xFFFFFFFF - t));

        return 0;
}
    (1-2/2)
    Go to top
    Add picture from clipboard (Maximum size: 1 GB)