Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using GNU Profiling (gprof) With ARM Cortex-M

DZone's Guide to

Using GNU Profiling (gprof) With ARM Cortex-M

Learn how to profile an embedded application on ARM Cortex-M devices using GNU gprof.

· IoT Zone
Free Resource

I published a Sneak Preview of how GNU gprof profiling looks for an embedded target ARM Cortex-M in an earlier post:

Freescale Kinetis Microcontroller Application Profiling

Freescale Kinetis Microcontroller Application Profiling

This tutorial explains how to profile an embedded application (no RTOS needed) on ARM Cortex-M devices with GNU gprof. Additionally I explain the inner workings to generate the data necessary for gprof.

Outline

In my earlier posts (“Code Coverage with gcov, launchpad tools and Eclipse Kinetis Design Studio V3.0.0“) I showed how GNU tools can be used to create coverage information for a bare metal application. This post is about Application Profiling with gprof. As it is a longer article with many details, it took me a while to get it written down. But hopefully this post enables you to use the powerful GNU gprof profiling with embedded targets on ARM Cortex-M. I show how to use it in command line mode or using gprof in Eclipse (e.g. in Freescale Kinetis Design Studio, or any other Eclipse distribution with standard GNU tools for ARM Embedded).

What is profiling? It is about to know where the application spends most of its time so I can optimize it. In the GNU world, there is ‘gprof‘ used for that, and it is part of the standard GNU binaries. gprof it is commonly used e.g. for host or (Embedded) Linux applications. Profiling with gprof requires the sources compiled by gcc with a special option to instrument the code. The gprof tool is used then on the host to analyze and visualize the data. To my surprise I have not found instructions or steps how to use gproffor an embedded bare-metal target. This post tries to close that gap.

Of course I can use any RTOS like FreeRTOS for this too. I’m using bare metal in this post to keep it very generic.

For this I’m using:

  • Toolchain: I’m using GNU ARM Embedded (launchpad) 4.9q2-2015. But any other or earlier GNU toolchain for ARM could be used too.
  • ARM Cortex-M: I’m using an ARM Cortex-M4F from Freescale (Kinetis K64F on aFRDM-K64F board). But any other ARM Cortex-M could be used instead.
  • Run Control: You need a run control device capable to do full semihosting including file I/O. I’m using a Segger J-Link which is embedded on the FRDM-K64F board from Freescale. But any external Segger J-Link probe can be used too.
  • SDK: I need to set up a timer (SysTick) for the statistical PC sampling (histogram). Every silicon vendor provides their own libraries, I’m using the Freescale Kinetis SDKin this article. But any other SDK or library will do it.
  • IDE: I’m using the Eclipse based Freescale Kinetis Design Studio V3.0.0. But any other Eclipse IDE or IDE will do it. Or use vi, that will work too :-). Or use command line utilities (no IDE is required).

The project used in this article is posted at the GitHub link at the end of this post. From the link you can download the sources and project files.

Profiling With GNU gprof

To use gprof, the sources need to be compiled with a special (-pg) compiler option. This directs the gcc compiler to add special calls to the generated code (more about this later). I can instrument all of my application code or only a part of it. Instrumenting only a part of it makes sense if I want to limit the overhead (FLASH, RAM) created in the instrumented (.elf) image (more about this later too). The instrumented image will collect the profiling information which then is written to the gmon.out file.

gprof System Overview

gprof System Overview

Typically, the generated .elf file gets downloaded to the embedded target/board e.g. using the debugger. At exit of the application, the profiling data gets written to the file gmon.out. Typically a bare embedded target does not have a file system. Therefore I need a way to transfer the data from the board to the host. One way to do is to use ‘semihosting’: With semihosting the application is using a special I/O Library.

Semihosting means that the embedded application gets linked with a special version of the library which will trigger the debugger for any console or file I/O operation. The application can use things like f_open() to open and then read/write files on the host :-).

With semihosting the debugger will intercept the I/O operations: it will re-route the data to the host through the debug interface: the embedded application can read/write files which are actually on the host.

Semihosting

Semihosting

I’m using semihosting to get the gmon.out data file from the embedded board to the host PC. Once the file is on the host, it can be analyzed by the gprof program to generate reports. For this gprof is using the debug information in the .elf file to locate the functions and source files.

gprof: Call Counting and PC Sampling

This is how gprof works: it has two aspects:

  1. Function Call Counting: it counts for every instrumented function, how many times the function is called and from where. gcc inserts at the beginning of each instrumented function a call to a special library function to count the calls. This creates ‘Arcs’: information about who is calling what, how many times. Think about directed edges in a graph. Counting is done with _mcount() or in the case of GNU for ARM with __gnu_mcount_nc().
  2. Program Counter Sampling: at a period time, the current PC (Program Counter) gets sampled, creating a histogram of PC over addresses. This is typically done with a profile() function.

Function Call Counting is done with an insertion of a special call at the beginning of each instrumented function. The GNU ARM Embedded calls the function __gnu_mcount_nc():

Calling _mcount

Calling _mcount

The _mcount_internal() function calculates the source and destination address of each call and stores it in a table. That table has a counter telling how many times a call from a source address has been performed. Such a table line entry is called an ‘arc’.

For the Function Call Counting, the compiler inserts a call to __gnu_mcount_nc at each instrumented function. Below is an example function to be instrumented:

static void CheckButton(void) {
 uint32_t val;
 
 val = GPIO_DRV_ReadPinInput(BUTTON_SW2);
 if (val==0) { /* SW2 pressed! */
...

After compiling it with the -pg option, the disassembly code shows the call to __gnu_mcount_nc() inserted:

30:../Sources/Application.c **** static void CheckButton(void) {
 44 .loc 1 30 0
 45 .cfi_startproc
 46 @ args = 0, pretend = 0, frame = 8
 47 @ frame_needed = 1, uses_anonymous_args = 0
 48 0000 80B5 push {r7, lr}
 49 .cfi_def_cfa_offset 8
 50 .cfi_offset 7, -8
 51 .cfi_offset 14, -4
 52 0002 82B0 sub sp, sp, #8
 53 .cfi_def_cfa_offset 16
 54 0004 00AF add r7, sp, #0
 55 .cfi_def_cfa_register 7
 56 0006 00B5 push {lr}
 57 0008 FFF7FEFF bl __gnu_mcount_nc
 31:../Sources/Application.c **** uint32_t val;
 32:../Sources/Application.c **** 
 33:../Sources/Application.c **** val = GPIO_DRV_ReadPinInput(BUTTON_SW2);
 58 .loc 1 33 0
 59 000c 40F20620 movw r0, #518
 60 0010 FFF7FEFF bl GPIO_DRV_ReadPinInput
 61 0014 7860 str r0, [r7, #4]
 34:../Sources/Application.c **** if (val==0) { /* SW2 pressed! */
 62 .loc 1 34 0
 63 0016 7B68 ldr r3, [r7, #4]
 64 0018 002B cmp r3, #0
 65 001a 19D1 bne .L2
...

The second thing is the Program Counter Sampling. gprof does *not* measure the time a function takes from the beginning to the end:

Measuring Function Execution Time

Measuring Function Execution Time


Such a measurement could be done with the ‘poor man’s profiling’ method.

Instead, it does a periodic sampling of the program counter (PC), say every 10 ms and counts the number of that PC in a histogram:

PC sampling to create Histogram

PC sampling to create Histogram

The histogram is a basically an array of code space addresses with a number how many times that PC has been sampled.

Note that the sampling is only about program counter addresses. gprof later will use the debug information to match address ranges with functions.

Accuracy and Timing

Based on the arcs and PC sampling information, gprof is able to estimate time spent in functions. Consider the following example where A() calls B() and C(), and D() calls C().A() has been sampled 3 times ([3 s]) and D() has been sampled one time ([1 s]):

Arcs with Timing Information

Arcs with Timing Information

The numbers on the arrows indicate the number of calls performed: A() calls B() one time and three times C()gprof will start bottom-up distributing run-time: B() has been sampled with one sample (e.g. one sample duration is 10 ms). C() has been sampled 10 times. As C() has been called in total 5 times, the sample time of 5 gets distributed to the callers of C() according to the call counts. So A() gets added ((10/5)*3 ==> 6) and D()gets added an additional of 4 samples ((10/5)*2).

gprof assumes that the time spent in C() is regardless how it is called. Maybe A() is calling C() with parameters letting C() spend more CPU cycles. gprof does assume that calling a function() always takes the same effort. Which might not be always true.

That PC sampling is sometimes called ‘statistical sampling’ and is not error-free. The first error of PC sampling comes with the sampling frequency: Usually sampling is done at 10 ms or 1 ms rate. That means that short functions less than the sampling time will be rarely sampled or not at all. If a function has not been sampled multiple times, the timing for it will not be very accurate.

Mathematically, the expected error for the sampling is the square root of the number of samples taken. To illustrate this with examples:

  1. Sampling frequency is 100 Hz (Sampling period 10 ms). 10 samples will make a runtime if 1 second. The error is SQRT(100) which is 10, or 10*10ms = 100ms. So the ‘measured’ 1 second has an error of 100 ms (10%).
  2. Sampling frequency is again 100 Hz, and a function gets sampled 10000 times (runtime 100 seconds). The error is SQRT(10000) or 100 samples or 1 second. This means that the error for this function is 1%.

This means that only high sample numbers are accurate (several hundreds of samples). If the sample count for a function is small, this does not tell much. To get higher sample counts, you need to profile for an extended period or combine multiple profiling runs.

Memory Requirements for Arcs and Histogram

I have presented so far what information gprof collects:

  1. Arcs: which address (function) gets called from where (address) and how many times.
  2. Histogram: sampling of program counter on a periodic base, program counter frequency distribution over text/code range.

In the implementation I’m using, that information is collected and stored in a data structure as shown below:

Data Structure to Collect Information for gprof

Data Structure to Collect Information for gprof

  • text is the memory range to be monitored. Typically this is the FLASH or code range of the application. Only one memory area is supported.
  • The arrays froms[] and tos[] are building the Arcs. froms[] is an array mapping the text addresses from where function calls are performed.
  • froms[] is an array of short (16bit) index values into the tos[] array. This means that there can be up to 65535 Arcs.
  • To save memory for the froms[] array, the mapping is not a 1 to 1: instead ahashfraction is used to reduce the number of entries. This hashfraction is configurable, and can be 2 or higher for an thumb code on ARM because calls can be only done from even addresses.
  • tos[] is an array of  desination address with total 96 bits for each entry: a 32bitselfpc (destination address of call), a 32bit counter how many times that selfpc has been called, a 16bit link index into tos[] array. To align the array entries to 32bit addresses, there is a 16bit padding field which is not used for any data.
  • To reduce the size of the tos[] array an ardensity can be specified. This value is a precentage, how many arcs are expected for an address range. Or in other words: the density of function calls for the code profiled. For example an arcdensity of 2% means that we can cover up to 2% of the text area using calls.
  • kcount[] is the histogram of the sampled PCs: each counter is 16 bit wide. To reduce the amount of RAM needed, the available text/code area is divided byhistfraction. For ARM Cortex-M with thumb mode, as every instruction at least 16bit long, a value of histfraction >= 2 can be used.

So the larger the text range to cover, the more RAM is needed. To give an example:

  • text==8192 bytes
  • hashfraction==2 (to cover every ‘from’ address
  • arcenensity==2% (2% of the text are function calls)
  • histfraction==2 (every address can be sampled)

This would result in a need for 10148 bytes of RAM:

Example of required RAM

Example of required RAM

Clearly, many targets will not have that much of RAM available. The strategy then is to use higher fraction values and/or lower arc density. Another approach is to selectively instrument only a portion of the source files/modules and place them into a text range using the linker file, as multiple gprof runs can be combined. But the amount of RAM needed to store the data is indeed a problem using gprof :-(.

Program Flow

After instrumentation, the program will run as usual, except that function calls get recorded in arcs and the PC gets sampled into a histogram. The data is written usually with calling_exit() at the end fo the program which then calls _mclean() which does the write of the file:

Program Flow to Write gmon.out

Program Flow to Write gmon.out

The application of course can directly call _mcleanup() which then will write the data with file I/O which gets transferred to the host with semihosting.

To enable semihosting (with the newlib-nano) I specify this option in the linker settings:

--specs=rdimon.specs

Linker Option for Semihosting

Linker Option for Semihosting

Note that there are several different level of semihosting implemented. I’m using a Segger J-Link interface which has console AND file semihosting implemented.

Profiler Library

As the gcc ARM Embedded (launchpad) tools do not come with the library having profiling support included, I have added ported the cygwin for i386 port to GNU ARM. The profiling port libraries are available in the project at the GitHub location at the end of this article.

Make sure you use the latest files from GitHub. I’m posting here the current version for documentation purposes.

Profiling Library

Profiling Library

With the option -pg, the gcc compiler will add calls to __gnu_mound_nc() to count the functions and building the arcs. profiler.S implements that function:

/*
 * profiler.S
 * Implements the gprof profiler arc counting function.
 * Created on: 06.08.2015
 * Author: Erich Styger
 */
 
 .syntax unified
 .arch armv7-m
 
.globl __gnu_mcount_nc
.type __gnu_mcount_nc, %function
 
__gnu_mcount_nc:
#if 0 /* dummy version, doing nothing */
 mov ip, lr
 pop { lr }
 bx ip
#else
 push {r0, r1, r2, r3, lr} /* save registers */
 bic r1, lr, #1 /* R1 contains callee address, with thumb bit cleared */
 ldr r0, [sp, #20] /* R0 contains caller address */
 bic r0, r0, #1 /* clear thumb bit */
 bl _mcount_internal /* jump to internal _mcount() implementation */
 pop {r0, r1, r2, r3, ip, lr} /* restore saved registers */
 bx ip /* return to caller */
#endif

gmon.h implements the interface to gmon.c. It declares the data structures and contains the configuration parameters. Calling _mcleanup() will write the output file.  _monInit()has to be called from the startup code in case the startup code is instrumented too.

/* $OpenBSD: gmon.h,v 1.3 1996/04/21 22:31:46 deraadt Exp $ */
/* $NetBSD: gmon.h,v 1.5 1996/04/09 20:55:30 cgd Exp $ */
 
/*-
 * Copyright (c) 1982, 1986, 1992, 1993
 * The Regents of the University of California. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 * notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 * notice, this list of conditions and the following disclaimer in the
 * documentation and/or other materials provided with the distribution.
 * 4. Neither the name of the University nor the names of its contributors
 * may be used to endorse or promote products derived from this software
 * without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 *
 * @(#)gmon.h 8.2 (Berkeley) 1/4/94
 */
 
/*
 * This file is taken from Cygwin distribution. Please keep it in sync.
 * The differences should be within __MINGW32__ guard.
 */
 
#ifndef _SYS_GMON_H_
#define _SYS_GMON_H_
 
#ifndef __P
#define __P(x) x
#endif
 
/* On POSIX systems, profile.h is a KRB5 header. To avoid collisions, just
 pull in profile.h's content here. The profile.h header won't be provided
 by Mingw-w64 anymore at one point. */
#if 0
#include <profile.h>
#else
#ifndef _WIN64
#define _MCOUNT_CALL __attribute__ ((regparm (2)))
extern void _mcount(void);
#else
#define _MCOUNT_CALL
extern void mcount(void);
#endif
#define _MCOUNT_DECL __attribute__((gnu_inline)) __inline__ \
 void _MCOUNT_CALL _mcount_private
#define MCOUNT
#endif
 
/*
 * Structure prepended to gmon.out profiling data file.
 */
struct gmonhdr {
 size_t lpc; /* base pc address of sample buffer */
 size_t hpc; /* max pc address of sampled buffer */
 int ncnt; /* size of sample buffer (plus this header) */
 int version; /* version number */
 int profrate; /* profiling clock rate */
 int spare[3]; /* reserved */
};
#define GMONVERSION 0x00051879
 
/*
 * histogram counters are unsigned shorts (according to the kernel).
 */
#define HISTCOUNTER unsigned short
 
/*
 * fraction of text space to allocate for histogram counters here, 1/2
 */
#define HISTFRACTION 2
 
/*
 * Fraction of text space to allocate for from hash buckets.
 * The value of HASHFRACTION is based on the minimum number of bytes
 * of separation between two subroutine call points in the object code.
 * Given MIN_SUBR_SEPARATION bytes of separation the value of
 * HASHFRACTION is calculated as:
 *
 * HASHFRACTION = MIN_SUBR_SEPARATION / (2 * sizeof(short) - 1);
 *
 * For example, on the VAX, the shortest two call sequence is:
 *
 * calls $0,(r0)
 * calls $0,(r0)
 *
 * which is separated by only three bytes, thus HASHFRACTION is
 * calculated as:
 *
 * HASHFRACTION = 3 / (2 * 2 - 1) = 1
 *
 * Note that the division above rounds down, thus if MIN_SUBR_FRACTION
 * is less than three, this algorithm will not work!
 *
 * In practice, however, call instructions are rarely at a minimal
 * distance. Hence, we will define HASHFRACTION to be 2 across all
 * architectures. This saves a reasonable amount of space for
 * profiling data structures without (in practice) sacrificing
 * any granularity.
 */
#define HASHFRACTION 2
 
/*
 * percent of text space to allocate for tostructs with a minimum.
 */
#define ARCDENSITY 2 /* this is in percentage, relative to text size! */
#define MINARCS 50
#define MAXARCS ((1 << (8 * sizeof(HISTCOUNTER))) - 2)
 
struct tostruct {
 size_t selfpc; /* callee address/program counter. The caller address is in froms[] array which points to tos[] array */
 long count; /* how many times it has been called */
 u_short link; /* link to next entry in hash table. For tos[0] this points to the last used entry */
 u_short pad; /* additional padding bytes, to have entries 4byte aligned */
};
 
/*
 * a raw arc, with pointers to the calling site and
 * the called site and a count.
 */
struct rawarc {
 size_t raw_frompc;
 size_t raw_selfpc;
 long raw_count;
};
 
/*
 * general rounding functions.
 */
#define ROUNDDOWN(x,y) (((x)/(y))*(y))
#define ROUNDUP(x,y) ((((x)+(y)-1)/(y))*(y))
 
/*
 * The profiling data structures are housed in this structure.
 */
struct gmonparam {
 int state;
 u_short *kcount; /* histogram PC sample array */
 size_t kcountsize; /* size of kcount[] array in bytes */
 u_short *froms; /* array of hashed 'from' addresses. The 16bit value is an index into the tos[] array */
 size_t fromssize; /* size of froms[] array in bytes */
 struct tostruct *tos; /* to struct, contains histogram counter */
 size_t tossize; /* size of tos[] array in bytes */
 long tolimit;
 size_t lowpc; /* low program counter of area */
 size_t highpc; /* high program counter */
 size_t textsize; /* code size */
};
extern struct gmonparam _gmonparam;
 
/*
 * Possible states of profiling.
 */
#define GMON_PROF_ON 0
#define GMON_PROF_BUSY 1
#define GMON_PROF_ERROR 2
#define GMON_PROF_OFF 3
 
void _mcleanup(void); /* routine to be called to write gmon.out file */
void _monInit(void); /* initialization routine */
 
#endif /* !_SYS_GMONH_ */

gmon.c implements the function counting with _mcount_internal() and the data writing with _mcleanup(). The implementation uses a global flag already_setup: the first time arcs get created, it will allocate memory using malloc().

/*-
 * Copyright (c) 1983, 1992, 1993
 *    The Regents of the University of California.  All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 4. Neither the name of the University nor the names of its contributors
 *    may be used to endorse or promote products derived from this software
 *    without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */
 
/*
 * This file is taken from Cygwin distribution. Please keep it in sync.
 * The differences should be within __MINGW32__ guard.
 */
 
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include "gmon.h"
#include "profil.h"
#include <stdint.h>
#include <string.h>
 
#define MINUS_ONE_P (-1)
#define bzero(ptr,size) memset (ptr, 0, size);
#define ERR(s) write(2, s, sizeof(s))
 
struct gmonparam _gmonparam = { GMON_PROF_OFF, NULL, 0, NULL, 0, NULL, 0, 0L, 0, 0, 0};
static char already_setup = 0; /* flag to indicate if we need to init */
static int    s_scale;
/* see profil(2) where this is described (incorrectly) */
#define        SCALE_1_TO_1    0x10000L
 
static void moncontrol(int mode);
 
/* required for gcc ARM Embedded 4.9-2015-q2 */
void *_sbrk(int incr) {
  extern char __HeapLimit; /* Defined by the linker */
  static char *heap_end = 0;
  char *prev_heap_end;
 
  if (heap_end==0) {
    heap_end = &__HeapLimit;
  }
  prev_heap_end = heap_end;
  heap_end += incr;
  return (void *)prev_heap_end;
}
 
static void *fake_sbrk(int size) {
  void *rv = malloc(size);
  if (rv) {
    return rv;
  } else {
    return (void *) MINUS_ONE_P;
  }
}
 
void monstartup (size_t lowpc, size_t highpc) {
    register size_t o;
    char *cp;
    struct gmonparam *p = &_gmonparam;
 
    /*
     * round lowpc and highpc to multiples of the density we're using
     * so the rest of the scaling (here and in gprof) stays in ints.
     */
    p->lowpc = ROUNDDOWN(lowpc, HISTFRACTION * sizeof(HISTCOUNTER));
    p->highpc = ROUNDUP(highpc, HISTFRACTION * sizeof(HISTCOUNTER));
    p->textsize = p->highpc - p->lowpc;
    p->kcountsize = p->textsize / HISTFRACTION;
    p->fromssize = p->textsize / HASHFRACTION;
    p->tolimit = p->textsize * ARCDENSITY / 100;
    if (p->tolimit < MINARCS) {
        p->tolimit = MINARCS;
    } else if (p->tolimit > MAXARCS) {
        p->tolimit = MAXARCS;
    }
    p->tossize = p->tolimit * sizeof(struct tostruct);
 
    cp = fake_sbrk(p->kcountsize + p->fromssize + p->tossize);
    if (cp == (char *)MINUS_ONE_P) {
        ERR("monstartup: out of memory\n");
        return;
    }
 
    /* zero out cp as value will be added there */
    bzero(cp, p->kcountsize + p->fromssize + p->tossize);
 
    p->tos = (struct tostruct *)cp;
    cp += p->tossize;
    p->kcount = (u_short *)cp;
    cp += p->kcountsize;
    p->froms = (u_short *)cp;
 
    p->tos[0].link = 0;
 
    o = p->highpc - p->lowpc;
    if (p->kcountsize < o) {
#ifndef notdef
        s_scale = ((float)p->kcountsize / o ) * SCALE_1_TO_1;
#else /* avoid floating point */
        int quot = o / p->kcountsize;
 
        if (quot >= 0x10000)
            s_scale = 1;
        else if (quot >= 0x100)
            s_scale = 0x10000 / quot;
        else if (o >= 0x800000)
            s_scale = 0x1000000 / (o / (p->kcountsize >> 8));
        else
            s_scale = 0x1000000 / ((o << 8) / p->kcountsize);
#endif
    } else {
        s_scale = SCALE_1_TO_1;
    }
    moncontrol(1); /* start */
}
 
void _mcleanup(void) {
    static const char gmon_out[] = "gmon.out";
    int fd;
    int hz;
    int fromindex;
    int endfrom;
    size_t frompc;
    int toindex;
    struct rawarc rawarc;
    struct gmonparam *p = &_gmonparam;
    struct gmonhdr gmonhdr, *hdr;
    const char *proffile;
#ifdef DEBUG
    int log, len;
    char dbuf[200];
#endif
 
    if (p->state == GMON_PROF_ERROR) {
        ERR("_mcleanup: tos overflow\n");
    }
    hz = PROF_HZ;
    moncontrol(0); /* stop */
    proffile = gmon_out;
    fd = open(proffile , O_CREAT|O_TRUNC|O_WRONLY|O_BINARY, 0666);
    if (fd < 0) {
        perror( proffile );
        return;
    }
#ifdef DEBUG
    log = open("gmon.log", O_CREAT|O_TRUNC|O_WRONLY, 0664);
    if (log < 0) {
        perror("mcount: gmon.log");
        return;
    }
    len = sprintf(dbuf, "[mcleanup1] kcount 0x%x ssiz %d\n",
        p->kcount, p->kcountsize);
    write(log, dbuf, len);
#endif
    hdr = (struct gmonhdr *)&gmonhdr;
    hdr->lpc = p->lowpc;
    hdr->hpc = p->highpc;
    hdr->ncnt = p->kcountsize + sizeof(gmonhdr);
    hdr->version = GMONVERSION;
    hdr->profrate = hz;
    write(fd, (char *)hdr, sizeof *hdr);
    write(fd, p->kcount, p->kcountsize);
    endfrom = p->fromssize / sizeof(*p->froms);
    for (fromindex = 0; fromindex < endfrom; fromindex++) {
        if (p->froms[fromindex] == 0) {
            continue;
        }
        frompc = p->lowpc;
        frompc += fromindex * HASHFRACTION * sizeof(*p->froms);
        for (toindex = p->froms[fromindex]; toindex != 0; toindex = p->tos[toindex].link) {
#ifdef DEBUG
            len = sprintf(dbuf,
            "[mcleanup2] frompc 0x%x selfpc 0x%x count %d\n" ,
                frompc, p->tos[toindex].selfpc,
                p->tos[toindex].count);
            write(log, dbuf, len);
#endif
            rawarc.raw_frompc = frompc;
            rawarc.raw_selfpc = p->tos[toindex].selfpc;
            rawarc.raw_count = p->tos[toindex].count;
            write(fd, &rawarc, sizeof rawarc);
        }
    }
    close(fd);
}
 
/*
 * Control profiling
 *    profiling is what mcount checks to see if
 *    all the data structures are ready.
 */
static void moncontrol(int mode) {
    struct gmonparam *p = &_gmonparam;
 
    if (mode) {
        /* start */
        profil((char *)p->kcount, p->kcountsize, p->lowpc, s_scale);
        p->state = GMON_PROF_ON;
    } else {
        /* stop */
        profil((char *)0, 0, 0, 0);
        p->state = GMON_PROF_OFF;
    }
}
 
void _mcount_internal(uint32_t *frompcindex, uint32_t *selfpc) {
  register struct tostruct    *top;
  register struct tostruct    *prevtop;
  register long            toindex;
  struct gmonparam *p = &_gmonparam;
 
  if (!already_setup) {
    extern char __etext; /* end of text/code symbol, defined by linker */
    already_setup = 1;
    monstartup(0x410, (uint32_t)&__etext);
  }
  /*
   *    check that we are profiling
   *    and that we aren't recursively invoked.
   */
  if (p->state!=GMON_PROF_ON) {
    goto out;
  }
  p->state++;
  /*
   *    check that frompcindex is a reasonable pc value.
   *    for example:    signal catchers get called from the stack,
   *            not from text space.  too bad.
   */
  frompcindex = (uint32_t*)((long)frompcindex - (long)p->lowpc);
  if ((unsigned long)frompcindex > p->textsize) {
      goto done;
  }
  frompcindex = (uint32_t*)&p->froms[((long)frompcindex) / (HASHFRACTION * sizeof(*p->froms))];
  toindex = *frompcindex;
  if (toindex == 0) {
    /*
    *    first time traversing this arc
    */
    toindex = ++p->tos[0].link; /* the link of tos[0] points to the last used record in the array */
    if (toindex >= p->tolimit) { /* more tos[] entries than we can handle! */
        goto overflow;
      }
    *frompcindex = toindex;
    top = &p->tos[toindex];
    top->selfpc = (size_t)selfpc;
    top->count = 1;
    top->link = 0;
    goto done;
  }
  top = &p->tos[toindex];
  if (top->selfpc == (size_t)selfpc) {
    /*
     *    arc at front of chain; usual case.
     */
    top->count++;
    goto done;
  }
  /*
   *    have to go looking down chain for it.
   *    top points to what we are looking at,
   *    prevtop points to previous top.
   *    we know it is not at the head of the chain.
   */
  for (; /* goto done */; ) {
    if (top->link == 0) {
      /*
       *    top is end of the chain and none of the chain
       *    had top->selfpc == selfpc.
       *    so we allocate a new tostruct
       *    and link it to the head of the chain.
       */
      toindex = ++p->tos[0].link;
      if (toindex >= p->tolimit) {
        goto overflow;
      }
      top = &p->tos[toindex];
      top->selfpc = (size_t)selfpc;
      top->count = 1;
      top->link = *frompcindex;
      *frompcindex = toindex;
      goto done;
    }
    /*
     *    otherwise, check the next arc on the chain.
     */
    prevtop = top;
    top = &p->tos[top->link];
    if (top->selfpc == (size_t)selfpc) {
      /*
       *    there it is.
       *    increment its count
       *    move it to the head of the chain.
       */
      top->count++;
      toindex = prevtop->link;
      prevtop->link = top->link;
      top->link = *frompcindex;
      *frompcindex = toindex;
      goto done;
    }
  }
  done:
    p->state--;
    /* and fall through */
  out:
    return;        /* normal return restores saved registers */
  overflow:
    p->state++; /* halt further profiling */
    #define    TOLIMIT    "mcount: tos overflow\n"
    write (2, TOLIMIT, sizeof(TOLIMIT));
  goto out;
}
 
void _monInit(void) {
  _gmonparam.state = GMON_PROF_OFF;
  already_setup = 0;
}

In profile.h is the interface to the PC sampling histogram functionality:

/* profil.h: gprof profiling header file
 
   Copyright 1998, 1999, 2000, 2001, 2002 Red Hat, Inc.
 
This file is part of Cygwin.
 
This software is a copyrighted work licensed under the terms of the
Cygwin license.  Please consult the file "CYGWIN_LICENSE" for
details. */
 
/*
 * This file is taken from Cygwin distribution. Please keep it in sync.
 * The differences should be within __MINGW32__ guard.
 */
 
#ifndef __PROFIL_H__
#define __PROFIL_H__
 
/* profiling frequency.  (No larger than 1000) */
#define PROF_HZ            1000
 
/* convert an addr to an index */
#define PROFIDX(pc, base, scale)    \
  ({                                    \
    size_t i = (pc - base) / 2;                \
    if (sizeof (unsigned long long int) > sizeof (size_t))        \
      i = (unsigned long long int) i * scale / 65536;            \
    else                                \
      i = i / 65536 * scale + i % 65536 * scale / 65536;        \
    i;                                    \
  })
 
/* convert an index into an address */
#define PROFADDR(idx, base, scale)        \
  ((base)                    \
   + ((((unsigned long long)(idx) << 16)    \
       / (unsigned long long)(scale)) << 1))
 
/* convert a bin size into a scale */
#define PROFSCALE(range, bins)        (((bins) << 16) / ((range) >> 1))
 
typedef void *_WINHANDLE;
 
typedef enum {
  PROFILE_NOT_INIT = 0,
  PROFILE_ON,
  PROFILE_OFF
} PROFILE_State;
 
struct profinfo {
  PROFILE_State state; /* profiling state */
  u_short *counter;            /* profiling counters */
  size_t lowpc, highpc;        /* range to be profiled */
  u_int scale;            /* scale value of bins */
};
 
int profile_ctl(struct profinfo *, char *, size_t, size_t, u_int);
int profil(char *, size_t, size_t, u_int);
 
#endif /* __PROFIL_H__ */

The PC sampling is implemented in profile.c. I’m using the SysTick with the Kinetis SDK to generate a 1 kHz sampling interrupt:

/* profil.c -- win32 profil.c equivalent
 
   Copyright 1998, 1999, 2000, 2001, 2002 Red Hat, Inc.
 
   This file is part of Cygwin.
 
   This software is a copyrighted work licensed under the terms of the
   Cygwin license.  Please consult the file "CYGWIN_LICENSE" for
   details. */
 
/*
 * This file is taken from Cygwin distribution, adopted to be used for bare embeeded targets.
 */
#include <stdio.h>
#include <sys/types.h>
#include <errno.h>
#include <math.h>
#include "profil.h"
#include <string.h>
#include <stdint.h>
 
/* global profinfo for profil() call */
static struct profinfo prof = {
  PROFILE_NOT_INIT, 0, 0, 0, 0
};
 
/* sample the current program counter */
void SysTick_Handler(void) {
  void OSA_SysTick_Handler(void);
  static size_t pc, idx;
 
  OSA_SysTick_Handler(); /* call normal Kinetis SDK SysTick handler */
  if (prof.state==PROFILE_ON) {
    pc = ((uint32_t*)(__builtin_frame_address(0)))[14]; /* get SP and use it to get the return address from stack */
    if (pc >= prof.lowpc && pc < prof.highpc) {
      idx = PROFIDX (pc, prof.lowpc, prof.scale);
      prof.counter[idx]++;
    }
  }
}
 
/* Stop profiling to the profiling buffer pointed to by p. */
static int profile_off (struct profinfo *p) {
  p->state = PROFILE_OFF;
  return 0;
}
 
/* Create a timer thread and pass it a pointer P to the profiling buffer. */
static int profile_on (struct profinfo *p) {
  p->state = PROFILE_ON;
  return 0; /* ok */
}
 
/*
 * start or stop profiling
 *
 * profiling goes into the SAMPLES buffer of size SIZE (which is treated
 * as an array of u_shorts of size size/2)
 *
 * each bin represents a range of pc addresses from OFFSET.  The number
 * of pc addresses in a bin depends on SCALE.  (A scale of 65536 maps
 * each bin to two addresses, A scale of 32768 maps each bin to 4 addresses,
 * a scale of 1 maps each bin to 128k address).  Scale may be 1 - 65536,
 * or zero to turn off profiling
 */
int profile_ctl (struct profinfo *p, char *samples, size_t size, size_t offset, u_int scale) {
  size_t maxbin;
 
  if (scale > 65536) {
    errno = EINVAL;
    return -1;
  }
  profile_off(p);
  if (scale) {
    memset(samples, 0, size);
    memset(p, 0, sizeof *p);
    maxbin = size >> 1;
    prof.counter = (u_short*)samples;
    prof.lowpc = offset;
    prof.highpc = PROFADDR(maxbin, offset, scale);
    prof.scale = scale;
    return profile_on(p);
  }
  return 0;
}
 
/* Equivalent to unix profil()
   Every SLEEPTIME interval, the user's program counter (PC) is examined:
   offset is subtracted and the result is multiplied by scale.
   The word pointed to by this address is incremented. */
int profil (char *samples, size_t size, size_t offset, u_int scale) {
  return profile_ctl (&prof, samples, size, offset, scale);
}

With this, the profiling library and support is complete.

It is ok to profile the profiler library too, except do *not* profile gmon.c. I have not found a way to disable profiling for a single function.

Profiling and Calling _mcleanup()

With the library included in the application, I can run the application as usual. The profiling with arc counting and histogram generation might slow down the application maybe ~30%, depending on the number of function calls. The PC sampling (histogram) adds a constant load of about 5% or less, depending on the sampling frequency (is use typically a frequency of 1 kHz). With using an RTOS like FreeRTOS the PC sampling could be done from a task or better from the FreeRTOS tick hook.

At the end of the application (or when I’m done with profiling), I have to call _mcleanup() to write the profiling data. I most cases I’m using my own version of the _exit() function as below:

void _exit(int status) {
 (void)status; /* not used */
 _mcleanup(); /* write gmon.out file */
 /* turn on all LED's ==&gt; WHITE */
 GPIO_DRV_ClearPinOutput(led_red);
 GPIO_DRV_ClearPinOutput(led_green);
 GPIO_DRV_ClearPinOutput(led_blue);
 for(;;){} /* does not return */
}

In that above implementation I’m indicating the end with a visual LED code, because writing the output file with semihosting can take several seconds.

Compiler Settings

Now I have the library for profiling. What is missing are the required project settings. In theGNU ARM Eclipse panels, there is a ‘Generate gprof information (-pg)’ option:

-pg option

-pg option

But it will not work that way! This option will be used for all the source files and will be passed to the linker. I do *not* set that option:

  1. It will instrument all the files which very likely will exceed the RAM size of an embedded target
  2. It will cause the linker to link with a special version of the C library, compiled with the -pg option too (libc_p.a)

Even if I would have enough RAM, the second point is a problem for the GCC ARM Embedded (launchpad) tools (at least up to version 4.9 2015q2): that GNU gcc distribution does not have that library included :-( (see this community discussion): the linker will report:

cannot find -lc_p

GNU linker error that it cannot find profiling library

GNU linker error that it cannot find profiling library

Instead, I specify the -pg compiler option on a per file/folder base: I only apply it to the files I want to profile (right click on the file/folder to set per-file/folder options):

Per file Option to Profile

Per file Option to Profile

Do *not* instrument (add the option -pg) to the profiling functions itself (file gmon.c), otherwise it will be called in a recursive way causing a stack overflow!

Eclipse will mark the files (see “Icon and Label Decorators in Eclipse“):

Eclipse Icon Decorators to show Special Options per File

Eclipse Icon Decorators to show Special Options per File

Running the program creates the gmon.out on the host through semihosting:

Generated gmon.out

Generated gmon.out

Using Command Line version of prof

The gmon.out result file then gets analyzed with gprof command:

gprof <options> <ELF file> <gmon file>

See https://sourceware.org/binutils/docs/gprof/Invoking.html#Invoking for the command line summary.

To make it easier for me, I have the following addPath.bat DOS batch file which adds the GNU tools path:

@ECHO off
ECHO Adding GNU tools to PATH
SET KDSBIN=C:\Freescale\KDS_3.0.0\bin
SET KDSTOOLBIN=C:\Freescale\KDS_3.0.0\toolchain\bin
ECHO %PATH%|findstr /i /c:"%KDSBIN:"=%">nul || set PATH=%PATH%;%KDSBIN%
ECHO %PATH%|findstr /i /c:"%KDSTOOLBIN:"=%">nul || set PATH=%PATH%;%KDSTOOLBIN%
ECHO %PATH%

The gprof utility can be used like this:

arm-none-eabi-gprof.exe Debug\FRDM-K64F_Profiling.elf gmon.out > gmon.txt

This creates the file gmon.txt with a ‘flat’ profile:

Flat profile:
 
Each sample counts as 0.001 seconds.
 % cumulative self self total 
 time seconds seconds calls s/call s/call name 
 94.61 28.92 28.92 1 28.92 29.31 APP_Run
 2.31 29.63 0.71 OSA_TimeDiff
 1.48 30.08 0.45 OSA_TimeGetMsec
 1.17 30.43 0.36 15 0.02 0.02 OSA_TimeDelay
 0.29 30.52 0.09 _mcount_internal
 0.10 30.55 0.03 15953 0.00 0.00 GPIO_HAL_ReadPinInput
 0.05 30.57 0.01 __gnu_mcount_nc
 0.00 30.57 0.00 15953 0.00 0.00 GPIO_DRV_ReadPinInput
 0.00 30.57 0.00 15 0.00 0.00 CheckButton
 0.00 30.57 0.00 15 0.00 0.00 GPIO_DRV_TogglePinOutput
 0.00 30.57 0.00 15 0.00 0.00 GPIO_HAL_TogglePinOutput
 0.00 30.57 0.00 3 0.00 0.00 CLOCK_SYS_EnablePortClock
 0.00 30.57 0.00 3 0.00 0.00 GPIO_DRV_OutputPinInit
 0.00 30.57 0.00 3 0.00 0.00 GPIO_DRV_SetPinOutput
 0.00 30.57 0.00 3 0.00 0.00 GPIO_HAL_SetPinDir
 0.00 30.57 0.00 3 0.00 0.00 GPIO_HAL_SetPinOutput
 0.00 30.57 0.00 3 0.00 0.00 GPIO_HAL_WritePinOutput
 0.00 30.57 0.00 3 0.00 0.00 PORT_HAL_SetDriveStrengthMode
 0.00 30.57 0.00 3 0.00 0.00 PORT_HAL_SetMuxMode
 0.00 30.57 0.00 3 0.00 0.00 PORT_HAL_SetOpenDrainCmd
 0.00 30.57 0.00 3 0.00 0.00 PORT_HAL_SetSlewRateMode
 0.00 30.57 0.00 3 0.00 0.00 SIM_HAL_EnableClock
 0.00 30.57 0.00 1 0.00 0.00 CLOCK_HAL_GetClkOutStat
 0.00 30.57 0.00 1 0.00 0.00 CLOCK_HAL_GetMcgExternalClkFreq
 0.00 30.57 0.00 1 0.00 0.00 CLOCK_HAL_GetOutClk
 0.00 30.57 0.00 1 0.00 0.00 CLOCK_HAL_GetOutDiv1
 0.00 30.57 0.00 1 0.00 0.00 CLOCK_HAL_GetPll0Clk
 0.00 30.57 0.00 1 0.00 0.00 CLOCK_HAL_GetPll0RefFreq
 0.00 30.57 0.00 1 0.00 0.00 CLOCK_HAL_TestOscFreq
 0.00 30.57 0.00 1 0.00 0.00 CLOCK_SYS_GetCoreClockFreq
 0.00 30.57 0.00 1 0.00 0.00 GPIO_DRV_Init
 0.00 30.57 0.00 1 0.00 0.00 OSA_Init
 0.00 30.57 0.00 1 0.00 0.00 _exit
 0.00 30.57 0.00 1 0.00 29.31 main
 0.00 30.57 0.00 1 0.00 0.00 task_init
 
 % the percentage of the total running time of the
time program used by this function.
 
cumulative a running sum of the number of seconds accounted
 seconds for by this function and those listed above it.
 
 self the number of seconds accounted for by this
seconds function alone. This is the major sort for this
 listing.
 
calls the number of times this function was invoked, if
 this function is profiled, else blank.
  
 self the average number of milliseconds spent in this
ms/call function per call, if this function is profiled,
 else blank.
 
 total the average number of milliseconds spent in this
ms/call function and its descendents per call, if this 
 function is profiled, else blank.
 
name the name of the function. This is the minor sort
 for this listing. The index shows the location of
 the function in the gprof listing. If the index is
 in parenthesis it shows where it would appear in
 the gprof listing if it were to be printed.
 
Copyright (C) 2012 Free Software Foundation, Inc.
 
Copying and distribution of this file, with or without modification,
are permitted in any medium without royalty provided the copyright
notice and this notice are preserved.
 
 Call graph (explanation follows)
 
 
granularity: each sample hit covers 4 byte(s) for 0.00% of 30.57 seconds
 
index % time self children called name
 28.92 0.39 1/1 main [2]
[1] 95.9 28.92 0.39 1 APP_Run [1]
 0.36 0.00 15/15 OSA_TimeDelay [6]
 0.00 0.03 15/15 CheckButton [10]
 0.00 0.00 15/15 GPIO_DRV_TogglePinOutput [12]
 0.00 0.00 3/3 GPIO_DRV_SetPinOutput [16]
 0.00 0.00 1/1 _exit [106]
-----------------------------------------------
 0.00 29.31 1/1 _mainCRTStartup [3]
[2] 95.9 0.00 29.31 1 main [2]
 28.92 0.39 1/1 APP_Run [1]
-----------------------------------------------
 &lt;spontaneous&gt;
[3] 95.9 0.00 29.31 _mainCRTStartup [3]
 0.00 29.31 1/1 main [2]
-----------------------------------------------
 &lt;spontaneous&gt;
[4] 2.3 0.71 0.00 OSA_TimeDiff [4]
-----------------------------------------------
 &lt;spontaneous&gt;
[5] 1.5 0.45 0.00 OSA_TimeGetMsec [5]
-----------------------------------------------
 0.36 0.00 15/15 APP_Run [1]
[6] 1.2 0.36 0.00 15 OSA_TimeDelay [6]
-----------------------------------------------
 &lt;spontaneous&gt;
[7] 0.3 0.09 0.00 _mcount_internal [7]
-----------------------------------------------
 0.00 0.03 15953/15953 CheckButton [10]
[8] 0.1 0.00 0.03 15953 GPIO_DRV_ReadPinInput [8]
 0.03 0.00 15953/15953 GPIO_HAL_ReadPinInput [9]
-----------------------------------------------
 0.03 0.00 15953/15953 GPIO_DRV_ReadPinInput [8]
[9] 0.1 0.03 0.00 15953 GPIO_HAL_ReadPinInput [9]
-----------------------------------------------
 0.00 0.03 15/15 APP_Run [1]
[10] 0.1 0.00 0.03 15 CheckButton [10]
 0.00 0.03 15953/15953 GPIO_DRV_ReadPinInput [8]
-----------------------------------------------
 &lt;spontaneous&gt;
[11] 0.0 0.01 0.00 __gnu_mcount_nc [11]
-----------------------------------------------
 0.00 0.00 15/15 APP_Run [1]
[12] 0.0 0.00 0.00 15 GPIO_DRV_TogglePinOutput [12]
 0.00 0.00 15/15 GPIO_HAL_TogglePinOutput [13]
-----------------------------------------------
 0.00 0.00 15/15 GPIO_DRV_TogglePinOutput [12]
[13] 0.0 0.00 0.00 15 GPIO_HAL_TogglePinOutput [13]
-----------------------------------------------
 0.00 0.00 3/3 GPIO_DRV_OutputPinInit [15]
[14] 0.0 0.00 0.00 3 CLOCK_SYS_EnablePortClock [14]
 0.00 0.00 3/3 SIM_HAL_EnableClock [24]
-----------------------------------------------
 0.00 0.00 3/3 GPIO_DRV_Init [33]
[15] 0.0 0.00 0.00 3 GPIO_DRV_OutputPinInit [15]
 0.00 0.00 3/3 CLOCK_SYS_EnablePortClock [14]
 0.00 0.00 3/3 PORT_HAL_SetMuxMode [21]
 0.00 0.00 3/3 GPIO_HAL_SetPinDir [17]
 0.00 0.00 3/3 GPIO_HAL_WritePinOutput [19]
 0.00 0.00 3/3 PORT_HAL_SetSlewRateMode [23]
 0.00 0.00 3/3 PORT_HAL_SetDriveStrengthMode [20]
 0.00 0.00 3/3 PORT_HAL_SetOpenDrainCmd [22]
-----------------------------------------------
 0.00 0.00 3/3 APP_Run [1]
[16] 0.0 0.00 0.00 3 GPIO_DRV_SetPinOutput [16]
 0.00 0.00 3/3 GPIO_HAL_SetPinOutput [18]
-----------------------------------------------
 0.00 0.00 3/3 GPIO_DRV_OutputPinInit [15]
[17] 0.0 0.00 0.00 3 GPIO_HAL_SetPinDir [17]
-----------------------------------------------
 0.00 0.00 3/3 GPIO_DRV_SetPinOutput [16]
[18] 0.0 0.00 0.00 3 GPIO_HAL_SetPinOutput [18]
-----------------------------------------------
 0.00 0.00 3/3 GPIO_DRV_OutputPinInit [15]
[19] 0.0 0.00 0.00 3 GPIO_HAL_WritePinOutput [19]
-----------------------------------------------
 0.00 0.00 3/3 GPIO_DRV_OutputPinInit [15]
[20] 0.0 0.00 0.00 3 PORT_HAL_SetDriveStrengthMode [20]
-----------------------------------------------
 0.00 0.00 3/3 GPIO_DRV_OutputPinInit [15]
[21] 0.0 0.00 0.00 3 PORT_HAL_SetMuxMode [21]
-----------------------------------------------
 0.00 0.00 3/3 GPIO_DRV_OutputPinInit [15]
[22] 0.0 0.00 0.00 3 PORT_HAL_SetOpenDrainCmd [22]
-----------------------------------------------
 0.00 0.00 3/3 GPIO_DRV_OutputPinInit [15]
[23] 0.0 0.00 0.00 3 PORT_HAL_SetSlewRateMode [23]
-----------------------------------------------
 0.00 0.00 3/3 CLOCK_SYS_EnablePortClock [14]
[24] 0.0 0.00 0.00 3 SIM_HAL_EnableClock [24]
-----------------------------------------------
 0.00 0.00 1/1 CLOCK_HAL_GetOutClk [27]
[25] 0.0 0.00 0.00 1 CLOCK_HAL_GetClkOutStat [25]
-----------------------------------------------
 0.00 0.00 1/1 CLOCK_HAL_GetPll0RefFreq [30]
[26] 0.0 0.00 0.00 1 CLOCK_HAL_GetMcgExternalClkFreq [26]
 0.00 0.00 1/1 CLOCK_HAL_TestOscFreq [31]
-----------------------------------------------
 0.00 0.00 1/1 CLOCK_SYS_GetCoreClockFreq [32]
[27] 0.0 0.00 0.00 1 CLOCK_HAL_GetOutClk [27]
 0.00 0.00 1/1 CLOCK_HAL_GetClkOutStat [25]
 0.00 0.00 1/1 CLOCK_HAL_GetPll0Clk [29]
-----------------------------------------------
 0.00 0.00 1/1 CLOCK_SYS_GetCoreClockFreq [32]
[28] 0.0 0.00 0.00 1 CLOCK_HAL_GetOutDiv1 [28]
-----------------------------------------------
 0.00 0.00 1/1 CLOCK_HAL_GetOutClk [27]
[29] 0.0 0.00 0.00 1 CLOCK_HAL_GetPll0Clk [29]
 0.00 0.00 1/1 CLOCK_HAL_GetPll0RefFreq [30]
-----------------------------------------------
 0.00 0.00 1/1 CLOCK_HAL_GetPll0Clk [29]
[30] 0.0 0.00 0.00 1 CLOCK_HAL_GetPll0RefFreq [30]
 0.00 0.00 1/1 CLOCK_HAL_GetMcgExternalClkFreq [26]
-----------------------------------------------
 0.00 0.00 1/1 CLOCK_HAL_GetMcgExternalClkFreq [26]
[31] 0.0 0.00 0.00 1 CLOCK_HAL_TestOscFreq [31]
-----------------------------------------------
 0.00 0.00 1/1 CLOCK_SYS_GetSystickFreq [42]
[32] 0.0 0.00 0.00 1 CLOCK_SYS_GetCoreClockFreq [32]
 0.00 0.00 1/1 CLOCK_HAL_GetOutClk [27]
 0.00 0.00 1/1 CLOCK_HAL_GetOutDiv1 [28]
-----------------------------------------------
 0.00 0.00 1/1 Components_Init [43]
[33] 0.0 0.00 0.00 1 GPIO_DRV_Init [33]
 0.00 0.00 3/3 GPIO_DRV_OutputPinInit [15]
-----------------------------------------------
 0.00 0.00 1/1 PE_low_level_init [55]
[34] 0.0 0.00 0.00 1 OSA_Init [34]
 0.00 0.00 1/1 task_init [35]
-----------------------------------------------
 0.00 0.00 1/1 OSA_Init [34]
[35] 0.0 0.00 0.00 1 task_init [35]
-----------------------------------------------
 0.00 0.00 1/1 APP_Run [1]
[106] 0.0 0.00 0.00 1 _exit [106]
-----------------------------------------------
 
 This table describes the call tree of the program, and was sorted by
 the total amount of time spent in each function and its children.
 
 Each entry in this table consists of several lines. The line with the
 index number at the left hand margin lists the current function.
 The lines above it list the functions that called this function,
 and the lines below it list the functions this one called.
 This line lists:
 index A unique number given to each element of the table.
 Index numbers are sorted numerically.
 The index number is printed next to every function name so
 it is easier to look up where the function is in the table.
 
 % time This is the percentage of the `total' time that was spent
 in this function and its children. Note that due to
 different viewpoints, functions excluded by options, etc,
 these numbers will NOT add up to 100%.
 
 self This is the total amount of time spent in this function.
 
 children This is the total amount of time propagated into this
 function by its children.
 
 called This is the number of times the function was called.
 If the function called itself recursively, the number
 only includes non-recursive calls, and is followed by
 a `+' and the number of recursive calls.
 
 name The name of the current function. The index number is
 printed after it. If the function is a member of a
 cycle, the cycle number is printed between the
 function's name and the index number.
 
 
 For the function's parents, the fields have the following meanings:
 
 self This is the amount of time that was propagated directly
 from the function into this parent.
 
 children This is the amount of time that was propagated from
 the function's children into this parent.
 
 called This is the number of times this parent called the
 function `/' the total number of times the function
 was called. Recursive calls to the function are not
 included in the number after the `/'.
 
 name This is the name of the parent. The parent's index
 number is printed after it. If the parent is a
 member of a cycle, the cycle number is printed between
 the name and the index number.
 
 If the parents of the function cannot be determined, the word
 `&lt;spontaneous&gt;' is printed in the `name' field, and all the other
 fields are blank.
 
 For the function's children, the fields have the following meanings:
 
 self This is the amount of time that was propagated directly
 from the child into the function.
 
 children This is the amount of time that was propagated from the
 child's children to the function.
 
 called This is the number of times the function called
 this child `/' the total number of times the child
 was called. Recursive calls by the child are not
 listed in the number after the `/'.
 
 name This is the name of the child. The child's index
 number is printed after it. If the child is a
 member of a cycle, the cycle number is printed
 between the name and the index number.
 
 If there are any cycles (circles) in the call graph, there is an
 entry for the cycle-as-a-whole. This entry shows who called the
 cycle (as parents) and the members of the cycle (as children.)
 The `+' recursive calls entry shows the number of function calls that
 were internal to the cycle, and the calls entry for each member shows,
 for that member, how many times it was called from other members of
 the cycle.
 
Copyright (C) 2012 Free Software Foundation, Inc.
 
Copying and distribution of this file, with or without modification,
are permitted in any medium without royalty provided the copyright
notice and this notice are preserved.
 
Index by function name
 
 [1] APP_Run [15] GPIO_DRV_OutputPinInit [5] OSA_TimeGetMsec
 [25] CLOCK_HAL_GetClkOutStat (fsl_mcg_hal.h) [8] GPIO_DRV_ReadPinInput [20] PORT_HAL_SetDriveStrengthMode (fsl_port_hal.h)
 [26] CLOCK_HAL_GetMcgExternalClkFreq (fsl_mcg_hal.c) [16] GPIO_DRV_SetPinOutput [21] PORT_HAL_SetMuxMode (fsl_port_hal.h)
 [27] CLOCK_HAL_GetOutClk [12] GPIO_DRV_TogglePinOutput [22] PORT_HAL_SetOpenDrainCmd (fsl_port_hal.h)
 [28] CLOCK_HAL_GetOutDiv1 (fsl_sim_hal_mk64f12.h) [9] GPIO_HAL_ReadPinInput (fsl_gpio_hal.h) [23] PORT_HAL_SetSlewRateMode (fsl_port_hal.h)
 [29] CLOCK_HAL_GetPll0Clk [17] GPIO_HAL_SetPinDir [24] SIM_HAL_EnableClock (fsl_sim_hal_mk64f12.h)
 [30] CLOCK_HAL_GetPll0RefFreq [18] GPIO_HAL_SetPinOutput (fsl_gpio_hal.h) [11] __gnu_mcount_nc
 [31] CLOCK_HAL_TestOscFreq [13] GPIO_HAL_TogglePinOutput (fsl_gpio_hal.h) [106] _exit
 [14] CLOCK_SYS_EnablePortClock [19] GPIO_HAL_WritePinOutput [7] _mcount_internal
 [32] CLOCK_SYS_GetCoreClockFreq [34] OSA_Init [2] main
 [10] CheckButton (Application.c) [6] OSA_TimeDelay [35] task_init
 [33] GPIO_DRV_Init [4] OSA_TimeDiff

Using gprof in Eclipse

The command line version of gprof is powerful. Another way is to use gprof in Eclipse. There is only a subtle problem: gprof and other GNU tools need to be present in the PATH, *without* the architecture (arm-none-eabi) prefix.

Both gcov (see “Code Coverage with gcov, launchpad tools and Eclipse Kinetis Design Studio V3.0.0“) and gprof need to have the tools in the PATH that way.

To keep toolchains and Eclipse IDE’s separate, I’m reluctant to modify the Windows PATH globally as this affects *everything*. And using a non-matching gprof with a different GNU toolchain (mixing GNU versions) can have strange effects.

Instead, I have a special batch file which

  1. Configures the PATH locally in a CMD shell (addPath.bat)
  2. Ensures that the necessary GNU tools without arm-none-eabi name are present (checkGNUbin.bat)
  3. Launches Eclipse from the environment (runEclipseGcovGprof.bat)

addPath.bat has the following content:

@ECHO off
ECHO Adding GNU tools to PATH
SET KDSBIN=C:\Freescale\KDS_3.0.0\bin
SET KDSTOOLBIN=C:\Freescale\KDS_3.0.0\toolchain\bin
ECHO %PATH%|findstr /i /c:"%KDSBIN:"=%"&gt;nul || set PATH=%PATH%;%KDSBIN%
ECHO %PATH%|findstr /i /c:"%KDSTOOLBIN:"=%"&gt;nul || set PATH=%PATH%;%KDSTOOLBIN%
ECHO %PATH%

checkGNUbin.bat has the following content:

@ECHO OFF
ECHO Checking GNU Binaries and making copies as necessary
SET KDS3=C:\Freescale\KDS_3.0.0
IF NOT EXIST "%KDS3%\toolchain\bin\gcov.exe" COPY "%KDS3%\toolchain\bin\arm-none-eabi-gcov.exe" "%KDS3%\toolchain\bin\gcov.exe"
IF NOT EXIST "%KDS3%\toolchain\bin\addr2line.exe" COPY "%KDS3%\toolchain\bin\arm-none-eabi-addr2line.exe" "%KDS3%\toolchain\bin\addr2line.exe"
IF NOT EXIST "%KDS3%\toolchain\bin\nm.exe" COPY "%KDS3%\toolchain\bin\arm-none-eabi-nm.exe" "%KDS3%\toolchain\bin\nm.exe"
IF NOT EXIST "%KDS3%\toolchain\bin\c++filt.exe" COPY "%KDS3%\toolchain\bin\arm-none-eabi-c++filt.exe" "%KDS3%\toolchain\bin\c++filt.exe"
IF NOT EXIST "%KDS3%\toolchain\bin\strings.exe" COPY "%KDS3%\toolchain\bin\arm-none-eabi-strings.exe" "%KDS3%\toolchain\bin\strings.exe"
IF NOT EXIST "%KDS3%\toolchain\bin\gdb.exe" COPY "%KDS3%\toolchain\bin\arm-none-eabi-gdb.exe" "%KDS3%\toolchain\bin\gdb.exe"
IF NOT EXIST "%KDS3%\toolchain\bin\gprof.exe" COPY "%KDS3%\toolchain\bin\arm-none-eabi-gprof.exe" "%KDS3%\toolchain\bin\gprof.exe"
ECHO Done!

And here is the content of runEclipseGcovGprof.bat:

REM Run Eclipse with gcov environment
 
REM Check that path setup is correct
call addPath.bat
 
REM Check that GNU tools are present
call checkGNUbin.bat
 
REM launch Eclipse
C:\Freescale\KDS_3.0.0\eclipse\kinetis-design-studio.exe

With an Eclipse launched like this, I can double click on the .out file in Eclipse and it will ask me for the .elf file to be used with that gmon.out file:

Launching gprof in Eclipse

Launching gprof in Eclipse

This then will open a nice graphical view (sorted by file):

gprof View in Eclipse

gprof View in Eclipse

I have different ways of sorting, for example I can sort it by function calls:

gprof Function Call Graph

gprof Function Call Graph

The columns show the collected information:

  • Samples: How many times it has been sampled (PC histogram).
  • Calls: how many calls to that function has been counted.
  • Time/Call: average time per call.
  • % Time: Percentage of time spent. The sum of all profiled functions are summed up to 100%.

As I can see from above example screenshot: I’m using a sample frequency of 1 kHz (“each sample counts as 1.000 ms” written in the header). GPIO_HAL_ReadPinInput() has been sampled 26 times, but has been called 15953 times. That function has been called from his parent function GPIO_DRV_readPinInput(), but because the parent is a very small/short function, it has been never sampled (Samples count zero).

So this shows that short functions might not be sampled. Basically I have a good chance to sample functions which take longer than the PC sampling period (in my case here 1 ms). So the sampling time is statistical and should be read with care. But the number of calls are counted in an exact way.

The gprof Eclipse view can visualize the data in multiple way, e.g. producing graphs and charts: select lines in the table view of gprof and use the ‘Create chart…’ button:

grpof chart

grpof chart

Summary

Using gprof to profile an application gives useful information about where the application spends most of the time. It is important to understand how gprof works to correctly use the data generated. Using gprof requires a considerable amount of RAM which needs to be taken into account during the design. Understanding the way how gprof works helps to read the data and helps to find the ‘hot spots’ and to optimize the application. Gprof is more targeted to host or embedded Linux applications where lots of RAM is available. But as I can instrument parts of the application and with the help of semihosting, it is very applicable to embedded targets having at a few kByte of RAM available for this kind of thing. And gathering this kind of information is always better than not knowing anything ;-).

The GCC ARM Embedded toolchain does not come with gprof enabled libraries. I have solved the problem with providing my own profiling implementation and libraries, inspired by the i386 cygwin libraries present in the GCC ARM Embedded libraries source files.

The project and source files discussed in this article can be found on GitHub:https://github.com/ErichStyger/mcuoneclipse/tree/master/Examples/KDS/FRDM-K64F120M/FRDM-K64F_Profiling

I’m going to present this research and approach at the Embedded Computing Conference 2015.

Happy Profiling :-)

Topics:
iot ,ARM Cortex-M ,gnu ,gprof

Published at DZone with permission of Erich Styger, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}