This post walks through Counters, an Instruments tool to profile low-level chip events on Apple devices. With the right configuration, Counters can help you quickly and reliably find performance improvements in apps.
Counters is an Instruments tool used to profile low-level chips. For example,
INST_BRANCH can be added to the Counters tool to count the number of branches executed.
Unlike other Instruments tools, Counters requires some configuration to provide valuable insights. Further, the Counters tool profiles events that are hardware-specific. This means the chip event options available on an A10 chip inside of an iPad will be different than the chip events available on an A12 chip inside of an iPhone Xs.
To configure Counters, select
Recording Options from the Instruments navigation menu. You'll see a menu show with configuration options for Counters:
For the purposes of this post, sampling by Time will be selected. Using the + you are able to add specific events that Counters can count available on the particular CPU currently connected to Instruments. With
INST_BRANCH selected the performance profile may look something like this:
Creating Formulas Using Counters
The number of branches executed on the chip is not enough information in itself to find performance issues. Additionally, we need to know the number of missed branches to compute the % of branches missed.
The best way to get high-value performance profiles from Counters is to use formulas. Formulas use events to compute a numerical result, for example, the % of branches missed on the chip. To configure a formula, select the ⚙ icon and then Create Formula.
Branch misprediction is one metric to determine how efficiently code is written. Missed branches are expensive, and can cause a slowdown to the processing pipeline. Reducing the % of missed branches can greatly improve performance.
To count the % of branches missed, we divide the number of missed branches by the total number of branches. Then, multiply by 100. Thus, enter
100 * (SYNC_BR_ANY_MISP / INST_BRANCH) into the formula box like so:
Instructions Per Cycle (IPC)
Counters can track multiple formulas at once. Instructions per cycle (IPC) is an important metric to determine processing efficiency. The greater the number of instructions per cycle that an app executes, the more efficiently the app is using the CPU.
The formula for IPC is the number of instructions divided by the number of cycles. In other words:
FIXED_INSTRUCTIONS / FIXED_CYCLES.
L2 Cache Misses
Modern CPUs have multiple levels of caching on the chip. Often an L1, L2, and L3 cache. Since Counters can only count the events exposed by the chip, for an A10 chip only L2 events are available. Since the L-caches help the CPU efficient access data, reducing the number of cache misses can greatly increase performance.
The formula for L2 cache misses is the number of missed stores and loads in the L2 cache, divided by the total number of stores and loads. To get the % we multiply by 100. Thus,
100 * ((L2C_AGENT_ST_MISS + L2C_AGENT_LD_MISS) / (L2C_AGENT_ST + L2C_AGENT_LD)).
Careful, the names of the events may be different (or not exist at all) depending on the chip your are profiling.
Finding Performance Improvements Using Counters
With these three formulas configured, a profile of an application may look like this:
To quickly find performance opportunities, first select Invert Call Tree under the Call Tree menu at the bottom of the Instruments window. This will invert the stack traces collected so Instruments shows the root functions.
Then, sort the list by Running Time and look for the following:
- IPC below 2.5
- Branch Misprediction % greater than 2%
- L2 Cache Misses greater than 5%
These rules are not set in stone, and may depend on the specifics of your application. In general, the 3 things to look for present a baseline guide for finding performance opportunities.
An example of what a great performance opportunity can look like is:
|Time||Time %||IPC||Branch Miss||L2 Cache Miss||Symbol|
Counters make it easy to spot functions and call trees with abnormal performance characteristics. In fact, this function can be optimized to run 600% faster by reducing cache misses using spacial locality. You can read more about that in another blog post: Fast Image Quantization For Tensorflow Lite.