This post dives into image quantization as implemented in the Tensorflow Lite iOS camera example application (as of March 2019). Instruments is used to profile performance. If you are not familiar with Instruments (part of Xcode) check out this Instruments Overview.

Previous Implementation

Image quantization is an important technique to prepare an image for a machine learning model in resource constrained environments. The source image is downsampled and transformed into a simpler representation.

In the Tensorflow Lite iOS camera example, the operation results in a 224x224 downsampled image with uint8_t pixel RGB components.

// Source modified for readibility
for (int y = 0; y < wanted_input_height; ++y) {
    uint8_t* out_row = output + (y * out_width * out_cmps);

    for (int x = 0; x < wanted_input_width; ++x) {
      
      const int in_x = (y * im_width) / out_width;
      const int in_y = (x * im_height) / out_height;
      
      uint8_t* in_pixel = input + (in_y * im_width * im_channels);
      in_pixel += (in_x * im_channels);
      
      uint8_t* out_pixel = out_row + (x * out_channels);
      
      for (int c = 0; c < out_channels; ++c) {
        out_pixel[c] = in_pixel[c];
      }
    }
  }

Identifying The Performance Improvement

Counters in Instruments were used to identify this performance improvement. If you are not familiar with Counters, and how to set up Counters formulas for L2 Cache Misses, check out: Use Counters In Instruments To Find Performance Improvements.

When profiling on an iPad with an A10 chip, the following aggregate sample was observed:

Time Time % IPC Branch Miss L2 Cache Miss Symbol
296.0ms 9.1% 0.723 2.846% 51.987% -[CameraExample runModelOnFrame:]

Multiple items stand out, particularly the significant L2 Cache Miss %. This likely means there are significant gains available. Upon inspection, it was discovered that the ProcessInputWithQuantizedModel function and the interpreter->invoke() functions required the most compute.

Looking deeper at the quantization function, notice the order of iteration:

for (int y = 0; y < wanted_input_height; ++y) {
    for (int x = 0; x < wanted_input_width; ++x) {}
}

Although seemingly correct, the variable names are a misnomer. Inside the loop the following variables are created:

const int in_x = (y * im_width) / out_width;
const int in_y = (x * im_height) / out_height;

Here, x  and y are flipped indicating that the loop is iterating on columns versus rows. Since the memory layout of the input matrix is row-major, reversing the order of iteration can take advanced of better spacial locality and thus decrease cache misses.

Further, a significant number of operations occur inside the inner loop. Moving the static computations outside of the loop can reduce the number of operations needed to complete the quantization transformation.

Faster Implementation

By moving static operations outside of the loop and taking better advantage of spacial locality the function can be optimized to be almost 600% faster:

Time Time % IPC Branch Miss L2 Cache Miss Symbol
51.0ms 1.6% 0.678 4.872% 17.904% -[CameraExample runModelOnFrame:]

A note, the image format used in the Tensorflow Lite iOS camera example comes from AVFoundation and specifically, an AVCaptureSession configured to output kCMPixelFormat_32BGRA. The optimized code presented here not only significantly improves performance, but also potentially (depending on how the model was trained) fixes a bug as the previous implementation did not turn the BGRA input into RGB output.

// Determine offsets per row and column to maintain consistent strides during downsampling and reduce loop computation
const int width_start = (int)((im_width % out_width) / 2);
const int height_start = (int)((im_height % out_height) / 2);

const int width_stride = (int)(im_width / out_width) * im_channels;
const int height_stride = ((int)(im_height / out_height));

// Iterate by offseting a stored pointer to help reduce loop computation
uint8_t* out_pixel = output;

uint8_t* in_start = input + (height_start * im_width * im_channels) + (width_start * im_channels);
uint8_t* in_pixel;

// Precompute the column size for faster loop computation
const int column_size = height_stride * im_width * im_channels;

// Iterate over input and store in the output
for (int w = 0; w < out_width; ++w) {
    in_pixel = in_start + (w * column_size);
        
    for (int h = 0; h < out_height; ++h) {

        // The output channels are hard-coded, so eliminate the loop
        // in favor of faster direct assignment
        out_pixel[2] = in_pixel[0];
        out_pixel[1] = in_pixel[1];
        out_pixel[0] = in_pixel[2];
    
        // Move to the next pixel mapping
        in_pixel += width_stride;
        out_pixel += wanted_input_channels;
    }
}