kn2row / kn2col Convolution

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

In this article, we will cover 2 different convolution methods: Kn2row and Kn2col Convolution which are alternatives to Im2row and Im2col. Kn2row and Kn2col are space efficient variants.

Table of Contents:

  1. Introduction.
  2. Kn2row and Kn2col Convolution
    2.1 Kn2row Algorithm
    2.2 Kn2col Algorithm
  3. Differences.
  4. Memory Usage.
  5. GEMM Call for Kn2row.
  6. GEMM call for Kn2col.
  7. Implementation of kn2row convolution.
  8. Implementation of kn2col convolution.
  9. Quick Summary of all types of Convolution methods.

Pre-requisites:

Introduction:

Convolution is the basic process of applying a filter to an input to produce an activation. The position and strength of a recognised feature in an input, such as an image, are shown by a feature map created by repeatedly applying the same filter to the input.

The capacity of convolutional neural networks to learn a large number of filters in parallel particular to a training dataset under the restrictions of a given predictive modelling task, such as image classification, is what makes them unique. As a result, extremely particular characteristics may be recognised across the input photos.

Kn2row and Kn2col Convolution:

Kn2row: Also Known as kernel to row, This approach is based on the fact that a KxK convolution may be computed by first shifting and combining the partial outputs from K.K 1x1 convolutions. MxHxW is the additional buffer space necessary for this.

Kn2col: Also known as kernel to Column is the inverse of the kn2row approach.

One thing we know that, we can compute 1 × 1 MCMK without data replication, but how can one implement K × K MCMK, for K > 1?

A K ×K convolution can be expressed as the sum of K^2(k square) separate 1 × 1 convolutions. The sum, on the other hand, is not easy to calculate. Each one-to-one convolution produces a result matrix with the dimensions [M]x [H x W]. Because each resultant matrix corresponds to a different kernel value in the K x K kernel, we can't just add each of the resulting matrices pointwise. The addition of these matrices can then be resolved by vertically and/or horizontally (row and column offsets) offsetting every pixel in every channel of the multi-channel representation of these matrices by one or more places before the addition.

For example, when computing a 3×3 convolution the result from computing the 1 × 1 MCMK for the central point of the 3 × 3 kernel is perfectly aligned with the final sum matrix. On the other hand, the matrix that results from computing the 1×1 MCMK for the upper left value of the 3×3 kernel must be offset up by one place and left by one place (in its multichannel representation) before being added to the final sum that computes the 3 × 3 MCMK. Note that when intermediate results of 1 × 1 convolutions are offset, some values of the offsetted matrix fall outside the boundaries of the final result matrix. These out-of-bounds values are simply discarded when computing the sum of 1 × 1 convolutions.

Let’s see the Kn2row Algorithm:

A single matrix multiplication may be used to compute each of the K^2 separate 1 x 1 convolutions. We reorganise the kernel matrix such that the channel data is set out in a logical manner, with M being the outer dimension and C being the inner dimension. This data reorganisation may be done statically ahead of time and used for all subsequent MCMK invocations. We multiply a [K^2xM]x[C] kernel matrix by a [C]x[HxW] input matrix with a single call to GEMM, resulting in a [K^2xM]X[HXW] matrix. In the multi-channel representation, we do a post pass of shift-add by summing each of the M^2 submatrices of size M x [H x W] with proper offsets. The output of our MCMK method is a [M] x [H x W] matrix, which is the result of this sum. This is the kn2row algorithm.

Let’s see the Kn2col Algorithm:

Inorder to get the kn2col Algorithm we have to swap the dimensions of the kernel matrix so that C is not the innermost dimension and swap the input layout to make C the innermost dimension, resulting in the kn2col algorithm.

Difference:

The kn2row approach has a higher degree of localisation. In all of the networks, the kn2row creates a significantly narrower dispersion of performance between layers, most notably in AlexNet. Another noteworthy finding is that the performance of both kn2col and kn2row increases with C and M while decreasing with kernel size.

Memory Usage:

When we will evaluate our algorithm, we may find out that for a number of our chosen input dimensions the board ran out of memory intermittetly while executing our O(K^2) algorithms. Sometimes they require extra memory and with respect to our topic, the methods take up extra space as follows:

Kn2row:

(K^2 − 1) × (M × (HW))

Kn2col:

Kn2col: (K^2 − 1) × (M × (HW))

Now let's look at the General Matrix Multiplication of the both the convolution methods.

Gemm Call for Kn2row:

[K^2×M]×[C]  // kernel matrix
[C]×[H×W] // input matrix

We multiply kernel matrix by input matrix

[K^2×M]×[C] x [C]×[H×W]
= [K2×M]×[H×W] 

perform a post pass of shift-add by summing each of the M^2 submatrices.

= [M] × [H × W] // output of our MCMK algorithm

Gemm Call for Kn2col:

[H×W]× [C] // input matrix
[C]×[K2 ×M] // kernel matrix

We multiply input matrix by kernel matrix:

[H×W]× [C] x [C]×[K2 ×M]
=[H × W] × [K2 × M] // matrix

Implementation for the Kn2row Convolution method:

We will have a look at the implementation of kn2row convolution in C

#include <assert.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>
#include <cblas.h>
#include "common_types.h"
#include "data_reshape.h"
#include "utils.h"

//
// col_shift : +ve --> shift left overlap mat , -ve --> shift right overlap mat
// or shift left base mat and keep overlap mat as it is.
//
//
// row_shift : +ve (coeff is down the center coeff) --> shift up overlap mat ,
// -ve --> shift down overlap mat or shift up the base mat.

void MatrixShiftAdd(float *base_mat,
                     int base_no_rows, int base_no_cols,
                     float *overlap_mat,
                     int ov_no_rows, int ov_no_cols,
                     int row_shift, int col_shift) {
  if (row_shift == 0 && col_shift == 0 && (base_no_rows == ov_no_rows) &&
      (base_no_cols == ov_no_cols)) {
    // normal matrix add
    cblas_saxpy(base_no_rows * base_no_cols, 1.0, overlap_mat, 1, base_mat, 1);
    return;
  }
  int rows_to_add, cols_to_add;
  int base_row_start, base_col_start;
  int ov_row_start, ov_col_start;
  // without padding case
  if (ov_no_rows > base_no_rows) {
    rows_to_add = base_no_rows;
    cols_to_add = base_no_cols;
    base_row_start = 0;
    base_col_start = 0;
    ov_row_start = row_shift < 0? -row_shift : 0;
    ov_col_start = col_shift < 0? -col_shift : 0;

  } else {
    rows_to_add = ov_no_rows - abs(row_shift);
    cols_to_add = ov_no_cols - abs(col_shift);

    ov_col_start = col_shift > 0? col_shift : 0;
    ov_row_start = row_shift > 0? row_shift : 0;
    base_row_start = row_shift < 0? -row_shift : 0;
    base_col_start = col_shift < 0? -col_shift : 0;
  }

  for (int r = 0; r < rows_to_add; ++r) {
    int base_mat_offset = (r + base_row_start) * base_no_cols + base_col_start;
    int overlap_mat_offset = (r + ov_row_start) * ov_no_cols + ov_col_start;
    cblas_saxpy(cols_to_add, 1.0, overlap_mat + overlap_mat_offset, 1,
                base_mat + base_mat_offset, 1);
  }
}

bool Kn2RowConvLayer(const float *in_data, const float *filters,
                         const float *bias, TensorDim in_dim,
                         TensorDim filt_dim, int stride, int pad, int group,
                         float *output) {

  assert(group == 1);
  assert((pad == 0) || (pad == filt_dim.w / 2));
  assert(in_dim.n == 1);
  assert(filt_dim.h == filt_dim.w);
  assert(stride == 1);

  // Output dimensions.
  TensorDim out_dim;
  out_dim.w = (in_dim.w + (pad + pad) - filt_dim.w) / stride + 1;
  out_dim.h = (in_dim.h + (pad + pad) - filt_dim.h) / stride + 1;
  out_dim.c = filt_dim.n;
  out_dim.n = in_dim.n;

  // Re-arrange filters in the  k x k x no_out_maps x no_in_maps.
  // We can avoid this if the filters are already reshaped in this format.
  float *kkmc_filters = malloc(filt_dim.n * filt_dim.c * filt_dim.h *
                               filt_dim.w * sizeof(float));
  NCHW2HWNC(filters, filt_dim.n, filt_dim.c, filt_dim.h, filt_dim.w,
            kkmc_filters);

int H = in_dim.h;
  int W = in_dim.w;
  float alpha = 1.0;
  float beta = 0.0;

  // We need separate buffer because GEMM output will have width = H*W even
  // if there is no padding (pad = 0).
  
  float *gemm_output = malloc(out_dim.c * H * W * sizeof(float));

  // Prefill output buffer with bias if present else set to zero.
  
  if (bias) {
    for (int m = 0; m < out_dim.c; ++m) {
      for (int a = 0; a < out_dim.h * out_dim.w; ++a) {
        output[m * out_dim.h * out_dim.w + a] = bias[m];
      }
      // For batch size > 1
      for (int b = 1; b < out_dim.n; ++b) {
        memcpy(output + b * out_dim.c * out_dim.h * out_dim.w,
               output, out_dim.c * out_dim.h * out_dim.w * sizeof(float));
      }
    }
  } else {
    memset(output, 0, out_dim.n * out_dim.c * out_dim.h * out_dim.w *
           sizeof(float));
  }

  for (int kr = 0; kr < filt_dim.h; kr++) {
    int row_shift = kr - filt_dim.h / 2;
    for (int kc = 0; kc < filt_dim.w; kc++) {
      int group_no = kr * filt_dim.w + kc;
      int col_shift = kc - filt_dim.w / 2;
      // Matrix dimensions - A -> mxk B -> kxn  C --> mxn
      int m = filt_dim.n;
      int k = filt_dim.c;
      int n = in_dim.h * in_dim.w;
      // This is just 1x1 convolution
      cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                  m, n, k, alpha, kkmc_filters + group_no * m * k,
                  k, in_data, n, beta, gemm_output, n);
      // Slide the resulting matrix which has contribution from one of the
      // KxK kernel coefficients and add to the output.
      for (int omap = 0; omap < filt_dim.n; omap++) {
        MatrixShiftAdd(output + omap * out_dim.h * out_dim.w,
                        out_dim.h, out_dim.w,
                        gemm_output + omap * H * W,
                        H, W, row_shift, col_shift);
      }
    }
  }
  free(kkmc_filters);
  free(gemm_output);
  return true;
}

Implementation of Kn2col convolution methods:

We will have a look at the implementation of kn2col convolution in C:

#include <stdbool.h>
#include <cblas.h>
#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include "common_types.h"
#include "data_reshape.h"
#include "utils.h"
#include "conv_layers.h"

bool Kn2ColConvLayer(const float *in_data, const float *filters,
                         const float *bias, TensorDim in_dim,
                         TensorDim filt_dim, int stride, int pad, int group,
                         float *output) {
 
  assert(group == 1);
  assert((pad == 0) || (pad == filt_dim.w / 2));
  assert(in_dim.n == 1);
  assert(filt_dim.h == filt_dim.w);
  assert(stride == 1);

  // Output dimensions.
  TensorDim out_dim;
  out_dim.w = (in_dim.w + (pad + pad) - filt_dim.w) / stride + 1;
  out_dim.h = (in_dim.h + (pad + pad) - filt_dim.h) / stride + 1;
  out_dim.c = filt_dim.n;
  out_dim.n = in_dim.n;

  // Reshape filters in CHWN (ker_size x ker_size x no_in_ch x no_out_ch) format
  
  float *kkcm_filters = malloc(filt_dim.n * filt_dim.c * filt_dim.h *
                               filt_dim.w * sizeof(float));
  NCHW2HWCN(filters, filt_dim.n, filt_dim.c, filt_dim.h, filt_dim.w,
            kkcm_filters);

 
  int H = in_dim.h;
  int W = in_dim.w;
  float alpha = 1.0;
  float beta = 0.0;

  // We need separate buffer because GEMM output will have width = H*W even
  // if there is no padding (pad = 0).
  
  float *gemm_output = malloc(out_dim.c * H * W * sizeof(float));
  float *nchw_gemm_output = malloc(out_dim.c * H * W * sizeof(float));
  
  // Prefill output buffer with bias if present else set to zero.
  
  if (bias) {
    for (int m = 0; m < out_dim.c; ++m) {
      for (int a = 0; a < out_dim.h * out_dim.w; ++a) {
        output[m * out_dim.h * out_dim.w + a] = bias[m];
      }
      
      // For batch size > 1
      
      for (int b = 1; b < out_dim.n; ++b) {
        memcpy(output + b * out_dim.c * out_dim.h * out_dim.w,
               output, out_dim.c * out_dim.h * out_dim.w * sizeof(float));
      }
    }
  } else {
    memset(output, 0, out_dim.n * out_dim.c * out_dim.h * out_dim.w *
           sizeof(float));
  }

  for (int kr = 0; kr < filt_dim.h; kr++) {
    int row_shift = pad - kr;
    for (int kc = 0; kc < filt_dim.w; kc++) {
      int group_no = kr * filt_dim.w + kc;
      int col_shift = pad - kc;
      // Matrix dimensions - A -> mxk B -> kxn  C --> mxn
      int m = in_dim.h * in_dim.w;
      int k = filt_dim.c;
      int n = filt_dim.n;

      // output is in H x W x C format.
      
      cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
                  m, n, k, alpha, in_data, m,
                  kkcm_filters + group_no * filt_dim.c * filt_dim.n, n, beta,
                  gemm_output, n);

      // convert to CxHxW format.
      // FIXME: this will be slow. Need to find other ways :(
      NHWC2NCHW(gemm_output, 1, filt_dim.n, H, W, nchw_gemm_output);

      for (int omap = 0; omap < filt_dim.n; omap++) {
        MatrixShiftAdd(output + omap * out_dim.h * out_dim.w,
                        out_dim.h, out_dim.w,
                        nchw_gemm_output + omap * H * W,
                        H, W, row_shift, col_shift);
      }
    }
  }
  free(kkcm_filters);
  free(gemm_output);
  free(nchw_gemm_output);
  return true;
}

Please note the following notations:

N : Batch size

C : No of input channels

H : Input height

W : Input width

M : No of output channels

K : Kernel size

We at Opengenus have covered almost all the major types of Convolution i.e. Im2row, Im2col, Kn2row and Kn2col, here is a quick definition and benefit of every convolution method:

Im2row vs Im2col:

1. Im2col: Im2col convolution, also known as Image Block to Column, is a method that involves flattening each window and stacking it as columns in a matrix.

2. Im2row : Im2row is a method wherein the local patches are unrolled into rows of the input-patch matrix I and the kernels are unrolled into columns of the kernel-patch-matrix K. Im2row works better compared to im2col.

Comparison/Benefit:

Im2col:The memory footprint of the input matrix is considerably increased by this im2col conversion, which lowers data locality.

Im2row: Im2row Convolution is actually a better version of im2col convolution. It is fast and steady and improves model performance.

Kn2row vs Kn2col:

1. Kn2row: This approach is based on the fact that a KxK convolution may be computed by first shifting and combining the partial outputs from K.K 1x1 convolutions. MxHxW is the additional buffer space necessary for this.

2. Kn2col: Also known as kernel to Column this method is the exact inverse of kn2row convolution method.

Comparison/Benefit:

The kn2row also produces a slightly tighter spread of performance across layers in all the networks, most notably in AlexNet.

However, kn2col and kn2row both tend to increase with the C and M and decreases with the size of the kernel.

With this article at OpenGenus, you must have the complete idea of kn2row and kn2col Convolution.

Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.