Skip to content

Instantly share code, notes, and snippets.

@eugnsp
Last active June 16, 2019 14:05
Show Gist options
  • Select an option

  • Save eugnsp/7ffff78c6bd0633280f8bfaf90429eeb to your computer and use it in GitHub Desktop.

Select an option

Save eugnsp/7ffff78c6bd0633280f8bfaf90429eeb to your computer and use it in GitHub Desktop.
Fewer memory accesses – faster rotation?

Fewer memory accesses – faster rotation?

A k-rotation (or a k-cyclic shift) of a sequence (a0a2...an-1) is a sequence (aP(1)aP(2)...aP(n-1)), where the index permutation is P(i) = (i + k) % n. The problem is to transform the given sequence into its k-rotation.

There are four well-known algorithms to do it:

  1. Dolphin algorithm:

    Compute the number of cycles, nc = gcd(k, n - k); for each cycle (aCi(0)aCi(1)aCi(2)...) with Ci(j) = (i + jk) % n, i = 0, ..., nc - 1, make a cyclic shift of all the elements by one position to obtain (aCi(1)aCi(2)...aCi(0)).

  2. Gries–Mills algorithm:.

    If k = n - k, swap the subsequences (a0...ak-1) and (ak...an-1); if k < n - k, swap the subsequences (a0...ak-1) and (ak...a2k-1), then proceed recursively for the suffix subsequence (a0...ak-1a2k...an-1) with k' = k; if k > n - k, swap the subsequences (a0...an-k-1) and (ak...an-1), then proceed recursively for the suffix subsequence (an-k...ak-1a0...an-k-1) with k' = 2k - n.

  3. Using three reverses:

    Reverse the subsequences (a0...ak-1) and (ak...an-1), then reverse the whole sequence (ak-1...a0an-1...ak).

  4. Direct copy using an additional memory buffer.

Below are the results for rotation of std::vector with N elements of type int. std::rotate is used as a baseline. Versions I and II of algorithms differ in some details, please refer to the source code.

The dolphin algorithm makes much fewer (~2-3 times) memory accesses. However, it is not cache-friendly and has poor performance in practice, as is vividly illustrated below.

Benchmark results

Intel Core i7-4770 with GCC 8.3.0 (-O3 -m64 -march=native):

Intel Core i7-4770

N Bytes (1 I) (1 II) (2 I) (2 II) (3) (4 I) (4 II)
64 256 1.57 1.93 1.09 1.12 0.70 0.64 0.63
256 1K 2.21 3.01 1.68 1.89 0.57 0.56 0.54
1'024 4K 3.12 4.54 2.81 3.16 0.58 0.54 0.53
4'096 16K 4.26 6.51 4.28 4.66 0.70 0.69 1.29
16'384 64K 8.02 8.49 4.01 4.29 1.27 1.27 1.45
65'536 256K 8.50 8.53 3.17 3.37 1.16 1.30 1.37
262'144 1M 7.69 7.72 2.51 2.66 1.12 1.12 1.12
1'048'576 4M 7.46 7.52 2.33 2.44 1.04 1.20 1.22
4'194'304 16M 16.68 17.56 1.36 1.41 1.40 1.23 1.28
16'777'216 64M 16.62 14.73 1.11 1.13 1.12 1.63 1.79
67'108'864 256M 22.14 19.23 1.11 1.11 1.07 1.92 1.43

Cache sizes on this CPU: L1 Data – 32K, L2 – 256K, L3 – 8M.


Intel Xeon E5-2650 v3 with GCC 7.3.0 (-O3 -m64 -march=native):

Intel Xeon E5-2650 v3

N Bytes (1 I) (1 II) (2 I) (2 II) (3) (4 I) (4 II)
64 256 1.45 1.80 0.97 1.00 0.80 1.33 1.31
256 1K 2.10 2.81 1.50 1.70 1.20 1.54 1.48
1'024 4K 3.20 4.49 2.66 3.02 2.22 1.92 1.84
4'096 16K 4.59 6.64 4.16 4.57 3.57 2.64 2.60
16'384 64K 8.05 8.58 4.05 4.36 3.62 2.62 2.59
65'536 256K 8.52 8.56 3.23 3.43 2.96 2.28 2.33
262'144 1M 7.93 8.01 2.52 2.66 2.38 1.85 1.83
1'048'576 4M 7.65 7.72 2.33 2.46 2.20 1.70 1.70
4'194'304 16M 8.96 9.11 2.26 2.38 2.13 1.96 2.61
16'777'216 64M 10.69 12.90 1.56 1.61 1.61 1.98 2.00
67'108'864 256M 12.43 12.44 1.39 1.43 1.41 2.04 1.86

Cache sizes on this CPU: L1 Data – 32K, L2 – 256K, L3 – 25M.

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment