Rotation revisited: Another unidirectional algorithm

The Old New Thing (Raymond Chen) 06/02/26, 02:00 PM News

algorithm rotation c-plus-plus memory optimization programming

Summary

Raymond Chen revisits a unidirectional rotation algorithm for swapping adjacent memory blocks, explaining its recursive approach and performance characteristics.

<p>Some time ago, we looked at the problem of <a title="Swapping two blocks of memory that reside inside a larger block, in constant memory" href="https://devblogs.microsoft.com/oldnewthing/20260101-00/?p=111955"> swapping two blocks of memory that reside inside a larger block, in constant memory</a>, and along the way, we learned about <code>std::<wbr />rotate</code> which swaps two <i>adjacent</i> blocks of memory (not necessarily the same size).</p> <p>I noted in a postscript that <a href="https://github.com/llvm/llvm-project/blob/682c8e22e61f50ce2d9a0c42475a3aa6d578a1ad/libcxx/include/__algorithm/rotate.h"> clang’s libcxx</a> and <a href="https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/stl_algo.h"> gcc’s libstdc++</a> contain specializations of <code>std::<wbr />rotate</code> for random-access iterators that view the operation as a permutation and decomposes the permutation into cycles.</p> <p>I was mistaken.</p> <p>The implementation in gcc’s libstdc++ has special cases for single-element rotations, but in the general case, it uses a different algorithm.</p> <p>Let’s call the blocks of memory to be exchanged A and B, where A is made up of elements A1, A2, A3, and so on; and block B has elements B1, B2, B3, and so on. Without loss of generality, suppose the A block is smaller. (If not, we can just mirror the algorithm.) And for concreteness let’s say that the elements are A1, A2, A3, B1, B2, B3, B4, B5.</p> <table style="border-collapse: collapse; text-align: center;" title="See text above" border="0" cellspacing="0" cellpadding="1"> <tbody> <tr> <td style="width: 2em; border: solid 1px currentcolor; background: lch(from currentcolor l c h / .1);">A1</td> <td style="width: 2em; border: solid 1px currentcolor; background: lch(from currentcolor l c h / .1);">A2</td> <td style="width: 2em; border: solid 1px currentcolor; background: lch(from currentcolor l c h / .1);">A3</td> <td style="width: 2em; border: solid 1px currentcolor;">B1</td> <td style="width: 2em; border: solid 1px currentcolor;">B2</td> <td style="width: 2em; border: solid 1px currentcolor;">B3</td> <td style="width: 2em; border: solid 1px currentcolor;">B4</td> <td style="width: 2em; border: solid 1px currentcolor;">B5</td> </tr> <tr> <td>↑</td> <td> </td> <td> </td> <td>↑</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>↑</td> </tr> <tr> <td>first</td> <td> </td> <td> </td> <td>mid</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>last</td> </tr> </tbody> </table> <p>Exchange elements at <code>first</code> and <code>mid</code>, then move both iterators forward. After the first step, we have this:</p> <table style="border-collapse: collapse; text-align: center;" title="B1 A2 A3 A1 B2 B3 B4 B5, with first pointing at A2 and mid pionting at B2" border="0" cellspacing="0" cellpadding="1"> <tbody> <tr> <td style="width: 2em; border: solid 1px currentcolor;">B1</td> <td style="width: 2em; border: solid 1px currentcolor; background: lch(from currentcolor l c h / .1);">A2</td> <td style="width: 2em; border: solid 1px currentcolor; background: lch(from currentcolor l c h / .1);">A3</td> <td style="width: 2em; border: solid 1px currentcolor; background: lch(from currentcolor l c h / .1);">A1</td> <td style="width: 2em; border: solid 1px currentcolor;">B2</td> <td style="width: 2em; border: solid 1px currentcolor;">B3</td> <td style="width: 2em; border: solid 1px currentcolor;">B4</td> <td style="width: 2em; border: solid 1px currentcolor;">B5</td> </tr> <tr> <td> </td> <td>↑</td> <td> </td> <td> </td> <td>↑</td> <td> </td> <td> </td> <td> </td> <td>↑</td> </tr> <tr> <td> </td> <td>first</td> <td> </td> <td> </td> <td>mid</td> <td> </td> <td> </td> <td> </td> <td>last</td> </tr> </tbody> </table> <p>After three steps, we have moved all of the A’s out and replaced them with an equal number of B’s.</p> <table style="border-collapse: collapse; text-align: center;" title="B1 B2 B3 A1 A2 A3 B4 B5, with first pointing at A1 and mid pointing at B4" border="0" cellspacing="0" cellpadding="1"> <tbody> <tr> <td style="width: 2em; border: solid 1px currentcolor;">B1</td> <td style="width: 2em; border: solid 1px currentcolor;">B2</td> <td style="width: 2em; border: solid 1px currentcolor;">B3</td> <td style="width: 2em; border: solid 1px currentcolor; background: lch(from currentcolor l c h / .1);">A1</td> <td style="width: 2em; border: solid 1px currentcolor; background: lch(from currentcolor l c h / .1);">A2</td> <td style="width: 2em; border: solid 1px currentcolor; background: lch(from currentcolor l c h / .1);">A3</td> <td style="width: 2em; border: solid 1px currentcolor;">B4</td> <td style="width: 2em; border: solid 1px currentcolor;">B5</td> </tr> <tr> <td> </td> <td> </td> <td> </td> <td>↑</td> <td> </td> <td> </td> <td>↑</td> <td> </td> <td>↑</td> </tr> <tr> <td> </td> <td> </td> <td> </td> <td>first</td> <td> </td> <td> </td> <td>mid</td> <td> </td> <td>last</td> </tr> </tbody> </table> <p>But don’t stop. Keep on going until <code>mid</code> reaches <code>last</code>.</p> <table style="border-collapse: collapse; text-align: center;" title="B1 B2 B3 B4 B5 A3 A1 A2, with first pointing at A1 and mid pointing one past the final element" border="0" cellspacing="0" cellpadding="1"> <tbody> <tr> <td style="width: 2em; border: solid 1px currentcolor;">B1</td> <td style="width: 2em; border: solid 1px currentcolor;">B2</td> <td style="width: 2em; border: solid 1px currentcolor;">B3</td> <td style="width: 2em; border: solid 1px currentcolor;">B4</td> <td style="width: 2em; border: solid 1px currentcolor;">B5</td> <td style="width: 2em; border: solid 1px currentcolor; background: lch(from currentcolor l c h / .1);">A3</td> <td style="width: 2em; border: solid 1px currentcolor; background: lch(from currentcolor l c h / .1);">A1</td> <td style="width: 2em; border: solid 1px currentcolor; background: lch(from currentcolor l c h / .1);">A2</td> </tr> <tr> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>↑</td> <td> </td> <td> </td> <td>↑</td> </tr> <tr> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>first</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>mid<br /> last</td> </tr> </tbody> </table> <p>All of the B’s have been swapped to their final positions, but the A’s are jumbled.</p> <p>But you can predict the exact nature of the jumbling. The A block is in two chunks. If we let <var>n</var> be the total number of elements |A| + |B| and <var>a</var> be the number of elements in A, then the first chunk has the final <var>n</var> % <var>a</var> elements, and the second chunk has the initial <var>a</var> − (<var>n</var> % <var>a</var>) elements.</p> <p>Therefore, we can recursively rotate the two pieces of the A block to finish the job. Move <code>mid</code> to <code>first</code> + (<var>n</var> % <var>a</var>) and restart the algorithm.</p> <p>This algorithm performs <var>n</var> − 1 swaps. You can calculate this inductively by observing that we perform |B| swaps, and then recursively rotate |A|. Or you can calculate this directly by observing that each swap moves one element to its final position, except that the final swap moves two elements to their final position.</p> <p>The locality of this algorithm fairly good. The <code>first</code> iterator moves steadily forward, and the <code>mid</code> iterator moves forward most of the time, with at most <var>O</var>(log (min(|A|, |B|)) backward resets.</p> <p>Next time, we’ll make a shocking discovery about this algorithm.</p> <p>The post <a href="https://devblogs.microsoft.com/oldnewthing/20260602-00/?p=112376">Rotation revisited: Another unidirectional algorithm</a> appeared first on <a href="https://devblogs.microsoft.com/oldnewthing">The Old New Thing</a>.</p>

Original Article

View Cached Full Text

Cached at: 06/03/26, 03:35 AM

# Rotation revisited: Another unidirectional algorithm - The Old New Thing Source: [https://devblogs.microsoft.com/oldnewthing/20260602-00?p=112376](https://devblogs.microsoft.com/oldnewthing/20260602-00?p=112376) Some time ago, we looked at the problem of[swapping two blocks of memory that reside inside a larger block, in constant memory](https://devblogs.microsoft.com/oldnewthing/20260101-00/?p=111955), and along the way, we learned about`std::rotate`which swaps two*adjacent*blocks of memory \(not necessarily the same size\)\. I noted in a postscript that[clang’s libcxx](https://github.com/llvm/llvm-project/blob/682c8e22e61f50ce2d9a0c42475a3aa6d578a1ad/libcxx/include/__algorithm/rotate.h)and[gcc’s libstdc\+\+](https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/stl_algo.h)contain specializations of`std::rotate`for random\-access iterators that view the operation as a permutation and decomposes the permutation into cycles\. I was mistaken\. The implementation in gcc’s libstdc\+\+ has special cases for single\-element rotations, but in the general case, it uses a different algorithm\. Let’s call the blocks of memory to be exchanged A and B, where A is made up of elements A1, A2, A3, and so on; and block B has elements B1, B2, B3, and so on\. Without loss of generality, suppose the A block is smaller\. \(If not, we can just mirror the algorithm\.\) And for concreteness let’s say that the elements are A1, A2, A3, B1, B2, B3, B4, B5\. A1A2A3B1B2B3B4B5↑↑↑firstmidlastExchange elements at`first`and`mid`, then move both iterators forward\. After the first step, we have this: B1A2A3A1B2B3B4B5↑↑↑firstmidlastAfter three steps, we have moved all of the A’s out and replaced them with an equal number of B’s\. B1B2B3A1A2A3B4B5↑↑↑firstmidlastBut don’t stop\. Keep on going until`mid`reaches`last`\. B1B2B3B4B5A3A1A2↑↑firstmid lastAll of the B’s have been swapped to their final positions, but the A’s are jumbled\. But you can predict the exact nature of the jumbling\. The A block is in two chunks\. If we letnbe the total number of elements \|A\| \+ \|B\| andabe the number of elements in A, then the first chunk has the finaln%aelements, and the second chunk has the initiala− \(n%a\) elements\. Therefore, we can recursively rotate the two pieces of the A block to finish the job\. Move`mid`to`first`\+ \(n%a\) and restart the algorithm\. This algorithm performsn− 1 swaps\. You can calculate this inductively by observing that we perform \|B\| swaps, and then recursively rotate \|A\|\. Or you can calculate this directly by observing that each swap moves one element to its final position, except that the final swap moves two elements to their final position\. The locality of this algorithm fairly good\. The`first`iterator moves steadily forward, and the`mid`iterator moves forward most of the time, with at mostO\(log \(min\(\|A\|, \|B\|\)\) backward resets\. Next time, we’ll make a shocking discovery about this algorithm\. ### Category ### Topics ## Author ![Raymond Chen](https://devblogs.microsoft.com/oldnewthing/wp-content/uploads/sites/38/2019/02/RaymondChen_5in-150x150.jpg) Raymond has been involved in the evolution of Windows for more than 30 years\. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie\-jeebies\. The Web site spawned a book, coincidentally also titled The Old New Thing \(Addison Wesley 2007\)\. He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information\.

Rotation revisited: Another unidirectional algorithm

Similar Articles

Rotation revisited: A shocking discovery about gcc’s unidirectional rotation algorithm

Rotation revisited: Cycle decomposition in clang’s libcxx

Rotation revisited: Avoiding having to calculate the gcd when doing cycle decomposition

Stupid RCU Tricks: Corner-Case RCU Implementations

@che_shr_cat: 1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do …

Submit Feedback

Similar Articles

Rotation revisited: A shocking discovery about gcc’s unidirectional rotation algorithm

Rotation revisited: Cycle decomposition in clang’s libcxx

Rotation revisited: Avoiding having to calculate the gcd when doing cycle decomposition

Stupid RCU Tricks: Corner-Case RCU Implementations

@che_shr_cat: 1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do …