AMDGPU Execution Synchronization

This document covers different ways of synchronizing execution of threads on AMD GPUs.

Note

This document is not exhaustive. There may be more ways of synchronizing execution that are not covered by this document.

Barriers

This section covers execution synchronization using barrier-style primitives.

Execution Model

This section contains a formal execution model that can be used to model the behavior of barriers on AMDGPU targets.

Barriers only synchronize execution and do not affect the visibility of memory operations between threads. Refer to the execution barriers memory model to determine how to synchronize memory operations through barrier-executes-before.

Note

The barrier execution model is experimental and subject to change.

Barrier Objects

Threads can synchronize execution by performing barrier operations on barrier objects as described below:

Each barrier object has the following state:

  • An unsigned positive integer expected count: counts the number of arrive operations expected for this barrier object.

  • An unsigned non-negative integer arrive count: counts the number of arrive operations already performed on this barrier object.

    • The initial value of arrive count is zero.

    • When an operation causes arrive count to be equal to expected count, the barrier is completed, and the arrive count is reset to zero.

Barrier objects exist within a scope instance (see AMDHSA LLVM Sync Scopes), and thus can only be accessed by threads in the same scope instance.

Barrier Operations

Barrier operations are performed on barrier objects. A barrier operation is a dynamic instance of one of the following:

  • Barrier init

    • Barrier init takes an additional unsigned positive integer argument k.

    • Sets the expected count of the barrier object to k.

    • Resets the arrive count of the barrier object to zero.

  • Barrier drop.

    • Decrements expected count of the barrier object by one.

    • A barrier drop cannot cause the expected count of the barrier object to become negative; otherwise, the behavior is undefined.

  • Barrier arrive.

    • Increments the arrive count of the barrier object by one.

    • If supported, an additional argument to arrive can also update the expected count of the barrier object before the arrive count is incremented; the new expected count cannot be less than or equal to the arrive count, otherwise the behavior is undefined.

  • Barrier wait.

    • Introduces execution dependencies between threads; this operation depends on other barrier operations to complete.

Barrier modification operations are barrier operations that modify the barrier object state:

  • Barrier init.

  • Barrier drop.

  • Barrier arrive.

Thread-barrier-order<BO>

Thread-barrier-order<BO> is the subset of program-order that only relates barrier operations performed on a barrier object BO.

Barrier-modification-order<BO>

All barrier modification operations on a barrier object BO occur in a strict total order called barrier-modification-order<BO>; it is the order in which BO observes barrier operations that change its state. For any valid barrier-modification-order<BO>, the following must be true:

  • Let A and B be two barrier modification operations where A -> B in thread-barrier-order<BO>, then A -> B is also in barrier-modification-order<BO>.

  • The first element in barrier-modification-order<BO> is always a barrier init, otherwise the behavior is undefined.

Barrier-participates-in

Barrier-participates-in relates barrier operations to the barrier waits that depend on them to complete. A barrier operation X barrier-participates-in a barrier wait W if and only if all of the following is true:

  • X and W are both performed on the same barrier object BO.

  • X is a barrier arrive or drop operation.

  • X does not barrier-participate-in another distinct barrier wait W' in the same thread as W.

  • W -> X not in thread-barrier-order<BO>.

  • All dependent constraint and relations are satisfied as well. [0]

For the set S consisting of all barrier operations that barrier-participate-in a barrier wait W for some barrier object BO:

  • The elements of S all exist in a continuous, uninterrupted interval of barrier-modification-order<BO>.

  • The arrive count of BO is zero before the first operation of S in barrier-modification-order<BO>.

  • The arrive count and expected count of BO are equal after the last operation of S in barrier-modification-order<BO>. The arrive count and expected count of BO cannot equal at any other point in S.

Barrier-executes-before

A barrier operation A barrier-executes-before another barrier operation B if any of the following is true:

  • A -> B in program-order.

  • A -> B in barrier-participates-in.

  • A barrier-executes-before some barrier operation X, and X barrier-executes-before B.

Barrier-executes-before is consistent with barrier-modification-order<BO> for every barrier object BO.

Barrier drop races

For every pair of barrier arrive A and barrier drop D performed on a barrier object BO, such that A -> D in thread-barrier-order<BO>, one of the following must be true:

  • A does not barrier-participates-in any barrier wait.

  • A barrier-participates-in at least one barrier wait W such that W -> D in barrier-executes-before.

barrier-phase-with

barrier-phase-with is a symmetric relation over barrier operations defined as the transitive closure of: barrier-participates-in and its inverse relation.

Barrier phase separation

For every barrier operation A that barrier-participates-in a barrier wait W on a barrier object BO:

  • There is no barrier operation X on BO such that A -> X -> W in barrier-executes-before, and X barrier-phase-with a non-empty set of operations that does not include W.

Informational Notes

Informally, we can deduce from the above formal model that execution barriers behave as follows:

  • Barrier-executes-before relates the dynamic instances of operations from different threads together. For example, if A -> B in barrier-executes-before, then the execution of A must complete before the execution of B can complete.

    • This property can also be combined with program-order. For example, let two (non-barrier) operations X and Y where X -> A and B -> Y in program-order, then we know that the execution of X completes before the execution of Y does.

  • Barriers do not complete “out-of-thin-air”; a barrier wait W cannot depend on a barrier operation X to complete if W -> X in barrier-executes-before.

  • It is undefined behavior to operate on an uninitialized barrier object.

  • It is undefined behavior for a barrier wait to never complete.

  • It is not mandatory to drop a barrier after joining it.

  • A thread may not arrive and then drop a barrier object unless the barrier completes before the barrier drop. Incrementing the arrive count and decrementing the expected count directly after may cause undefined behavior.

  • Joining a barrier is only useful if the thread will wait on that same barrier object later.

Barrier Implementations on AMDGPU Targets

s_barrier

s_barrier are the primary barrier implementation of AMD GPUs.

s_barrier instructions can only be used to synchronize threads at a wavefront granularity. s_barrier instructions are convergent within a wave, and thus can only be performed in wave-uniform control flow.

The s_barrier family of instructions is available in some form on all GFX targets, and has evolved over time. The sub-sections below cover the capabilities offered by every major iteration of this feature separately.

GFX6-11

Targets from GFX6 through GFX11 included do not have the “split barrier” feature. The barrier arrive and barrier wait operations cannot be performed independently using s_barrier.

There is only one workgroup barrier object of workgroup scope that is implicitly used by all s_barrier instructions.

The following code sequences can be used to implement the barrier operations defined by the execution synchronization model using s_barrier on GFX6 through GFX11:

Table 125 s_barrier GFX6-11

Barrier Operation(s)

Barrier Object

AMDGPU Machine Code

Init and Drop

init

  • Workgroup barrier

Automatically initialized by the hardware when a workgroup is launched. The expected count of this barrier is set to the number of waves in the workgroup.

drop

  • Workgroup barrier

When a thread ends, it automatically drops this barrier object if it had previously joined it.

Arrive and Wait

arrive then wait

  • Workgroup barrier

BackOffBarrier
s_barrier
No BackOffBarrier
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
s_waitcnt_vscnt null, 0x0
s_barrier
  • If the target does not have the BackOffBarrier feature, then there cannot be any outstanding memory operations before issuing the s_barrier instruction.

  • The waitcnts can independently be moved earlier, or removed entirely as long as the associated counter remains at zero before issuing the s_barrier instruction.

  • The s_barrier instruction cannot complete before all waves of the workgroup have launched.

arrive

  • Workgroup barrier

Not available separately, see arrive then wait

wait

  • Workgroup barrier

Not available separately, see arrive then wait

GFX12

GFX12 targets have the split-barrier feature, and also allow s_barrier instructions to use one of multiple barrier objects available per workgroup. s_barrier instruction use the barrier ID operand to determine the barrier object they operate on.

GFX12.5 additionally introduces new barrier objects that offer more flexibility for synchronizing the execution of a subset of waves of a workgroup, or synchronizing execution across workgroups within a workgroup cluster, via s_barrier. These are called “named barriers”.

Note

Check the the table below to determine which barrier IDs are available to s_barrier instructions on a given target.

“Named Barriers” Model Extensions

In order to reason about the execution of s_barrier “named barriers” on GFX12.5 and up, we define the following extensions to the barrier execution model:

Note

The aforementioned execution model always applies unless stated otherwise by one of the extensions below.

Barrier Objects

There is a sub-type of barrier objects called named barrier objects. Named barrier objects inherit all the properties of barrier objects as defined by the barrier execution model. They are also subject to additional constraints.

Not all barrier objects are named barrier objects, and both types can coexist in an implementation.

Barrier Operations

The entirety of the barrier operations section applies, with the following barrier operation being added:

  • Barrier join.

    • Allow the thread that executes the operation to wait on a barrier object.

    • Can only be used on named barrier objects.

Barrier-joined-before

A barrier join J is barrier-joined-before a barrier operation X if and only if all of the following is true:

  • J -> X in thread-barrier-order<BO>.

  • X is not a barrier join.

  • There is no barrier join or drop JD where J -> JD -> X in thread-barrier-order<BO>.

  • There is no barrier join J' on a distinct barrier object BO' such that J -> J' -> X in program-order.

Join and Drop Ordering

For every barrier drop D performed on a named barrier object BO:

  • There is a barrier join J such that J -> D in barrier-joined-before; otherwise, the behavior is undefined.

Join and Wait Ordering

For every barrier wait W performed on a named barrier object BO:

  • There is a barrier join J such that J -> W in barrier-joined-before, and J must barrier-executes-before at least one operation X that barrier-participates-in W; otherwise, the behavior is undefined.

Code Sequences

The following code sequences can be used to implement the barrier operations defined by the GFX12 s_barrier execution synchronization model:

Table 126 s_barrier GFX12

Barrier Operation(s)

Barrier ID

AMDGPU Machine Code

Init, Join and Drop

init

  • -2, -1

Automatically initialized by the hardware when a workgroup is launched. The expected count of this barrier is set to the number of waves in the workgroup.

init

  • -4, -3

Automatically initialized by the hardware when a workgroup is launched as part of a workgroup cluster. The expected count of this barrier is set to the number of workgroups in the workgroup cluster.

init

  • 0

Automatically initialized by the hardware and always available. This barrier object is opaque and immutable as all operations other than barrier join are no-ops.

init

  • [1, 16]

s_barrier_init <N>
  • <N> is an immediate constant, or stored in the lower half of m0.

  • The value to set as the expected count of the barrier is stored in the upper half of m0.

join

  • -2, -1

  • -4, -3

N/A

join

  • 0

  • [1, 16]

s_barrier_join <N>
  • <N> is an immediate constant, or stored in the lower half of m0.

drop

  • 0

  • [1, 16]

s_barrier_leave
  • s_barrier_leave takes no operand. It can only be used to drop a named barrier object BO if BO was previously joined using s_barrier_join.

  • Drops the named barrier object BO if and only if there is a barrier join J such that J is barrier-joined-before this barrier drop operation.

drop

  • -2, -1

  • -4, -3

When a thread ends, it automatically drops this barrier object if it had previously joined it.

Arrive and Wait

arrive

  • -4, -3

  • -2, -1

  • 0

  • [1, 16]

s_barrier_signal <N>
Or
s_barrier_signal_isfirst <N>
  • <N> is an immediate constant, or stored in bits [4:0] of m0.

  • The _isfirst variant sets SCC=1 if this wave is the first to signal the barrier, otherwise SCC=0.

  • For barrier objects [1, 16]: When using m0 as an operand, if there is a non-zero value contained in the bits [22:16] of m0, the expected count of the barrier object is set to that value before the arrive count of the barrier object is incremented. The new expected count value must be greater than or equal to the arrive count, otherwise the behavior is undefined.

  • For barrier objects -4 and -3 (cluster barriers): only one wave per workgroup may arrive at the barrier on behalf of its entire workgroup. However, any wave within the workgroup cluster can then wait on this barrier object.

  • This is a no-op on the NULL barrier (barrier object 0).

wait

  • -4, -3

  • -2, -1

  • 0

  • [1, 16]

s_barrier_wait <N>.

  • <N> is an immediate constant.

  • For barrier objects -2 and -1: This instruction cannot complete before all waves of the workgroup have launched.

  • For barrier objects -4 and -3 (cluster barriers): This instruction cannot complete before all waves of the workgroup cluster have launched.

  • This is a no-op on the NULL barrier (barrier object 0).

  • For named barrier objects, this instruction always waits on the last named barrier object that the thread has joined, even if it is different from the barrier object passed to the instruction.

The following barrier IDs are available:

Table 127 s_barrier IDs GFX12

Barrier ID

Scope

Availability

Named barrier object?

Description

-4

cluster

GFX12.5

NO

Cluster trap barrier; cluster barrier object for use by all workgroups of a workgroup cluster. Dedicated for the trap handler and only available in privileged execution mode (not accessible by the shader).

-3

cluster

GFX12.5

NO

Cluster user barrier; cluster barrier object for use by all workgroups of a workgroup cluster.

-2

workgroup

GFX12 (all)

NO

Workgroup trap barrier, dedicated for the trap handler and only available in privileged execution mode (not accessible by the shader).

-1

workgroup

GFX12 (all)

NO

Workgroup barrier.

0

workgroup

GFX12.5

YES

NULL barrier.

[1, 16]

workgroup

GFX12.5

YES

Named barrier objects for the shader to assign and use.

Informally, we can note that:

  • All operations on the NULL named barrier object other than join are no-ops.

    • As the NULL barrier (barrier ID 0) is also a named barrier object, a thread can use a join on the NULL barrier as a way to “unjoin” a named barrier (break barrier-joined-before) without having to use a drop operation.

  • When a thread ends, it does not implicitly drop any named barrier objects (barrier IDs [0, 16]) it has joined.