RISC-V Bit-Manipulation ISA-extensions

Date:: 2023-08-14

This document is released under the Creative Commons Attribution 4.0 International License.

It describes the BitManip Zba, Zbb, Zbc and Zbs extensions being submitted for public review.

Contributors to this specification (in alphabetical order) include:

Jacob Bachmeyer, Allen Baum, Ari Ben, Alex Bradbury, Steven Braeger, Rogier Brussee, Michael Clark, Ken Dockser, Paul Donahue, Dennis Ferguson, Fabian Giesen, John Hauser, Robert Henry, Bruce Hoult, Po-wei Huang, Ben Marshall, Rex McCrary, Lee Moore, Jiří Moravec, Samuel Neves, Markus Oberhumer, Christopher Olson, Nils Pipenbrinck, Joseph Rahmeh, Xue Saw, Tommy Thorn, Philipp Tomsich, Avishai Tvila, Andrew Waterman, Thomas Wicki, and Claire Wolf.

We express our gratitude to everyone that contributed to, reviewed or improved this specification through their comments and questions.

ビットマニピュレーションa、b、c、s拡張の公開レビューと批准のためのグループ化

ビットマニピュレーション(bitmanip)拡張は、ベースとなるRISC-Vアーキテクチャに対するいくつかのコンポーネント拡張で構成され、コードサイズの削減、性能向上、エネルギー削減の組み合わせを提供することを目的としている。これらの命令は一般的に使用されることを意図しているが、いくつかの命令は他の命令よりもある領域で有用である。そのため、1つの大きな拡張ではなく、いくつかの小さなbitmanip拡張が提供されている。これらの小さな拡張はそれぞれ共通の機能と使用例によってグループ化され、それぞれ独自のZb*拡張名を持っている。

各 bitmanip 拡張は、同じような目的を持ち、しばしば同じロジックを共有することができるいくつかの bitmanip 命令のグループを含む。いくつかの命令は1つの拡張子で利用可能であり、他の命令は複数の拡張子で利用可能である。命令は、それらが現れる拡張に依存しないニーモニックとエンコーディングを持つ。したがって、重複する命令を持つ拡張機能を実装する場合、ロジックやエンコーディングに冗長性はない。

bitmanip拡張はRV32とRV64用に定義されている。ほとんどの命令はRV128と前方互換性があると予想される。シフト即値命令は最大6ビットの即値フィールドを持つように定義されているが、RV128で必要とされる場合には、エンコード空間に7ビット目が用意されている。

ワード命令

bitmanip拡張は、 w 付き命令 ( _w_の前にドットがない)は入力の上位32ビットを無視し、最下位32ビットを符号付き値として演算し、符号をXLENに拡張した32ビットの符号付き結果を生成するというRV64の慣例に従っている。

接尾辞が .uw の Bitmanip 命令は、指定されたレジスタの最下位 32 ビットから抽出された符号なし 32 ビット値をオペランドとして持つ。それ以外は、完全なXLEN演算を行う。

接尾辞 .b , .h , .w を持つbitmanip命令は、入力の最下位 8 ビット、16 ビット、32 ビット（それぞれ）のみを参照し、符号拡張された XLEN幅の結果を生成する。この結果は、特定の命令に基づいて符号拡張またはゼロ拡張される。

命令セマンティクスのための疑似コード

Instructions (in alphabetical order) で記述される各命令のセマンティクスは、SAILの構文で記述される。

Extensions

パブリック・レビューとして公開されたbitmanipの最初のグループは以下である。

Address generation instructions
Basic bit-manipulation
Carry-less multiplication
Single-bit instructions

以下は、これらの拡張に含まれるすべての命令(および擬似命令)のリストとそのマッピングの一覧である:

RV 32	RV 64	Mnemonic	Instruction	Z b a	Z b b	Z b c	Z b s
	✓	add.uw rd, rs1, rs2	Add unsigned word	✓
✓	✓	andn rd, rs1, rs2	AND with inverted operand		✓
✓	✓	clmul rd, rs1, rs2	Carry-less multiply (lo w-part)			✓
✓	✓	clmulh rd, rs1, rs2	Carry-less multiply (high -part)			✓
✓	✓	clmulr rd, rs1, rs2	Carry-less multiply (rev ersed)			✓
✓	✓	clz rd, rs	Count leading zero bits		✓
	✓	clzw rd, rs	Count leading zero bits in word		✓
✓	✓	cpop rd, rs	Count set bits		✓
	✓	cpopw rd, rs	Count set bits in word		✓
✓	✓	ctz rd, rs	Count trailing zero bits		✓
	✓	ctzw rd, rs	Count trailing zero bits in word		✓
✓	✓	max rd, rs1, rs2	Maximum		✓
✓	✓	maxu rd, rs1, rs2	Unsigned maximum		✓
✓	✓	min rd, rs1, rs2	Minimum		✓
✓	✓	minu rd, rs1, rs2	Unsigned minimum		✓
✓	✓	orc.b rd, rs1, rs2	Bitwise OR-Combine, byte granule		✓
✓	✓	orn rd, rs1, rs2	OR with inverted operand		✓
✓	✓	rev8 rd, rs	Byte-reverse register		✓
✓	✓	rol rd, rs1, rs2	Rotate left ( Register)		✓
	✓	rolw rd, rs1, rs2	Rotate Left Word (R egister)		✓
✓	✓	ror rd, rs1, rs2	Rotate right ( Register)		✓
✓	✓	rori rd, rs1, shamt	Rotate right (Im mediate)		✓
	✓	roriw rd, rs1, shamt	Rotate right Word (Imm ediate)		✓
	✓	rorw rd, rs1, rs2	Rotate right Word (R egister)		✓
✓	✓	bclr rd, rs1, rs2	Single-Bit Clear (R egister)				✓
✓	✓	bclri rd, rs1, imm	Single-Bit Clear (Imm ediate)				✓
✓	✓	bext rd, rs1, rs2	Single-Bit Extract (R egister)				✓
✓	✓	bexti rd, rs1, imm	Single-Bit Extract (Imm ediate)				✓
✓	✓	binv rd, rs1, rs2	Single-Bit Invert (R egister)				✓
✓	✓	binvi rd, rs1, imm	Single-Bit Invert (Imm ediate)				✓
✓	✓	bset rd, rs1, rs2	Single-Bit Set (R egister)				✓
✓	✓	bseti rd, rs1, imm	Single-Bit Set (Imm ediate)				✓
✓	✓	sext.b rd, rs	Sign-extend byte		✓
✓	✓	sext.h rd, rs	Sign-extend ha lfword		✓
✓	✓	sh1add rd, rs1, rs2	Shift left by 1 and add	✓
	✓	sh1add.uw rd, rs1, rs2	Shift unsigned word left by 1 and add	✓
✓	✓	sh2add rd, rs1, rs2	Shift left by 2 and add	✓
	✓	sh2add.uw rd, rs1, rs2	Shift unsigned word left by 2 and add	✓
✓	✓	sh3add rd, rs1, rs2	Shift left by 3 and add	✓
	✓	sh3add.uw rd, rs1, rs2	Shift unsigned word left by 3 and add	✓
	✓	slli.uw rd, rs1, imm	Shift-left unsigned word (Immed iate)	✓
✓	✓	xnor rd, rs1, rs2	Exclusive NOR		✓
✓	✓	zext.h rd, rs	Zero-extend ha lfword		✓
	✓	zext.w rd, rs	Add unsigned word	✓

Zba extension

Note

The Zba extension is frozen.

Zba命令は、符号なしワードサイズとXLENサイズの両方のインデックスを使用して、基本タイプ(ハーフワード、ワード、ダブルワード)の配列にインデックスを付けるアドレスの生成を高速化するために使用できる。

シフト命令と加算命令で1、2、3の左シフトを行うのは、実際のコードで一般的であり、単純な加算器以上の最小限の追加ハードウェアで実装できるからである。これにより、実装におけるクリティカル・パスが長くなるのを避けることができる。

シフト命令と加算命令の最大左シフト数は3に制限されているが、(ベースISAの)slli 命令を使用すると、より広い要素の配列にインデックスを付けるために同様のシフトを実行できる。このサブ拡張で追加されたslli.uwは、インデックスを符号なしワードとして解釈する場合に使用できる。

Zba拡張は以下の命令で構成されている:

RV 32	RV 64	Mnemonic	Instruction
	✓	add.uw rd, rs1, rs2	Add unsigned word
✓	✓	sh1add rd, rs1, rs2	Shift left by 1 and add
	✓	sh1add.uw rd, rs1, rs2	Shift unsigned word left by 1 and add
✓	✓	sh2add rd, rs1, rs2	Shift left by 2 and add
	✓	sh2add.uw rd, rs1, rs2	Shift unsigned word left by 2 and add
✓	✓	sh3add rd, rs1, rs2	Shift left by 3 and add
	✓	sh3add.uw rd, rs1, rs2	Shift unsigned word left by 3 and add
	✓	slli.uw rd, rs1, imm	Shift-left unsigned word (Immediate)
	✓	zext.w rd, rs	Add unsigned word

Zbb: Basic bit-manipulation

Note

Zbb拡張はFrozen状態である。

否定付き論理演算命令

RV 32	RV 64	Mnemonic	Instruction
✓	✓	andn rd, rs1, rs2	AND with inverted operand
✓	✓	orn rd, rs1, rs2	OR with inverted operand
✓	✓	xnor rd, rs1, rs2	Exclusive NOR

Note

The Logical with Negate instructions can be implemented by inverting the rs2 inputs to the base-required AND, OR, and XOR logic instructions. In some implementations, the inverter on rs2 used for subtraction can be reused for this purpose.

Leading/Trailing ゼロビットカウント命令

RV 32	RV 64	Mnemonic	Instruction
✓	✓	clz rd, rs	Count leading zero bits
	✓	clzw rd, rs	Count leading zero bits in word
✓	✓	ctz rd, rs	Count trailing zero bits
	✓	ctzw rd, rs	Count trailing zero bits in word

Pop Count命令

これらの命令はセットされている(ビットが1)の数を数える。これは一般的にPopulation Countと呼ばれている。

RV 32	RV 64	Mnemonic	Instruction
✓	✓	cpop rd, rs	Count set bits
	✓	cpopw rd, rs	Count set bits in word

整数最大値・最小値命令

整数最大値・最小値命令はR-typeの算術演算命令であり、 2つのオペランドの最大値・最小値を返す。

RV 32	RV 64	Mnemonic	Instruction
✓	✓	max rd, rs1, rs2	Maximum
✓	✓	maxu rd, rs1, rs2	Unsigned maximum
✓	✓	min rd, rs1, rs2	Minimum
✓	✓	minu rd, rs1, rs2	Unsigned minimum

符号拡張・ゼロ拡張命令

これらの命令はソース・レジスタの最下位8ビット、16ビット、32ビットを符号拡張もしくはゼロ拡張する。

これらの命令は、8ビットおよび16ビットのゼロ拡張時は slli rD,rs,(XLEN-<size>) + srli 命令、 16ビットおよび32ビットの符号拡張時は slli + srai という一般的なイディオムとして置き換えることができる。

RV 32	RV 64	Mnemonic	Instruction
✓	✓	sext.b rd, rs	Sign-extend byte
✓	✓	sext.h rd, rs	Sign-extend halfword
✓	✓	zext.h rd, rs	Zero-extend halfword

ローテート命令

ビット単位の回転命令は、基本仕様のシフト論理演算に似ている。ただし、シフトがゼロをシフトするのに対し、ローテート命令は値の反対側にシフトされたビットをシフトする。このような操作は’循環シフト’とも呼ばれる。

RV 32	RV 64	Mnemonic	Instruction
✓	✓	rol rd, rs1, rs2	Rotate left (Register)
	✓	rolw rd, rs1, rs2	Rotate Left Word (Register)
✓	✓	ror rd, rs1, rs2	Rotate right (Register)
✓	✓	rori rd, rs1, shamt	Rotate right (Immediate)
	✓	roriw rd, rs1, shamt	Rotate right Word (Immediate)
	✓	rorw rd, rs1, rs2	Rotate right Word (Register)

Note

The rotate instructions were included to replace a common four-instruction sequence to achieve the same effect (neg; sll/srl; srl/sll; or)

OR組み合わせ命令

orc.b は、結果 rd の各バイトのビットを、 rs の各バイト内のビットがセットされていなければすべてゼロに、 rs の各バイト内のビットがセットされていればすべて1にセットする。

使用例としては、 strlen や strcpy のような文字列処理関数がある。ワード内の非ゼロのバイトのセットビットをカウントすることで、終端ゼロバイトをテストすることができる。

RV 32	RV 64	Mnemonic	Instruction
✓	✓	orc.b rd, rs	Bitwise OR-Combine, byte granule

バイト逆転命令

rev8 命令は、 rs の倍との順序を逆転させる。

RV 32	RV 64	Mnemonic	Instruction
✓	✓	rev8 rd, rs	Byte-reverse register

Zbc: キャリー無し乗算命令

Note

Zbc拡張はFrozen状態である。

キャリー無し乗算はGF(2)上の多項式環における乗算である。

clmul はキャリーレス積の下半分を生成し、 clmulh は 2✕XLEN キャリーレス積の上半分を生成する。

clmulr は 2✕XLEN キャリーレス積のビット 2✕XLEN-2:XLEN-1 を生成する。

RV 32	RV 64	Mnemonic	Instruction
✓	✓	clmul rd, rs1, rs2	Carry-less multiply (low-part)
✓	✓	clmulh rd, rs1, rs2	Carry-less multiply (high-part)
✓	✓	clmulr rd, rs1, rs2	Carry-less multiply (reversed)

Zbs: 単一ビット命令

Note

Zbc拡張はFrozen状態である。

シングルビット命令は、レジスタの単一ビットをセット、クリア、反転、または抽出するメカニズムを提供する。ビットはインデックスで指定する。

RV 32	RV 64	Mnemonic	Instruction
✓	✓	bclr rd, rs1, rs2	Single-Bit Clear (Register)
✓	✓	bclri rd, rs1, imm	Single-Bit Clear (Immediate)
✓	✓	bext rd, rs1, rs2	Single-Bit Extract (Register)
✓	✓	bexti rd, rs1, imm	Single-Bit Extract (Immediate)
✓	✓	binv rd, rs1, rs2	Single-Bit Invert (Register)
✓	✓	binvi rd, rs1, imm	Single-Bit Invert (Immediate)
✓	✓	bset rd, rs1, rs2	Single-Bit Set (Register)
✓	✓	bseti rd, rs1, imm	Single-Bit Set (Immediate)

Zbkc: 暗号向けキャリーレス乗算

Note

Zbkc拡張はFrozen状態である。

キャリーレス乗算は、GF(2)上の多項式環における乗算である。これはいくつかの暗号ワークロード、特にAES-GCM認証暗号化スキームにおいて重要な演算である。この拡張は、このワークロードの一部であるGHASH演算を効率的に実装するために必要な命令のみを提供する。

RV 32	RV 64	Mnemonic	Instruction
✓	✓	clmul rd, rs1, rs2	Carry-less multiply (low-part)
✓	✓	clmulh rd, rs1, rs2	Carry-less multiply (high-part)

Zbkx: クロスバ組み合わせ命令

Note

Zbkx拡張はFrozen状態である。

これらの命令は、汎用レジスタ内の4ビットと8ビットの要素に対して”ルックアップテーブル”を実装する。 rs1 はNビット・ワードのベクトルとして使用され、 rs2 は rs1 へのNビット・インデックスのベクトルとして使用される。 rs1 の要素は、 rs2 のインデックス付き要素で置き換えられる。 rs2 へのインデックスが範囲外の場合はゼロとなる。

これらの命令は、Nビット対Nビットのブーリアン演算を表現したり、実行レイテンシが演算対象の(秘密)データに依存しないような、秘密に依存するメモリアクセス(特にSBox)を持つ暗号コードを実装したりするのに便利である。

RV 32	RV 64	Mnemonic	Instruction
✓	✓	xperm.n rd, rs1, rs2	Crossbar permutation (nibbles)
✓	✓	xperm.b rd, rs1, rs2	Crossbar permutation (bytes)

Zbkb: 暗号化向けビット操作命令

Note

Zbkb拡張はFrozen状態である。

この拡張には、暗号ワークロードの実装の基本となる共通動作のための命令が含まれている。

RV 32	RV 64	Mnemonic	Instruction
✓	✓	rol	Rotate left (Register)
	✓	rolw	Rotate Left Word (Register)
✓	✓	ror	Rotate right (Register)
✓	✓	rori	Rotate right (Immediate)
	✓	roriw	Rotate right Word (Immediate)
	✓	rorw	Rotate right Word (Register)
✓	✓	andn	AND with inverted operand
✓	✓	orn	OR with inverted operand
✓	✓	xnor	Exclusive NOR
✓	✓	pack	Pack low halves of registers
✓	✓	packh	Pack low bytes of registers
	✓	packw	Pack low 16-bits of registers (RV64)
✓	✓	rev.b	Reverse bits in bytes
✓	✓	rev8	Byte-reverse register
✓		zip	Bit interleave
✓		unzip	Bit deinterleave

Instructions (in alphabetical order)

add.uw

Synopsis: Add unsigned word
Mnemonic: add.uw rd, rs1, rs2
Pseudoinstructions: zext.w rd, rs1 → add.uw rd, rs1, zero

Encoding

Description: This instruction performs an XLEN-wide addition between rs2 and the zero-extended least-significant word of rs1.

Operation

let base = X(rs2);
let index = EXTZ(X(rs1)[31..0]);

X(rd) = base + index;

Included in

Extension	Minimum version	Lifecycle state
Zba (Address generation instructions)	0.93	Frozen

andn

Synopsis: AND with inverted operand
Mnemonic: andn rd, rs1, rs2

Encoding

Description: This instruction performs the bitwise logical AND operation between rs1 and the bitwise inversion of rs2.

Operation

X(rd) = X(rs1) & ~X(rs2);

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

bclr

Synopsis: Single-Bit Clear (Register)
Mnemonic: bclr rd, rs1, rs2

Encoding

Description: This instruction returns rs1 with a single bit cleared at the index specified in rs2. The index is read from the lower log2(XLEN) bits of rs2.

Operation

let index = X(rs2) & (XLEN - 1);
X(rd) = X(rs1) & ~(1 << index)

Included in

Extension	Minimum version	Lifecycle state
Zbs (Single-bit instructions)	0.93	Frozen

bclri

Synopsis: Single-Bit Clear (Immediate)
Mnemonic: bclri rd, rs1, shamt

Encoding (RV32)

Encoding (RV64)

Description: This instruction returns rs1 with a single bit cleared at the index specified in shamt. The index is read from the lower log2(XLEN) bits of shamt. For RV32, the encodings corresponding to shamt[5]=1 are reserved.

Operation

let index = shamt & (XLEN - 1);
X(rd) = X(rs1) & ~(1 << index)

Included in

Extension	Minimum version	Lifecycle state
Zbs (Single-bit instructions)	0.93	Frozen

bext

Synopsis: Single-Bit Extract (Register)
Mnemonic: bext rd, rs1, rs2

Encoding

Description: This instruction returns a single bit extracted from rs1 at the index specified in rs2. The index is read from the lower log2(XLEN) bits of rs2.

Operation

let index = X(rs2) & (XLEN - 1);
X(rd) = (X(rs1) >> index) & 1;

Included in

Extension	Minimum version	Lifecycle state
Zbs (Single-bit instructions)	0.93	Frozen

bexti

Synopsis: Single-Bit Extract (Immediate)
Mnemonic: bexti rd, rs1, shamt

Encoding (RV32)

Encoding (RV64)

Description: This instruction returns a single bit extracted from rs1 at the index specified in rs2. The index is read from the lower log2(XLEN) bits of shamt. For RV32, the encodings corresponding to shamt[5]=1 are reserved.

Operation

let index = shamt & (XLEN - 1);
X(rd) = (X(rs1) >> index) & 1;

Included in

Extension	Minimum version	Lifecycle state
Zbs (Single-bit instructions)	0.93	Frozen

binv

Synopsis: Single-Bit Invert (Register)
Mnemonic: binv rd, rs1, rs2

Encoding

Description: This instruction returns rs1 with a single bit inverted at the index specified in rs2. The index is read from the lower log2(XLEN) bits of rs2.

Operation

let index = X(rs2) & (XLEN - 1);
X(rd) = X(rs1) ^ (1 << index)

Included in

Extension	Minimum version	Lifecycle state
Zbs (Single-bit instructions)	0.93	Frozen

binvi

Synopsis: Single-Bit Invert (Immediate)
Mnemonic: binvi rd, rs1, shamt

Encoding (RV32)

Encoding (RV64)

Description: This instruction returns rs1 with a single bit inverted at the index specified in shamt. The index is read from the lower log2(XLEN) bits of shamt. For RV32, the encodings corresponding to shamt[5]=1 are reserved.

Operation

let index = shamt & (XLEN - 1);
X(rd) = X(rs1) ^ (1 << index)

Included in

Extension	Minimum version	Lifecycle state
Zbs (Single-bit instructions)	0.93	Frozen

bset

Synopsis: Single-Bit Set (Register)
Mnemonic: bset rd, rs1,rs2

Encoding

Description: This instruction returns rs1 with a single bit set at the index specified in rs2. The index is read from the lower log2(XLEN) bits of rs2.

Operation

let index = X(rs2) & (XLEN - 1);
X(rd) = X(rs1) | (1 << index)

Included in

Extension	Minimum version	Lifecycle state
Zbs (Single-bit instructions)	0.93	Frozen

bseti

Synopsis: Single-Bit Set (Immediate)
Mnemonic: bseti rd, rs1,shamt

Encoding (RV32)

Encoding (RV64)

Description: This instruction returns rs1 with a single bit set at the index specified in shamt. The index is read from the lower log2(XLEN) bits of shamt. For RV32, the encodings corresponding to shamt[5]=1 are reserved.

Operation

let index = shamt & (XLEN - 1);
X(rd) = X(rs1) | (1 << index)

Included in

Extension	Minimum version	Lifecycle state
Zbs (Single-bit instructions)	0.93	Frozen

clmul

Synopsis: Carry-less multiply (low-part)
Mnemonic: clmul rd, rs1, rs2

Encoding

Description: clmul produces the lower half of the 2·XLEN carry-less product.

Operation

let rs1_val = X(rs1);
let rs2_val = X(rs2);
let output : xlenbits = 0;

foreach (i from 0 to (xlen - 1) by 1) {
   output = if   ((rs2_val >> i) & 1)
            then output ^ (rs1_val << i);
            else output;
}

X[rd] = output

Included in

Extension	Minimum version	Lifecycle state
Zbc (Carry-less multiplication)	0.93	Frozen
Zbkc (Carry-less multiplication for Cryptography)	v0.9.4	Frozen

clmulh

Synopsis: Carry-less multiply (high-part)
Mnemonic: clmulh rd, rs1, rs2

Encoding

Description: clmulh produces the upper half of the 2·XLEN carry-less product.

Operation

let rs1_val = X(rs1);
let rs2_val = X(rs2);
let output : xlenbits = 0;

foreach (i from 1 to xlen by 1) {
   output = if   ((rs2_val >> i) & 1)
            then output ^ (rs1_val >> (xlen - i));
            else output;
}

X[rd] = output

Included in

Extension	Minimum version	Lifecycle state
Zbc (Carry-less multiplication)	0.93	Frozen
Zbkc (Carry-less multiplication for Cryptography)	v0.9.4	Frozen

clmulr

Synopsis: Carry-less multiply (reversed)
Mnemonic: clmulr rd, rs1, rs2

Encoding

Description: clmulr produces bits 2·XLEN−2:XLEN-1 of the 2·XLEN carry-less product.

Operation

let rs1_val = X(rs1);
let rs2_val = X(rs2);
let output : xlenbits = 0;

foreach (i from 0 to (xlen - 1) by 1) {
   output = if   ((rs2_val >> i) & 1)
            then output ^ (rs1_val >> (xlen - i - 1));
            else output;
}

X[rd] = output

Note

The clmulr instruction is used to accelerate CRC calculations. The r in the instruction’s mnemonic stands for reversed, as the instruction is equivalent to bit-reversing the inputs, performing a clmul, then bit-reversing the output.

Included in

Extension	Minimum version	Lifecycle state
Zbc (Carry-less multiplication)	0.93	Frozen

clz

Synopsis: Count leading zero bits
Mnemonic: clz rd, rs

Encoding

Description: This instruction counts the number of 0’s before the first 1, starting at the most-significant bit (i.e., XLEN-1) and progressing to bit 0. Accordingly, if the input is 0, the output is XLEN, and if the most-significant bit of the input is a 1, the output is 0.

Operation

val HighestSetBit : forall ('N : Int), 'N >= 0. bits('N) -> int

function HighestSetBit x = {
  foreach (i from (xlen - 1) to 0 by 1 in dec)
    if [x[i]] == 0b1 then return(i) else ();
  return -1;
}

let rs = X(rs);
X[rd] = (xlen - 1) - HighestSetBit(rs);

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

clzw

Synopsis: Count leading zero bits in word
Mnemonic: clzw rd, rs

Encoding

Description: This instruction counts the number of 0’s before the first 1 starting at bit 31 and progressing to bit 0. Accordingly, if the least-significant word is 0, the output is 32, and if the most-significant bit of the word (i.e., bit 31) is a 1, the output is 0.

Operation

val HighestSetBit32 : forall ('N : Int), 'N >= 0. bits('N) -> int

function HighestSetBit32 x = {
  foreach (i from 31 to 0 by 1 in dec)
    if [x[i]] == 0b1 then return(i) else ();
  return -1;
}

let rs = X(rs);
X[rd] = 31 - HighestSetBit(rs);

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

cpop

Synopsis: Count set bits
Mnemonic: cpop rd, rs

Encoding

Description: This instructions counts the number of 1’s (i.e., set bits) in the source register.

Operation

let bitcount = 0;
let rs = X(rs);

foreach (i from 0 to (xlen - 1) in inc)
    if rs[i] == 0b1 then bitcount = bitcount + 1 else ();

X[rd] = bitcount

Note

This operations is known as population count, popcount, sideways sum, bit summation, or Hamming weight.

The GCC builtin function __builtin_popcount (unsigned int x) is implemented by cpop on RV32 and by cpopw on RV64. The GCC builtin function __builtin_popcountl (unsigned long x) for LP64 is implemented by cpop on RV64.

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

cpopw

Synopsis: Count set bits in word
Mnemonic: cpopw rd, rs

Encoding

Description: This instructions counts the number of 1’s (i.e., set bits) in the least-significant word of the source register.

Operation

let bitcount = 0;
let val = X(rs);

foreach (i from 0 to 31 in inc)
    if val[i] == 0b1 then bitcount = bitcount + 1 else ();

X[rd] = bitcount

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

ctz

Synopsis: Count trailing zeros
Mnemonic: ctz rd, rs

Encoding

Description: This instruction counts the number of 0’s before the first 1, starting at the least-significant bit (i.e., 0) and progressing to the most-significant bit (i.e., XLEN-1). Accordingly, if the input is 0, the output is XLEN, and if the least-significant bit of the input is a 1, the output is 0.

Operation

val LowestSetBit : forall ('N : Int), 'N >= 0. bits('N) -> int

function LowestSetBit x = {
  foreach (i from 0 to (xlen - 1) by 1 in dec)
    if [x[i]] == 0b1 then return(i) else ();
  return xlen;
}

let rs = X(rs);
X[rd] = LowestSetBit(rs);

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

ctzw

Synopsis: Count trailing zero bits in word
Mnemonic: ctzw rd, rs

Encoding

Description: This instruction counts the number of 0’s before the first 1, starting at the least-significant bit (i.e., 0) and progressing to the most-significant bit of the least-significant word (i.e., 31). Accordingly, if the least-significant word is 0, the output is 32, and if the least-significant bit of the input is a 1, the output is 0.

Operation

val LowestSetBit32 : forall ('N : Int), 'N >= 0. bits('N) -> int

function LowestSetBit32 x = {
  foreach (i from 0 to 31 by 1 in dec)
    if [x[i]] == 0b1 then return(i) else ();
  return 32;
}

let rs = X(rs);
X[rd] = LowestSetBit32(rs);

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

max

Synopsis: Maximum
Mnemonic: max rd, rs1, rs2

Encoding

Description: This instruction returns the larger of two signed integers.

Operation

let rs1_val = X(rs1);
let rs2_val = X(rs2);

let result = if   rs1_val <_s rs2_val
             then rs2_val
             else rs1_val;

X(rd) = result;

Note

Calculating the absolute value of a signed integer can be performed using the following sequence: neg rD,rS followed by max rD,rS,rD. When using this common sequence, it is suggested that they are scheduled with no intervening instructions so that implementations that are so optimized can fuse them together.

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

maxu

Synopsis: Unsigned maximum
Mnemonic: maxu rd, rs1, rs2

Encoding

Description: This instruction returns the larger of two unsigned integers.

Operation

let rs1_val = X(rs1);
let rs2_val = X(rs2);

let result = if   rs1_val <_u rs2_val
             then rs2_val
             else rs1_val;

X(rd) = result;

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

min

Synopsis: Minimum
Mnemonic: min rd, rs1, rs2

Encoding

Description: This instruction returns the smaller of two signed integers.

Operation

let rs1_val = X(rs1);
let rs2_val = X(rs2);

let result = if   rs1_val <_s rs2_val
             then rs1_val
             else rs2_val;

X(rd) = result;

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

minu

Synopsis: Unsigned minimum
Mnemonic: minu rd, rs1, rs2

Encoding

Description: This instruction returns the smaller of two unsigned integers.

Operation

let rs1_val = X(rs1);
let rs2_val = X(rs2);

let result = if   rs1_val <_u rs2_val
             then rs1_val
             else rs2_val;

X(rd) = result;

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

orc.b

Synopsis: Bitwise OR-Combine, byte granule
Mnemonic: orc.b rd, rs

Encoding

Description: Combines the bits within each byte using bitwise logical OR. This sets the bits of each byte in the result rd to all zeros if no bit within the respective byte of rs is set, or to all ones if any bit within the respective byte of rs is set.

Operation

let input = X(rs);
let output : xlenbits = 0;

foreach (i from 0 to (xlen - 8) by 8) {
   output[(i + 7)..i] = if   input[(i + 7)..i] == 0
                        then 0b00000000
                        else 0b11111111;
}

X[rd] = output;

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

orn

Synopsis: OR with inverted operand
Mnemonic: orn rd, rs1, rs2

Encoding

Description: This instruction performs the bitwise logical OR operation between rs1 and the bitwise inversion of rs2.

Operation

X(rd) = X(rs1) | ~X(rs2);

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

pack

Synopsis: Pack the low halves of rs1 and rs2 into rd.
Mnemonic: pack rd, rs1, rs2

Encoding

Description: The pack instruction packs the XLEN/2-bit lower halves of rs1 and rs2 into rd, with rs1 in the lower half and rs2 in the upper half.

Operation

let lo_half : bits(xlen/2) = X(rs1)[xlen/2-1..0];
let hi_half : bits(xlen/2) = X(rs2)[xlen/2-1..0];
X(rd) = EXTZ(hi_half @ lo_half);

Included in

Extension	Minimum version	Lifecycle state
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

packh

Synopsis: Pack the low bytes of rs1 and rs2 into rd.
Mnemonic: packh rd, rs1, rs2

Encoding

Description: And the packh instruction packs the least-significant bytes of rs1 and rs2 into the 16 least-significant bits of rd, zero extending the rest of rd.

Operation

let lo_half : bits(8) = X(rs1)[7..0];
let hi_half : bits(8) = X(rs2)[7..0];
X(rd) = EXTZ(hi_half @ lo_half);

Included in

Extension	Minimum version	Lifecycle state
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

packw

Synopsis: Pack the low 16-bits of rs1 and rs2 into rd on RV64.
Mnemonic: packw rd, rs1, rs2

Encoding

Description: This instruction packs the low 16 bits of rs1 and rs2 into the 32 least-significant bits of rd, sign extending the 32-bit result to the rest of rd. This instruction only exists on RV64 based systems.

Operation

let lo_half : bits(16) = X(rs1)[15..0];
let hi_half : bits(16) = X(rs2)[15..0];
X(rd) = EXTS(hi_half @ lo_half);

Included in

Extension	Minimum version	Lifecycle state
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

rev8

Synopsis: Byte-reverse register
Mnemonic: rev8 rd, rs

Encoding (RV32)

Encoding (RV64)

Description: This instruction reverses the order of the bytes in rs.

Operation

let input = X(rs);
let output : xlenbits = 0;
let j = xlen - 1;

foreach (i from 0 to (xlen - 8) by 8) {
   output[i..(i + 7)] = input[(j - 7)..j];
   j = j - 8;
}

X[rd] = output

Note

The rev8 mnemonic corresponds to different instruction encodings in RV32 and RV64.

Note

The byte-reverse operation is only available for the full register width. To emulate word-sized and halfword-sized byte-reversal, perform a rev8 rd,rs followed by a srai rd,rd,K, where K is XLEN-32 and XLEN-16, respectively.

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

rev.b

Synopsis: Reverse the bits in each byte of a source register.
Mnemonic: rev.b rd, rs

Encoding

Description: This instruction reverses the order of the bits in every byte of a register.

Operation

result : xlenbits = EXTZ(0b0);
foreach (i from 0 to sizeof(xlen) by 8) {
    result[i+7..i] = reverse_bits_in_byte(X(rs1)[i+7..i]);
};
X(rd) = result;

Included in

Extension	Minimum version	Lifecycle state
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

rol

Synopsis: Rotate Left (Register)
Mnemonic: rol rd, rs1, rs2

Encoding

Description: This instruction performs a rotate left of rs1 by the amount in least-significant log2(XLEN) bits of rs2.

Operation

let shamt = if   xlen == 32
            then X(rs2)[4..0]
            else X(rs2)[5..0];
let result = (X(rs1) << shamt) | (X(rs1) >> (xlen - shamt));

X(rd) = result;

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

rolw

Synopsis: Rotate Left Word (Register)
Mnemonic: rolw rd, rs1, rs2

Encoding

Description: This instruction performs a rotate left on the least-significant word of rs1 by the amount in least-significant 5 bits of rs2. The resulting word value is sign-extended by copying bit 31 to all of the more-significant bits.

Operation

let rs1 = EXTZ(X(rs1)[31..0])
let shamt = X(rs2)[4..0];
let result = (rs1 << shamt) | (rs1 >> (32 - shamt));
X(rd) = EXTS(result[31..0]);

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

ror

Synopsis: Rotate Right
Mnemonic: ror rd, rs1, rs2

Encoding

Description: This instruction performs a rotate right of rs1 by the amount in least-significant log2(XLEN) bits of rs2.

Operation

let shamt = if   xlen == 32
            then X(rs2)[4..0]
            else X(rs2)[5..0];
let result = (X(rs1) >> shamt) | (X(rs1) << (xlen - shamt));

X(rd) = result;

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

rori

Synopsis: Rotate Right (Immediate)
Mnemonic: rori rd, rs1, shamt

Encoding (RV32)

Encoding (RV64)

Description: This instruction performs a rotate right of rs1 by the amount in the least-significant log2(XLEN) bits of shamt. For RV32, the encodings corresponding to shamt[5]=1 are reserved.

Operation

let shamt = if   xlen == 32
            then shamt[4..0]
            else shamt[5..0];
let result = (X(rs1) >> shamt) | (X(rs1) << (xlen - shamt));

X(rd) = result;

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

roriw

Synopsis: Rotate Right Word by Immediate
Mnemonic: roriw rd, rs1, shamt

Encoding

Description: This instruction performs a rotate right on the least-significant word of rs1 by the amount in the least-significant log2(XLEN) bits of shamt. The resulting word value is sign-extended by copying bit 31 to all of the more-significant bits.

Operation

let rs1_data = EXTZ(X(rs1)[31..0];
let result = (rs1_data >> shamt) | (rs1_data << (32 - shamt));
X(rd) = EXTS(result[31..0]);

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

rorw

Synopsis: Rotate Right Word (Register)
Mnemonic: rorw rd, rs1, rs2

Encoding

Description: This instruction performs a rotate right on the least-significant word of rs1 by the amount in least-significant 5 bits of rs2. The resultant word is sign-extended by copying bit 31 to all of the more-significant bits.

Operation

let rs1 = EXTZ(X(rs1)[31..0])
let shamt = X(rs2)[4..0];
let result = (rs1 >> shamt) | (rs1 << (32 - shamt));
X(rd) = EXTS(result);

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

sext.b

Synopsis: Sign-extend byte
Mnemonic: sext.b rd, rs

Encoding

Description: This instruction sign-extends the least-significant byte in the source to XLEN by copying the most-significant bit in the byte (i.e., bit 7) to all of the more-significant bits.

Operation

X(rd) = EXTS(X(rs)[7..0]);

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

sext.h

Synopsis: Sign-extend halfword
Mnemonic: sext.h rd, rs

Encoding

Description: This instruction sign-extends the least-significant halfword in rs to XLEN by copying the most-significant bit in the halfword (i.e., bit 15) to all of the more-significant bits.

Operation

X(rd) = EXTS(X(rs)[15..0]);

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

sh1add

Synopsis: Shift left by 1 and add
Mnemonic: sh1add rd, rs1, rs2

Encoding

Description: This instruction shifts rs1 to the left by 1 bit and adds it to rs2.

Operation

X(rd) = X(rs2) + (X(rs1) << 1);

Included in

Extension	Minimum version	Lifecycle state
Zba (Address generation instructions)	0.93	Frozen

sh1add.uw

Synopsis: Shift unsigned word left by 1 and add
Mnemonic: sh1add.uw rd, rs1, rs2

Encoding

Description: This instruction performs an XLEN-wide addition of two addends. The first addend is rs2. The second addend is the unsigned value formed by extracting the least-significant word of rs1 and shifting it left by 1 place.

Operation

let base = X(rs2);
let index = EXTZ(X(rs1)[31..0]);

X(rd) = base + (index << 1);

Included in

Extension	Minimum version	Lifecycle state
Zba (Address generation instructions)	0.93	Frozen

sh2add

Synopsis: Shift left by 2 and add
Mnemonic: sh2add rd, rs1, rs2

Encoding

Description: This instruction shifts rs1 to the left by 2 places and adds it to rs2.

Operation

X(rd) = X(rs2) + (X(rs1) << 2);

Included in

Extension	Minimum version	Lifecycle state
Zba (Address generation instructions)	0.93	Frozen

sh2add.uw

Synopsis: Shift unsigned word left by 2 and add
Mnemonic: sh2add.uw rd, rs1, rs2

Encoding

Description: This instruction performs an XLEN-wide addition of two addends. The first addend is rs2. The second addend is the unsigned value formed by extracting the least-significant word of rs1 and shifting it left by 2 places.

Operation

let base = X(rs2);
let index = EXTZ(X(rs1)[31..0]);

X(rd) = base + (index << 2);

Included in

Extension	Minimum version	Lifecycle state
Zba (Address generation instructions)	0.93	Frozen

sh3add

Synopsis: Shift left by 3 and add
Mnemonic: sh3add rd, rs1, rs2

Encoding

Description: This instruction shifts rs1 to the left by 3 places and adds it to rs2.

Operation

X(rd) = X(rs2) + (X(rs1) << 3);

Included in

Extension	Minimum version	Lifecycle state
Zba (Address generation instructions)	0.93	Frozen

sh3add.uw

Synopsis: Shift unsigned word left by 3 and add
Mnemonic: sh3add.uw rd, rs1, rs2

Encoding

Description: This instruction performs an XLEN-wide addition of two addends. The first addend is rs2. The second addend is the unsigned value formed by extracting the least-significant word of rs1 and shifting it left by 3 places.

Operation

let base = X(rs2);
let index = EXTZ(X(rs1)[31..0]);

X(rd) = base + (index << 3);

Included in

Extension	Minimum version	Lifecycle state
Zba (Address generation instructions)	0.93	Frozen

slli.uw

Synopsis: Shift-left unsigned word (Immediate)
Mnemonic: slli.uw rd, rs1, shamt

Encoding

Description: This instruction takes the least-significant word of rs1, zero-extends it, and shifts it left by the immediate.

Operation

X(rd) = (EXTZ(X(rs)[31..0]) << shamt);

Included in

Extension	Minimum version	Lifecycle state
Zba (Address generation instructions)	0.93	Frozen

Note

This instruction is the same as slli with zext.w performed on rs1 before shifting.

unzip

Synopsis: Implements the inverse of the zip instruction.
Mnemonic: unzip rd, rs

Encoding

Description: This instruction gathers bits from the high and low halves of the source word into odd/even bit positions in the destination word. It is the inverse of the zip instruction. This instruction is available only on RV32.

Operation

foreach (i from 0 to xlen/2-1) {
  X(rd)[i] = X(rs1)[2*i]
  X(rd)[i+xlen/2] = X(rs1)[2*i+1]
}

Note

This instruction is useful for implementing the SHA3 cryptographic hash function on a 32-bit architecture, as it implements the bit-interleaving operation used to speed up the 64-bit rotations directly.

Included in

Extension	Minimum version	Lifecycle state
Zbkb (Bit-manipulation for Cryptography) (RV32)	v0.9.4	Frozen

xnor

Synopsis: Exclusive NOR
Mnemonic: xnor rd, rs1, rs2

Encoding

Description: This instruction performs the bit-wise exclusive-NOR operation on rs1 and rs2.

Operation

X(rd) = ~(X(rs1) ^ X(rs2));

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen
Zbkb (Bit-manipulation for Cryptography)	v0.9.4	Frozen

xperm.b

Synopsis: Byte-wise lookup of indices into a vector in registers.
Mnemonic: xperm.b rd, rs1, rs2

Encoding

Description: The xperm.b instruction operates on bytes. The rs1 register contains a vector of XLEN/8 8-bit elements. The rs2 register contains a vector of XLEN/8 8-bit indexes. The result is each element in rs2 replaced by the indexed element in rs1, or zero if the index into rs2 is out of bounds.

Operation

val xpermb_lookup : (bits(8), xlenbits) -> bits(8)
function xpermb_lookup (idx, lut) = {
    (lut >> (idx @ 0b000))[7..0]
}

function clause execute ( XPERM_B (rs2,rs1,rd)) = {
    result : xlenbits = EXTZ(0b0);
    foreach(i from 0 to xlen by 8) {
        result[i+7..i] = xpermn_lookup(X(rs2)[i+7..i], X(rs1));
    };
    X(rd) = result;
    RETIRE_SUCCESS
}

Included in

Extension	Minimum version	Lifecycle state
Zbkx (Crossbar permutations)	v0.9.4	Frozen

xperm.n

Synopsis: Nibble-wise lookup of indices into a vector.
Mnemonic: xperm.n rd, rs1, rs2

Encoding

Description: The xperm.n instruction operates on nibbles. The rs1 register contains a vector of XLEN/4 4-bit elements. The rs2 register contains a vector of XLEN/4 4-bit indexes. The result is each element in rs2 replaced by the indexed element in rs1, or zero if the index into rs2 is out of bounds.

Operation

val xpermn_lookup : (bits(4), xlenbits) -> bits(4)
function xpermn_lookup (idx, lut) = {
    (lut >> (idx @ 0b00))[3..0]
}

function clause execute ( XPERM_N (rs2,rs1,rd)) = {
    result : xlenbits = EXTZ(0b0);
    foreach(i from 0 to xlen by 4) {
        result[i+3..i] = xpermn_lookup(X(rs2)[i+3..i], X(rs1));
    };
    X(rd) = result;
    RETIRE_SUCCESS
}

Included in

Extension	Minimum version	Lifecycle state
Zbkx (Crossbar permutations)	v0.9.4	Frozen

zext.h

Synopsis: Zero-extend halfword
Mnemonic: zext.h rd, rs

Encoding (RV32)

Encoding (RV64)

Description: This instruction zero-extends the least-significant halfword of the source to XLEN by inserting 0’s into all of the bits more significant than 15.

Operation

X(rd) = EXTZ(X(rs)[15..0]);

Note

The zext.h mnemonic corresponds to different instruction encodings in RV32 and RV64.

Included in

Extension	Minimum version	Lifecycle state
Zbb (Basic bit-manipulation)	0.93	Frozen

zip

Synopsis: Gather odd and even bits of the source word into upper/lower halves of the destination.
Mnemonic: zip rd, rs

Encoding

Description: This instruction scatters all of the odd and even bits of a source word into the high and low halves of a destination word. It is the inverse of the unzip instruction. This instruction is available only on RV32.

Operation

foreach (i from 0 to xlen/2-1) {
  X(rd)[2*i] = X(rs1)[i]
  X(rd)[2*i+1] = X(rs1)[i+xlen/2]
}

Note

This instruction is useful for implementing the SHA3 cryptographic hash function on a 32-bit architecture, as it implements the bit-interleaving operation used to speed up the 64-bit rotations directly.

Included in

Extension	Minimum version	Lifecycle state
Zbkb (Bit-manipulation for Cryptography) (RV32)	v0.9.4	Frozen

Software optimization guide

strlen

The orc.b instruction allows for the efficient detection of NUL bytes in an XLEN-sized chunk of data:

the result of orc.b on a chunk that does not contain any NUL bytes will be all-ones, and
after a bitwise-negation of the result of orc.b, the number of data bytes before the first NUL byte (if any) can be detected by ctz/clz (depending on the endianness of data).

A full example of a strlen function, which uses these techniques and also demonstrates the use of it for unaligned/partial data, is the following:

#include <sys/asm.h>

    .text
    .globl strlen
    .type  strlen, @function
strlen:
    andi    a3, a0, (SZREG-1)   // offset
    andi    a1, a0, -SZREG      // align pointer
.Lprologue:
    li      a4, SZREG
    sub     a4, a4, a3          // XLEN - offset
    slli    a3, a3, PTRLOG      // offset * 8
    REG_L   a2, 0(a1)           // chunk
    /*
     * Shift the partial/unaligned chunk we loaded to remove the bytes
     * from before the start of the string, adding NUL bytes at the end.
     */
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
    srl a2, a2 ,a3          // chunk >> (offset * 8)
#else
    sll     a2, a2, a3
#endif
    orc.b   a2, a2
    not a2, a2
    /*
     * Non-NUL bytes in the string have been expanded to 0x00, while
     * NUL bytes have become 0xff.  Search for the first set bit
     * (corresponding to a NUL byte in the original chunk).
     */
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
    ctz     a2, a2
#else
    clz     a2, a2
#endif
    /*
     * The first chunk is special: compare against the number of valid
     * bytes in this chunk.
     */
    srli    a0, a2, 3
    bgtu    a4, a0, .Ldone
    addi    a3, a1, SZREG
    li      a4, -1
    .align 2
    /*
     * Our critical loop is 4 instructions and processes data in 4 byte
     * or 8 byte chunks.
     */
.Lloop:
    REG_L   a2, SZREG(a1)
    addi    a1, a1, SZREG
    orc.b   a2, a2
    beq     a2, a4, .Lloop

.Lepilogue:
    not     a2, a2
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
    ctz     a2, a2
#else
    clz     a2, a2
#endif
    sub     a1, a1, a3
    add a0, a0, a1
    srli    a2, a2, 3
    add     a0, a0, a2
.Ldone:
    ret

strcmp

#include <sys/asm.h>

  .text
  .globl strcmp
  .type  strcmp, @function
strcmp:
  or    a4, a0, a1
  li    t2, -1
  and   a4, a4, SZREG-1
  bnez  a4, .Lsimpleloop

  # Main loop for aligned strings
.Lloop:
  REG_L a2, 0(a0)
  REG_L a3, 0(a1)
  orc.b t0, a2
  bne   t0, t2, .Lfoundnull
  addi  a0, a0, SZREG
  addi  a1, a1, SZREG
  beq   a2, a3, .Lloop

  # Words don't match, and no null byte in first word.
  # Get bytes in big-endian order and compare.
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
  rev8  a2, a2
  rev8  a3, a3
#endif
  # Synthesize (a2 >= a3) ? 1 : -1 in a branchless sequence.
  sltu a0, a2, a3
  neg  a0, a0
  ori  a0, a0, 1
  ret

.Lfoundnull:
  # Found a null byte.
  # If words don't match, fall back to simple loop.
  bne   a2, a3, .Lsimpleloop

  # Otherwise, strings are equal.
  li    a0, 0
  ret

  # Simple loop for misaligned strings
.Lsimpleloop:
  lbu   a2, 0(a0)
  lbu   a3, 0(a1)
  addi  a0, a0, 1
  addi  a1, a1, 1
  bne   a2, a3, 1f
  bnez  a2, .Lsimpleloop

1:
  sub   a0, a2, a3
  ret

.size   strcmp, .-strcmp