In this post we demonstrate CPA on AES on our custom-made RISC-V board with

SiFive FE310 for ChipWhisperer Target Board. For this attack we use a slightly modified implementation of AES for microcontrollers. We still follow methodology described in previously done CPA attacks with some modifications enhancing performance and accuracy.

## Trace misalignment

For this CPA we use standard attack on the first round of AES, where known bytes of the plaintext are xored with unknown bytes of the key and then an Sbox is applied. In tiny AES it is implemented as follows:

static void SubBytes(void) {

uint8_t i, j;

for(i = 0; i < 4; ++i){

for(j = 0; j < 4; ++j){

(*state)[j][i] = getSBoxValue((*state)[j][i]);

}

}

}

The assembly of this source code contains a bunch of conditional jumps:

...

2040075e: ffc60793 addi a5,a2,-4

20400762: 00c506b3 add a3,a0,a2

20400766: 0007c703 lbu a4,0(a5)

2040076a: 0006c583 lbu a1,0(a3)

2040076e: 0785 addi a5,a5,1

20400770: 0685 addi a3,a3,1

20400772: 8f2d xor a4,a4,a1

20400774: fee78fa3 sb a4,-1(a5)

20400778: fec797e3bnea5,a2,20400766 <Cipher+0x3a>

2040077c: 00478613 addi a2,a5,4

20400780: fd061fe3bnea2,a6,2040075e <Cipher+0x32>...

Like many chips, SiFive FE310 implements branch prediction. A missed prediction on SiFive results in a 3-cycle latency and consequently, power traces that we collect with ChipWhisperer are misaligned. This significantly impacts CPA analysis as fixed power trace points for different traces may correspond to different operations. This could be overcome by collecting sufficiently many traces to cancell-out the noise or selecting only the longest traces, where all predictions were misses and thus all branches had the 3-cycle penalty.

One approach to solve this problem would be to align the traces, e.g. by cutting out irrelevant parts caused by missed predictions. It’s an interesting topic that we would like to tackle in a dedicated post. For now, to simplify the procedure, we manually unrolled the SubBytes procedure so that there are no conditional jumps:

static void SubBytes(void) {

(*state)[0][0] = getSBoxValue((*state)[0][0]);

(*state)[0][1] = getSBoxValue((*state)[0][1]);

(*state)[0][2] = getSBoxValue((*state)[0][2]);

(*state)[0][3] = getSBoxValue((*state)[0][3]);

(*state)[1][0] = getSBoxValue((*state)[1][0]);

(*state)[1][1] = getSBoxValue((*state)[1][1]);

(*state)[1][2] = getSBoxValue((*state)[1][2]);

(*state)[1][3] = getSBoxValue((*state)[1][3]);

(*state)[2][0] = getSBoxValue((*state)[2][0]);

(*state)[2][1] = getSBoxValue((*state)[2][1]);

(*state)[2][2] = getSBoxValue((*state)[2][2]);

(*state)[2][3] = getSBoxValue((*state)[2][3]);

(*state)[3][0] = getSBoxValue((*state)[3][0]);

(*state)[3][1] = getSBoxValue((*state)[3][1]);

(*state)[3][2] = getSBoxValue((*state)[3][2]);

(*state)[3][3] = getSBoxValue((*state)[3][3]);

}

We got rid of the loops and it results in constant time of execution and no branch prediction.

## CPA preparations

Using scope.adc.trig_count we know that one execution of SubBytes takes 111 cyles. We take 4 measurements for every cycle. Let’s look at the average trace for 1000 runs:

The first trace should always be discarded due to instruction cache misses. However, when taking a batch of samples, there will usually be some small number of samples of length different than the base length (in this case 444). This is most likely caused by cache misses on the SBox lookup table. Worth noting is that those cache misses could lead to another vulnerability, known as Cache Attacks, where cache misses are modelled and predicted. In this post we do not focus on the cache miss cases and instead drop all such traces.

We quickly discovered that we need a larger number of traces on SiFive than on other chips we worked with to perform an accurate CPA. It is probably caused by many factors like five-stage pipeline, result latency for some instructions and in general “more noise” on the wires because of a high resistance on the shunt (e.g. higher than on Cortex M3). The high resistance makes measurements less accurate what in consequence creates the need for much more traces. In this post “accuracy” means we want to perform CPA such that all rank-0 candidates for the key bytes are the correct ones, just for the reference for the future. It basically means we want very accurate CPA with no brute-force after getting possible key bytes to get a sense of what we are able to do.

After some experiments (and taking a peek at the assembly), we noticed that two instructions will correlate the best: loading sbox output and sbox output xored with key byte xored with plaintext byte. Having that in mind, we profiled AES execution taking 7977 sample traces of length 444 (from 8000 sized batch) with known key.

For the sole sbox correlation we get:

[98, 102, 130, 138, 162, 178, 194, 202, 226, 242, 253, 277, 386, 301, 326, 342]

which are the points in power traces with the highest correlation for consecutive key bytes. Analogously, for the 2nd type of correlation we get:

[86, 102, 118, 134, 149, 166, 181, 198, 214, 234, 246, 266, 282, 298, 314, 330]

It’s easy to notice that the differences between correlation peaks are not consistent and that there is one peak that is the same in both sets. This is most probably due to sequencing of instructions in the assembly file.

## CPA

In our analysis we declare ranks for key bytes based on the sum of the correlations in points we selected (rank 0 is the most probable candidate). We chose this approach because of a simple observation. Usually when testing a set of possible subkeys the correct subkey has the highest correlation and the rest of subkeys is noticeably below that level. When it’s not the case, the correlation for all the subkeys tends to be around the same level. We hope that the sum of correlations will be able to “catch” those high points in either of the correlations (or both).

For our CPA we collected 1000 traces, from which 996 are of length 111 (444 measurements). Look at the video to follow our CPA.

We can see that all key bytes are in rank-0 with relatively high correlations. We could easy drop some traces and still get this result, e.g. we managed to choose randomly 500 traces and still recover the whole key.

Jupyter notebook for this post together with collected traces can be found in our public repo.