Power-Balancing Software Implementation to Mitigate Side-Channel Attacks without Using Look-Up Tables

: With the increasing number of side-channel attacks, countermeasure designers continue to develop various implementations to address such threats. Power-balancing (PB) methods hold the number of 1s and/or transitions (i.e., Hamming weight/distance) of internal processes constant to ensure side-channel safety in an environment in which it is difﬁcult to use random numbers. Most existing studies employed look-up tables (LUTs) to compute those operations, except for XOR and NOT operations. However, LUT-based schemes exhibit some side-channel issues in the address bits of LUTs. In this paper, we propose the application of AND and ADD operations to PB methods based on a rule that encodes 8-bit data into a 32-bit codeword without using LUTs. Unlike previous studies that employed LUTs, our proposals overcome side-channel vulnerabilities associated with the address bits and memory wastage. In addition, we evaluate the side-channel security ensured by the proposed method in comparison with that ensured by other methods. Finally, we apply our methods to SIMON/SPECK ciphers and analyze their performance by comparing them with older schemes.


Introduction
Side-channel attacks are techniques that extract secret values using physical leakages, such as power consumption, electromagnetic radiation, and running time, which occur during the execution of cryptographic algorithms on real devices [1][2][3]. Power analyses, one of the best-known side-channel attacks, are statistical techniques that extort a secret key based on the fact that total power consumed by a cryptographic module is strongly related to a computed value. To address these threats, countermeasure designers have attempted to break the link between computed data and physical measurements. However, it is very hard to completely break this relationship.
Another type of countermeasure is the use of masking methods to blind secret data by using random numbers, and as a consequence of such a scheme, power measurements can be randomly generated [4]. However, because masking methods unconditionally require random numbers, they are not suited for resource-constrained environments in which it may be difficult to employ a cryptographic pseudorandom number generator (PRNG) [5]. For instance, a dth-order side-channel attack that exploits the leakages related to d intermediate variables at the same time are theoretically countered by a dth-order masking method that d random values are involved per sensitive variable [6].
Power-balancing (PB) methods may ensure side-channel safety in an environment in which it is difficult to use PRNGs [7]. These methods encode data and construct operations to ensure that the number of 1s and/or transitions of bits (i.e., Hamming weight/distance (HW/HD)) is constant, and therefore power leakages can be similarly generated regardless of the secret values [8]. Because the power consumption of a crypto-module applied to PB methods contains small amounts of information related to the computed data, these schemes do not ensure perfect security against power analyses. However, in environments where random numbers cannot be used, these methods exhibit advantages that reduce the impact of power analyses. Moreover, these methods can be combined with other countermeasures such as shuffling and dummy operation schemes to achieve a higher level of security [9].
The fundamental idea of PB methods was derived from dual-rail with precharge logic (DPL) styles in hardware [8]. In DPL styles, a logic bit a is represented by a 2-bit pair (a, a), where a is the complement bit of a. Hoogvorst et al. was the first to attempt to create DPL styles of hardware into software [7]. Since then, several studies, such as constant XOR and look-up table-based (LUT-based) operations, have applied these schemes to symmetric ciphers [10][11][12][13][14][15][16][17][18]. Here, constant operators, such as constant XOR, are modified into operations that always hold the same HW/HD in internal process by using only assembly language. The most recent attempt is the study by Pour et al., which proposed constant ANDs by using only a small number of LUTs and applied these operations to the SIMON block cipher [19]. However, most studies that use LUTs exhibit vulnerabilities to power analyses targeted at address bits. Moreover, these implementations are inefficient in terms of memory usage [18]. In addition, both the data of LUTs and address bits should have the same HW as 0101, 0110, and 1001, whereas others can be unavailable.
In this paper, we first propose constant AND and ADD operations without using LUTs. Our operations are based on a rule that encodes 8-bit data into a 32-bit codeword. The 32-bit codewords are optimized for 32-bit microcontrollers such as ARM and AVR32, and our suggestions are applicable to ARX ciphers in environments where PRNG is not available. Moreover, we evaluate the side-channel security and analyze the performances of the proposed countermeasures. The main contributions of this paper are as follows.

New Constant Operations
We propose a new AND operation that can be applied to the PB method without using LUTs. This constant AND exhibits advantages in terms of performance and side-channel security compared with previous works using LUTs. In terms of performance, the proposed scheme requires the same running time regardless of the bit size of the computed data. Because previous LUT-based algorithms were designed on a divide and conquer strategy, they result in an increase in computing time with increase in the bit size of inputs. On the other hand, the proposed method computes a whole word at once; therefore, it takes the same amount of time regardless of the bit size of inputs. Moreover, our operation can reduce memory usage requirements. In terms of side-channel security, the proposed method is resilient against side-channel attacks that target leakages from the address bits of LUTs. Side-channel leakages in the process of loading LUTs involve not only information about the input and output values of LUTs, but also information about the address bits of memory cells in which LUTs are stored. Because our algorithm does not use LUTs themselves, it is immune to threats originating from address bits.
Next, we propose an addition operation applied to the PB method using the proposed method and existing constant operations. In order to apply PB methods to an addition operation, we optimized an SW-based kogge-stone adder for the multi-word environment. The algorithm operates very efficiently in a constrained environment where the bit length of codewords is too long to be handled all at once.

Security Evaluation
We evaluate the side-channel security ensured by the proposed method with those ensured by other methods. To measure the amount of information leakages in side-channel signals, we acquired power traces of the proposed, existing, and unprotected cases. It is found that the signal-to-noise ratios (SNRs) and peaks of correlation power analysis (CPA) are reduced by up to approximately 2697% and 318% compared with the weak implementation against side-channel attacks, respectively. Through experiments, we determined the side-channel vulnerabilities of the address bits of LUTs. By not applying PB methods to address bits, information leakages that are similar to the unprotected case occur.

Performance Analysis
We analyze the performances of the proposed schemes in comparison with those of the other schemes. We verify the number of operators in the proposed schemes as well as those in the existing studies, and we compare the number of clocks in an ARM simulator provided by IAR Embedded Workbench. We apply the proposed algorithm to SIMON and SPECK block ciphers and analyze their performances in comparison with existing studies. As a result, it is found that the SIMON cipher using constant ANDs and the SPECK cipher using constant ADDs exhibit performance improvements of approximately 33% and 37%, respectively.
The remainder of this paper is organized as follows. In Section 2, we introduce PB methods by employing dual-rail-based encoding schemes. Section 3 is the core of the paper, in which we present new dual-rail-based PB methods without the application of LUTs. In Section 4, we evaluate the security against side-channel attacks. In Section 5, we present the performance metrics of our proposed method as well as those of existing methods. Finally, we present the conclusions in Section 6.

Existing Works on Power-Balancing Countermeasures by Employing Dual-Rail-Based Encoding Schemes
PB methods are associated with hiding countermeasures that apply DPL styles of hardware to software. In hardware, DPL operates in pairs of data bits and complement bits. This circuit consumes the same amount of power, irrespective of the amount of data. Because hardware can be flexibly employed in circuit design, it is easy to implement operations of the DPL styles. On the other hand, software operated on 8/16/32-bit microcontrollers exhibit difficulties when DPL styles are applied. PB methods define operations that satisfy constant HW/HD using operators supported by microcontrollers. These methods are divided into (x, y)-code-based schemes and dual-rail-based schemes according to an encoding rule of constant HW/HD. (x, y)-code-based schemes encode data using a codeword in which the HW is denoted by x and the size of the bits is denoted by y. For instance, if we encode 4-bit nibble data into a (3, 6)-code, we only use 6 C 3 codewords, which are of 6-bit sizes with HW= 3. In 2014, Servant el al. proposed constant HW codes using LUTs [14]. Rauzy et al. developed a tool to apply (x, y)-code-based countermeasures to arbitrary cryptographic algorithms [15]. Maghrebi et al. proposed a methodology for selecting optimal codewords on (x, y)-code [16]. This method was improved by Bhasin et al. [17]. Petrvalsky et al. applied the (x, y)-code-based countermeasure to AES and embedded it on an ARM Cortex-M3 [18]. One advantage of (x, y)-code-based schemes is that memories may be constructed compactly, but these schemes can only define operations using LUTs.
Dual-rail-based schemes operate on codewords that double or quadruple the size of the data. For instance, data a 1 a 0 (a i ∈ F 2 ) is encoded into a codeword a 1 a 1 a 0 a 0 . In 2011, Hoogvorst et al. first tried to bring dual-rail-based schemes of hardware into software [7]. Han [12,13]. The most recent balancing countermeasure based on a Han-like encoding rule is the method proposed by Pour et al. [19]. They first proposed a constant AND with only a small number of LUTs and applied these operations to the SIMON block cipher. One disadvantage of dual-rail-based schemes is that they have large codewords, but it is easy to define the operations because the encoding rules exhibit regularity.
In this section, we describe a 1-to-4 encoding rule, which encodes 8-bit data into a 32-bit codeword based on the method proposed by Pour et al. We denote constant XOR, NOT, and AND operations based on 1-to-4 encoding [19]. At the end of this section, we discuss the drawbacks of existing research and how to overcome it.

Balanced Encoding Rule
There are many ways of presenting 1-to-4 encoding. In this paper, we define constant operations based on three types of encoding rules as shown in Table 1. x is a complement value of x. Table 1. Three types of proposed encoding rules.
We denote the intermediate values of the algorithms as x ∈E 1 , X ∈E 1 , or Y ∈E 2 . This means that the registers x and X are encoded in type 1, and the register Y is encoded in type 2. Some notations with no information of the encoding type indicate that the codeword satisfies constant HW/HD. However, the encoding uses a different format compared to the three defined encodings.
This encoding rule exhibits the following properties for any bit x, y.
These properties are the key principles for designing constant operations. The first property is that all the HWs of codewords are twice the bit size of data. Second, the HDs between different types of codewords are twice the bit size of data. In other words, an XOR result of different types of codewords leaks constant power. Finally, the conversion processes of the encoding types can be simply computed by XORing a constant.

Constant XOR/NOT Operation
As can be observed intuitively from the properties described above, we can achieve constant XOR/NOT by converting the encoding type appropriately.
Constant XORs are summarized as follows.
When different types of codewords are XORed, the result is denoted as another type of codeword. This result can be converted into other encoding types by XORing the constant described in the encoding rule.
The principles of these operations state that because a true pair's XOR result is true, a false pair's XOR result is true, and a cross pair's XOR result is false. In the encoding rule, the codeword bits of different encoding types are arranged in two cross pairs and two true/false pairs.
Similarly, constant NOTs can be easily defined. Constant NOTs include the following.
These processes also satisfy constant HW/HD.

Constant AND Operation
Pour et al. proposed two methods of constant AND that use only small LUTs. Algorithm 1 and Table 2 present algorithms for constant AND and LUTs, respectively.
However, as mentioned in the paper by Pour et al., even if few LUTs are employed, the address bits of LUTs leak side-channel information. The authors checked the hex file after a compiling phase and applied their methods to the address bits. However, if a system with cryptographic modules is large, it can be difficult to practically use special memory addresses.
To overcome these limitations, we propose constant AND and ADD that completely exclude LUTs. The algorithms generate data-independent leakages, which can be used for cryptographic modules without SBoxes such as ARX block ciphers.

A New Dual-Rail-Based Power-Balancing Countermeasure without Look-Up Tables
In this section, we propose a constant AND and a constant ADD based on dual-rail encoding schemes without LUTs. These algorithms are based on the 1-to-4 encoding rule. Unlike previous studies in which LUTs were always used, our proposals overcome side-channel vulnerabilities in address bits and memory usage. Our constant ADD is an extended version of the methods employed by prior studies [20,21] (Appendix A). The proposed algorithm is a software-optimized Kogge-Stone adder (KS adder), and it can calculate multi-word codewords. Briefly, the KS adder in the software proposed by Coron et al. did not consider carry bits for multi-word processing [20]. Won et al. proposed an algorithm that calculates carry bits, but the algorithm is inefficient because it deals with carry bits as a decimal point [21]. We propose an optimized algorithm based on the fact that a carry bit in lower words does not exceed 1. We also propose an addition algorithm that can be applied to PB methods. Table 3 illustrates the processes of our constant AND. c is a result of a ∧ b. Table 3. Operation process of the constant AND.

Constant AND Operation
The design principles include the following.
(1) Even if data is encoded into the dual-rail scheme, at least one bit must include an AND result. Based on this idea, we calculated an AND operation of different types of codewords.
Step (1) in Table 3 is an AND result of the codeword a encoded with E 1 and the codeword b encoded with E 2 . We verified that this AND operation satisfies constant HW/HD. Consequently, the AND operation between different types of codewords satisfy constant HW/HD. Tables 4 and 5 present an HW(E 1 (a) ∧ E 2 (b)) and an HD(E 1 (a), E 1 (a) ∧ E 2 (b)), respectively. As can be observed in the tables, the HWs and the HDs of the AND result are 1. Table 4. Table 3. Table 5.
Step. (1) in Table 3. (2) We must remove a and b from each bit of the AND result while satisfying constant HW/HD. Here, we consider the removal of the a-terms. A simple method of removal is to XOR aa00. However, because the operation causes strong side-channel leakages related to a, a and a must be applied simultaneously to remove the a-terms. We removed the a-term with an OR result of the initial input E 1 (a) and 0011. The reason why the form is aa11 is to maintain constant HW/HD.
(3) The b-terms have also been removed, similar to that in (2). To convert E 2 (b) into b0b1, we computed ANDing 1010 and ORing 0001. All the processes satisfied constant HW/HD. Algorithm 2 shows the algorithm described above. The initial process of T to 0 can be initialized to x or y.

Algorithm 2: ConstAND , AND Operator Applied to Power-balancing Countermeasure
Require:

Constant ADD Operation
In this section, we present a description of the manner in which the PB methods can be used for addition operations involving constant Boolean operations. We were inspired by the Kogge-Stone adder (KS adder) in achieving our goal. A concept of the KS adder was developed by Peter M. Kogge and Harold S. Stone [22]. The KS adder, which can be calculated using only Boolean operators, is known as a very fast addition algorithm. Coron et al. expanded the KS adder that was designed using operators that are supported by general software such as XOR, AND, and SHIFT [20]. Won et al. developed the KS adder to handle multi-word data [21]. Each algorithm is summarized in Appendix A. Additionally, the papers of Coron et al. and Won et al. are mainly about implementing the KS-adder in software, not related to PB methods. However, in order to compare the performance to the multi-word operation, the algorithms were set as a comparison group.
To apply PB methods to the KS adder, several problems must be addressed. For instance, assume that we apply our encoding rule to 8-bit based block ciphers in a 32-bit microcontrollers. In this case, we simply apply the constant XOR and AND operations to Coron's algorithm. In other words, we replace 8-bit data with a 32-bit codeword and put it in a 32-bit register. However, most ARX block ciphers require 32-bit additions. These cryptosystems represent 32-bit data using four 32-bit codewords in our encoding rule. This means that a carry bit generated in a lower word should be updated in an upper word, as in the case of arbitrary-precision arithmetic. Won's algorithm addressed this problem; however, the algorithm is not efficient because it computes the carry bit of a lower word as a decimal point. For instance, suppose that 32-bit data is calculated as four 8-bit words. Won's algorithm requires five 8-bit KS adders.
We designed an algorithm that directly applies a carry bit of a lower word to the least significant bit (LSB) of an upper word. Algorithm 3 shows our KS adder. Algorithm 4 is an addition algorithm with multi-words comprising our KS adder and Coron's KS adder.
Here, we present a comparison of the differences between our Algorithm 3, Coron's Algorithm A1, and Won's algorithms Algorithms A3 and A5. Coron's algorithm Algorithm A1 did not consider carry bits. According to the design principles of the KS adder, a carry bit of an upper word is the most significant bit of the last state of G. That is, the carry bit of an upper word can be easily solved. A problem arises when handling a carry bit in a lower word.
Won's Algorithms A3 and A5 overcame this problem based on the idea of a decimal point. Assume that the words to be added are (0b 0101) and (0b 0011), and a carry bit is generated in a lower word. Won's algorithm performs computations by modifying the two words into (0b 0101.1) and (0b 0011.1). This method intuitively handles the carry bit of a lower word; however, it is inefficient. The algorithm requires the initial and final processes for converting words between a normal format and a decimal format, and the algorithm again computes the KS adder.

Algorithm 4: Our Multi-bit Adder Combining Kogge-Stone Adders
Require: Before describing our algorithm, consider the following example; "A carry bit is generated in a lower word, and both LSBs of the words are 1." Even in this case, a carry bit, which is generated in both LSBs and the lower carry bit, does not exceed 1. We redesigned the software-based KS adder based on this fact. Lines 1, 3 in Algorithm 3 are the keys for handling a carry bit. We assume that the KS algorithm is computed in 1-bit units. The algorithm is of the form in which for loops after Line 4 are removed. Here, P denotes the result of the addition, and G denotes a carry bit. Now, suppose that we increase the size of the bits. However, even if the word size increases, the LSB of a result P ⊕ 2G does not change. Based on these ideas, we initialized P to an XOR result of a lower carry bit and two input words (Line 1). In the case of G, the LSB of G denotes the carry bit that is generated by a lower carry bit and the LSBs of inputs. We initialized G using the majority rule (Line 3). However, because the well-known majority rule includes three AND operators, we adjusted the formula with one AND operator.
We further designed Algorithms 5 and 6 with PB methods. Algorithms A2, A4, and A6 in Appendix A show the algorithms of Coron et al. and Won et al. with PB methods.

Side-Channel Security Evaluation
In this section, we evaluate the side-channel security ensured by the proposed method and those ensured by existing methods. We used the SCARF ARM (ARM920T) evaluation board, which has an operating frequency of approximately 100 MHz [23]. Power signals were collected using Wave Runner 204Xi-A oscilloscope from LeCroy at a sample rate of 1GS/s. We set two comparative groups, which include an 8-bit AND operation (u8AND) without any countermeasures and a constant AND operation with Pour's method-1. We evaluated the address bits of Pour's method-1 to analyze the side-channel vulnerabilities in LUTs. The implementations verified (at the assembly level) that the registers were used as intended by the designer. Figures 1-3 present the evaluation results of the u8AND operation, our ConstAND, and Pour's method-1, respectively. In the figures, the upper left side is a graph of 10 traces. The x-axis denotes the samples, and the y-axis denotes the relative magnitude. The upper right side shows an SNR graph of [24] according to HW. The x-axis denotes the samples, and the y-axis denotes the SNR. The lower left side shows a graph of CPA [25] results with 10,000 traces. The x-axis denotes the samples and the y-axis denotes the correlation coefficient. The lower right side is a graph of the highest CPA peaks according to the number of traces. The x-axis denotes the number of traces and the y-axis denotes the correlation coefficient. The SNR formula is as follows [24].
Definition 1 (Signal-to-Noise Ratio (SNR) [24]). The signal-to-noise Ratio of a leakage is denoted by a random variable L, which depends on the informative part denoted by I, as follows.

E[Var[L|I]]
(1) Figure 1 presents the results of the normal AND operation. These results illustrate the physical characteristics of the evaluation board. As observed from the CPA peaks, this board exhibits ideal side-channel leakages as in an HW model. The SNR is also approximately 5 ∼ 8, and this board exhibits strong leakages in power consumption like the register B.  Figure 2 presents the results of our ConstAND. PB methods are known to reduce SNRs and decrease CPA peaks. In this experiment, the decreasing rates of SNR were 429% to 2697%. The decreasing rates of CPA peaks were 48% to 318%. The reason for the occurrence of strong side-channel leakages at approximately 1450 is that we assume the register B to have no ideal HW model. However, our countermeasure exhibits decreased SNRs and CPA peaks compared to the u8AND experiments. The proposed method exhibits sufficient performance characteristics as PB methods. Figure 3 presents the results of Pour's method-1. We implemented a constant AND based on Pour's method-1 using the divide and conquer method on four 8-bit codewords. In other words, we implemented a constant AND consisting of four LUT i(=1,...,4) operations in 8-bit units. We did not post the results of experiments related to the data, and the results showed low leakages as in our ConstAND experiments. We focused on demonstrating the side-channel vulnerabilities of the address bits of LUTs. As the authors had noted in their paper, special editing of the hex file is required to prevent side-channel leakages of the address bits. However, it is practically difficult to manipulate the hex file of cryptographic modules in a large system. We experimented with an assumption that the hex file cannot be manipulated. We restored the address bit of the target LUT using brute force attacks focusing on the starting address from 0x00 to 0xFF. The experimental results of the address bits are almost the same as in the case of the normal AND operation. The reason why the SNR is small compared with the CPA peaks is that the codewords of the LUT input includes only 2-bit information. We evaluated the side-channel security ensured by our algorithm. The SNR and the CPA peak were decreased by up to 2697% and 318%, respectively, compared to the normal operations. Compared with existing methods that use LUTs, the proposed method is incapable of attacking the address bits itself.

Performance Analysis and Case Study
In this section, we measure the performance of our method and those of existing studies. We analyzed the number of operators in the proposed method as well as those in the comparison groups (Coron's algorithm and Won's algorithm) and compared the number of operating clocks using the ARM simulator (Cortex-M0) provided by IAR Embedded Workbench. We applied PB methods to SIMON and SPECK block ciphers and compared their performances with previous studies. The reason why SIMON and SPECK block ciphers were selected as case studies is that each cipher consists of AND operators and ADD operators as main operations, so that the additional cost of the proposed countermeasures can be precisely analyzed. In these comparisons, the SIMON block cipher compared the performance of the proposed method with unprotected case and Pour's countermeasure, and the SPECK block cipher compared the performance of the proposed method with unprotected case and Won's technique. However, these comparisons may raise the question of whether the results may differ for other block ciphers. The answers to the question are as follows. When applying side-channel countermeasures to block ciphers, the main part of overhead is the nonlinear function. In other words, the additional cost of applying PB methods to nonlinear functions has a major impact on overall performance. As can be seen from the number of operators in the algorithms, it can be inferred that similar results will occur in other block ciphers.

Performance Analysis
The number of operators is presented in Table 6. The number of simulation clocks is shown in Table 7. Here, n is the number of words, l is the max( log 2 (k − 1) , 1), and k is the size of the "data" bits of a word. For instance, 32-bit data is transformed into four 32-bit codewords with 1-to-4 encoding. Here, n is 4, k is 8, and l is 3 (= log 2 7 ). We counted the output operation of the KS algorithms (i.e., the operation to split into a carry bit and an addition result) using one AND and one SHIFT.  As can observed in Tables 6 and 7, our ConstAND can operate with only six operators. In the KS algorithms, our method exhibits additional operations in the initial process of calculating P and G. However, the total number of operations is almost the same as that in the other methods because the operations inside the for loops are the same. By applying the PB methods to the KS adder with ConstAND, the total number of operations of the three algorithms are almost the same. However, our method exhibits advantages in the multi-bit KS adder. Coron's KS adder cannot handle the carry bit of a lower word, and it was used only in a least significant word. To process n words, Won's multi-bit adder required a total of (n + 1) KS adders, where the number of Coron's KS adders is 1 and the number of Won's KS adders is n. On the other hand, our method required a total of n KS adders, where the number of Coron's KS adders is 1 and the number of our KS adders is (n − 1). Furthermore, Won's multi-bit adder requires conversion of the input and output between normal and decimal representations.

SIMON/SPECK Block Cipher
The SIMON and SPECK families were developed for content security on constrained devices by a group of researchers at the US National Security Agency's Research Directorate [26]. The SIMON block cipher consists of only AND, ROTATION, and XOR operations. The SPECK block cipher consists of an ARX (ADD, ROTATION, and XOR) structure. We implemented SIMON and SPECK using the constant AND and ADD, and we compared the performance with the ARM simulator. Table 8 presents the ARM simulation results. The 32-bit coding is implemented using 32-bit long type registers, and the 8-bit coding is implemented using 8-bit char type registers. Because the SPECK block cipher required 32-bit additions, these additions also used 32-bit additions even with the 8-bit coding. The nonlinear functions of the SIMON block cipher include AND operators. We implemented the protected SIMON cipher with Pour's method-1 and our ConstAND. As a result, it was found that our method demonstrated a performance improvement of 33% compared to Pour's method-1. This difference in performance was because Pour's method-1 handles a codeword in small LUT units, whereas our method processes the entire codeword.
The nonlinear functions of the SPECK cipher include ADD operators. We implemented this cipher with four 32-bit codewords based on 1-to-4 encoding. As a result, our method demonstrated a performance improvement of 37% compared to Won's method. The biggest reason for the difference in performance is that our method includes 4 times the number of KS adders, whereas Won's method includes 5 times the number of KS adders for protected addition.

Conclusions
In this paper, we proposed power-balancing methods without LUTs. We designed new constant AND and ADD operations based on 1-to-4 encoding. We also evaluated the side-channel security on a real board and assessed the performance based on the number of operators and simulation results. The proposed methods could be optimized for 32-bit microcontrollers such as ARM, AVR32, etc. The biggest advantage of our methods is that these schemes do not use look-up tables at all. As a result, the proposed schemes do not lead to side-channel leakages from the address bits of memories. On the other hand, the proposed algorithms have a disadvantage in terms of running time compared with the implementation coded by using only LUTs. An environment that meets these conditions is one in which memory resources are sufficient and the implementation is coded with complete control over the side-channel leakages of address bits. However, from a practical perspective, it is very difficult to implement all the operations using LUTs and check the side-channel leakages of address bits for all of them.
Given that threats such as side-channel attacks are growing in number, customized countermeasures are required in various environments. Our proposals are particularly effective for an environment where random numbers cannot be used as in IoT environments.

Appendix A. Existing Algorithms for Addition and Their Countermeasures
For completeness, we include the existing algorithms and protected algorithms that employ PB methods in the appendix. Algorithms A1-A6 show the algorithms of Coron et al. and Won et al. In addition, we show the examples of our algorithm and the existing software-based KS adder algorithms with Figure A1. Above from left, Figure A1 present examples of the software-based KS adders of us, Coron et al., and Won et al., respectively. As you can see in the example, Coron's algorithm did not consider carry bits. Therefore, it can not compute the addition of multi-word data by this algorithm alone. On the other hand, Won's algorithm performs computations by modifying the two words into ((0b 110.0) and (0b 111.0).) at c = 0 such as the first addition. If a carry bit is 1, the algorithm performs computations by modifying the two words into ((0b 101.1) and (0b 010.1).) such as the second addition. However, the algorithm requires the initial and final processes for switching words between a normal format and a decimal format. To overcome these cons, we closely dissected the principle of a carry-bit propagation in the software-based KS algorithm and proposed an optimized algorithm. Require: x, y ∈ {0, 1} k , and l = max( log 2 (k − 1) , 1) Ensure: Carry c ∈ {0, 1}, z(= x + y mod 2 k ) 1: P ← x ⊕ y 2: P ← P 3: G ← x ∧ y 4: for i = 0 to l − 2 do 5: G ← (G ∧ (G (1 i))) ⊕ G 6: P ← (P ∧ (P (1 i))) 7: end for 8: G ← (G ∧ (G (1 (l − 1)))) ⊕ G 9: c||z ← P ⊕ (2G) {Carry bit is MSB of G} 10: return (c, z)