/
Debdeep Mukhopadhyay  Chester Rebeiro Debdeep Mukhopadhyay  Chester Rebeiro

Debdeep Mukhopadhyay Chester Rebeiro - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
345 views
Uploaded On 2018-11-04

Debdeep Mukhopadhyay Chester Rebeiro - PPT Presentation

Department of Computer Science and Engineering Indian Institute of Technology Kharagpur 1 Accelerations of Scalar Multiplication Advanced Techniques 2327 May 2011 Anurag Labs DRDO NonAdjacency Form NAF ID: 713804

point scalar labs naf scalar point naf labs drdo multiplication anurag 2011 algorithm reduction addition clock squaring number quad cycles curve circuits

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Debdeep Mukhopadhyay Chester Rebeiro" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Debdeep Mukhopadhyay Chester RebeiroDepartment of Computer Science and EngineeringIndian Institute of Technology Kharagpur

1

Accelerations of Scalar MultiplicationAdvanced Techniques

23-27 May 2011

Anurag Labs, DRDOSlide2

Non-Adjacency Form (NAF)NAF(29)=(1,0,0,-1,0,1), since 29=32-4+1Binary(29)=(1,1,1,0,1), since 29=16+8+4+1Pros:NAF does not have any consecutive ones (hence called non-adjacent).Average density of non-zero terms in NAF is 1/3.It reduces the number of point additions in ECC scalar multiplication.Cons:Maximum length of NAF can be one more than the binary.23-27 May 20112

Anurag Labs, DRDOSlide3

Algorithm for NAF generationk=29.k0=2-(29%4)=1, k=29-1=28, k=14k1=0 (Note that it can never be 1). k=7k2=2-(7%4)=-1, k=4k3=0, k=2k4=0, k=1k5=2-(1%4)=1, k=0 (algorithm terminates)23-27 May 20113Anurag Labs, DRDOSlide4

Why Non-adjacent?When k is odd, it can be either 4p+1 or 4p+3 (p is an integer).Case 1: k=4p+1ki=1, k=2p (even) => next NAF bit is 0Case 2: k=4p+3ki=-1, k=2p+2 (even) => next NAF bit is 023-27 May 20114Anurag Labs, DRDOSlide5

Scalar Multiplication with NAFExpected Run time = m/3 A + m DNormal Run time = m/2 A + mDNote that here number of doubling is unchanged. Later we see a method to remove doubling all together. 23-27 May 20115Anurag Labs, DRDOSlide6

Width w-NAFk=29, w=3NAF digits = (1,0,0,0,0,-3)29=(1,0,0,0,0,-3)=1.32-3Pros: Density of non-zero terms =1/(w+1)Cons: Pre-computation required, this means storage in hardware Length is unaltered as normal NAF23-27 May 20116Anurag Labs, DRDOSlide7

Algorithm for width w-NAF generationu≡k (mod 2w) => -2w-1≤k≤2w-1k=29, w=3k0=-3, k=16k1=0, k=8k2=0, k=4k3=0, k=2k4=0, k=1k5=1, k=0 (algorithm terminates)

23-27 May 2011

7Anurag Labs, DRDOSlide8

Scalar Multiplication with width w-NAFPre-computation: 1D + (2w-2-1)AExpected Run time = m/(w+1) A + m DNormal Run time = m/2 A + mDHence designing an architecture would incur the initial pre-computation phase.23-27 May 20118Anurag Labs, DRDOSlide9

Koblitz Curves9Koblitz curves are a special class of elliptic curves and are defined on

where elliptic curve parameter

Koblitz curves are computationally efficient compared to random curves, as Frobenius map can be utilized to accelerate scalar multiplication.

The previous methods did not reduce the number of doubling operations.

Koblitz invented a set of curves which does not require any

doubling. he

curves were named after him.

23-27 May 2011

Anurag Labs, DRDOSlide10

Choice of the curveChoice of the curve depends on a, which can be either 0 or 1.As we have seen the Elliptic Curve is a group of points.Group should be chosen that ECDLP is difficult.The number of elements in the elliptic group is called the order of the group. For security, the order of the group should be very nearly prime (it has a factor of a prime number and a small integer)as otherwise there can be subgroups which are called as divisors of the group, which makes the curve cryptographically weak.The field elements belong to GF(2m)The subgroups belong to GF(2d), where d | m.If m is prime, d=1. Thus the only subgroups are E0(GF(2)) and E1

(GF(2)).It can be easily checked that:

E0(GF(2)) = (O, (0,1))E1(GF(2))= (O, (0,1), (1,0), (1,1)) 23-27 May 201110Anurag Labs, DRDOSlide11

An Interesting Property The curve satisfies : (x4,y4)+2(x,y)=µ(x2,y2), where µ=(-1)1-a

Define,

Frobenius Map as:

Frobenius map follows the relation

where

For a point P on the

Koblitz

curve, we can use the property of

Frobenius

map to compute 2P.

23-27 May 2011

11

Anurag Labs, DRDOSlide12

τ-adic NAFThe scalar k can be represented as a polynomial, where τ is the inderminate.this sum is analogous to the binary expansion.the scalar is said to belong to the ring Z[τ].

It can be proved that the

τ-adic NAF representation is unique.23-27 May 201112Anurag Labs, DRDOSlide13

Divisibility by τ In order to generate this NAF, we divide the element by τ, like we divided by 2 in the binary NAF.As it is a NAF, the remainder is generated such that the next NAF digit is zero.23-27 May 201113Anurag Labs, DRDOSlide14

Algorithm for τ-adic NAF generationk=29.The τ-adic NAF is (-1,0,1,0,1,0,-1,0,1)=> 29=1- τ2+ τ

4+

τ6- τ829P=P- τ2(P)+ τ

4(P)+

τ6

(P)-

τ

8

(P)

29P=(

x,y

)-(x

4

,y

4

)+(x

16

,y

16

)+(x

64

,y

64

)-(x

256

,y

256

)

Thus, the scalar multiplication avoids any doubling operation, instead it performs easy squaring operation.

It may be noted that the length is almost twice of the binary expansion, hence a reduction is necessary.

23-27 May 2011

14

Anurag Labs, DRDOSlide15

Reduction of the scalarτm(P)=P [from Fermat’s Little Theorem](τm-1)(P)=OHence, if γ≡k (mod τm-1)=> γ(P)=k(P)23-27 May 2011

15

Anurag Labs, DRDOSlide16

Reduction of Scalar

16

Solinos presented efficient reduction algorithm for reduction of a scalar. The

algorithm

involves integer multiplication. Thus, costly for hardware implementations.

An alternative approach known as

Lazy Reduction

was proposed by

Brumley

and

Jarvinen which uses the observation that division by is cheap.

The algorithm uses multiplication and division by

2

and integer additions.

Implementation is simple and area requirement is small.

The algorithm takes

m

clock cycles.

So,

Lazy

.

23-27 May 2011

Anurag Labs, DRDOSlide17

Scalar Multiplication with τ-adic NAFExpected Run time = m/3 A Normal Run time = m/2 A + mD23-27 May 201117Anurag Labs, DRDOSlide18

Summary

18

Basic steps of scalar multiplication on Koblitz

curves

Reduction of the scalar.

NAF generation from reduced scalar.

Point addition for nonzero NAF digits.

Point addition is performed in Lopez-

Dahab

projective co-ordinate system.

Point squaring for every NAF digit.

Final field inversion to transform scalar multiplication result into affine co-ordinate system from projective co-ordinate system.

Our

Koblitz

curve scalar multiplier uses simple NAF method for scalar multiplication.

23-27 May 2011

Anurag Labs, DRDOSlide19

Top Level Architecture

19

23-27 May 2011Anurag Labs, DRDOSlide20

Reduction of Scalar

20

Solinos presented efficient reduction algorithm for reduction of a scalar. The algorithm involves integer multiplication. Thus, costly for hardware implementations.

An alternative approach known as

Lazy Reduction

was proposed by

Brumley and Jarvinen which uses the observation that division by is cheap.

The algorithm uses multiplication and division by

2

and integer additions.

Implementation is simple and area requirement is small.

The algorithm takes

m

clock cycles.

So,

Lazy

.

23-27 May 2011

Anurag Labs, DRDOSlide21

Architecture for Reduction of Scalar

21

Arrangement of adders and shift circuits is used to perform reduction of scalar. Here u

is the LSB of c

0. There are registers to store intermediate values. Control unit generates control signals for Multiplexers and write enable signal for storage registers. Initially storage register for

c

0

contains the value of scalar.

23-27 May 2011

Anurag Labs, DRDOSlide22

T-NAF Generation from Reduced Scalar22Can be found by observing last two bits of c0

and c

1.r0=b0+c0 r

1=b1

+c1

Reduced Scalar

T-NAF digits are generated after performing reduction of the scalar. As, the algorithm does integer additions and subtractions, adders of the reduction circuit can be used to generate T-NAF digits.

23-27 May 2011

Anurag Labs, DRDOSlide23

Architecture for Reduction & T-NAF Generation

23

The left portion of the circuit is used to generate digits. The NAF generation and reduction hardware shares the adders and registers. During reduction, control signal M is set to 0. After the reduction is over, NAF generation starts and M is changed to 1

.

23-27 May 2011

Anurag Labs, DRDOSlide24

Choice of Scalar Multiplication AlgorithmThe Left-to-Right algorithm first computes the entire NAF of the reduced scalar and then starts processing the NAF from MSB.

So, it waits for the entire NAF generation and this takes nearly

m

clock cycles in GF(2

m

).

Additionally, at every iteration,

Q

is squared. So, when a point addition is in progress, we cannot perform in parallel.

But, squaring is cheap in hardwares and the algorithm does not uses this advantage of parallel processing.

24

There are two scalar multiplication algorithms in literature:

Process the scalar starting from MSB (Left-to-Right).

Process the scalar starting from LSB (Right-to-Left).

23-27 May 2011

Anurag Labs, DRDOSlide25

Fast Scalar Multiplication Algorithm

The Right-to-Left algorithm for scalar multiplication is shown below

The scalar multiplication does not wait for entire NAF of the scalar. As soon as the LSB, i.e the first NAF digit is generated, the scalar multiplication starts.

Additionally, point addition updates only

Q

and point squaring is independent of

Q

.

So, we can use the fact that

point squaring is cheap in hardware

and can perform in parallel with .

So, we select this Right-to-Left algorithm for scalar multiplication.

23-27 May 2011

25

Anurag Labs, DRDOSlide26

Point Addition Unit

26

The point addition unit does point addition in Lopez-Dahzb co-ordinate system and takes 8 clock cycles. Initially these three registers are initialized with base point (P

x, Py

, 1). After every point addition, result

Q

Q+P

is stored in register (RA

1

, RB

1

, RC

1

). In the figure,

P =

(

RA

2

, RB

2

). In every clock cycle field multiplication is performed and the Multiplier is of

Hybrid Karatsuba

type. Control signals are used to control the multiplexers and write eneble signals for storage registers.

23-27 May 2011

Anurag Labs, DRDOSlide27

Point Addition Unit

27

23-27 May 2011

Anurag Labs, DRDOSlide28

Point Squaring Unit

28

During scalar multiplication, point squaring is performed in every clock cycle. The base point is updated P(x, y) P(x2

, y2

). Point squarings are performed using dedicated squarer circuits as squarers are cheap.

If we see the scalar multiplication algorithm, then

it can be seen that point squaring is independent of point additions.

A nonzero NAF digit is followed by several Zero digits (NAF property). So, during point addition, we can continue point squaring in parallel until another nonzero NAF digit appears.

23-27 May 2011

Anurag Labs, DRDOSlide29

Point Squaring Unit29The NAF digits are generated from LSB side. Let us consider a portion of the entire NAF be <. . . . . .1 0 0 0 0 0 1 . . . . .>. For the first 1, a point addition is required nad this point addition takes

8 clock cycles.

If we check the algorithm, then it can be seen that for a nonzer NAF digit u, when a point addition takes place and uses the present value of P.

If we consider only sequential processing, then it can be seen that after performing point addition for 1, the algorithm will perform 6 point squarings for the sequence <0 0 0 0 0 1>. This will require 6 clock cycles.

As P is independent of Q, we can perform the 6 point squarings in parallel with point additions (which takes 8 clock cycles). Thus saving 6 clock cycles.

When the next nonzero appears in <. .

1

0 0 0 0 0

1

. . > , then we must stop this parallel processing of zeros, as the last updated value of P for <. .

1

0

0 0 0 0

1

. . > will be required during the next point addition.

23-27 May 2011

Anurag Labs, DRDOSlide30

Architecture for Point Squaring Unit

The point

P(x, y

) is in affine co-ordinate and two dedicated squarers are used for squaring x

and y

co-ordinates.

Initially the registers are assigned with the base point. When the scalar multiplication starts, point squaring is performed for every digit and the registers are updated.

A write enable signal

en

is used to protect content of registers from unnecessary squarings specially for the case (another Nonzer) mentioned in previous slide.

23-27 May 2011

30

Anurag Labs, DRDOSlide31

Architecture for Inversion

31

Scalar multiplication when done in Lopez-Dahab co-ordinate system, requires a final inversion after processing the entire scalar.

For ECC, Itoh-Tsujii inversion is efficient.

In a field GF(2

m

), the inverse of an element

a

is .

Using quad operation we can compute the inverse. Here is an example for the field GF(2

233

).

This requires multiplications and repeated quad operations. We can implement this using a multiplier and quad circuits.

23-27 May 2011

Anurag Labs, DRDOSlide32

Architecture for Inversion

32

This is the basic block diagram for the inversion unit. The multiplier is actually a part of the point addition unit. This multiplier is shared between point addition unit and inversion unit.

It can be seen from the previous slide that there are repeated quad operations. For example in step 7, computation of . If we use a single quad circuit, then the exponentiation will take 14 clock cycles. To reduce number of clock cycles, we use a cascade of several quad circuits. This cascade of quad circuits is called Quadblock.

23-27 May 2011

Anurag Labs, DRDOSlide33

Architecture for Quadblock …

33

Here is an example for a Quadblock which contains

11 cascaded quad circuits. So, for an element

a, we can raise it to a maximum of .

A multiplexer is used to get intermediate results, for example .

To raise an element to a power which is more than the number of cascaded quad circuits, repeated application of the quad block is required. So, the number of clock cycles depend on the number of quad circuits. For example, to perform , we can do it in two clock cycles.

Number of clock cycles reduce if we increase number of quad circuits. But delay increases. So, there must be a balance in the design between delay and number of quad circuits.

23-27 May 2011

Anurag Labs, DRDOSlide34

Experimental Performance34Experimentation was performed on Xilinx Virtex V FPGA for GF(2283).

Scalar multiplier on

random curve in the field GF(2283) has an area of around 40,000 LUTs, frequency 37 MHz and computation time of 63 micro seconds.Koblitz curve scalar multiplier

(in first stage of implementation) which uses in GF(2

283), has an area of 41,300 LUTs, frequency 31 MHz and average computation time of 35 micro seconds.

It can be seen, that a Koblitz curve crypto processor takes almost half computation time compared to random curve crypto processor.

23-27 May 2011

Anurag Labs, DRDOSlide35

Further Acceleration35We have found a novel technique to reduce number of point additions during scalar multiplication using representation of a scalar.For any scalar, we have found that length of is close to half of the length of .

However, there is an overhead of small amount of pre-computations and an increased area.

In Virtex IV FPGA, scalar multiplication using for the field GF(2283) saves 35% computation time compared to method.

23-27 May 2011

Anurag Labs, DRDOSlide36

Thank You3623-27 May 2011Anurag Labs, DRDO