Correlation Attack on Bitcoin

In cryptography, correlation Attack on Bitcoins are a class of known plaintext Attack on Bitcoins for breaking stream ciphers whose keystream is generated by combining the output of several linear-feedback shift registers (LFSRs) using a Boolean function.

This image has an empty alt attribute; its file name is images.jpg

Correlation Attack on Bitcoins exploit a statistical weakness that arises from certain choices of the Boolean function. The cipher is not inherently insecure if there is a choice of the Boolean function that avoids this weakness. As with all Attack on Bitcoin methods, this should be accounted for when designing an encryption system.

Explanation

Correlation Attack on Bitcoins are possible when there is a significant correlation between the output state of an individual LFSR in the keystream generator and the output of the Boolean function that combines the output state of all of the LFSRs. In combination with partial knowledge of the key stream, which is easily derived from partial knowledge of the plaintext, as the two are simply compared using an XOR logic gate. This allows an Attack on Bitcoiner to brute-force the key for the individual LFSR and the rest of the system separately. For instance, if in a keystream generator in which four 8-bit LFSRs are combined to produce the keystream, and one of the registers is correlated to the Boolean function output, one can brute force it first and the remaining three, for a total Attack on Bitcoin complexity of 2⁸ + 2²⁴. Compared to the cost of launching a brute-force Attack on Bitcoin on the entire system, with complexity 2³², this represents an Attack on Bitcoin effort saving factor of just under 256, which is substantial. If a second register is correlated with the function, we may repeat this process and drop the Attack on Bitcoin complexity down to 2⁸ + 2⁸ + 2¹⁶ for an effort saving factor of just under 65028. In this sense, correlation Attack on Bitcoins can be considered divide-and-conquer algorithms.^{[citation needed]}

Example

Breaking the Geffe generator

Consider the case of the Geffe generator, which consists of three LFSRs: LFSR-1, LFSR-2 and LFSR-3. If we denote the outputs of these registers by {\displaystyle x_{1}} $x_{1}$ , {\displaystyle x_{2}} $x_{2}$ and {\displaystyle x_{3}} $x_{3}$ , respectively, then the Boolean function that combines the three registers to provide the generator output is given by {\displaystyle F(x_{1},x_{2},x_{3})=(x_{1}\wedge x_{2})\oplus (\neg x_{1}\wedge x_{3})} $F(x_{1},x_{2},x_{3})=(x_{1}\wedge x_{2})\oplus (\neg x_{1}\wedge x_{3})$ (i.e. ({\displaystyle x_{1}} $x_{1}$ AND {\displaystyle x_{2}} $x_{2}$ ) XOR (NOT {\displaystyle x_{1}} $x_{1}$ AND {\displaystyle x_{3}} $x_{3}$ )). There are 2³ = 8 possible values for the outputs of the three registers, and the value of this combining function for each of them is shown in the table below:

{\displaystyle x_{1}} $x_{1}$	{\displaystyle x_{2}} $x_{2}$	{\displaystyle x_{3}} $x_{3}$	{\displaystyle F(x_{1},x_{2},x_{3})} $F(x_{1},x_{2},x_{3})$
0	0	0	0
0	0	1	1
0	1	0	0
0	1	1	1
1	0	0	0
1	0	1	0
1	1	0	1
1	1	1	1

Consider the output of the third register, {\displaystyle x_{3}} $x_{3}$ . The table above shows that of the 8 possible outputs of {\displaystyle x_{3}} $x_{3}$ , 6 are equal to the corresponding value of the generator output, {\displaystyle F(x_{1},x_{2},x_{3})} $F(x_{1},x_{2},x_{3})$ . In 75% of all possible cases, {\displaystyle x_{3}=F(x_{1},x_{2},x_{3})} $x_{3}=F(x_{1},x_{2},x_{3})$ . Thus we say that LFSR-3 is correlated with the generator. This is a weakness we may exploit as follows:

Suppose we intercept the cipher text {\displaystyle c_{1},c_{2},c_{3},\ldots ,c_{n}} $c_{1},c_{2},c_{3},\ldots ,c_{n}$ of a plain text {\displaystyle p_{1},p_{2},p_{3},\ldots } $p_{1},p_{2},p_{3},\ldots$ which has been encrypted by a stream cipher using a Geffe generator as its keystream generator, i.e. {\displaystyle c_{i}=p_{i}\oplus F(x_{1i},x_{2i},x_{3i})} $c_{i}=p_{i}\oplus F(x_{{1i}},x_{{2i}},x_{{3i}})$ for {\displaystyle i=1,2,3,\ldots ,n} $i=1,2,3,\ldots ,n$ , where {\displaystyle x_{1i}} $x_{{1i}}$ is the output of LFSR-1 at time {\displaystyle i} $i$ , etc. Suppose further that we know some part of the plain text, e.g. we know {\displaystyle p_{1},p_{2},p_{3},\ldots ,p_{32}} $p_{1},p_{2},p_{3},\ldots ,p_{{32}}$ , the first 32 bits of the plaintext (corresponding to 4 ASCII characters of text). This is not as improbable as it may seem: if we know the plaintext is a valid XML file, for instance, we know that the first 4 ASCII characters must be “<xml”. Similarly, many file formats or network protocols have standard headers or footers which can be guessed easily. Given the intercepted {\displaystyle c_{1},c_{2},c_{3},\ldots ,c_{32}} $c_{1},c_{2},c_{3},\ldots ,c_{{32}}$ and our known/guessed {\displaystyle p_{1},p_{2},p_{3},\ldots ,p_{32}} $p_{1},p_{2},p_{3},\ldots ,p_{{32}}$ , we may easily find {\displaystyle F(x_{1i},x_{2i},x_{3i})} $F(x_{{1i}},x_{{2i}},x_{{3i}})$ for {\displaystyle i=1,2,3,\ldots ,32} $i=1,2,3,\ldots ,32$ by XORing the two together. We now know 32 consecutive bits of the generator output.

Now we may begin a brute-force search of the space of possible keys (initial values) for LFSR-3 (assuming we know the tapped bits of LFSR-3, an assumption which is in line with Kerckhoffs’ principle). For any given key in the keyspace, we may quickly generate the first 32 bits of LFSR-3’s output and compare these to our recovered 32 bits of the entire generator’s output. Because we have established earlier that there is a 75% correlation between the output of LFSR-3 and the generator, we know that if we have correctly guessed the key for LFSR-3, approximately 24 of the first 32 bits of LFSR-3 output will match up with the corresponding bits of generator output. If we have guessed incorrectly, we should expect roughly half, or 16, of the first 32 bits of these two sequences to match. Thus we may recover the key for LFSR-3 independently of the keys of LFSR-1 and LFSR-2. At this stage we have reduced the problem of brute forcing a system of 3 LFSRs to the problem of brute forcing a single LFSR and then a system of 2 LFSRs. The amount of effort saved here depends on the length of the LFSRs. For realistic values, it is a very substantial saving and can make brute force Attack on Bitcoins very practical.

Observe in the table above that {\displaystyle x_{2}} $x_{2}$ also agrees with the generator output 6 times out of 8, again a correlation of 75% correlation between {\displaystyle x_{2}} $x_{2}$ and the generator output. We may begin a brute force Attack on Bitcoin against LFSR-2 independently of the keys of LFSR-1 and LFSR-3, leaving only LFSR-1 unbroken. Thus, we are able to break the Geffe generator with as much effort as required to brute force 3 entirely independent LFSRs, meaning that the Geffe generator is a very weak generator and should never be used to generate stream cipher keystreams.

Note from the table above that {\displaystyle x_{1}} $x_{1}$ agrees with the generator output 4 times out of 8—a 50% correlation. We cannot use this to brute force LFSR-1 independently of the others: the correct key will yield output which agrees with the generator output 50% of the time, but on average so will an incorrect key. This represents the ideal situation from a security perspective—the combining function {\displaystyle F(x_{1},x_{2},x_{3})} $F(x_{1},x_{2},x_{3})$ should be chosen so that the correlation between each variable and the combining function’s output is as close as possible to 50%. In practice it may be difficult to find a function which achieves this without sacrificing other design criteria, e.g. period length, so a compromise may be necessary.

Clarifying the statistical nature of the Attack on Bitcoin

While the above example illustrates well the relatively simple concepts behind correlation Attack on Bitcoins, it perhaps simplifies the explanation of precisely how the brute forcing of individual LFSRs proceeds. We make the statement that incorrectly guessed keys will generate LFSR output which agrees with the generator output roughly 50% of the time, because given two random bit sequences of a given length, the probability of agreement between the sequences at any particular bit is 0.5. However, specific individual incorrect keys may well generate LFSR output which agrees with the generator output more or less often than exactly 50% of the time. This is particularly salient in the case of LFSRs whose correlation with the generator is not especially strong; for small enough correlations, it is certainly not outside the realm of possibility that an incorrectly guessed key will also lead to LFSR output that agrees with the desired number of bits of the generator output. Thus, we may not be able to find the key to that LFSR uniquely and with certainty. We may instead find a number of possible keys, although this is still a significant breach of the cipher’s security. If we had, say, a megabyte of known plain text, the situation would be substantially different. An incorrect key may generate LFSR output that agrees with more than 512 kilobytes of the generator output, but not likely to generate output that agrees with as much as 768 kilobytes of the generator output like a correctly guessed key would. As a rule, the weaker the correlation between an individual register and the generator output, the more known plain text is required to find that register’s key with a high degree of confidence. Readers with a background in probability theory should be able to see easily how to formalize this argument and obtain estimates of the length of known plain text required for a given correlation using the binomial distribution.

Higher order correlations

Definition

The correlations which were exploited in the example Attack on Bitcoin on the Geffe generator are examples of what are called first order correlations: they are correlations between the value of the generator output and an individual LFSR. It is possible to define higher order correlations in addition to these. For instance, it may be possible that while a given Boolean function has no strong correlations with any of the individual registers it combines, a significant correlation may exist between some Boolean function of two of the registers, e.g. {\displaystyle x_{1}\oplus x_{2}} $x_{1}\oplus x_{2}$ . This would be an example of a second order correlation. We can define third order correlations and so on in the obvious way.

Higher order correlation Attack on Bitcoins can be more powerful than single order correlation Attack on Bitcoins, however this effect is subject to a “law of limiting returns”. The table below shows a measure of the computational cost for various Attack on Bitcoins on a keystream generator consisting of eight 8-bit LFSRs combined by a single Boolean function. Understanding the calculation of cost is relatively straightforward: the leftmost term of the sum represents the size of the keyspace for the correlated generators, and the rightmost term represents the size of the keyspace for the remaining generators.

Attack on Bitcoin	Effort (size of keyspace)
Brute force	{\displaystyle 2^{8\times 8}=18446744073709551616} $2^{{8\times 8}}=18446744073709551616$
Single 1st order correlation Attack on Bitcoin	{\displaystyle 2^{8}+2^{7\times 8}=72057594037928192} $2^{8}+2^{{7\times 8}}=72057594037928192$
Single 2nd order correlation Attack on Bitcoin	{\displaystyle 2^{2\times 8}+2^{6\times 8}=281474976776192} $2^{{2\times 8}}+2^{{6\times 8}}=281474976776192$
Single 3rd order correlation Attack on Bitcoin	{\displaystyle 2^{3\times 8}+2^{5\times 8}=1099528404992} $2^{{3\times 8}}+2^{{5\times 8}}=1099528404992$
Single 4th order correlation Attack on Bitcoin	{\displaystyle 2^{4\times 8}+2^{4\times 8}=8589934592} $2^{{4\times 8}}+2^{{4\times 8}}=8589934592$
Single 5th order correlation Attack on Bitcoin	{\displaystyle 2^{5\times 8}+2^{3\times 8}=1099528404992} $2^{{5\times 8}}+2^{{3\times 8}}=1099528404992$
Single 6th order correlation Attack on Bitcoin	{\displaystyle 2^{6\times 8}+2^{2\times 8}=281474976776192} $2^{{6\times 8}}+2^{{2\times 8}}=281474976776192$
Single 7th order correlation Attack on Bitcoin	{\displaystyle 2^{7\times 8}+2^{8}=72057594037928192} $2^{{7\times 8}}+2^{{8}}=72057594037928192$

While higher order correlations lead to more powerful Attack on Bitcoins, they are also more difficult to find, as the space of available Boolean functions to correlate against the generator output increases as the number of arguments to the function does.

Terminology

A Boolean function {\displaystyle F(x_{1},\ldots ,x_{n})} $F(x_{1},\ldots ,x_{n})$ of n variables is said to be “m-th order correlation immune” or to have “m-th order correlation immunity” for some integer m if no significant correlation exists between the function’s output and any Boolean function of m of its inputs. For example, a Boolean function which has no first order or second order correlations but which does have a third order correlation exhibits 2nd order correlation immunity. Obviously, higher correlation immunity makes a function more suitable for use in a keystream generator (although this is not the only thing which needs to be considered).

Siegenthaler showed that the correlation immunity m of a Boolean function of algebraic degree d of n variables satisfies {\displaystyle m+d\leq n} $m+d\leq n$ ; for a given set of input variables, this means that a high algebraic degree will restrict the maximum possible correlation immunity. Furthermore, if the function is balanced then {\displaystyle m\leq n-1} $m\leq n-1$ .^[1]

It follows that it is impossible for a function of n variables to be n-th order correlation immune. This also follows from the fact that any such function can be written using a Reed-Muller basis as a combination of XORs of the input functions.

Cipher design implications

Given the possibly extreme severity of a correlation Attack on Bitcoin’s impact on a stream cipher’s security, it should be considered essential to test a candidate Boolean combination function for correlation immunity before deciding to use it in a stream cipher. However, it is important to note that high correlation immunity is a necessary, but not sufficient condition for a Boolean function to be appropriate for use in a keystream generator. There are other issues to consider, e.g. whether or not the function is balanced – whether it outputs as many or roughly as many 1’s as it does 0’s when all possible inputs are considered.

Research has been conducted into methods for easily generating Boolean functions of a given size which are guaranteed to have at least some particular order of correlation immunity. This research has uncovered links between correlation immune Boolean functions and error correcting codes.