Tuesday, August 27, 2013

Encryption, Compression and Randomness - Part 1

One often comes across situations where data has to be compressed for size reduction and also encrypted for security. But should you first compress and then encrypt ? Or should it be done in reverse i.e. encrypt and then compress ?


If you are using an encrypted transport like https, you don’t have to think much about this. You would usually compress the data and then transfer it over https with the Content-Type header set to something like application/zip or application/x-gzip or even application/octet-stream So the data ends up being first compressed by the application and then encrypted using the SSL/TLS layer.

But is compression followed by encryption the right way to do it ? 

Yes, it is. And the reason is that encryption increases the "randomness" of the data making it more difficult to compress. Compression algorithms reduce statistical redundancy by encoding repeated patterns efficiently. When the randomness increases, it is harder to find patterns and repetitions and compression is not as effective.

It may be a good idea to try some quick samples. The examples here are shown for Mac OS X where OpenSSL is installed. Similar commands will work on Linux.

Macbeth.txt contains the text of Shakespeare's play Macbeth. Its size is 100129 bytes.

> ls -log Macbeth.txt
-rw-r--r--  1 100129 Aug 19 05:18 Macbeth.txt

Let us compress it and then encrypt it. Then we will print the sizes of the output after each operation.

> gzip -c Macbeth.txt > Macbeth.txt.gz

We want to retain the original Macbeth.txt and so we make use of the -c option to gzip which writes to stdout.

We will need an encryption password and we attempt to select a good quality one by generating a 15 bit (pseudo)random number and encoding it as a base64 string. This password will be used for all our subsequent encryption and decryption operations.

> openssl rand -base64 15
WV+V+4LW9y7d8dDFlWYX

Next we encrypt the compressed file using AES-256.

> openssl enc -aes-256-cbc -salt -in Macbeth.txt.gz -out Macbeth.txt.gz.enc
enter aes-256-cbc encryption password:
Verifying - enter aes-256-cbc encryption password:

Before going further, it is good to verify that we are able to decrypt and gunzip the resulting file and get back the text of Macbeth. That would make sure our operations worked correctly.

> openssl enc -aes-256-cbc -d -in Macbeth.txt.gz.enc -out /tmp/Macbeth.txt.gz
enter aes-256-cbc decryption password:
> gunzip /tmp/Macbeth.txt.gz
> diff Macbeth.txt /tmp/Macbeth.txt
>

The diff shows no output and so things look good.

Now we look at the sizes after compression (.gz) followed by encryption (.gz.enc)

> ls -log Macbeth.txt*
-rw-r--r--  1 100129 Aug 23 15:40 Macbeth.txt
-rw-r--r--  1  41468 Aug 23 15:40 Macbeth.txt.gz
-rw-r--r--  1  41488 Aug 23 15:47 Macbeth.txt.gz.enc

We see that the compression reduced the size to roughly 41% and the subsequent encryption added only 20 more bytes. Symmetric block ciphers like AES work by generating pseudo-random permutations (bijective self mapping) over the set of values of an input block. They do not cause an increase in size beyond the bytes required for padding the last block and adding a header with randomization params that are needed for decryption (e.g. salt in AES).

Note: The 8 byte salt is actually present in a 16 byte header that starts with the 8 byte string Salted__. This can be seen by opening the encrypted file in a text editor. The password is used to generate the encryption key and the IV (initialization vector) for encryption and decryption.

So the compressed encrypted output is about 41% of the input.

Let us see what happens if we do the encryption first and do the compression after that.

> openssl enc -aes-256-cbc -salt -in Macbeth.txt -out Macbeth.txt.enc
enter aes-256-cbc encryption password:
Verifying - enter aes-256-cbc encryption password:

> gzip -c Macbeth.txt.enc > Macbeth.txt.enc.gz


> ls -log Macbeth.txt*
-rw-r--r--  1 100129 Aug 23 15:40 Macbeth.txt
-rw-r--r--  1 100160 Aug 23 15:50 Macbeth.txt.enc
-rw-r--r--  1 100214 Aug 23 15:50 Macbeth.txt.enc.gz
-rw-r--r--  1  41468 Aug 23 15:40 Macbeth.txt.gz
-rw-r--r--  1  41488 Aug 23 15:47 Macbeth.txt.gz.enc

In this case, the encryption added 31 bytes but the subsequent compression did not live up to its name and ended up adding a further 54 bytes.

Clearly this is not efficient - the overall result was not any reduction in size at all but a small increase instead.

Looks like the compression failed to find enough patterns to reduce redundancy due to the randomness of the encrypted input. But can we try and get a better feel for what is going on ? Maybe some visual representations could help.

Making use of the techniques described in Roland's home page (which make use of a Python script and ImageMagick), we will try to do some visuals. For each of the three i.e. original text (Macbeth.txt), its compression (Macbeth.txt.gz) and its encryption (Macbeth.txt.enc), we will generate :-
i) a frequency distribution of the numeric values of the bytes (0 - 255)
ii) an image representation of the bitstream - by showing white and black pixels for 1 and 0 bit values.

For ii), the pixels are laid out as horizontal lines, one below the other, so as to give rise to a square of dimensions n x n pixels. It is good to make n a multiple of 8 so that bytes align below one another. To make this happen,  we have to select n  as the highest multiple of 8 that is less than the square root of 8*N where N is the number of bytes in the file. In Java, this would be
int n = ( ( (int) Math.pow( 8*N, 0.5) ) / 8 ) * 8
and in Python it can be done as
n = 8 * (int(math.sqrt(8*N))/8).

First we look at i), i.e. frequency distribution of byte values, for the three files.

1. Macbeth.txt: For the plain text file, we can see highest peaks at 32 (space), 101 (small letter e), 116 (small letter t) and so on, in line with character frequencies in English text. Also there is nothing above 127 since this is ASCII text.


2. Macbeth.txt.gz: For the compressed file, the spread is smoother and the full range of  byte values is seen.


3. Macbeth.txt.enc: The encrypted file shows a distribution that is much closer to uniform.



Next we look at ii) i.e. image representations of the bitstreams for each of the three files.

1. Macbeth.txt  (n=888)
Here is the top left 1/16th portion of the image representation of the file's bitstream produced by cropping the image at 1/4th of its width and 1/4th of its height. It is generated using
> convert -size 888x888 mono:Macbeth.txt -crop 222x222+0+0 Macbeth.txt-mono-cropped.png

Macbeth.txt-mono-cropped.png

The full image, generated using
> convert -size 888x888 mono:Macbeth.txt Macbeth.txt-mono.png
is available here.

2. Macbeth.txt.gz (n = 568)
Below is the top left 1/16th of the image representation of the file bitstream that was produced using
> convert -size 568x568 mono:Macbeth.txt -crop 142x142+0+0 Macbeth.txt.gz-mono-cropped.png

Macbeth.txt.gz-mono-cropped.png


The full image generated using
> convert -size 568x568 mono:Macbeth.txt.gz Macbeth.txt.gz-mono.png
is here.


3. Macbeth.txt.enc (n=888)
The 1/16th top left portion can be produced using
> convert -size 888x888 mono:Macbeth.txt.enc -crop 222x222+0+0 Macbeth.txt.enc-mono-cropped.png

Macbeth.txt.enc-mono-cropped.png

And the full image generated using
> convert -size 888x888 mono:Macbeth.txt.enc Macbeth.txt.enc-mono.png
is here.

What do we observe from the images in the above three cases ?

When we look at either the cropped images or the full-sized ones for Macbeth.txt and Macbeth.txt.gz, we find that they look very much like scaled versions of one another, and there are clear patterns of ridges and furrows in the vertical direction. The compression of the file seems to have resulted in a compression of the visual representation as our eyes perceive it - this is indeed somewhat fascinating !

But when we look at the cropped or full-sized images for Macbeth.txt and Macbeth.txt.enc, we clearly see that the encrypted version gets rid of the visual patterns and makes it look like uniform noise.

As a final activity, we dump pseudo-random bytes from /dev/urandom and plot the byte value frequency distribution and the bitstream images. This would let us see what good quality random data looks like. We would expect it to be pretty similar to what we got for Macbeth.txt.enc.

There are differences between /dev/urandom and /dev/random on Linux (not on Mac OS X) but that is a long story for another day.

The data file is created using
> dd if=/dev/urandom bs=1k count=1k of=urandom_output
This produces 1MB of random data in urandom_output.

The frequency distribution is very close to uniform, in fact much closer to what we got with Macbeth.txt.enc.


n can be computed to be 2896 and the cropped 1/64th top left section generated using
> convert -size 2896x2896 mono:urandom_output -crop 362x362+0+0 urandom_output-mono-cropped.png
is shown below.

urandom_output-mono-cropped.png


We did not list entropies and other statistical measures of randomness so far. But that can be done too using the ENT utility which runs a number of statistical tests on data files at the byte level. More details are explained in the ENT page. We do not get into the details here but from the outputs shown below, it can be seen that these measures are consistent with what we have seen so far.

> ent Macbeth.txt
Entropy = 4.768412 bits per byte.

Optimum compression would reduce the size of this 100129 byte file by 40 percent.

Chi square distribution for 100129 samples is 1324655.46, and randomly would exceed this value less than 0.01 percent of the times.

Arithmetic mean value of data bytes is 87.9000 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is -0.001562 (totally uncorrelated = 0.0).

>  ent Macbeth.txt.gz
Entropy = 7.992967 bits per byte.

Optimum compression would reduce the size of this 41468 byte file by 0 percent.

Chi square distribution for 41468 samples is 405.21, and randomly would exceed this value less than 0.01 percent of the times.

Arithmetic mean value of data bytes is 128.2851 (127.5 = random).
Monte Carlo value for Pi is 3.138764289 (error 0.09 percent).
Serial correlation coefficient is 0.007875 (totally uncorrelated = 0.0).

> ent Macbeth.txt.enc
Entropy = 7.998104 bits per byte.

Optimum compression would reduce the size of this 100160 byte file by 0 percent.

Chi square distribution for 100160 samples is 262.74, and randomly
would exceed this value 35.61 percent of the times.

Arithmetic mean value of data bytes is 127.8811 (127.5 = random).
Monte Carlo value for Pi is 3.120349847 (error 0.68 percent).
Serial correlation coefficient is -0.002324 (totally uncorrelated = 0.0).

> ent urandom_output
Entropy = 7.999864 bits per byte.

Optimum compression would reduce the size of this 1048576 byte file by 0 percent.

Chi square distribution for 1048576 samples is 197.19, and randomly would exceed this value 99.71 percent of the times.

Arithmetic mean value of data bytes is 127.5124 (127.5 = random).
Monte Carlo value for Pi is 3.140430986 (error 0.04 percent).
Serial correlation coefficient is -0.000428 (totally uncorrelated = 0.0).

In Part 2, we will try similar things on a simple binary input.

2 comments:

  1. As ive posted on the forums of crypto, you are wrong. Here is my post copied from the forums.

    One big warning, compress and then encrypt is not CPA secure. if you have an encryption of an unkown message M, you can ask for encryptions of M||b. If the ciphertext is smaller, you know it compressed b and then you know something about the plaintext. This is something you can use, and is being used in attacks on TLS/SSL. compression AFTER encryption is useless since the output should be indistinguishable from random. Thus a compression algorithm will not decrease the ciphertext size on average.

    proof that with a CPA-secure encryption and compression you wont get a CPA-secure encryption
    The CPA-game:
    The oracle takes a Compression function $C$, and encryption function $E$ and a key $K$ and an $i$ in ${0,1}$
    The attack ask for the encryption of two distinct messages M0, M1
    The oracle returns c:=E(C(Mi),K)
    the attacker then generates a padding such that |C(M0||P)|<<|C(M1||P)|, eg P=M_0
    the attacker then asks for encryptions of M0||P, M1||P
    The oracle returns the encryption d:=E(C(Mi||P),K)
    If|D|>>|C(M1||P)| then we know i=1, else we know we are in game 0: thus we won the CPA game with chance 1. Thus compression is not CPA-secure

    TL;dR:
    compress then encrypt is not CPA-secure, and you can design a general attack on compress than encrypt, which works even if the encryption algorithm is CPA-secure(or even authenticated encryption).

    ReplyDelete
    Replies
    1. Replied in the crypto forum. Copied below.

      Thanks for your post.

      For one thing, compression followed by encryption seems to be widely used in practice. Though, like in all practical solutions, there are additional parameters for randomization, dedup, prevention of replay attacks etc.
      Examples include sending compressed data over https, database storage http://docs.oracle.com/cd/E16655_01/network.121/e17729/asotrans_gen.htm (Compression and Data Deduplication of Encrypted Data) and file storage (https://code.google.com/p/s3backer/ section Compression)
      Of course, that does not necessarily make them secure as we have seen from various examples in the course.

      My posts were primarily related to the size, efficiency and randomness of the output from the compression<->encryption sequence. Though while attempting to answer that question in the problem set, the first thought that did flash briefly was regarding the security of compress->encrypt vs encrypt->compress.

      I think I was able to find something along the lines of the SSL/TLS attacks that you have referred to (http://blog.astrumfutura.com/2013/08/breach-attacks-extracting-https-encrypted-data-in-under-a-minut...) - which is quite interesting and pretty recent.

      Your proof does look good based on my initial exposure to such proofs in this course. But I am not well-versed enough in this area to judge its formal correctness. Padding M0 with itself looks cool.

      However, I am not sure if your second sentence is right - "if you have an encryption of an unkown message M, you can ask for encryptions of M||b." If M is unknown and all you have is the encryption of M, you cannot ask for the encryption of M||b. You need to know M0 and M1 and the way this seems to happen in the above attacks is using 2 and 3 in the section titled Attack Requirements.

      On a different note, there appears to be research that claims to have actually managed to compress block-cipher-encrypted data significantly, at least for certain types of input e.g. http://users.ece.gatech.edu/~demi/docs/dcc09paper.pdf

      Delete