Decrypting Signal for Android's Backup Files

Signal is one of the better specimens of the 'private messaging app' genre providing end-to-end encrypted messaging between mobile phones. Unfortunately, Signal does not yet provide a way to bulk-export your messages and media (for example for archival purposes). The Android app does, however, come with a rudimentary backup and restore facility which produces an encrypted dump of the internal state of the app which we'll walk through the process of decrypting in this article.

Whilst the backup format is not documented, because Signal is Open Source, it is fairly easy to work out. Once decrypted and unpacked, the result is a plaintext SQLite database and a collection of media files. In this article we'll just look at decryption; we leave the next step of actually using the data as an exercise for the reader.

You can find a ready-to-use script which ties together all of the steps described in this article on GitHub at mossblaser/signal_for_android_decryption. Indeed, if you're not interested in the inner workings you can stop reading here and jump straight over to that script.

Sources

For the purposes of figuring out the backup format I mostly referred to the Signal Android app's source code as it stood in January 2021. Be warned that it is entirely possible that the backup format will change at some point in the future rendering some or all of this article inaccurate.

Though the Signal app's backup code is spread over a few parts of the codebase, the FullBackupExporter class is a good place to start if you want to start digging for yourself.

Unfortunately it appears that the iOS Signal app's backup function differs from the Android version. As such, this article may be of little use for restoring from those backups.

Obtaining a backup file

To generate a backup, follow the instructions on the Signal support pages.

Be very careful to write down the generated passphrase correctly.

As a warning: in my experience, if you have a large number of photographs and videos within any of your groups the backup process can take several hours. In addition, expect the generated backup to be somewhat larger than the space consumed by the signal app itself because the backup file contains a lot of (encrypted, but uncompressed) SQL statements, about one per message.

Backup format: broad overview

The encrypted backup format consists of a series of encrypted 'frames'. Each decrypted frame contains a Protocol Buffer formatted structure which, in general, describes either an SQL statement or a binary blob.

The sequence of SQL statements may be used to initialise and populate an SQLite database containing the internal state of the Signal app. For instance, this database contains all message text and group details. The schema of the database is not documented (like the backup format) but relatively easy to find your way around.

Message attachments (such as photographs and videos) are stored in the binary blobs within the backup file. These large live outside the SQLite database but are referenced by ID in the database.

Encryption: broad overview

Disclaimer: I do not have a cryptography background and do not really know what I'm talking about...

Each frame in the backup is encrypted using an AES block cipher in counter (CTR) mode.

Additionally, each frame is accompanied by a HMAC-SHA256 (Hash-based Message Authentication Code) which is used to verify the authenticity of the frame's contents (and, indirectly, the correctness of the passphrase). For the uninitiated: a HMAC is similar to a checksum except that it hashes both the data and a secret key. This means that without knowing the secret key you cannot produce (nor verify) a HMAC. Because of this, if some data matches its HMAC you can be confident that it has not been tampered with because without knowing the secret key, a valid HMAC cannot be produced.

Both the AES cipher and HMAC use a secret key to do their jobs. This key is derived from the many-digit passphrase using the HKDFv3 key derivation function alongside a (possibly bespoke) iterated SHA512-based hash.

Backup decryption steps

In the subsections below we'll describe the low-level process of decrypting a backup file. The process is described using snippets of Python 3 code which have the following dependencies:

  • protobuf: The Google Protocol Buffers Python support library.
  • cryptography: A library containing the various cryptographic functions we'll need.

You can install these using pip as follows:

$ pip install protobuf cryptography

In the examples below, we'll assume that the backup file has been opened in binary mode like so:

>>> backup_file = open("signal-0000-00-00-00-00-00.backup", "rb")

And the passphrase has been stored in a string like so (spaces are optional):

>>> passphrase = "00000 00000 00000 00000 00000 00000"

Compiling the protobuf definition

The first step is to obtain the Backups.proto protobuf description defined in the Signal source code.

This can then be compiled ready for use with Python using:

$ protoc -I=. --python_out=. Backups.proto

NB: The -I=. and --python_out=. arguments specify the location where the input *.proto file resides and the output Python modules should be written. In this case we use the working directory.

This will create a Python module in Backups_pb2.py in the working directory. We'll use this to deserialise the decrypted data structures in the backup file. We can import it like so:

>>> from Backups_pb2 import BackupFrame

Unpacking the header

Before we can decrypt the data within the backup we need to determine the 32 byte encryption key and 16 byte initialisation vector for the AES cipher and the 32 byte key used by the HMAC. The backup file begins with a short plaintext header containing the various details we'll need.

The backup file begins with a 32 bit, big-endian integer length field. This field gives the size of the backup header in bytes. We can then read and unpack this into the protobuf's 'BackupFrame' type:

>>> import struct

>>> length = struct.unpack(">I", backup_file.read(4))[0]
>>> backup_frame = BackupFrame.FromString(backup_file.read(length))
>>> assert backup_frame.HasField("header")
>>> header = backup_frame.header

>>> initialisation_vector = header.iv
>>> salt = header.salt

The header principally contains the initialisation_vector which is used by the AES cipher to encrypt the backup file and a salt value used during the derivation of the AES and HMAC keys from our passphrase.

Key derivation

Key derivation proceeds in two steps.

In the first step, the salt and the passphrase are combined in an artificially costly repeated hashing operation (to make brute force attacks harder).

>>> from cryptography.hazmat.primitives.hashes import Hash, SHA512

>>> # NB: The passphrase has all whitespace removed and is encoded using ASCII
>>> passphrase_bytes = passphrase.replace(" ", "").encode("ascii")

>>> hash = passphrase_bytes
>>> sha512 = Hash(algorithm=SHA512())
>>> sha512.update(salt)
>>> for _ in range(250000):
...     sha512.update(hash)
...     sha512.update(passphrase_bytes)
...     hash = sha512.finalize()
...     sha512 = Hash(algorithm=SHA512())

The first 32 bytes of the resulting hash (in hash) is now fed to the HKDFv3 key derivation function (which the cryptography module simply calls 'HKDF'). HKDFv3 may be used to turn a single key into an arbitrary number of keys. In the case of signal, it is used to generate a single 64 byte key from which the first 32 bytes are used as the key for the AES cipher and the remaining 32 bytes as the HMAC key:

>>> from cryptography.hazmat.primitives.hashes import SHA256
>>> from cryptography.hazmat.primitives.kdf.hkdf import HKDF

>>> hkdf = HKDF(algorithm=SHA256(), length=64, info=b"Backup Export", salt=b"")
>>> keys = hkdf.derive(hash[:32])
>>> cipher_key = keys[:32]
>>> hmac_key = keys[32:]

At this point we're ready to start decrypting the backup.

Decrypting a backup frame

All data after the header/BackupFrame (read in an earlier step) is contained in encrypted frames. These frames consist of a (plaintext) 32 bit big-endian length field followed by a corresponding number of bytes. The last 10 bytes of the data blob contain the first 10 bytes of the Message Authentication Code (MAC) computed by the HMAC algorithm for the proceeding bytes of ciphertext.

>>> length = struct.unpack(">I", backup_file.read(4))[0]
>>> assert length >= 10
>>> ciphertext = backup_file.read(length - 10)
>>> their_mac = backup_file.read(10)

Before trying to decrypt the data, we must first verify its authenticity and integrity.

>>> from cryptography.hazmat.primitives.hmac import HMAC
>>> from cryptography.hazmat.primitives.hashes import SHA256

>>> hmac = HMAC(hmac_key, SHA256())
>>> hmac.update(ciphertext)
>>> our_mac = hmac.finalize()

>>> assert their_mac == our_mac[:len(their_mac)], "Bad MAC (wrong password? corrupt?)"

If the MAC we computed did not match the MAC provided in the backup file one of three things may have happened:

  1. The backup file may have been corrupted
  2. The backup decryption process may have got out of sync with the file (e.g. forgot to read a particular field)
  3. (Most likely) The incorrect passphrase was supplied

In any case, if the MACs don't match, there's nothing more we can do so we should stop at this point.

Assuming the MACs match we can be confident that the ciphertext is intact and we can begin to decrypt it.

>>> from cryptography.hazmat.primitives.ciphers import Cipher
>>> from cryptography.hazmat.primitives.ciphers.algorithms import AES
>>> from cryptography.hazmat.primitives.ciphers.modes import CTR

>>> cipher = Cipher(algorithm=AES(cipher_key), mode=CTR(initialisation_vector))
>>> decryptor = cipher.decryptor()
>>> frame_bytes = decryptor.update(ciphertext) + decryptor.finalize()

Before we start unpacking the decrypted data (frame_bytes) we must increment the initialisation_vector ready to decrypt the next frame. Specifically, we should increment the 32-bit big endian integer encoded in the first 4 bytes of the 16 byte initialisation_vector (the other bytes must be left as-is).

>>> def increment_initialisation_vector(initialisation_vector: bytes) -> bytes:
...     counter = struct.unpack(">I", initialisation_vector[:4])[0]
...     counter = (counter + 1) & 0xFFFFFFFF
...     return struct.pack(">I", counter) + initialisation_vector[4:]

>>> initialisation_vector = increment_initialisation_vector(initialisation_vector)

NB: The initialisation vector is incremented between frames because AES requires the key-initialisation vector pair to be different between messages to produce secure outputs.

Processing a decrypted backup frame

Each decrypted backup frame contains a BackupFrame protobuf structure which we can deserialise as follows:

>>> frame = BackupFrame.FromString(frame_bytes)

The type of backup frame is determined by which of the following fields is defined in the frame structure:

  • version: Contains a version number from the database contained in the backup. This corresponds to the SQLite user_version field for the database and in is set based on the constants in the SQLCipherOpenHelper class in the Signal source code. If you intend on building any kind of robust queries against the database, you may wish to check this version and apply all of the migration steps enumerated in that class to get a consistent view.
  • statement: Contains a single SQLite statement to execute as part of initialising and populating a blank SQLite database.
  • preference: A couple of these currently appear in each backup and appear to contain a public and private key. See the IdentityKeyUtil class. These are presumably used for some part of the Signal protocol.
  • attachment, sticker and avatar frames contain data blobs for files stored outside of the database. attachment frames contain the attached pictures and videos sent with messages. sticker frames contain 'sticker' graphics sent or received in messages. avatar frames contain user and group avatars. In all three cases, the frame is followed immediately in the backup file by a blob of encrypted data containing the file contents associated with that frame. The length of this blob is defined by a 'length' field within the frame.
  • end: The final frame in the backup file just contains an end field.

To check which kind of frame we have to use a chain of protobuf HasField method calls, e.g.:

>>> if backup_frame.HasField("version"):
...     version_frame = backup_frame.version
...     # TODO: Deal with the version_frame...
>>> if backup_frame.HasField("statement"):
...     statement_frame = backup_frame.statement
...     # TODO: Deal with the statement_frame...
>>> # ...

statement frames

A statement frame contains a string containing an SQLite statement and a series of parameters which should be substituted into this statement when executed against the database.

The following caveats apply when unpacking statements:

Firstly, integer parameters are represented as unsigned 64 bit integers in the protobuf but SQLite expects signed 64 bit integers. As such a manual conversion is required, e.g.:

>>> if i & (1 << 63):
...     i |= -1 << 63

Secondly, create table statements are included for internal SQLite tables (named sqlite_*). These must not be executed when extracting the backup.

Finally, the signal backups may contain statements relating to tables named sms_fts_* and mms_fts_* (used to implement full text search in the app). These are omitted during restoration by the Signal app and its likely you will want to do the same too.

attachment, sticker and avatar frames

These frames are followed by a blob of encrypted data containing the contents of the relevant attachment, sticker or avatar. The size of this blob is encoded by the length field in the frame. The frame also includes an identifier used within the SQLite database to refer to a particular attachment, sticker or avatar. These IDs are given in the following fields:

  • For attachment: attachment_frame.attachmentId
  • For sticker: sticker_frame.rowId
  • For avatar: avatar_frame.recipientId

NB: No information is given on the type of data included. This information may be later queried from the database or inferred from the file contents.

Decryption logically follows a similar process to the decryption of a frame. Like frames, a 10 byte MAC is stored after the data. Unlike frames, the length field does not include the 10 bytes which hold the MAC. Additionally, the MAC is computed on the concatenation of the initialisation vector and ciphertext. (I am unsure why this is done in this case.)

>>> # Warning: Blobs can be quite large so you may end up using up all your
>>> # RAM reading the whole blob in one go. A practical implementation
>>> # could instead read, decrypt/verify and write the blob in chunks of,
>>> # e.g., a few KB at a time.
>>> ciphertext = backup_file.read(length)
>>> their_mac = backup_file.read(10)

>>> hmac = HMAC(hmac_key, SHA256())
>>> hmac.update(initialisation_vector)
>>> hmac.update(ciphertext)
>>> our_mac = hmac.finalize()

>>> assert their_mac == our_mac[:len(their_mac)], "Bad MAC (wrong password? corrupt?)"

>>> cipher = Cipher(algorithm=AES(cipher_key), mode=CTR(initialisation_vector))
>>> decryptor = cipher.decryptor()
>>> blob = decryptor.update(ciphertext) + decryptor.finalize()

>>> initialisation_vector = increment_initialisation_vector(initialisation_vector)

Here, the decrypted attachment, sticker or avatar is contained in blob and can be written out to disk (e.g. using the relevant ID as part of its name).

Complete script

For a complete script which brings together all of the steps described above, head over to GitHub mossblaser/signal_for_android_decryption.

Using a decrypted backup

Once decrypted, you will be left with a SQLite database and a collection of attachment, sticker and avatar files. Turning these into, for example, a readable chat log or collections of media files with sensible filenames is left as a reverse-engineering exercise for the reader. Some hints are provided below to help get you started, however.

If your file browser is smart enough you can probably directly open most attachment files and they'll open in your image viewer or media player if they contain pictures or video. Alternatively you could use the file command to guess what kind of file each is and give it a more appropriate file extension.

More robustly, the original mime type of attachments can be found in the part table in the ct column. Attachment IDs may be found in the unique_id column. The caption column contains caption text associated with the attachment. The mid column is a foreign key pointing to entries in the mms table containing the message this attachment was sent in.

Message data appears to be held in the sms and mms tables, though I'm not clear what the distinction between these is.

Groups are enumerated in the groups table (with group names in the title column). These can be related back to messages in sms and mms via the recipient and thread tables.