Decrypting Signal for Android's Backup Files
Signal is one of the better specimens of the 'private messaging app' genre providing end-to-end encrypted messaging between mobile phones. Unfortunately, Signal does not yet provide a way to bulk-export your messages and media (for example for archival purposes). The Android app does, however, come with a rudimentary backup and restore facility which produces an encrypted dump of the internal state of the app which we'll walk through the process of decrypting in this article.
Whilst the backup format is not documented, because Signal is Open Source, it is fairly easy to work out. Once decrypted and unpacked, the result is a plaintext SQLite database and a collection of media files. In this article we'll just look at decryption; we leave the next step of actually using the data as an exercise for the reader.
You can find a ready-to-use script which ties together all of the steps
described in this article on GitHub at
mossblaser/signal_for_android_decryption
.
Indeed, if you're not interested in the inner workings you can stop reading
here and jump straight over to that script.
Sources
For the purposes of figuring out the backup format I mostly referred to the Signal Android app's source code as it stood in January 2021. Be warned that it is entirely possible that the backup format will change at some point in the future rendering some or all of this article inaccurate.
Though the Signal app's backup code is spread over a few parts of the codebase, the FullBackupExporter class is a good place to start if you want to start digging for yourself.
Unfortunately it appears that the iOS Signal app's backup function differs from the Android version. As such, this article may be of little use for restoring from those backups.
Obtaining a backup file
To generate a backup, follow the instructions on the Signal support pages.
Be very careful to write down the generated passphrase correctly.
As a warning: in my experience, if you have a large number of photographs and videos within any of your groups the backup process can take several hours. In addition, expect the generated backup to be somewhat larger than the space consumed by the signal app itself because the backup file contains a lot of (encrypted, but uncompressed) SQL statements, about one per message.
Backup format: broad overview
The encrypted backup format consists of a series of encrypted 'frames'. Each decrypted frame contains a Protocol Buffer formatted structure which, in general, describes either an SQL statement or a binary blob.
The sequence of SQL statements may be used to initialise and populate an SQLite database containing the internal state of the Signal app. For instance, this database contains all message text and group details. The schema of the database is not documented (like the backup format) but relatively easy to find your way around.
Message attachments (such as photographs and videos) are stored in the binary blobs within the backup file. These large live outside the SQLite database but are referenced by ID in the database.
Encryption: broad overview
Disclaimer: I do not have a cryptography background and do not really know what I'm talking about...
Each frame in the backup is encrypted using an AES block cipher in counter (CTR) mode.
Additionally, each frame is accompanied by a HMAC-SHA256 (Hash-based Message Authentication Code) which is used to verify the authenticity of the frame's contents (and, indirectly, the correctness of the passphrase). For the uninitiated: a HMAC is similar to a checksum except that it hashes both the data and a secret key. This means that without knowing the secret key you cannot produce (nor verify) a HMAC. Because of this, if some data matches its HMAC you can be confident that it has not been tampered with because without knowing the secret key, a valid HMAC cannot be produced.
Both the AES cipher and HMAC use a secret key to do their jobs. This key is derived from the many-digit passphrase using the HKDFv3 key derivation function alongside a (possibly bespoke) iterated SHA512-based hash.
Backup decryption steps
In the subsections below we'll describe the low-level process of decrypting a backup file. The process is described using snippets of Python 3 code which have the following dependencies:
- protobuf: The Google Protocol Buffers Python support library.
- cryptography: A library containing the various cryptographic functions we'll need.
You can install these using pip as follows:
$ pip install protobuf cryptography
In the examples below, we'll assume that the backup file has been opened in binary mode like so:
>>> backup_file = open("signal-0000-00-00-00-00-00.backup", "rb")
And the passphrase has been stored in a string like so (spaces are optional):
>>> passphrase = "00000 00000 00000 00000 00000 00000"
Compiling the protobuf definition
The first step is to obtain the Backups.proto
protobuf description defined
in the Signal source
code.
This can then be compiled ready for use with Python using:
$ protoc -I=. --python_out=. Backups.proto
NB: The -I=.
and --python_out=.
arguments specify the location where the
input *.proto
file resides and the output Python modules should be written.
In this case we use the working directory.
This will create a Python module in Backups_pb2.py
in the working directory.
We'll use this to deserialise the decrypted data structures in the backup file.
We can import it like so:
>>> from Backups_pb2 import BackupFrame
Unpacking the header
Before we can decrypt the data within the backup we need to determine the 32 byte encryption key and 16 byte initialisation vector for the AES cipher and the 32 byte key used by the HMAC. The backup file begins with a short plaintext header containing the various details we'll need.
The backup file begins with a 32 bit, big-endian integer length field. This field gives the size of the backup header in bytes. We can then read and unpack this into the protobuf's 'BackupFrame' type:
>>> import struct
>>> length = struct.unpack(">I", backup_file.read(4))[0]
>>> backup_frame = BackupFrame.FromString(backup_file.read(length))
>>> assert backup_frame.HasField("header")
>>> header = backup_frame.header
>>> initialisation_vector = header.iv
>>> salt = header.salt
The header principally contains the initialisation_vector
which is used by
the AES cipher to encrypt the backup file and a salt
value used during the
derivation of the AES and HMAC keys from our passphrase.
Key derivation
Key derivation proceeds in two steps.
In the first step, the salt and the passphrase are combined in an artificially costly repeated hashing operation (to make brute force attacks harder).
>>> from cryptography.hazmat.primitives.hashes import Hash, SHA512
>>> # NB: The passphrase has all whitespace removed and is encoded using ASCII
>>> passphrase_bytes = passphrase.replace(" ", "").encode("ascii")
>>> hash = passphrase_bytes
>>> sha512 = Hash(algorithm=SHA512())
>>> sha512.update(salt)
>>> for _ in range(250000):
... sha512.update(hash)
... sha512.update(passphrase_bytes)
... hash = sha512.finalize()
... sha512 = Hash(algorithm=SHA512())
The first 32 bytes of the resulting hash (in hash
) is now fed to the HKDFv3
key derivation function (which the cryptography
module simply calls 'HKDF').
HKDFv3 may be used to turn a single key into an arbitrary number of keys. In
the case of signal, it is used to generate a single 64 byte key from which the
first 32 bytes are used as the key for the AES cipher and the remaining 32
bytes as the HMAC key:
>>> from cryptography.hazmat.primitives.hashes import SHA256
>>> from cryptography.hazmat.primitives.kdf.hkdf import HKDF
>>> hkdf = HKDF(algorithm=SHA256(), length=64, info=b"Backup Export", salt=b"")
>>> keys = hkdf.derive(hash[:32])
>>> cipher_key = keys[:32]
>>> hmac_key = keys[32:]
At this point we're ready to start decrypting the backup.
Decrypting a backup frame
All data after the header/BackupFrame (read in an earlier step) is contained in encrypted frames. These frames consist of a 32 bit big-endian length field followed by a corresponding number of bytes. (NB: Prior to v6.26.0, this length field was plaintext but since then, lengths are encrypted. This article reflects the older format but see the completed program on GitHub for a version supporting the latest format.) The last 10 bytes of the data blob contain the first 10 bytes of the Message Authentication Code (MAC) computed by the HMAC algorithm for the proceeding bytes of ciphertext.
>>> length = struct.unpack(">I", backup_file.read(4))[0]
>>> assert length >= 10
>>> ciphertext = backup_file.read(length - 10)
>>> their_mac = backup_file.read(10)
Before trying to decrypt the data, we must first verify its authenticity and integrity.
>>> from cryptography.hazmat.primitives.hmac import HMAC
>>> from cryptography.hazmat.primitives.hashes import SHA256
>>> hmac = HMAC(hmac_key, SHA256())
>>> hmac.update(ciphertext)
>>> our_mac = hmac.finalize()
>>> assert their_mac == our_mac[:len(their_mac)], "Bad MAC (wrong password? corrupt?)"
If the MAC we computed did not match the MAC provided in the backup file one of three things may have happened:
- The backup file may have been corrupted
- The backup decryption process may have got out of sync with the file (e.g. forgot to read a particular field)
- (Most likely) The incorrect passphrase was supplied
In any case, if the MACs don't match, there's nothing more we can do so we should stop at this point.
Assuming the MACs match we can be confident that the ciphertext is intact and we can begin to decrypt it.
>>> from cryptography.hazmat.primitives.ciphers import Cipher
>>> from cryptography.hazmat.primitives.ciphers.algorithms import AES
>>> from cryptography.hazmat.primitives.ciphers.modes import CTR
>>> cipher = Cipher(algorithm=AES(cipher_key), mode=CTR(initialisation_vector))
>>> decryptor = cipher.decryptor()
>>> frame_bytes = decryptor.update(ciphertext) + decryptor.finalize()
Before we start unpacking the decrypted data (frame_bytes
) we must increment
the initialisation_vector
ready to decrypt the next frame. Specifically, we
should increment the 32-bit big endian integer encoded in the first 4 bytes of
the 16 byte initialisation_vector
(the other bytes must be left as-is).
>>> def increment_initialisation_vector(initialisation_vector: bytes) -> bytes:
... counter = struct.unpack(">I", initialisation_vector[:4])[0]
... counter = (counter + 1) & 0xFFFFFFFF
... return struct.pack(">I", counter) + initialisation_vector[4:]
>>> initialisation_vector = increment_initialisation_vector(initialisation_vector)
NB: The initialisation vector is incremented between frames because AES requires the key-initialisation vector pair to be different between messages to produce secure outputs.
Processing a decrypted backup frame
Each decrypted backup frame contains a BackupFrame
protobuf structure which
we can deserialise as follows:
>>> frame = BackupFrame.FromString(frame_bytes)
The type of backup frame is determined by which of the following fields is
defined in the frame
structure:
version
: Contains a version number from the database contained in the backup. This corresponds to the SQLiteuser_version
field for the database and in is set based on the constants in theSQLCipherOpenHelper
class in the Signal source code. If you intend on building any kind of robust queries against the database, you may wish to check this version and apply all of the migration steps enumerated in that class to get a consistent view.statement
: Contains a single SQLite statement to execute as part of initialising and populating a blank SQLite database.preference
: A couple of these currently appear in each backup and appear to contain a public and private key. See theIdentityKeyUtil
class. These are presumably used for some part of the Signal protocol.attachment
,sticker
andavatar
frames contain data blobs for files stored outside of the database.attachment
frames contain the attached pictures and videos sent with messages.sticker
frames contain 'sticker' graphics sent or received in messages.avatar
frames contain user and group avatars. In all three cases, the frame is followed immediately in the backup file by a blob of encrypted data containing the file contents associated with that frame. The length of this blob is defined by a 'length' field within the frame.end
: The final frame in the backup file just contains anend
field.
To check which kind of frame we have to use a chain of protobuf HasField
method calls, e.g.:
>>> if backup_frame.HasField("version"):
... version_frame = backup_frame.version
... # TODO: Deal with the version_frame...
>>> if backup_frame.HasField("statement"):
... statement_frame = backup_frame.statement
... # TODO: Deal with the statement_frame...
>>> # ...
statement
frames
A statement frame contains a string containing an SQLite statement and a series of parameters which should be substituted into this statement when executed against the database.
The following caveats apply when unpacking statements:
Firstly, integer parameters are represented as unsigned 64 bit integers in the protobuf but SQLite expects signed 64 bit integers. As such a manual conversion is required, e.g.:
>>> if i & (1 << 63):
... i |= -1 << 63
Secondly, create table
statements are included for internal SQLite tables
(named sqlite_*
). These must not be executed when extracting the backup.
Finally, the signal backups may contain statements relating to tables named
sms_fts_*
and mms_fts_*
(used to implement full text search in the app).
These are omitted during restoration by the Signal app and its likely you
will want to do the same too.
attachment
, sticker
and avatar
frames
These frames are followed by a blob of encrypted data containing the contents
of the relevant attachment, sticker or avatar. The size of this blob is encoded
by the length
field in the frame. The frame also includes an identifier used
within the SQLite database to refer to a particular attachment, sticker or
avatar. These IDs are given in the following fields:
- For
attachment
:attachment_frame.attachmentId
- For
sticker
:sticker_frame.rowId
- For
avatar
:avatar_frame.recipientId
NB: No information is given on the type of data included. This information may be later queried from the database or inferred from the file contents.
Decryption logically follows a similar process to the decryption of a frame. Like frames, a 10 byte MAC is stored after the data. Unlike frames, the length field does not include the 10 bytes which hold the MAC. Additionally, the MAC is computed on the concatenation of the initialisation vector and ciphertext. (I am unsure why this is done in this case.)
>>> # Warning: Blobs can be quite large so you may end up using up all your
>>> # RAM reading the whole blob in one go. A practical implementation
>>> # could instead read, decrypt/verify and write the blob in chunks of,
>>> # e.g., a few KB at a time.
>>> ciphertext = backup_file.read(length)
>>> their_mac = backup_file.read(10)
>>> hmac = HMAC(hmac_key, SHA256())
>>> hmac.update(initialisation_vector)
>>> hmac.update(ciphertext)
>>> our_mac = hmac.finalize()
>>> assert their_mac == our_mac[:len(their_mac)], "Bad MAC (wrong password? corrupt?)"
>>> cipher = Cipher(algorithm=AES(cipher_key), mode=CTR(initialisation_vector))
>>> decryptor = cipher.decryptor()
>>> blob = decryptor.update(ciphertext) + decryptor.finalize()
>>> initialisation_vector = increment_initialisation_vector(initialisation_vector)
Here, the decrypted attachment, sticker or avatar is contained in blob
and
can be written out to disk (e.g. using the relevant ID as part of its name).
Complete script
For a complete script which brings together all of the steps described above,
head over to GitHub
mossblaser/signal_for_android_decryption
.
Using a decrypted backup
Once decrypted, you will be left with a SQLite database and a collection of attachment, sticker and avatar files. Turning these into, for example, a readable chat log or collections of media files with sensible filenames is left as a reverse-engineering exercise for the reader. Some hints are provided below to help get you started, however.
If your file browser is smart enough you can probably directly open most
attachment files and they'll open in your image viewer or media player if they
contain pictures or video. Alternatively you could use the file
command to
guess what kind of file each is and give it a more appropriate file extension.
More robustly, the original mime type of attachments can be found in the part
table in the ct
column. Attachment IDs may be found in the unique_id
column. The caption
column contains caption text associated with the
attachment. The mid
column is a foreign key pointing to entries in the mms
table containing the message this attachment was sent in.
Message data appears to be held in the sms
and mms
tables, though I'm not
clear what the distinction between these is.
Groups are enumerated in the groups
table (with group names in the title
column). These can be related back to messages in sms
and mms
via the
recipient
and thread
tables.