Handling binary assets and test data with CMake

A common situation facing many projects is how to incorporate large binary assets into the main source code and its build process. Examples of such assets include firmware binaries for embedded products, videos, user manuals, test data and so on. These binary assets often have their own workflow for managing source materials, change history and building the binaries. This article demonstrates an approach to handling this situation with CMake builds.

The ExternalData module

Since version 2.8.11, CMake has provided the ExternalData module. Originally, it was aimed at handling tests that needed access to large binary files, allowing such data to be downloaded and used at build time rather than requiring the files to be included directly in the source code repository. It turns out, however, that ExternalData is much more flexible than that and can be used for managing external files needed at any part of a build.

When using ExternalData, the large binary files are stored separately from the main source code repository and only downloaded on demand. This brings a number of advantages:

If a developer is building targets which don’t need to access the large binary files, those files may not need to be downloaded at all.
The main repository can remain small in size. This is especially useful with git repositories, since local working copies include a complete copy of the whole repository, including every version of every binary file ever committed to it.
Access to the binary assets can be controlled separately from the main source code, allowing only those developers who should be able to access sensitive binaries to do so (e.g. debug versions of firmware with sensitive diagnostic capabilities that contractors should not have access to).

The basics

To use ExternalData in your project, add the following line to your CMakeLists.txt file:

include(ExternalData)

The following three CMake functions then become available:

ExternalData_Expand_Arguments
ExternalData_Add_Test
ExternalData_Add_Target

These functions understand references to external files using the syntax DATA{filename}. We will explain how to use these functions shortly, but first we need to understand how these DATA{filename} statements are processed. For every file referenced via DATA{filename}, CMake will look for a corresponding file in the source tree of the form filename.algorithm where algorithm must be one of md5, sha1, sha224, sha256, sha384 or sha512.

The contents of the filename.algorithm file should be the hash value of the real file using the named hashing algorithm. For example, a file referred to by DATA{foobar.bin} could have a file in the source tree called foobar.bin.md5 whose contents are the MD5 checksum of the real foobar.bin file. This foobar.bin.md5 file acts like a placeholder for the real file.

Once CMake finds an appropriate local placeholder file in the source tree, it then consults the contents of the ExternalData_URL_TEMPLATES variable which is expected to be a list of URLs. For each URL, CMake will replace %(algo) with the algorithm part of the local file name and %(hash) with the contents of the local file. CMake will attempt to download each URL in order until one succeeds. This is best illustrated by an example:

include(ExternalData)

set(ExternalData_URL_TEMPLATES
    "file://$ENV{HOME}/assets/%(algo)/%(hash)"
    "https://intranet.mycompany.com/assets/%(algo)/%(hash)"
)

In the above example, when looking for foobar.bin.md5 CMake will first try looking for assets/md5/HASH below the user’s home directory (where HASH is the contents of the local foobar.bin.md5 file). If not found there, CMake will then try to download https://intranet.mycompany.com/assets/md5/HASH. You can list as many URLs in the ExternalData_URL_TEMPLATES variable as you like.

When you want to make a file available to CMake using ExternalData, you need to perform the following three steps:

Generate the hash for the file using your desired hashing algorithm. We will use md5 for the remainder of this article since it is widely supported and easy to use. On most Unix-like systems, the md5 command is available and to generate the md5 checksum, simply run md5 filename.
On the external location(s) referred to by ExternalData_URL_TEMPLATES, copy the real binary to the md5 directory and make the file name the same as the checksum from the previous step.
Create a file in your local source tree with whatever name you want to use to refer to it, with .md5 appended to the file name. The contents of the file should be the checksum generated above.

Returning now to the three functions defined by the ExternalData module, their respective syntaxes are as follows:

ExternalData_Expand_Arguments(
    <target>  # Name of data management target
    <outVar>  # Output variable
    [args...] # Input arguments, DATA{} allowed
)

ExternalData_Add_Target(
    <target>  # Name of data management target
)

ExternalData_Add_Test(
    <target>  # Name of data management target
    ...       # Arguments of add_test(), DATA{} allowed
)

The ExternalData_Expand_Arguments function acts as a translator, converting a set of DATA{filename} arguments to their build directory equivalents. It replaces each DATA{} reference in args with the full path of a real data file on disk that will exist after the named target builds. On its own, ExternalData_Expand_Arguments does not define the target itself, it just provides the expanded versions of the arguments in the outVar output variable and records some internal information about each external file and the target those external files should be associated with.

The ExternalData_Add_Target function is the other half of the picture. It’s purpose is to define a build target which takes care of downloading all external files that were previously mentioned in any ExternalData_Expand_Arguments call with the same target. This is the function that sets up the commands and rules for managing the automatic download of the external files associated with the specified target.

The ExternalData_Add_Test function is just a convenience wrapper around ExternalData_Expand_Arguments and the built-in add_test command. Its implementation is simply:

function(ExternalData_Add_Test target)
    ExternalData_Expand_Arguments("${target}" testArgs "${ARGN}")
    add_test(${testArgs})
endfunction()

The easiest way to understand how to use these three functions is through some concrete examples.

Example 1: A test requiring externally supplied data

Adding a test which needs to access a file stored remotely is one of the simplest scenarios for using ExternalData. The following example comes straight out of the CMake documentation:

include(ExternalData)

set(ExternalData_URL_TEMPLATES
    "file:///local/%(algo)/%(hash)"
    "file:////host/share/%(algo)/%(hash)"
    "http://data.org/%(algo)/%(hash)"
 )

ExternalData_Add_Test(MyData
    NAME MyTest
    COMMAND MyExe DATA{MyInput.png}
)
ExternalData_Add_Target(MyData)

This defines a new test target called MyTest which will run an executable called MyExe. When that executable is run, it is given the MyInput.png file name on the command line. More specifically, it is given the path to the file in the build directory and that file will be downloaded on demand before the command is run. The local source tree needs a placeholder file (e.g. MyInput.png.md5) which CMake will then use to try to find the real file in one of the three locations specified in ExternalData_URL_TEMPLATES.

Example 2: Adding firmware to a target

Let’s say you are building an application myapp which expects a firmware file to be located at Assets/firmware.bin relative to the application’s executable at run-time. That firmware file is stored in a remote location and it should be downloaded on demand when myapp is built. This could be implemented as a post-build step for the myapp target, something like the following:

include(ExternalData)

set(ExternalData_URL_TEMPLATES
    "file://$ENV{HOME}/assets/%(algo)/%(hash)"
    "https://intranet.mycompany.com/firmware/%(algo)/%(hash)"
)

ExternalData_Expand_Arguments(myapp_externalData
    firmware
    DATA{firmware.bin}
)
ExternalData_Add_Target(myapp_externalData)

add_executable(myapp myapp.cpp)
add_custom_command(TARGET myapp POST_BUILD
    COMMAND ${CMAKE_COMMAND} -E make_directory  $<TARGET_FILE_DIR:myapp>/Assets
    COMMAND ${CMAKE_COMMAND} -E copy ${firmware}$<TARGET_FILE_DIR:myapp>/Assets/
)
add_dependencies(myapp myapp_externalData)

Using a local cache

If you list a local location first in ExternalData_URL_TEMPLATES, it can act like a local cache. This allows you to work offline if you can populate that local location with any remote binaries you may wish to use while disconnected. This has a few drawbacks though:

Different developers may use different paths for their local caches and you wouldn’t typically want to have developer-specific entries in your CMakeLists.txt file.
Even if CMake downloads a file from a remote location, it won’t put it in your local cache for you, so you still have to manually manage your local cache.

Happily, the ExternalData module provides a separate ExternalData_OBJECT_STORES variable which can be used to specify local cache locations to check before consulting ExternalData_URL_TEMPLATES. If a particular file is found in one of the locations specified by ExternalData_OBJECT_STORES, the remote locations in ExternalData_URL_TEMPLATES won’t be used. Furthermore, if a file is downloaded from a remote location, it will automatically be added to the first location listed in ExternalData_OBJECT_STORES. This means downloads only happen once and are cached locally thereafter without any manual intervention. If you don’t provide ExternalData_OBJECT_STORES, a local cache is created in your build output directory instead.

When the ExternalData_OBJECT_STORES variable is used, CMake will append the %(algo)/%(hash) part to the path when searching for an object. An example may make this a little clearer::

set(ExternalData_OBJECT_STORES
    "$ENV{HOME}/assets"
    "/opt/sitecache"
)

Compare this with ExternalData_URL_TEMPLATES where we need to explicitly include %(algo) and %(hash).

Using a local object store is particularly useful when setting up a build server. It will ensure that remote files only need to be downloaded once and then the locally cached version will be used thereafter. If the build server is performing frequent clean builds, this can save repeatedly downloading the same remote files over and over again and no manual cache population is required.

Closing remarks

One of the first things that becomes apparent when using the ExternalData module is when you are placing the real file on the remote location, you lose any humanly meaningful name for it since its name has to be the hash value. To help with this, if the remote location is a Unix-based file system, you may be able to use a symbolic link where the link has the hash value for its name, but the linked-to file can have any meaningful name you like (including something different to the name you use for it in your local source tree).

Since ExternalData_URL_TEMPLATES can contain any number of entries, you can set up multiple external locations. You may, for example, have one location for storing firmware files, another for videos, etc. You can also set up redundant locations or have some locations that only make sense for some platforms. For example, you can specify a Windows share as follows:

set(ExternalData_URL_TEMPLATES "file:////host/share/%(algo)/%(hash)")

Regarding history of a binary file, when a given external binary file is updated, the local source tree only needs to update the hash value in its placeholder file. This is very convenient, since changes in the contents of these placeholder files will be recorded in any version control system being used, so earlier code can still be rebuilt with the original binary files from that point in history (assuming those binary files are still available from one of the remote locations or a local object store).

In summary, the ExternalData module is a great way of managing large binary files without having to add them directly to your source tree. Not only is it a great time and bandwidth saver, it makes for more flexible build arrangements.

Have a CMake maintainer work on your project

Get the book for more CMake content

5 thoughts on “Handling binary assets and test data with CMake”

Craig Scott

July 12, 2015 at 10:29 pm

Updated to fix incorrect information about how ExternalData_OBJECT_STORES is used.
LPC

May 31, 2017 at 11:30 pm

Hi Craig,

Thank you for this wonderful article. However this CMake module still appears obscure to me, even after reading your article & the official documentation. Especially I would like to know if the ExternalData module can be used for general purpose data that should be copied next to the binary dir. I tried to use this module with ExternalData_OBJECT_STORES but it doesn’t seems to detect when the files are modified. For instance, lets say I have a mydb.db file that I want to copy in the target binary dir but only if it changed. Would it be possible using this module?

I don’t know if I’m enough clear. Thank you for your time!
- Craig Scott
  
  June 1, 2017 at 8:30 am
  
  ExternalData can definitely be used for any general purpose data, it isn’t restricted to files used by tests. The location the data is copied to will be in the build directory, not outside of it, although it may update an object store located outside the build directory as part of what it does.
  
  When using ExternalData, if the object you want to copy is modified, then its hash value will also change. This means you have to update the value inside your placeholder file (e.g. named mydb.db.md5), which also means the build target created by ExternalData_Add_Target() should see that the downloaded file needs to be updated. If you’re finding that isn’t the case, I suggest you report it as a bug in CMake’s gitlab, including a simple example that reproduces the problem.
  
  https://gitlab.kitware.com/cmake/cmake/issues
Adam

March 20, 2018 at 3:23 am

Hi Craig,
Thanks for this article. It explains a lot, however I still have a difficulty to apply ExternalData in my project. The data I need for my tests is organised in directories. The documentation claims it is possible to specify trailing ‘/’ and patterns for files in the directory in the following way:

DATA{MyDataDir/,RECURSE:,REGEX:.*}

However, I can’t see what data should I keep in the repository. The options I can see are as follows:
1. I keep hash for every single file in the actual dir.
2. I keep hash for an archived directory.
3. I keep no hashes, just a placeholder for the directory that is actually placed in an external storage.

The problem is that the structure of the directory might change in time and from the test perspective it’s not relevant to know it (except for one entry point file that is always there). Can you explain to me what is the actual behaviour of ExternalData in case of directories? How does it expect to keep the data in the external storage?
- Craig Scott
  
  March 20, 2018 at 6:44 am
  
  I don’t know off the top of my head how the directory side of things works, I haven’t used that aspect of the command (or at least it was so long ago I no longer recall!). If you’re feeling adventurous, you can always try to follow the source code of the module itself, it’s just CMake commands. You can find it in your local copy of CMake, or have a look online here:
  
  https://gitlab.kitware.com/cmake/cmake/blob/master/Modules/ExternalData.cmake
  
  Sorry I can’t be of more help. If I get the chance to dig into it more, I’ll see if I can update the article, but that won’t be any time soon unfortunately.

Professional CMake:

A Practical Guide