Reverse Engineer Docker Images into Dockerfiles
Reverse Engineer Docker Images into Dockerfiles
Take a look at how to use image exploration tool Dive to learn how to revive a Dockerfile from a Docker image.
Join the DZone community and get the full member experience.Join For Free
As public Docker registries like Docker Hub and TreeScale increase in popularity, except for the most restrictive environments, it has become common for admins and developers to casually download an image built by an unknown entity. It often comes down to the convivence outweighing the perceived risk. When a Docker image is made publicly available, the Dockerfile is sometimes also provided, either directly in the listing, in a git repository, or through an associated link, but sometimes this is not the case. Even if the Dockerfile was made available, we don't have many assurances that the published image is safe to use.
Maybe security vulnerabilities aren't your concern. Perhaps one of your favorite images is no longer being maintained, and you would like to update it so that it runs on the latest version of Ubuntu. Or perhaps a compiler for another distribution has an exclusive feature that makes it better optimized to produce binaries during compile time, and you have an uncontrollable compulsion to release a similar image that's just a little more optimized.
Whatever the reason, if you wish to recover a Dockerfile from an image, there are options. Docker images aren't a black box. Often, you can retrieve most of the information you need to reconstruct a Dockerfile. In this article, we will explore exactly how to do that by looking inside a Docker image so that we can very closely reconstruct the Dockerfile that built it.
In this article, we will show how it's possible to reconstruct a Dockerfile from an image using two tools,
Dedockify, a customized Python script provided for this article, and
dive. The basic process flow used will be as follows.
To get some quick, minimal-effort intuition regarding how images are composed, we will introduce ourselves to various advanced and potentially unfamiliar Docker concepts using Dive. Dive is an image exploration tool that allows examination of each layer of a Docker image.
First, let us create a simple, easy to follow Dockerfile that we can explore for testing purposes.
In an empty directory, enter the following snippet directly into the command line:
By entering the above and pressing enter, we've just created a new
Dockerfile and populated three zero-byte test files in the same directory.
So now, let's build an image using this Dockerfile and tag it as
example1 image should produce the following output:
The following zero-byte
example1 image should now be available:
Note that since there's no binary data, this image won't be functional. We are only using it as a simplified example of how layers can be viewed in Docker images.
We can see here by the size of the image that there is no source image. Instead of a source image, we used
scratch which instructed Docker to use a zero-byte blank image as the source image. We then modified the blank image by copying three additional zero-byte test files onto it, and then tagged the changes as
Now, let us explore our new image with Dive.
Executing the above command should automatically pull
wagoodman/dive from Docker Hub, and produce the output of Dive's polished interface.
Scroll through the three layers of the image in the list to find the three files in the tree displayed on the right.
We can see the contents on the right change as we scroll through each layer. As each file was copied to a blank Docker
scratch image, it was recorded as a new layer.
Notice also that we can see the commands that were used to produced each layer. We can also see the hash value of the source file and the file that was updated.
If we take note of the items in the
Command: section, we should see the following:
Each command provides solid insight into the original command used in the Dockerfile to produce the image. However, the original filename is lost. It appears that the only way to recover this information is to make observations about the changes to the target filesystem, or perhaps to infer based on other details. More on this later.
Aside from third-party tools like
dive, the tool we have immediately available is
docker history. If we use the
docker history command on our
example1 image, we can view the entries we used in the Dockerfile to create that image.
We should, therefore, get the following result:
Notice that everything in the
CREATED BY column is truncated. These are Dockerfile directives passed through the Bourne shell. This information could be useful for recreating our Dockerfile, and although it is truncated here, we can view all of it by also using the
While this has some useful data, it could be a challenge to parse from the command line. We could also use
docker inspect. However, in this article, we will focus on using the Docker Engine API for Python.
Using Docker Engine API for Python
Docker released a Python library for the Docker Engine API, which allows full control of Docker from within Python. In the following example, we can recover similar information we did using
docker history by running the following Python 3 code:
This should result in output much like the following:
Looking at the output, we can see that reconstructing much of the Dockerfile is just a matter of parsing all the relevant data and reversing the entries. But as we saw earlier, we also notice that there are a few hashed entries in the
COPY directives. As previously mentioned, the hashed entries here represent filenames used from outside the layer. This information isn't directly recoverable. However, just as we saw in
dive, we can infer these names when we search for changes made to the layer. It's also sometimes possible to infer in cases where the original copy directive included the target filename as the destination. In other cases, the filenames may not be critical, allowing us to use arbitrary filenames. And still in other cases, while more difficult to assess, we can infer filenames that are back-referenced elsewhere in the system, such as in supporting dependencies like scripts or configuration files. But in any case, searching for all changes between layers is the most reliable.
Let's take this a few steps further. In order to help reverse engineer this image into a Dockerfile, we will need to parse everything and reformat it into a form that is readable. Please note that for the purposes of this article, the following Python 3 code has been made available and can be obtained from the Dedockify repository on GitHub. Thanks goes to LanikSJ for all prior work.
Initial Dockerfile Generation
If you've made it this far, then you should have two images:
wagoodman/dive and our custom
Running this code against our
example1 image will finally produce the following:
We've extracted nearly the same information that we observed when we explored the image with
dive earlier. Notice the
FROM directive shows us
example1:latest instead of
scratch. Our code is making an assumption about the base image that is technically incorrect in this case.
As a comparison, let us do the same thing with our
This shows a lot more diversity compared to our
example1 image. We notice the
ADD directive just before the
FROM directive. Our code is making the wrong assumption again. We don't know what the
ADD directive is adding. We can intuitively make the assumption, however, that we don't know for sure what the base image is. The
ADD directive could have been used to extract a local
tar file into the root directory. It's possible that it was using this method to load another base image.
Dedockify Limitation Testing
Let's experiment by creating an example
Dockerfile where we explicitly define the base image. As we did earlier, in an empty directory, run the following snippet directly from the command line.
Now, perform a build that tags our new image as
example2. This will create a similar image as before, except instead of using
scratch it will use
ubuntu:latest as the base image.
Since we now have a slightly more complex
Dockerfile to reconstruct, and we have the exact
Dockerfile we used to generate this image, we can make a comparison. Let us generate the output from our Python script.
This correlates well with the original
Dockerfile. There's no
ADD directive this time, and the
FROM directive is correct. Provided that our base image is defined in the original
Dockerfile, and it avoids using
scratch or avoids using the
ADD directive to create a base image from a
tar file, we should be able to reconstruct the
Dockerfile with some accuracy. We still don't know the names of the original files that were copied, however.
Blind Freestyle Dockerfile Reconstruction
Now, let us try reverse engineering a Docker container the proper way using the tools that we've already discussed. The container we will use has been modified from the above examples. Our earlier Dockerfile has been modified to create
example3. The image has been made functional by adding a small binary. The assembly source code is available here in the Dedockify GitHub repository. Since this image is so small, we won't need to build or pull it. We can just copy and paste the entire container right into our Docker environment with the snippet below.
Running everything directly from the command line will load
example3:latest. Now, let us try to recreate the
This gives us a base
Dockerfile to work from. Since
example3:latest is the name of this image, we can assume from the context that it's using
scratch. Now, we need to see what files were copied into
/app. Let us run this image against
dive to see how we will recover the missing data.
If you scroll down to the last layer, you'll be able to see all of the missing data populate the tree on the right. Each of the directories had zero-byte files named
testfile3 copied to it. And in the last later, a 63-byte file was copied called
hello to the
Now, let us recover those files! There doesn't appear to be a way to copy the files directly from the image, so we will need to create a container first.
Now, let us copy the files we need from the container to the host using the path and filenames we recovered from
We might first check to see if our container is still running.
If a container isn't running for some reason, that's fine. We can verify its status to see that it's stopped.
We can also check the logs.
It appears to be running a persistent
Hello, world! program. Actually, in this case, the
Hello, world! program wasn't designed to be persistent. In Docker version
19.03.6, there may be a bug that's preventing the application from terminating normally. This is acceptable for now. The container can be active or stopped; the application doesn't need to be persistent to recover any of the data we need. A container in any state only needs to be generated from the source image for which we are extracting data.
By running the recovered executable to verify its behavior, we should see the following:
Dockerfile we generated earlier, we can update it to include all the new details. This includes updating the
FROM directive to
scratch, along with all of the discovered filenames we found while exploring with
Again, combining all files in a shared folder, we're ready to run our reverse engineered
Now, for further verification, let's check the layers with
This image shows the same files as the original. Comparing the two images side, by side, they both show that they match. Both show the same file sizes. And both function in exactly the same way.
Here is the original
Dockerfile used to generate the original
We can see that, while we weren't able to reconstruct it perfectly, we were able to reconstruct approximately. There's no way to reconstruct a Dockerfile that uses a multi-stage build like this one. The information simply isn't available. Our only option is to reconstruct the Dockerfile of the image we actually have. If we have images from the eariler build stages, we can reproduce a Dockerfile for each of those, but in this case, all we had was the final build. But regardless, we have still successfully reproduced a useful
Dockerfile from a Docker image.
By using a similar approach as
dive, we should be able to update the
Dedockify source code to transgress through each of the layers automatically in order to recover all useful file information. Also, the program can be updated to be able to automatically recover files from the container and store them locally, while also automatically making appropriate updates to the
Dockerfile. Finally, the program can also be updated to be able to easily infer if the base layer is using an empty
scratch image, or something else. With some additional changes to the recovered
Dedockify can potentially be updated to completely automate the reverse engineering of a Docker image into a functional
Dockerfile in most cases.
This article was originally published on https://appfleet.com/blog/reverse-engineer-docker-images-into-dockerfiles-with-dedockify/.
Published at DZone with permission of Sudip Sengupta . See the original article here.
Opinions expressed by DZone contributors are their own.