Hosting user uploaded files (photos, PDF, forum avatar images, etc.) opens a potential attack vector against your website. In this blog post I discuss about the common pitfalls with hosted user content and how to prevent using them as an attack vector. The blog post discusses this in Python programming language, Django web framework and Plone CMS, but the advices apply on all systems.
Table Of Content
I look the issues from the perspective of compromising user data and your server (shellcode); denial of service and other brute force attacks are not considered here. Also, web technologies which maps executable scripts to URLs, like PHP, are open for far more attack vectors which are not discussed here (prone to uploading server-side executable issues like this).
The issue list here is non-exhaustive. Please submit in the blog comments if I missed something.
1. Filename and path attacks
Letting user uploaded filenames through to the server file system from HTTP request is a subject to attack vectors both when storing the file and later when serving the file.
Do not disclose the server path where you are storing user uploaded files (full path disclose).
Do not save the user uploaded filenames on the file system as is. It’s ok to save this information separately to the database if you want to later hint the browser to save the download with the original name, as with the content-disposition header example below. Instead, use a running counter (database id), a random hex string or a hash as a name when writing the file. Counters are good, since one needs to avoid filename conflicts in any case. In Django, File Storage API takes care of filename conflicts and sanitizes the uploaded filenames, in Plone/Zope filenames are tied to the database transaction id (blobstorage).
Do not mix paths and filenames as a string operation e.g. by concatenating the upload folder path with the filename . The attacker then could use relative (..) and absolute (/) paths to overwrite any writeable file on your server.
If you cannot tie the filename to the database entry, you may want to normalize (munge) the filename before writing it to the disk. Also, non-ASCII filenames cause issues in intra-filesystem communications (backups, migrations, etc.).
Distribute user uploaded files in hashed folder structure. Do not store all the uploads in the same file system folder. Even with small amount of files (1000+) you may start overloading the file system. For example, Django web framework simply dumps all the files to the same folder by default. If possible, put all user specific content to a folder with username and then create hashes inside this folder.
2. Serving unsanitized documents
Downloading the user uploaded file from your server is the other side of the user content issues. Though it might not look like a potential attack vector first, due to history of HTML and HTTP, the backwards compatibility and how browsers handle downloads this is another can of worms.
Never read files from the file system by user provided filename coming in a HTTP request. Always refer files by ids stored in the database. (Relative Path Travelsal attack, Directory Traversal in Django security guide).
When serving the file, do not put user supplied filenames to download URLs as it is potential XSS attack vector. Instead, use HTTP response Content-disposition header. Just make sure your web framework is not open to a HTTP response header injection exploitby using new lines in the filename.
3. Image uploads and content decode attacks
When you are processing user uploads on the server side, e.g. by resizing images, you are open to codec bugs. Most codecs are implemented by open source C libraries which have native bindings to your run-time environment (Python, Ruby, Node.js). When you are processing the user uploaded file, the code path jumps to a native library code which is open for traditional C exploits (buffer overflow).
To reduce the attack surface, you want to limit the number codecs used on your website and only enable the codecs which you really want to support. In the case images, usually these are JPEG, GIF and PNG. Even the most common codecs can be unsafe now and then (PNG vulnerability 2012). By default, the imaging libraries may have all file formats enabled and are not picky about what they process.
File extension blacklists or whitelists are useless. Native libraries usually operate by detecting the content type from the payload. Here is example of how Python imaging library (Pillow) detects the image format. This means that even if you are allowing file uploads with .PNG extension only, the user can potentially upload a renamed SVG file; SVG has much more complex codec increasing the attack surface against your service. In Python’s case you can use python-magic to detect the actual file payload and do your own file content validation in the form validation phase. Or alternative hack Pillow by dropping out file formats from PIL.Image.OPEN registry.
If you are processing very complex files with native codecs (video, PDF, .DOC, etc.) it might make sense to run the processor in an external process with limited UNIX privileges.