DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Your First Java RDD With Apache Spark
  • Recurrent Workflows With Cloud Native Dapr Jobs
  • Java Virtual Threads and Scaling
  • Java’s Next Act: Native Speed for a Cloud-Native World

Trending

  • Optimize Deployment Pipelines for Speed, Security and Seamless Automation
  • Problems With Angular Migration
  • Beyond ChatGPT, AI Reasoning 2.0: Engineering AI Models With Human-Like Reasoning
  • Failure Handling Mechanisms in Microservices and Their Importance
  1. DZone
  2. Coding
  3. Java
  4. Determining File Types in Java

Determining File Types in Java

Programmatically determining the type of a file can be surprisingly tricky and there have been many content-based file identification approaches proposed and implemented.

By 
Dustin Marx user avatar
Dustin Marx
·
Mar. 04, 15 · Tutorial
Likes (8)
Comment
Save
Tweet
Share
168.5K Views

Join the DZone community and get the full member experience.

Join For Free

Programmatically determining the type of a file can be surprisingly tricky and there have been many content-based file identification approaches proposed and implemented. There are several implementations available in Java for detecting file types and most of them are largely or solely based on files' extensions. This post looks at some of the most commonly available implementations of file type detection in Java.

Several approaches to identifying file types in Java are demonstrated in this post. Each approach is briefly described, illustrated with a code listing, and then associated with output that demonstrates how different common files are typed based on extensions. Some of the approaches are configurable, but all examples shown here use "default" mappings as provided out-of-the-box unless otherwise stated.

About the Examples

The screen snapshots shown in this post are of each listed code snippet run against certain subject files created to test the different implementations of file type detection in Java. Before covering these approaches and demonstrating the type each approach detects, I list the files under test and what they are named and what they really are.

File
Name
File
Extension
File
Type
Type Matches
Extension Convention?
actualXml.xmlxmlXMLYes
blogPostPDFPDFNo
blogPost.pdfpdfPDFYes
blogPost.gifgifGIFYes
blogPost.jpgjpgJPEGYes
blogPost.pngpngPNGYes
blogPostPDF.txttxtPDFNo
blogPostPDF.xmlxmlPDFNo
blogPostPNG.gifgifPNGNo
blogPostPNG.jpgjpgPNGNo
dustin.txttxtTextYes
dustin.xmlxmlTextNo
dustinTextNo

Files.probeContentType(Path) [JDK 7]

Java SE 7 introduced the highly utilitarian Files class and that class's Javadoc succinctly describes its use: "This class consists exclusively of static methods that operate on files, directories, or other types of files" and, "in most cases, the methods defined here will delegate to the associated file system provider to perform the file operations."

The java.nio.file.Files class provides the method probeContentType(Path) that "probes the content type of a file" through use of "the installed FileTypeDetector implementations" (the Javadoc also notes that "a given invocation of the Java virtual machine maintains a system-wide list of file type detectors").

/**
 * Identify file type of file with provided path and name
 * using JDK 7's Files.probeContentType(Path).
 *
 * @param fileName Name of file whose type is desired.
 * @return String representing identified type of file with provided name.
 */
public String identifyFileTypeUsingFilesProbeContentType(final String fileName)
{
   String fileType = "Undetermined";
   final File file = new File(fileName);
   try
   {
      fileType = Files.probeContentType(file.toPath());
   }
   catch (IOException ioException)
   {
      out.println(
           "ERROR: Unable to determine file type for " + fileName
              + " due to exception " + ioException);
   }
   return fileType;
}

When the above Files.probeContentType(Path)-based approach is executed against the set of files previously defined, the output appears as shown in the next screen snapshot.

The screen snapshot indicates that the default behavior for Files.probeContentType(Path) on my JVM seems to be tightly coupled to the file extension. The files with no extensions show "null" for file type and the other listed file types match the files' extensions rather than their actual content. For example, all three files with names starting with "dustin" are really the same single-sentence text file, butFiles.probeContentType(Path) states that they are each a different type and the listed types are tightly correlated with the different file extensions for essentially the same text file.

MimetypesFileTypeMap.getContentType(String) [JDK 6]

The class MimetypesFileTypeMap was introduced with Java SE 6 to provide "data typing of files via their file extension" using "the .mime.types format." The class's Javadoc explains where in a given system the class looks for MIME types file entries. My example uses the ones that come out-of-the-box with my JDK 8 installation. The next code listing demonstrates use of javax.activation.MimetypesFileTypeMap.

/**
 * Identify file type of file with provided name using
 * JDK 6's MimetypesFileTypeMap.
 *
 * See Javadoc documentation for MimetypesFileTypeMap class
 * (http://docs.oracle.com/javase/8/docs/api/javax/activation/MimetypesFileTypeMap.html)
 * for details on how to configure mapping of file types or extensions.
 */
public String identifyFileTypeUsingMimetypesFileTypeMap(final String fileName)
{    
   final MimetypesFileTypeMap fileTypeMap = new MimetypesFileTypeMap();
   return fileTypeMap.getContentType(fileName);
}

The next screen snapshot demonstrates the output from running this example against the set of test files.

This output indicates that the MimetypesFileTypeMap approach returns the MIME type ofapplication/octet-stream for several files including the XML files and the text files without a .txt suffix. We see also that, like the previously discussed approach, this approach in some cases uses the file's extension to determine the file type and so incorrectly reports the file's actual file type when that type is different than what its extension conventionally implies.

URLConnection.getContentType()

I will be covering three methods in URLConnection that support file type detection. The first isURLConnection.getContentType(), a method that "returns the value of the content-type header field." Use of this instance method is demonstrated in the next code listing and the output from running that code against the common test files is shown after the code listing.

/**
 * Identify file type of file with provided path and name
 * using JDK's URLConnection.getContentType().
 *
 * @param fileName Name of file whose type is desired.
 * @return Type of file for which name was provided.
 */
public String identifyFileTypeUsingUrlConnectionGetContentType(final String fileName)
{
   String fileType = "Undetermined";
   try
   {
      final URL url = new URL("file://" + fileName);
      final URLConnection connection = url.openConnection();
      fileType = connection.getContentType();
   }
   catch (MalformedURLException badUrlEx)
   {
      out.println("ERROR: Bad URL - " + badUrlEx);
   }
   catch (IOException ioEx)
   {
      out.println("Cannot access URLConnection - " + ioEx);
   }
   return fileType;
}

The file detection approach using URLConnection.getContentType() is highly coupled to files' extensions rather than the actual file type. When there is no extension, the String returned is "content/unknown."

URLConnection.guessContentTypeFromName(String)

The second file detection approach provided by URLConnection that I'll cover here is its methodguessContentTypeFromName(String). Use of this static method is demonstrated in the next code listing and associated output screen snapshot.

/**
 * Identify file type of file with provided path and name
 * using JDK's URLConnection.guessContentTypeFromName(String).
 *
 * @param fileName Name of file whose type is desired.
 * @return Type of file for which name was provided.
 */
public String identifyFileTypeUsingUrlConnectionGuessContentTypeFromName(final String fileName)
{
   return URLConnection.guessContentTypeFromName(fileName);
}

URLConnection's guessContentTypeFromName(String) approach to file detection shows "null" for files without file extensions and otherwise returns file type String representations that closely mirror the files' extensions. These results are very similar to those provided by the Files.probeContentType(Path)approach shown earlier with the one notable difference being that URLConnection'sguessContentTypeFromName(String) approach identifies files with .xml extension as being of file type "application/xml" while Files.probeContentType(Path) identifies these same files' types as "text/xml".

URLConnection.guessContentTypeFromStream(InputStream)

The third approach I cover that is provided by URLConnection for file type detection is via the class's static method guessContentTypeFromStream(InputStream). A code listing employing this approach and associated output in a screen snapshot are shown next.

/**
 * Identify file type of file with provided path and name
 * using JDK's URLConnection.guessContentTypeFromStream(InputStream).
 *
 * @param fileName Name of file whose type is desired.
 * @return Type of file for which name was provided.
 */
public String identifyFileTypeUsingUrlConnectionGuessContentTypeFromStream(final String fileName)
{
   String fileType;
   try
   {
      fileType = URLConnection.guessContentTypeFromStream(new FileInputStream(new File(fileName)));
   }
   catch (IOException ex)
   {
      out.println("ERROR: Unable to process file type for " + fileName + " - " + ex);
      fileType = "null";
   }
   return fileType;
}

All the file types are null! The reason for this appears to be explained by the Javadoc for the InputStream parameter of the URLConnection.guessContentTypeFromStream(InputStream) method: "an input stream that supports marks." It turns out that the instances of FileInputStream in my examples do not support marks (their calls to markSupported() all return false).

Apache Tika

All of the examples of file detection covered in this post so far have been approaches provided by the JDK. There are third-party libraries that can also be used to detect file types in Java. One example is Apache Tika, a "content analysis toolkit" that "detects and extracts metadata and text from over a thousand different file types." In this post, I look at using Tika's facade class and its detect(String) method to detect file types. The instance method call is the same in the three examples I show, but the results are different because each instance of the Tika facade class is instantiated with a different Detector.

The instantiations of Tika instances with different Detectors is shown in the next code listing.

/** Instance of Tika facade class with default configuration. */
private final Tika defaultTika = new Tika();

/** Instance of Tika facade class with MimeTypes detector. */
private final Tika mimeTika = new Tika(new MimeTypes());
his is 
/** Instance of Tika facade class with Type detector. */
private final Tika typeTika = new Tika(new TypeDetector());

With these three instances of Tika instantiated with their respective Detectors, we can call thedetect(String) method on each instance for the set of test files. The code for this is shown next.

/**
 * Identify file type of file with provided name using
 * Tika's default configuration.
 *
 * @param fileName Name of file for which file type is desired.
 * @return Type of file for which file name was provided.
 */
public String identifyFileTypeUsingDefaultTika(final String fileName)
{
   return defaultTika.detect(fileName);
}

/**
 * Identify file type of file with provided name using
 * Tika's with a MimeTypes detector.
 *
 * @param fileName Name of file for which file type is desired.
 * @return Type of file for which file name was provided.
 */
public String identifyFileTypeUsingMimeTypesTika(final String fileName)
{
   return mimeTika.detect(fileName);
}

/**
 * Identify file type of file with provided name using
 * Tika's with a Types detector.
 *
 * @param fileName Name of file for which file type is desired.
 * @return Type of file for which file name was provided.
 */
public String identifyFileTypeUsingTypeDetectorTika(final String fileName)
{
   return typeTika.detect(fileName);
}

When the three above Tika detection examples are executed against the same set of files are used in the previous examples, the output appears as shown in the next screen snapshot.

We can see from the output that the default Tika detector reports file types similarly to some of the other approaches shown earlier in this post (very tightly tied to the file's extension). The other two demonstrated detectors state that the file type is application/octet-stream in most cases. Because I called the overloaded version of detect(-) that accepts a String, the file type detection is "based on known file name extensions."

If the overloaded detect(File) method is used instead of detect(String), the identified file type results are much better than the previous Tika examples and the previous JDK examples. In fact, the "fake" extensions don't fool the detectors as much and the default Tika detector is especially good in my examples at identifying the appropriate file type even when the extension is not the normal one associated with that file type. The code for using Tika.detect(File) and the associated output are shown next.

   /**
    * Identify file type of file with provided name using
    * Tika's default configuration.
    *
    * @param fileName Name of file for which file type is desired.
    * @return Type of file for which file name was provided.
    */
   public String identifyFileTypeUsingDefaultTikaForFile(final String fileName)
   {
      String fileType;
      try
      {
         final File file = new File(fileName);
         fileType = defaultTika.detect(file);
      }
      catch (IOException ioEx)
      {
         out.println("Unable to detect type of file " + fileName + " - " + ioEx);
         fileType = "Unknown";
      }
      return fileType;
   }

   /**
    * Identify file type of file with provided name using
    * Tika's with a MimeTypes detector.
    *
    * @param fileName Name of file for which file type is desired.
    * @return Type of file for which file name was provided.
    */
   public String identifyFileTypeUsingMimeTypesTikaForFile(final String fileName)
   {
      String fileType;
      try
      {
         final File file = new File(fileName);
         fileType = mimeTika.detect(file);
      }
      catch (IOException ioEx)
      {
         out.println("Unable to detect type of file " + fileName + " - " + ioEx);
         fileType = "Unknown";
      }
      return fileType;
   }

   /**
    * Identify file type of file with provided name using
    * Tika's with a Types detector.
    *
    * @param fileName Name of file for which file type is desired.
    * @return Type of file for which file name was provided.
    */
   public String identifyFileTypeUsingTypeDetectorTikaForFile(final String fileName)
   {
      String fileType;
      try
      {
         final File file = new File(fileName);
         fileType = typeTika.detect(file);
      }
      catch (IOException ioEx)
      {
         out.println("Unable to detect type of file " + fileName + " - " + ioEx);
         fileType = "Unknown";
      }
      return fileType;
   }

Caveats and Customization

File type detection is not a trivial feat to pull off. The Java approaches for file detection demonstrated in this post provide basic approaches to file detection that are often highly dependent on a file name's extension. If files are named with conventional extensions that are recognized by the file detection approach, these approaches are typically sufficient. However, if unconventional file type extensions are used or the extensions are for files with types other than that conventionally associated with that extension, most of these approaches to file detection break down without customization. Fortunately, most of these approaches provide the ability to customize the mapping of file extensions to file types. The Tika approach usingTika.detect(File) was generally the most accurate in the examples shown in this post when the extensions were not the conventional ones for the particular file types.

Conclusion

There are numerous mechanisms available for simple file type detection in Java. This post reviewed some of the standard JDK approaches for file detection and some examples of using Tika for file detection.


File system Java (programming language)

Published at DZone with permission of Dustin Marx, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Your First Java RDD With Apache Spark
  • Recurrent Workflows With Cloud Native Dapr Jobs
  • Java Virtual Threads and Scaling
  • Java’s Next Act: Native Speed for a Cloud-Native World

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!