Over a million developers have joined DZone.

Automate Metadata Cleaning Before External Sharing

How to use GroupDocs .NET libraries to clean metadata from digital documents.

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

What is Metadata?

Here is an excerpt from the Wikipedia article about Metadata:

Metadata were traditionally used in the card catalogs of libraries. As information has become increasingly digital, metadata is also used to describe digital data using metadata standards specific to a particular discipline. Describing the contents and context of data or data files increases their usefulness. For example, a web page may include metadata specifying what language the page is written in, what tools were used to create it, and where to find more information about the subject; this metadata can automatically improve the reader's experience.

What is the Purpose of Metadata?

Again, an excerpt from the Wikipedia article explains the purpose of Metadata:

A main purpose of metadata is to facilitate in the discovery of relevant information, more often classified as resource discovery. Metadata also helps organize electronic resources, provide digital identification, and helps support archiving and preservation of the resource. Metadata assists in resource discovery by "allowing resources to be found by relevant criteria, identifying resources, bringing similar resources together, distinguishing dissimilar resources, and giving location information."

What Are the Common Metadata Types?

Below are some common Metadata in Documents and Images:

  1. Builtin Document Properties (Office Documents)

  2. Custom Document Properties (Office Documents)

  3. XMP Metadata (PDF Documents and Images including JPEG/JPG, PNG, PSD and GIF)

  4. EXIF Metadata (Images including JPEG/JPG and Tiff)

  5. IPTC-IIM (JPEG/JPG Images)

  6. Email Metadata (Email Documents including EML and MSG formats)

Why Clean Metadata?

Though metadata is useful information associated with electronic media, it is mostly confidential information. Due to its sensitive nature, any corporate organization would like to avoid leakage of metadata associated with electronic media like documents and images. Therefore, in most cases, organizations will like cleaned metadata from digital media before sharing externally.

Why Automate Metadata Cleaning?

There may be hundreds or thousands of documents and images in organizations depending upon their volume. It may be a nightmare to manually clean metadata associated with this number of electronic documents and images. For example, please visit some links to get an idea about manually cleaning such metadata:

Therefore, automation of Metadata Cleaning may make this process hassle-free and smooth. The next question will be: how do you automate?

Introduction to GroupDocs.Metadata for .NET

GroupDocs.Metadata for .NET provides public classes and interfaces to create, manipulate, clean, search and compare metadata associated with Microsoft Office Documents, Email, PDF, AutoCAD, and Image formats. Since it is a .NET API, developers have options to use the API in any .NET environment like C#, VB.NET, and ASP.NET.

For a more specific chart on the supported features and support Document/Image formats, please visit:

Criteria Based Metadata Cleaning

Although GroupDocs.Metadata for .NET provides much more features to deal with Documents and Images Metadata, we'll only focus on Metadata cleaning in this article. The console based project for C# and VB.NET examples of features offered by GroupDocs.Metadata for .NET can be found at Github: https://github.com/groupdocs-metadata/GroupDocs.Metadata-for-.NET/tree/master/Examples

All you will have to do is download the Examples project and follow these instructions to run these feaure usage examples.

Coming back to Metadata cleaning, let's envision a real life use case where you want to clean metadata from the documents created by a certain author. The steps will include: 

  1. Getting all documents created by a particular author from a directory.

  2. Cleaning metadata from all the documents found in that collection.

  3. Save metadata-free documents to another directory.

Let's have a look at the C# code when executing the above steps using GroupDocs.Metadata for .NET:

Source

        /// <summary>
        /// Takes author name and removes metadata in files created by specified author
        /// </summary>
        /// <param name="authorName">Author name</param>
        public void RemoveMetadataByAuthor(string authorName)
        {
            // Map directory in source folder
            string sourceDirectoryPath = Common.MapSourceFilePath(this.DocumentsPath);

            // get files presented in target directory
            string[] files = Directory.GetFiles(sourceDirectoryPath);

            foreach (string path in files)
            {
                // recognize format
                FormatBase format = FormatFactory.RecognizeFormat(path);

                // initialize DocFormat
                DocFormat docFormat = format as DocFormat;
                if (docFormat != null)
                {
                    // get document properties
                    DocMetadata properties = docFormat.DocumentProperties;

                    // check if author is the same
                    if (string.Equals(properties.Author, authorName, StringComparison.OrdinalIgnoreCase))
                    {
                        // remove comments
                        docFormat.ClearComments();

                        List<string> customKeys = new List<string>();

                        // find all custom keys
                        foreach (KeyValuePair<string, PropertyValue> keyValuePair in properties)
                        {
                            if (!properties.IsBuiltIn(keyValuePair.Key))
                            {
                                customKeys.Add(keyValuePair.Key);
                            }
                        }

                        // and remove all of them
                        foreach (string key in customKeys)
                        {
                            properties.Remove(key);
                        }
                        //====== yet to change things =========================
                        // and commit changes
                        string fileName = Path.GetFileName(path);
                        string outputFilePath = Common.MapDestinationFilePath(this.DocumentsPath + "/" + fileName);
                        docFormat.Save(outputFilePath);
                        //=====================================================
                    }
                }
            }

            Console.WriteLine("Press any key to exit.");
        }
    }

Feedback

Hope you found this article useful in solving a real life metadata use case using GroupDocs.Metadata for .NET API. There may be multiple real life use cases with regards to Metadata processing which can be acomplished using multitude of Metada features and file formats supported by the API. 

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
c#.net ,document management ,document sharing ,image manipulation ,metadata ,protecting shared data

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}