Countering the Risks of Document Hidden Data & Metadata



Contents

Introduction to the Risks of Document Hidden Data & Metadata

All popular document file formats can contain a wealth of hidden data & metadata. These include Microsoft Office® Word, Excel, and PowerPoint files, OpenDocument Text, Spreadsheet, and Presentation files, and PDF files.

While hidden data & metadata are useful for finding files and reviewing documents, they pose privacy and confidentiality risks when the files are shared. The hidden data often contains private and sensitive information, that if unintentionally exposed can cause the document creator and his organization embarrassment with possible financial and legal implications.

Types of dangerous hidden data that document files might contain include Document Properties, Comments, Tracked Changes, Hidden Text and Objects, and Embedded Images Hidden Data

The dangers of document hidden data & metadata has been emphasized in recent years with several high-profile incidents, where sensitive data was inadvertently leaked through the hidden data stored within document files that were sent to other parties by email or were posted on the web.

There are two main approaches for countering the risks of document hidden data:

  1. Removing the hidden data with a reliable hidden data remover

  2. Converting the documents to hidden data free PDF

[Back to Contents]

Types of Dangerous Document Hidden Data

Types of dangerous hidden data that document files might contain include the following:

  • Document Properties - Metadata that includes details such as author name, title, subject, keywords, category, status, comments, revision number, and total editing time. Document properties may also include user defined custom properties, a non-standard metadata that can be added to a document.
  • Comments - Comments that were added to the document. With each comment, the name of the user who added it and the date and time in which it was added are also saved.
  • Tracked Changes - Tracked changes are changes made to the document while the Track Changes option was enabled. This include inserted, deleted, modified, and moved text. Every change is saved with the name of the user who made the change, as well as the date and time in which the change occurred. If the tracked changes are not removed from the document, previous versions of the document can still be viewed.
  • Hidden Text and Objects - Text and objects can be formatted as hidden so they won't be printed. Hidden text and objects will not appear on the screen as well unless the application is specifically set to show them.
  • Embedded Images Hidden Data - Hidden data of images files that have been embedded in the document. This might include personally identifiable information (PII) and geographical coordinates.

[Back to Contents]

The Wrong Ways to Remove Document Hidden Data

One method for removing document hidden data often seen in many websites and blogs is to use the Remove Properties and Personal Information feature that is integral to Windows®. To access this feature, the user needs to select one or more files in Windows Explorer, right-click and select "Properties" from the context menu. In the "Properties" window, the user needs to select the "Details" tab, and then click on the "Remove Properties and Personal Information" link at the bottom.

This feature, however, is completely unsuitable for the task. It can only remove a small number of document properties from old Microsoft Office® 2003 files. (Word, Excel, and PowerPoint files with three letter extension - DOC, XLS, and PPT) If Microsoft Office® 2007 or above is installed on the computer, this feature can also remove some properties from Microsoft Office® 2007-2021 files, (Word, Excel, and PowerPoint files with four letter extension - DOCX, XLSX, and PPTX) as well as OpenDocument Text files. In either case, this feature cannot remove at all other types of dangerous document hidden data, including Comments, Tracked Changes, Hidden Text and Objects, and Embedded Images Hidden Data. PDF is entirely unsupported by this feature.

Another method often seen in many websites and blogs, is to use a metadata editing application, or a metadata editing feature of document editors, to clear the document properties. This is a wrong way to remove hidden data as well, since it will not get rid of all other types of dangerous document hidden data, including Comments, Tracked Changes, Hidden Text and Objects, and Embedded Images Hidden Data. Due to a unique features of the PDF format, this method might not even remove the cleared document properties. PDF could store document properties in two forms: Document Properties and XMP metadata. This method might clear just one of them. In addition, PDF has an incremental update feature, meaning deleted objects might only be marked as deleted, while still persist in the file. This feature is intended for efficiency reasons. As a result, while cleared properties will not appear on regular viewing applications, it might still be possible to recover them with a specialized application like hex editors.

Unfortunately, the misleading websites and blogs that suggest the above wrong methods, often appear high in search engines results when searching using such queries as "how to remove metadata from pdf".

Removing just the Document Properties portion of document files, while retaining other types of dangerous document hidden data, such as Comments, Tracked Changes, Hidden Text and Objects, and Embedded Images Hidden Data, is a problem that also exist in many third-party tools that are advertised as document metadata scrubbers, both commercial and open-source.

[Back to Contents]

The Slow and Dangerous Way to Remove Document Hidden Data

In recent years, several online web services for removing hidden data & metadata from document files have been created by several developers, that can be accessed by any web browser.

Using online hidden data & metadata removers is generally slower than using offline applications, since they require uploading the files to the server, waiting for their cleaning, and then downloading the cleaned files. They are also generally less comfortable to use, especially for cleaning multiple files at once.

When using an online service, there is also a danger of misusing uploaded files in a way that will compromise privacy and confidentiality, as well as a danger of a data breach. It is therefore imperative to make sure it is trustworthy and employs adequate data security measures.

[Back to Contents]

The Right Way to Remove Document Hidden Data

The best way to remove hidden data & metadata from document files is to use a specialized privacy-oriented hidden data & metadata removal tool, dedicated for the document file types one wishes to clean, with a broad-spectrum of supported hidden data types. Such application will permanently remove the supported hidden data types, so they could not be recovered from the files. An offline application that run on the local computer is preferable.

For Microsoft Office® and OpenDocument files, the integral Document Inspector tool that is included in Microsoft Office® has a decent coverage of hidden data types it can remove. To access this feature, the user needs to open the document in Microsoft Office®, and select "Info" from the "File" menu. In the "Info" window, the user needs to click on "Check for Issues", and then select "Inspect Document". The user is then presented with a window where the document can be inspected for selected hidden data types. The Document Inspector can remove a few hidden data types, and can only warn for the presence of a few others. The Document Inspector cannot remove hidden data from PDF files. Notable disadvantages of this tool is the many steps required to remove hidden data with it, as well as the lack of batch hidden data removal capability for cleaning multiple documents at once.

BatchPurifier™ can permanently remove hidden data & metadata from multiple files at once. It supports multiple types of document, image, and media file formats, including Microsoft Office® Word, Excel, and PowerPoint files, OpenDocument Text, Spreadsheet, and Presentation files, and PDF files. It's an offline desktop application for Windows, with a broad-spectrum of supported hidden data types.

Document file types are very intricate and complex. They are also extensible. Using a well designed hidden data & metadata remover can considerably minimize the amount of dangerous hidden data & metadata left in files processed by it, however, using such tool does not necessarily guarantee no hidden data is left in files, and does not necessarily guarantee processed files are anonymized. In some cases, one might be able to learn new information about the document and its authors by closely examining processed files for hidden data, either from embedded hidden data on its own, or by combining it with other data from other sources.

[Back to Contents]

Converting Documents to PDF to Avoid Hidden Data Risks

Another approach to counter the problem of hidden data risks is to convert the documents to PDF. When using this approach, it's imperative to make sure the converting software does not pass-through the metadata of the source documents, and does not introduce new metadata into the PDF. If it cannot be assured that is the case, the risk can be countered by using a PDF hidden data & metadata remover to process the PDF files. (such as BatchPurifier™) An offline PDF converting application that run on the local computer is preferable over an online service.

This approach has disadvantages since there are features of the original files that one might wish to preserve in the shared files, and are lost in the conversion. For instance, one might want to preserve speaker notes in presentations, and only get rid of other hidden data types. It might also be desirable at times for the shared files to be editable, whereas editing PDF files is generally less convenient, as it wasn't designed for it. Text documents and presentation generally get well converted to PDF. Converting spreadsheets to PDF, on the other hand, often makes the resulting files useless.

[Back to Contents]

Related White Papers