Home | Site Map | Contact Us |
Countering the Risks of Document Hidden Data & MetadataContents
Introduction to the Risks of Document Hidden Data & MetadataAll popular document file formats can contain a wealth of hidden data & metadata. These include Microsoft Office® Word, Excel, and PowerPoint files, OpenDocument Text, Spreadsheet, and Presentation files, and PDF files. While hidden data & metadata are useful for finding files and reviewing documents, they pose privacy and confidentiality risks when the files are shared. The hidden data often contains private and sensitive information, that if unintentionally exposed can cause the document creator and his organization embarrassment with possible financial and legal implications. Types of dangerous hidden data that document files might contain include Document Properties, Comments, Tracked Changes, Hidden Text and Objects, and Embedded Images Hidden Data The dangers of document hidden data & metadata has been emphasized in recent years with several high-profile incidents, where sensitive data was inadvertently leaked through the hidden data stored within document files that were sent to other parties by email or were posted on the web. There are two main approaches for countering the risks of document hidden data:
Types of Dangerous Document Hidden DataTypes of dangerous hidden data that document files might contain include the following:
The Wrong Ways to Remove Document Hidden DataOne method for removing document hidden data often seen in many websites and blogs is to use the Remove Properties and Personal Information feature that is integral to Windows®. To access this feature, the user needs to select one or more files in Windows Explorer, right-click and select "Properties" from the context menu. In the "Properties" window, the user needs to select the "Details" tab, and then click on the "Remove Properties and Personal Information" link at the bottom. This feature, however, is completely unsuitable for the task. It can only remove a small number of document properties from old Microsoft Office® 2003 files. (Word, Excel, and PowerPoint files with three letter extension - DOC, XLS, and PPT) If Microsoft Office® 2007 or above is installed on the computer, this feature can also remove some properties from Microsoft Office® 2007-2021 files, (Word, Excel, and PowerPoint files with four letter extension - DOCX, XLSX, and PPTX) as well as OpenDocument Text files. In either case, this feature cannot remove at all other types of dangerous document hidden data, including Comments, Tracked Changes, Hidden Text and Objects, and Embedded Images Hidden Data. PDF is entirely unsupported by this feature. Another method often seen in many websites and blogs, is to use a metadata editing application, or a metadata editing feature of document editors, to clear the document properties. This is a wrong way to remove hidden data as well, since it will not get rid of all other types of dangerous document hidden data, including Comments, Tracked Changes, Hidden Text and Objects, and Embedded Images Hidden Data. Due to a unique features of the PDF format, this method might not even remove the cleared document properties. PDF could store document properties in two forms: Document Properties and XMP metadata. This method might clear just one of them. In addition, PDF has an incremental update feature, meaning deleted objects might only be marked as deleted, while still persist in the file. This feature is intended for efficiency reasons. As a result, while cleared properties will not appear on regular viewing applications, it might still be possible to recover them with a specialized application like hex editors. Unfortunately, the misleading websites and blogs that suggest the above wrong methods, often appear high in search engines results when searching using such queries as "how to remove metadata from pdf". Removing just the Document Properties portion of document files, while retaining other types of dangerous document hidden data, such as Comments, Tracked Changes, Hidden Text and Objects, and Embedded Images Hidden Data, is a problem that also exist in many third-party tools that are advertised as document metadata scrubbers, both commercial and open-source. The Slow and Dangerous Way to Remove Document Hidden DataIn recent years, several online web services for removing hidden data & metadata from document files have been created by several developers, that can be accessed by any web browser. Using online hidden data & metadata removers is generally slower than using offline applications, since they require uploading the files to the server, waiting for their cleaning, and then downloading the cleaned files. They are also generally less comfortable to use, especially for cleaning multiple files at once. When using an online service, there is also a danger of misusing uploaded files in a way that will compromise privacy and confidentiality, as well as a danger of a data breach. It is therefore imperative to make sure it is trustworthy and employs adequate data security measures. The Right Way to Remove Document Hidden DataThe best way to remove hidden data & metadata from document files is to use a specialized privacy-oriented hidden data & metadata removal tool, dedicated for the document file types one wishes to clean, with a broad-spectrum of supported hidden data types. Such application will permanently remove the supported hidden data types, so they could not be recovered from the files. An offline application that run on the local computer is preferable. For Microsoft Office® and OpenDocument files, the integral Document Inspector tool that is included in Microsoft Office® has a decent coverage of hidden data types it can remove. To access this feature, the user needs to open the document in Microsoft Office®, and select "Info" from the "File" menu. In the "Info" window, the user needs to click on "Check for Issues", and then select "Inspect Document". The user is then presented with a window where the document can be inspected for selected hidden data types. The Document Inspector can remove a few hidden data types, and can only warn for the presence of a few others. The Document Inspector cannot remove hidden data from PDF files. Notable disadvantages of this tool is the many steps required to remove hidden data with it, as well as the lack of batch hidden data removal capability for cleaning multiple documents at once. BatchPurifier™ can permanently remove hidden data & metadata from multiple files at once. It supports multiple types of document, image, and media file formats, including Microsoft Office® Word, Excel, and PowerPoint files, OpenDocument Text, Spreadsheet, and Presentation files, and PDF files. It's an offline desktop application for Windows, with a broad-spectrum of supported hidden data types. Document file types are very intricate and complex. They are also extensible. Using a well designed hidden data & metadata remover can considerably minimize the amount of dangerous hidden data & metadata left in files processed by it, however, using such tool does not necessarily guarantee no hidden data is left in files, and does not necessarily guarantee processed files are anonymized. In some cases, one might be able to learn new information about the document and its authors by closely examining processed files for hidden data, either from embedded hidden data on its own, or by combining it with other data from other sources. Converting Documents to PDF to Avoid Hidden Data RisksAnother approach to counter the problem of hidden data risks is to convert the documents to PDF. When using this approach, it's imperative to make sure the converting software does not pass-through the metadata of the source documents, and does not introduce new metadata into the PDF. If it cannot be assured that is the case, the risk can be countered by using a PDF hidden data & metadata remover to process the PDF files. (such as BatchPurifier™) An offline PDF converting application that run on the local computer is preferable over an online service. This approach has disadvantages since there are features of the original files that one might wish to preserve in the shared files, and are lost in the conversion. For instance, one might want to preserve speaker notes in presentations, and only get rid of other hidden data types. It might also be desirable at times for the shared files to be editable, whereas editing PDF files is generally less convenient, as it wasn't designed for it. Text documents and presentation generally get well converted to PDF. Converting spreadsheets to PDF, on the other hand, often makes the resulting files useless. Related White Papers
|
©2024 Digital Confidence Ltd. All rights reserved. Privacy Statement | Site Map | Contact Us |