Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Toward the Detection of Polyglot Files

Conference ·

Standardized file types play a key role in the development and use of computer software. However, it is possible to confound standardized file type processing by creating a file that is valid in multiple file types. The resulting polyglot (many languages) file can confuse file type identification, allowing elements of the file to evade analysis. This is especially problematic for malware detection systems that rely on file type identification for feature extraction. Although work has been done to identify file types using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file type-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tools, including file, polydet, binwalk, and TrID. Our analysis demonstrates that existing file type detection tools fail to provide reliable polyglot detection. We then evaluated the ability of a range of machine and deep learning models to detect polyglot files. The most performant models were MalConv2 and Catboost, which demonstrated the highest recall on our data set with 95.16% and 95.45%, respectively. These models outperformed existing methods and could be incorporated into a malware detector’s file processing pipeline to filter out potentially malicious polyglots before file type-dependent feature extraction takes place.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
3002965
Country of Publication:
United States
Language:
English

Similar Records

Toward the Detection of Polyglot Files
Conference · Mon Aug 08 00:00:00 EDT 2022 · OSTI ID:1885926

Beyond the Hype: An Evaluation of Commercially Available Machine-Learning-Based Malware Detectors
Journal Article · Wed Feb 15 23:00:00 EST 2023 · Digital Threats: Research and Practice · OSTI ID:1965262

Deep PDF parsing to extract features for detecting embedded malware.
Technical Report · Thu Sep 01 00:00:00 EDT 2011 · OSTI ID:1030303

Related Subjects