Advancements in File Similarity Techniques: Traditional and Modern Approaches for Malware Detection

Udbhav Prasad

doi:https://doi.org/10.14445/22312803/IJCTT-V72I12P118

Research Article | Open Access | Download PDF

Volume 72 | Issue 12 | Year 2024 | Article Id. IJCTT-V72I12P118 | DOI : https://doi.org/10.14445/22312803/IJCTT-V72I12P118

Advancements in File Similarity Techniques: Traditional and Modern Approaches for Malware Detection

Udbhav Prasad

Received	Revised	Accepted	Published
04 Nov 2024	30 Nov 2024	17 Dec 2024	31 Dec 2024

Citation :

Udbhav Prasad, "Advancements in File Similarity Techniques: Traditional and Modern Approaches for Malware Detection," International Journal of Computer Trends and Technology (IJCTT), vol. 72, no. 12, pp. 144-152, 2024. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V72I12P118

Abstract

Threat hunting, malware analysis and digital forensic techniques often use signatures to identify malicious executables. While cryptographic hashes are helpful for identifying a particular file uniquely, attackers often tailor their malware to particular systems, releasing variants that target different platforms, operating systems, and even specific organizations or governments. As attacks become more sophisticated, security researchers have proposed “similarity” digests that attempt to overcome the limitations of cryptographic hashes and other traditional signatures by detecting variants of an executable. Modern enterprises manage tens of thousands of endpoints with billions of files, making the scalability of the proposed techniques more important than ever. This survey reviews traditional file similarity digests, such as ssdeep, sdhash, and TLSH, alongside emerging technologies like embeddings and vector databases. By classifying and comparing these techniques, the paper highlights their strengths, weaknesses, and practical applications in malware detection. Key contributions include a structured taxonomy of methods and insights into integrating traditional digests with modern vector database solutions for scalable, efficient detection. This work provides a roadmap for future research and development in this critical domain.

Keywords

Cybersecurity, Malware detection, File similarity, Fuzzy digests, Vector databases.

References

[1] Managing Risks & Costs at the Edge, Ponemon Institute, Report, pp. 1-49, 2022. [Online]. Available: https://adaptiva.com/hubfs/Reports/Adaptiva-Ponemon-Report-2022.pdf
[2] Cem Dilmegani, Endpoint Security Statistics in 2025, AI Multiple Research, 2024. [Online]. Available: https://research.aimultiple.com/endpoint-security-statistics
[3] Darren Quick, and Kim-Kwang Raymond Choo, “Impacts of the Increasing Volume of Digital Forensic Data: A Survey and Future Research Challenges,” Digital Investigation, vol. 11, no. 4, pp. 273-294, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[4] National Software Reference Library, NIST. [Online]. Available: http://www.nsrl.nist.gov/
[5] National Vulnerability Database, NIST. [Online]. Available: http://nvd.nist.gov/
[6] Malware Bazaar. [Online]. Available: https://bazaar.abuse.ch/export/
[7] Reyadh Hazim Mahdi, and Hafedh Trabelsi, “Detection of Malware by Using YARA Rules,” 2024 21st International Multi-Conference on Systems, Signals & Devices, Erbil, Iraq, pp. 1-8, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Frank Breitinger et al., “Using Approximate Matching to Reduce the Volume of Digital Data,” Advances in Digital Forensics X: 10th IFIP WG 11.9 International Conference, Vienna, Austria, pp. 149-163, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Jianguo Wang et al., “Milvus: A Purpose-Built Vector Data Management System,” Proceedings of the 2021 International Conference on Management of Data, Virtual Event China, pp. 2614-2627, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Build knowledgeable AI, Pinecone. [Online]. Available: https://www.pinecone.io
[11] The AI-Native Database for a New Generation of Software, Weaviate. [Online]. Available: https://weaviate.io
[12] Yu A. Malkov, and D. A. Yashunin, “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824-836, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Annoy. [Online]. Available: https://github.com/spotify/annoy
[14] Jonathan Oliver, Chun Cheng, and Yanggui Chen, “TLSH - A Locality Sensitive Hash,” 2013 Fourth Cybercrime and Trustworthy Computing Workshop, Sydney, NSW, Australia, pp. 7-13,2013.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Jonathan Oliver, Muqeet Ali, and Josiah Hagen, “HAC-T and Fast Search for Similarity in Security,” 2020 International Conference on Omni-layer Intelligent Systems, Barcelona, Spain, pp. 1-7, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Gonzalo Navarro, “A Guided Tour to Approximate String Matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31–88, 2001. https://doi.org/10.1145/375360.375365.[
[CrossRef] [Google Scholar] [Publisher Link]
[17] Esko Ukkonen, “Algorithms for Approximate String Matching,” Information and Control, vol. 64, no. 1-3, pp. 100-118, 1985.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Joshua Saxe, and Hillary Sanders, Malware Data Science: Attack Detection and Attribution, No Starch Press, pp. 1-272, 2018.
[Google Scholar] [Publisher Link]
[19] A.Z. Broder, “On the Resemblance and Containment of Documents,” Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), Salerno, Italy, pp. 21-29,1997.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Jesse Kornblum, “Identifying Almost Identical Files Using Context Triggered Piecewise Hashing,” Digital Investigation, vol. 3, pp. 91-97, 2006.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Vassil Roussev, “Data Fingerprinting with Similarity Digests,” Advances in Digital Forensics VI: Sixth IFIP WG 11.9 International Conference on Digital Forensics, Hong Kong, China, pp. 207-226, 2010.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Dongkwan Kim et al., “Revisiting BCSA Using Interpretable Feature Engineering and Lessons Learned,” IEEE Transactions on Software Engineering, vol. 49, no. 4, pp. 1661-1682, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Jonathan Oliver, Scott Forman, and Chun Cheng, “Using Randomization to Attack Similarity Digests,” Applications and Techniques in Information Security: 5th International Conference, Melbourne, Australia, pp. 199-210, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Thomas Göbel et al., “FRASHER – A Framework for Automated Evaluation of Similarity Hashing,” Forensic Science International: Digital Investigation, vol. 42, pp. 1-13, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[25] William Pugh, “Skip Lists: A Probabilistic Alternative to Balanced Trees,” Communications of the ACM, vol. 33, no. 6, pp. 668-676, 1990.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Patrick O’Neil et al., “The Log-structured Merge-Tree (LSM-tree),” Acta Informatica, vol. 33, pp. 351–385, 1996.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Christian Winter, Markus Schneider, and York Yannikos, “F2S2: Fast Forensic Similarity Search Through Indexing Piecewise Hash Signatures,” Digital Investigation, vol. 10, no. 4, pp. 361–371, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Frank Breitinger, Christian Rathgeb, and Harald Baier, “An Efficient Similarity Digests Database Lookup - A Logarithmic Divide & Conquer Approach,” Journal of Digital Forensics, Security and Law, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Bin Fan et al., “Cuckoo Filter: Practically Better Than Bloom,” Proceedings of the 10th ACM International Conference on Emerging Networking Experiments and Technologies, Sydney Australia, pp. 75-88, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Irfan Ul Haq, and Juan Caballero, “A Survey of Binary Code Similarity,” ACM Computing Surveys, vol. 54, no. 3, pp. 1-38, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Jiang Du et al., “A Review of Deep Learning-Based Binary Code Similarity Analysis,” Electronics, vol. 12, no. 22, pp. 1-18, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Abhiraj Malhotra, “Single-Shot Image Recognition Using Siamese Neural Networks,” 2023 3rd International Conference on Advance Computing and Innovative Technologies in Engineering, Greater Noida, India, pp. 2550-2553, 2023.
[CrossRef] [Google Scholar] [Publisher Link]