On concept drift, deployability, and adversarial selection in machine learning-based malware detection

Arun Lakhotia, Anshuman Singh

Research output: ThesisDoctoral Thesis

Abstract

Machine learning-based methods are used for malware detection due to their ability to automatically learn the detection rules from examples. The effective application of machine learning-based methods requires addressing some problems that arise due to adversarial nature of the malware domain. We address three such problems in this dissertation: concept drift, deployable classifier selection, and adversarial configuration of selection-based AV system. Concept drift results from nonstationary populations. Malware populations may not be stationary due to evolution for evading detection. Machine learning methods for malware detection assume that malware population is stationary i.e. probability distribution of the observed characteristics (features) of malware populations do not change over time. We investigate this assumption for malware families as populations. We propose two measures for tracking concept drift in malware families when feature sets are very large-relative temporal similarity and metafeatures. Our study using the proposed measures on 4000+ samples from three real world families of x86 malware, spanning over 5 years, shows negligible drift in mnemonic 2-grams extracted from unpacked versions of the samples.

A novel classifier selection criterion, called deployability, is proposed. Deployability explicitly takes into account the performance target that the deployed classifier is expected to meet on unseen data. The performance target in conjunction with interval estimate of generalization performance of candidate classifiers can be used to select deployable classifiers. An evaluation of the criterion shows least expected cost classifier may not be deployable for a given cost target and higher expected cost classifiers may be deployable for a given cost target and confidence level. A game-theoretic model of dynamic classifier selection-based AV system is proposed. The model takes into accoint the possible evasion of the selector. A backward induction based equlibrium solution of the game between adversary and defender gives optimal configuration of the classifiers in the systemn for the expected cost of defender to be minimum.
The solutions to each of the three problems would help in effective application of machine learning-based methods to malware detection.ISBN: 978-1-267-83877-3
Original languageAmerican English
QualificationPh.D.
StatePublished - Jan 1 2012
Externally publishedYes

Disciplines

  • Computer Sciences
  • Artificial Intelligence and Robotics

Cite this