Paper Note | A Lightweight Framework for Function Name Reassignment Based on Large-Scale Stripped Binaries

April 16, 2023

Publication: ISSTA 2021

论文摘要

Software in the wild is usually released as stripped binaries that contain no debug information (e.g., function names). This paper studies the issue of reassigning descriptive names for functions to help facilitate reverse engineering. Since the essence of this issue is a data-driven prediction task, persuasive research should be based on sufficiently large-scale and diverse data. However, prior studies can only be based on small-scale datasets because their techniques suffer from heavyweight binary analysis, making them powerless in the face of big-size and large-scale binaries.

This paper presents the Neural Function Rename Engine (NFRE), a lightweight framework for function name reassignment that utilizes both sequential and structural information of assembly code. NFRE uses fine-grained and easily acquired features to model assembly code, making it more effective and efficient than existing techniques. In addition, we construct a large-scale dataset and present two data-preprocessing approaches to help improve its usability. Benefiting from the lightweight design, NFRE can be efficiently trained on the large-scale dataset, thereby having better generalization capability for unknown functions. The comparative experiments show that NFRE outperforms two existing techniques by a relative improvement of 32% and 16%, respectively, while the time cost for binary analysis is much less.

解决的问题与创新点

文章针对stripped binary中函数的命名问题，它认为之前的工作（DEBIN、Nero）在较小的数据集上进行测试，而数据集的大小会影响最终的效果。因此，本工作在更大的数据集上训练与测试。此外，文章认为之前的工作不能适应于更大的数据集是因为他们对二进制的分析繁琐，因此本工作采用了更轻量的分析。

声明的贡献

We present NFRE, a lightweight framework for the reas- signment of function names in stripped binaries. It does not require heavyweight binary analysis, so it can be efficiently trained and evaluated on large-scale binaries. It also has a wider application scope than Nero in design.
We summarize the label noise and sparsity problems and present two data-preprocessing approaches to help mitigate them. In this way, we improve the usability of the large-scale dataset in a (semi-)automated manner.
We conduct extensive experiments to evaluate NFRE and validate our intuitions. The results demonstrate the significance of data preprocessing and show NFRE outperforms existing techniques, DEBIN and Nero, by a relative improvement of 32% and 16%, respectively, while the time cost for feature extraction is much less.

总体方法

在整体方法方面，NFRE框架读入一个二进制，获得其反汇编指令后，（1）对指令进行标准化来缓解OOV问题，如替换常数等，（2）基于指令生成指令级别的CFG，也就是节点是单个指令而非基本块，这样做的目的是分析更轻量，（3）基于指令级别的CFG使用DeepWalk进行图嵌入。获得图嵌入后，NFRE将其作为输入送入encoder-decoder架构的预测模型中对函数名进行预测（The structural information is used in a pre-training manner. First, we perform structural instruction embedding based on CFGs, representing instructions as embeddings (i.e., high-dimensional numerical vectors) that potentially aggregate the control-flow information. Then we use the pre-trained embeddings as the input of the neural model so that the model can benefit from the structural information.）。

此外，工作还设计方法解决构建训练数据集中的label noise问题（训练一个二分类器看看哪些函数名称不是有意义的）和label sparsity问题（寻找同义词，如Message和Msg；getMessageType和getMsgType其实是相同含义的函数名称）。