개요

KISA 데이터셋의 malware 클래스는 총 4가지로
- ransomware (1266개)
- trojan (9958개)
- worm (5349개)
- virus (4630개)
각각 클래스의 개수는 차이가 꽤 나는편으로 이를 어떻게 처리해야 할지 고민해봐야 한다.
normal 데이터셋의 경우 5만개

확장 가능한 path 설정

root_path = "/lockard_ai/works/malware_detection/"
for root, dir, file in os.walk(root_path):
    for f in file:
        if "preprocessed_data" in f:
            origin_data_path = os.path.join(root, f)
            print(origin_data_path)
        if "KISA_total.csv" in f:
            KISA_dataset_path = os.path.join(root, f)
            print(KISA_dataset_path)

진행

normal과 malware 개수

sns.histplot(data=df, x='label',hue=df['label'], bins=2)

Untitled

EDA, 전처리 없이 ExtraTreesClassifier, SelectFromModel 모델 학습 진행

def extratree_process(X, y):
    extratrees = ek.ExtraTreesClassifier().fit(X,y)
    model = SelectFromModel(extratrees, prefit=True, threshold=1.8e-2)
    X_new = model.transform(X)
    nbfeatures = X_new.shape[1]
    X_train, X_test, y_train, y_test = train_test_split(X_new, y ,test_size=0.2)
    features = list()
    index = np.argsort(extratrees.feature_importances_)[::-1][:nbfeatures]
    print("index : ", index)
    for f in range(nbfeatures):
        print("%d. feature %s (%f)" % (f + 1, dataset.columns[index[f]], extratrees.feature_importances_[index[f]]))
        features.append(dataset.columns[index[f]])

    model = { "DecisionTree":tree.DecisionTreeClassifier(max_depth=10),
         "RandomForest":ek.RandomForestClassifier(n_estimators=50),
         "Adaboost":ek.AdaBoostClassifier(n_estimators=50),
         "GradientBoosting":ek.GradientBoostingClassifier(n_estimators=50),
         "GNB":GaussianNB(),
         "LinearRegression":LinearRegression()   
        }
    
    print("\\n모델 성능 비교")
    # 모델 성능 비교
    results = {}
    for algo in model:
        clf = model[algo]
        clf.fit(X_train,y_train)
        score = clf.score(X_test,y_test)
        print ("%s : %s " %(algo, score))
        results[algo] = score
        
    return model['RandomForest'], features, results

malware binary classification

출력

multi class classification

ML의 multi classification은 단순히 y의 값을 multi class 값으로 전달하여 fit을 진행해주기만 하면 된다.

각 클래스에 대해서 label encoding 진행

각 malware 클래스별 개수

sns.histplot(data=df, x='label',hue=df['label'], bins=2)