Skip to content

Commit e839a41

Browse files
committed
get feature
deep fm
1 parent 24648db commit e839a41

File tree

1 file changed

+10
-0
lines changed

1 file changed

+10
-0
lines changed

deep_ctr/get_feature.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,16 @@
22
"""
33
Preprocess Criteo dataset. This dataset was used for the Display Advertising
44
Challenge (https://www.kaggle.com/c/criteo-display-ad-challenge).
5+
The original dataset is know as Criteo 1TB click log, in which the CriteoLab has collected 30 days of masked data. We only know there are 13 numerical and 26 categorical features, and there is no feature description released. Thus we name thease features as num_0 ... num_12, and cat_0 ..., cat_25.
6+
-For numerical features, @tianyao chen discretized them by equal-size buckets, referring APEXDatasets.
7+
-For categorical features, he removed long-tailed data appearing less than 20 times.
8+
-Nagetive sown sampling is used, and the resulting positive sample ratio is about 0.5.
9+
After one-hot encoding, the feature space approximates 1M.
10+
11+
12+
TODO:
13+
#1 连续特征 异常值/离散化/归一化
14+
#2 离散特征 log tail低频特征
515
"""
616
import os
717
import sys

0 commit comments

Comments
 (0)