-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
Description
I've noticed recently that for some data sets, the accuracies obtained by running liblinear on the command line and using scikit-learn (with the same set of parameters) are very different. For the dataset I'm linking here, scikit-learn's performance is substantially better, but for a much larger proprietary dataset run with the same settings, the performance is much worse.
With this training file and this test file, which use a subset of the features and instances from the Kaggle Titanic task, we get the following output from liblinear 1.94:
$ ~/Documents/liblinear-1.94/train -s 6 train/family.libsvm family.libsvm.model
iter 1 #CD cycles 1
iter 2 #CD cycles 1
iter 3 #CD cycles 2
=========================
optimization finished, #iter = 3
Objective value = 487.412579
#nonzeros/#features = 2/2
$ ~/Documents/liblinear-1.94/predict dev/family.libsvm family.libsvm.model family.libsvm.pred
Accuracy = 36.3128% (65/179)
Repeating the same experiment using scikit-learn 0.15.1:
In [4]: from sklearn.datasets import load_svmlight_files
In [5]: X_train, y_train, X_test, y_test = load_svmlight_files(("train/family.libsvm", "dev/family.libsvm"))
...
In [11]: clf = LogisticRegression(penalty="l1", tol=0.01, fit_intercept=False)
In [12]: clf
Out[12]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
intercept_scaling=1, penalty='l1', random_state=None, tol=0.01)
In [14]: clf.fit(X_train, y_train)
Out[14]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
intercept_scaling=1, penalty='l1', random_state=None, tol=0.01)
In [17]: yhat = clf.predict(X_test)
In [18]: from sklearn.metrics import accuracy_score
In [20]: accuracy_score(y_test, yhat)
Out[20]: 0.69273743016759781Setting the keywords arguments to penalty="l1", tol=0.01, fit_intercept=False should have yielded the same results as running liblinear from the command line with solver 6, since those are the defaults it uses.