LogisticRegression in scikit-learn and liblinear 1.94 give vastly different results for same data

I've noticed recently that for some data sets, the accuracies obtained by running liblinear on the command line and using scikit-learn (with the same set of parameters) are very different. For the dataset I'm linking here, scikit-learn's performance is substantially better, but for a much larger proprietary dataset run with the same settings, the performance is much worse.

With this training file and this test file, which use a subset of the features and instances from the Kaggle Titanic task, we get the following output from liblinear 1.94:

$ ~/Documents/liblinear-1.94/train -s 6 train/family.libsvm family.libsvm.model
iter   1  #CD cycles 1
iter   2  #CD cycles 1
iter   3  #CD cycles 2
=========================
optimization finished, #iter = 3
Objective value = 487.412579
#nonzeros/#features = 2/2

$ ~/Documents/liblinear-1.94/predict dev/family.libsvm family.libsvm.model family.libsvm.pred
Accuracy = 36.3128% (65/179)

Repeating the same experiment using scikit-learn 0.15.1:

In [4]: from sklearn.datasets import load_svmlight_files

In [5]: X_train, y_train, X_test, y_test = load_svmlight_files(("train/family.libsvm", "dev/family.libsvm"))
...
In [11]: clf = LogisticRegression(penalty="l1", tol=0.01, fit_intercept=False)

In [12]: clf
Out[12]: 
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
          intercept_scaling=1, penalty='l1', random_state=None, tol=0.01)

In [14]: clf.fit(X_train, y_train)
Out[14]: 
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
          intercept_scaling=1, penalty='l1', random_state=None, tol=0.01)

In [17]: yhat = clf.predict(X_test)

In [18]: from sklearn.metrics import accuracy_score

In [20]: accuracy_score(y_test, yhat)
Out[20]: 0.69273743016759781

Setting the keywords arguments to penalty="l1", tol=0.01, fit_intercept=False should have yielded the same results as running liblinear from the command line with solver 6, since those are the defaults it uses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

LogisticRegression in scikit-learn and liblinear 1.94 give vastly different results for same data #3600

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

LogisticRegression in scikit-learn and liblinear 1.94 give vastly different results for same data #3600

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions