Skip to content

LogisticRegression in scikit-learn and liblinear 1.94 give vastly different results for same data #3600

@dan-blanchard

Description

@dan-blanchard

I've noticed recently that for some data sets, the accuracies obtained by running liblinear on the command line and using scikit-learn (with the same set of parameters) are very different. For the dataset I'm linking here, scikit-learn's performance is substantially better, but for a much larger proprietary dataset run with the same settings, the performance is much worse.

With this training file and this test file, which use a subset of the features and instances from the Kaggle Titanic task, we get the following output from liblinear 1.94:

$ ~/Documents/liblinear-1.94/train -s 6 train/family.libsvm family.libsvm.model
iter   1  #CD cycles 1
iter   2  #CD cycles 1
iter   3  #CD cycles 2
=========================
optimization finished, #iter = 3
Objective value = 487.412579
#nonzeros/#features = 2/2

$ ~/Documents/liblinear-1.94/predict dev/family.libsvm family.libsvm.model family.libsvm.pred
Accuracy = 36.3128% (65/179)

Repeating the same experiment using scikit-learn 0.15.1:

In [4]: from sklearn.datasets import load_svmlight_files

In [5]: X_train, y_train, X_test, y_test = load_svmlight_files(("train/family.libsvm", "dev/family.libsvm"))
...
In [11]: clf = LogisticRegression(penalty="l1", tol=0.01, fit_intercept=False)

In [12]: clf
Out[12]: 
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
          intercept_scaling=1, penalty='l1', random_state=None, tol=0.01)

In [14]: clf.fit(X_train, y_train)
Out[14]: 
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
          intercept_scaling=1, penalty='l1', random_state=None, tol=0.01)

In [17]: yhat = clf.predict(X_test)

In [18]: from sklearn.metrics import accuracy_score

In [20]: accuracy_score(y_test, yhat)
Out[20]: 0.69273743016759781

Setting the keywords arguments to penalty="l1", tol=0.01, fit_intercept=False should have yielded the same results as running liblinear from the command line with solver 6, since those are the defaults it uses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions