Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
a580cac
ENH Download recommendation data in chapter 8
luispedro Oct 10, 2014
437da2a
ENH Use curl instead of wget
luispedro Oct 10, 2014
efce70f
ENH Add color histogram function
luispedro Oct 10, 2014
8209239
RFCT Use scipy instead of rolling our own
luispedro Oct 12, 2014
b0636c4
ENH Use color histograms with SVMs+Grid Search
luispedro Oct 12, 2014
1421b34
ENH Partition matrix into train/test
luispedro Oct 13, 2014
de74835
RFCT Simplify sampling code
luispedro Oct 17, 2014
87c322d
RFCT Cleaner neighbor correlation code
luispedro Oct 17, 2014
b651220
RFCT Simplify user regression code
luispedro Oct 17, 2014
5ece8c0
RFCT Split prediction from evaluation
luispedro Oct 17, 2014
b7117f2
MIN Simple explanatory comments
luispedro Oct 21, 2014
a183b7e
ENH Set random seed to ensure consistency
luispedro Oct 23, 2014
b89ac7e
RFCT Remove flawed stacking code
luispedro Oct 24, 2014
362207f
RFCT Do not filter matrix
luispedro Oct 24, 2014
f5d8cc6
ENH Stacked learning for combining methods
luispedro Oct 24, 2014
d7e2f96
DOC Start a README file
luispedro Oct 24, 2014
e19da48
MIN Describe different scripts in README
luispedro Oct 24, 2014
87d1cfd
ENH Take sqrt for stddev
luispedro Oct 29, 2014
c3226d8
MIN Remove unnecessary print statement
luispedro Oct 30, 2014
b8dfd13
MIN Remove unused import
luispedro Oct 30, 2014
e28cc3a
ENH Simple & commented code
luispedro Oct 31, 2014
773a401
MIN Fix typo in print call
luispedro Oct 31, 2014
51701a0
ENH Update URL & make curl call follow redirects
luispedro Oct 31, 2014
9904f16
ENH Comment code. Use namedtuple for readability
luispedro Oct 31, 2014
13fc479
ENH Comment on alternative implementation
luispedro Oct 31, 2014
f559c3a
DOC Describe Apriori files
luispedro Oct 31, 2014
a9e4eea
ENH Simplify & document code
luispedro Nov 2, 2014
5886bf0
ENH Improve Apriori code examples
luispedro Nov 3, 2014
2680779
ENH Add association rule code
luispedro Nov 3, 2014
8c337a0
Switching to py3-complaint twitter library; improving handling of rat…
wrichert Nov 5, 2014
b8c32d5
py3 compliant
wrichert Nov 5, 2014
740e009
py3-compliant; help message for missing SentiWordNet
wrichert Nov 6, 2014
a3d1afe
grace wait time
wrichert Nov 6, 2014
671888a
Remove deprecated parameter 'indices'
wrichert Nov 6, 2014
5758d97
MIN Output Nr descriptors. Use range
luispedro Nov 7, 2014
7c4ad5c
MIN Add explanatory comment
luispedro Nov 8, 2014
61b894d
ENH Second edition has less image processing
luispedro Nov 8, 2014
332400f
ENH Update to newer Chapter 10 architecture
luispedro Nov 10, 2014
55b9fa9
BLD Ignore output files
luispedro Nov 10, 2014
ac0afd9
MIN Use all Haralick features
luispedro Nov 16, 2014
0fbc909
ENH Final updated version of simple_classification
luispedro Nov 16, 2014
e4403a9
FIG Show building text & building
luispedro Nov 16, 2014
9cebdb7
MIN Use range instead of xrange
luispedro Nov 16, 2014
9d3ea50
ENH Finalize large classification method
luispedro Nov 17, 2014
5bdba98
ENH Add code for image neighbors
luispedro Nov 17, 2014
f764536
MIN Scale features
luispedro Nov 21, 2014
ba4e379
MIN Use np.bincount instead of loop
luispedro Nov 21, 2014
d252557
ENH Add AWS code from the book
luispedro Nov 21, 2014
a24de07
ENH Update jug file to match Chapter 10
luispedro Nov 21, 2014
940baf5
ENH Code that is simpler to explain
luispedro Nov 22, 2014
40a59ec
ENH Add LBP to image classification script
luispedro Nov 22, 2014
34302d2
ENH Work with Python 2.6
luispedro Nov 23, 2014
79857c0
MIN Use curl instead of wget
luispedro Jan 6, 2015
691de56
ENH Add thresholded figure
luispedro Jan 6, 2015
d535cfc
FIG Improve figures
luispedro Jan 26, 2015
fa00f07
BUG Use threshold obtained from current code
luispedro Jan 26, 2015
7fd23a3
RFCT Use model as first argument
luispedro Jan 26, 2015
d818133
Modify UnicodeDecodeError text. You'll use utf-8
neoneo40 Jan 27, 2015
91da359
ENH Update to Python 3. Better figures
luispedro Jan 27, 2015
4de7593
Modify UnicodeDecodeError text in Python 2.x
neoneo40 Jan 27, 2015
9682f47
RFCT Update figure generation code
luispedro Jan 29, 2015
9077676
RFCT Remove old script
luispedro Jan 29, 2015
e1d2b6d
Merge pull request #6 from re4lfl0w/unicodedecodeerror_fix
wrichert Feb 3, 2015
822f6ac
MIN Better axis label
luispedro Feb 6, 2015
83d426c
MIN Add axis= for explicitness
luispedro Feb 9, 2015
d192bee
ENH Improve code readability
luispedro Feb 12, 2015
f2b5bdf
MIN Better filename for output
luispedro Feb 14, 2015
cd69e60
ENH More readable code
luispedro Mar 20, 2015
30a1512
ENH Use matutils.corpus2dense instead of looping
luispedro Mar 20, 2015
2c530e1
MIN Use same function name as in book
luispedro Mar 20, 2015
6948563
RFCT Simpler code
luispedro Mar 20, 2015
c03c2f3
DOC Add note on randomness
luispedro Mar 20, 2015
c1881c9
Switching to py3-complaint twitter library; improving handling of rat…
wrichert Nov 5, 2014
2f8ee9e
py3 compliant
wrichert Nov 5, 2014
37920fb
py3-compliant; help message for missing SentiWordNet
wrichert Nov 6, 2014
fbd6a3d
grace wait time
wrichert Nov 6, 2014
9b8b80a
Remove deprecated parameter 'indices'
wrichert Nov 6, 2014
22bc140
Modify UnicodeDecodeError text. You'll use utf-8
neoneo40 Jan 27, 2015
1d8fd23
Modify UnicodeDecodeError text in Python 2.x
neoneo40 Jan 27, 2015
4205208
ENH Single file with the code as in book
luispedro Mar 25, 2015
bb9b8de
ENH Average methods (instead of stacked learning)
luispedro Mar 25, 2015
afcdc0b
ENH Add chapter.py files
luispedro Mar 27, 2015
3ae4c9d
BLD Update gitignore files
luispedro Mar 27, 2015
24ac39d
ENH Add script to process Wikipedia with HDP
luispedro Mar 27, 2015
309ab17
DAT Add example image
luispedro Mar 27, 2015
77b13d8
Merge branch 'master' into second_edition
luispedro Apr 3, 2015
e28dfc4
DOC Clarify that this is the second edition code
luispedro Apr 3, 2015
79581d2
rename missed name
iory Apr 20, 2015
0999da1
Merge pull request #9 from iory/master
luispedro Apr 20, 2015
8386e85
ENH Better spacing & comments
luispedro Jun 16, 2015
6536b1b
analyze_webstats with PEP8
juanpabloaj Jan 3, 2016
90b8207
Merge pull request #13 from juanpabloaj/webstats_pep8
wrichert Mar 5, 2016
db795e0
BUG Fix name of function chist
luispedro Mar 8, 2016
c714303
Fix predict method called
tomahawk28 Nov 2, 2015
b97a0be
format this file
ao-song Apr 28, 2016
456a830
Merge pull request #17 from ao-song/master
wrichert May 1, 2016
98d66ea
MIN Remove extraneous function call
luispedro Jul 21, 2016
c0a3b3a
DOC Make explicit how to get AP data
luispedro Nov 27, 2016
c4c71a5
MIN Update link to AP data
luispedro Mar 28, 2017
a237d75
BUG Fixes API usage
luispedro Jun 25, 2017
52891e6
BUG Fix function import
luispedro May 21, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
Building Machine Learning Systems with Python
=============================================

Source Code for the book Building Machine Learning Systems with Python by
[Willi Richert](http://twotoreal.com) and [Luis Pedro
Coelho](http://luispedro.org).
Source Code for the book Building Machine Learning Systems with Python by [Luis
Pedro Coelho](http://luispedro.org) and [Willi Richert](http://twotoreal.com).

The book was published in 2013 by Packt Publishing and is available [from their
The book was published in 2013 (second edition in 2015) by Packt Publishing and
is available [from their
website](http://www.packtpub.com/building-machine-learning-systems-with-python/book).

The code in the repository corresponds to the second edition. Code for the
first edition is available in [first\_edition
branch](https://github.com/luispedro/BuildingMachineLearningSystemsWithPython/tree/first_edition).

7 changes: 4 additions & 3 deletions ch01/analyze_webstats.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,9 @@
x = x[~sp.isnan(y)]
y = y[~sp.isnan(y)]

# plot input data

def plot_models(x, y, models, fname, mx=None, ymax=None, xmin=None):
''' plot input data '''

plt.figure(num=None, figsize=(8, 6))
plt.clf()
Expand Down Expand Up @@ -138,8 +139,8 @@ def error(f, x, y):
train = sorted(shuffled[split_idx:])
fbt1 = sp.poly1d(sp.polyfit(xb[train], yb[train], 1))
fbt2 = sp.poly1d(sp.polyfit(xb[train], yb[train], 2))
print("fbt2(x)= \n%s"%fbt2)
print("fbt2(x)-100,000= \n%s"%(fbt2-100000))
print("fbt2(x)= \n%s" % fbt2)
print("fbt2(x)-100,000= \n%s" % (fbt2-100000))
fbt3 = sp.poly1d(sp.polyfit(xb[train], yb[train], 3))
fbt10 = sp.poly1d(sp.polyfit(xb[train], yb[train], 10))
fbt100 = sp.poly1d(sp.polyfit(xb[train], yb[train], 100))
Expand Down
22 changes: 9 additions & 13 deletions ch01/gen_webstats.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,26 +17,22 @@

sp.random.seed(3) # to reproduce the data later on

x = sp.arange(1, 31 * 24)
y = sp.array(200 * (sp.sin(2 * sp.pi * x / (7 * 24))), dtype=int)
x = sp.arange(1, 31*24)
y = sp.array(200*(sp.sin(2*sp.pi*x/(7*24))), dtype=int)
y += gamma.rvs(15, loc=0, scale=100, size=len(x))
y += 2 * sp.exp(x / 100.0)
y = sp.ma.array(y, mask=[y < 0])
print(sum(y), sum(y < 0))
y += 2 * sp.exp(x/100.0)
y = sp.ma.array(y, mask=[y<0])
print(sum(y), sum(y<0))

plt.scatter(x, y)
plt.title("Web traffic over the last month")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w * 7 * 24 for w in [0, 1, 2, 3, 4]], ['week %i' % (w + 1) for w in
[0, 1, 2, 3, 4]])

plt.xticks([w*7*24 for w in range(5)],
['week %i' %(w+1) for w in range(5)])
plt.autoscale(tight=True)
plt.grid()
plt.savefig(os.path.join(CHART_DIR, "1400_01_01.png"))

# sp.savetxt(os.path.join("..", "web_traffic.tsv"),
# zip(x[~y.mask],y[~y.mask]), delimiter="\t", fmt="%i")

sp.savetxt(os.path.join(
DATA_DIR, "web_traffic.tsv"), list(zip(x, y)), delimiter="\t", fmt="%s")
sp.savetxt(os.path.join(DATA_DIR, "web_traffic.tsv"),
list(zip(x, y)), delimiter="\t", fmt="%s")
3 changes: 3 additions & 0 deletions ch02/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ Support code for *Chapter 2: Learning How to Classify with Real-world
Examples*. The directory data contains the seeds dataset, originally downloaded
from https://archive.ics.uci.edu/ml/datasets/seeds

chapter.py
The code as printed in the book.

figure1.py
Figure 1 in the book: all 2-by-2 scatter plots

Expand Down
164 changes: 164 additions & 0 deletions ch02/chapter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# This code is supporting material for the book
# Building Machine Learning Systems with Python
# by Willi Richert and Luis Pedro Coelho
# published by PACKT Publishing
#
# It is made available under the MIT License


from matplotlib import pyplot as plt
import numpy as np

# We load the data with load_iris from sklearn
from sklearn.datasets import load_iris
data = load_iris()

# load_iris returns an object with several fields
features = data.data
feature_names = data.feature_names
target = data.target
target_names = data.target_names

for t in range(3):
if t == 0:
c = 'r'
marker = '>'
elif t == 1:
c = 'g'
marker = 'o'
elif t == 2:
c = 'b'
marker = 'x'
plt.scatter(features[target == t, 0],
features[target == t, 1],
marker=marker,
c=c)
# We use NumPy fancy indexing to get an array of strings:
labels = target_names[target]

# The petal length is the feature at position 2
plength = features[:, 2]

# Build an array of booleans:
is_setosa = (labels == 'setosa')

# This is the important step:
max_setosa =plength[is_setosa].max()
min_non_setosa = plength[~is_setosa].min()
print('Maximum of setosa: {0}.'.format(max_setosa))

print('Minimum of others: {0}.'.format(min_non_setosa))

# ~ is the boolean negation operator
features = features[~is_setosa]
labels = labels[~is_setosa]
# Build a new target variable, is_virigina
is_virginica = (labels == 'virginica')

# Initialize best_acc to impossibly low value
best_acc = -1.0
for fi in range(features.shape[1]):
# We are going to test all possible thresholds
thresh = features[:,fi]
for t in thresh:

# Get the vector for feature `fi`
feature_i = features[:, fi]
# apply threshold `t`
pred = (feature_i > t)
acc = (pred == is_virginica).mean()
rev_acc = (pred == ~is_virginica).mean()
if rev_acc > acc:
reverse = True
acc = rev_acc
else:
reverse = False

if acc > best_acc:
best_acc = acc
best_fi = fi
best_t = t
best_reverse = reverse

print(best_fi, best_t, best_reverse, best_acc)

def is_virginica_test(fi, t, reverse, example):
'Apply threshold model to a new example'
test = example[fi] > t
if reverse:
test = not test
return test
from threshold import fit_model, predict

# ning accuracy was 96.0%.
# ing accuracy was 90.0% (N = 50).
correct = 0.0

for ei in range(len(features)):
# select all but the one at position `ei`:
training = np.ones(len(features), bool)
training[ei] = False
testing = ~training
model = fit_model(features[training], is_virginica[training])
predictions = predict(model, features[testing])
correct += np.sum(predictions == is_virginica[testing])
acc = correct/float(len(features))
print('Accuracy: {0:.1%}'.format(acc))


###########################################
############## SEEDS DATASET ##############
###########################################

from load import load_dataset

feature_names = [
'area',
'perimeter',
'compactness',
'length of kernel',
'width of kernel',
'asymmetry coefficien',
'length of kernel groove',
]
features, labels = load_dataset('seeds')



from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=1)
from sklearn.cross_validation import KFold

kf = KFold(len(features), n_folds=5, shuffle=True)
means = []
for training,testing in kf:
# We learn a model for this fold with `fit` and then apply it to the
# testing data with `predict`:
classifier.fit(features[training], labels[training])
prediction = classifier.predict(features[testing])

# np.mean on an array of booleans returns fraction
# of correct decisions for this fold:
curmean = np.mean(prediction == labels[testing])
means.append(curmean)
print('Mean accuracy: {:.1%}'.format(np.mean(means)))


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

classifier = KNeighborsClassifier(n_neighbors=1)
classifier = Pipeline([('norm', StandardScaler()), ('knn', classifier)])

means = []
for training,testing in kf:
# We learn a model for this fold with `fit` and then apply it to the
# testing data with `predict`:
classifier.fit(features[training], labels[training])
prediction = classifier.predict(features[testing])

# np.mean on an array of booleans returns fraction
# of correct decisions for this fold:
curmean = np.mean(prediction == labels[testing])
means.append(curmean)
print('Mean accuracy: {:.1%}'.format(np.mean(means)))
14 changes: 11 additions & 3 deletions ch02/figure1.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,21 @@

fig,axes = plt.subplots(2, 3)
pairs = [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

# Set up 3 different pairs of (color, marker)
color_markers = [
('r', '>'),
('g', 'o'),
('b', 'x'),
]
for i, (p0, p1) in enumerate(pairs):
ax = axes.flat[i]

# Use a different marker/color for each class `t`
for t, marker, c in zip(range(3), ">ox", "rgb"):
for t in range(3):
# Use a different color/marker for each class `t`
c,marker = color_markers[t]
ax.scatter(features[target == t, p0], features[
target == t, p1], marker=marker, c=c, s=40)
target == t, p1], marker=marker, c=c)
ax.set_xlabel(feature_names[p0])
ax.set_ylabel(feature_names[p1])
ax.set_xticks([])
Expand Down
7 changes: 4 additions & 3 deletions ch02/figure2.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,9 @@
labels = labels[~is_setosa]
is_virginica = (labels == 'virginica')

# Hand fixed threshold:
t = 1.75
# Hand fixed thresholds:
t = 1.65
t2 = 1.75

# Features to use: 3 & 2
f0, f1 = 3, 2
Expand All @@ -49,7 +50,7 @@
ax.fill_between([t, x1], [y0, y0], [y1, y1], color=area2c)
ax.fill_between([x0, t], [y0, y0], [y1, y1], color=area1c)
ax.plot([t, t], [y0, y1], 'k--', lw=2)
ax.plot([t - .1, t - .1], [y0, y1], 'k:', lw=2)
ax.plot([t2, t2], [y0, y1], 'k:', lw=2)
ax.scatter(features[is_virginica, f0],
features[is_virginica, f1], c='b', marker='o', s=40)
ax.scatter(features[~is_virginica, f0],
Expand Down
2 changes: 1 addition & 1 deletion ch02/figure4_5_no_sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def plot_decision(features, labels):

model = fit_model(1, features[:, (0, 2)], np.array(labels))
C = predict(
np.vstack([X.ravel(), Y.ravel()]).T, model).reshape(X.shape)
model, np.vstack([X.ravel(), Y.ravel()]).T).reshape(X.shape)
if COLOUR_FIGURE:
cmap = ListedColormap([(1., .6, .6), (.6, 1., .6), (.6, .6, 1.)])
else:
Expand Down
4 changes: 2 additions & 2 deletions ch02/figure4_5_sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,11 @@ def plot_decision(features, labels, num_neighbors=1):
ax.pcolormesh(X, Y, C, cmap=cmap)
if COLOUR_FIGURE:
cmap = ListedColormap([(1., .0, .0), (.1, .6, .1), (.0, .0, 1.)])
ax.scatter(features[:, 0], features[:, 2], c=labels, cmap=cmap, s=40)
ax.scatter(features[:, 0], features[:, 2], c=labels, cmap=cmap)
else:
for lab, ma in zip(range(3), "Do^"):
ax.plot(features[labels == lab, 0], features[
labels == lab, 2], ma, c=(1., 1., 1.), ms=8)
labels == lab, 2], ma, c=(1., 1., 1.), ms=6)
return fig,ax


Expand Down
4 changes: 2 additions & 2 deletions ch02/knn.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ def plurality(xs):
return k

# This function was called ``apply_model`` in the first edition
def predict(features, model):
def predict(model, features):
'''Apply k-nn model'''
k, train_feats, labels = model
results = []
Expand All @@ -42,5 +42,5 @@ def predict(features, model):


def accuracy(features, labels, model):
preds = predict(features, model)
preds = predict(model, features)
return np.mean(preds == labels)
4 changes: 2 additions & 2 deletions ch02/threshold.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def fit_model(features, labels):


# This function was called ``apply_model`` in the first edition
def predict(features, model):
def predict(model, features):
'''Apply a learned model'''
# A model is a pair as returned by fit_model
t, fi, reverse = model
Expand All @@ -51,5 +51,5 @@ def predict(features, model):

def accuracy(features, labels, model):
'''Compute the accuracy of the model'''
preds = predict(features, model)
preds = predict(model, features)
return np.mean(preds == labels)
4 changes: 4 additions & 0 deletions ch04/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,6 @@
wiki_lda.pkl
wiki_lda.pkl.state
*.png
*.npy
*.pkl
topics.txt
14 changes: 14 additions & 0 deletions ch04/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,16 @@ Chapter 4

Support code for *Chapter 4: Topic Modeling*


AP Data
-------

To download the AP data, use the ``download_ap.sh`` script inside the ``data``
directory::

cd data
./download_ap.sh

Word cloud creation
-------------------

Expand Down Expand Up @@ -49,3 +59,7 @@ Scripts

blei_lda.py
Computes LDA using the AP Corpus.
wikitopics_create.py
Create the topic model for Wikipedia using LDA (must download wikipedia database first)
wikitopics_create_hdp.py
Create the topic model for Wikipedia using HDP (must download wikipedia database first)
Loading