|
110 | 110 | }, |
111 | 111 | { |
112 | 112 | "cell_type": "code", |
113 | | - "execution_count": 2, |
| 113 | + "execution_count": null, |
114 | 114 | "metadata": { |
115 | 115 | "collapsed": true |
116 | 116 | }, |
|
817 | 817 | }, |
818 | 818 | { |
819 | 819 | "cell_type": "code", |
820 | | - "execution_count": 23, |
| 820 | + "execution_count": null, |
821 | 821 | "metadata": { |
822 | 822 | "collapsed": true |
823 | 823 | }, |
824 | 824 | "outputs": [], |
825 | 825 | "source": [ |
826 | | - "%psource PluralityLearner" |
| 826 | + "psource(PluralityLearner)" |
827 | 827 | ] |
828 | 828 | }, |
829 | 829 | { |
|
909 | 909 | }, |
910 | 910 | { |
911 | 911 | "cell_type": "code", |
912 | | - "execution_count": 25, |
| 912 | + "execution_count": null, |
913 | 913 | "metadata": { |
914 | 914 | "collapsed": true |
915 | 915 | }, |
916 | 916 | "outputs": [], |
917 | 917 | "source": [ |
918 | | - "%psource NearestNeighborLearner" |
| 918 | + "psource(NearestNeighborLearner)" |
919 | 919 | ] |
920 | 920 | }, |
921 | 921 | { |
|
991 | 991 | "\n", |
992 | 992 | "Information Gain is difference between entropy of the parent and weighted sum of entropy of children. The feature used for splitting is the one which provides the most information gain.\n", |
993 | 993 | "\n", |
| 994 | + "#### Pseudocode\n", |
| 995 | + "\n", |
| 996 | + "You can view the pseudocode by running the cell below:" |
| 997 | + ] |
| 998 | + }, |
| 999 | + { |
| 1000 | + "cell_type": "code", |
| 1001 | + "execution_count": null, |
| 1002 | + "metadata": { |
| 1003 | + "collapsed": true |
| 1004 | + }, |
| 1005 | + "outputs": [], |
| 1006 | + "source": [ |
| 1007 | + "pseudocode(\"Decision Tree Learning\")" |
| 1008 | + ] |
| 1009 | + }, |
| 1010 | + { |
| 1011 | + "cell_type": "markdown", |
| 1012 | + "metadata": {}, |
| 1013 | + "source": [ |
994 | 1014 | "### Implementation\n", |
995 | 1015 | "The nodes of the tree constructed by our learning algorithm are stored using either `DecisionFork` or `DecisionLeaf` based on whether they are a parent node or a leaf node respectively." |
996 | 1016 | ] |
997 | 1017 | }, |
998 | 1018 | { |
999 | 1019 | "cell_type": "code", |
1000 | | - "execution_count": 27, |
| 1020 | + "execution_count": null, |
1001 | 1021 | "metadata": { |
1002 | 1022 | "collapsed": true |
1003 | 1023 | }, |
1004 | 1024 | "outputs": [], |
1005 | 1025 | "source": [ |
1006 | | - "%psource DecisionFork" |
| 1026 | + "psource(DecisionFork)" |
1007 | 1027 | ] |
1008 | 1028 | }, |
1009 | 1029 | { |
|
1015 | 1035 | }, |
1016 | 1036 | { |
1017 | 1037 | "cell_type": "code", |
1018 | | - "execution_count": 28, |
| 1038 | + "execution_count": null, |
1019 | 1039 | "metadata": { |
1020 | 1040 | "collapsed": true |
1021 | 1041 | }, |
1022 | 1042 | "outputs": [], |
1023 | 1043 | "source": [ |
1024 | | - "%psource DecisionLeaf" |
| 1044 | + "psource(DecisionLeaf)" |
1025 | 1045 | ] |
1026 | 1046 | }, |
1027 | 1047 | { |
|
1033 | 1053 | }, |
1034 | 1054 | { |
1035 | 1055 | "cell_type": "code", |
1036 | | - "execution_count": 29, |
| 1056 | + "execution_count": null, |
1037 | 1057 | "metadata": { |
1038 | 1058 | "collapsed": true |
1039 | 1059 | }, |
1040 | 1060 | "outputs": [], |
1041 | 1061 | "source": [ |
1042 | | - "%psource DecisionTreeLearner" |
| 1062 | + "psource(DecisionTreeLearner)" |
1043 | 1063 | ] |
1044 | 1064 | }, |
1045 | 1065 | { |
|
1142 | 1162 | "source": [ |
1143 | 1163 | "### Implementation\n", |
1144 | 1164 | "\n", |
1145 | | - "The implementation of the Naive Bayes Classifier is split in two; Discrete and Continuous. The user can choose between them with the argument `continuous`." |
| 1165 | + "The implementation of the Naive Bayes Classifier is split in two; *Learning* and *Simple*. The *learning* classifier takes as input a dataset and learns the needed distributions from that. It is itself split into two, for discrete and continuous features. The *simple* classifier takes as input not a dataset, but already calculated distributions (a dictionary of `CountingProbDist` objects)." |
1146 | 1166 | ] |
1147 | 1167 | }, |
1148 | 1168 | { |
|
1237 | 1257 | }, |
1238 | 1258 | { |
1239 | 1259 | "cell_type": "code", |
1240 | | - "execution_count": 32, |
| 1260 | + "execution_count": null, |
1241 | 1261 | "metadata": { |
1242 | 1262 | "collapsed": true |
1243 | 1263 | }, |
1244 | 1264 | "outputs": [], |
1245 | 1265 | "source": [ |
1246 | | - "%psource NaiveBayesDiscrete" |
| 1266 | + "psource(NaiveBayesDiscrete)" |
1247 | 1267 | ] |
1248 | 1268 | }, |
1249 | 1269 | { |
|
1327 | 1347 | }, |
1328 | 1348 | { |
1329 | 1349 | "cell_type": "code", |
1330 | | - "execution_count": 35, |
| 1350 | + "execution_count": null, |
1331 | 1351 | "metadata": { |
1332 | 1352 | "collapsed": true |
1333 | 1353 | }, |
1334 | 1354 | "outputs": [], |
1335 | 1355 | "source": [ |
1336 | | - "%psource NaiveBayesContinuous" |
| 1356 | + "psource(NaiveBayesContinuous)" |
| 1357 | + ] |
| 1358 | + }, |
| 1359 | + { |
| 1360 | + "cell_type": "markdown", |
| 1361 | + "metadata": {}, |
| 1362 | + "source": [ |
| 1363 | + "#### Simple\n", |
| 1364 | + "\n", |
| 1365 | + "The simple classifier (chosen with the argument `simple`) does not learn from a dataset, instead it takes as input a dictionary of already calculated `CountingProbDist` objects and returns a predictor function. The dictionary is in the following form: `(Class Name, Class Probability): CountingProbDist Object`.\n", |
| 1366 | + "\n", |
| 1367 | + "Each class has its own probability distribution. The classifier given a list of features calculates the probability of the input for each class and returns the max. The only pre-processing work is to create dictionaries for the distribution of classes (named `targets`) and attributes/features.\n", |
| 1368 | + "\n", |
| 1369 | + "The complete code for the simple classifier:" |
| 1370 | + ] |
| 1371 | + }, |
| 1372 | + { |
| 1373 | + "cell_type": "code", |
| 1374 | + "execution_count": null, |
| 1375 | + "metadata": {}, |
| 1376 | + "outputs": [], |
| 1377 | + "source": [ |
| 1378 | + "psource(NaiveBayesSimple)" |
| 1379 | + ] |
| 1380 | + }, |
| 1381 | + { |
| 1382 | + "cell_type": "markdown", |
| 1383 | + "metadata": {}, |
| 1384 | + "source": [ |
| 1385 | + "This classifier is useful when you already have calculated the distributions and you need to predict future items." |
1337 | 1386 | ] |
1338 | 1387 | }, |
1339 | 1388 | { |
|
1385 | 1434 | "cell_type": "markdown", |
1386 | 1435 | "metadata": {}, |
1387 | 1436 | "source": [ |
1388 | | - "Notice how the Discrete Classifier misclassified the second item, while the Continuous one had no problem." |
| 1437 | + "Notice how the Discrete Classifier misclassified the second item, while the Continuous one had no problem.\n", |
| 1438 | + "\n", |
| 1439 | + "Let's now take a look at the simple classifier. First we will come up with a sample problem to solve. Say we are given three bags. Each bag contains three letters ('a', 'b' and 'c') of different quantities. We are given a string of letters and we are tasked with finding from which bag the string of letters came.\n", |
| 1440 | + "\n", |
| 1441 | + "Since we know the probability distribution of the letters for each bag, we can use the naive bayes classifier to make our prediction." |
| 1442 | + ] |
| 1443 | + }, |
| 1444 | + { |
| 1445 | + "cell_type": "code", |
| 1446 | + "execution_count": 2, |
| 1447 | + "metadata": { |
| 1448 | + "collapsed": true |
| 1449 | + }, |
| 1450 | + "outputs": [], |
| 1451 | + "source": [ |
| 1452 | + "bag1 = 'a'*50 + 'b'*30 + 'c'*15\n", |
| 1453 | + "dist1 = CountingProbDist(bag1)\n", |
| 1454 | + "bag2 = 'a'*30 + 'b'*45 + 'c'*20\n", |
| 1455 | + "dist2 = CountingProbDist(bag2)\n", |
| 1456 | + "bag3 = 'a'*20 + 'b'*20 + 'c'*35\n", |
| 1457 | + "dist3 = CountingProbDist(bag3)" |
| 1458 | + ] |
| 1459 | + }, |
| 1460 | + { |
| 1461 | + "cell_type": "markdown", |
| 1462 | + "metadata": {}, |
| 1463 | + "source": [ |
| 1464 | + "Now that we have the `CountingProbDist` objects for each bag/class, we will create the dictionary. We assume that it is equally probable that we will pick from any bag." |
| 1465 | + ] |
| 1466 | + }, |
| 1467 | + { |
| 1468 | + "cell_type": "code", |
| 1469 | + "execution_count": 3, |
| 1470 | + "metadata": { |
| 1471 | + "collapsed": true |
| 1472 | + }, |
| 1473 | + "outputs": [], |
| 1474 | + "source": [ |
| 1475 | + "dist = {('First', 0.5): dist1, ('Second', 0.3): dist2, ('Third', 0.2): dist3}\n", |
| 1476 | + "nBS = NaiveBayesLearner(dist, simple=True)" |
| 1477 | + ] |
| 1478 | + }, |
| 1479 | + { |
| 1480 | + "cell_type": "markdown", |
| 1481 | + "metadata": {}, |
| 1482 | + "source": [ |
| 1483 | + "Now we can start making predictions:" |
| 1484 | + ] |
| 1485 | + }, |
| 1486 | + { |
| 1487 | + "cell_type": "code", |
| 1488 | + "execution_count": 4, |
| 1489 | + "metadata": {}, |
| 1490 | + "outputs": [ |
| 1491 | + { |
| 1492 | + "name": "stdout", |
| 1493 | + "output_type": "stream", |
| 1494 | + "text": [ |
| 1495 | + "First\n", |
| 1496 | + "Second\n", |
| 1497 | + "Third\n" |
| 1498 | + ] |
| 1499 | + } |
| 1500 | + ], |
| 1501 | + "source": [ |
| 1502 | + "print(nBS('aab')) # We can handle strings\n", |
| 1503 | + "print(nBS(['b', 'b'])) # And lists!\n", |
| 1504 | + "print(nBS('ccbcc'))" |
| 1505 | + ] |
| 1506 | + }, |
| 1507 | + { |
| 1508 | + "cell_type": "markdown", |
| 1509 | + "metadata": {}, |
| 1510 | + "source": [ |
| 1511 | + "The results make intuitive sence. The first bag has a high amount of 'a's, the second has a high amount of 'b's and the third has a high amount of 'c's. The classifier seems to confirm this intuition.\n", |
| 1512 | + "\n", |
| 1513 | + "Note that the simple classifier doesn't distinguish between discrete and continuous values. It just takes whatever it is given. Also, the `simple` option on the `NaiveBayesLearner` overrides the `continuous` argument. `NaiveBayesLearner(d, simple=True, continuous=False)` just creates a simple classifier." |
1389 | 1514 | ] |
1390 | 1515 | }, |
1391 | 1516 | { |
|
1423 | 1548 | }, |
1424 | 1549 | { |
1425 | 1550 | "cell_type": "code", |
1426 | | - "execution_count": 37, |
| 1551 | + "execution_count": null, |
1427 | 1552 | "metadata": { |
1428 | 1553 | "collapsed": true |
1429 | 1554 | }, |
1430 | 1555 | "outputs": [], |
1431 | 1556 | "source": [ |
1432 | | - "%psource PerceptronLearner" |
| 1557 | + "psource(PerceptronLearner)" |
1433 | 1558 | ] |
1434 | 1559 | }, |
1435 | 1560 | { |
|
0 commit comments