Week 8 workshops (NumPy)

pbosilj · pbosilj · commit aff94d4a65a2 · 2024-11-10T17:46:36.000Z
diff --git a/CMP9065 Data Programming in Python/workshops/Workshop 8 - Numpy.ipynb b/CMP9065 Data Programming in Python/workshops/Workshop 8 - Numpy.ipynb
@@ -0,0 +1,276 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Workshop 8 - NumPy"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In today's workshop, you will work with a reduced set of data from [[1]](#References). This data describes measurements of different dry beans, used for designing a system for classifying different types of beans. To run this workshop, you will need to download the file `beans.txt` from Blackboard and place it in the same folder as your workshop notebook.\n",
+    "\n",
+    "Through the workshop, you will be expected to analyse the measured characteristics of the beans in the dataset:\n",
+    "- [Loading the dataset](#Loading-the-dataset)\n",
+    "- [Analysing the dataset](#Analysing-the-dataset)\n",
+    "  - [Array dimensions](#Array-dimensions)\n",
+    "    - [Exercise 1](#Exercise-1)\n",
+    "  - [Extreme values](#Extreme-values)\n",
+    "    - [Exercise 2](#Exercise-2)\n",
+    "    - [Exercise 3](#Exercise-3)\n",
+    "  - [Calculating new features](#Calculating-new-features)\n",
+    "    - [Exercise 4](#Exercise-5)\n",
+    "  - [Dataset statistics](#Dataset-statistics)\n",
+    "    - [Exercise 5](#Exercise-5)\n",
+    "    - [Exercise 6](#Exercise-6)\n",
+    "- [References](#References)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading the dataset\n",
+    "\n",
+    "The file `beans.txt` contains some measurements on 7 different types of beans. The below code snippet loads this data into a NumPy array, such that every row corresponds to a different bean.\n",
+    "\n",
+    "The columns correspond to the following measurements of the bean: **area**, **perimiter**, **major axis length**, **minor axis length**. The last column represents the **class** (type of bean), and is encoded as an integer number.\n",
+    "\n",
+    "The code for loading the dataset from the file `beans.txt` is provided. This code relies only on the material covered in the lectures and workshops so far. In the coming lectures, we will learn about libraries better suited for handling large data files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open('beans.txt', 'r') as beans_data:\n",
+    "    beans = np.array([line.strip().split(',') for line in beans_data], dtype='float')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To ensure we have correctly loaded the dataset, let's inspect the features (measurements) of the first bean:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[28395.           610.291        208.17811671   173.88874704\n",
+      "     0.        ]\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(beans[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Analysing the dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Array dimensions\n",
+    "\n",
+    "Let's try and get a basic sense of the dataset by looking the number of samples it contains."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Exercise 1\n",
+    "\n",
+    "How many different samples (beans) does this dataset contain?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "13611\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(beans.shape[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Extreme values"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Exercise 2\n",
+    "\n",
+    "Let's get a deeper look into the dataset. What is the area of the largest bean in the datset?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Exercise 3\n",
+    " \n",
+    "Can you print all the features of the largest bean?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Calculating new features"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Exercise 4\n",
+    "\n",
+    "Beans have an approximately ellipsoid shape. The ratio between the major and minor axis can tell us something about the shape of the bean. Calculate the **aspect ratio** between the major and minor axis of every bean and store it in a new array `aspect_ratio`.\n",
+    "\n",
+    "**Extension:** Add this new informatio to the array representing the dataset as a new column. You will need [`numpy.transpose()`](https://numpy.org/doc/stable/reference/generated/numpy.transpose.html) and [`numpy.hstack()`](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html#numpy.hstack)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Dataset statistics"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Exercise 5\n",
+    "\n",
+    "First, calculate the mean and standard deviation of the aspect ratio of all the beans."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mean_all = np.mean(beans[:, 5])\n",
+    "std_all = np.std(beans[:, 5])\n",
+    "print(mean_all, std_all)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Exercise 6\n",
+    "\n",
+    "Now, let's compare the above global statistics for the dataset to those of class 3. Claculate the mean and standard deviation of the aspect ratio of all the beans of class 3 (you will need to use mask idexing)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You should be able to observe a difference between the mean and standard deviation of the measured characteristic between the whole dataset and class 3. **How would you interpret this difference** (especially the one in the standard deviation)?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### References\n",
+    "\n",
+    "[1] KOKLU, M. and OZKAN, I.A., (2020), \"Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques.\" Computers and Electronics in Agriculture, 174, 105507. DOI: https://doi.org/10.1016/j.compag.2020.105507"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}