|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Code Walkthrough: Tablib, a Python Module for Tabular Datasets" |
| 4 | +date: 2018-10-08 21:00:00 |
| 5 | +comments: true |
| 6 | +categories: blog |
| 7 | +image: /assets/img/source_code.png |
| 8 | +description: Reading Great Code and it's benefits. Code walkthrough of tablib python module by Nipun Sadvilkar |
| 9 | +--- |
| 10 | +<hr> |
| 11 | + |
| 12 | +<h1 style="font-size: 30px;">Motivation</h1> |
| 13 | + |
| 14 | +Oftentimes, I like to dive into open source projects to learn best practices and design patterns programming pundits use to do things correctly and optimally. In addition, [Peter Norvig](https://en.wikipedia.org/wiki/Peter_Norvig) has also said in his famous blog post [Teach Yourself Programming in Ten Years](http://norvig.com/21-days.html) |
| 15 | + |
| 16 | +> *Talk with other programmers; read other programs. This is more important than any book or training course.* |
| 17 | +
|
| 18 | +I am big advocate of it. This blog post is to emphasize - how reading open source code helps you identify and understand efficient patterns and coding constructs. |
| 19 | + |
| 20 | +<h1 style="font-size: 30px;">Tablib</h1> |
| 21 | + |
| 22 | +I admire [Kenneth Reitz](https://github.com/kennethreitz) very much. Do read and follow his [The Hitchhiker’s Guide to Python!](https://docs.python-guide.org) to be a a great Python programmer. Lesson from this book - [Reading Great Code](https://docs.python-guide.org/writing/reading/?highlight=tablib#reading-great-code) - is the main reason why I decided to give a go at reading source code of [Tablib](https://github.com/kennethreitz/tablib). Reading source code is initially daunting because of certain constructs which are obscure or you may not be familiar with it, and which is natural. Despite such hurdles, if you keep con oncentrating you will find lot of "Aha!" moments by identifying useful patterns. Here is my experience, I came across a very simple yet useful code snippet which is very important and widely used task in data cleaning i.e., removing duplicates. |
| 23 | + |
| 24 | +[Source code: tablib removing_duplicates menthod:](http://docs.python-tablib.org/en/master/_modules/tablib/core/#Dataset.remove_duplicates) |
| 25 | + |
| 26 | +```python |
| 27 | +def remove_duplicates(self): |
| 28 | + """Removes all duplicate rows from the :class:`Dataset` object |
| 29 | + while maintaining the original order.""" |
| 30 | + seen = set() |
| 31 | + self._data[:] = [row for row in self._data if not (tuple(row) in seen or seen.add(tuple(row)))] |
| 32 | +``` |
| 33 | + |
| 34 | +Check the `if ` statement followed by [_generator expression_](https://dbader.org/blog/python-generator-expressions). If you look closely inside generator expression the technique used to check for duplicate rows is called [_short circuit technique_](https://www.geeksforgeeks.org/short-circuiting-techniques-python/) implemented in python. |
| 35 | + |
| 36 | + |
| 37 | +[Short circuit explained by official docs](https://docs.python.org/2/library/stdtypes.html#boolean-operations-and-or-not): |
| 38 | + |
| 39 | +|Operation|Result|Notes| |
| 40 | +|---|---|---| |
| 41 | +|`x or y` |if x is false, then y, else x| Only evaluates the second argument(`y`) if the first one is `False`.| |
| 42 | +|`x and y`|if x is false, then x, else y| Only evaluates the second argument(`y`) if the first one(`x`) is `True`.| |
| 43 | +|`not x`|if x is false, then True, else False|`not` has a lower priority than non-Boolean operators| |
| 44 | + |
| 45 | +<br> |
| 46 | +`remove_duplicates` method uses 1st and 3rd Operation from above table. |
| 47 | + |
| 48 | +Key thing to remember is: |
| 49 | + |
| 50 | +**The evaluation of expression takes place from left to right.** |
| 51 | + |
| 52 | +Explained with toy example: |
| 53 | +```python |
| 54 | +>>> _data = [[1, 2, 3], [4, 5, 6], [1, 2, 3]] |
| 55 | +>>> seen = set() |
| 56 | +>>> data_deduplicated = [row for row in _data if not (tuple(row) in seen or seen.add(tuple(row)))] |
| 57 | + |
| 58 | +>>> print(data_deduplicated) |
| 59 | +# [[1, 2, 3], [4, 5, 6]] |
| 60 | +``` |
| 61 | + |
| 62 | +To put it into words, within list comprehension - iterate over data row by row and check if given row is present within `seen` _set_. If it's not present, meaning |
| 63 | +```python |
| 64 | +tuple(row) in seen |
| 65 | +``` |
| 66 | +evaluates to `False` and as per 1st operation from the table, evaluate second argument which is to add given row in `seen` _set_. Furthermore, `if not ()` condition gets satisfied and given row is added to outer list. Subsequently, if the same row occurs then we know it's already in `seen` _set_ and hence that row will not be added to outer list. In overall, resulting into removing of duplicate rows. |
| 67 | + |
| 68 | +If you are more of a visual learning person, following demonstartion using [Python tutor tool](http://pythontutor.com/) built by an outstanding academic and prolific blogger - [Philip Guo](http://pgbovine.net) - would help*: |
| 69 | +> *If below IFrame is not visible then please enable **"load unsecure script"** of your browser. Don't worry! it's saying unsecure because of http protocol used by [Python tutor](http://pythontutor.com/) and not **https**. |
| 70 | +
|
| 71 | +<iframe width="820" height="650" frameborder="1.5" src="http://pythontutor.com/iframe-embed.html#code=_data%20%3D%20%5B%5B1,2,3%5D,%20%5B4,5,6%5D,%20%5B1,2,3%5D%5D%0Aseen%20%3D%20set%28%29%0Adata_deduplicated%20%3D%20%5Brow%20for%20row%20in%20_data%20if%20not%20%28tuple%28row%29%20in%20seen%20or%20seen.add%28tuple%28row%29%29%29%5D&codeDivHeight=400&codeDivWidth=350&cumulative=false&curInstr=6&heapPrimitives=nevernest&origin=opt-frontend.js&py=2&rawInputLstJSON=%5B%5D&textReferences=false"> </iframe> |
| 72 | + |
| 73 | +I hope by now you have understood [_short circuit technique_](https://www.geeksforgeeks.org/short-circuiting-techniques-python/) and importance of reading open source code. So keep exploring and do share your experience with me. Thank you! :) |
0 commit comments