Skip to content

Commit 73c2861

Browse files
committed
Add remove-duplicates script.
1 parent f1d8d75 commit 73c2861

File tree

3 files changed

+202
-110
lines changed

3 files changed

+202
-110
lines changed

README.md

Lines changed: 121 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -65,22 +65,6 @@ and inserted directly into the README as markdown.
6565

6666

6767

68-
## [python2.7/music-organizer.py](https://github.com/bamos/python-scripts/blob/master/python2.7/music-organizer.py)
69-
+ Authors: [Brandon Amos](http://bamos.github.io)
70-
+ Created: 2014.04.19
71-
72-
73-
This script (music-organizer.py) organizes my music collection for
74-
iTunes and [mpv](http://mpv.io) using tag information.
75-
The directory structure is `<artist>/<track>`, where `<artist>` and `<track>`
76-
are lower case strings separated by dashes.
77-
78-
See my blog post
79-
[Using Python to organize a music directory](http://bamos.github.io/2014/07/05/music-organizer/)
80-
for a more detailed overview of this script.
81-
82-
83-
8468
## [python2.7/mt.py](https://github.com/bamos/python-scripts/blob/master/python2.7/mt.py)
8569
+ Authors: [Brandon Amos](http://bamos.github.io)
8670
+ Created: 2014.11.30
@@ -93,78 +77,39 @@ of the output.
9377

9478

9579

96-
## [python3/github-repo-summary.py](https://github.com/bamos/python-scripts/blob/master/python3/github-repo-summary.py)
97-
+ Authors: [Brandon Amos](http://bamos.github.io)
98-
+ Created: 2014.11.02
99-
100-
101-
Produces a Markdown table concisely summarizing a list of GitHub repositories.
102-
103-
104-
105-
## [python3/link-checker.py](https://github.com/bamos/python-scripts/blob/master/python3/link-checker.py)
80+
## [python2.7/music-organizer.py](https://github.com/bamos/python-scripts/blob/master/python2.7/music-organizer.py)
10681
+ Authors: [Brandon Amos](http://bamos.github.io)
107-
+ Created: 2014.02.06
108-
109-
110-
Script to be run by crontab to report broken links.
111-
112-
Builds upon linkchecker (Ubuntu: sudo apt-get install linkchecker)
113-
to hide warnings and to send a concise email if bad links are found.
114-
115-
![Link checker screenshot](https://raw.githubusercontent.com/bamos/python-scripts/master/link-checker-screenshot.png?raw=true)
116-
82+
+ Created: 2014.04.19
11783

11884

119-
## [python3/phonetic.py](https://github.com/bamos/python-scripts/blob/master/python3/phonetic.py)
120-
+ Authors: [Brandon Amos](http://bamos.github.io)
121-
+ Created: 2014.02.14
122-
85+
This script (music-organizer.py) organizes my music collection for
86+
iTunes and [mpv](http://mpv.io) using tag information.
87+
The directory structure is `<artist>/<track>`, where `<artist>` and `<track>`
88+
are lower case strings separated by dashes.
12389

124-
Obtain the NATO phonetic alphabet representation from short phrases.
90+
See my blog post
91+
[Using Python to organize a music directory](http://bamos.github.io/2014/07/05/music-organizer/)
92+
for a more detailed overview of this script.
12593

126-
```
127-
$ phonetic.py github
128-
g - golf
129-
i - india
130-
t - tango
131-
h - hotel
132-
u - uniform
133-
b - bravo
134-
```
13594

13695

96+
## [python3/eval-expr.py](https://github.com/bamos/python-scripts/blob/master/python3/eval-expr.py)
97+
+ Authors: J. Sebastian, [Brandon Amos](http://bamos.github.io)
98+
+ Created: 2013.08.01
13799

138-
## [python3/rank-writing.py](https://github.com/bamos/python-scripts/blob/master/python3/rank-writing.py)
139-
+ Authors: [Brandon Amos](http://bamos.github.io)
140-
+ Created: 2014.02.14
141100

101+
A module to evaluate a mathematical expression using Python's AST.
142102

143-
`rank-writing.py` ranks the writing quality of my
144-
blog's Markdown posts and my project's Markdown README files.
103+
+ Original by: J. Sebastian at http://stackoverflow.com/questions/2371436.
104+
+ Modifications by: [Brandon Amos](http://bamos.github.io).
145105

146-
The following programs should be on your `PATH`:
147-
+ [aspell](http://aspell.net/)
148-
+ [write-good](https://github.com/btford/write-good)
149-
+ [diction](https://www.gnu.org/software/diction/)
106+
If you want a command-line expression evaluator, use
107+
[Russell91/pythonpy](https://github.com/Russell91/pythonpy).
150108

151109

152110
```
153-
$ rank-writing.py *.md
154-
155-
=== 2013-05-03-scraping-tables-python.md ===
156-
Total: 53
157-
├── aspell: 34
158-
├── diction: 0
159-
└── write-good: 19
160-
161-
...
162-
163-
=== 2013-04-16-pdf-from-plaintext.md ===
164-
Total: 0
165-
├── aspell: 0
166-
├── diction: 0
167-
└── write-good: 0
111+
$ eval-expr.py '(((4+6)*10)<<2)'
112+
(((4+6)*10)<<2) = 400
168113
```
169114

170115

@@ -182,10 +127,10 @@ delete the current wallpaper.
182127

183128

184129
### Warning
185-
+ This approach doesn't work with multiple monitors.
130+
+ This approach doesn't work with multiple monitors or virtual desktops.
186131

187132
### Tested On
188-
+ OSX Yosemite 10.10.2 with a single monitor on a MBP.
133+
+ OSX Yosemite 10.10.2 with a single desktop on a MBP.
189134

190135
### Usage
191136
Ensure `db_path` and `wallpaper_dir` are correctly set below.
@@ -220,6 +165,38 @@ alias rm-wallpaper='rm $(get-osx-wallpaper.py) && killall Dock'
220165

221166

222167

168+
## [python3/github-repo-summary.py](https://github.com/bamos/python-scripts/blob/master/python3/github-repo-summary.py)
169+
+ Authors: [Brandon Amos](http://bamos.github.io)
170+
+ Created: 2014.11.02
171+
172+
173+
Produces a Markdown table concisely summarizing a list of GitHub repositories.
174+
175+
176+
177+
## [python3/link-checker.py](https://github.com/bamos/python-scripts/blob/master/python3/link-checker.py)
178+
+ Authors: [Brandon Amos](http://bamos.github.io)
179+
+ Created: 2014.02.06
180+
181+
182+
Script to be run by crontab to report broken links.
183+
184+
Builds upon linkchecker (Ubuntu: sudo apt-get install linkchecker)
185+
to hide warnings and to send a concise email if bad links are found.
186+
187+
![Link checker screenshot](https://raw.githubusercontent.com/bamos/python-scripts/master/link-checker-screenshot.png?raw=true)
188+
189+
190+
191+
## [python3/merge-mutt-contacts.py](https://github.com/bamos/python-scripts/blob/master/python3/merge-mutt-contacts.py)
192+
+ Authors: [Brandon Amos](http://bamos.github.io)
193+
+ Created: 2014.01.08
194+
195+
196+
Merges two mutt contact files.
197+
198+
199+
223200
## [python3/merge-pdfs-printable.py](https://github.com/bamos/python-scripts/blob/master/python3/merge-pdfs-printable.py)
224201
+ Authors: [Brandon Amos](http://bamos.github.io)
225202
+ Created: 2014.10.17
@@ -257,6 +234,70 @@ PS file.
257234

258235

259236

237+
## [python3/phonetic.py](https://github.com/bamos/python-scripts/blob/master/python3/phonetic.py)
238+
+ Authors: [Brandon Amos](http://bamos.github.io)
239+
+ Created: 2014.02.14
240+
241+
242+
Obtain the NATO phonetic alphabet representation from short phrases.
243+
244+
```
245+
$ phonetic.py github
246+
g - golf
247+
i - india
248+
t - tango
249+
h - hotel
250+
u - uniform
251+
b - bravo
252+
```
253+
254+
255+
256+
## [python3/rank-writing.py](https://github.com/bamos/python-scripts/blob/master/python3/rank-writing.py)
257+
+ Authors: [Brandon Amos](http://bamos.github.io)
258+
+ Created: 2014.02.14
259+
260+
261+
`rank-writing.py` ranks the writing quality of my
262+
blog's Markdown posts and my project's Markdown README files.
263+
264+
The following programs should be on your `PATH`:
265+
+ [aspell](http://aspell.net/)
266+
+ [write-good](https://github.com/btford/write-good)
267+
+ [diction](https://www.gnu.org/software/diction/)
268+
269+
270+
```
271+
$ rank-writing.py *.md
272+
273+
=== 2013-05-03-scraping-tables-python.md ===
274+
Total: 53
275+
├── aspell: 34
276+
├── diction: 0
277+
└── write-good: 19
278+
279+
...
280+
281+
=== 2013-04-16-pdf-from-plaintext.md ===
282+
Total: 0
283+
├── aspell: 0
284+
├── diction: 0
285+
└── write-good: 0
286+
```
287+
288+
289+
290+
## [python3/remove-duplicates.py](https://github.com/bamos/python-scripts/blob/master/python3/remove-duplicates.py)
291+
+ Authors: [Brandon Amos](http://bamos.github.io)
292+
+ Created: 2015.06.06
293+
294+
295+
Detect and remove duplicate images using average hashing.
296+
297+
http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html
298+
299+
300+
260301
## [python3/word-counter.py](https://github.com/bamos/python-scripts/blob/master/python3/word-counter.py)
261302
+ Authors: [Brandon Amos](http://bamos.github.io)
262303
+ Created: 2014.11.7
@@ -288,45 +329,15 @@ $ word-counter.py shakespeare.md --numWords 4 --maxTuples 3
288329

289330

290331

291-
## [python3/eval-expr.py](https://github.com/bamos/python-scripts/blob/master/python3/eval-expr.py)
292-
+ Authors: J. Sebastian, [Brandon Amos](http://bamos.github.io)
293-
+ Created: 2013.08.01
294-
295-
296-
A module to evaluate a mathematical expression using Python's AST.
297-
298-
+ Original by: J. Sebastian at http://stackoverflow.com/questions/2371436.
299-
+ Modifications by: [Brandon Amos](http://bamos.github.io).
300-
301-
If you want a command-line expression evaluator, use
302-
[Russell91/pythonpy](https://github.com/Russell91/pythonpy).
303-
304-
305-
```
306-
$ eval-expr.py '(((4+6)*10)<<2)'
307-
(((4+6)*10)<<2) = 400
308-
```
309-
310-
311-
312-
## [python3/merge-mutt-contacts.py](https://github.com/bamos/python-scripts/blob/master/python3/merge-mutt-contacts.py)
313-
+ Authors: [Brandon Amos](http://bamos.github.io)
314-
+ Created: 2014.01.08
315-
316-
317-
Merges two mutt contact files.
318-
319-
320-
321332
# Similar Projects
322333
There are many potpourri Python script repositories on GitHub.
323334
The following list shows a short sampling of projects,
324335
and I'm happy to merge pull requests of other projects.
325336

326337
Name | Stargazers | Description
327338
----|----|----
328-
[averagesecurityguy/Python-Examples](https://github.com/averagesecurityguy/Python-Examples) | 18 | Example scripts for common python tasks
339+
[averagesecurityguy/Python-Examples](https://github.com/averagesecurityguy/Python-Examples) | 20 | Example scripts for common python tasks
329340
[ClarkGoble/Scripts](https://github.com/ClarkGoble/Scripts) | 26 | My scripts - primarily using python and appscript
330-
[computermacgyver/twitter-python](https://github.com/computermacgyver/twitter-python) | 40 | Simple example scripts for Twitter data collection with Tweepy in Python
331-
[gpambrozio/PythonScripts](https://github.com/gpambrozio/PythonScripts) | 40 | A bunch of Python scripts I made and that might interest somebody else
332-
[realpython/python-scripts](https://github.com/realpython/python-scripts) | 56 | because i'm tired of gists
341+
[computermacgyver/twitter-python](https://github.com/computermacgyver/twitter-python) | 45 | Simple example scripts for Twitter data collection with Tweepy in Python
342+
[gpambrozio/PythonScripts](https://github.com/gpambrozio/PythonScripts) | 39 | A bunch of Python scripts I made and that might interest somebody else
343+
[realpython/python-scripts](https://github.com/realpython/python-scripts) | 59 | because i'm tired of gists

python3/remove-duplicates.py

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
#!/usr/bin/env python3
2+
3+
__author__ = ['[Brandon Amos](http://bamos.github.io)']
4+
__date__ = '2015.06.06'
5+
6+
"""
7+
Detect and remove duplicate images using average hashing.
8+
9+
http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html
10+
"""
11+
12+
import argparse
13+
import imagehash
14+
import os
15+
import sys
16+
17+
from collections import defaultdict
18+
from PIL import Image
19+
20+
21+
def getImgs(d):
22+
"""Get the images from the test directory, partitioned by class."""
23+
exts = ["jpg", "png"]
24+
25+
imgClasses = [] # Images, separated by class.
26+
for subdir, dirs, files in os.walk(d):
27+
imgs = []
28+
for fName in files:
29+
(imageClass, imageName) = (os.path.basename(subdir), fName)
30+
if any(imageName.lower().endswith("." + ext) for ext in exts):
31+
imgs.append(os.path.join(subdir, fName))
32+
imgClasses.append(imgs)
33+
return imgClasses
34+
35+
36+
def getHash(imgPath):
37+
"""Get the hash of an image, and catch exceptions if the image
38+
file is corrupted."""
39+
try:
40+
return imagehash.average_hash(Image.open(imgPath))
41+
except:
42+
return None
43+
44+
45+
def runOnClass(args, imgs):
46+
"""Find and remove duplicates within an image class."""
47+
d = defaultdict(list)
48+
for imgPath in imgs:
49+
imgHash = getHash(imgPath)
50+
if imgHash:
51+
d[imgHash].append(imgPath)
52+
53+
numFound = 0
54+
for imgHash, imgs in d.items():
55+
if len(imgs) > 1:
56+
print("{}: {}".format(imgHash, " ".join(imgs)))
57+
numFound += len(imgs) - 1 # Keep a single image.
58+
59+
if args.delete:
60+
largestImg = max(imgs, key=os.path.getsize)
61+
print("Keeping {}.".format(largestImg))
62+
imgs.remove(largestImg)
63+
for img in imgs:
64+
os.remove(img)
65+
return numFound
66+
67+
if __name__ == '__main__':
68+
parser = argparse.ArgumentParser()
69+
parser.add_argument('inplaceDir', type=str,
70+
help="Directory of images, divided into "
71+
"subdirectories by class.")
72+
parser.add_argument('--delete', action='store_true',
73+
help="Delete the smallest duplicate images instead "
74+
"of just listing them.")
75+
args = parser.parse_args()
76+
77+
numFound = 0
78+
for imgClass in getImgs(args.inplaceDir):
79+
numFound += runOnClass(args, imgClass)
80+
print("\n\nFound {} total duplicate images.".format(numFound))

requirements-3.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@ Jinja2==2.7.2
22
PyGithub==1.25.2
33
PyPDF2==1.23
44
toolz==0.7.1
5+
imagehash==0.3

0 commit comments

Comments
 (0)