Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion tutorial-13/tutorial-13-a/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# Tutorial 13-a

This analyzer parses the URLs from this link: https://state.1keydata.com/ into a URL list. It then has a python script to fetch the webpages and save them in a folder. This folder then can easily be moved into the second analyzer where the pages will be processed.
This analyzer parses the URLs from this link: https://state.1keydata.com/ into a URL list. It then has a python script to fetch the webpages and save them in a folder. This folder then can easily be moved into the second analyzer where the pages will be processed.

## NOTE

You will have to install BeautifulSoup and certifi before using the python script.
5 changes: 2 additions & 3 deletions tutorial-13/tutorial-13-a/input/urlfetch.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,11 @@
from bs4 import BeautifulSoup
from pathlib import Path
import re
import certifi

wordsfile = os.path.join(os.path.dirname(__file__), "urls.txt")
file1 = codecs.open(wordsfile, "r", "utf-8")
lines = file1.readlines()

urlbase = "https://state.1keydata.com/"

count = 0
for url in lines:
Expand All @@ -31,7 +30,7 @@
found = False

try:
page = urllib.request.urlopen(url)
page = urllib.request.urlopen(url, cafile=certifi.where())
except HTTPError as e:
print(' Error code: ', e.code)
file1 = open(os.path.join(os.path.dirname(__file__), "urlorphans.txt"), "a")
Expand Down