Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.gitignore		.gitignore
PyCrawler.db.1		PyCrawler.db.1
PyCrawler.py		PyCrawler.py
README.md		README.md
__init__.py		__init__.py
cPrinter.py		cPrinter.py
content_processor.py		content_processor.py
query.py		query.py
ready_queue.py		ready_queue.py
settings.py		settings.py

Repository files navigation

Setup

Open settings.py and adjust database settings
DATABASE_ENGINE can either be "mysql" or "sqlite"
For sqlite only DATABASE_HOST is used, and it should begin with a '/'
All other DATABASE_* settings are required for mysql
VERBOSE mode causes the crawler to output some stats that are generated as it goes

Current State

mysql engine untested
Lots of debug prints
Issue in some situations where the database is locked and queries cannot execute. Presumably an issue only with sqlite's file-based approach

Misc

Designed to be able to run on multiple machines and work together to collect info in central DB
Queues links into the database to be crawled. This means that any machine running the crawler with the central db can grab from the same queue. Reduces crawling redundancy.
Thread pool apprach to analyzing keywords in text.

About

A python web crawler

Report repository

Releases

No releases published

Packages

No packages published