Personal Project
I decided I liked the idea of writing a program to parse my squid access.log so that I could categorize the web requests that were passing through my proxy using shallalist (http://www.shallalist.de/). If I succeeded and cross-referenced the information with time and IP, I could start doing all kinds of cool little analytics to get a handle on the day to day traffic that I'm seeing.
I figured I'd start with a "proof of concept" which you'll find below. I attempted to write for speed, rather than memory usage. This code took just below 1.5 seconds to process an access.log that was 21 megabytes with a 17 kilobyte list of domains. I don't find that too shabby, but I wouldn't turn down any suggestions on how to make it faster.
EDIT: This is now old code, core functionality has since been added and the implementation has been slightly changed. If you'd like the new code you can contact me.
test_counter =\
Counter(open('/Users/xxxxx/BL/searchengines/domains', 'r').readlines()) domain_list = []
domain_regex = re.compile('^.*http://(.*?)/')
for line in open('/Users/xxxxx/access.log', 'r'):
domain_match = domain_regex.match(line)
try:
domain_list.append(domain_match.group(1))
except AttributeError:
pass for domain in domain_list[:]:
# I need to check to see if the domain is blocked, not just the subdomain.
domain_parts = deque(domain.split('.'))
if len(domain_parts) > 1:
domain_construction = domain_parts.pop()
while len(domain_parts) > 0:
domain_construction =\
'.'.join((domain_parts.pop(), domain_construction))
if test_counter.has_key(domain_construction+'\n'):
print (domain_construction, "** search engine")
from collections import Counter, deque
import re
blog comments powered by Disqus