Personal Project

2011-03-29 15:58:39.391863


I decided I liked the idea of writing a program to parse my squid access.log so that I could categorize the web requests that were passing through my proxy using shallalist (http://www.shallalist.de/). If I succeeded and cross-referenced the information with time and IP, I could start doing all kinds of cool little analytics to get a handle on the day to day traffic that I'm seeing.

I figured I'd start with a "proof of concept" which you'll find below. I attempted to write for speed, rather than memory usage. This code took just below 1.5 seconds to process an access.log that was 21 megabytes with a 17 kilobyte list of domains. I don't find that too shabby, but I wouldn't turn down any suggestions on how to make it faster.

EDIT: This is now old code, core functionality has since been added and the implementation has been slightly changed. If you'd like the new code you can contact me.

from collections import Counter, deque
import re

test_counter =\ Counter(open('/Users/xxxxx/BL/searchengines/domains', 'r').readlines())

domain_list = [] domain_regex = re.compile('^.*http://(.*?)/') for line in open('/Users/xxxxx/access.log', 'r'): domain_match = domain_regex.match(line) try: domain_list.append(domain_match.group(1)) except AttributeError: pass

for domain in domain_list[:]: # I need to check to see if the domain is blocked, not just the subdomain. domain_parts = deque(domain.split('.')) if len(domain_parts) > 1: domain_construction = domain_parts.pop() while len(domain_parts) > 0: domain_construction =\ '.'.join((domain_parts.pop(), domain_construction)) if test_counter.has_key(domain_construction+'\n'): print (domain_construction, "** search engine")



blog comments powered by Disqus