New job assessment: Webcluster logs parsing

15 January 2012

New job assessment: Webcluster logs parsing

These is one of the proposed solutions for the job assessment commented in a previous post.

Provide a design which is able to parse Apache access-logs so we can generate an overview of which IP has visited a specific link. This design needs to be usable for a 500+ node webcluster. Please provide your configs/possible scripts and explain your choices and how scalable they are.

I will consider these requirements:

It is not critical to register all the log entries. It is no needed ensure that all the web hits are registered.
No control on duplicated log entries. It is not needed to check that the log entries had been already loaded.
It is also needed to propose a mechanism to gather the logs from the webservers.
It must be scalable.
It is a plus to make it flexible to allow further different analysis.

The problems to be solved are log storage and log gathering, but the main bottleneck will be the storage.

One realizes that the best option is a noSQL database due to the characteristics of the data to process (log entries):

Time ordered entries
no duplicates
need of fast insertion
fixed fields
no data relation or conceptual integrity
need to be rotated (old entries removed)
etc...

So, I will propose the usage of MongoDB 1, that fits the requirements:

It is fast, both at inserting and querying.
Scales horizontally without disruption (is initially proper configured).
Supports replication and High Availability.
Well known solution. Commercial support if needed.
Python bindings (pyMongo)

[1]

Note: I will not enter in details of a MongoDB scalable HA architecture. See the quick start guide to setup a single node and the documentation for architecture examples.

To parse the logs and store them in MongoDB, I will propose a simple python script: accesslogparsemongo.py that:

Setup a direct MongoDB connection.
Read the access log from standard input.
Parse the logs and store all the fields, including: client_ip, url, referer, status code, timestamp, timezone...
I do not set any indexes in the NoSQL db. Indexes could be created on url or clientip_ fields, but not having indexes allows faster insertions, that is the objective. The reads are very uncommon and performed in batch processes.
Notice that it should be improved to be more reliable. For instance, it does not check for errors (DB failures, etc.). It could buffer entries in case of DB failure.
A second script called examplequeryaccesslog.py queries the DB and prints the access. It gets an optional argument, the relative URL.

To feed the DB with the logs from the webservers, some solutions could be:

Copy the log files with a scheduled task via SSH or similar, then process them with accesslogparsemongo.py in a centralized server (or cluster of servers).

* Pros: Logs are centralized. Only a set of servers access to MongoDB.

System can be stopped as needed.

* Cons: Needs extra programming to get the logs.

No realtime data.

Use a centralized syslog service, like syslog-ng (can be balanced and configured in HA), and setup all the webservers to send the logs via syslog (see this article).

In the log server, we can process resulting files with a batch process or send all the messages to accesslogparsemongo.py. For instance, the configuration for syslog-ng:

destination d_prog { program("/apath/accesslog_parse_mongo.py"
                              template(“$MSGONLY\n”)
                              template-escape(no)); };




* Pros: Centralized logs. No extra programming. Realtime data.

Use of existent infrastructure (syslog). Only a set of servers access to MongoDB.

* Cons: Some logs entries can be dropped. Can not be stopped, if not log entries will be lost.

Pipe the webserver logs directly to the script, accesslogparsemongo.py. In Apache configuration:

CustomLog "|/apath/accesslog_parse_mongo.py" combined




* Pros: Easy to implement. No extra programming or infrastructure. Realtime data.


* Cons: Some logs entries can be dropped. It can not be stopped or log entries will be lost.

The script should be improved to make it more reliable.

These is one of the proposed solutions for the job assessment commented in a previous post.