New job assessment: Webcluster logs parsing

These is one of the proposed solutions for the job assessment commented in a previous post.

Provide a design which is able to parse Apache access-logs so we can generate an overview of which IP has visited a specific link. This design needs to be usable for a 500+ node webcluster. Please provide your configs/possible scripts and explain your choices and how scalable they are.

I will consider these requirements:

The problems to be solved are log storage and log gathering, but the main bottleneck will be the storage.

One realizes that the best option is a noSQL database due to the characteristics of the data to process (log entries):

So, I will propose the usage of MongoDB 1, that fits the requirements:

[1]

Note: I will not enter in details of a MongoDB scalable HA architecture. See the quick start guide to setup a single node and the documentation for architecture examples.

To parse the logs and store them in MongoDB, I will propose a simple python script: accesslogparsemongo.py that:

To feed the DB with the logs from the webservers, some solutions could be:

Copy the log files with a scheduled task via SSH or similar, then process them with accesslogparsemongo.py in a centralized server (or cluster of servers).

* Pros: Logs are centralized. Only a set of servers access to MongoDB.

System can be stopped as needed.

* Cons: Needs extra programming to get the logs.

No realtime data.

Use a centralized syslog service, like syslog-ng (can be balanced and configured in HA), and setup all the webservers to send the logs via syslog (see this article).

In the log server, we can process resulting files with a batch process or send all the messages to accesslogparsemongo.py. For instance, the configuration for syslog-ng:

destination d_prog { program("/apath/accesslog_parse_mongo.py"
                              template(ā€œ$MSGONLY\nā€)
                              template-escape(no)); };




* Pros: Centralized logs. No extra programming. Realtime data.

Use of existent infrastructure (syslog). Only a set of servers access to MongoDB.

* Cons: Some logs entries can be dropped. Can not be stopped, if not log entries will be lost.

Pipe the webserver logs directly to the script, accesslogparsemongo.py. In Apache configuration:

CustomLog "|/apath/accesslog_parse_mongo.py" combined




* Pros: Easy to implement. No extra programming or infrastructure. Realtime data.


* Cons: Some logs entries can be dropped. It can not be stopped or log entries will be lost.

The script should be improved to make it more reliable.

These is one of the proposed solutions for the job assessment commented in a previous post.