Apache2 Version Analysis

Posted by Bagpuss on January 20, 2011
Tags: honeynet, digital forensics

During a recent attempt at answering the Honeynet Log Mysteries Challenge, I wrote a series of reasoned analyses for the supplied Honeynet logging data. Unfortunately, teaching workloads stopped me from submitting any realistic challenge answer.

Inspired by the idea of applying the Scientific Method to Digital Forensics (see Casey2009 and Carrier2006), I set about attempting to apply the same principles to analysing the Log Mysteries data sets.

Using just the apache2/www-* logs from the Log Mysteries Honeynet challenge, this blog post demonstrates how we can define upper bounds on the version of Apache2 used and, more interestingly, data regarding Apache's worker threads. We are also able to establish how to obtain the log events with microsecond (instead of just second) timestamp accuracy.

These surprising results (well they were to me!) arise because the Apache2 LogFormat directive had been customised to include the contents of the environment variable UNIQUE_ID (which in turn has had its value set by the Apache2 module mod_unique_id). By examining source code changes to the underlying module, one is then able to deduce that Apache2 is at revision 420983 (ie. release version 2.2.2) or below.

Using our Apache2 revision number estimate now allows us to correctly decode the UNIQUE_ID value to extract:

  • the Apache worker thread ID (as present in the Apache2 score board data structure) that handled the request
  • the web server process ID for the worker thread that handled the request
  • a 4 byte timestamp value that is derived from the time that the request was received
  • a 2 byte counter value that is initialised (when the worker thread runs for the first time) from the current time in microseconds and then incremented whenever the worker thread handles a new request
  • and the IP address of the web server handling the request.
Using our apache2/www-* files, we can now determine that the largest recorded UNIQUE_ID timestamp value is 4281. If Apache2 were at a revision number > 420983 then these timestamp values would be close (ie. we'd expect them to be within ±1 second) to the logged events observed timestamp value (expressed as seconds from UNIX Epoch).

As this is not what we observe, then we may estimate 420983 (ie. release version 2.2.2 - see the Apache2 tags link) as our upper bound on the Apache2 revision number.

Worked Example

If we take the apache2/www-access.log log line:

10.0.1.2 - - [19/Apr/2010:06:36:15 -0700] "GET /feed/ HTTP/1.1" 200 16605 "-" "Apple-PubSub/65.12.1" C4nE4goAAQ4AAEP1Dh8AAAAA 3822005
then our UNIQUE_ID value is C4nE4goAAQ4AAEP1Dh8AAAAA, which in turn provides us with the following data regarding the Apache2 worker thread that handled the request:
  • PID is 17397 and the (scoreboard) thread index is 0
  • a timestamp value of 193
  • a counter value of 3615
  • the web server that handled the request is at 10.0.1.14.
Our observed timestamp value is 13:36:15 on the 19th April 2010 UTC or 1271684175 seconds since UNIX Epoch. Thus we have that 1271684175 mod 232 = 1271684175. As 193 and 1271684175 are orders of magnitude appart, we clearly see that the UNIQUE_ID value has been encoded using a revision of mod_unique_id.c prior to 420983.

If the mod_unique_id.c code present in revision 420983 was used to generate our UNIQUE_ID values, then we may additionally estimate our observed timestamp values to microsecond accuracy.

Worked Example: continued

We now assume that revision 420983 of mod_unique_id.c was used. Thus, 193 is the number of microseconds past 1271684175 seconds at which our log event was observed. In other words, our log event was actually received at 1271684175.000193 seconds past the UNIX Epoch.

At this point, the reader may be interested to know that the previous estimates on a version for Apache2 actually introduces a subtle error! In the next blog post, we'll rework our logical reasoning (with a dash of data visualisation) to locate and fix this error - in the meantime, the reader is invited to try and locate that reasoning error. Future blog posts will focus on using data visualisation and statistical analysis techniques to further analyse the Honeynet logging data.

Some Technical Details

The mod_unique_id documentation informs us that the tuple ( ip_addr, pid, time_stamp, counter ), via an algorithm similar to MIME base64, is encoded as a 19 character string using the characters [A-Za-z0-9+/]. The resulting value is placed in the UNIQUE_ID environment variable.

Viewing the subversion revision log for mod_unique_id.c, we see that revision 596448 is the latest version of code that our server could have used (this is based on the revision log timestamps and that the last Apache log entry timestamp, in the apache2/www-* log files, is 01:52:24 on the 25th April 2010 UTC).

In viewing revision 596448 of the mod_unique_id source code, we notice (see lines 56 to 103) that the tuple ( time_stamp, in_addr, pid, counter, thread_index ) is in fact used to generate our UNIQUE_ID value - this explains why the UNIQUE_ID value is in fact 24 characters in length and not 19 (BTW, revision 981084 fixes the incorrectly commented code).

Note:

According to the source code (see function unique_id_global_init), the size of the UNIQUE_ID string is:

((size_of(unsigned int)+size_of(unsigned int)+size_of(unsigned int)+size_of(unsigned short)+size_of(unsigned int)) * 8 + 5) div 6
= ((4+4+4+2+4) * 8 + 5) div 6
= 24 characters

From the source code (see function gen_unique_id) we also find that a standard MIME base64 encoding is used followed by translating the '+' character to '@' and the '/' character to '-'. This allows us to easily reverse engineer our UNIQUE_ID values and so extract the original input tuple.

In examining the revision log for mod_unique_id.c we have that:

  • revision 596448 differs from revision 420893 in the way in which it handles the timestamp value within the function gen_unique_id
  • revision 420893 differs from revision 596059 by changes to the license comments and in the way that it handles calculating UNIQUE_ID over frequent process restarts (ie. within < 1 second of the previous restart)
  • revisions prior to 596059 (all of which are dated July 2002 or earlier) alter the code in multiple ways.
More specifically, we have that:
  • revision 420893 and 596448 both use the C++ expression r->request_time to extract the apr_time_t struct that encodes the time at which the request was received
  • revision 596448 uses the C++ function apr_time_sec to extract (from the apr_time_t struct) the time at which the request was received in seconds as our timestamp value (modulo 232)
  • revision 420893 uses the first 4 bytes of the apr_time_t struct (ie. the tm_usec field or the microseconds component of the time at which the request was received) as our timestamp value.
Based on these observations we can now implement a test, based on our timestamp encoding and the time at which the log event was received, to determine if a given UNIQUE_ID value is at a revision > 420893 or not.

References

[Carrier2006] A Hypothesis-based Approach to Digital Forensic Investigations

by Brian Carrier

May 2006. Phd Thesis. Purdue University. CERIAS Tech Report 2006-06

[Casey2009] Handbook of Digital Forensics and Investigation

by Eoghan Casey

October 2009. Academic Press. ISBN: 978-0-12-374267-4.