Wordpress Versioning: Part 2

Posted by Bagpuss on April 09, 2011
Tags: honeynet, digital forensics, data visualisation, wordpress

During a recent attempt at answering the Honeynet Log Mysteries Challenge, I wrote a series of reasoned analyses for the supplied Honeynet logging data. Unfortunately, teaching workloads stopped me from submitting any realistic challenge answer.

Inspired by the idea of applying the Scientific Method to Digital Forensics (see Casey2009 and Carrier2006) and using data visualisation (see Conti2007 and Marty2008), I set about attempting to apply the same principles to analysing the Log Mysteries data sets.

In Wordpress Versioning: Part 1 we had shown how, by downloading candidate Wordpress plugins, we could compare the downloaded plugin against a series of observed URLs. In doing this, we could then effectively test if a given candidate plugin was unlikely to be installed. This blog article shall focus on using probability measures to estimate:

  • the version of Wordpress that is installed
  • and the Wordpress plugins that are installed.

Wordpress and its plugins have their source code version controlled with Subversion (at least this is the case these days!). By checking out the entire source code tree for wordpress and all its plugins, we can build database tables relating all wordpress and plugin files (from the repository!) to their sizes, SHA1 hashes, etc.

Mapping Subversion Repositories to Rails Models

When using script based implementations to download large repositories, we are best first checking out the entire repository and then processing those results offline. Thus, we first issue the commands:

svn co http://core.svn.wordpress.org evidence/wordpress
svn co http://svn.wp-plugins.org evidence/wp-plugins
to checkout the entire Wordpress (and plugin) subversion repository (see the rake tasks checkout:svn:wordpress and checkout:svn:wp-plugins in svn.rake).

Once we have successfully download the Wordpress repositories, we can use the rake tasks build:svn:wordpress and build:svn:wp-plugins to then process the downloaded code into an instance of the following class diagram:

Note: to check out all of the Wordpress (and plugin) source code repository takes approximately 3 days and consumes around 70GB of disk space.

From Wordpress Versioning: Part 1 we can identify that two URLs are used to access the Wordpress application:

  • GET /wp-includes/js/jquery/jquery.js with a response size of 57276 bytes - we'll refer to this as event:
    $E_0 = \{ ($/wp-includes/js/jquery/jquery.js"$, 57276) \}$
  • and GET /wp-includes/js/jquery/jquery.form.js with a response size of 8429 bytes - we'll refer to this as event:
    $E_1 = \{ ($"/wp-includes/js/jquery/jquery.form.js"$, 8429) \}$.
Using this information allows us to construct the following naive Bayesian network in SamIam:
Here, the probability function for our Wordpress query node is:
\(Pr(x \in W(v)) = \frac{card\; W(v)}{card\; \cup\; ran\; W}\)
and, for our observed URL nodes, we use the following conditional probability (for $i = 0, 1$):
\(Pr(x \in W(v) | (u, s) \in E_i) = \frac{card\; \{ x \in W(v) \;|\; u \in url(x) \wedge size(x) = s \}}{card\; \{ n \in \mathbb{N}_0 \;|\; \exists x \in W(v) \cdot size(x) = n \}}\)
  • $File$ is a non-empty set of files (we assume that each file is a hash data structure with keys $size: File \rightarrow \mathbb{N}_0$ and $url: File \rightarrow \mathbb{P}(String)$);
  • $x \in File$ is a random variable;
  • $Version$ is a non-empty set of version tags;
  • $W: Version \rightarrow \mathbb{P}(File)$ associates files with a specific tag release of Wordpress;
  • $v \in Version$ is the tag release to be classified;
  • $u \in String$ is an observed URL request;
  • and $s \in \mathbb{N}_0$ is an observed response size (in bytes).

With our naive Bayesian network setup, we can now use it to classify the (tag release) versions of Wordpress as follows:

Whence we estimate the following (based on evidence $e_0$ and $e_1$, all other tag release values are not present [ie. probability is 0%], and so are excluded from this table):

Wordpress Tag ReleaseProbability

Working to one decimal place, we are able to estimate (with equal likely hood) that Wordpress has a tag release within the range of releases listed in the table above (ie. we're on a 2.8 or 2.9 tag release branch). The small probability variations here can be accounted for by differing tag release population sizes.

Wordpress Plugins: Tag Release Estimates

In a similar manner, we can also build naive Bayesian classifiers for determining which Wordpress plugins are installed, along with their respective tag release or trunk version numbers, as follows:

Wordpress PluginTag ReleaseObservations
Contact Form 7equal probability for each tag release in list 2.1, 2.1.1, 2.1.2, 2.2, 2.2.1 and 2.3estimate consistent with parameter ver=2.1.1
Google Analyticatorequal probability for each tag release in list 6.0, 6.0.1, 6.0.2, 6.1 and 6.1.1estimate consistent with parameter ver=6.0.2
Google Syntax Highlighter1.5.1this estimate only holds if we ignore the observation of shBrushBash.js with a size of 2810 bytes

Note: searching for the file shBrushBash.js within the Wordpress plugin repository reveals no file with a size of 2810 bytes.

In the final blog article to this series, we shall look at how the work of Florian Buchholz (eg. see An Improved Clock Model for Translating Timestamps) can be used to measure logging event times relative to a suitable reference clock description.


Modeling and Reasoning with Bayesian Networks
by A.Darwiche
Cambridge University Press 2009

Data Analysis in Forensic Science: a Bayesian Decision Perspective
by F.Taroni, S.Bozza, A.Biedermann, P.Garbolino and C.Aitken
Wiley 2010

Tools Used

SamIam used to build our naive Bayesian classifiers.
SamIam naive bayesian classifiers used in this article: Rails 3 used to model our data (see GitHub project for Rails application used in analysis).