--
FrankOGorman - 24 May 2006
Daily processing of usage data
Overview
This topic describes the modification of the usage reporting system to run daily instead of just at the end of the month.
The problems
There were various, primarily operational, issues with the usage reporting system that motivated the switch to "daily processing":
- Memory use: The amount of memory used for processing journal data had been rising steadily with increasing use of the WIS website and was becoming critical. The processing of EAL and BAL customers had been separated some time ago to reduce memory requirements. Nevertheless, at the end of 2005 the 4GB limit for 32-bit programs was reached. Using a 64-bit version of the
usagereport
program provided some temporary relief, but by March 2006 this was using around 13GB of memory. Since libra
(the machine on which usagereport
is run) has only 16GB of RAM, the consequence was excessive page-swapping and very slow running of the program (i.e. "thrashing").
- Processing time: The total CPU time to generate the end-of-month reports had increased to around 60 hours. Although this could be distributed over 3 of
libra's
4 CPUs, it was still taking several days to generate the reports. With QA time added, this meant that it was typically taking 2 weeks to provide reports to customers.
- Operator time: The running of the end-of-month processing was labour-intensive. The following steps had to be done manually:
- Run several commands on the live-site (
apollo
) to filter and compress the AM stats log.
- Copy compressed AM stats log and compressed database dump (produced automatically by a cron job) from
apollo
to libra
and uncompress them.
- Copy various manually-generated files from the previous month.
- Run a script to pre-process journal weblog files (already copied from
apollo
by a cron job).
- Move weblog files to the directories that the
usagereport
program expected them to be in.
- Run up to 3 instances of the
usagereport
program (on different lists of products), trying to balance the load across 3 CPUs.
- Run a Perl script to generate Article Select reports.
- When QA of the reports was completed, copy the reports to
apollo
and run a script on apollo
to create directories for any new customers.
The solution
The processing performed by the original
usagereport
program consisted of three main stages:
- Load document and customer data dumped from the Wispers database.
- Process the month's weblog files in chronological order, counting accesses by each customer for each document.
- Write usage reports for each customer to disk.
Most of the memory used by the program was for the counts generated by the second stage and needed by the third stage.
The first change in the "daily processing" system is to distribute the work done in the second stage, and do all the weblog processing and most of the counting on a daily basis. So each day,
usagereport
does:
- Load document and customer data dumped from the Wispers database.
- Process the day's weblog file in chronological order, counting accesses by each customer for each document.
- Write the day's access counts to a file on disk.
And at the end of the month,
usagereport
does:
- Load document and customer data dumped from the Wispers database.
- Add together the daily counts.
- Write usage reports for each customer to disk.
This re-organization drastically reduces the end-of-month processing time (to around 5 hours), thus dealing with the "processing time" issue listed above. Note, however, that the
total processing time for the whole month has been significanly increased, due to the repeated loading of document and customer data.
This re-organization does not yet address the "memory use" issue, since the end-of-month processing still has all the counts in memory at the end of the second stage. This is dealt with by concatenating the daily count files and sorting them into customer order. The end-of-month procedure then becomes:
- Concatenate and sort daily count files
- Load document and customer data dumped from the Wispers database.
- For each customer:
- Add together the daily counts.
- Write usage reports for that customer to disk.
With this organization, only the counts for a single customer at a time are needed in memory.
The running of
usagereport
daily obviously does not help with the "operator time" problem - in fact, it aggravates it hugely. It did, however, make it essential that something was done about the problem. The solution was to automate the running of the system using shell scripts run by
cron
(see
Operational aspects of usage data generation).
Implementation
Overview
The top-level of the
usagereport
program has been modified to perform as outlined above, with a command-line option specifying whether daily or end-of-month processing is to be performed.
Saving of daily counts
When
usagereport
is run for a single day, it does essentially what it used to do when run for a whole month, except that it processes only a single weblog and it stops short of generating reports. Instead, it saves all the access counts (and related data) to a text file, so that they can be used for end-of-month processing.
The access counts are stored in memory in what is essentially a tree structure, with the counts at the leaves of the tree. The levels of the tree correspond to:
- Customer group (e.g. EAL, BAL)
- Customer
- Document
- Access counts
Each line that is written to the text file consists of a number of fields, separated by '|', and corresponds to a path from the root of the tree to a leaf. For example, in the line:
JournalEAL|97077091|stats|1430|0|0|0|3|0|0|0|0|0|0|1|0|0|0|0|2|0|0|0|0|0|0|0|0|0
JournalEAL
is the customer group,
97077091
is the Customer ID,
1430
is the Document ID, and the fields following the Document ID are the access counts. The
stats
field value is used to indicate that this line contains access counts. This is done so that other data may also be saved in the file. For example, an IP address used by the customer is written as:
JournalEAL|97077091|ip|2174877984
where
2174877984
is the IP address (in undotted decimal format).
The advantages of this file format are:
- It is "object oriented". Each object in the tree is responsible for writing and reading its own field or fields, and is independent of the fields preceding or following it. Additions or changes to the file format can thus be done in a modular fashion.
- It can be sorted into customer order using the standard
sort
utility.
- It is human-readable, which helps with debugging.
The format obviously results in much repetition: for example, there may be a million lines beginning with
JournalEAL
. The waste of disk space that this might have caused is avoided by compressing the files.
The files have names of the form
product.stats.gz
and are stored in the directory hierarchy under the
UsageReportsCounter/DailyStats
. For example, the Kirk counts for 13 May 2006 are stored in the file
UsageReportsCounter/DailyStats/2006/05/13/Kirk.stats.gz
.
Saving of the counts is done simply by recursing/looping through the objects containing the counts. The functions that do this are:
-
Report::dump_stats
-
CustomerTableGroup::dump_stats
-
CustomerTable::dump_stats
-
Customer::dump_stats
-
UsageStats::dump_usage_stats
-
UsageData::dump_usage_stats
Compression of the files is done using the
GzOStream
class, which is a wrapper for the
zlib
(i.e.
gzip
) library.
Session handling
When processing a weblog,
usagereport
initially increments per-session counters. Only when a session has ended (i.e. timed-out) are the counts added to the corresponding customer counters.
Sessions may "straddle" day boundaries, i.e. a session started on one day may not finish until the following day. The partial counts for such sessions have to be saved at the end of daily processing, and restored at the start of daily processing on the next day. No session data is saved at the end of the last day of a month - the sessions are deemed to have ended.
Other information about each session also has to be saved. For example, the time of the last access is saved so that session timeout can detected correctly. A list of last-access times for URLs is also saved for handling "double clicks".
The file format is similar to that of the
.stats.gz
files. The first field on each line is the session "key", which consists of the SID cookie concatenated with the client IP address. The next field is
stats
for a line containing access counts,
url
for a line containing the last-access time of a URL, and so on. For example:
002bd26817635a6ef2e9b03b02f9b310746f7234;
131.217.6.6|stats|5885|0|0|0|3|0|1|0|0|0|0|2|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0
002d21be96014708db0aba8b55a381bf008cd008;
18.51.1.222|data|002d21be96014708db0aba8b55a381bf008cd008; |-1|305332702|1139846845
In the first line,
002bd26817635a6ef2e9b03b02f9b310746f7234; 131.217.6.6
is the session key,
5885
is the document ID, and the remaining fields are access counts. In the second line,
002d21be96014708db0aba8b55a381bf008cd008
is the SID cookie,
-1
is the user ID,
305332702
is the client IP (in undotted decimal format), and
1139846845
is the last access time of the session (as seconds since 1 Jan 1970).
The files have names of the form
product.sessions.gz
, and are stored in the same directory hierarchy as the
.stats.gz
files.
The functions that save the states of the sessions are:
-
Report::dump_sessions
-
SessionGroup::dump
-
SessionUsage::dump
-
UsageStats::dump_usage_stats
-
UsageData::dump_usage_stats
and the corresponding functions that load the session states on the next day are:
-
Report::loadSessions
-
SessionGroup::load
-
SessionUsage::load
-
UsageStats::load_usage_stats
-
UsageStats::load_usage_stats
End-of-month processing
At the end of the month, the
product.stats.gz
files for each product are uncompressed and concatenated (using the
gunzip
utility) and sorted into customer order (using the
sort
utility).
The
usagereport
program reads the concatenated/sorted file, and adds together the daily counts to get counts for the month. Each time it reaches the end of the data for a customer it generates the reports for that customer. Most of the counts for that customer are then deleted to release memory, before reading and processing the counts for the next customer. The main functions functions involved in this are:
-
Report::print_usage_stats
-
CustomerTableGroup::print_stats
-
CustomerTable::load_stats
-
CustomerTable::print_stats
-
CustomerTable::print_cust_stats
-
Customer::delete_detail_stats
For products other than Journal, client IP addresses are treated much like customers, in the sense that web accesses are counted for each IP address as well as for each customer. However, detailed reports are only required for the IP addresses with the highest (top 100) text accesses and access denieds. This means that we cannot generate the report for an IP address when we reach the end of the data for that IP address, since we do not know at that point which are the top 100. On the other hand, we do not want to retain
all the detailed counts until the end of processing (there are far more IP addresses than customers...). This is handled by using "heaps" to keep track of the top 100 IP addresses
so far, and only retaining the detailed counts of those. The main functions involved are:
-
IPCustomerTable::load_stats
-
IPCustomerTable::update_tables
-
IPCustomerTable::add_to_table
-
IPCustomer::inc_refcount
-
IPCustomer::dec_refcount
Unresolved issues
The switch to daily processing has raised a few issues that are currently not resolved in a satisfactory manner.
Database updates
The end-of-month processing is done using a database dump done at the end of the month, whereas daily processing is done using using a dump done on the day. This can result in inconsistent database snapshots being used to process the access counts. In particular:
- If a customer is deleted from the database during the month, the end-of-month processing will have to process counts for a customer ID that has no details (name, IP addresses, etc.) in the database dump. Currently, no reports are generated for such a customer and the counts are discarded (although they are included in the summary counts for the appropriate customer group, e.g. EAL). The old (pre-daily) system would have included the counts in the "guest" customer group.
- If a document is deleted from the database during the month, the end-of-month processing will have to process counts for a document ID that has no details (title, etc.) in the database. Currently, such counts are discarded, although in this case it might be possible to retrieve the details from the Document DB file. The old (pre-daily) system would not have counted any accesses for the document.
Retention of files for re-runs
The weblog files and monthly database dumps have always been retained to enable regeneration of usage reports. Retention of additional files may make re-runs quicker or more accurate, but at the expense of additional disk usage.
The
product.stats.gz
and
product.sessions.gz
files are currently being retained. The total size of these files is less than 1GB per month, so the disk space used is not a major problem.
Retention of the
product.stats.gz
files enables the end-of-month processing to be re-run without having to re-run the daily processing. Retention of the
product.sessions.gz
files enables daily processing to be re-run from part-way through the month.
However, if daily processing is re-run, it will currently use the database dump produced at the end of the month. This may produce different counts due to deletions of customers, changes to customer IP ranges, etc. Note that the counts produced by such a re-run are not actually
wrong, they are simply
different (and, in fact, should be the the same as the counts that would have been produced by the old pre-daily system).
In order for a re-run of daily processing to produce the same counts as the original run, all the daily database dumps would have to be retained. The daily database dumps are typically around 1GB, but can be compressed to around 150MB. Retaining the daily database dumps (compressed) would therefore require an extra 5GB of disk space per month.
Whatever retention policy is adopted (i.e. which files to retain, for how long, and whether they should be compressed), it should ideally be automated.