-- FrankOGorman - 24 May 2006

Daily processing of usage data

Overview

This topic describes the modification of the usage reporting system to run daily instead of just at the end of the month.

The problems

There were various, primarily operational, issues with the usage reporting system that motivated the switch to "daily processing":

  • Memory use: The amount of memory used for processing journal data had been rising steadily with increasing use of the WIS website and was becoming critical. The processing of EAL and BAL customers had been separated some time ago to reduce memory requirements. Nevertheless, at the end of 2005 the 4GB limit for 32-bit programs was reached. Using a 64-bit version of the usagereport program provided some temporary relief, but by March 2006 this was using around 13GB of memory. Since libra (the machine on which usagereport is run) has only 16GB of RAM, the consequence was excessive page-swapping and very slow running of the program (i.e. "thrashing").

  • Processing time: The total CPU time to generate the end-of-month reports had increased to around 60 hours. Although this could be distributed over 3 of libra's 4 CPUs, it was still taking several days to generate the reports. With QA time added, this meant that it was typically taking 2 weeks to provide reports to customers.

  • Operator time: The running of the end-of-month processing was labour-intensive. The following steps had to be done manually:
    • Run several commands on the live-site (apollo) to filter and compress the AM stats log.
    • Copy compressed AM stats log and compressed database dump (produced automatically by a cron job) from apollo to libra and uncompress them.
    • Copy various manually-generated files from the previous month.
    • Run a script to pre-process journal weblog files (already copied from apollo by a cron job).
    • Move weblog files to the directories that the usagereport program expected them to be in.
    • Run up to 3 instances of the usagereport program (on different lists of products), trying to balance the load across 3 CPUs.
    • Run a Perl script to generate Article Select reports.
    • When QA of the reports was completed, copy the reports to apollo and run a script on apollo to create directories for any new customers.

The solution

The processing performed by the original usagereport program consisted of three main stages:

  • Load document and customer data dumped from the Wispers database.
  • Process the month's weblog files in chronological order, counting accesses by each customer for each document.
  • Write usage reports for each customer to disk.

Most of the memory used by the program was for the counts generated by the second stage and needed by the third stage.

The first change in the "daily processing" system is to distribute the work done in the second stage, and do all the weblog processing and most of the counting on a daily basis. So each day, usagereport does:

  • Load document and customer data dumped from the Wispers database.
  • Process the day's weblog file in chronological order, counting accesses by each customer for each document.
  • Write the day's access counts to a file on disk.

And at the end of the month, usagereport does:

  • Load document and customer data dumped from the Wispers database.
  • Add together the daily counts.
  • Write usage reports for each customer to disk.

This re-organization drastically reduces the end-of-month processing time (to around 5 hours), thus dealing with the "processing time" issue listed above. Note, however, that the total processing time for the whole month has been significanly increased, due to the repeated loading of document and customer data.

This re-organization does not yet address the "memory use" issue, since the end-of-month processing still has all the counts in memory at the end of the second stage. This is dealt with by concatenating the daily count files and sorting them into customer order. The end-of-month procedure then becomes:

  • Concatenate and sort daily count files
  • Load document and customer data dumped from the Wispers database.
  • For each customer:
    • Add together the daily counts.
    • Write usage reports for that customer to disk.

With this organization, only the counts for a single customer at a time are needed in memory.

The running of usagereport daily obviously does not help with the "operator time" problem - in fact, it aggravates it hugely. It did, however, make it essential that something was done about the problem. The solution was to automate the running of the system using shell scripts run by cron (see Operational aspects of usage data generation).

Implementation

Overview

The top-level of the usagereport program has been modified to perform as outlined above, with a command-line option specifying whether daily or end-of-month processing is to be performed.

Saving of daily counts

When usagereport is run for a single day, it does essentially what it used to do when run for a whole month, except that it processes only a single weblog and it stops short of generating reports. Instead, it saves all the access counts (and related data) to a text file, so that they can be used for end-of-month processing.

The access counts are stored in memory in what is essentially a tree structure, with the counts at the leaves of the tree. The levels of the tree correspond to:

  • Customer group (e.g. EAL, BAL)
  • Customer
  • Document
  • Access counts

Each line that is written to the text file consists of a number of fields, separated by '|', and corresponds to a path from the root of the tree to a leaf. For example, in the line:

JournalEAL|97077091|stats|1430|0|0|0|3|0|0|0|0|0|0|1|0|0|0|0|2|0|0|0|0|0|0|0|0|0
JournalEAL is the customer group, 97077091 is the Customer ID, 1430 is the Document ID, and the fields following the Document ID are the access counts. The stats field value is used to indicate that this line contains access counts. This is done so that other data may also be saved in the file. For example, an IP address used by the customer is written as:
JournalEAL|97077091|ip|2174877984
where 2174877984 is the IP address (in undotted decimal format).

The advantages of this file format are:

  • It is "object oriented". Each object in the tree is responsible for writing and reading its own field or fields, and is independent of the fields preceding or following it. Additions or changes to the file format can thus be done in a modular fashion.
  • It can be sorted into customer order using the standard sort utility.
  • It is human-readable, which helps with debugging.

The format obviously results in much repetition: for example, there may be a million lines beginning with JournalEAL. The waste of disk space that this might have caused is avoided by compressing the files.

The files have names of the form product.stats.gz and are stored in the directory hierarchy under the UsageReportsCounter/DailyStats. For example, the Kirk counts for 13 May 2006 are stored in the file UsageReportsCounter/DailyStats/2006/05/13/Kirk.stats.gz.

Saving of the counts is done simply by recursing/looping through the objects containing the counts. The functions that do this are:

  • Report::dump_stats
  • CustomerTableGroup::dump_stats
  • CustomerTable::dump_stats
  • Customer::dump_stats
  • UsageStats::dump_usage_stats
  • UsageData::dump_usage_stats

Compression of the files is done using the GzOStream class, which is a wrapper for the zlib (i.e. gzip) library.

Session handling

When processing a weblog, usagereport initially increments per-session counters. Only when a session has ended (i.e. timed-out) are the counts added to the corresponding customer counters.

Sessions may "straddle" day boundaries, i.e. a session started on one day may not finish until the following day. The partial counts for such sessions have to be saved at the end of daily processing, and restored at the start of daily processing on the next day. No session data is saved at the end of the last day of a month - the sessions are deemed to have ended.

Other information about each session also has to be saved. For example, the time of the last access is saved so that session timeout can detected correctly. A list of last-access times for URLs is also saved for handling "double clicks".

The file format is similar to that of the .stats.gz files. The first field on each line is the session "key", which consists of the SID cookie concatenated with the client IP address. The next field is stats for a line containing access counts, url for a line containing the last-access time of a URL, and so on. For example:

002bd26817635a6ef2e9b03b02f9b310746f7234;
    131.217.6.6|stats|5885|0|0|0|3|0|1|0|0|0|0|2|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0
002d21be96014708db0aba8b55a381bf008cd008;
    18.51.1.222|data|002d21be96014708db0aba8b55a381bf008cd008; |-1|305332702|1139846845

In the first line, 002bd26817635a6ef2e9b03b02f9b310746f7234; 131.217.6.6 is the session key, 5885 is the document ID, and the remaining fields are access counts. In the second line, 002d21be96014708db0aba8b55a381bf008cd008 is the SID cookie, -1 is the user ID, 305332702 is the client IP (in undotted decimal format), and 1139846845 is the last access time of the session (as seconds since 1 Jan 1970).

The files have names of the form product.sessions.gz, and are stored in the same directory hierarchy as the .stats.gz files.

The functions that save the states of the sessions are:

  • Report::dump_sessions
  • SessionGroup::dump
  • SessionUsage::dump
  • UsageStats::dump_usage_stats
  • UsageData::dump_usage_stats

and the corresponding functions that load the session states on the next day are:

  • Report::loadSessions
  • SessionGroup::load
  • SessionUsage::load
  • UsageStats::load_usage_stats
  • UsageStats::load_usage_stats

End-of-month processing

At the end of the month, the product.stats.gz files for each product are uncompressed and concatenated (using the gunzip utility) and sorted into customer order (using the sort utility).

The usagereport program reads the concatenated/sorted file, and adds together the daily counts to get counts for the month. Each time it reaches the end of the data for a customer it generates the reports for that customer. Most of the counts for that customer are then deleted to release memory, before reading and processing the counts for the next customer. The main functions functions involved in this are:

  • Report::print_usage_stats
  • CustomerTableGroup::print_stats
  • CustomerTable::load_stats
  • CustomerTable::print_stats
  • CustomerTable::print_cust_stats
  • Customer::delete_detail_stats

For products other than Journal, client IP addresses are treated much like customers, in the sense that web accesses are counted for each IP address as well as for each customer. However, detailed reports are only required for the IP addresses with the highest (top 100) text accesses and access denieds. This means that we cannot generate the report for an IP address when we reach the end of the data for that IP address, since we do not know at that point which are the top 100. On the other hand, we do not want to retain all the detailed counts until the end of processing (there are far more IP addresses than customers...). This is handled by using "heaps" to keep track of the top 100 IP addresses so far, and only retaining the detailed counts of those. The main functions involved are:

  • IPCustomerTable::load_stats
  • IPCustomerTable::update_tables
  • IPCustomerTable::add_to_table
  • IPCustomer::inc_refcount
  • IPCustomer::dec_refcount

Unresolved issues

The switch to daily processing has raised a few issues that are currently not resolved in a satisfactory manner.

Database updates

The end-of-month processing is done using a database dump done at the end of the month, whereas daily processing is done using using a dump done on the day. This can result in inconsistent database snapshots being used to process the access counts. In particular:

  • If a customer is deleted from the database during the month, the end-of-month processing will have to process counts for a customer ID that has no details (name, IP addresses, etc.) in the database dump. Currently, no reports are generated for such a customer and the counts are discarded (although they are included in the summary counts for the appropriate customer group, e.g. EAL). The old (pre-daily) system would have included the counts in the "guest" customer group.

  • If a document is deleted from the database during the month, the end-of-month processing will have to process counts for a document ID that has no details (title, etc.) in the database. Currently, such counts are discarded, although in this case it might be possible to retrieve the details from the Document DB file. The old (pre-daily) system would not have counted any accesses for the document.

Retention of files for re-runs

The weblog files and monthly database dumps have always been retained to enable regeneration of usage reports. Retention of additional files may make re-runs quicker or more accurate, but at the expense of additional disk usage.

The product.stats.gz and product.sessions.gz files are currently being retained. The total size of these files is less than 1GB per month, so the disk space used is not a major problem.

Retention of the product.stats.gz files enables the end-of-month processing to be re-run without having to re-run the daily processing. Retention of the product.sessions.gz files enables daily processing to be re-run from part-way through the month.

However, if daily processing is re-run, it will currently use the database dump produced at the end of the month. This may produce different counts due to deletions of customers, changes to customer IP ranges, etc. Note that the counts produced by such a re-run are not actually wrong, they are simply different (and, in fact, should be the the same as the counts that would have been produced by the old pre-daily system).

In order for a re-run of daily processing to produce the same counts as the original run, all the daily database dumps would have to be retained. The daily database dumps are typically around 1GB, but can be compressed to around 150MB. Retaining the daily database dumps (compressed) would therefore require an extra 5GB of disk space per month.

Whatever retention policy is adopted (i.e. which files to retain, for how long, and whether they should be compressed), it should ideally be automated.

Revision: r1.8 - 07 Jun 2006 - 14:41 - FrankOGorman
Copyright © 1999-2006 John Wiley & Sons Ltd.
Ideas, requests, problems regarding TWiki? Send feedback