-- Main.MonicaCroucher - 09 May 2006 !CountUse

SOME THOUGHTS ON THE DESIGN OF THE USAGE REPORTING SYSTEM


This was started with the aim of generating more ways in which the current usage reporting system, which is to be run on a daily basis, could be improved. The main focus was on changes that might improve processing time, memory use, and disk use, in ways that can be scaled to cope with future increases in load.

Ideas were generated partly by examining the system and its problems, and partly by thinking of a design to meet needs without the constraints of the existing system.

In the course of doing this, many simpler ways of gaining the same benefits emerged, which are more radical in their effect on the system - it would probably be quicker to rewrite it. The simpler methods also make the system easier to develop, maintain and extend to incorporate new features. This is hard to achieve with the existing system even with improvements.  So I also include a sketch of more radical changes.

The simpler ways are not new ideas, but maybe put together a little differently. Others with more detailed knowledge of specific areas may have a better idea of whether the approach is feasible.

Below,  thoughts on the existing system are followed by the sketch of more radical changes, then a comparison section.

Existing System: UsageReport

Possible changes to UsageReport are outlined below with respect to the benefits they provide in the areas of processing time, memory use, disk use, and webpage presentation.

Processing time

Processing times of around 60 hours at the end of the month have already been reduced by distributing the work over the month under 'Daily Processing'.

The current estimated times to produce the webpage access statistics are:
End of Month:          5 hrs
Daily:                        4 hrs

At the end of each month the main activities are to concatenate the daily files into one file, sort and add all the counts, perform end of month counts, such as Top-100_FullTxtAccess, and copy to files in a format for webpage display.  The breakdown of daily activities is below.

Estimated Daily Breakdown
Activity Hrs
Database Dumps 0.5
Reading in large Document file/creating data structures for each  Product 1.5
Reading/parsing Weblog file for each Product
1.0
Reading in Customer files per Product, and building index 1.0
Counts 0.5
Daily Total 4.5

Possible Changes:


  • Read in the large Document file once (by rewriting top-level of program  to do all Products) -  estimate saving 1.25hrs
  • Sort the Weblog files into Product order -(not easy to estimate since also have to sort) -  saving 0.5hr
  • Improve Weblog parsing/pattern matching   - unknown - not done
  • Customer files per Product (don't know what could be modified,  nor how much benefit gained)  - unknown - not done

Memory Use

Currently, with features introduced under 'Daily Processing', only 2GB of memory is used when running the program (reduced from 4GB-13GB).  Now only the Customers and Documents for the day are held in memory. Most of the remaining daily memory use is still the UsageStats data structures used to keep counts (per Document per Customer/IPCustomer).

Possible Change:


  • Process One Customer at a Time
After processing each Customer, save to a file and release the memory. To do quickly, the Weblog file needs to be sorted by IPCustomer and Customer within each Product (see Product ordering above).

Disk Use

Estimated Monthly Disk Use
Type of Use
GB
Raw Data: Weblogs, Database dumps, Misc Logs 6
Processed Data: CustomerReports/ArticleReports Berkeley DBs: 5
Total 11

There is a lot of duplication in data stored in the Berkeley DB files both within each month and between months. Some duplicaton has been removed but a lot remains. For example, Document titles, Customers in every Product, (i.e. in each Berkeley DB).

Possible Change:

  • Some of this duplication could be further reduced. However, most of the duplication is tied to webpage presentation and would not be easy to remove. For example, several Berkeley DB files corresponding to different webpages for a Customer may contain columns of data with the same statistics, each of which the Perl scripts are written to access. Reducing this duplication would require significant changes both to the format of the Berkeley DB files and to the Perl scripts (see 'Webpage Presentation' below).

Webpage Presentation

It has already been  proposed by others that Java Cocoon should replace the Perl scripts that currently generate webpages.

The Perl scripts currently access precomputed end-of-month data in Berkeley DB files that is stored in a format specific to each webpage. Since the data is scattered in files that are specific to webpages, it makes it hard to access. So it has to be re-computed when a new or modified statistic needs to be derived. This can be lengthy.  For example, if a new statistic was needed for the last 3 years, on current times, it would take 7.5 days to recompute it ( 5 (hrs) * 12(mths) * 3  (yrs) = 180/24).

Possible Changes:

  • Store data in a general format (not webpage specific) in the Berkeley DB files. This makes it unnecessary to re-run the program when a new statistic or a new webpage is needed.
  • New procedures, called by the web presentation code (such as Java Cocoon), need to be written to select, sort and compute from the data in the new format.

Sketch of Alternative System: CountUse

There are two main features to CountUse. One is to replace reading and parsing of the weblogs by getting data directly from the Apache threads. The second is to replace existing data representation and storage by a standard database.

Overview

CountUse would consist of two main processes. One to be run on Apollo - call it  CollectData - which would log data from the web server. The other process -  ProcessData - to be run on Libra,  would, among other tasks, process  the data collected on Apollo and store it in standard databases. Interim counts would be stored in a 'WorkInProgress' database on Libra, which might also hold up-to-date Customer and Document data. Monthly historical data for webpage display would be stored in dedicated 'Web' databases on Apollo and on Libra.       

CollectData: process on Apollo

This process would read data from Apache threads in real-time via a FIFO, and write the data to a log. Periodically during a day, a new log would be started and the old moved to a directory where ProcessData on Libra would look for logs to copy.

It is important to copy the data immediately to a log on Apollo to avoid losing it: if Libra was down this process could not send the data in real-time, but neither could it wait since it must continue to read from the FIFO.

The data needed is already used by an Apache thread to check a Customer license for a Document, and would just need to be written to the FIFO. Minimally this data would include Customer IP address, Document ID/TypeOfAccess, and Session ID. Additionally the date/time on Apollo would be needed. Getting this data directly eliminates current Weblog parsing and reading.

The FIFO would probably be needed for the speed gained from reads/writes being done in memory rather than on disk, so that other Apache threads are not held up.  [Unknown - time to open/close FIFO per thread.]

[Apache Thread Modifications: open a FIFO, write a line of data to the FIFO  and close the FIFO.]

ProcessData: process on Libra

On a daily basis this process would copy logs written by CollectData on Apollo, and sort and store cumulative data in the WorkInProgress database. Monthly it would perform end of the month counts, and transfer authorized data to the Web databases. Additionally, it would respond to requests at any time to re-run tasks.

Many of these tasks are independent and need to be capable of being performed independently to provide additional flexibility. So the structure of this process might be a Control process that spawns several sub-processes.

ControlProcess - This would spawn several sub-processes at appropriate times. For ease of reference I'll name the sub-processes CopyLog, ProcessLog, ProcessEndMonth and TransferCounts. It might also respond to user requests, for example, to restart the processing from the beginning of the month, as well as to requests to re-run from month N, or transfer monthly data to Apollo. Responding to requests might be implemented over a socket connection with a simple client that accepts and sends user requests.

CopyLog - this would check a log directory on Apollo to copy any log files that CollectData had written, and concatenate them with the existing log for the day.

ProcessLog - this would sort a log and store cumulative totals per Customer per Document in the WorkInProgress database. The original unsorted log would be retained to allow counts that require the original ordering (e.g. sessions) to be reconstructed when necessary.

Sorting the log would minimize database accesses to the hard disk, and, therefore, reduce processing time. The preference for sorting the log makes it unnecessary to transfer data in real-time from Apollo.

To perform all the counts for a webpage access, ProcessLog also needs to be able to find other Documents in the hierarchy and other Customers Licenses that contain the IP address. For ease of reference I'll call this extra data ExtraDataForCounts. This might be provided by one of the following alternatives:

- The Apache thread might extract this data, if time allowed, from the Apollo database and write it to the FIFO at the same time as writing the other data. Don't know if this is feasible.
- The WorkInProgress database might itself hold data on Customer Licenses, their IP address ranges and Document hierarchies. This would need to be kept up-to-date. The following are also required by this alternative:
- Each Customer IP entry would also need the start date, and when no longer valid, an end date.
Dates are needed to reconstruct a snapshot of the database to correctly recalculate statistics for the past when needed. For example,  if an IP address is no longer valid then an access might have to be treated as a Guest.
- When Document hierarchies change, it might be simplest to start a new database, unless a simple means of representing the hierarchy with dates could be found. I believe a change to hierarchies is infrequent, so if it were necessary to start a new database,  it might not add much to disk space requirements.     

ProcessEndMonth - at the end of the month this sub-process would perform the monthly counts (e.g. Top-100_IPs_FullTxtAccess, Top-10_CustomerDenieds).

TransferCounts - this would transfer the latest monthly statistics from the WorkInProgress database to the Web databases on Apollo and Libra for the display of webpages for customers and for internal use.

When a request is made to re-run the current month if there had been some error, the Control process  would need to reset the Work-In-Progress database and restart ProcessLog on the logs from the start of the month, leaving CopyLog to continue copying new logs from Apollo.

Isolating the process of copying the logs and combining the sorting and counting functions means that there is no need to store sorted logs, and two independent tasks - copying the new logs, and re-processing unsorted logs that have already been copied, can easily be continued in parallel - a feature which might be useful if webpage statistics increased.

To re-process previous months, Control would run another ProcessLog  on the appropriate month's log directory,  and use an independent section of the database to store the results. An independent TransferCount sub-process would also be started to transfer the data to the Web databases when authorized.

Webpage Presentation

As indicated in the previous section, Java Cocoon has been suggested previously to do webpage presentation instead of the current Perl scripts.

Under CountUse the historical data would be stored in databases on Apollo and Libra. Procedures for computing data when a webpage is accessed would, therefore, use standard selection and sorting database procedures to produce their statistics on the fly.

Comparison

Below, the main features of CountUse are itemized and compared with corresponding aspects of UsageReport.

  • Real-time data from Apache threads on webpage access
-Eliminates data from the Weblog that the system does not need.
-Eliminates Weblog processing time.
-Eliminates specialized Weblog processing code and its maintenance.
-Eliminates the administration task of maintaining the mapping of URLs to Document IDs, and the need to re-run the program if it hasn't been kept up-to-date.


  • WorkInProgress database
More sophisticated Indexes standard
- Eliminates specialized code to build indexes (e.g. for Customers Per Product) and its maintenance.
- Eliminates daily processing time for constructing indexes for Customers and Documents, but this might be offset by database access time.
Automatic addition of new Documents/Customers and maybe Products
- Reduced administration time
Easier to add procedures to do new totals
- Reduced development time

  • ExtraDataForCounts: Document/Customer Data
ExtraDataForCounts: stored in the WorkInProgress database or streamed from Apache threads
-Eliminates time to do the database dumps.
-Eliminates the time to read in the database dumps and build data structures from them.

If ExtraDataForCounts is stored in WorkInProgress, rather than streamed, it also eliminates the disk space currently required by monthly database dumps, but at the expense of increased processing time. 

  • Web databases for historical data
-Provides standard facilities for selecting, sorting data
-Eliminates need to write specialized code, which requires maintenance.
-Ease of modifying/adding webpages
This is due in part to the general data format from which to extract data. Although the Berkeley DB files could be reorganized in a more general way, taking advantage of this would require implementing the selection and sorting facilities that are standard with a database.
          - Reduced development time

  • Control Process
The Control process and the modular sub-processes would provide useful facilities that reduce administration time, such as re-running the program to do counts. A control process could, however, be implemented in the current system, with the same advantages.

Summary

The initial aim was to investigate ways to improve the use of memory, disk and processor. With the proposed changes to UsageReport, both systems would offer significant benefits over the current system. Memory use should be similar in the two systems, but disk use should be significantly better in CountUse. It is more difficult to compare processing time, since it is a comparison between specialized index-building and generalized database access. 

Although much can be done on UsageReport  towards the initial aim, it would not gain the simplicity of CountUse, and thereby its accessibility to others, or its ease of maintenance or extensibility.  These factors are the primary benefits of the more radical changes. They also assume a more important role now that immediate needs for memory, disk and processor would seem to be met.

In the long-term, the ease of extending CountUse would cut development and maintenance time.

In the short-term, development time might be similar for the two systems, even though CountUse involves more radical changes. This is partly because some improvements would be harder or involve extra work on UsageReport, and partly because the interfacing can be difficult when so much of the system is unknown. 

The modularity of CountUse, and the use of more standard components, means that many aspects of development can be done independently and by using a wider range of development skills and, therefore, personnel.


Revision: r1.1 - 09 May 2006 - 10:27 - MonicaCroucher
Copyright © 1999-2006 John Wiley & Sons Ltd.
Ideas, requests, problems regarding TWiki? Send feedback