Some Thoughts On the Design of Usage Reporting System

This was started with the aim of generating more ways in which the current usage reporting system, which is to be run on a daily basis, could be improved. The main focus was on changes that might improve processing time, memory usage, and disk usage, in ways that can be scaled to cope with future increases in load.

Ideas were generated partly by examining the system and its problems and  partly by thinking of a design to meet the requirements without the constraints of the existing system.

In the course of doing this, many simpler ways of gaining the same benefits emerged, which are more radical in their effect on the system - it would probably be simpler to rewrite it. The simpler methods also make the system easier to develop, maintain and extend to incorporate new features. This is hard to achieve with the existing system even with improvements.  So I also include a sketch of more radical changes.

The simpler ways are not new ideas, but maybe put together a little differently. Others with more detailed knowledge of specific areas may have have ideas on whether the approach is feasible.

Below,  thoughts on the existing system are followed by the sketch of more radical changes, and a comparison section.

Existing System: UsageReport

Processing time
Processing times of around 60 hours a month have already been reduced by distributing the work over the month.

The current estimated times to produce the webpage access statistics are:

End of Month:          5 hrs
Daily:                        4 hrs

At the end of the month the main activities are to concatenate the daily files into one file, sort and add all the Counts, perform end of month counts such as Top-100_FullTxtAccess, and copy to files in a format for webpage display.  The breakdown of daily activities is below.


Daily Breakdown
Activity Hrs
Database Dumps 0.5
Reading in large Document file/creating data structures for each  Product 1.5
Reading/parsing Weblog file for each Product
1.0
Reading in Customer files per Product, and building index 1.0
Counts 0.5
Daily Total 4.5

Possible Changes:

This processing time can be reduced to a few minutes by reorganizing the top level of the program to do all Products.
Not easy to estimate the processing time saved, since also have to sort.
Don't know how easy to modify, nor how much benefit would be gained.
Memory Usage

Currently, with features introduced under 'Daily Processing', only 2GB of memory is used when running the program (reduced from 4GB-13GB).  Now only the Customers and Documents for the day are held in memory. Most of the remaining daily usage is still the UsageStats data structures used to keep counts (per Document per Product, per Customer/IPCustomer).

Possible Change:

After processing each Customer, save to a file and release the memory. To do quickly, the weblog needs to be sorted by IPCustomer and Customer within each Product (see Product ordering above).
Disk Usage

Estimated Monthly Disk Use
Type of Use
GB
Raw Data: Weblogs, Database dumps, Misc Logs 6.33
Processed Data: CustomerReports/ArticleReports Berkeley DBs:
4.6
Total 10.93

There is a lot of duplication in data stored within every Berkeley DB file each month and between months. For example, Customers in every Product,  i.e. each Berkeley DB,  Document titles.

Possible Change:

Webpage Presentation

Currently webpages are generated by Perl scripts. It has already been  proposed elsewhere that these should be replaced by Java
Cocoon, which would bring it into line with the rest of the website.  The Perl scripts currently access precomputed data in Berkeley DB files that is stored in a format specific to each webpage.

The end of month data is not kept in a general format, only the form for specific webpages. If the program has to be re-run, therefore, the general data has to be recomputed. So, for example, if end-of-month data was needed for a new statistic for the last 3 years, on current times, it would take 7.5 days to compute ( 5 (hrs) * 12(mths) * 3  (yrs) = 180/24).

Possible Changes:

Sketch of Alternative System: CountUse

There are two main features to CountUse. One is to replace reading and parsing of the weblogs by getting data directly from the Apache threads. The second is to replace existing data representation and storage by a standard database.

Overview

CountUse would consist of two main processes. One to be run on Apollo - call it  CollectData - which would log data from the web server. The other process-  call it - ProcessData - to be run on Libra,  would, among other tasks, process  the data collected on Apollo and store it in standard databases. Interim counts would be stored in a 'WorkInProgress' database on Libra, which might also hold up-to-date Customer and Document data. Monthly historical data for webpage display would be stored in dedicated 'Web' databases on Apollo and on Libra.       

CollectData: process on Apollo

This process would read data from Apache threads in real-time via a FIFO, and write the data to a log. Periodically during a day, a new log would be started and the old copied to a directory where ProcessData on Libra would look for logs to copy.

Copying the data immediately to a log on Apollo is necessary since if Libra is down and it cannot send data, this process cannot stop reading from the FIFO, but neither can it lose the data it has read.

The data needed is already used by an Apache thread to check a Customer license for a Document, and would just need to be written to the FIFO. Minimally this data would include Customer IP address, webpage URL and session id. Additionally the date/time on Apollo would be needed. Getting this data directly eliminates current weblog parsing and reading.

The FIFO would probably be needed for the speed gained from reads/writes being done in memory rather than on disk, so that other Apache threads are not held up.  [Unknown - time to open/close FIFO per thread.]

[Apache Thread Modifications: open a FIFO, write a line of data to the FIFO  and close the FIFO.]

ProcessData: process on Libra

On a daily basis this process would copy logs written by CollectData on Apollo, and sort and store cumulative data in the WorkInProgress database. Monthly it would perform end of the month counts, and transfer authorized data to the Web databases. Additionally, it would respond to requests at any time to re-run tasks.

Many of these tasks are independent and need to be capable of being performed independently to provide additional flexibility. So the structure of this process might be a Control process that spawns several sub-processes.

Control Process

This would spawn several sub-processes at appropriate times. For ease of reference I'll name the sub-processes CopyLog, ProcessLog, ProcessEndMonth and TransferCounts. It might also respond to user requests, for example, to restart the processing from the beginning of the month, as well as to requests to re-run from month N, or transfer monthly data to Apollo. Responding to requests might be implemented over a socket connection with a simple client that accepts and sends user requests (see "Process organisation" below ).

CopyLog - this would check a log directory on Apollo to copy any log files that CollectData had written, and concatenate ones it copies wiith existing logs for the day. [The log would be kept to reconstruct statistics when needed. ]

ProcessLog - this would sort a log and store cumulative totals per Customer per Document in the WorkInProgress database. [Only the original unsorted log would be retained to allow counts that require the original ordering (e.g. sessions) to be reconstructed when necessary.]

Sorting the log would minimize database accesses to the hard disk, and, therefore, reduce processing time. The preference for sorting the log makes it unnecessary to collect data in real-time from Apollo.

To perform all the necessary counts, ProcessLog also needs to be able to find other Documents in the hierarchy of Documents and other Customers Licenses that contain the IP address in  a web-access. This data might be provided by one of the following alternatives:

- The Apache thread might extract this data, if time allowed, from the Apollo database and write it to the FIFO at the same time as writing the other data. Don't know if this is feasible.
- The WorkInProgress database might itself hold up-to-date data on Customer Licenses, their IP address ranges,and Document hierarchies. This would need to be kept up-to-date.
Additionally, each Customer IP entry would need the start date, and when no longer valid, an end date.  This is needed so that if there is a program error and the statistics for a month have to be re-run, then the database snapshot for a month needs to be available - i.e. if an IP address is no longer there then an access might have t be treated differently - as a Guest for example.  Start and end dates would not be needed  for the alternative above, since the list of Customers would be stored in the log for each month .

ProcessEndMonth - at the end of the month this sub-process would perform the monthly counts (e.g. Top-100_IPs_FullTxtAccess, Top-10_CustomerDenieds).

TransferCounts - this would transfer the latest monthly statistics from WorkInProgress to the Web databases on Apollo and Libra for the display of webpages for customers and for internal use.

Process organisation

When a request is made to re-run the current month if ,for example, there had been some error, the Control process  would need to reset the CountUse_lib_wip database and restart ProcessLog on the logs from the start of the month, leaving CopyLog to continue copying new logs from Apollo.

Isolating the process of copying the logs and combining the sorting and counting functions means that there is no need to store sorted logs, and two independent tasks - copying the new logs, and re-processing unsorted logs that have already been copied, can easily be continued in parallel.

To re-process previous months Control would run another ProcessLog  on the appropriate month's log directory,  and use an independent section of the database to store the results. An independent TransferCount sub-process would also be started to transfer the data to the webpage display databases when authorized.

Webpage presentation

As indicated, it has already been proposed that webpage presentation should be via Java Cocoon, instead of the current Perl scripts.
Under CountUse the historical data would be stored in databases on Apollo and Libra. Procedures for computing data when a web-page is accessed would, therefore,  use standard selection and sorting database procedures to produce their statistics on the fly.

Comparison

-Eliminates reading and parsing of Weblog file
-Eliminates daily processing time for weblog parsing
-Eliminates specialized code to do parsing, and, therefore, maintenance of it and further development
-Eliminates time to do database dumps
-Eliminates need to read in database files and build data structures from them
-Eliminates processing time for reading database dumps and building data structures in current system (Also faster than the alternative of storing in the WorkInProgress database each month.
-Saves a lot of the duplication in database dumps each month, but less than if data stored in WorkInProgress database
Storage of Document and Customer Data (see above)

More sophisticated Indexes standard.
- Eliminates specialized code to build indexes (e.g. for Customers Per Product) to search files and, therefore, maintenance of it.
- Eliminates daily processing time for constructing indexes for Customers, Documents.
Automatic addition of new Documents/Customers and maybe Products.
- Reduced administration time
Easier to add procedures to do cumulative totals in a database.
- Reduced development time
Speed of updating counts.
-This should be similar, since  the data in both systems would be sorted into Customer/Document  order, so the most that needs processing at any one time should be in memory.  Where it's necessary to read and write to disk, the database may be slower than specialized code.
Storage of Document and Customer Data (here or passed by Apache thread -see above)

-Provides standard facilities for selecting, sorting data
-Eliminates need to write specialized code that requires maintenance.
-Ease of Modifying/adding Web-pages - due in part to the  general data format from which to extract data.  Although the Berkeley DB files could be reorganized in a more general way, taking advantage of the better organization requires the implementation of the selection and sorting facilities that are standard with a database.
          - Reduced development time
The Control process and the modular sub-processes would provide useful facilities that reduce administration time, such as re-running the program to do counts. A control process could, however, be implemented in the current system, with the same advantages.
With proposed changes both systems are probably equal on their ability to deal with memory limitations. If with future increases in webpage access, all of a Customer's Counts for each Document could not easily be held in memory at once, with both existing and alternative systems, sorting of Documents within each Customer, would reduce unnecessary disk accesses and, therefore, processing time.
With  proposed changes in the existing system ,  the two systems are probably not dissimilar on disk usage.

Summary

There's a lot that can be done on the existing system, but without gaining the simplicity of the new, and thereby its accessibility to others, or its ease of maintenance.

Where improvements could be made to achieve that simplicity, it would in effect be making the more radical changes of the new system, and at a higher development cost. An evolutionary approach to change might be advantageous if there were immediate needs for improvement, but, at present, foreseeable needs for memory, disk space, and processing speed would seem to be met.

Both are probably as scalable as each other. For example, a log can be sorted into both Customer and Document order to reduce how much memory is required to process a webpage access. Both could spawn a process per Product to run simultaneously on separate cpus if the web-usage increased significantly.

The old is probably not as extensible ....

The modularity of the processes in CountUse and the use of more standard components, such as databases and Java Cocoon, means that many aspects of development can be done independently and by using a wider range of development skills, and therefore, personnel.