CountUse

Ideas were generated partly by examining the system and its problems and partly by thinking of a design to meet the requirements without the constraints of the existing system.

In the course of doing this, many simpler ways of gaining the same benefits emerged, which are more radical in their effect on the system - it would probably be simpler to rewrite it. The simpler methods also make the system easier to develop, maintain and extend to incorporate new features. This is hard to achieve with the existing system even with improvements. So I also include a sketch of more radical changes.

Below, thoughts on the existing system are followed by the sketch of more radical changes, and a comparison section.

Existing System: UsageReport

Processing time

Processing times of around 60 hours a month have already been reduced by distributing the work over the month.

The current estimated times to produce the webpage access statistics are:

End of Month: 5 hrs
Daily: 4 hrs

At the end of the month the main activities are to concatenate the daily files into one file, sort and add all the Counts, perform end of month counts such as Top-100_FullTxtAccess, and copy to files in a format for webpage display. The breakdown of daily activities is below.

Daily Breakdown
Activity	Hrs
Database Dumps	0.5
Reading in large Document file/creating data structures for each Product	1.5
Reading/parsing Weblog file for each Product	1.0
Reading in Customer files per Product, and building index	1.0
Counts	0.5
Daily Total	4.5

Read in the large Document file once - estimate saving 1.25hrs

This processing time can be reduced to a few minutes by reorganizing the top level of the program to do all Products.

Sort the Weblog files into Product Order - estimate saving .5hr

Not easy to estimate the processing time saved, since also have to sort.

Improve Weblog parsing - unknown - not done

Customer files per Product - unknown - not done

Don't know how easy to modify, nor how much benefit would be gained.

Memory Usage

Currently, with features introduced under 'Daily Processing', only 2GB of memory is used when running the program (reduced from 4GB-13GB). Now only the Customers and Documents for the day are held in memory. Most of the remaining daily usage is still the UsageStats data structures used to keep counts (per Document per Product, per Customer/IPCustomer).

After processing each Customer, save to a file and release the memory. To do quickly, the weblog needs to be sorted by IPCustomer and Customer within each Product (see Product ordering above).

Disk Usage

Estimated Monthly Disk Use
Type of Use	GB
Raw Data: Weblogs, Database dumps, Misc Logs	6.33
Processed Data: CustomerReports/ArticleReports Berkeley DBs:	4.6
Total	10.93

There is a lot of duplication in data stored within every Berkeley DB file each month and between months. For example, Customers in every Product, i.e. each Berkeley DB, Document titles.

Some of this duplication could be further reduced, but not that easily since they are designed for the use of Perl Scripts to display webpages (see webpage Presentation below). For example, different Berkeley DB files corresponding to different webpages for a Customer may contain columns of data with the same statistics.

Webpage Presentation

Currently webpages are generated by Perl scripts. It has already been proposed elsewhere that these should be replaced by Java
Cocoon, which would bring it into line with the rest of the website. The Perl scripts currently access precomputed data in Berkeley DB files that is stored in a format specific to each webpage.

The end of month data is not kept in a general format, only the form for specific webpages. If the program has to be re-run, therefore, the general data has to be recomputed. So, for example, if end-of-month data was needed for a new statistic for the last 3 years, on current times, it would take 7.5 days to compute ( 5 (hrs) * 12(mths) * 3 (yrs) = 180/24).

Store general data (not webpage specific) and compute on the fly when a webpage is accessed. This makes it unnecessary to re-run the program when a new statistic is needed or a new web-page.
- A concomitant of this is that the data must also be stored in a general way in the Berkeley DB files to enable the selection, sorting and computing of data by new procedures
- New procedures, called by the web presentation code (such as Java Cocoon), need to be written to select, sort and compute the data in the new format.

Sketch of Alternative System: CountUse

There are two main features to CountUse. One is to replace reading and parsing of the weblogs by getting data directly from the Apache threads. The second is to replace existing data representation and storage by a standard database.

Overview

CountUse would consist of two main processes. One to be run on Apollo - call it CollectData - which would log data from the web server. The other process- call it - ProcessData - to be run on Libra, would, among other tasks, process the data collected on Apollo and store it in standard databases. Interim counts would be stored in a 'WorkInProgress' database on Libra, which might also hold up-to-date Customer and Document data. Monthly historical data for webpage display would be stored in dedicated 'Web' databases on Apollo and on Libra.

CollectData: process on Apollo

This process would read data from Apache threads in real-time via a FIFO, and write the data to a log. Periodically during a day, a new log would be started and the old copied to a directory where ProcessData on Libra would look for logs to copy.

Copying the data immediately to a log on Apollo is necessary since if Libra is down and it cannot send data, this process cannot stop reading from the FIFO, but neither can it lose the data it has read.

The data needed is already used by an Apache thread to check a Customer license for a Document, and would just need to be written to the FIFO. Minimally this data would include Customer IP address, webpage URL and session id. Additionally the date/time on Apollo would be needed. Getting this data directly eliminates current weblog parsing and reading.

The FIFO would probably be needed for the speed gained from reads/writes being done in memory rather than on disk, so that other Apache threads are not held up. [Unknown - time to open/close FIFO per thread.]

[Apache Thread Modifications: open a FIFO, write a line of data to the FIFO and close the FIFO.]

ProcessData: process on Libra

On a daily basis this process would copy logs written by CollectData on Apollo, and sort and store cumulative data in the WorkInProgress database. Monthly it would perform end of the month counts, and transfer authorized data to the Web databases. Additionally, it would respond to requests at any time to re-run tasks.

Many of these tasks are independent and need to be capable of being performed independently to provide additional flexibility. So the structure of this process might be a Control process that spawns several sub-processes.

Control Process

This would spawn several sub-processes at appropriate times. For ease of reference I'll name the sub-processes CopyLog, ProcessLog, ProcessEndMonth and TransferCounts. It might also respond to user requests, for example, to restart the processing from the beginning of the month, as well as to requests to re-run from month N, or transfer monthly data to Apollo. Responding to requests might be implemented over a socket connection with a simple client that accepts and sends user requests (see "Process organisation" below ).

CopyLog - this would check a log directory on Apollo to copy any log files that CollectData had written, and concatenate ones it copies wiith existing logs for the day. [The log would be kept to reconstruct statistics when needed. ]

ProcessLog - this would sort a log and store cumulative totals per Customer per Document in the WorkInProgress database. [Only the original unsorted log would be retained to allow counts that require the original ordering (e.g. sessions) to be reconstructed when necessary.]

Sorting the log would minimize database accesses to the hard disk, and, therefore, reduce processing time. The preference for sorting the log makes it unnecessary to collect data in real-time from Apollo.

To perform all the necessary counts, ProcessLog also needs to be able to find other Documents in the hierarchy of Documents and other Customers Licenses that contain the IP address in a web-access. This data might be provided by one of the following alternatives:

- The Apache thread might extract this data, if time allowed, from the Apollo database and write it to the FIFO at the same time as writing the other data. Don't know if this is feasible.
- The WorkInProgress database might itself hold up-to-date data on Customer Licenses, their IP address ranges,and Document hierarchies. This would need to be kept up-to-date.
Additionally, each Customer IP entry would need the start date, and when no longer valid, an end date. This is needed so that if there is a program error and the statistics for a month have to be re-run, then the database snapshot for a month needs to be available - i.e. if an IP address is no longer there then an access might have t be treated differently - as a Guest for example. Start and end dates would not be needed for the alternative above, since the list of Customers would be stored in the log for each month .

ProcessEndMonth - at the end of the month this sub-process would perform the monthly counts (e.g. Top-100_IPs_FullTxtAccess, Top-10_CustomerDenieds).

TransferCounts - this would transfer the latest monthly statistics from WorkInProgress to the Web databases on Apollo and Libra for the display of webpages for customers and for internal use.

Process organisation

When a request is made to re-run the current month if ,for example, there had been some error, the Control process would need to reset the CountUse_lib_wip database and restart ProcessLog on the logs from the start of the month, leaving CopyLog to continue copying new logs from Apollo.

Isolating the process of copying the logs and combining the sorting and counting functions means that there is no need to store sorted logs, and two independent tasks - copying the new logs, and re-processing unsorted logs that have already been copied, can easily be continued in parallel.

To re-process previous months Control would run another ProcessLog on the appropriate month's log directory, and use an independent section of the database to store the results. An independent TransferCount sub-process would also be started to transfer the data to the webpage display databases when authorized.

Webpage presentation

As indicated, it has already been proposed that webpage presentation should be via Java Cocoon, instead of the current Perl scripts.
Under CountUse the historical data would be stored in databases on Apollo and Libra. Procedures for computing data when a web-page is accessed would, therefore, use standard selection and sorting database procedures to produce their statistics on the fly.

Comparison

-Eliminates reading and parsing of Weblog file
-Eliminates daily processing time for weblog parsing
-Eliminates specialized code to do parsing, and, therefore, maintenance of it and further development

Real-time Document/Customer data from Apache threads (or kept in up-to-date WorkInProgress)

-Eliminates time to do database dumps
-Eliminates need to read in database files and build data structures from them
-Eliminates processing time for reading database dumps and building data structures in current system (Also faster than the alternative of storing in the WorkInProgress database each month.
-Saves a lot of the duplication in database dumps each month, but less than if data stored in WorkInProgress database

-This should be similar, since the data in both systems would be sorted into Customer/Document order, so the most that needs processing at any one time should be in memory. Where it's necessary to read and write to disk, the database may be slower than specialized code.

-Provides standard facilities for selecting, sorting data

-Eliminates need to write specialized code that requires maintenance.

-Ease of Modifying/adding Web-pages - due in part to the general data format from which to extract data. Although the Berkeley DB files could be reorganized in a more general way, taking advantage of the better organization requires the implementation of the selection and sorting facilities that are standard with a database.
- Reduced development time

The Control process and the modular sub-processes would provide useful facilities that reduce administration time, such as re-running the program to do counts. A control process could, however, be implemented in the current system, with the same advantages.

With proposed changes both systems are probably equal on their ability to deal with memory limitations. If with future increases in webpage access, all of a Customer's Counts for each Document could not easily be held in memory at once, with both existing and alternative systems, sorting of Documents within each Customer, would reduce unnecessary disk accesses and, therefore, processing time.

Summary

There's a lot that can be done on the existing system, but without gaining the simplicity of the new, and thereby its accessibility to others, or its ease of maintenance.

Where improvements could be made to achieve that simplicity, it would in effect be making the more radical changes of the new system, and at a higher development cost. An evolutionary approach to change might be advantageous if there were immediate needs for improvement, but, at present, foreseeable needs for memory, disk space, and processing speed would seem to be met.

Both are probably as scalable as each other. For example, a log can be sorted into both Customer and Document order to reduce how much memory is required to process a webpage access. Both could spawn a process per Product to run simultaneously on separate cpus if the web-usage increased significantly.

The modularity of the processes in CountUse and the use of more standard components, such as databases and Java Cocoon, means that many aspects of development can be done independently and by using a wider range of development skills, and therefore, personnel.

Some Thoughts On the Design of Usage Reporting System