Ideas were generated partly by examining the system
and its problems, and partly by thinking of a design to meet needs
without the constraints of the existing system.
In the course of doing this, many simpler ways of
gaining the same benefits emerged, which are more radical in their
effect on the system - it would probably be quicker to rewrite it. The
simpler methods also make the system easier to develop, maintain and
extend to incorporate new features. This is
hard to achieve with the existing system even with improvements.
So I also include a sketch of more radical changes.
The simpler ways are not new ideas, but maybe put
together a little differently. Others with more detailed knowledge of
specific areas may have a better idea of whether the approach is
feasible.
Below, thoughts on the existing system are
followed by the sketch of more radical changes, then a comparison
section.
Possible changes to
UsageReport are outlined below with respect to the benefits they
provide in the areas of processing time, memory use, disk use, and
webpage presentation.
Activity | Hrs |
Database Dumps | 0.5 |
Reading in large Document file/creating data structures for each Product | 1.5 |
Reading/parsing
Weblog file for each Product |
1.0 |
Reading in Customer files per Product, and building index | 1.0 |
Counts | 0.5 |
Daily Total | 4.5 |
Possible Changes:
Currently, with features
introduced under 'Daily Processing', only 2GB of memory is used when
running the program (reduced from 4GB-13GB). Now only the
Customers and Documents for the day are held in memory. Most of the
remaining
daily memory use is still the UsageStats data structures used to keep
counts
(per Document per Customer/IPCustomer).
Possible Change:
Type of Use |
GB |
Raw Data: Weblogs, Database dumps, Misc Logs | 6 |
Processed Data: CustomerReports/ArticleReports Berkeley DBs: | 5 |
Total | 11 |
There is a lot of duplication in data stored in the Berkeley DB files both within each month and between months. Some duplicaton has been removed but a lot remains. For example, Document titles, Customers in every Product, (i.e. in each Berkeley DB).
Possible Change:
It has already been proposed by others
that Java Cocoon should replace the Perl scripts that currently generate webpages.
The Perl scripts
currently access
precomputed end-of-month data in Berkeley DB files that is stored in a
format
specific to each webpage. Since the data is scattered in files that are
specific to webpages, it makes it hard to access. So it has to be
re-computed when a new or modified statistic needs to be derived. This
can be lengthy. For example, if a new
statistic was needed for the last 3 years, on current times, it would
take 7.5
days to recompute it ( 5 (hrs) * 12(mths) * 3 (yrs) = 180/24).
Possible Changes:
There are two main
features to CountUse. One is to replace reading and parsing of the
weblogs by getting data directly from the Apache threads. The second is
to replace existing data representation and storage by a standard
database.
CountUse would consist of
two main processes. One to be run on Apollo - call it
CollectData - which would
log data from the web server. The other process -
ProcessData - to be run on
Libra, would, among other tasks, process the data
collected on Apollo and store it in standard databases.
Interim counts would be stored in a 'WorkInProgress' database on Libra,
which might also hold up-to-date Customer and
Document data. Monthly historical data for webpage display would be
stored in dedicated 'Web' databases on Apollo and on
Libra.
This process would read
data from Apache threads in real-time via a FIFO, and write the data to
a log. Periodically during a day, a new log would be started and the
old moved to a directory where ProcessData
on Libra would
look for logs to copy.
The data needed is
already used by an Apache thread to check a Customer license for a
Document, and would just need to be written to the FIFO. Minimally this
data would include Customer IP address, Document ID/TypeOfAccess, and
Session ID.
Additionally the date/time on Apollo would be needed. Getting this data
directly eliminates current Weblog parsing and reading.
The FIFO would probably
be needed for the speed gained from reads/writes being done in memory
rather than on disk, so that other Apache threads are not held
up. [Unknown - time to open/close FIFO per thread.]
[Apache Thread Modifications: open a FIFO, write a line of data to the FIFO and close the FIFO.]
On a daily basis this
process would copy logs written by CollectData
on Apollo, and sort and store
cumulative data in the WorkInProgress database. Monthly it
would perform end of the month counts, and transfer authorized
data to the Web databases. Additionally, it would respond to
requests at any time to re-run tasks.
Many of these tasks are
independent and need to be capable of being performed independently to
provide additional flexibility. So
the structure of this process might be a Control process that spawns
several sub-processes.
ControlProcess - This would spawn
several
sub-processes at appropriate times. For ease of reference I'll name the
sub-processes CopyLog, ProcessLog,
ProcessEndMonth and TransferCounts. It might also
respond to user requests, for example, to
restart the processing from the beginning of the month, as well as to
requests to re-run from month N, or transfer monthly
data to Apollo. Responding to requests might be implemented over a
socket connection with a simple client that accepts and sends user
requests.
CopyLog - this would check a log
directory on Apollo to copy any log files that CollectData had written,
and concatenate them with the existing log for the day.
ProcessLog - this would sort a log
and store cumulative totals per Customer per Document in the
WorkInProgress database. The original unsorted log would be
retained to allow counts that require the original ordering (e.g.
sessions) to be reconstructed when necessary.
Sorting the log would
minimize database accesses to the hard disk, and, therefore, reduce
processing time. The preference for sorting the log makes it
unnecessary to transfer data in real-time from Apollo.
To perform all the counts
for a webpage access, ProcessLog
also needs to be able to find other Documents
in
the hierarchy and other
Customers Licenses that contain the IP
address. For ease of reference I'll call this
extra data ExtraDataForCounts.
This might
be provided by one of the following alternatives:
ProcessEndMonth - at the end of the
month
this sub-process would perform the monthly counts (e.g.
Top-100_IPs_FullTxtAccess, Top-10_CustomerDenieds).
TransferCounts - this would transfer
the
latest monthly statistics from the WorkInProgress database to the Web
databases on
Apollo
and Libra for the display of webpages for customers and for internal
use.
When a request is made to
re-run the current month if there had been some error,
the Control process would need to reset
the Work-In-Progress database and restart ProcessLog on the logs from
the start of the month, leaving CopyLog
to continue copying new
logs from Apollo.
Isolating the process of
copying the logs and combining the sorting and counting functions means
that there is no need to store sorted logs, and two independent tasks -
copying the new logs, and re-processing unsorted logs that have already
been copied, can easily be continued in parallel - a feature which
might be useful if webpage statistics increased.
To re-process previous
months, Control would run
another ProcessLog on
the appropriate
month's log directory, and use an independent section of the
database to store the results. An independent TransferCount sub-process
would also be started to transfer the data to the Web
databases when authorized.
As indicated in the previous section, Java Cocoon has been suggested previously to do webpage presentation instead of the current Perl scripts.
Under CountUse the
historical data would
be stored in databases on Apollo and Libra. Procedures for computing
data when a webpage is accessed would, therefore, use standard
selection and
sorting database procedures to produce their statistics on the fly.
Below, the main features
of CountUse are itemized and compared with
corresponding aspects of UsageReport.
The initial aim was to
investigate ways to improve the use of memory, disk and
processor. With the proposed changes to UsageReport, both systems would
offer significant benefits over the current system. Memory use should be similar in the two systems, but disk
use should be significantly better in CountUse. It is
more difficult to compare processing time, since it is a comparison
between specialized index-building and generalized database
access.
Although much can be done on UsageReport towards the initial aim, it would not gain the simplicity of CountUse, and thereby its accessibility to others, or its ease of maintenance or extensibility. These factors are the primary benefits of the more radical changes. They also assume a more important role now that immediate needs for memory, disk and processor would seem to be met.
In the long-term, the ease of extending CountUse would cut development and maintenance time.
In
the short-term, development time might be similar for the two
systems, even though CountUse involves more radical changes. This is
partly because some improvements would be harder or involve
extra
work on UsageReport, and partly because the interfacing can be
difficult when so much of the system is
unknown.
The modularity of
CountUse, and the use
of more standard components, means that many aspects of development can
be done independently and by using a wider range of development
skills
and, therefore, personnel.