Ideas were generated partly by
examining the system
and its problems and partly by thinking of a design to meet
the
requirements without the constraints of the existing system.
In the course of doing this, many
simpler ways of
gaining the same benefits emerged, which are more radical in their
effect on the system - it would probably be simpler to rewrite it. The
simpler methods also make the system easier to develop, maintain and
extend to incorporate new features. This
is
hard to achieve with the existing system even with
improvements.
So I also include a sketch of more radical changes.
Below, thoughts on the
existing system are
followed by the sketch of more radical changes, and a comparison
section.
The current estimated times to produce the webpage access statistics are:
End of Month: 5 hrsActivity | Hrs |
Database Dumps | 0.5 |
Reading in large Document file/creating data structures for each Product | 1.5 |
Reading/parsing
Weblog file for each Product |
1.0 |
Reading in Customer files per Product, and building index | 1.0 |
Counts | 0.5 |
Daily Total | 4.5 |
Possible Changes:
Currently,
with features
introduced under 'Daily Processing', only 2GB of memory is used when
running the program (reduced from 4GB-13GB). Now only the
Customers and Documents for the day are held in memory. Most of the
remaining
daily usage is still the UsageStats data structures used to keep counts
(per Document per Product, per Customer/IPCustomer).
Possible Change:
Type of Use |
GB |
Raw Data: Weblogs, Database dumps, Misc Logs | 6.33 |
Processed
Data:
CustomerReports/ArticleReports Berkeley
DBs: |
4.6 |
Total | 10.93 |
There
is a lot of
duplication in data stored within every Berkeley DB file each month and
between months. For example, Customers in every Product, i.e.
each Berkeley DB, Document titles.
Possible
Change:
Currently
webpages are
generated by Perl scripts. It has already been proposed
elsewhere
that
these should be replaced by Java
Cocoon, which would bring it into line with the rest of the
website. The Perl scripts currently access
precomputed data in Berkeley DB files that is stored in a format
specific to each webpage.
The
end of month data is
not kept in a general format, only the form for specific webpages. If
the program has to be re-run, therefore, the general data has to be
recomputed. So, for example, if end-of-month data was needed for a new
statistic for the last 3 years, on current times, it would take 7.5
days to compute ( 5 (hrs) * 12(mths) * 3 (yrs) = 180/24).
Possible
Changes:
There
are two main
features to CountUse. One is to replace reading and parsing of the
weblogs by getting data directly from the Apache threads. The second is
to replace existing data representation and storage by a standard
database.
CountUse
would consist of
two main processes. One to be run on Apollo - call it CollectData - which
would
log data from the web server. The other process- call it - ProcessData - to be
run on
Libra, would, among other tasks, process the data
collected on Apollo and store it in standard databases.
Interim counts would be stored in a 'WorkInProgress' database on Libra,
which might also hold up-to-date Customer and
Document data. Monthly historical data for webpage display would be
stored in dedicated 'Web' databases on Apollo and on
Libra.
This
process would read
data from Apache threads in real-time via a FIFO, and write the data to
a log. Periodically during a day, a new log would be started and the
old copied to a directory where ProcessData
on Libra would
look for logs to copy.
The
data needed is
already used by an Apache thread to check a Customer license for a
Document, and would just need to be written to the FIFO. Minimally this
data would include Customer IP address, webpage URL and session id.
Additionally the date/time on Apollo would be needed. Getting this data
directly eliminates current weblog parsing and reading.
[Apache Thread Modifications: open a FIFO, write a line of data to the FIFO and close the FIFO.]
On
a daily basis this
process would copy logs written by CollectData
on Apollo, and sort and store
cumulative data in the WorkInProgress database. Monthly it
would perform end of the month counts, and transfer authorized
data to the Web databases. Additionally, it would respond to
requests at any time to re-run tasks.
Many
of these tasks are
independent and need to be capable of being performed independently to
provide additional flexibility. So
the structure of this process might be a Control process that
spawns
several sub-processes.
This
would spawn several
sub-processes at appropriate times. For ease of reference I'll name the
sub-processes CopyLog,
ProcessLog, ProcessEndMonth and TransferCounts. It
might also
respond to user requests, for example, to
restart the processing from the beginning of the month, as well as to
requests to re-run from month N, or transfer monthly
data to Apollo. Responding to requests might be implemented over a
socket connection with a simple client that accepts and sends user
requests (see "Process organisation" below ).
CopyLog - this would
check a log
directory on Apollo to copy any log files that CollectData had
written,
and concatenate ones it copies wiith existing logs for the day. [The
log would be
kept to reconstruct statistics when needed. ]
ProcessLog - this
would sort a log
and store cumulative totals per Customer per Document in the
WorkInProgress database. [Only the original unsorted log would be
retained to allow counts that require the original ordering (e.g.
sessions) to be reconstructed when necessary.]
Sorting
the log would
minimize database accesses to the hard disk, and, therefore, reduce
processing time. The preference for sorting the log makes it
unnecessary to collect data in real-time from Apollo.
To
perform all the
necessary counts, ProcessLog
also needs to be able to find other
Documents
in
the hierarchy of Documents and
other
Customers Licenses that contain the IP
address in a web-access. This
data might
be provided by one of the following alternatives:
ProcessEndMonth - at
the end of the
month
this sub-process would perform the monthly counts (e.g.
Top-100_IPs_FullTxtAccess, Top-10_CustomerDenieds).
TransferCounts -
this would transfer
the
latest monthly statistics from WorkInProgress to the Web databases on
Apollo
and Libra for the display of webpages for customers and for internal
use.
When
a request is made to
re-run the current month if ,for example, there had been some error,
the Control process would need to reset
the CountUse_lib_wip database and restart ProcessLog on the
logs from
the start of the month, leaving CopyLog
to continue copying new
logs from Apollo.
Isolating
the process of
copying the logs and combining the sorting and counting functions means
that there is no need to store sorted logs, and two independent tasks -
copying the new logs, and re-processing unsorted logs that have already
been copied, can easily be continued in parallel.
To
re-process previous
months Control
would run
another ProcessLog
on
the appropriate
month's log directory, and use an independent section of the
database to store the results. An independent TransferCount
sub-process
would also be started to transfer the data to the webpage display
databases when authorized.
As
indicated, it has
already been proposed that webpage presentation should be via Java
Cocoon, instead of the current Perl scripts.
Under CountUse the historical data would
be stored in databases on Apollo and Libra. Procedures for computing
data when a web-page is accessed would, therefore, use
standard
selection and
sorting database procedures to produce their statistics on the fly.
There's
a lot that can be
done on the existing system, but without gaining the simplicity
of the new, and thereby its accessibility to others, or its ease of
maintenance.
Where
improvements could
be made to achieve that simplicity, it would in effect be making the
more radical changes of the new system, and at a higher development
cost. An evolutionary approach to change might be advantageous if there
were immediate needs for improvement, but, at present, foreseeable
needs
for memory, disk space, and processing speed would seem to be met.
Both are probably as scalable as each other. For example, a log can be sorted into both Customer and Document order to reduce how much memory is required to process a webpage access. Both could spawn a process per Product to run simultaneously on separate cpus if the web-usage increased significantly.
The old is probably not as extensible ....
The
modularity of the
processes in CountUse and the use of more standard components, such as
databases and Java Cocoon, means that many aspects of development can
be done independently and by using a wider range of development
skills, and therefore, personnel.