overview

Historical Background

Original aims with regard tofully integrating UsageReporting with DataWarehousing are unclear. It seems likely that it was never considered.

Current Status

Currently the DataWarehousing system appears to be supporting the internal Business requirements for Usage Reports.

The Counter requirements for Customers are not however supported:

The design of the DataWarehousing system does not currently include the data to support UsageReporting, such as Session Counts, or appropriate concepts of Customer .

Additionally, its design apparently precludes the incorporation of some features in the forseeable future, such as storing journals with old and new titles.

It also seems probable that it may not have been able to be designed for the two independent tasks - the UsageReport need for 100s of reports daily and the datawarehousing needs for fewer longer reports.
Historical data requirement might also be incompatible with combined needs.

Future

Weblog Parser Module

This module would do the weblog parsing for both DataWarehousing and UsageReporting, rather than each doing the parsing independently.

A common weblog-parsing frontend would:

Make development and extension of the systems easier by virtue of the modularity of the weblog parser. That is, the weblog parsing is independent of the further processing done by the two systems.
Reduce maintenance. Code maintenance would be reduced as there would be only one parser to maintain. Administrative procedures such as adding or modifying URL patterns would also be easier where they are in common.
Reduce parsing time. The basic parsing of each weblog line would only be done once. URL pattern-matching time would also be reduced, but how much depends on how much the patterns have in common.

There are two main sets of web logs that are processed - usually referred to as "journal" and "MRW" logs:

A journal log is typically around 2,000,000 lines/day
The MRW log is typically around 300,000 lines/day

The first two points are probably more important at present: the system is relatively stable in terms of processing time, memory and disk space requirements, however, the maintenance, usage procedures and development aspects are the more pressing problems.

Data output from this module would be available to each module of DataWarehousing and UsageReporting
for the next stage in processing.

This might be accomplished dynamically or it could be stored directly in a database to be retrieved by each system.
The alternative is to modify the existing systems of each to accept the data in its new form.

Counter Report database

The current means of displaying the Customer reports is by using Perl scripts that access Berkeley DB files in which the usage reporting counts are stored. For reasons outlined in report .. , there are advantages to replacing the Berkely DBs files by a standard database, and the Perl scripts that access them by an alternative. Moreover these two changes would need to be done together.

The database would be used for:

For storing Counts or computing on the fly (since discussion suggested should be fast enough). Though whether the historical requirement could also be met if data computed on fly, don't know.
For storing Customer and Document ID data required to fulfill te Counter Requirements (e.g. licenses, Customer IDS, Document IDS.
Automatic updating of Customers, Documents

Customer Reports could be generated by:

Business objects from the database.

This would handle the majority of the remaining tasks that the current system deals with. The remaining would fall into the category of control process to cover adminitstrative tasks and user interface to the Counter Report system module below.

Control Module

This would be a user interface to transfer counts to database on apoollo, to user requests to restart processing, or for unusual data. One aspect might be a means of automating the entering of new URLS.
Additonally feature would be to automate some of the monthly administative task of doing the QA on the statistics.
This would be in part developed throught the regression tests needed to test the weblog parsing Module and Database Modules.

RES - Store relatively unprocessed data (not counts) in database - though need copy on apollo for website.
        - Compute on the fly ( database apparently fast enough
        - Counts need to be transferred each month to database on apollo for website access.


        - Since 100s of reports need to be generated a day, probably not a general purpose datawarehousing database since that designed for limited
         number of reports.
          -Not designed for it anyway:

Datewarehouse not currently storing data that need (e.g. session counts) ,nor by sound of it capable of being stored in foreesable future (e.g. historical data - such as old titles for journals recently ruled out when requested)

Concepts of Customer needed for Counter requirements not incorporated in datawarehousing