Report on Meeting with regard to the Integration of UsageReporting with DataWarehousing

 
Background
UsageReporting is  in 2 parts:

Historical Background

Original aims  with regard tofully  integrating UsageReporting  with  DataWarehousing are unclear. It seems likely that it was never considered.

Current Status

Currently the DataWarehousing system appears to be supporting the internal Business requirements for Usage Reports.

The Counter requirements for Customers are not however supported:

The design of the DataWarehousing  system does not currently include the data to support UsageReporting, such as Session Counts, or appropriate concepts of Customer . 

Additionally,  its design apparently precludes the incorporation of some features in the forseeable future, such as storing journals with old and new titles.

 It also seems probable that it may not have been able  to be designed for the two independent tasks - the UsageReport need for 100s of reports daily and the datawarehousing needs for fewer longer reports.
Historical data requirement might also be incompatible with combined needs.
Future
Weblog Parser Module
This module would do the weblog parsing for both DataWarehousing and UsageReporting, rather than each doing the parsing  independently.

A common weblog-parsing frontend would:
There are two main sets of web logs that are processed - usually referred to as "journal" and "MRW" logs:
A journal log is typically around 2,000,000 lines/day
The MRW log is typically around 300,000 lines/day
For the Usage Reporting System:
 A journal log is processed 3 times (for each of the 3 products: journal, book, and cochrane).
The MRW log is processed over 50 times (once for each of over 50 products).
For the Data Warehousing System:
Both logs are processed once.

The first two points are probably more important at present: the system is relatively stable in terms of processing time, memory and disk space requirements, however, the maintenance, usage procedures and development aspects are the more pressing problems.

Data output from this module would be available to each module of DataWarehousing and UsageReporting
for the next stage in processing.

This might be accomplished dynamically or it could be stored directly in a database to be retrieved  by each system.
The alternative is to modify the existing systems of each to accept the data in its new form.

Counter Report database 
The current means of displaying the Customer reports is by using Perl scripts that access Berkeley DB files in which the usage reporting counts are stored.  For reasons outlined in report .. , there are advantages to replacing the Berkely DBs files by a standard database, and the Perl scripts that access them by an alternative. Moreover these two changes would need to be done together.

The database would be used for:
Customer Reports could be generated by:
This would handle the majority of the remaining tasks that the current system deals with. The remaining would fall into the category of control  process to cover adminitstrative tasks and user interface to the Counter Report system module below.
Control Module
This would be a user interface  to transfer counts to database on apoollo, to user requests to restart processing, or for unusual data. One aspect might be a means of automating the entering of new URLS.
Additonally feature would be to automate some of the monthly administative task of doing the QA on the statistics.
This would be in part developed throught the regression tests needed to test the weblog parsing Module and  Database Modules.




 
         

RES - Store relatively unprocessed data (not counts) in database - though need copy on apollo for website.
        - Compute on the fly ( database apparently fast enough
       
- Counts need to be transferred each month to database on apollo for website access.
         

        - Since 100s of reports need to be generated a day, probably not a general purpose datawarehousing database since that designed for limited
         number of reports.
          -Not designed for it anyway:

Datewarehouse not currently storing data that need (e.g. session counts) ,nor by sound of it capable of being stored in foreesable future  (e.g. historical data - such as old titles for journals recently ruled out when requested)
                    Concepts of Customer needed for Counter requirements not incorporated in datawarehousing       
        
Not sure whether possible to use their DB to store counts if wanted :
Speed of access

UsageReporting Now:

Stable for memory, disk use, processing time for a while
Main problems time-consuming to do maintenance, procedures or extensions