Future of Usage Reporting System


Will the Usage Reporting System have a future when the Data Warehouse is operational?

TABLE OF CONTENTS

Introduction

When the Wiley InterScience Data Warehouse was proposed two years ago, it was planned that it would take over some of the work of the Usage Reporting System (see Wiley InterScience Data Warehouse - WPS Initiated Usage Reporting Requirements). The issue of whether it would ever totally replace the Usage Reporting System was raised, but not, so far as I know, ever resolved. Now that the Data Warehouse is about to become operational it seems opportune to re-visit this issue.

Current state of Usage Reporting System

Users and generated reports

The Usage Reporting System currently generates reports for two sets of users: Wiley customers and Wiley employees.

The reports for Wiley employees are used mainly for business purposes. The Data Warehouse is intended to replace the Usage Reporting System for these purposes.

The reports for Wiley customers are provided for them to monitor their licensed use of the InterScience web site. It is a contractual requirement of the license that these reports are provided, and the content and form of the reports has to satisfy the COUNTER Code of Practice. The Data Warehouse has not been designed to replace the Usage Reporting System for this purpose.

So the issue under discussion is how to provide customer reports in the future.

Outstanding problems

Recent work on the Usage Reporting System has overcome immediate problems with memory use and processing time, and has automated normal running of the system.

There are, however, still problems with the system. These are discussed elsewhere, but some of the problems are:
The document Some thoughts on the design of the Usage Reporting System proposes a replacement of the Usage Reporting System that would address these and other problems.

Processing in common with Data Warehouse

As noted above, the Usage Reporting System repeatedly parses the same weblog files from the InterScience website. The Data Warehouse also independently parses these weblog files.

There are two main sets of web logs that are processed - usually referred to as "journal" and "MRW" logs.
For the Usage Reporting System:
The Data Warehouse processes both logs once.

Both systems perform the following main steps for each weblog line:
  1. Split the line into fields (IP address, date/time, request URL, etc.)
  2. Match the request URL against a set of patterns to determine whether the line is of interest
  3. Use the request URL match to determine what document was accessed, and in what manner
The patterns used in step 2/3 are currently different for the two systems. However, they should, in theory, have a lot in common since they are being used to identify accesses to the same documents. Even if the patterns themselves cannot be combined, the code that does the matching certainly can be.

Options for future

There are 3 main options for generating customer reports in the future:
These options are discussed in more detail below.

Data Warehouse completely replaces Usage Reporting System

This option has the obvious advantage that there is only system to maintain.

However, the current Data Warehouse has not been designed to provide customer reports, and it is not clear how easy it would be to do so. In particular:

Usage Reporting System retained as totally separate system

The only advantage to this option is that isince the system is separate, nothing needs doing.
t does not require any immediate development, since this is the current state of affairs.

Obviously this option does not take advantage of any opportunity to share development and maintenance with the Data Warehouse.

Also, as a separate system, existing problems specific to UsageReporting would still need to be addressed.

Usage Reporting System and Data Warehouse share front-end

In this option, a common module would perform weblog processing for both the Usage Reporting System and the Data Warehouse.

Data output from this module would be passed to both the Usage Reporting System and the Data Warehouse for the next stage in processing. This might be accomplished dynamically or it could be stored in a database to be retrieved by each system.

to
If we were to introduce a common weblog-parsing frontend, this would:
The first two points are probably more important at present. The system is relatively stable in terms of processing time, memory and disk space requirements. The maintenance, usage and development aspects are the more pressing problems.

As in option 2,  existing problems specific to the Usage Reporting  system (e.g. adhoc data representation, use of Perl scripts etc) still need to be addressed to ease development and maintenance and everyday administration tasks.
[
The arguments presented in the Some thoughts ... document for a database-based system would still apply]

>>>>
However, replacing the frontend of the Usage Reporting System with a shared module would not address the outstanding problems with the rest of the system (ad hoc data representation, use of Perl scripts, etc.). The arguments presented in the Some thoughts ... document for a database-based system would still apply. A variation on the Some thoughts ... proposal (from FreddieQuek) is to store relatively unprocessed data from the weblog parser in the database, rather than usage counts, and use a report-generator such as Business Objects to generate customer-report web-pages on demand.

Conclusion

The most promising way forward seems to be the third option: we retain a separate Usage Reporting System, but one that shares a weblog-parsing frontend with the Data Warehouse and ideally has a re-designed backend.