Future of Usage Reporting System

Will the Usage Reporting System have a future when the Data Warehouse is operational?

TABLE OF CONTENTS

Introduction

When the Wiley InterScience Data Warehouse was proposed two years ago, it was planned that it would take over some of the work of the Usage Reporting System (see Wiley InterScience Data Warehouse - WPS Initiated Usage Reporting Requirements). The issue of whether it would ever totally replace the Usage Reporting System was raised, but not, so far as I know, ever resolved. Now that the Data Warehouse is about to become operational it seems opportune to re-visit this issue.

Current state of Usage Reporting System

Users and generated reports

The Usage Reporting System currently generates reports for two sets of users: Wiley customers and Wiley employees.

The reports for Wiley employees are used mainly for business purposes. The Data Warehouse is intended to replace the Usage Reporting System for these purposes.

The reports for Wiley customers are provided for them to monitor their licensed use of the InterScience web site. It is a contractual requirement of the license that these reports are provided, and the content and form of the reports has to satisfy the COUNTER Code of Practice. The Data Warehouse has not been designed to replace the Usage Reporting System for this purpose.

So the issue under discussion is how to provide customer reports in the future.

Existing problems

Recent work on the Usage Reporting System has overcome immediate problems with memory use and processing time, and has automated normal running of the system.

There are, however, still problems with the system. These are discussed elsewhere, but some of the problems are:

The administrative task of maintaining the mapping of URLs from weblogs is time-consuming and error-prone
Parsing of weblogs is done repeatedly, which is wasteful of resources
The form in which usage reporting data is stored is ad-hoc, making it very difficult and error-prone to present it in different ways
The use of Perl scripts for presenting the data also makes maintenance and development difficult

Processing in common with Data Warehouse

As noted above, the Usage Reporting System repeatedly parses the same weblog files from the InterScience website. The Data Warehouse also independently parses these weblog files.

There are two main sets of web logs that are processed - usually referred to as "journal" and "MRW" logs.

The journal log is typically around 2,000,000 lines/day
The MRW log is typically around 300,000 lines/day

For the Usage Reporting System:

The journal log is processed 3 times (for each of the 3 products: journal, book, and cochrane).
The MRW log is processed over 50 times (once for each of over 50 products).

The Data Warehouse processes both logs once.

Both systems perform the following main steps for each weblog line:

Split the line into fields (IP address, date/time, request URL, etc.)
Match the request URL against a set of patterns to determine whether the line is of interest
Use the request URL match to determine what document was accessed, and in what manner

The patterns used in step 2/3 are currently different for the two systems. However, they should, in theory, have a lot in common since they are being used to identify accesses to the same documents. Even if the patterns themselves cannot be combined, the code that does the matching certainly can be.

Options for future

There are 3 main options for generating customer reports in the future:

Use the Data Warehouse - so the Usage Reporting System would no longer be needed
Continue to use the Usage Reporting System, and maintain it as a totally separate system from the Data Warehouse
Continue to use the Usage Reporting System, but share the processing of weblogs with the Data Warehouse

These options are discussed in more detail below.

Option 1: Data Warehouse completely replaces Usage Reporting System

This option has the obvious advantage that there is only system to maintain.

However, the current Data Warehouse has not been designed to provide customer reports, and it is not clear how easy it would be to do so. In particular:

The existing Data Warehouse design seems to be intended for generating a small number of in-depth reports for small number of users. For customer usage reporting, we need to generate a large number of simple reports for thousands of customers. This may not be feasible given the large amount of data stored in the data warehouse.
The Data Warehouse appears not to comply with the COUNTER Code of Practice in various ways, such as its handling of double-clicks, sessions, identification of customer licenses, and historic naming of documents. It is not clear whether a future version of the Data Warehouse could ever be COUNTER compliant and also satisfy internal business requirements.
Customer reports need to be presented as web pages on the Wiley InterScience website. For efficiency reasons, it would probably be necessary to copy the relevant data from the Data Warehouse and store it on the Wiley InterScience web-server. So this part of the system would still be separate.

Option 2: Usage Reporting System retained as totally separate system

The only advantage to this option is that since the system is already separate, nothing needs doing.

Obviously this option does not take advantage of any opportunity to share development and maintenance with the Data Warehouse.

Also, as a separate system, existing problems specific to Usage Reporting would still need to be addressed.

Option 3: Usage Reporting System and Data Warehouse share front-end

In this option, a common module would perform weblog processing for both the Usage Reporting System and the Data Warehouse.

Data output from this module would be passed to both the Usage Reporting System and the Data Warehouse for the next stage in processing. This might be accomplished dynamically or it could be stored in a database to be retrieved by each system.

If we were to introduce a common weblog-parsing frontend, this would:

Make development and extension of the systems easier by virtue of the modularity of the weblog parser. That is, the weblog parsing would be independent of the further processing done by the two systems.
Reduce maintenance. Code maintenance would be reduced as there is only one parser to maintain. Administrative procedures such as adding or modifying URL patterns would also be easier where they are in common.
Reduce parsing time. The basic parsing of each weblog line would only be done once. URL pattern-matching time would also be reduced, but by how much depends on how much the patterns have in common.

The first two points are probably more important at present. The system is relatively stable in terms of processing time, memory and disk space requirements. The maintenance, usage and development aspects are the more pressing problems.

As in option 2, existing problems specific to the Usage Reporting System (e.g. ad hoc data representation, use of Perl scripts, etc.) still need to be solved to ease development, maintenance and everyday administration tasks. The document Some thoughts on the design of the Usage Reporting System contains proposals that address these and other problems.

Conclusion

The most promising way forward seems to be the third option: we retain a separate Usage Reporting System, but one that shares a weblog-parsing frontend with the Data Warehouse and ideally has a re-designed backend.