Future of Usage Reporting System
Will the Usage Reporting System have a future when the Data Warehouse is operational?
TABLE OF CONTENTS
Introduction
When the Wiley InterScience Data Warehouse was proposed two years ago, it was planned that it would take over some of the work of the Usage Reporting System (see Wiley InterScience Data Warehouse - WPS Initiated Usage Reporting Requirements). The issue of whether it would ever totally
replace the Usage Reporting System was raised, but not, so far as I
know, ever resolved. Now that the Data Warehouse is about to become
operational it seems opportune to re-visit this issue.
Current state of Usage Reporting System
Users and generated reports
The Usage Reporting System currently generates reports for two sets of users: Wiley customers and Wiley employees.
The
reports for Wiley employees are used mainly for business purposes. The
Data Warehouse is intended to replace the Usage Reporting System for
these purposes.
The reports for Wiley customers are provided for
them to monitor their licensed use of the InterScience web site. It is
a contractual requirement of the license that these reports are
provided, and the content and form of the reports has to satisfy the COUNTER Code of Practice. The Data Warehouse has not
been designed to replace the Usage Reporting System for this purpose.
So the issue under discussion is how to provide customer reports in the future.
Existing problems
Recent work on the Usage Reporting System has overcome immediate problems with memory use and processing time, and has automated normal running of the system.
There are, however, still problems with the system. These are discussed elsewhere, but some of the problems are:
- The administrative task of maintaining the mapping of URLs from weblogs is time-consuming and error-prone
- Parsing of weblogs is done repeatedly, which is wasteful of resources
- The
form in which usage reporting data is stored is ad-hoc, making it very
difficult and error-prone to present it in different ways
- The use of Perl scripts for presenting the data also makes maintenance and development difficult
Processing in common with Data Warehouse
As noted above, the Usage Reporting System repeatedly parses the same weblog files from the InterScience website. The Data Warehouse also independently parses these weblog
files.
There are two
main sets of web logs that are processed - usually referred to as
"journal" and "MRW" logs.
- The journal log is typically around
2,000,000 lines/day
- The MRW log is typically around 300,000
lines/day
For the Usage Reporting System:
- The journal log is processed 3 times (for
each of the 3
products: journal, book, and cochrane).
- The MRW log is processed over 50 times
(once for each of over 50 products).
The Data Warehouse processes both logs once.
Both systems perform the following main steps for each weblog line:
- Split the line into fields (IP address, date/time, request URL, etc.)
- Match the request URL against a set of patterns to determine whether the line is of interest
- Use the request URL match to determine what document was accessed, and in what manner
The patterns used in step 2/3 are currently different for the two
systems. However, they should, in theory, have a lot in common since
they are being used to identify accesses to the same documents. Even if
the patterns themselves cannot be combined, the code that does the
matching certainly can be.
Options for future
There are 3 main options for generating customer reports in the future:
- Use the Data Warehouse - so the Usage Reporting System would no longer be needed
- Continue to use the Usage Reporting System, and maintain it as a totally separate system from the Data Warehouse
- Continue to use the Usage Reporting System, but share the processing of weblogs with the Data Warehouse
These options are discussed in more detail below.
Option 1: Data Warehouse completely replaces Usage Reporting System
This option has the obvious advantage that there is only system to maintain.
However, the current Data Warehouse has not been designed to provide
customer reports, and it is not clear how easy it would be to do so. In
particular:
- The existing Data Warehouse design seems to be intended for
generating a small number of in-depth reports for small number of
users. For customer usage reporting, we need to generate a large number
of simple reports for thousands of customers. This may not be feasible
given the large amount of data stored in the data warehouse.
- The Data Warehouse appears not to comply with the COUNTER Code of
Practice in various ways, such as its handling of double-clicks,
sessions, identification of customer licenses, and historic naming
of documents. It is not clear whether a future version of the Data
Warehouse could ever be COUNTER compliant and also satisfy internal business
requirements.
- Customer reports need to be presented as web pages on the Wiley
InterScience website. For efficiency reasons, it would probably be
necessary to copy the relevant data from the Data Warehouse and store
it on the Wiley InterScience web-server. So this part of the system
would still be separate.
Option 2: Usage Reporting System retained as totally separate system
The only advantage to this option is that since the system is
already separate, nothing needs doing.
Obviously this option does not take
advantage of any opportunity to share development and maintenance with
the Data Warehouse.
Also, as a separate system, existing problems specific to
Usage Reporting would still need to be addressed.
Option 3: Usage Reporting System and Data Warehouse share front-end
In this option, a common module would perform weblog processing for both the Usage Reporting System and the Data Warehouse.
Data output from this module would be passed
to both the
Usage Reporting System and the Data Warehouse for the next stage in processing.
This might be accomplished dynamically or it could be stored in a database to be retrieved by each system.
If we were to introduce a common weblog-parsing frontend, this would:
- Make development and extension of the
systems easier by virtue of the modularity of the weblog parser. That
is, the weblog parsing would be independent of the further processing done by
the two systems.
- Reduce maintenance. Code maintenance
would be reduced as there is only one parser to maintain.
Administrative procedures such as adding or modifying URL patterns
would also be easier where they are in common.
- Reduce parsing time. The basic parsing of
each weblog line would only be done once. URL pattern-matching time
would also be reduced, but by how much depends on how much the patterns
have in common.
The first two points are probably more important at present. The system is
relatively stable in terms of processing time, memory and disk space
requirements. The maintenance, usage and
development aspects are the more pressing problems.
As in option 2, existing problems specific to the Usage
Reporting System (e.g. ad hoc data representation, use of Perl
scripts, etc.) still need to be solved to ease development,
maintenance and everyday administration tasks.
The document Some thoughts on the design of the Usage Reporting System contains proposals that address these and other problems.
Conclusion
The most promising way forward seems to be the
third option: we retain a separate Usage Reporting System, but one that
shares a weblog-parsing frontend with the Data Warehouse and
ideally has a re-designed backend.