Future of Usage Reporting System
Will the Usage Reporting System have a future when the Data Warehouse
is operational?
TABLE OF CONTENTS
Introduction
When the Wiley InterScience Data Warehouse was proposed two years ago,
it was planned that it would take over some of the work of the Usage
Reporting System (see Wiley
InterScience Data Warehouse - WPS Initiated Usage Reporting Requirements).
The issue of whether it would ever totally
replace the Usage Reporting System was raised, but not, so far as I
know, ever resolved. Now that the Data Warehouse is about to become
operational it seems opportune to re-visit this issue.
Current state of Usage Reporting System
Users and generated reports
The Usage Reporting System currently generates reports for two sets of
users: Wiley customers and Wiley employees.
The
reports for Wiley employees are used mainly for business purposes. The
Data Warehouse is intended to replace the Usage Reporting System for
these purposes.
The reports for Wiley customers are provided for
them to monitor their licensed use of the InterScience web site. It is
a contractual requirement of the license that these reports are
provided, and the content and form of the reports has to satisfy the COUNTER Code of Practice.
The Data Warehouse has not
been designed to replace the Usage Reporting System for this purpose.
So the issue under discussion is how to provide customer reports
in the future.
Outstanding problems
Recent work on the Usage Reporting System has overcome immediate problems with memory use and
processing time, and has automated normal running of
the system.
There are, however, still problems with the system. These are discussed
elsewhere, but some of
the problems are:
- The administrative task of maintaining the mapping of URLs from
weblogs is time-consuming and error-prone
- Parsing of weblogs is done repeatedly, which is wasteful of
resources
- The
form in which usage reporting data is stored is ad-hoc, making it very
difficult and error-prone to present it in different ways
- The use of Perl scripts for presenting the data also makes
maintenance and development difficult
The document Some
thoughts on the design of the Usage Reporting System proposes a
replacement of the Usage Reporting System that would address these and
other problems.
Processing in common with Data Warehouse
As noted above, the Usage Reporting System repeatedly parses the same
weblog files from the InterScience website. The
Data Warehouse also independently parses these weblog
files.
There are two
main sets of web logs that are processed - usually referred to as
"journal" and "MRW" logs.
- The journal log is typically around
2,000,000 lines/day
- The MRW log is typically around 300,000
lines/day
For the Usage Reporting System:
- The journal log is processed 3 times (for
each of the 3
products: journal, book, and cochrane).
- The MRW log is processed over 50 times
(once for each of over 50 products).
The Data Warehouse processes both
logs once.
Both systems perform the following main steps for each weblog line:
- Split the line into fields (IP address, date/time, request URL,
etc.)
- Match the request URL against a set of patterns to determine
whether the line is of interest
- Use the request URL match to determine what document was
accessed, and in what manner
The patterns used in step 2/3 are currently different for the two
systems. However, they should, in theory, have a lot in common since
they are being used to identify accesses to the same documents. Even if
the patterns themselves cannot be combined, the code that does the
matching certainly can be.
Options for future
There are 3 main options for generating customer reports in the future:
- Use the Data Warehouse - so the Usage Reporting System would no
longer be needed
- Continue to use the Usage Reporting System, and maintain it as a
totally separate system from the Data Warehouse
- Continue to use the Usage Reporting System, but share the
processing of weblogs with the Data Warehouse
These options are discussed in more detail below.
Data Warehouse completely replaces Usage Reporting System
This option has the obvious advantage that there is only system to
maintain.
However, the current Data Warehouse has not been designed to provide
customer reports, and it is not clear how easy it would be to do so. In
particular:
- The existing Data Warehouse design seems to be intended for
generating a small number of in-depth reports for small number of
users. For customer usage reporting, we need to generate a large number
of simple reports for thousands of customers. This may not be feasible
given the large amount of data stored in the data warehouse.
- The Data Warehouse appears not to comply with the COUNTER Code of
Practice in various ways, such as its handling of double-clicks,
sessions, identification of customer licenses, and historic naming
of documents. >>>>>>>>It is open to debate
whether a future version of the Data
Warehouse could be COUNTER compliant and also satisfy internal business
requirements.
- Customer reports need to be presented as web pages on the Wiley
InterScience website. For efficiency reasons, it would probably be
necessary to copy the relevant data from the Data Warehouse and store
it on the Wiley InterScience web-server. So this part of the system
would still be separate.
Usage Reporting System retained as totally separate system
The only advantage to this option is that isince the system is
separate, nothing needs doing.
t does not require any immediate development,
since this is the current state of affairs.
Obviously this option does not take
advantage of any opportunity to share development and maintenance with
the Data Warehouse.
Also, as a separate system, existing problems specific to
UsageReporting would still need to be addressed.
Usage Reporting System and Data Warehouse share front-end
In this option, a common module would perform
weblog processing for both the Usage Reporting System and the Data
Warehouse.
Data output from this module would be passed
to both the
Usage Reporting System and the Data Warehouse for the next stage in
processing.
This might be accomplished dynamically or it could be stored in a
database to be retrieved by each system.
to
If we were to introduce a common
weblog-parsing frontend, this would:
- Make development and extension of the
systems easier by virtue of the modularity of the weblog parser. That
is, the weblog parsing would be independent of the further processing
done by
the two systems.
- Reduce maintenance. Code maintenance
would be reduced as there is only one parser to maintain.
Administrative procedures such as adding or modifying URL patterns
would also be easier where they are in common.
- Reduce parsing time. The basic parsing of
each weblog line would only be done once. URL pattern-matching time
would also be reduced, but by how much depends on how much the patterns
have in common.
The first two points are probably more important at present. The system is
relatively stable in terms of processing time, memory and disk space
requirements. The maintenance, usage and
development aspects are the more pressing problems.
As in option 2, existing problems specific to the Usage
Reporting system (e.g. adhoc data representation, use of Perl
scripts etc) still need to be addressed to ease development and
maintenance and everyday administration tasks.
[The
arguments presented in the Some
thoughts ... document for a database-based system would still
apply]
>>>>
However, replacing the frontend of the Usage Reporting System with
a shared module would not address the outstanding problems with the
rest of the
system (ad hoc data representation, use of Perl scripts, etc.). The
arguments presented in the Some
thoughts ... document for a database-based system would still
apply. A variation on the Some
thoughts ... proposal
(from FreddieQuek)
is to store relatively unprocessed data from the weblog parser in the
database, rather than usage counts, and use a report-generator such as
Business Objects to generate customer-report web-pages on demand.
Conclusion
The most promising way forward seems to be the
third option: we retain a separate Usage Reporting System, but one that
shares a weblog-parsing frontend with the Data Warehouse and
ideally has a re-designed backend.