Report on Meeting with regard to the Integration of UsageReporting with
DataWarehousing
Background
UsageReporting is in 2
parts:
- Business - internal business requirements
- Customer - Counter requirements
Historical Background
Original aims
with regard tofully integrating UsageReporting
with DataWarehousing are unclear. It seems likely that it was
never considered.
Current Status
Currently the
DataWarehousing system appears to be supporting the internal Business
requirements for Usage Reports.
The Counter
requirements for Customers are not however supported:
The design of the
DataWarehousing system does not currently include
the data to support UsageReporting, such as Session Counts, or
appropriate concepts of Customer .
Additionally, its design apparently precludes the
incorporation of some features in the
forseeable future, such as storing journals with old and new titles.
It also seems
probable that it may not have been able to be designed for the
two independent tasks - the UsageReport need for 100s of reports daily
and the datawarehousing needs for fewer longer reports.
Historical data requirement might also be incompatible with combined
needs.
Future
Weblog Parser Module
This module would do
the weblog parsing for both DataWarehousing and UsageReporting, rather
than each doing the parsing independently.
A common weblog-parsing frontend
would:
- Make development and extension
of the
systems easier by virtue of the modularity of the weblog parser. That
is, the weblog parsing is independent of the further processing done by
the two systems.
- Reduce maintenance. Code
maintenance
would be reduced as there would be only one parser to maintain.
Administrative procedures such as adding or modifying URL patterns
would also be easier where they are in common.
- Reduce parsing time. The basic
parsing of
each weblog line would only be done once. URL pattern-matching time
would also be reduced, but how much depends on how much the patterns
have in common.
There
are two
main sets of web logs that are processed - usually referred to as
"journal" and "MRW" logs:
A
journal log is typically around
2,000,000 lines/day
The MRW log is typically
around 300,000
lines/day
For
the Usage Reporting System:
A
journal log is processed 3 times (for
each of the 3
products: journal, book, and cochrane).
The MRW log is processed over 50 times
(once for each of over 50 products).
For
the Data Warehousing System:
Both
logs are processed once.
The
first two points are probably more important at present: the system is
relatively stable in terms of processing time, memory and disk space
requirements, however, the maintenance, usage procedures and
development aspects are the more pressing problems.
Data output from this module would be available to each module of
DataWarehousing and UsageReporting
for the next stage in processing.
This might be accomplished dynamically or it could be stored directly
in a database to be retrieved by each system.
The alternative is to modify the existing systems of each to accept the
data in its new form.
Counter Report database
The current means of displaying the Customer reports is
by using Perl scripts that access Berkeley DB files in which the usage
reporting counts are stored. For reasons outlined in report .. ,
there are advantages to replacing the Berkely DBs files by a standard
database, and the Perl scripts that access them by an alternative.
Moreover these two changes would need to be done together.
The database would be used for:
- For storing Counts or computing on the fly (since
discussion suggested should be fast enough). Though whether the
historical requirement could also be met if data computed on fly, don't
know.
- For storing Customer and Document ID data
required to fulfill te Counter Requirements (e.g. licenses, Customer
IDS, Document IDS.
- Automatic updating of Customers, Documents
Customer Reports could be generated by:
- Business objects from the database.
This would handle the majority of the remaining tasks
that the current system deals with. The remaining would fall into the
category of control process to cover adminitstrative tasks and
user interface to the Counter Report system module below.
Control Module
This would be a user interface to transfer counts
to database on apoollo, to user requests to restart processing, or for
unusual data. One aspect might be a means of automating the entering of
new URLS.
Additonally feature would be to automate some of the monthly
administative task of doing the QA on the statistics.
This would be in part developed throught the regression tests needed to
test the weblog parsing Module and Database Modules.
RES - Store relatively
unprocessed data (not counts) in database - though need copy on apollo
for website.
- Compute on the fly ( database
apparently fast enough
- Counts
need to be transferred each month to database on apollo for website
access.
- Since 100s of reports need to
be generated a day, probably not a general purpose datawarehousing
database since that designed for limited
number of reports.
-Not designed for it
anyway:
Datewarehouse not
currently storing
data that need (e.g. session counts) ,nor by sound of it capable of
being stored in foreesable future (e.g. historical data - such as
old
titles for journals recently ruled out when requested)
Concepts
of Customer needed for Counter requirements not incorporated in
datawarehousing
Not sure whether
possible to use their DB to store counts if wanted :
UsageReporting Now:
Stable for memory, disk
use, processing time for a while
Main problems time-consuming to do maintenance, procedures or extensions