1/24/00 Plot of total elapsed processing time for relevance factors from .01 to .10.
1/24/00 Stacked bar graph breakdown of processing times by relevance factor.
1/24/00 Stacked bar graph breakdown of processing times by relevance factor excluding document parsing.
1/24/00 Plot of total elapsed processing time for 3 to 18 processors.
1/24/00 Stacked bar graph breakdown of processing times for 3 to 18 processors.
1/24/00 Stacked bar graph breakdown of processing times for 3 to 18 processors excluding document parsing.
1/25/00 Plot of total elapsed processing time for relevance factors from .01 to .10.
1/25/00 Stacked bar graph breakdown of processing times by relevance factor.
1/25/00 Stacked bar graph breakdown of processing times by relevance factor excluding document parsing.
1/25/00 Plot of total elapsed processing time for 3 to 18 processors.
1/25/00 Stacked bar graph breakdown of processing times for 3 to 18 processors.
1/25/00 Stacked bar graph breakdown of processing times for 3 to 18 processors excluding document parsing.
1/26/00 Plot of total elapsed processing time for relevance factors from .01 to .10.
1/26/00 Stacked bar graph breakdown of processing times by relevance factor.
1/26/00 Stacked bar graph breakdown of processing times by relevance factor excluding document parsing.
1/26/00 Plot of total elapsed processing time for 3 to 18 processors.
1/26/00 Stacked bar graph breakdown of processing times for 3 to 18 processors.
1/26/00 Stacked bar graph breakdown of processing times for 3 to 18 processors excluding document parsing.
2/17/00 Plot of total elapsed processing time for relevance factors from .01 to .10.
2/17/00 Stacked bar graph breakdown of processing times by relevance factor.
2/17/00 Stacked bar graph breakdown of processing times by relevance factor excluding document parsing.
2/17/00 Plot of total elapsed processing time for 3 to 18 processors.
2/17/00 Stacked bar graph breakdown of processing times for 3 to 18 processors.
2/17/00 Stacked bar graph breakdown of processing times for 3 to 18 processors excluding document parsing.
3/06/00 Plot of total elapsed processing time for 3 to 18 processors.
3/06/00 Stacked bar graph breakdown of processing times for 3 to 18 processors.
3/06/00 Stacked bar graph breakdown of processing times for 3 to 18 processors excluding document parsing.
3/07/00 Plot of total elapsed processing time for 30 to 200 standing queries.
3/07/00 Stacked bar graph breakdown of processing times for 30 to 200 standing queries.
3/07/00 Stacked bar graph breakdown of processing times for 30 to 200 standing queries excluding document parsing.
3/08/00 Plot of total individual document processing time for 200 standing queries.
3/08/00 Stacked bar graph breakdown of individual document processing times for 200 standing queries.
3/08/00 Stacked bar graph breakdown of individual document processing times 200 standing queries excluding document parsing.
5/24/00 Plot of total elapsed processing time for 1 to 1000 standing queries.
5/24/00 Stacked bar graph breakdown of processing times for 1 to 1000 standing queries.
5/24/00 Stacked bar graph breakdown of processing times for 1 to 1000 standing queries excluding document parsing.
#docs x #SQ x (#unique_stemmed_words_in_doc x 8) x 2 = total memory allocated
In the tests run here we have 350 documents, 200 standing queries, and the
average number of unique stemmed words per document looks to be around 350.
So the total memory used will be:
350 x 200 x (350 x 8) x 2 = 392MB
Alternatively, you can say that each document will cause the standing
query server to grow in size by 392MB/350 = 1.12MB (much larger than the
documents themselves.)
The PC's in the computer cluster have 256MB of RAM and 256MB of swap space.
When the tests start running there is about 300-400MB of space available (difficult to tell since Linux uses RAM for disk caching also.) Given that there
are other parts of the code that also allocate memory, it's not too difficult
to believe that the Vector caching and Centroids are the main reason we are
using up all the memory when we get close to 350 documents, especially if
we happened to have run across a sequence of larger documents. One
document with 1000 words would itself be responsible for 3.2MB of memory.
Looking at the size of the documents being added most are between 1000
and 8000 BYTES (not words!) in size, but there are a few around 40K BYTES in size.
The tradeoffs would appear to be between these four factors:
# documents clustered
Vector caching (speed vs memory)
memory usage/availability
# standing queries
Increase the # of documents clustered and you either need more memory
(you could turn off caching) or have to reduce the number of standing
queries. Increase the number of standing queries and again you either
need more memory or have to reduce the number of documents clustered.
Use caching and you have greater speed but less memory which reduces
the number of documents clustered or the number of standing queries.
Increase any one of them and the others are detrimentally affected.
During one of the test runs I plotted memory usage by the SQS versus
number of documents processed. The graph looks like this (note that you
have to add 50 to every point on the X axis since 50 documents are clustered
when the query is registered):
Memory Usage of SQS versus Number of New Documents Processed
Where things get weird in the upper right corner is where the PC was starting to run out of both RAM and swap space. The starting size of an SQS server is around 20MB before any queries are registered. I watched how the SQS grew in size during a similar test run and noted that it had grown to 53904KB after 162 queries had been registered. Since it started at 20KB this works out to (53904 - 20000)/162 = 209KB per query. Each query immediately clusters 50 documents so that is 209KB/50 = 4180B/document. Divide by 8 to get 522 unique stemmed words per document which seems a little high. However, these first 50 documents come from the serval Usenet news database whereas the new documents added to the cluster come from Clarinet news. Our Usenet news database had some problems in filtering during formation that led to containing some very large documents (1MB binary files for example where every line becomes a word).
Some things to try:
Try the same tests with just Clarinet news. Arne is building the database.
Turn off Vector caching to verify that it is the cause of the memory growth.
Try a test run with fewer new documents sent through the system so it will be
able to support more standing queries.
The results below show a comparison of Katya's Star Clustering algorithm and Fred's improvement on that algorithm (Linear Space Sampled Star algorithm) when run using a single standing query.
11/6/00 Plot of total elapsed processing time using Katya's algorithm.
11/6/00 Stacked bar graph breakdown of processing times using Katya's algorithm
11/6/00 Stacked bar graph breakdown of relevance and transmission times using Katya's algorithm.
11/6/00 Plot of total elapsed processing time using Fred's algorithm.
11/6/00 Stacked bar graph breakdown of processing times using Fred's algorithm
11/6/00 Stacked bar graph breakdown of relevance and transmission times using Fred's algorithm.
2/13/01 Plot of total elapsed processing time using Katya's algorithm.
2/13/01 Stacked bar graph breakdown of processing times using Katya's algorithm
2/13/01 Stacked bar graph breakdown of relevance and transmission times using Katya's algorithm.
2/13/01 Plot of total elapsed processing time using Fred's algorithm.
2/13/01 Stacked bar graph breakdown of processing times using Fred's algorithm
2/13/01 Stacked bar graph breakdown of relevance and transmission times using Fred's algorithm.
Graphs of Fred's algorithm with vector caching disabled.
2/22/01 Plot of total elapsed processing time using Fred's algorithm.
2/22/01 Stacked bar graph breakdown of processing times using Fred's algorithm
2/22/01 Stacked bar graph breakdown of relevance and transmission times using Fred's algorithm.
Graphs of varying relevance percentage for Fred's algorithm, varying the relevance from 0.00 to 0.20
11/15/00 Plot of total elapsed processing time for relevance factors from .01 to .20.
11/15/00 Stacked bar graph breakdown of processing times by relevance factor.
11/15/00 Stacked bar graph breakdown of processing times by relevance factor excluding document parsing.
A paper entitled Scalable Information Organization contains a description of the algorithm devised by Fred Reiss as an extension of Katya's Star clustering algorithm.
SQS.ppt Diagram of architecture of standing query server.
PersistentQueryC.ppt Peristant queries poster.