The agent has been modified to launch from one to ten children,
each to a seperate PC. Each child acts as a client in an
IR task in which it repeatedly jumps to a
central server, performs a query, processes the query results and
returns to its client PC with the final results. The time it takes
for the processing
and the jumps is calculated and reported to the parent agent
where they are averaged and summarized as a performance measurement.
By varying the number of children from one to ten a curve can
be constructed showing how the agent system performance scales
with number of clients.
Similarly, the client/server code was modified so that both the client and server each accept a port number when started. Clients can be started on from one to ten PC's by a control script and a corresponding number of servers is started on a single server machine. The idea here is that having N independent server tasks on a single machine is similar to having N threads on a server spawned in response to incoming client requests, especially since Linux threads are seperate tasks anyway. Thus the server machine running multiple server applications is a close equivalant of a machine running a multithreaded server application (without our having to take the time to write and debug a multithreaded server). The control script can start from one to ten client/server pairs. The clients and servers keep track of execution and data transmission times, from which a performance scalability curve can be constructed which shows client/server performance versus number of clients.
Shown below is a cronological record of the results so far along
with some descriptions of current thoughts and snapshots of
source code and scripts. The parameters that we can vary are:
The bandwidth was varied using a PC with two network interface cards, running DummyNet. DummyNet is a modified FreeBSD firewall that can be used to control bandwidth allocations.
The PC's used for the clients and as a server are VA Linux VarStation 28, Model 2871E; Pentium II at 450MHz, 512K ECC L2 Cache, 256MB RAM, running Linux 2.2.14 (RedHat 6.2).
9/18/00 2Mbps Ethernet, 0.5 query/sec, 200/1000 queries, 60 docs, 20% relevance
The agent data for 200 and 300 queries is shown here, though it's difficult
to see since there is practically no difference. Hence 200 queries is
probably sufficient to get meaningful data for the agent tests.
Not sure why there is a jump between 7 and 8 in the agent curves, although
this is right about where the tests used to fail before the EALREADY fix
to the agent server. It's not due to retries in the agent code; I added
code to count retries and there are none in the agent (or the client
for that matter). Could be retries in the agent server?
9/20/00 1Mbps Ethernet, 0.5 query/sec, 200/1000 queries, 60 docs, 20% relevance
These are for the test results shown above.
60 Documents, 5% Relevance
slope intercept
agent.dat.200.100Mbps 35.4335 41.9946
agent.dat.200.2Mbps 86.4419 144.2459
agent.dat.200.1Mbps 157.8562 276.4375
client.dat.1000.100Mbps 13.4332 20.0544
client.dat.1000.2Mbps 1020.0 4.0
client.dat.1000.1Mbps 2051.0 0.3
60 Documents, 20% Relevance
slope intercept
agent.dat.200.100Mbps 41.8522 61.9915
agent.dat.200.2Mbps 235.2351 60.1098
agent.dat.200.1Mbps 506.8804 70.6040
client.dat.1000.100Mbps 11.8074 26.9357
client.dat.1000.2Mbps 1019.9 4.8
client.dat.1000.1Mbps 2053.1 -5.6
All the client agents start at the same time and it appears this
is causing data artifacts. Looking at the
7 agent and 8 agent plots above, which show the total
time (jump plus data processing) from each message per machine, they both have about the same maximums
but the 8 agent case has a lot lower minimum and a wider spread of times.
When I added a 60 second delay between the start of each agent, the
7 agent time dropped dramatically (from 1566.79 to 1048.93msec) and
the spread of times got a lot wider as shown in the third plot above.
Even with the delay between starts, there still appear to be timing
artifacts (machines 9 through 12 having increasing times and there are
unusual gaps in the higher numbered machines.)
Longer duration test runs would seem to be called for. Tests are
currently limited to 200 queries due to a bug which causes a segmentation
fault. Averaging several 200 query runs might be possible, if they
can be started with varying offsets (otherwise, from what I've seen so
far, they will repeat almost exactly.) Fixing the bug would be the
best solution, though likely to take the most time.
As to why the timing changes between 7 and 8 clients in both places
where the artifacts show up, it appears that this is the point in
both cases where the available bandwidth matches the size of the
data + agents at a jump rate of 0.5/sec. That is, it's coincidence
(or the perverseness of numbers) that it happened between 7 and 8 both times.
These graphs show the order of arrival of messages per client. This was created by logging the client machine number as each message arrived. This log of machine numbers was then plotted, thus the horizontal axis corresponds roughly to time. If any one client was getting many messages in a row it would appear as grouped marks on these graphs. If any client didn't get any messages for a long time it would appear as a gap in the marks. The one client plot isn't shown because it's meaningless.
The number of marks horizontally indicates the number of queries each
client executed.
Overall, there are no surprises here. No client starts early or ends
late. They all execute about the same number of queries. There's
no great gaps or clusters, though there are a few very small ones.
In fact, the very evenness might be an indication that all the
queries are always executing in the same order, one after another.
Looking at the data files that is indeed the case, although the
tests with more clients occasionally change order slightly. So
basically we seem to have a train of agents running through the
bandwidth bottleneck in the same order in both directions. Which
makes sense given the precise timing of the query launches and
relatively high speed of the clients and server compared to the
speed of the channel connecting them.
The first client is always faster when the channel is not full
because the server CPU is free when it arrives (all the previous agents
are still waiting to leave.) As more agents arrive they use
more CPU so the times get slower and slower per client.
At some point the channel becomes full and the times for all agents
equalize because they are all waiting on the channel.
10,2,1Mbps, 20% relevance, 60 docs, 1000 queries/client, 0.5 queries/sec +/-0.25
Here's a graph of all six curves together:
10,2,1Mbps, 20% relevance, 60 docs, 1000 queries/client, 0.5 queries/sec +/-0.25 (before and after randomization added to interquery time.)
OLD - 100Mbps, 5% relevance, 60 docs, 1000 queries/client, 0.5 queries/sec (+/-0.25 Q/s for agent)
* NEW results 10/19/00 *
10Mbps, 20% relevance, 60 docs, 1000 queries/client, 0.5 queries/sec (+/-0.25 Q/s for agent)
10Mbps, 5% relevance, 60 docs, 1000 queries/client, 0.5 queries/sec (+/-0.25 Q/s for agent)
2Mbps, 20% relevance, 60 docs, 1000 queries/client, 0.5 queries/sec (+/-0.25 Q/s for agent)
2Mbps, 5% relevance, 60 docs, 1000 queries/client, 0.5 queries/sec (+/-0.25 Q/s for agent)
1Mbps, 20% relevance, 60 docs, 1000 queries/client, 0.5 queries/sec (+/-0.25 Q/s for agent)
1Mbps, 5% relevance, 60 docs, 1000 queries/client, 0.5 queries/sec (+/-0.25 Q/s for agent)
Here's the slopes and intercepts for the above data:
agent client/server
speed rel slope intercept slope intercept
100M 20% 16.6253 91.7759 22.4702 9.7161
100M 5% 16.2592 53.8143 21.4159 15.2020
10M 20% 11.1708 158.9240 234.0797 -7.0076
10M 5% 6.4676 103.4845 234.6322 -9.4191
2M 20% 185.5561 -101.7343 1019.9 4.8
2M 5% 15.4103 205.6061 1020.0 4.0
1M 20% 543.9005 -282.2995 2053.1 -5.6
1M 5% 89.7088 147.2869 2051.0 0.3
An Excel spreadsheet with the above numbers plus bandwidth times slope calculations is available here.
10/18/00 - Increased the number of queries for the agent test at
100Mbps to 2000 and the number of queries for the client/server test
at 100Mbps to 3000 to see if that would smooth out the curves.
Instead of smoothing them it changed them!
Here's what the new data looks like for 5% relevance:
100Mbps, 5% relevance, 60 docs, 2000/3000 queries/client, 0.5 queries/sec (+/-0.25 Q/s for agent)
Here's a graph with both the old and new data on it for 5% relevance to make
it easier to see what has changed:
100Mbps, 5% relevance, 60 docs, 2000/3000 queries/client, 0.5 queries/sec (+/-0.25 Q/s for agent)
Here's what the new data looks like for 20% relevance:
100Mbps, 20% relevance, 60 docs, 2000/3000 queries/client, 0.5 queries/sec (+/-0.25 Q/s for agent)
Here's a graph with both the old and new data on it for 20% relevance:
100Mbps, 20% relevance, 60 docs, 2000/3000 queries/client, 0.5 queries/sec (+/-0.25 Q/s for agent)
The additional queries seem to have resulted in the client/server curve becoming a straight line and the agent curve now almost seems to converge to the client/server line.
10/20/00 - A 3000 query agent test run completed and the results are strange. I've plotted the results for 1000, 2000 and 3000 queries on the same graph to make them easy to compare (ignore the title):
It would appear that we can move the right end of the curve up or down at
will by altering the number of queries. This doesn't make sense in terms
of any kind of bandwidth limitations because 100Mbps should be more
than adequate for the agents to utilize with minimal contention.
There is also a discernable effect even with just a few clients (even
with just one client!) which would
seem to rule out CPU, memory, and bandwidth limits causing this effect.
The shape of all the curves is identical as well, with kinks in odd places.
If you look at the 100Mbps client/server data, it is exhibiting a
somewhat similar phenomenon in that the curve moves up with more
queries. It would be interesting to try a 5000 query client/server test
to see if it changes any. When the cluster is available again I could
try firing up the mtool daemons to collect info about what's going on
with each CPU, memory, and network interface. There is some info I
can get from dummynet about number of packets dropped as well.
Anyone have any ideas about what is going on here?
Here's the agent data with the Y axis values multiplied by the bandwidth:
This graph shows the idle CPU percentage over time during the test:
Idle CPU versus Time
This graph shows the memory utilization of the Java agents versus time:
Agent Size versus Time
1Mbps 5% & 20%
1Mbps 5% & 20% EPS format
2Mbps 5% & 20%
2Mbps 5% & 20% EPS format
10Mbps 5% & 20%
10Mbps 5% & 20% EPS format
100Mbps 5% & 20%
100Mbps 5% & 20% EPS format
---
All 5% data on one graph
All 5% data
All 5% data EPS format
All 20% data on one graph
All 20% data
All 20% data EPS format
---
10Mbps, 2Mbps, and 1Mbps 5% data on one graph
1,2,10Mbps 5% data
1,2,10Mbps 5% data EPS format
10Mbps, 2Mbps, and 1Mbps 20% data on one graph
1,2,10Mbps 20% data
1,2,10Mbps 20% data EPS format
1,2,10Mbps 20% data EPS format (title, line, and point styles changed.)
---
Client-Server/Agent Performance Ratio for 100Mbps, 10Mbps, 2Mbps, and 1Mbps 5% data on one graph
Ratio 1,2,10,100Mbps 5% data
Ratio 1,2,10,100Mbps 5% data EPS format
Client-Server/Agent Performance Ratio for 100Mbps, 10Mbps, 2Mbps, and 1Mbps 20% data on one graph
Ratio 1,2,10,100Mbps 20% data
Ratio 1,2,10,100Mbps 20% data EPS format
This graph shows the Server idle CPU percentage over time during the test:
Idle CPU versus Time
We reach maximum server CPU usage at about 200*30 = 6000sec = 1hr40min.
A short time later (1hr55min) the server CPU becomes completely idle.
This graph shows the Server memory utilization (RAM and swap) versus time:
Total RAM and Swap usage versus Time.
The top line is the free swap space. The bottom line is the free RAM.
The "tail" floating in midspace is free RAM since the top line continues
at that point and the bottom line doesn't. So at about 410*15sec = 6,150 seconds = 1hr42min swap space stopped declining and some free RAM appeared.
From the test log, the clients started at 22:15:40 and they stopped
at 00:07:36 which is a total of 1:51:56.
So the time scales all look consistent.
-----
Same graphs with scale same as the 10/27 ones.
5% 1,2,10,100Mbps
20% 1,2,3,10,100Mbps (Same graph, different key)
MainAgent.java - agent that launches client agents as children.
QueryAgent.java - the child client agent.
server.cc - Server for client/server implementation.
client.cc - Client for client/server implementation.
scaletest - agent test control script
scaletestclient - client/server test control script.
I did a performance comparison between JDK1.0 and JDK1.2, comparing
the time it takes to do the IR task. Here's the
java source code
and here's the
control script
that builds and runs the test. The Blackdown JDK 1.2.2 and the
most recent AgentJava JDK 1.0.2 were used for the test. The
JIT is supposedly on by default in the Blackdown port.
The tests were run on a node in the cluster (agentc02) which is
a VA Linux VarStation 28,
Model 2871E; Pentium II at 450MHz, 512K ECC L2 Cache, 256MB RAM, running
Linux 2.2.14 (RedHat 6.2)
The results from running the test
are:
20% JDK 1.0.2 Average Task Time: 61.644 20% JDK 1.2.2 Average Task Time: 96.169 5% JDK 1.0.2 Average Task Time: 55.903 5% JDK 1.2.2 Average Task Time: 88.953
So it would appear the task runs significantly faster in JDK1.0.2 than it does in JDK 1.2.2.
Here's some initial results from the Lockheed tests: Preliminary Results in Excel 97 spreadsheet
MainAgent.java
QueryTask.java
ReturnTask.java
SocketManagerNoDelay.java
ThreadNotify.java
TimedSequentialAgent.java
setup.csh
More results from the Lockheed tests:
12/14/00 Results in Excel 97 spreadsheet
D'Agents, EMAA, and NOMAD Results combined in an Excel spreadsheet
D'Agents, EMAA, and NOMAD Results combined in a Powerpoint presentation
D'Agents final results in Excel format including breakout of Task Time and Jump Time:
D'Agents Final Results in Excel 97 spreadsheet
NOMAD final results in Excel format:
NOMAD Results in Excel spreadsheet
Latest Lockheed Results in Excel spreadsheet
1/16/01
The following spreadsheet contains the results from D'Agents, EMAA, and NOMADS for 100Mbps, 10Mbps, and 1Mbps at 20% and 5% relevance. Missing from the data is the NOMADS 10Mbps and 5% tests. The NOMADS data should be read using the right hand scale in all cases, except for the plots of Jump Times (where NOMADS times are comparable to the other systems.) Also included are ratio plots that show the Client-Server/Agent performance ratio of Total Times.
Latest combined results in a variety of formats in an Excel spreadsheet
Latest results from Lockheed in an Excel spreadsheet
1/17/01 - Results from all systems combined into one spreadsheet.
Latest combined results graphed in a variety of formats in an Excel spreadsheet
Latest DAgents and EMAA results graphed as a contour plot of
Client-Server/Agent Ratios.
The axis are Bandwidth (X) versus Number of Clients (Y) versus
Client-Server/Agent Ratio (color). Format is Excel Spreadsheet.
1/18/01 - Added Query Completion Rate graphs to the
results from all systems combined into one spreadsheet.
Query Completion Rates added to the latest combined results graphed in a variety of formats in an Excel spreadsheet
New Query Completion Rate graphs added to other graphs in an Excel spreadsheet
The new data is in the following worksheets:
60Docs5%Rel at 1Mbps
60Docs5%Rel at 10Mbps
60Docs5%Rel at 100Mbps
60Docs20%Rel at 10Mbps
Note that due to time constraints we have collected numbers for 1, 3, 5, 7,
and 10 clients.
Latest data for NOMADS.
1/23/01 - Excel spreadsheet with all data
Latest data from everyone. Some of the NOMADS data is interpolated. The graph legends have been removed, but can easily be restored in Excel by selecting Chart/Chart Options from the menu, then clicking on the Legend tab and selecting the "Show Legend" checkbox.
1/25/01 - Time for C++ code to perform just the IR task
The C++ client code was modified to perform the entire task (reading the documents off of disk and searching them) locally in order to get a time to compare with the Java task times shown above in the 12/6/00 entry. Here's the source code used: tasktest.cc
The resulting task times were:
20% C++ Average Task Time: 3.02 ms 5% C++ Average Task Time: 2.92 ms
DAgents 100,10,3,2,1Mbps, 5%,20% relevance, 1 to 20 clients, Excel spreadsheet ratio3D.xls
CPU idle time measurements for 100Mbps, 5% test with 1 to 20 clients cpu5_100M.xls
D'Agents results in Excel format
Paper in PDF format
Paper in PS format
Data and graphs in Excel format
ETIvsDN.eps
ETIvsDNratio-05.eps
eti.eps - Graph comparing 3Mbps 5% agent tests
Shows the original test, the test after the cluster upgrade to RH7.1,
and the test after the RH7.1 upgrade AND the changeover to using the
ETI machine for DummyNet.
scalaJun28.xls - Results of 5% D'Agents tests
These test results were calculated after the upgrades to the cluster
(RedHat 7.1, DummyNet moved to 500MHz Pentium, network now fully switched).
cpunet_5cent.xls - Excel spreadsheet with CPU and network traffic info
for 100Mbps 5% tests at 1, 10 and 20 clients.
scalaJul3.xls - Excel spreadsheet with Client/Server and D"agents 5% and 20% pass ratio data and graphs for 1, 10, 100Mbps.
idleAverages5Cent.xls - Excel spreadsheet with % CPU Idle averages graphs
scalaJul17.xls - Excel spreadsheet with D'Agents, KAoS, and NOMADS data and graphs.
sort1.ps graph of sorted query times. Client/Server, 5 clients, 10Mbps, 5% pass ratios
sort2.ps zoom on central region of above graph.
idlecpuJul19.xls averages of CPU idle percentage for 5% and 20% pass ratios, for 1,10,100 Mbps, for both client/server and D'Agents.
scalaAug22.xls - complete AgentJava results. Mostly complete NOMADS
results, incorrect KAoS results (no interquery time was implemented), and
partial EMAA results.
scalaAug30.xls - Excel spreadsheet with complete D'Agents, KAoS, and NOMADS data and graphs. Also has partial EMAA data.
InfoMinion.java
InfoMinionMother.java
MOBIRMainAgent.java
MOBIRMainAgent_orig.java
MOBTIEConstants.java
TaskResult.java
IResultsReceiver.java
StDeviationCalculator.java
scalaSep4.xls - Excel spreadsheet with final D'Agents, KAoS, EMAA, and NOMADS data. Graphs still need reformatting.
scalaSep6.xls - Excel spreadsheet with final D'Agents, KAoS, EMAA, and NOMADS data. Graphs have been reformatted for publication.
scalaOct4.xls - Excel spreadsheet with really final D'Agents, KAoS, EMAA, and NOMADS data. Graphs have been reformatted for publication.
paper.pdf - paper with new graphs in PDF format
paper.ps - paper with new graphs in PS format