Today, I ran STREAM on my old router(Sun V210, 2x UltraSPARC IIIi) and my new one(Sun T2000, 1xNiagara T1).
Benchmark Overview
STREAM is a simple memory bandwidth benchmark.
It conducts a series of operations on large vectors of variables and reports back on how
long the operations took, as well as a double-check to ensure the results are correct. The tests are:
- Copy: Y[] = X[]
- Scale: Y[] = X[]*n
- Add: Y[] = X[]+Z[]
- Triad: Y[] = X[]*n+Z[]
The Copy test almost always gets optimized to a memcpy, which serves as a
good reference for systems with weak FPU performance, or with no FPU at all.
All other tests tend to make heavy use of any available FPU.
System Overview
V210
The V210 uses two UltraSPARC IIIi CPUs attached to DDR memory. Each IIIi supports a single
core with FPU.
T2000
The T2000 uses a single UltraSPARC T1 CPU attached to DDR2 memory. The T1 supports
four cores each with eight threads, but with only a single FPU. Effectively up to 32
independently schedulable threads. The T1 is also known for slow single-threaded
performance, a design corrected in the T4 and newer CPUs.
STREAM Results in Megabytes/Second
Box: | V210x1 | V210x2 | T2000x1 | T2000x32 |
Copy: | 496.7 | 577.5 | 429.6 | 3492.9 |
Scale: | 498.3 | 568.0 | 261.2 | 1558.7 |
Add: | 494.1 | 597.1 | 282.8 | 2133.4 |
Triad: | 419.3 | 579.5 | 220.9 | 1176.8 |
Single-Threaded Results
V210
What can I say? This router is getting old.
T2000
The T1's single-threaded results are bad - even worse than the IIIi(a 4 year older design).
This could prove to be a problem, as in addition to routing, I'll need it to run a few
mostly single-threaded game servers as well. More measurements required.
Relative Multi-Threading Improvement over Single-Threading
Box: | V210x2 | T2000x32 |
Copy: | 1.16 | 8.13 |
Scale: | 1.14 | 5.97 |
Add: | 1.21 | 7.54 |
Triad: | 1.38 | 5.33 |
Multi-Threaded Results
V210
A little bit faster, but not a whole lot. This usually means that one thread
is capable of nearly saturating the memory bus/controller, which is good - it
implies that the penalty for the extra multithreading hardware is relatively
cheap, although it could also mean your memory controller or cache just isn't
very good.
T2000
This is where the T1 shines, with between 5.3x and 8.1x more bandwidth usage
spread over 32 threads. What's interesting here, is that the overall
improvement was greater than 4x(number of cores). This means that a
hardware thread isn't capable of saturating the bandwidth for the
local core, and so 8 or more threads will be required for saturating
the chip's bandwidth and that may only occur if the kernel schedules
them 2-to-a-core.