STREAM:UltraSPARC IIIi vs UltraSPARC T1

Written 2014-06-21

Tags:SPARC benchmark STREAM computing 

Today, I ran STREAM on my old router(Sun V210, 2x UltraSPARC IIIi) and my new one(Sun T2000, 1xNiagara T1).

Benchmark Overview

STREAM is a simple memory bandwidth benchmark. It conducts a series of operations on large vectors of variables and reports back on how long the operations took, as well as a double-check to ensure the results are correct. The tests are:
  • Copy: Y[] = X[]
  • Scale: Y[] = X[]*n
  • Add: Y[] = X[]+Z[]
  • Triad: Y[] = X[]*n+Z[]
The Copy test almost always gets optimized to a memcpy, which serves as a good reference for systems with weak FPU performance, or with no FPU at all. All other tests tend to make heavy use of any available FPU.

System Overview

V210

The V210 uses two UltraSPARC IIIi CPUs attached to DDR memory. Each IIIi supports a single core with FPU.

T2000

The T2000 uses a single UltraSPARC T1 CPU attached to DDR2 memory. The T1 supports four cores each with eight threads, but with only a single FPU. Effectively up to 32 independently schedulable threads. The T1 is also known for slow single-threaded performance, a design corrected in the T4 and newer CPUs.

STREAM Results in Megabytes/Second

Box: V210x1V210x2T2000x1T2000x32
Copy: 496.7 577.5 429.6 3492.9
Scale:498.3 568.0 261.2 1558.7
Add: 494.1 597.1 282.8 2133.4
Triad:419.3 579.5 220.9 1176.8

Single-Threaded Results

V210

What can I say? This router is getting old.

T2000

The T1's single-threaded results are bad - even worse than the IIIi(a 4 year older design). This could prove to be a problem, as in addition to routing, I'll need it to run a few mostly single-threaded game servers as well. More measurements required.

Relative Multi-Threading Improvement over Single-Threading

Box: V210x2T2000x32
Copy: 1.16 8.13
Scale:1.14 5.97
Add: 1.21 7.54
Triad:1.38 5.33

Multi-Threaded Results

V210

A little bit faster, but not a whole lot. This usually means that one thread is capable of nearly saturating the memory bus/controller, which is good - it implies that the penalty for the extra multithreading hardware is relatively cheap, although it could also mean your memory controller or cache just isn't very good.

T2000

This is where the T1 shines, with between 5.3x and 8.1x more bandwidth usage spread over 32 threads. What's interesting here, is that the overall improvement was greater than 4x(number of cores). This means that a hardware thread isn't capable of saturating the bandwidth for the local core, and so 8 or more threads will be required for saturating the chip's bandwidth and that may only occur if the kernel schedules them 2-to-a-core.