Three RTX 3090s, One 32B Model: A Pipeline-Parallel Canary

Nodehome's current three-GPU serving experiment is less about chasing a single headline benchmark and more about finding the shapes that actually work on owned Ampere hardware.

The useful lesson from the latest canary: tensor parallelism is not always the natural fit. For the tested Qwen2.5 32B AWQ checkpoint, the attention layout makes a 3-way tensor split invalid. Pipeline parallelism across three RTX 3090s is the viable way to put all three cards to work on that model.

The canary completed a repeated-request soak with HTTP success, stable worker behavior, and bursty inference loads at the 300W cap. Temperatures peaked in the low 80s C on the hotter cards and then dropped quickly once the request burst ended.

That makes it a real serving-shape signal, but not a universal closure:

  • It is a 32B AWQ pipeline-parallel proof, not a 70B proof.
  • It is a bursty inference canary, not a sustained training or stress-test pass.
  • It shows that three consumer GPUs can be useful even when tensor parallelism is the wrong fit.
  • It keeps the next question focused on workload shape: interactive agents, long-context concurrency, and sustained thermal policy are different tests.

The practical takeaway for local builders is simple: count GPUs, but also count model heads, KV layout, runtime constraints, and thermal behavior. "Three GPUs" is not one architecture. It is a menu of possible serving shapes.