Problem rendering with Vray on 32 Core Opterons.


#1

So, we started to build a few workstations to improve rendering performance, one day we will go on to build a render farm, :). We did render tests on 3 machines including my macbook pro 2010. *Note, we are rendering using vray 2.0 on 3ds max 2010

Computer #1
2 AMD Opteron 6276 16 Core Processors
Quadro 5000
64 gigs of Ram

Computer #2
Intel core i7 -3960x CPU @3.3ghz - 6 cores - 12 threads
Nvidia gtx 570
16 gb ram

Computer #3 (Macbook Pro 2010)
i7 2.6ghz - 4 cores, 8 threads
8gigs of ram

So we rendered the same file on all 3 machines, file has GI on, Light casche and brute force, rendering with resolution of 1920 x 1080… After the renders, we were VERY surprised at the performance of the Opteron machine.

The Macbook pro rendered the image at 16 min.
The Intel Core i7 3960x rendered the image at 5 min.
The OPTERONS rendered the image at 20 min.

We are really confused! Why is the opteron workstation taking the longest out of the two machines?! After hours of countless research, we couldnt find an answer. I assumed it has to do with the light casche, but setting it to 32 did nothing on the opteron machine.


#2

Does it utilize 100% of all the cpu cores when you render?

Are you using any sub surface scattering shaders? I’m not 100% sure because I’m not a v-ray user, but I think I’ve read that v-ray SSS shaders are single-threaded just as they are in mental ray.

Individually, the opteron cores are a little slow so if there’s any single-threaded rendering functions, it can really slow things down while the other cores wait on 1 or 2 cores to finish something.

I’d run the task manager, click the performance tab, under the view menu set the speed of the graph to “low” and under computer history set it to display “one graph, all CPUs” Then I’d render a file with it open and see if the graph is fully pegged the whole time or if it isn’t for a large portion.


#3

Recent Opteron machines are built on NUMA architecture, which means that the processor cores don’t have equal access to RAM. Each core has some part of the RAM attached to it and it can access it fastest. If the core needs to fetch data from other parts of the RAM, it needs to ask the other cores for it.

This works very well if the machine runs many separate processes performing different tasks, as is the case for web servers and database servers. This is also the main target for these Opteron machines.

However this can become very slow if one single application needs to use all the cores operating on the same data, as is the case with rendering. Such an application needs to be specifically coded for NUMA processor architecture to achieve maximum performance. Most of the applications that you use normally are not coded in this way (and it’s a lot of effort to rewrite them).

Best regards,
Vlado


#4

The V-Ray SSS shaders are multithreaded.

Best regards,
Vlado


#5

Operton processors have always been NUMA, not just recently. Xeon processors are NUMA as well. Are you saying that V-Ray sucks on anything with more than one socket? NUMA is not the problem.

Sounds to me like something else is going on, like its using one processor core the whole time or most of the time (or there’s a bug somewhere affecting Opteron systems). Those Opteron processors score around 15 on Cinebench 11.5 which is about 50% faster than the 3960X.


#6

hi, i’m no expert (acyually i could use some advise too regarding what cpu’s to buy for our own renderfarm, see some threads below :slight_smile: ),
but how big is the scene and what is your dynamic memory limit?


#7

do you know if they’re 100% parallel multithreaded or does each SSS shader get calculated by it’s own individual core, but they calculate at the same time? If that’s the case, then if one shader is taking up 90% of the screen while 5 other SSS shaders take up 10%, that others could finish early and everything waits on the one taking 90% of the image.


#8

I’m pretty sure the first Opterons at least had an option to work in SMP mode.

Xeon processors are NUMA as well.
Could be, but then it seems better implemented.

NUMA is not the problem.
For Opterons, as far as we could test, it does present a problem. It was more efficient to run multiple render jobs on the different nodes, rather than one job on all CPU nodes. It was even a lot faster to do DR on the same machine with several render servers running locally, each rendering on its own CPU node.

Best regards,
Vlado


#9

They are.

or does each SSS shader get calculated by it’s own individual core, but they calculate at the same time? If that’s the case, then if one shader is taking up 90% of the screen while 5 other SSS shaders take up 10%, that others could finish early and everything waits on the one taking 90% of the image.
This is not the case.

Best regards,
Vlado


#10

I currently use a quad core 9550 and just ordered an i7 3930 K - May be it’s not the time to worry about, but I started wondering how will the new machine perform on viewport operations:

Render performance is confusingly stated to be between two or three times my current machine on many benchmark sites. Here is a link to Anandtech Benchmarks which says it is just about twice at most in radiosity rendering and even worse with other 3dsmax benchmark subjects.

You know when there are 4 cores, viewport operations are held by single core which uses %25 of the total performance. Looking at the benchmark results together with the 12 virtual cores of i7 3930, I started to think whether this new machine will be slower during standard viewport operations:

I_____I_____I_____I_____I Performance of all 4 cores of q9550

I___I___I___I___I___I___I___I___I___I___I___I___I Performance of all 12 cores of i7 3930K (less than twice)

As you can see above, the single core of 3930 would eventually be slower while eg. a pflow event with lots of particles trying to recompute while you move the time slider, or a complex boolean operation will make you wait more than the previous cpu.

I wonder if this will be the case. If so then I’d better sit and :cry: not because the advantage of the new cpu is only halving the rendering times, but because I’ve just given my old invaluable single core p4 3.2 gHz to the son of a friend :smiley:


I wonder how will be the mental ray benchmark of those 3 machines of yours? Is it possible for you to test that with one scene?


#11

This is why the single-threaded cinebench benchmarks are important as well as the multithreaded ones - such as the single-threaded cinebench. I don’t understand why websites always seem to run the multithreaded cinebench and not also do the single-threaded one.

I have both 3930k’s and 9550’s at work and the 3930k is waaaay faster than the 9550 as long as you’re not running old software. If something is slower it’s because the software wasn’t coded to account for virtual cores via hyperthreading, or in some cases, some older solftware won’t see more than 4 cores.

The 6 virtual cores on the 3930k account for roughly 25% of the total multithreaded performance. If you turn off the virtual cores by turning hyperthreading off you’ll lose 25% of your multithreaded performance, but single-threaded performance will be just as fast unless you’re using older software that isn’t coded to understand the difference between a real core and virtual core.

For instance, Adobe 32bit After Effects CS4 and earlier ran something like 2-3x faster if hyperthreading was turned off. With CS5 and higher, it was no longer penalized for having it on.

The benchmarks on Anandtech are all valid, but you’ll notice many of the benchmarks are older versions of programs and benchmarks


#12

hi chris,

did you ever solve your problem with the slow rendering opteron machine?
im having the same issues right now…


#13

This thread has been automatically closed as it remained inactive for 12 months. If you wish to continue the discussion, please create a new thread in the appropriate forum.