R16 Team Render feedback thread

Become a member of the CGSociety

Connect, Share, and Learn with our Large Growing CG Art Community. It's Free!

REPLY TO THREAD
 
Thread Tools Display Modes
  06 June 2015
New job sent to TRS, new old problems with getting anything done.

Just 5 out of the 20 nodes are rendering frames. No other job is running (although I see plenty of failed renders from other users). Status is "Busy" for all nodes though, good to know.

What should have been a one hour render is roughly 40% through after three and a half hours of waiting. My productivity is plummeting with each new job and the producers don't want to hear it anymore. Thanks Maxon.
 
  06 June 2015
Originally Posted by Nanome: New job sent to TRS, new old problems with getting anything done.

Just 5 out of the 20 nodes are rendering frames. No other job is running (although I see plenty of failed renders from other users). Status is "Busy" for all nodes though, good to know.

What should have been a one hour render is roughly 40% through after three and a half hours of waiting. My productivity is plummeting with each new job and the producers don't want to hear it anymore. Thanks Maxon.


wow, sorry to hear you are still having issues...

be sure to check the log on the TRS Server and look at what the network speeds are to the nodes - so far it seems for many that i have offered to help they are getting very slow speeds in the K ranges and that is timing out the nodes. (this is where maxon is missing a lot in testing - having just a few nodes on a very fast networks is showing them only one half of the scenario) where NET was developed back when 10-base T was FAST - so it can and knows how to handle slower networks ok

i have had such good success with TRS and TR since the 16.050 i was thinking that other then the two issues i experience that most users below 20 or 10 nodes would be going ok.

what are the logs on the TR nodes saying?

dann
__________________
-
Dann Stubbs - dann@darkskydigital.com
http://www.RenderKing.com Value Priced C4D, VRAY, Cycles4D Render Farm
-
 
  06 June 2015
Originally Posted by Nanome: New job sent to TRS, new old problems with getting anything done.

Just 5 out of the 20 nodes are rendering frames. No other job is running (although I see plenty of failed renders from other users). Status is "Busy" for all nodes though, good to know.

What should have been a one hour render is roughly 40% through after three and a half hours of waiting. My productivity is plummeting with each new job and the producers don't want to hear it anymore. Thanks Maxon.


Are you running 16.050? Grab the logs for one of the busy clients that's not rendering and the job and attach them here or fill out the support form (mention my name so they can route the ticket to me).
 
  06 June 2015
Thanks, Dan. I'd wish I knew what's happening with the network speeds as the engineer got 'offended' after I got in touch with Maxon (true story) and stopped informing me of whatever he's doing (long story). Patrick Goski has not replied to me since then either so I'm blind from both sides. I can see the speeds are slow but I have no idea of what Maxon and the engineer discussed about it or if there's any fix for it.

In this case, as the project has no big assets and therefore there's little network activity, I don't see why just some of the nodes are rendering while the rest are "Busy" doing nothing.
 
  06 June 2015
Originally Posted by Nanome: Thanks, Dan. I'd wish I knew what's happening with the network speeds as the engineer got 'offended' after I got in touch with Maxon (true story) and stopped informing me of whatever he's doing (long story). Patrick Goski has not replied to me since then either so I'm blind from both sides. I can see the speeds are slow but I have no idea of what Maxon and the engineer discussed about it or if there's any fix for it.

In this case, as the project has no big assets and therefore there's little network activity, I don't see why just some of the nodes are rendering while the rest are "Busy" doing nothing.



sorry to hear that, well i can't say what was said, but many years ago the couple times i ever talked to maxon support it was "always" my fault. which if you got one of "those guys", yeah i can see how it may not have been a good resolution. i can imagine it possibly going like this "your network speeds are too slow, get a faster network, it's your fault"

hopefully not, but...

are you rendering on the TRS Server node? i don't recall if you ever really answered that to me, but try without that node rendering (if it is) and see if that helps - it may help the throughput to the other nodes.

i will note one weird thing i noticed in my logs the other day, i had my normal 20 nodes active to start a job an i saw the TRS Server was sending out the data to all of them really unusually slow... the posted rates were like a tiny fraction of the speeds i typically get.

i grew concerned and started checking my network switches, the server HD etc thinking something was faulty or failed - everything seemed normal, so i started adding on the additional 12 TR nodes to the job (i usually add them in 6, then 6 more in a couple minutes to avoid the "too many open files error")

and the speeds to these nodes posted by the same TRS Server of course were magnitudes higher - like back to normal for here 60-90Mib per second. i should have taken a screen shot but the initial send out to the first 20 was like 500-700Kib or something - something unusually slow for here - far slower then even if it was 1/20 of the network speeds i normally get

if maxon could put in a limiter, like the TRS sends to 5 nodes, then to the next 5, then the next 5 etc - i could see this really helping out the massive barrage of data that needs to flow out - it could probably help users with slower networks from getting overloaded too - maybe preventing or solving this issue. this type of "limit" action is fairly standard in upper level render managers (can you imagine TRS trying to blast the data all at once to 900 TR nodes? it would just die)

the number of nodes should in my opinion be a user controlled preference variable of course - like i may like it to be 10, but others may want 5, etc... or if it's a small network they could put it at a higher number so all nodes get sent to at once etc)

there definitely is something fishy with the TCP-IP in TRS and TR though... i still see that timeout happening far too frequently, yes the come back online by themselves but it really should not be happening so often.

feel free to email me your logs or any info if you want me to take a look, sorry again about the troubles...

dann
__________________
-
Dann Stubbs - dann@darkskydigital.com
http://www.RenderKing.com Value Priced C4D, VRAY, Cycles4D Render Farm
-
 
  06 June 2015
Thanks Rick, here's the log of one of the Busy nodes that was not rendering my project. The failed project ALF_RED2 is from another artist. Mine seems to be the last line after I restarted the node.

C4DNRSERVER1A Download-Speed 1.59 MiB\s
Peer-to-Peer Statistics End
2015/06/28 21:42:39 (Error) Render Job failed: Rendering stopped because of an out-of-memory or unknown error
2015/06/28 22:30:04 Downloaded Asset(s) in 1.011 seconds
2015/06/28 22:30:04 Rendering frame 426 of ALF_RED2
2015/06/28 22:35:06 (Error) Render Job failed: Rendering stopped because of an out-of-memory or unknown error
2015/06/28 22:35:06 Peer-to-Peer Statistics:
> C4DNRSERVER1A Download-Speed 1.44 MiB\s
Peer-to-Peer Statistics End
2015/06/28 23:16:55 Downloaded Asset(s) in 1.970 seconds
2015/06/28 23:16:56 Rendering frame 399 of ALF_RED2
2015/06/28 23:19:28 Rendering frame 400 of ALF_RED2
2015/06/28 23:23:15 Rendering frame 401 of ALF_RED2
2015/06/28 23:23:15 (Error) Render Job failed: Rendering stopped because of an out-of-memory or unknown error
2015/06/28 23:23:15 Peer-to-Peer Statistics:
> C4DNRSERVER1A Download-Speed 647.44 KiB\s
Peer-to-Peer Statistics End
2015/06/29 00:04:17 Downloaded Asset(s) in 1.330 seconds
2015/06/29 00:04:19 Rendering frame 385 of ALF_RED2
2015/06/29 00:09:03 Peer-to-Peer Statistics:
> C4DNRSERVER1A Download-Speed 905.71 KiB\s
Peer-to-Peer Statistics End
2015/06/29 00:09:03 Rendering frame 386 of ALF_RED2
2015/06/29 00:09:03 (Error) Render Job failed: Rendering stopped because of an out-of-memory or unknown error
2015/06/29 00:29:06 Downloaded Asset(s) in 0.928 seconds
2015/06/29 00:29:06 Rendering frame 171 of ALF_RED2
2015/06/29 00:34:16 Peer-to-Peer Statistics:
> C4DNRSERVER1A Download-Speed 2.13 MiB\s
Peer-to-Peer Statistics End
2015/06/29 00:34:16 (Error) Render Job failed: Rendering stopped because of an out-of-memory or unknown error
2015/06/29 00:54:22 Downloaded Asset(s) in 1.128 seconds
2015/06/29 00:54:22 Rendering frame 406 of ALF_RED2
2015/06/29 00:59:11 (Error) Render Job failed: Rendering stopped because of an out-of-memory or unknown error
2015/06/29 00:59:11 Peer-to-Peer Statistics:
> C4DNRSERVER1A Download-Speed 1.09 MiB\s
Peer-to-Peer Statistics End
2015/06/29 00:59:11 Rendering frame 407 of ALF_RED2
2015/06/29 01:01:48 Downloaded Asset(s) in 2.917 seconds
2015/06/29 04:56:21 Peer-to-Peer Statistics:
> C4DNRSERVER1A Download-Speed 545.69 KiB\s
Peer-to-Peer Statistics End
2015/06/29 04:58:38 Downloaded Asset(s) in 0.703 seconds

Last edited by Nanome : 06 June 2015 at 10:10 PM.
 
  06 June 2015
So it appears as though the issue is not related to your scene at all, but the ALF_RED2 scene. Any idea what's in that scene? Can you pull the job log for it?

The network speeds are pretty slow, which is concerning. I just reviewed the last 20 pages of this thread to catch up on what ground has already been covered (and remembered why I hate forums for tech support), and I see Patrick already brought this up. Do you know if Fetch Assets from Server is enabled as Patrick suggested?

Also, is C4DNRSERVER1A as Windows or Mac client? It might be helpful to see a screenshot of its Performance tab / Activity Monitor if that's possible. Also, please make sure that it doesn't have both an Ethernet and Wifi connection on at the same time - that could be a source of the slow network speeds.

I'm not sure how much of this you and I can work out - it sounds like there's a network engineer between us, and I've managed IT long enough to know how people like us can be when users start complaining. If the engineer is willing to contact me, and maybe do some screensharing, it might help us get to a solution quicker. I presume your studio is in the US based on your forum tag? If he just mentions my name through the main support phone or email they'll direct things over to me (usually not a great idea because I'm out of the office, in meetings or on special projects a lot, but for a case like this I'm happy to step in).
 
  06 June 2015
Originally Posted by Nanome: Thanks Rick, here's the log of one of the Busy nodes that was not rendering my project. The failed project ALF_RED2 is from another artist. Mine seems to be the last line after I restarted the node.


now on rare occasions (used to be all the time before 16.050) the TR nodes will just decide they are done rendering and will just go idle - it's only happened a couple times since .050 but that could be one issue - and if you just simply quit that TR Client and then restart it they will usually come back online and jump in on the rendering fine.

i do see they sometimes get weird at times after a VRAY render in particular - sometimes the VRAY job is STILL running on the node despite the job being stopped/quit/errored out. this does happen too with normal C4D renders too at times.

usually if you go to the node and look at the info and job running you can see the previous job may still show running (or sometimes it does indeed show it as stopped) but the CPU utilization is still active (could be anywhere from 13% to 100% depending on task stuck running) but i do have to quit the TR Client to get the job to really stop, restart it and then it will move onto the next job that is really active

i suspect that is what is going on in your "busy" nodes - they are still actually running the previous job that has errored out - some just don't seem to get the "quit" message (with both AR and VRAY jobs after errors)

also what stop on error settings do you have set on your TRS Server?

Abort render on Client error will stop the job on that job if it gets any error - so say one gets an out of memory error - then the job gets stopped (i have turned that one off so the job keeps running except the one node i guess drops off - it's not really super clear actually what the supposed result is or should be)

but also the "handle warning as error" - even if just a benign error comes up it will stop the job on that node if that check mark above is on

and the exclude client on loading error - i think this may be most of your issue with the slow network speeds i think the TR Client times out and the just does not run the job. this one is good but with slow network it becomes a very active thing

also i have turned OFF the peer to peer asset distribution - that whole "idea" sounds good on paper but not in the real world unfortunately - i guess if you have just two or three nodes on a really fast network it's ok but i think it gets overwhelmed easily in bigger setups. this may also help your network speeds i think. (kinda like bonjour etc)

dann
__________________
-
Dann Stubbs - dann@darkskydigital.com
http://www.RenderKing.com Value Priced C4D, VRAY, Cycles4D Render Farm
-
 
  06 June 2015
Thanks Dan, I now have admin access to TRS webpage but I cannot find any of those preferences you mentioned. Where should I be looking at? I don't have physical access to the machines in the farm if that's what's needed.

Rick, thanks again, the clients are all Windows machines, I will request that activity info that you mentioned.

Even though the errors in that log I posted were related to another project, after restarting server and clients and starting rendering the project from scratch, the "Busy nodes that are actually idle" problem, persists. Nothing is rendering but the clients are "Busy", only the Server says Idle and it's not even sending the render command to the nodes.

Needless to say, I was forced to open the file and recreate all the materials in R14 and it already rendered twice with no problems using NET Render. Only issue with this is that the R16 version looks way better and I hate having to compromise on the way stuff looks because of TRS. Very frustrating.

Rick, maybe you can't give an answer to this but, why won't Maxon just give us NET Render R16 for the time being till they fix this mess?
 
  06 June 2015
The prefs are in the Server app itself, so you'd need console access to the machine running TR server to check the prefs. I'm going to make a suggestion that they be visible in the Web Interface, because that would be helpful for troubleshooting.

When you restarted the server and client, and started just your simple job they immediately exhibited the issue? Hmm. Guess we need more logs.

It's really not possible to go back to the NET architecture. A lot of changes were made to the render engine to support DR and other Team Render concepts, and TR introduced new fundamental networking code that other functions now rely on as well.

The only way to go is forward, so we'll need to continue working together to resolve these issues. Good progress was made with R16.050, to the point where it's much easier now to narrow down and troubleshoot the issues that remain. Yours seems a tricky issue, but we'll get it figured out.
 
  06 June 2015
Are there any 3rd-party plugins installed? Are any plugins being used in the scene?

After restart, can you successfully render 90 frames of a rotating cube?

We may need to boil this down to the lowest common denominator, and then start building it back up.
 
  06 June 2015
Originally Posted by Nanome: Thanks Dan, I now have admin access to TRS webpage but I cannot find any of those preferences you mentioned. Where should I be looking at? I don't have physical access to the machines in the farm if that's what's needed.

Rick, thanks again, the clients are all Windows machines, I will request that activity info that you mentioned.

Even though the errors in that log I posted were related to another project, after restarting server and clients and starting rendering the project from scratch, the "Busy nodes that are actually idle" problem, persists. Nothing is rendering but the clients are "Busy", only the Server says Idle and it's not even sending the render command to the nodes.

Needless to say, I was forced to open the file and recreate all the materials in R14 and it already rendered twice with no problems using NET Render. Only issue with this is that the R16 version looks way better and I hate having to compromise on the way stuff looks because of TRS. Very frustrating.

Rick, maybe you can't give an answer to this but, why won't Maxon just give us NET Render R16 for the time being till they fix this mess?


the prefs i mention are in the TR Client or TRS Server app preferences... so you need OS level access to check or change them.

that is very weird that the behavior persists, with them saying busy but being idle - have they ever worked? or has this started with 16.050?

i think most know that i am very vocal so maxon hears it but 16.050 has been working very well - i don't have any special version or anything, i suffered with the previous versions just the same - and finally 16.050 i think is almost equivalent to NET, still not a huge fan of the TRS GUI, but the fact it's working pretty reliably now means i don't need to do much other then the MONITOR page and the occasional pop over to the JOB tab now it's not too annoying to use. (albeit missing things like the frames info of the scene, and the generic 1 of 100 is not too helpful as i usually want to be sure its actual frame 100-200, or 350-450, and both those just show as generic 1 of 100 so meh)

if maxon can get the TCP-IP timeout more stable and the ability to avoid that "too many open files" error gone, then really TRS will be in pretty good shape compared to NET

wish i had more ideas, but not being able to get the OS level is hard to trouble shoot. at least at this point with admin access from the NETWORK tab you can see the full TRS Server log as well as the TR Client logs too - so read through those and maybe it will give some clues?

dann
__________________
-
Dann Stubbs - dann@darkskydigital.com
http://www.RenderKing.com Value Priced C4D, VRAY, Cycles4D Render Farm
-
 
  07 July 2015
Originally Posted by Nanome: Thanks, Dan. I'd wish I knew what's happening with the network speeds as the engineer got 'offended' after I got in touch with Maxon (true story) and stopped informing me of whatever he's doing (long story). Patrick Goski has not replied to me since then either so I'm blind from both sides. I can see the speeds are slow but I have no idea of what Maxon and the engineer discussed about it or if there's any fix for it.

In this case, as the project has no big assets and therefore there's little network activity, I don't see why just some of the nodes are rendering while the rest are "Busy" doing nothing.


Haha.
I haven't heard anything back since I escalated the issue.
So really, the best I can offer is to reach out to the engineer...buy him some beers to placate his wounded ego...and then work to resolve the network issues that are currently causing issues.

All I can say, is TR/TRS continues to be a major priority for development.
16.050 was a HUGE step in the process and all of the sample files that you have all provided me have been forwarded to development (these I delivered personally), and will remain a part of the testing process as things move forward.

All your help has been a huge help along the way.
__________________
The views expressed in this post are by no means the opinion of those making the post or of any one person in particular.
 
reply share thread



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
CGSociety
Society of Digital Artists
www.cgsociety.org

Powered by vBulletin
Copyright 2000 - 2006,
Jelsoft Enterprises Ltd.
Minimize Ads
Forum Jump
Miscellaneous

All times are GMT. The time now is 05:43 PM.


Powered by vBulletin
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.