PDA

View Full Version : Switch Traffic Theory


Cryptite
12-05-2012, 03:07 PM
This is a call for advice on switch traffic theory for a small-sized (~13 artists) VFX studio. For the most part, we do fine, but we have a couple (read, one big) problems that cause bandwidth choking for us, and I'd love to get some ideas from you guys about how your studios run and what has and hasn't worked for you guys in the past.

Here's the setup for us:

Roughly 5 fileservers (we'll shorten to FS for the rest of the post), one main one that serves all of the production files, the other 4 are rare-use and can be legitimately forgotten about for the time being.
60 Render Farm Blades; our 20 fastest render our Fusion/Nuke comps (where the problem lies which I'll get to in a moment).
~13 Artist Workstations presumed equal in speed, but this is a bandwidth issue, so it doesn't much matter here.


The switch setup is the following:

40 render blades, all 3d renderers, no Fusion rendering are on switch A
The other 20 aforementioned comp/3d rendering blades are on switch B
Switch B is wired to Switch A with 2 CAT6 cables.
Switch A then routes traffic from both itself and B (daisy-chain, if you will) to the "Main Switch" through 4 CAT6 cables.
The "Main Switch" hosts all of our file servers and workstations as well as all of the traffic from the render farm switches.


The main bandwidth choke point is when we launch Fusion jobs to the farm. While we only have 10 render licenses for it, when 10 of our fastest machines (or probably any 10 for that matter) get to rendering Fusion; because of the constant pull of Very Large EXRs™ as well as how fusion handles EXRs (poorly, but that's a topic for another day) nearly everybody on the network gets a reasonably noticeable (our AE guys complain the most) network hit and pretty much just have to work through it until those jobs finish. No other jobs seem to have this problem as all of our 3d scene files and assets are copied locally before they render.

If it needs mentioning, we use Royal Render as our render farm manager.

Also, Nuke is new to our pipeline and we're slowly introducing it. We know it handles EXRs better, but is it orders of magnitude better in that this problem may not even exist anymore once we've fulled transitioned?

My question to ye vfx types is how does your company route your network traffic? We've had many internal theories about trying to separate switch connections so that the farm and the main FS can talk to each other directly without going through the same switch that the workstations go through. Either way, we know something needs to change; we're just not 100% certain on what that is.

Also for extra credit, how do you host your files in terms of production assets and renders for compositing? Do you keep them all in the same project folder? Do you separate your renders on an entirely different FS so that they may be accessed without affecting the production FS? Do you all use SSD's? Do your computers sit atop wireless unicorns?

Thanks in advance!
-Crypt

DePaint
12-05-2012, 03:41 PM
You may get better help asking on forum dedicated to IT Networking, like here:

http://www.daniweb.com/hardware-and-software/networking/13?gclid=COfvr7_Mg7QCFYqvzAodITEAHg

Google for "Networking Forum" or "IT Networking Forum" and you'll get a bunch of places where the Networking Pros hang out...

cojam
12-05-2012, 07:19 PM
You may get better help asking on forum dedicated to IT Networking, like here:

http://www.daniweb.com/hardware-and-software/networking/13?gclid=COfvr7_Mg7QCFYqvzAodITEAHg

Google for "Networking Forum" or "IT Networking Forum" and you'll get a bunch of places where the Networking Pros hang out...





Or hire a professional?

MDuffy
12-05-2012, 08:24 PM
Maybe have a cache server on the same switch/subnet as your renderfarm, and update the cache as the first operation of the render job. Some caches can be set up so they refresh on file request, making it a bit more transparent at the expense of a few more calls back to the main file server.

That way you will only pull new/changed content over the users's switch, and the farm can hammer its own cache and switch as much as it likes without the normal humans noticing. You can also even out network access by staggering the start times of the jobs a bit so they aren't all requesting big files at the same time, and some renderers/compositors also internally stagger the order they process nodes (if possible) to even out file access as well.

Cheers,
Michael

sentry66
12-05-2012, 11:29 PM
switches that support 10 gigabits
network cards that are faster than 1 gigabit
and possibly faster hard drives would help

olson
12-06-2012, 12:46 AM
A cache server like Duffy has suggested could improve the felt hit for the rest of the artists. There are specialized caching servers available that can do this transparently or commodity hardware could be used if the pipeline handles the caching updating and path changes.

I don't think that's the best solution though because it sounds like the production needs have simply outgrown the network and file servers. Even if you have the ideal caching setup there's still only so much bandwidth on the network (assuming we're talking gigabit Ethernet here). All of the compositing render nodes on switch B are getting less than 250 MB/s between all of them (two gigabit Ethernet connections from switch B to A). If the file server is on a single gigabit Ethernet connection it will be able to offer less than 125 MB/s.

It might be time to get a new switch with some 10 gigabit Ethernet ports and a new file server or file server cluster with 10 gigabit Ethernet as well. If you look at the numbers no matter how you configure things with gigabit Ethernet and multiple switches there will always be severe bottlenecks on the network. If the budget won't allow for any major upgrades the best option would probably be to put a cache server on the switch of each group of nodes as needed. For example put a cache server on the switch with the compositing nodes.

If nobody has done this yet look into monitoring the network traffic and file server traffic. In the past I've used Cacti for this which supports many switch models and all workstations and servers when configured properly. It will keep track of what uses bandwidth and when along with lots of other useful information so you can see exactly where the bottlenecks are coming from. It logs everything too so you can look back over the last month or two and correlate things with stuff from the render queue.

tswalk
12-06-2012, 05:00 AM
it sounds like to me you are using a single VLAN, so no matter how you daisy those switches... the broadcasts are going to kill you... but i'm guessing on that, cause you didn't really describe your segmentation, only how to cabled them. and by default, everyone will be on a single VLAN.

if it were me, i would segment the backend traffic on a different VLAN if you have multiple NICs on the blades (one NIC for VLAN1 and another for VLAN2). so when you launch a job, it runs on VLAN2 (which in theory can be on the same switch).. but optimally since you have 3 switches.. you could dedicate that VLAN2/switch for this "backend" and it will keep your client and other server segments clear of traffic.

also, if you choose too (and have the capability), with that File Server, do the same with bridging between the VLANs.

I would also segment my client workstations on another VLAN with your "main" switch.

I'm just guessing a lot, .. no idea what kinda of switches you have or blades/ servers, etc...

so, if this fixes the problem... does it mean a free lunch at Sonny's or just a beer at the WestEnd Pub? and if you really want, i got a CCNA in Keller, and another living in Flowermound... i'm however, out here in Mansfield.. they're great guys I use to work with up at Nokia.

here's a diagram (from what I think you've described versus the segmentation):

http://sdrv.ms/11XIn3S

.. right now, i'ld probably take the beer..

CGTalk Moderation
12-06-2012, 05:00 AM
This thread has been automatically closed as it remained inactive for 12 months. If you wish to continue the discussion, please create a new thread in the appropriate forum.