Renderfarm in the cloud


#21

Great success! :slight_smile:

I’ve got one EC2 instance rendering along with may LAN renderfarm.

After creating the EC2 node and installing 3ds max etc I had to setup a bridged VPN tunnel to connect the LANs into a single LAN. I tried to do this with Vista’s native VPN services… 8 hours of epic fails later I installed OpenVPN and it was up and running in 30 minutes.
OpenVPN has great documentation, is opensource and free, nice stuff!

http://openvpn.net/index.php/open-source/documentation/howto.html#startup
(see download page for the installers etc, take the ‘bridge’ route trough the documentation instead of the ‘routing’ )

A few issues, I run a 64bit OS and openVPN installs into the “program files (x86)/”, a few of OpenVPN’s batchfiles don’t take the “(x86)” part into account. so be aware of that.

(Read the faq on how to setup a bridge on the 192.168.1.xx subnet, which you’re probably on.)

After some port-forwards in the router and enabling file-sharing on the EC2 node I could see it in my network neighborhood and vice-versa. Great :slight_smile: After allowing backburner trough the firewall I started the manager and a server on the EC2 node, another server on my workstation and it all just worked after adjusting some IP settings. The backburner-monitor showed both servers, I submitted a job and it just started!

Next up is making it more solid, locking down IP adresses etc and adding more nodes.

I don’t want a new VPN tunnel from my workstation to each node, i want the EC2-lan to connect via 1 tunnel to my workstation for scalability and performance.

I think I’m going to try to setup one ‘master’ ec2 node and normal EC2-nodes, the normals are all indential instances, IP’ed by DHCP so I only have to install 1 plugin or script once, then start a few instances of the normal nodes. I’ll have to see if that’s even possible first :slight_smile:


#22

A few notes: Use OpenVPN’s config file to give the ‘master’ vpn-client a static local IP. This is where backburner’s manager will live. Having a static IP makes it easy for all the servers and monitors to find it on the LAN.

And I’ve used a free cmd util called ‘sendemail’ and the task-scheduler to send me an email every 30 minutes the ec2 node is running, this will prevent me to forget to turn it off :slight_smile: I think I can automate this later but for now this will be fine.

And when you install backburner as a service, so it runs without a user having to log on, you should give the service some credentials to login with so it can access the network. And you should use the ‘delayed start’ option since the OpenVPN service has to startup first. No sure about this but it solved the problem of the manager not starting properly.

Now I have to do some actual work again… :slight_smile:


#23

A few updates:

I’ve dumped the brided VPN and have build a routed VPN tunnel. Much better performance and scalability.

Now I’ve got a 10.0.0.x Cloud Area Network (CAN) and my local 192.168.1.x LAN connected via a single encrypted VPN tunnel in the 192.160.10.1 range. Via a routing table those 2 networks can talk to each other, each node can see all the other nodes in both networks.

In the cloud I’ve setup a master node that runs the manager and is the client side of the VPN tunnel. My workstation double as VPN server end. The 2 subnets are linked via routing tables so for example my 192.168.1.12 workstation can connect to a 10.0.0.185 node and visa verca. Since I had to learn all the tech behind this to make it work it was a 3 day trip down the rabbit hole and back again, learned a lot and I have mad ninja network skills now :slight_smile:

The beauty of this setup is I send a job to the manager-in-the-cloud and that nodes send it to all the other cloud nodes, so I don’t waste bandwidth on my own internet connection.

Another thing is that I want to spawn extra nodes from a singe rendernode template system. If I install an update or a plugin I don’t want to do that 20 times. I only have to do it twice now. Once on the Manager-node (that renders as well with max in a low-priority mode so managing the jobs does take precedence over the actual rendering) and once on the the template render node.

And at EC2 you pay for storage as well, if not needed I don’t want to keep 20x 35GB diskimages stored, just the a manager-node and a single render-node.

I can start up to 19 render nodes of the same template. Only problem is the backburner.xml. It holds the MAC address as an identifier, so every instance has the same XML file now and that won’t work. Luckily my command-line batch script magic is strong and I’ve made a batch script that finds the MAC and computername and puts that into the XML file before the server service starts up. btw: to run backburner as a service make sure to use ‘delayed start’ so it won’t fail on dependencies that have not started yet. I use NSSM to register the batch file as a service so it starts as well without anyone logging in first.

The whole ‘ohh no it’s in the cloud’ security issue is a academic non-issue as far as I’m concerned. The internets are scary!

And a nice side effect is that I now exactly know how much it costs to render something so I can bill that without have to make a wild guess.

Stay tuned :slight_smile:


#24

@jonadb

Nice progress you got there. As i wrote earlier - allthough you don´t seem to care for my replies so far - we are on quite the same road here.

I had some issues with running backburner as a service. Did that work out for you right from the start? Or did you have to do some tricks? I can´t get it to really start anything…windows says, the service is running, but nothing really happens after startup. It´s a bit of a letdown so far…i don´t understand, why backburner-server isn´t bootable via commandline with additional parameters like manager-ip and stuff. That would make things so much easier to handle.

Another thing about node-startup

I have it done quite the same way…i have a node-image-template from which i can spawn new ones. It has a startup-script, that looks for the backburner xml and deletes the MAC-Adress of the node itself (because that causes troubles otherwise when all the nodes say to have the same MAC-address)

I know, that it is possible to pass user-data into a started node via the amazon-tools, but i haven´t found a workable solution to grab that data and use it inside the node so far.
That would be great to do though, because it could enable you to prepare a node-image-template, that isn´t depending on a fixed manager-IP. You could simply input the current manager-node-ip via user-data and the node would grab and insert it via a script into the correct config-file so the backburner server could boot up in no time.
So you don´t have to have the manager-node up and running all the time.

How do you handle the data-access for rendering? Do you rely on Max-Project-Folders and Mapped network-drives? That is, how we do it…and it works out great so far. Max can be quite picky, when it comes to network-access of textures and maps.

Can you give a more detailed elaboration about how you set up OpenVPN. I tried that tool and fiddled for a couple of times with it, but haven´t really got my grips on it so far. I would appreciate it, if you could explain a bit more what you did to setup everything.

You should always keep an eye on the running EBS-Volumes of your instances. When you boot up EBS-backed instances, every node has it´s according EBS-Volume. The advantage of those is, that you can stop your instance at any time - and thus not being forced to pay the hourly fee for it - and can resume work later on. The only thing you have to care for is, that the EBS-Volume persists the lifetime of the node. So if you terminate it, you have to delete the volume afterwards or you will have to pay for using it allthough the instance itself isn´t there anymore.

And by the way…a little reward for you, as you went down all the road so far:

Ever had a look into spot-instances? You definitely should :wink: Be prepared to get your jaw dropped about the possible cost-reduction and more important…node-count-increase. You can have up to 100 nodes in addition to your 20 fixed ones. The only “draw-back” is, that you can loose those nodes, if someone else is ready to pay more per hour then you do. The bidding works a bit like at the stockexchange with and hourly evaluated offer/demand-ratio, which then dictates the pricing. Have a look into it…it´s pretty simple to handle and greatly increases your flexibility.


#25

Hi Deracus, sorry I didn’t reply to your comments earlier, they are appreciated!

About the MACs, each intstance you boot will get a different MAC automatically af far as I can tell, the problem is that backburner.xml doesn’t get updated. When you start the server as an service without a valid XML it fails silently so I think there might be your problem.

I hacked together a little batch+vbscript that creates a new a new backburner.xml. I’ve attached it to this post.

  1. Place the content in the same folder as the original backburner.xml
  2. adjust the backburner_reset.xml to your settings. this is the template
  3. run ‘createXMl.bat’ if will insert your MAC and computername into the new backburner.xml file.

Use ‘nssm’ ( https://iain.cx/src/nssm/ )to install the batchfile as a service. Adjust the registry to set the action to ‘ignore’ (see bottom of that same page) You’ll have to start/stop the service once to get the keys in the registry first. Then in the service manager set the settings to the appropriate settings. (Startup and give a user logon with adminrights)

The server/manager services should use ‘start delayed’ and be given a adminrights log on as well.

If I use network paths for my resources it renders just as normal. I prefer to just use the ‘include maps’ option, saves some bandwidth in the end.

The VPN is actually quite easy once you know how. Just follow the basic howto for a ‘routed tunnel’ once you have the ‘master cloud node’ VPN’ed to your LAN you have to set some setting to get all the other nodes in both network to see each oter, these links helped me alot:
http://openvpn.net/index.php/open-source/documentation/howto.html
http://sysextra.blogspot.com/2011/01/creating-virtual-private-cluster-with.html

So in shot: (Assuming you’ve put all cloud nodes in a VPC)
step1: get OpenVPN server running on your LAN
step2: get OpenVPN client on the master cloud node and connect them
step3: setup routing so you can ping from 10.0.0.x to 192.168.1.x and vice versa from each node on both networks.

That last step took me 2 days to figure out, I can really summarize what I’ve done but I might be able to help you if you get stuck on a certian point.

I’ve been using EBS volumes etc. When all is offline I have just 2 EBS volumes sitting there.

Thanks for mentioning you can get a 100 spot nodes, I did see the option but didn’t realize you can get so many, nice! :slight_smile:

Andif you put ServerPriority script (http://www.jdbgraphics.nl/index/30/nl/DOWNLOADS ) in the script/startup folder max will start in a low prio mode, this prevent the manager node from getting slow while rendering.

edit:
btw once you’ve get your VPN tunnel to the VPC you can assign the manager node a fixed internal IP. (10.0.0.185 in my case) That’s where all the server are connecting to. That internal IP can be made static manually assigning an IP the the tcp/ip settings.


#26

I know the MAC-Problem and have a little C#-solution for that myself. For me it was enough to simply delete the old MAC-Adress in the xml to force backburner to input a new one as soon as the server starts. That works great…but only if i start the server manually. Maybe i should have a look into your tool, when you input the correct MAC into the file.,…maybe the service is a bit more picky then the backburner server :wink:

I might have found a solution for the access of the access of the user-data. I´ll post it here, as soon as i got more about that :wink:

I have to use network mapped drives, because the include-maps function proofed to be not very reliable as soon as XRefs and stuff like irradiance-maps are coming into play. For this i setup a sync-tool to permanently sync my local data with the data on the manager node. By doing so, i can hack together a little startup-script for the nodes, that automatically maps the network-drive to the manager-node. So data-access is available and reliable. I only have to find a way to access the amazon-userdata so i can dynamically adjust for changing manager-nodes. THat is necessary as the farm isn´t up all the time and i dont want to pay for the manager if i dont use it.

I´ll have a look into that. Thanks for the offer

You know, that you have to pay for the EBS-Volumes allthough they are not mounted to a machine? I´m researching a way to automatically delete them as soon as i permanently terminate a node. I´m onto a custom made controll-software for my cloud-based-rendersolution anyway, as i work on building a complete solution to publish to others for controlling and maintaining the farm without much prior knowledge.

Spot-Instances are spot on :slight_smile: Don´t know where i would be without them. Being able to get a 70±Nodes Farm up and running in minutes is awesome.

I had it that way at first too, but dropped that by now, as my manager-node doesn´t have the time to render very much anymore. As soon as you start projects with more then 50 nodes, the manager will have quite a busy time to shuffle all the data around for the nodes. As the manager is the most valuable node in the network, i tend to leave it not at full load all the time. Safety rule…it doesnt really matter for me, if there is one node more or less rendering…but it does matter, that the manager always has enough power to cope with the whole progress.

Does that mean, that i get independent from the given IP by amazon when hosting a new node from an image? THat could solve my data-access-problems then…
So i could let all the nodes connect to that static-VPN-IP of the manager? Or did i get something wrong there?


#27
I can tell you right now: it is more picky  :)
Dont you need the EBSs to store the AMI on? So 2 different nodes equals 2 EBSs   

Sounds awesome indeed! But it turnsout I can’t use spots because I’ve got my nodes in a VPC. I could make a 3th type of node type that connects to the masternode via it’s own VPN tunnels. I will look into that :slight_smile:

Indeed.. once setup the master node 'calls' home and the tunnel gets setup. All subsequent communication goes through the tunnel which routs  al the local IPs back and fourth, either 192.168.1.x or 10.0.0.x. So once the tunnel is ready you dont need any outside IP from amazon anymore. And because the node calls home, home doesn't have to know it's ip. 

For example I can remote-desktop all the nodes in the cloud via their local ip right from my desktop/lan.

I havent tested this but if you give a bitmap the a path like "\192.168.1.12 extures\bla.jpg’ all the nodes should just find it. Just forget about using the window share names :slight_smile:

edit: About the EBS.

If I have your instance completely setup, I create a new AMI from that. Then terminate the old one and delete all snapshots and volumes from the old AMI. So if it is all offline I just have 1 volume per AMI… I think that is the most efficient way?

Once I start up the manager-node I can give it a static internal IP in the create instance dialog. and that’s all I have to do… the nodes just fireup and are ready to go.


#28

I just had a brilliant idea:) It can al be done with only one node type!

If ip==10.0.0.185 {

start manager node stuff

} else if ip==10.0.0.* {

Start normal node stuff

} else {

Start spot instance stuff

}

This way there is only 1 AMI to store and maintain!


#29

I would never figure that stuff out, hah.

My main question is, for the amount of speed that you’re getting on your renders, is it worth the money?


#30

The Idea is pretty nice…but cost-wise not really necessary. Had a look into how much it costs to store the AMI and stuff? It´s not really worth the time to think about it…i pay around 2-3 $ a month for that stuff…and it doesn´t really matter :wink:
But from the maintenance-point-of-view it´s great. You only have to update and maintain one AMI. Will incorporate that.

I had another idea about a possible problem, that you might run into with your VPN-Node-To-Node-Connection-Approach: The way amazon charges internodal communication.

The traffic between two nodes in the same availabilty zone is free BUT for the traffic in and out of the zone and/or the cloud, you have to pay.
With your approach of creating a VPN-IP-based connection between the master and the nodes (and that it is, what you are doing, isn´t it?) you are effectively moving the communication to a channel, that leads out and then again into the cloud.

I am not sure, if amazon wouldnt charge you for that traffic, because i am not sure, how they are evaluating the costs. Usually i would say they dont charge for traffic, that happens in the local ip-ranges of the individual nodes of the cloud, as those ip-ranges aren´t reachable from the outside.

You might have to check that…or maybe i got something wrong about that idea in my head :wink:

@darthviper107:

It is - as far as my calculations, actual project-experiences and all of that go (around a year of working with that kind of solution) - absolutely worth the money.
You are cheaper then almost every commercial renderservice on the market (as long as you stick to heavy use of spot-instances and effective setup of the machines) and you have - which is a huge plus for me - absolute control about the environment paramaters like software-version, plugins, shaders and stuff. That is something, that no other service then your own renderfarm could offer you. Running and maintaining your own - physical present - renderfarm on the other hand has quite a few problems, that you have to value against amazon to proper judge the efficiancy of that approach.

If you buy and run a renderfarm for yourself, you have to pay for the hardware, the cooling, the energy. You have to maintain it, set it up, fix hardware-troubles and all of that. If you don´t run it for paid work quite a lot, the costs generated by the farm are bigger then the amount of money, that amazon charges you. But for the efficiency of amazon, the way you work with the given possibilities there is vital.


#31

How much faster is it though? The systems don’t look like they’re all that powerful, so do you get like 10-20x performance compared to using the average workstation?


#32

Not all the traffic goes to the tunnel, only the CAN < > LAN traffic is. The manager and the render nodes are all on the 10.0.0.x CAN, so the main part of the payed traffic is me sending a job from my LAN and downloading the rendered images for the CAN. So once the render job is started I could even shutdown the VPN tunnel if I want that.

The spot nodes will be outside of the CAN but I can make sure they are in the same availability zone so that traffic will be free as well.

@DarthV:

One of those 8core 2.2Ghz nodes is as powerful as my workstation… so 20 or 100+ nodes can be 20 to a 100 times faster.

But just 10x is a pretty big game changer, I can get my results in an hour instead of the next day… so I can do a few more render iterations of a project and get there sooner.

And I hope that in the not to distant future the GPU clusters come available for the windows platforms as well so I can use iray much more. I can use it in the cloud now but it’s still running on CPUs.


#33

Wow, Rebusfarm is offering a 75% summer special discount… I doubt the cloud will be cheaper than that :smiley:


#34

Now go create some idiot proof script wizard UI for everyone to set up an amazon render farm, without having to go through all that batch files, mac address forging, backburner patching and what the hell.
I’d buy it instantly.


#35

I think they are scared of my monster cloud and trying to earn some cash while they can :slight_smile:

But seriusly this is a bit cheaper but not vastly different, when using spot instances I can get around 3c per Ghz/h… not sure about the math behind it though.

2.2ghz 8 core -> 17.6 GHz

On spot prices I can get one of those for about $0.60/hour (50% off)

$0.60 / 17.6 = $0.035 per Ghz/h

I’m not going to compete with any commercial company :slight_smile: I just like learning and doing stuff like this and the extra control and comfort is worth the few extra cents.


#36

I almost got it down to 1 single AMI (Amazon Machine Image) that is internally configured to work as a Manager, dedicated ‘CAN node’ or as a ‘Spot Node’ depending on how it’s started.

If you would have this image all you’d have to do run OpenVPN on you workstation and adjust 2 lines in a config file maybe some network settings and you’re up. But it’s not ready yet :slight_smile:


#37

I did the math behind it and also quite some benchmarking and have to admit, that the node-performance isn´t quite as good, as you would thing by the values.

The EC-Units, that amazon uses to calculate things are quite vague, saying that one EC-Unit equals an AMD Opteron of around 1.1 GHz or something. But then there is also the amount of performance, that you loose through virtualization of the machines itself. The RAM-Performance is not on par with a real, physical machine and especially the disk-access-performance is not as good.

I came to the conclusion, that an Instance of the type of c1.xLarge (High Compute, 20 EC2 Units, 7 Gig RAM) roughly equals a Core i7 870 to Core i7 880 performance-wise…
Which would equal it to around 4*3.0 GHz plus the usual 30% Performance-gain of active HyperThreading when rendering…so…around 15-16 GHz per Node
That said, it is still pretty fast for the price you have to pay for it.

For me and our business, it definitely has huge advantages compared to rendersolutions like Rebus & Co.


#38

Thanks for sharing those numbers, interesting data!


#39

I found a need trick to save a few bucks on traffic costs:

http://alestic.com/2009/06/ec2-elastic-ip-internal

My spot instances are outside the VPC (they are not available for that yet) So the instances create a VPN tunnel to the manager-node located in the VPC. They do this over the manager’s elastic IP (the outside) So the traffic trough this tunnel cost a few cents per GB.

With the trick described in the article you can get the inside IP from the outside-elastic IP… laying the tunnel to the inside IP is free in the same AZ. nice :slight_smile:


#40

I have now read every single word of this post.

As soon as my brain unscrambles, I am going to try it! This sounds too good to be true.

If you had to estimate, in terms of dollars only, what percentage savings (±) are you realizing with this setup?

Thanks so much for all this - it’s gold!