I am building an application that will be modular, in a way that it will be a set of separate systems communicating with each other. It uses Hadoop on all systems, and HBase on 3 of the 4.
Scaling will only be an issue on the non Hbase system, it uses it’s own NoSQL like system. It will be a client facing system, and is very memory intensive, so each server can handle a limited number of users, before drastically dropping performance. All 4 systems have Zookeeper procedures & options built in.
Is it smart to design a command and control server, that will monitor system resources and spin up/spin down servers at times of peak? Would that set me up for failure, since if it faults the whole system can fail? How hard would it be to automate a task like that for Hadoop?
3
At a basic level, you have to have something in place in order to scale up / down the size of your environment.
In your application’s infancy, you’ll provide that “something.” As you receive feedback from your systems (or more likely from your users complaining about your systems), you’ll increase the capacity of the environment.
Automating that manual process is an excellent idea.
I would recommend designing an asynchronous method for querying your system nodes. An asynchronous update will allow your C&C (command and control) center to keep operating even when a queried node is buried under load or has completely gone down.
But I’ll contradict myself by mentioning ICMP and SNMP as two technologies to consider if you don’t want to roll everything yourself.
If you’re concerned about your C&C center going down, then the next level will be to setup redundancy between two or more C&C centers. There are a number of schemes that you can use to minimize overhead in querying nodes. I think this is beyond the scope of your question, so I won’t delve into the details and will be content with having brought it up.
You are at risk of logic flaws within the C&C center. But if you work through the logic diagram, you’ll see you’re either better or or no worse than you would have been otherwise.
- C&C good, nodes good == you’re good
- C&C good, nodes fail == C&C fixes problem with more resource
- C&C fail, nodes fail == you have resource starvation, but no worse than the case of the nodes failing anyway.
If your health monitor gets really good, it can send warning signals to the C&C center instead of outright error states. The C&C can then have special logic to wait for the “all-clear” from that node. If the all-clear doesn’t arrive in a set amount of time, the C&C server can automatically spin up more nodes knowing that the troubled node never made it out of the danger zone.
2