Over the last month I've noticed a sudden increase of cluster failures which result in a restart of ejabberd on each node. Symptoms include extremely high load generated by the beam.smp that coincide with all available memory being consumed. When the system runs out of physical memory, the node stops responding and a restart of the process is required.
I've checked the logs and there's nothing obvious of why the nodes crash under the load (the ejabberd.log shows only the connection info prior to the restart). As for consistency it has happened two weeks in a row at the start of the 'morning rush' of users connecting. However it doesn't happen daily and, in the past, if one node were to fail the other picks up the reconnecting users without issue. Has anyone experienced issues similar to this before?
Current setup is using 2.1 (and has been for awhile without issue) with the system layout as follows:
Gateway/Firewall -> (pair of Zeus load balancers) -> (node1)(node2)
Apart from load balancing the traffic, the LB's also decrypt the SSL encryption for the nodes. Average concurrent user connections are between 1500~1700 at peak.
Addendum
It's worth also noting that prior to the overloading, the system load is quite low ( 15minu average < 0.5 ). Upon restarting the cluster it returns that average.