So!
I've spent time the last few weeks working with a dev to stress test our proposed installation of EJabberd.
I'm trying to future-proof the environment a bit, so we're testing a much larger environment than we currently have.
We're using the LDAP shared roster, the patched version for AD. We're testing with 4000 users total (only about 1500 active), and the roster is an absurd 12000 users long ( each user in 3 groups). Lastly, we're using the clustering to do geodistributed chat, using DNS A records and sortlisting to make sure you go to your local site.
We experienced the following challenges that we're trying to figure out if we need to Just Live With:
- DNS round robin sucks at keeping things balanced. We know that. SRV records don't sortlist. Is there another solution I should be staring at other than an actual load balancer?
- memory usage per connected user was around 10MB. This is about an order of magnitude larger than our old chat system, Openfire. Should I chalk this up to the cost of concurrent performance?
- Split Brain recovery requires restarts of the minority nodes, it seems. I tried restarting mnesia, and in every case ejabberd halted as mnesia started. This is pretty much expected, yes?
- Normal CPU usage is basically nothing. We tested to 8000 concurrent messages per second before the 2 2cpu servers in the test cluster reached capacity. However, login time efforts were more than capable of bringing the server to its knees if we didn't throttle them a bit. Do I just rely on the client backoff to make sure everyone doesn't log in at once, or is there an accept queue configuration I should be setting?
On the whole, the experience has been terrific. The only thing I'm significantly discomfited by is that 10MB per client memory requirement. If none of my other questions get answered, I'd be most appreciative if you could give me some guidance on that.