We have a small problem with our ejabber installation.
We had a couple of site running chat and everything was running fine for months. The decision was taken to roll it out to all our other sites (most of which shared a member base). As more and more sites were added, the system took more strain. Out concurrent users never really went higher than about 110.
Then one day ejabbderd crashed and would not start up again. A crash file was generated with the following:
Slogan: eheap_alloc: Cannot allocate 40121760 bytes of memory (of type "heap"). System version: Erlang (BEAM) emulator version 5.6.5 [source] [smp:4] [async-threads:0] [hipe] [kernel-poll:true] Compiled: Wed Jun 2 14:18:09 2010 Atoms: 11760 =memory total: 3055651424 processes: 50346268 processes_used: 50341476 system: 3005305156 atom: 536285 atom_used: 521045 binary: 36760 code: 4351225 ets: 2992667516
After some investigation, I saw that the roster file had grown to a massive 309MB and that the problem was most likely the roster being too big. I did some testing, backed up the roster file and then cat /dev/null > roster.DCD - After this, ejabberd would start up again and new roster entries could be added and everything was fine.
I've noticed other people have had the same crash problem with no real way of fixing it being presented. I've tried plenty of things myself, with different configurations to try and sort the problem out. No matter what I do, the big roster file crashes the system.
I suspect that hardware is an issue, as ejabberd is not running on a dedicated machine and shares the machine with a image server, sendmail relay, amongst other things. Convincing the boss it is the hardware is a different story and I needed backup to prove my point.
I've been searching around for hardware specs or suggestions for ejabberd. One the site it claims to support connections with X amount of concurrent users. Thats all good and dandy, but what are the details where that many connections are concerned?
For instance, I've started looking around at other jabber servers, to see if they might fair any better. Once of the servers I came accross was tigase, a java and mysql jabber server. On their site, they claim 500k users on a single domain, but they go further and tell you how they got to that number with the test details (
At the end, it seems like 300k users, each having a roster size of about 50 was the max it could do, before everything went pair shaped. The server specs are shown too for each server and quite frankly, they are pretty impressive, nothing near what we are running now.
I was wondering if the same sort of details are hidden somewhere for ejabberd? I'd really like to be able to present some details so we can move on and getting ejabberd running again with the amount of users we want.
Thanks
Garth
Summary of your
Summary of your stats:
A) Out concurrent users never really went higher than about 110.
B) Slogan: eheap_alloc: Cannot allocate 40,121,760 bytes of memory (of type "heap").
C) the roster file had grown to a massive 309MB
You don't say how many registered accounts. Apparently, the server have manymany roster items, either for a few users, or in general for all the users.
You don't clarify the usage your server has, but this smells to either you (or your client software) automatically add useless roster items, or you have some attack/stress (either malicious or by ignorance/bug).
Check the WebAdmin, and consider installing mod_statsdx.
Some information about a free server I administer (for typical chat, using Psi and Jabber clients).
Some stats provided by mod_statsdx:
List of Top rosters provided by mod_statsdx
Some Mnesia spool files details:
Your 309 MB roster.DCD for 110 concurrent users (let's imagine the server has 1,000 different users along the week, and 10,000 total accounts registered, each with 15 roster items) is crazy high.
Hi there Thanks for your
Hi there
Thanks for your reply. I've installed the mod and come back with the following stats:
As you can see, the amount of users we have is pretty high and in turn, the roster size is pretty high. To put into perspective, I've yet to add another 400k-500k members to the system, so the roster size will probably get boosted back up to the 300MB+ mark once that is done. I've held off doing that for now.
Now even if the roster size if 309MB and it's trying to load the full roster into memory, the machine should technically have the memory available to do this, so I don't understand why it's actually crashing.
Do you think throwing hardware at the problem will fix the issue, or are we just going to run into the same problem further down the line?
Thanks
Try table in disc; try in other machine
Ok, your table size is reasonable considering the number of roster items.
Now even if the roster size if 309MB and it's trying to load the full roster into memory, the machine should technically have the memory available to do this, so I don't understand why it's actually crashing.
I don't know the free RAM in your server machine when ejabberd is starting.
You can try to set the roster table as "copy in disc only" using the WebAdmin. This reduces RAM usage, but increases resource consumption when reading/writting roster items (when users login). maybe that trade is benefitial in your case.
Do you think throwing hardware at the problem will fix the issue, or are we just going to run into the same problem further down the line?
Before touching the hardware, you may prefer to run experiments: install ejabberd in other idle machine with more RAM, like your desktop machine, copy the mnesia spool file, the config, and check what happens in that one.
Hi, don't you think it's
Hi, don't you think it's better to use MySql or some other databases, instead of Mnesia?
Probably yes.
Probably yes.
I suppose it's an Erlang
I suppose it's an Erlang problem here because it takes as many resources as it gets and when it doesn't have, it crashes without warning. Hardware improvement here would be helpful, but I would suggest instead of increasing your system capabilities, you to start using distributed computer power ("divide and conquer" method). This configuration will allow you to add more power to your server without restarting it in the future.