Roster becomes too big

We have a small problem with our ejabber installation.

We had a couple of site running chat and everything was running fine for months. The decision was taken to roll it out to all our other sites (most of which shared a member base). As more and more sites were added, the system took more strain. Out concurrent users never really went higher than about 110.

Then one day ejabbderd crashed and would not start up again. A crash file was generated with the following:

Slogan: eheap_alloc: Cannot allocate 40121760 bytes of memory (of type "heap").
System version: Erlang (BEAM) emulator version 5.6.5 [source] [smp:4] [async-threads:0] [hipe] [kernel-poll:true]
Compiled: Wed Jun  2 14:18:09 2010
Atoms: 11760
=memory
total: 3055651424
processes: 50346268
processes_used: 50341476
system: 3005305156
atom: 536285
atom_used: 521045
binary: 36760
code: 4351225
ets: 2992667516

After some investigation, I saw that the roster file had grown to a massive 309MB and that the problem was most likely the roster being too big. I did some testing, backed up the roster file and then cat /dev/null > roster.DCD - After this, ejabberd would start up again and new roster entries could be added and everything was fine.

I've noticed other people have had the same crash problem with no real way of fixing it being presented. I've tried plenty of things myself, with different configurations to try and sort the problem out. No matter what I do, the big roster file crashes the system.

I suspect that hardware is an issue, as ejabberd is not running on a dedicated machine and shares the machine with a image server, sendmail relay, amongst other things. Convincing the boss it is the hardware is a different story and I needed backup to prove my point.

I've been searching around for hardware specs or suggestions for ejabberd. One the site it claims to support connections with X amount of concurrent users. Thats all good and dandy, but what are the details where that many connections are concerned?

For instance, I've started looking around at other jabber servers, to see if they might fair any better. Once of the servers I came accross was tigase, a java and mysql jabber server. On their site, they claim 500k users on a single domain, but they go further and tell you how they got to that number with the test details (http://www.tigase.org/content/tigase-load-tests-again-500k-user-connections). 500k users doesn't mean anything when each user has a roster size of 0.

At the end, it seems like 300k users, each having a roster size of about 50 was the max it could do, before everything went pair shaped. The server specs are shown too for each server and quite frankly, they are pretty impressive, nothing near what we are running now.

I was wondering if the same sort of details are hidden somewhere for ejabberd? I'd really like to be able to present some details so we can move on and getting ejabberd running again with the amount of users we want.

Thanks

Garth

Summary of your

Summary of your stats:

synack wrote:

A) Out concurrent users never really went higher than about 110.

B) Slogan: eheap_alloc: Cannot allocate 40,121,760 bytes of memory (of type "heap").

C) the roster file had grown to a massive 309MB

You don't say how many registered accounts. Apparently, the server have manymany roster items, either for a few users, or in general for all the users.

You don't clarify the usage your server has, but this smells to either you (or your client software) automatically add useless roster items, or you have some attack/stress (either malicious or by ignorance/bug).

Check the WebAdmin, and consider installing mod_statsdx.

Some information about a free server I administer (for typical chat, using Psi and Jabber clients).

Some stats provided by mod_statsdx:

Registered users	9,759
Online users	695

Total roster items	135,774

Mean items in roster	13.9127

List of Top rosters provided by mod_statsdx

Top rosters
#	Jabber ID	Value
1	ken2@example.org	1536
2	etireng2@example.org	1529
3	rdomo@example.org	1445
4	rer@example.org	1407
5	rnero@example.org	1295
6	ersa@example.org	985
7	o@example.org	969
8	ro.coria@example.org	938
...

Some Mnesia spool files details:

-rw-r--r--  1 ejabberd ejabberd  569K Feb  5 21:30 passwd.DCD
-rw-r--r--  1 ejabberd ejabberd   33K Feb 17 13:30 passwd.DCL

-rw-r--r--  1 ejabberd ejabberd   26M Feb 12 19:07 roster.DCD
-rw-r--r--  1 ejabberd ejabberd  1.3M Feb 17 13:36 roster.DCL

Your 309 MB roster.DCD for 110 concurrent users (let's imagine the server has 1,000 different users along the week, and 10,000 total accounts registered, each with 15 roster items) is crazy high.

Hi there Thanks for your

Hi there

Thanks for your reply. I've installed the mod and come back with the following stats:

ejabberdctl getstatsdx totalrosteritems
696269
ejabberdctl getstatsdx registeredusers
202918
ejabberdctl getstatsdx meanitemsinroster
3.4311861500022176
-rw-r--r--  1 root     root     135M Feb 21 09:40 /var/lib/ejabberd/roster.DCD
-rw-r--r--  1 root     root     4.0M Feb 23 14:19 /var/lib/ejabberd/roster.DCL

As you can see, the amount of users we have is pretty high and in turn, the roster size is pretty high. To put into perspective, I've yet to add another 400k-500k members to the system, so the roster size will probably get boosted back up to the 300MB+ mark once that is done. I've held off doing that for now.

Now even if the roster size if 309MB and it's trying to load the full roster into memory, the machine should technically have the memory available to do this, so I don't understand why it's actually crashing.

Do you think throwing hardware at the problem will fix the issue, or are we just going to run into the same problem further down the line?

Thanks

Try table in disc; try in other machine

Ok, your table size is reasonable considering the number of roster items.

synack wrote:

Now even if the roster size if 309MB and it's trying to load the full roster into memory, the machine should technically have the memory available to do this, so I don't understand why it's actually crashing.

I don't know the free RAM in your server machine when ejabberd is starting.

You can try to set the roster table as "copy in disc only" using the WebAdmin. This reduces RAM usage, but increases resource consumption when reading/writting roster items (when users login). maybe that trade is benefitial in your case.

synack wrote:

Do you think throwing hardware at the problem will fix the issue, or are we just going to run into the same problem further down the line?

Before touching the hardware, you may prefer to run experiments: install ejabberd in other idle machine with more RAM, like your desktop machine, copy the mnesia spool file, the config, and check what happens in that one.

Hi, don't you think it's

Hi, don't you think it's better to use MySql or some other databases, instead of Mnesia?

Probably yes.

Probably yes.

I suppose it's an Erlang

I suppose it's an Erlang problem here because it takes as many resources as it gets and when it doesn't have, it crashes without warning. Hardware improvement here would be helpful, but I would suggest instead of increasing your system capabilities, you to start using distributed computer power ("divide and conquer" method). This configuration will allow you to add more power to your server without restarting it in the future.

Syndicate content