Hello,
I am running Ejabberd 1.0.0 with Erlang 10B-8. I am using Jabsimul to benchmark the system, and am using
The test will run with no errors for about an hour with message frequency of 500 ms, then in the space of 5 minutes the memory usage will spike from 400MB to 2GB and the swap space usage will jump from 0MB to 1GB. Then all current connections are dumped and all further connections are refused. Even if I restart the ejabberd service I am unable to log in, and have to restart the server before I can connect again.
I am running Gentoo 2.6.14-r5 with 2GB of memory and a 1GB NIC. I have jabber starting as a service with the following command line:
ulimit -n 15000;/usr/local/bin/erl -pa /var/lib/ejabberd/ebin -sname ejabberd@farlinux01 -s ejabberd -env ERL_MAX_PORTS 5000 -env ERL_MAX_ETS_TABLES 20000 -ejabberd config \"/etc/ejabberd/ejabberd.cfg\" log-path \"/var/log/ejabberd.log\" -sasl sasl_error_logger \{file,\"var/log/ejabberd/sasl.log\"\} -mnesia dir \"var/lib/ejabberd/spool\" +P 250000 +K true -detached
Any help with this would be appreciated, thank you.
Re: Jabber crashes
Hello,
The test will run with no errors for about an hour with message frequency of 500 ms, then in the space of 5 minutes the memory usage will spike from 400MB to 2GB and the swap space usage will jump from 0MB to 1GB. Then all current connections are dumped and all further connections are refused. Even if I restart the ejabberd service I am unable to log in, and have to restart the server before I can connect again.
I didn't have this problem when running my test, but I had less than 1 GB of RAM and my tests were more about connections than about routing messages.
Do any suspicious error or warning messages appear on ejabberd.log or sasl.log?
Re: Jabber crashes
No error messages appear on the ejabberd server, I checked the ejabberd.log, sasl.log, and the server logs. All of the errors appeared on the Jab_simul server.
I figured out that everytime the server crashed the ejabberd server, the jab_simul server had run out of disk space. I assumed this was unrelated, but setup a cron job to delete the tmp log files. This did make the error go away, and I was able to run a simulation for 70 hours this weekend with no errors. I had 500 users with a message frequency of 100 ms using 160MB of memory.
I was able to recreate the error even with these settings by adding an additional 300 users, bringing me to 800 total. The job runs for 5 minutes, then I start getting Kolejka za dluga, pakiet anulowany! errors. Shortly after that I get POLLERR: Connection terminated and POLLERR: Connection refused errors.
Checking my ejabberd server, my beam service has become a zombie process, and is still running, but refuses all connections. The memory usage on the server spiked and then came back down after the beam service crashed.
If I dial the message frequency back to 500 ms with 800 users, it runs for at least 10 minutes, but I'm testing that as I type.
I don't understand the interaction between ejabberd and jab_simul enough to understand why this is happening, but I am concerned that the POLLERR errors are causing my ejabberd server to crash. I was unable to find any metion of this problem, has anything like it occured before?