ejabberd - Comments for "Erlang core dump in OpenSSL"

RE: Some ideas to try

Tronman — Thu, 20 May 2010 15:28:54 +0000

badlop wrote:

'before' what? Updating Erlang R12 to R13B01?, Updating ejabberd 2.0.0 to 2.1.3 (which has SMP enabled by default)?, ...

You can try to update Erlang to the most recent R13B04.

That's the strange thing, I haven't changed anything...it just started crashing this way out of nowhere as far as I can tell. I used to be able to turn ejabberd on and off with SMP set to auto, but now I can't. No hardware changes, no software changes, no configuration changes. There *might* have been changes in the Solaris environment that I'm unaware of, but I don't think so. I didn't see any error messages such as "not properly closed" in the ejabberd log (level is set at INFO).

I did indeed try upgrading to R13B04 and ejabberd 2.1.3, but I was no longer able to log any clients in, receiving errors such as:

** Reason for termination =
** {badarg,[{erlang,port_control,[crypto_drv02,1,"admin1:localhost:admin1"]},

which I spent some time trying to solve but eventually gave up so I went back to the original R13B01 and 2.1.2 setup (I think I said 2.1.1 in my original message, but I meant 2.1.2, sorry about that).

Anyway, not worrying about why I had to disable SMP for no reason, since I've disabled it the crashes/core dumps have stopped but under load (~30000 simultaneous users) many of my simulated clients are getting time out errors when trying to sign it. I have the following values set in ejabberdctl:

POLL=true
SMP=disable
ERL_MAX_PORTS=320000
ERL_PROCESSES=2500000
ERL_MAX_ETS_TABLES=100000

It looks like I'm hitting a wall just over 9000 users. Specifically 9,227 is the number I show logged in in this last test before I start getting hit with time outs.

prstat shows memory usage around 900 MB and CPU usage from 0.5% to 1.0%. The memory/cpu usages of the other processes are negligible. The machine is a T5120 with 16GB of ram and the machine and a 64 core processor. So I think I should be able to handle a lot more users, and that this ~9000 user limit I'm seeing seems like its being artificially imposed by something, I just don't know what!

Some ideas to try

mfoss — Thu, 20 May 2010 10:08:46 +0000

Tronman wrote:

The logs do show me:

=INFO REPORT==== 2010-05-17 21:12:24 ===
I(<0.470.0>:ejabberd_c2s:1366) : ({socket_state,tls,{tlssock,#Port<0.3434>,#Port<0.3454>},<0.469.0>}) Close session for admin1@localhost/27926996461274119930670529

=INFO REPORT==== 2010-05-17 21:12:29 ===
I(<0.37.0>:ejabberd_app:86) : ejabberd 2.1.3 is stopped in the node ejabberd@localhost

Then, the core occurs at 21:12:41, around 10-12 seconds later or so.

So, the erlang node crashes while it is being stopped. You can try to start and stop it interactively:

$ ejabberdctl live
...
=INFO REPORT==== 20-May-2010::11:54:15 ===
I(<0.41.0>:ejabberd_app:69) : ejabberd 2.1.3 is started in the node ejabberd@localhost

=PROGRESS REPORT==== 20-May-2010::11:54:15 ===
         application: ejabberd                
          started_at: ejabberd@localhost      

... login a client ...

(ejabberd@localhost)1> application:stop(ejabberd).

=INFO REPORT==== 20-May-2010::12:01:23 ===
I(<0.41.0>:ejabberd_app:86) : ejabberd 2.1.3 is stopped in the node ejabberd@localhost

=INFO REPORT==== 20-May-2010::12:01:24 ===
    application: ejabberd                 
    exited: stopped                       
    type: temporary                       
ok                                        
(ejabberd@localhost)2> application:stop(mnesia).  

=INFO REPORT==== 20-May-2010::12:01:29 ===
    application: mnesia                   
    exited: stopped
    type: permanent
ok
(ejabberd@localhost)3> application:stop(crypto).

=INFO REPORT==== 20-May-2010::12:01:35 ===
    application: crypto
    exited: stopped
    type: temporary
ok
(ejabberd@localhost)6> init:stop().
ok

$

Tronman wrote:

Which brings me back to the possibility that something is screwing up with OpenSSL...but what? And why if only someone has connected? Does that cleanup code only get executed in that case?

Maybe OpenSSL (or some part of it, or some library of it) is only loaded/started when the first client connects to the server.

Tronman wrote:

EDIT: You might also find it interesting that ejabberd takes up to 10 to 15 seconds to start. That is, from the time I invoke ./ejabberdctl start to the time I can hit the web client. This is quite a bit slower then my initial tests on Ubuntu. This has always been the case since I've been running it on the Solaris machine.

If ejabberd doesn't stop cleanly in Solaris, when it's started it may need to repair the Mnesia table, showing something like this on the log files:

dets: file "/var/lib/ejabberd/offline_msg.DAT" not properly closed, repairing ...

Tronman wrote:

Or perhaps any guesses on why the SMP was causing such a weird error?

Maybe Erlang support for SMP in Solaris wasn't completely stable in Erlang R13B01.

Tronman wrote:

And again, I'm sure I never had such strange behavior before even with SMP set to auto...so there still could be something in the environment changed somewhere.

'before' what? Updating Erlang R12 to R13B01?, Updating ejabberd 2.0.0 to 2.1.3 (which has SMP enabled by default)?, ...

You can try to update Erlang to the most recent R13B04.

Another update

Tronman — Mon, 17 May 2010 19:59:47 +0000

Bah, that's a problem then. It seems to happen for me on a fresh install on ejabberd 2.1.3 on Solaris 10, though I did have to change:

ejabberdctl to a bash script from a shell script. The shell script was giving me a "bad substitution error" but wouldn't identify what line so I don't know whats causing it

I also had to change id -g/-G to be /usr/xpg4/bin/id since the one in the path didn't support the -g option.

The logs do show me:

=INFO REPORT==== 2010-05-17 21:12:29 ===
I(<0.37.0>:ejabberd_app:86) : ejabberd 2.1.3 is stopped in the node ejabberd@localhost

Then, the core occurs at 21:12:41, around 10-12 seconds later or so.

I've also discovered that:

The client doesn't have to be connected when the simulator goes down to core. However, at least one client must have been connected during the lifetime of the session. To recap:

1. Start Ejabberd/Stop Ejabberd = No Core
2. Start Ejabberd/Connect Client/Stop Ejabberd = Core
3. Start Ejabberd/Connect Client/Disconnect Client/Stop Ejabberd = Core

gdb always shows:

1. That the "Program terminated with signal 11, Segmentation fault."
2. The core was generated by: beam.smp -K true -P
3. The backtrace is always the same as I first listed; with CRYPTO_lock being the last one executed.

Which brings me back to the possibility that something is screwing up with OpenSSL...but what? And why if only someone has connected? Does that cleanup code only get executed in that case?

EDIT2: I took your advice and tried disabling symmetric multiprocessing. It seems to have stopped the core's when logging clients in and out which is great, but I don't know about under load yet. Speaking of load, any idea how disabling the SMP will impact ejabberd's performance? I'm not sure how ejabberd would even make use of it since all the XMPP messages would be asynchronous. Or perhaps any guesses on why the SMP was causing such a weird error? And again, I'm sure I never had such strange behavior before even with SMP set to auto...so there still could be something in the environment changed somewhere.

Thanks!

Indeed

Tronman — Mon, 17 May 2010 17:39:04 +0000

badlop wrote:

And another weird thing: do you have 128 processors?

It's a pretty high end server which advertises up to 128 simultaneous jobs, I'm not exactly sure if that translates to cores or processors, or some sort of sharing, but I don't think that number is wrong. I share the server with many other parties so it's likely it doesn't actually have access to all 128 at once.

badlop wrote:

Maybe Erlang on Solaris has some bug. You can try to disable SMP support in ejabberdctl.cfg

[/quote]

I am starting to fear this as well. Of course, my build box (the one I mentioned previously) is also Solaris 10 which is where I printed the proper header from. The runtime box is also Solaris 10, but *doesn't* show the header. So possibly some sort of configuration issue? I'm not a Solaris expert, but I don't get a choice in the OS :)

SMP 128?

mfoss — Mon, 17 May 2010 14:18:39 +0000

Tronman wrote:

When I run erl on my build box I get:

Erlang R13B01 (erts-5.7.2) [source] [smp:128:128] [rq:128] [async-threads:0] [kernel-poll:false]
Eshell V5.7.2 (abort with ^G)

And another weird thing: do you have 128 processors?

In my case it's:

$ erl
Erlang R13B04 (erts-5.7.5) [source] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]

Eshell V5.7.5  (abort with ^G)
1>

Maybe Erlang on Solaris has some bug. You can try to disable SMP support in ejabberdctl.cfg

ejabberdctl stop is a normal operation, no crash

mfoss — Mon, 17 May 2010 10:11:39 +0000

Tronman wrote:

A minor update:

It appears that I can recreate the core pretty easily, all I have to do is connect a client with Pidgin, then run ./ejabberdctl stop while he's connected.

It appears that killing the simulator with a connected client causes it to core with the same backtrace I reported. Is this expected behavior or a bug in ejabberd? So it's possible something did indeed cause the simulator to crash/stop while clients were disconnected which caused the core and not the core causing the clients to disconnect.

That isn't normal. In ejabberd 2.1.3 I compiled and installed from source in Debian Sid 32bits, with the Erlang packaged in Debian, with Pidgin (home) and Tkabber (work) connected to my 'badlop' account, when I execute "ejabberdctl stop", there isn't any crash, just this is logged in ejabberd.log and ejabberd stops:

=INFO REPORT==== 2010-05-17 12:07:05 ===
I(<0.366.0>:ejabberd_c2s:1409) : ({socket_state,tls,{tlssock,#Port<0.3910>,#Port<0.3957>},<0.365.0>})
  Close session for badlop@localhost/Home

=INFO REPORT==== 2010-05-17 12:07:05 ===
I(<0.369.0>:ejabberd_c2s:1409) : ({socket_state,gen_tcp,#Port<0.3960>,<0.368.0>})
  Close session for badlop@localhost/work

=INFO REPORT==== 2010-05-17 12:07:10 ===
I(<0.41.0>:ejabberd_app:86) : ejabberd 2.1.3 is stopped in the node ejabberd@localhost

Update

Tronman — Fri, 14 May 2010 19:55:46 +0000

A minor update:

It appears that I can recreate the core pretty easily, all I have to do is connect a client with Pidgin, then run ./ejabberdctl stop while he's connected.

Re: Erlang core dump in OpenSSL

Tronman — Fri, 14 May 2010 14:50:29 +0000

Oh I see. I'd thought that ejabberd restarted itself since I could still login/hit the web client after the core without my actually starting it back up again. So maybe what ever caused it to crash (which I'd thought was a segmentation fault) only disconnected all the currently connected users (which is still bad, lol) but didn't actually bring the system down.

The load environment is down, so I haven't tested it yet since upgrading to 2.1.3, I wasn't able to upgrade the Erlang version since it was giving me different errors when I tried to log in (though with the crypto library yet again). I've also increased the number of ERL_MAX_PORTS, ERL_PROCESSES and ERL_MAX_ETS_TABLES. Hopefully I'll have more luck, if not, I'll be popping back in here. Thanks!

Re: Erlang core dump in OpenSSL

zinid — Fri, 14 May 2010 02:30:50 +0000

Tronman wrote:

I'm not sure what you mean by "when ejabberd stops until you don't reload ejabberd in the emulator every second" ? Would you mind elaborating?

That was just a joke ;)

Quote:

If tls_drv_finish() is only called when ejabberd stops, would that imply that my crash is occurring elsewhere and the core dump I'm getting is a consequence of ejabberd trying to restart itself to recover from a crash?

Yes with the one exception: ejabberd never tries to restart itself. In your case it just stops or Erlang emulator tries to stop it. So your core dump is not a cause but an effect. I think if you comment tls_drv_finish() function you will get a crash dump next time.

Re: Erlang core dump in OpenSSL

Tronman — Thu, 13 May 2010 14:21:58 +0000

Hi zinid,

I'm not sure what you mean by "when ejabberd stops until you don't reload ejabberd in the emulator every second" ? Would you mind elaborating?

Thanks

Well

Tronman — Wed, 12 May 2010 18:15:18 +0000

I did run into some problems like that at first but I tracked down all the missing libraries (that I'm aware of or that were causing a problem) and put them in my LD_LIBRARY_PATH. This includes all the crypto/ssl libraries. So I'm not aware of any that could be missing but I can't guarantee it either.

EDIT:
There is the weird header issue. When I run erl on my build box I get:

Erlang R13B01 (erts-5.7.2) [source] [smp:128:128] [rq:128] [async-threads:0] [kernel-poll:false]
Eshell V5.7.2 (abort with ^G)

but on the runtime box I only get:

Eshell V5.7.2 (abort with ^G)
1>

without the "Header" at the beginning. I never was able to figure that out.

Re: Erlang core dump in OpenSSL

zinid — Wed, 12 May 2010 17:40:45 +0000

Dunno. Perhaps your target system lacks of some libraries installed.

Installing Erlang...

Tronman — Wed, 12 May 2010 17:33:13 +0000

Interesting! I'd haven't heard that, and I have had to do some tricks to get Erlang running which might be causing an issue with something I'm not aware of.

Basically, since I'm running on Solaris my only option is to build Erlang from source. Things are made trickier by the fact that I have absolutely no build tools on my runtime machine, so I have to build on a separate Solaris machine. So my method of "installing" erlang is to download the source code, run ./configure with a prefix to a specific directory (e.g. /usr/local/me/erlang), then make install.

I then take the built /usr/local/me/erlang folder and copy it verbatim to the runtime box including the path structure. From there, I do the same with my ejabberd install and make sure it's pointing to the correct Erlang. This has been working fine until now, either with the load issue, or with logging in on the newer version. So is there something that my install method is missing that could cause an issue like this?

Re: Erlang core dump in OpenSSL

zinid — Wed, 12 May 2010 17:20:59 +0000

Since there is a core dump, there will not be any crash dump. The problems with crypto is caused by incorrect Erlang installation in 99% of cases.

Hmm

Tronman — Wed, 12 May 2010 17:05:19 +0000

I can't seem to find an erlang crash dump, which I find odd. Usually they are in the /var/log/ejabberd folder. All I seem to have is the core file.

I did try installing Erlang R13B04 and Ejabberd 2.1.3, but now I have a new issue: I can hit the web client fine but every time I try to sign in, I fail half way through the authentication with:

** Reason for termination =
** {badarg,[{erlang,port_control,[crypto_drv02,1,"admin1:localhost:admin1"]},

Which seems identical to the issue mentioned here: http://www.ejabberd.im/node/3223, but there doesn't seem to be a resolution to that nor does any of the solutions listed there work. Perhaps I should start a separate thread for that though.