ejabberd - Comments for "Erlang core dump in OpenSSL" https://www.ejabberd.im/node/4050 en RE: Some ideas to try https://www.ejabberd.im/node/4050#comment-55786 <div class="quote-msg"> <div class="quote-author"><em>badlop</em> wrote:</div> <p>'before' what? Updating Erlang R12 to R13B01?, Updating ejabberd 2.0.0 to 2.1.3 (which has SMP enabled by default)?, ...</p> <p>You can try to update Erlang to the most recent R13B04.</p></div> <p>That's the strange thing, I haven't changed anything...it just started crashing this way out of nowhere as far as I can tell. I used to be able to turn ejabberd on and off with SMP set to auto, but now I can't. No hardware changes, no software changes, no configuration changes. There *might* have been changes in the Solaris environment that I'm unaware of, but I don't think so. I didn't see any error messages such as "not properly closed" in the ejabberd log (level is set at INFO).</p> <p>I did indeed try upgrading to R13B04 and ejabberd 2.1.3, but I was no longer able to log any clients in, receiving errors such as:</p> <p>** Reason for termination =<br /> ** {badarg,[{erlang,port_control,[crypto_drv02,1,"admin1:localhost:admin1"]},</p> <p>which I spent some time trying to solve but eventually gave up so I went back to the original R13B01 and 2.1.2 setup (I think I said 2.1.1 in my original message, but I meant 2.1.2, sorry about that).</p> <p>Anyway, not worrying about why I had to disable SMP for no reason, since I've disabled it the crashes/core dumps have stopped but under load (~30000 simultaneous users) many of my simulated clients are getting time out errors when trying to sign it. I have the following values set in ejabberdctl:</p> <p>POLL=true<br /> SMP=disable<br /> ERL_MAX_PORTS=320000<br /> ERL_PROCESSES=2500000<br /> ERL_MAX_ETS_TABLES=100000</p> <p>It looks like I'm hitting a wall just over 9000 users. Specifically 9,227 is the number I show logged in in this last test before I start getting hit with time outs. </p> <p>prstat shows memory usage around 900 MB and CPU usage from 0.5% to 1.0%. The memory/cpu usages of the other processes are negligible. The machine is a T5120 with 16GB of ram and the machine and a 64 core processor. So I think I should be able to handle a lot more users, and that this ~9000 user limit I'm seeing seems like its being artificially imposed by something, I just don't know what!</p> Thu, 20 May 2010 15:28:54 +0000 Tronman comment 55786 at https://www.ejabberd.im Some ideas to try https://www.ejabberd.im/node/4050#comment-55783 <div class="quote-msg"> <div class="quote-author"><em>Tronman</em> wrote:</div> <p>The logs do show me:</p> <p>=INFO REPORT==== 2010-05-17 21:12:24 ===<br /> I(&lt;0.470.0&gt;:ejabberd_c2s:1366) : ({socket_state,tls,{tlssock,#Port&lt;0.3434&gt;,#Port&lt;0.3454&gt;},&lt;0.469.0&gt;}) Close session for admin1@localhost/27926996461274119930670529</p> <p>=INFO REPORT==== 2010-05-17 21:12:29 ===<br /> I(&lt;0.37.0&gt;:ejabberd_app:86) : ejabberd 2.1.3 is stopped in the node ejabberd@localhost</p> <p>Then, the core occurs at 21:12:41, around 10-12 seconds later or so. </p></div> <p>So, the erlang node crashes while it is being stopped. You can try to start and stop it interactively:</p> <pre> $ ejabberdctl live ... =INFO REPORT==== 20-May-2010::11:54:15 === I(&lt;0.41.0&gt;:ejabberd_app:69) : ejabberd 2.1.3 is started in the node ejabberd@localhost =PROGRESS REPORT==== 20-May-2010::11:54:15 === application: ejabberd started_at: ejabberd@localhost ... login a client ... (ejabberd@localhost)1&gt; application:stop(ejabberd). =INFO REPORT==== 20-May-2010::12:01:23 === I(&lt;0.41.0&gt;:ejabberd_app:86) : ejabberd 2.1.3 is stopped in the node ejabberd@localhost =INFO REPORT==== 20-May-2010::12:01:24 === application: ejabberd exited: stopped type: temporary ok (ejabberd@localhost)2&gt; application:stop(mnesia). =INFO REPORT==== 20-May-2010::12:01:29 === application: mnesia exited: stopped type: permanent ok (ejabberd@localhost)3&gt; application:stop(crypto). =INFO REPORT==== 20-May-2010::12:01:35 === application: crypto exited: stopped type: temporary ok (ejabberd@localhost)6&gt; init:stop(). ok $ </pre><div class="quote-msg"> <div class="quote-author"><em>Tronman</em> wrote:</div> <p>Which brings me back to the possibility that something is screwing up with OpenSSL...but what? And why if only someone has connected? Does that cleanup code only get executed in that case? </p></div> <p>Maybe OpenSSL (or some part of it, or some library of it) is only loaded/started when the first client connects to the server.</p> <div class="quote-msg"> <div class="quote-author"><em>Tronman</em> wrote:</div> <p>EDIT: You might also find it interesting that ejabberd takes up to 10 to 15 seconds to start. That is, from the time I invoke ./ejabberdctl start to the time I can hit the web client. This is quite a bit slower then my initial tests on Ubuntu. This has always been the case since I've been running it on the Solaris machine. </p></div> <p>If ejabberd doesn't stop cleanly in Solaris, when it's started it may need to repair the Mnesia table, showing something like this on the log files:</p> <pre> dets: file "/var/lib/ejabberd/offline_msg.DAT" not properly closed, repairing ... </pre><div class="quote-msg"> <div class="quote-author"><em>Tronman</em> wrote:</div> <p>Or perhaps any guesses on why the SMP was causing such a weird error? </p></div> <p>Maybe Erlang support for SMP in Solaris wasn't completely stable in Erlang R13B01.</p> <div class="quote-msg"> <div class="quote-author"><em>Tronman</em> wrote:</div> <p>And again, I'm sure I never had such strange behavior before even with SMP set to auto...so there still could be something in the environment changed somewhere. </p></div> <p>'before' what? Updating Erlang R12 to R13B01?, Updating ejabberd 2.0.0 to 2.1.3 (which has SMP enabled by default)?, ...</p> <p>You can try to update Erlang to the most recent R13B04.</p> Thu, 20 May 2010 10:08:46 +0000 mfoss comment 55783 at https://www.ejabberd.im Another update https://www.ejabberd.im/node/4050#comment-55776 <p>Bah, that's a problem then. It seems to happen for me on a fresh install on ejabberd 2.1.3 on Solaris 10, though I did have to change:</p> <p>ejabberdctl to a bash script from a shell script. The shell script was giving me a "bad substitution error" but wouldn't identify what line so I don't know whats causing it</p> <p>I also had to change id -g/-G to be /usr/xpg4/bin/id since the one in the path didn't support the -g option.</p> <p>The logs do show me:</p> <p>=INFO REPORT==== 2010-05-17 21:12:24 ===<br /> I(&lt;0.470.0&gt;:ejabberd_c2s:1366) : ({socket_state,tls,{tlssock,#Port&lt;0.3434&gt;,#Port&lt;0.3454&gt;},&lt;0.469.0&gt;}) Close session for admin1@localhost/27926996461274119930670529</p> <p>=INFO REPORT==== 2010-05-17 21:12:29 ===<br /> I(&lt;0.37.0&gt;:ejabberd_app:86) : ejabberd 2.1.3 is stopped in the node ejabberd@localhost</p> <p>Then, the core occurs at 21:12:41, around 10-12 seconds later or so. </p> <p>I've also discovered that:</p> <p>The client doesn't have to be connected when the simulator goes down to core. However, at least one client must have been connected during the lifetime of the session. To recap:</p> <p>1. Start Ejabberd/Stop Ejabberd = No Core<br /> 2. Start Ejabberd/Connect Client/Stop Ejabberd = Core<br /> 3. Start Ejabberd/Connect Client/Disconnect Client/Stop Ejabberd = Core</p> <p>gdb always shows: </p> <p>1. That the "Program terminated with signal 11, Segmentation fault."<br /> 2. The core was generated by: beam.smp -K true -P<br /> 3. The backtrace is always the same as I first listed; with CRYPTO_lock being the last one executed.</p> <p>Which brings me back to the possibility that something is screwing up with OpenSSL...but what? And why if only someone has connected? Does that cleanup code only get executed in that case?</p> <p>EDIT: You might also find it interesting that ejabberd takes up to 10 to 15 seconds to start. That is, from the time I invoke ./ejabberdctl start to the time I can hit the web client. This is quite a bit slower then my initial tests on Ubuntu. This has always been the case since I've been running it on the Solaris machine.</p> <p>EDIT2: I took your advice and tried disabling symmetric multiprocessing. It seems to have stopped the core's when logging clients in and out which is great, but I don't know about under load yet. Speaking of load, any idea how disabling the SMP will impact ejabberd's performance? I'm not sure how ejabberd would even make use of it since all the XMPP messages would be asynchronous. Or perhaps any guesses on why the SMP was causing such a weird error? And again, I'm sure I never had such strange behavior before even with SMP set to auto...so there still could be something in the environment changed somewhere.</p> <p>Thanks!</p> Mon, 17 May 2010 19:59:47 +0000 Tronman comment 55776 at https://www.ejabberd.im Indeed https://www.ejabberd.im/node/4050#comment-55775 <div class="quote-msg"> <div class="quote-author"><em>badlop</em> wrote:</div> <p>And another weird thing: do you have 128 processors? </p></div> <p>It's a pretty high end server which advertises up to 128 simultaneous jobs, I'm not exactly sure if that translates to cores or processors, or some sort of sharing, but I don't think that number is wrong. I share the server with many other parties so it's likely it doesn't actually have access to all 128 at once. </p> <div class="quote-msg"> <div class="quote-author"><em>badlop</em> wrote:</div> <p>Maybe Erlang on Solaris has some bug. You can try to disable SMP support in ejabberdctl.cfg</p></div> <p>[/quote]</p> <p>I am starting to fear this as well. Of course, my build box (the one I mentioned previously) is also Solaris 10 which is where I printed the proper header from. The runtime box is also Solaris 10, but *doesn't* show the header. So possibly some sort of configuration issue? I'm not a Solaris expert, but I don't get a choice in the OS :)</p> Mon, 17 May 2010 17:39:04 +0000 Tronman comment 55775 at https://www.ejabberd.im SMP 128? https://www.ejabberd.im/node/4050#comment-55774 <div class="quote-msg"> <div class="quote-author"><em>Tronman</em> wrote:</div> <p>When I run erl on my build box I get:</p> <p>Erlang R13B01 (erts-5.7.2) [source] [smp:128:128] [rq:128] [async-threads:0] [kernel-poll:false]<br /> Eshell V5.7.2 (abort with ^G)</p></div> <p>And another weird thing: do you have 128 processors?</p> <p>In my case it's:</p> <pre>$ erl Erlang R13B04 (erts-5.7.5) [source] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false] Eshell V5.7.5 (abort with ^G) 1&gt; </pre><p> Maybe Erlang on Solaris has some bug. You can try to disable SMP support in ejabberdctl.cfg</p> Mon, 17 May 2010 14:18:39 +0000 mfoss comment 55774 at https://www.ejabberd.im ejabberdctl stop is a normal operation, no crash https://www.ejabberd.im/node/4050#comment-55765 <div class="quote-msg"> <div class="quote-author"><em>Tronman</em> wrote:</div> <p>A minor update:</p> <p>It appears that I can recreate the core pretty easily, all I have to do is connect a client with Pidgin, then run ./ejabberdctl stop while he's connected.</p> <p>It appears that killing the simulator with a connected client causes it to core with the same backtrace I reported. Is this expected behavior or a bug in ejabberd? So it's possible something did indeed cause the simulator to crash/stop while clients were disconnected which caused the core and not the core causing the clients to disconnect.</p></div> <p>That isn't normal. In ejabberd 2.1.3 I compiled and installed from source in Debian Sid 32bits, with the Erlang packaged in Debian, with Pidgin (home) and Tkabber (work) connected to my 'badlop' account, when I execute "ejabberdctl stop", there isn't any crash, just this is logged in ejabberd.log and ejabberd stops:</p> <pre> =INFO REPORT==== 2010-05-17 12:07:05 === I(&lt;0.366.0&gt;:ejabberd_c2s:1409) : ({socket_state,tls,{tlssock,#Port&lt;0.3910&gt;,#Port&lt;0.3957&gt;},&lt;0.365.0&gt;}) Close session for badlop@localhost/Home =INFO REPORT==== 2010-05-17 12:07:05 === I(&lt;0.369.0&gt;:ejabberd_c2s:1409) : ({socket_state,gen_tcp,#Port&lt;0.3960&gt;,&lt;0.368.0&gt;}) Close session for badlop@localhost/work =INFO REPORT==== 2010-05-17 12:07:10 === I(&lt;0.41.0&gt;:ejabberd_app:86) : ejabberd 2.1.3 is stopped in the node ejabberd@localhost </pre> Mon, 17 May 2010 10:11:39 +0000 mfoss comment 55765 at https://www.ejabberd.im Update https://www.ejabberd.im/node/4050#comment-55761 <p>A minor update:</p> <p>It appears that I can recreate the core pretty easily, all I have to do is connect a client with Pidgin, then run ./ejabberdctl stop while he's connected.</p> <p>It appears that killing the simulator with a connected client causes it to core with the same backtrace I reported. Is this expected behavior or a bug in ejabberd? So it's possible something did indeed cause the simulator to crash/stop while clients were disconnected which caused the core and not the core causing the clients to disconnect.</p> Fri, 14 May 2010 19:55:46 +0000 Tronman comment 55761 at https://www.ejabberd.im Re: Erlang core dump in OpenSSL https://www.ejabberd.im/node/4050#comment-55759 <p>Oh I see. I'd thought that ejabberd restarted itself since I could still login/hit the web client after the core without my actually starting it back up again. So maybe what ever caused it to crash (which I'd thought was a segmentation fault) only disconnected all the currently connected users (which is still bad, lol) but didn't actually bring the system down. </p> <p>The load environment is down, so I haven't tested it yet since upgrading to 2.1.3, I wasn't able to upgrade the Erlang version since it was giving me different errors when I tried to log in (though with the crypto library yet again). I've also increased the number of ERL_MAX_PORTS, ERL_PROCESSES and ERL_MAX_ETS_TABLES. Hopefully I'll have more luck, if not, I'll be popping back in here. Thanks!</p> Fri, 14 May 2010 14:50:29 +0000 Tronman comment 55759 at https://www.ejabberd.im Re: Erlang core dump in OpenSSL https://www.ejabberd.im/node/4050#comment-55757 <div class="quote-msg"> <div class="quote-author"><em>Tronman</em> wrote:</div> <p>I'm not sure what you mean by "when ejabberd stops until you don't reload ejabberd in the emulator every second" ? Would you mind elaborating?</p></div> <p>That was just a joke ;)</p> <div class="quote-msg"> <div class="quote-author">Quote:</div> <p>If tls_drv_finish() is only called when ejabberd stops, would that imply that my crash is occurring elsewhere and the core dump I'm getting is a consequence of ejabberd trying to restart itself to recover from a crash?</p></div> <p>Yes with the one exception: ejabberd never tries to restart itself. In your case it just stops or Erlang emulator tries to stop it. So your core dump is not a cause but an effect. I think if you comment tls_drv_finish() function you will get a crash dump next time.</p> Fri, 14 May 2010 02:30:50 +0000 zinid comment 55757 at https://www.ejabberd.im Re: Erlang core dump in OpenSSL https://www.ejabberd.im/node/4050#comment-55754 <p>Hi zinid,</p> <p>I'm not sure what you mean by "when ejabberd stops until you don't reload ejabberd in the emulator every second" ? Would you mind elaborating?</p> <p>If tls_drv_finish() is only called when ejabberd stops, would that imply that my crash is occurring elsewhere and the core dump I'm getting is a consequence of ejabberd trying to restart itself to recover from a crash?</p> <p>Thanks</p> Thu, 13 May 2010 14:21:58 +0000 Tronman comment 55754 at https://www.ejabberd.im Well https://www.ejabberd.im/node/4050#comment-55750 <p>I did run into some problems like that at first but I tracked down all the missing libraries (that I'm aware of or that were causing a problem) and put them in my LD_LIBRARY_PATH. This includes all the crypto/ssl libraries. So I'm not aware of any that could be missing but I can't guarantee it either.</p> <p>EDIT:<br /> There is the weird header issue. When I run erl on my build box I get:</p> <p>Erlang R13B01 (erts-5.7.2) [source] [smp:128:128] [rq:128] [async-threads:0] [kernel-poll:false]<br /> Eshell V5.7.2 (abort with ^G)</p> <p>but on the runtime box I only get:</p> <p>Eshell V5.7.2 (abort with ^G)<br /> 1&gt;</p> <p>without the "Header" at the beginning. I never was able to figure that out.</p> Wed, 12 May 2010 18:15:18 +0000 Tronman comment 55750 at https://www.ejabberd.im Re: Erlang core dump in OpenSSL https://www.ejabberd.im/node/4050#comment-55749 <p>Dunno. Perhaps your target system lacks of some libraries installed.</p> Wed, 12 May 2010 17:40:45 +0000 zinid comment 55749 at https://www.ejabberd.im Installing Erlang... https://www.ejabberd.im/node/4050#comment-55748 <p>Interesting! I'd haven't heard that, and I have had to do some tricks to get Erlang running which might be causing an issue with something I'm not aware of. </p> <p>Basically, since I'm running on Solaris my only option is to build Erlang from source. Things are made trickier by the fact that I have absolutely no build tools on my runtime machine, so I have to build on a separate Solaris machine. So my method of "installing" erlang is to download the source code, run ./configure with a prefix to a specific directory (e.g. /usr/local/me/erlang), then make install.</p> <p>I then take the built /usr/local/me/erlang folder and copy it verbatim to the runtime box including the path structure. From there, I do the same with my ejabberd install and make sure it's pointing to the correct Erlang. This has been working fine until now, either with the load issue, or with logging in on the newer version. So is there something that my install method is missing that could cause an issue like this?</p> Wed, 12 May 2010 17:33:13 +0000 Tronman comment 55748 at https://www.ejabberd.im Re: Erlang core dump in OpenSSL https://www.ejabberd.im/node/4050#comment-55747 <p>Since there is a core dump, there will not be any crash dump. The problems with crypto is caused by incorrect Erlang installation in 99% of cases.</p> Wed, 12 May 2010 17:20:59 +0000 zinid comment 55747 at https://www.ejabberd.im Hmm https://www.ejabberd.im/node/4050#comment-55745 <p>I can't seem to find an erlang crash dump, which I find odd. Usually they are in the /var/log/ejabberd folder. All I seem to have is the core file.</p> <p>I did try installing Erlang R13B04 and Ejabberd 2.1.3, but now I have a new issue: I can hit the web client fine but every time I try to sign in, I fail half way through the authentication with:</p> <p>** Reason for termination =<br /> ** {badarg,[{erlang,port_control,[crypto_drv02,1,"admin1:localhost:admin1"]},</p> <p>Which seems identical to the issue mentioned here: <a href="http://www.ejabberd.im/node/3223" title="http://www.ejabberd.im/node/3223">http://www.ejabberd.im/node/3223</a>, but there doesn't seem to be a resolution to that nor does any of the solutions listed there work. Perhaps I should start a separate thread for that though.</p> Wed, 12 May 2010 17:05:19 +0000 Tronman comment 55745 at https://www.ejabberd.im