Hi Everyone,
We are working on ejabberd setup with external auth and a few custom modules. Everything works fine, but our stress test crashes the node. When server accept more than 16300+ connection it crashes with message:
[error] <0.22081.3> gen_fsm <0.22081.3> in state loop terminated with reason: {system_limit,[{erlang,open_port,[{spawn,"expat_erl"},[binary]],[]},{xml_stream,new,2,[{file,"src/xml_stream.erl"},{line,182}]},{ejabberd_http_ws,parse,2,[{file,"src/ejabberd_http_ws.erl"},{line,312}]},{ejabberd_http_ws,handle_info,3,[{file,"src/ejabberd_http_ws.erl"},{line,213}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
2015-04-22 06:05:34.391 [error] <0.22081.3> CRASH REPORT Process <0.22081.3> with 1 neighbours exited with reason: {system_limit,[{erlang,open_port,[{spawn,"expat_erl"},[binary]],[]},{xml_stream,new,2,[{file,"src/xml_stream.erl"},{line,182}]},{ejabberd_http_ws,parse,2,[{file,"src/ejabberd_http_ws.erl"},{line,312}]},{ejabberd_http_ws,handle_info,3,[{file,"src/ejabberd_http_ws.erl"},{line,213}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]} in gen_fsm:terminate/7 line 622
2015-04-22 06:05:34.391 [error] <0.32163.0> Supervisor ejabberd_http_sup had child undefined started with {ejabberd_http,start_link,undefined} at <0.22080.3> exit with reason {system_limit,[{erlang,open_port,[{spawn,"expat_erl"},[binary]],[]},{xml_stream,new,2,[{file,"src/xml_stream.erl"},{line,182}]},{ejabberd_http_ws,parse,2,[{file,"src/ejabberd_http_ws.erl"},{line,312}]},{ejabberd_http_ws,handle_info,3,[{file,"src/ejabberd_http_ws.erl"},{line,213}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]} in context child_terminated
2015-04-22 06:05:34.501 [error] <0.32296.0>@ejabberd_listener:accept:316 (#Port<0.37951>) Failed TCP accept: system_limit
2015-04-22 06:05:34.569 [error] <0.22095.3> gen_fsm <0.22095.3> in state loop terminated with reason: {system_limit,[{erlang,open_port,[{spawn,"expat_erl"},[binary]],[]},{xml_stream,new,2,[{file,"src/xml_stream.erl"},{line,182}]},{ejabberd_http_ws,parse,2,[{file,"src/ejabberd_http_ws.erl"},{line,312}]},{ejabberd_http_ws,handle_info,3,[{file,"src/ejabberd_http_ws.erl"},{line,213}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
2015-04-22 06:05:34.570 [error] <0.22095.3> CRASH REPORT Process <0.22095.3> with 1 neighbours exited with reason: {system_limit,[{erlang,open_port,[{spawn,"expat_erl"},[binary]],[]},{xml_stream,new,2,[{file,"src/xml_stream.erl"},{line,182}]},{ejabberd_http_ws,parse,2,[{file,"src/ejabberd_http_ws.erl"},{line,312}]},{ejabberd_http_ws,handle_info,3,[{file,"src/ejabberd_http_ws.erl"},{line,213}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]} in gen_fsm:terminate/7 line 622
2015-04-22 06:05:34.570 [error] <0.32163.0> Supervisor ejabberd_http_sup had child undefined started with {ejabberd_http,start_link,undefined} at <0.22094.3> exit with reason {system_limit,[{erlang,open_port,[{spawn,"expat_erl"},[binary]],[]},{xml_stream,new,2,[{file,"src/xml_stream.erl"},{line,182}]},{ejabberd_http_ws,parse,2,[{file,"src/ejabberd_http_ws.erl"},{line,312}]},{ejabberd_http_ws,handle_info,3,[{file,"src/ejabberd_http_ws.erl"},{line,213}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]} in context child_terminated
2015-04-22 06:05:34.661 [error] <0.32296.0>@ejabberd_listener:accept:316 (#Port<0.37951>) Failed TCP accept: system_limit
2015-04-22 06:05:34.758 [error] <0.22101.3> gen_fsm <0.22101.3> in state loop terminated with reason: {system_limit,[{erlang,open_port,[{spawn,"expat_erl"},[binary]],[]},{xml_stream,new,2,[{file,"src/xml_stream.erl"},{line,182}]},{ejabberd_http_ws,parse,2,[{file,"src/ejabberd_http_ws.erl"},{line,312}]},{ejabberd_http_ws,handle_info,3,[{file,"src/ejabberd_http_ws.erl"},{line,213}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
2015-04-22 06:05:34.758 [error] <0.22101.3> CRASH REPORT Process <0.22101.3> with 1 neighbours exited with reason: {system_limit,[{erlang,open_port,[{spawn,"expat_erl"},[binary]],[]},{xml_stream,new,2,[{file,"src/xml_stream.erl"},{line,182}]},{ejabberd_http_ws,parse,2,[{file,"src/ejabberd_http_ws.erl"},{line,312}]},{ejabberd_http_ws,handle_info,3,[{file,"src/ejabberd_http_ws.erl"},{line,213}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]} in gen_fsm:terminate/7 line 622
System limits sets to:
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 15845 15845 processes
Max open files 40960 40960 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 15845 15845 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Apparently (due to references to "spawn" and "proc_lib" in the crash log) it hits processes limit? But it is still pretty confusing, as the system limit is set to lower value than actual number of connections accepted. So, does this kind of crash actually relate to the system "max processes" limit setting, or this is a pure coincidence?
Another important question related to the single node capacity is how to limit the number of client connections accepted by a particular node? The scenario we have in mind is to allow about 10K client connections to the node and when this limit is reached the node should start refusing further new client connections while still be able to respond normally to other connection types (s2s, xml-rpc, etc). We'd rather not be able to achieve the desired behavior by simply setting ERL_MAX_PORTS - right?
You probably hit Erlang
You probably hit Erlang process limit, not Linux System limits. You should increase the number of allowed process either by editing ejabberdctl.cfg or using erl command line +P parameter (depending on how you launch ejabberd).