Hi guys,
hope someone can help me with following problem.
We have experienced multiple crashes of Ejabberd (2.1.5) recently. Analyzing crash dumps showed some misbehaving c2s processes with enormously big message queues.
Scenario:
Trying to reproduce this situation on internal environment led me to following scenario:
I have 500 test bots constantly logging in, entering one MUC room and leaving again. Also I’ve created a ‘misbehaving’ client that stays in MUC room for the whole time sending messages once in a while (not to get kicked for inactivity). Important thing is that this client only sends data, it never reads it (not doing recv from socket at all) after entering room. So in few words I emulate case when consumer processes data much slower than producers generate it.
As expected, taking dump after few minutes shows ~12K messages in this client’s process mailbox. So, either intentionally, or because of some network issues we have similar misbehaving client on real deployment. Only solution I’ve found for this is using max_fsm_queue option which (as stated in documentation) will terminate client process if its message queue exceeds this limit.
Problem:
Unfortunately max_fsm_queue doesn’t work at all in my case. Misbehaving client was never kicked, up until I got bored and collected the dump. With limit set to 200, client had more than 6K messages in queue.
Reading through the code of p1_fsm I’ve found the code which does queue size checks (p1_fsm:message_queue_len/2) and added logging there. Running test again gave unexpected results, this check was called only few times and then – silence.
Bottom line:
So from my understanding there’s an architectural flaw in p1_fsm queue check logic. FSM validates own queue size in message processing loop. This means:
- Queue size limit can’t be strictly maintained
- When message processing logic of FSM blocks – queue validation logic will not work until FSM will return to the idle loop.
In case of C2S FSM, when client does not do recv for a while FSM will block on writing to the socket, and then it totally lose control over queue growth. Everything ends with out of memory crash.
I have some ideas of solving it, but will greatly appreciate any help from you, maybe I’m missing something.
Best regards,
Sergii