Hi,
I experience strange loop in my system behavior:
- presence messages are routed by ejabberd_router
- then drop to ejabberd_sm
- and again I see ejabberd_router
I have 3 nodes, my router table has 3 entries.
What I know comes from dbg over user's process which hits database hundred times more often than regular.
I do not see any other activity for this process than ejabberd_router & ejabberd_sm. (side question - how to discover Pid of a reciever of the c2s?)
process_info on super active user shows empty queue:
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,2},
{messages,[{'$gen_event',closed},
{'DOWN',#Ref<0.0.1179.161926>,process,<0.4679.1813>,
normal}]},
{links,[<0.306.0>]},
{dictionary,[{'$internal_queue_len',0},
{'$ancestors',[ejabberd_c2s_sup,ejabberd_sup,<0.37.0>]},
{'$initial_call',{gen,init_it,6}}]},
{trap_exit,false},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.36.0>},
{total_heap_size,17711},
{heap_size,6765},
{stack_size,63},
{reductions,10779599595},
{garbage_collection,[{min_bin_vheap_size,46368},
{min_heap_size,233},
{fullsweep_after,4},
{minor_gcs,3}]},
{suspending,[]}]]
The data flow is:
ejabberd_router:route
ejabberd_router:do_route
ejabberd_sm:route
and again _router..
How to determine if those are new messages or some loop in the code?
Reductions are going up if I query again and again. Queue is stable.
If I try to ask c2s for something - I got timeout. Only exit(C2S, kill) exits the process.
I see in logs lots of:
=ERROR REPORT==== 2011-05-05 23:37:17 ===
E(<0.7292.2960>:ejabberd_c2s:1154) : Unexpected info: {'DOWN',
#Ref<0.0.1439.188998>,
process,
<10545.13676.2940>,
killed}
how to decipher this error? Socket on remote node was closed?
How can I tell why socket/connection was closed?
Thanks a lot for any help,
Dzej
dzejefkej wrote: how to
how to discover Pid of a reciever of the c2s?)
Maybe I didn't understand your question. ejabberd creates two processes per XMPP session (one for sm and one for c2s). In this example, the processes are 0.483.0 and 0.484.0:
The data flow is:
ejabberd_router:route
ejabberd_router:do_route
ejabberd_sm:route
and again _router..
How to determine if those are new messages or some loop in the code?
Check if the function is called with the same arguments several times:
=ERROR REPORT==== 2011-05-05 23:37:17 ===
E(<0.7292.2960>:ejabberd_c2s:1154) : Unexpected info: {'DOWN',
#Ref<0.0.1439.188998>,
process,
<10545.13676.2940>,
killed}
A process number 0.XXX.0 means that the process lives in the local node where you see that number. A number like 123.XXX.0 means that the process lives in a remote node of the cluster.
Maybe ejabberd got confused with the routing scheme, and is sending a XMPP packet from one node to the other, and back to the first, and to the second... until some process dies or timeouts?
strange loop
how to discover Pid of a reciever of the c2s?)
Maybe I didn't understand your question. ejabberd creates two processes per XMPP session (one for sm and one for c2s). In this example, the processes are 0.483.0 and 0.484.0:
Your right, but that info comes from INFO debbug level. I cannot turn it on on production.
Is there any other way to discover Pid of the reciever belonging to c2s?
- get_status does not work (c2s hangs in call - thats mysterious also btw.)
- process_info on pid which is lower by 1 integer 483 and 484 returns undefined, however under heavy traffic there is no guarantee that those pids diff by 1.
The data flow is:
ejabberd_router:route
ejabberd_router:do_route
ejabberd_sm:route
and again _router..
How to determine if those are new messages or some loop in the code?
Check if the function is called with the same arguments several times:
I attached DBG on production to malfunctioning process. It is all the time the same presence unavailable stanza
with From = To. I traced calls and messages on _router, _local, _sm, _c2s, _roster. Here is the result, sometimes instead of presence I see unsubscribe. Couple sec of dbg generates ~50mb of trace :/
(<6567.1465.114>) << {route,
{jid,"test43",
"example.test.com","mobile",
"test43",
"example.test.com","mobile"},
{jid,"test43",
"example.test.com","mobile",
"test43",
"example.test.com","mobile"},
{xmlelement,"presence",[{"type","unavailable"}],[]}}
(<6567.1465.114>) call ejabberd_router:route({jid,"test43","example.test.com","mobile",
"test43","example.test.com","mobile"},{jid,"test43","example.test.com","mobile",
"test43","example.test.com","mobile"},{xmlelement,"presence",[{"type","unavailable"}],[]})
(<6567.1465.114>) call ejabberd_local:route({jid,"test43","example.test.com","mobile",
"test43","example.test.com","mobile"},{jid,"test43","example.test.com","mobile",
"test43","example.test.com","mobile"},{xmlelement,"presence",[{"type","unavailable"}],[]})
(<6567.1465.114>) call ejabberd_sm:route({jid,"test43","example.test.com","mobile",
"test43","example.test.com","mobile"},{jid,"test43","example.test.com","mobile",
"test43","example.test.com","mobile"},{xmlelement,"presence",[{"type","unavailable"}],[]})
(<6567.1465.114>) <6567.1465.114> ! {route,
{jid,"test43",
"example.test.com","mobile",
"test43",
"example.test.com","mobile"},
{jid,"test43",
"example.test.com","mobile",
"test43",
"example.test.com","mobile"},
{xmlelement,"presence",
[{"type","unavailable"}],
[]}}
(<6567.1465.114>) << {route,
{jid,"test43",
"example.test.com","mobile",
"test43",
"example.test.com","mobile"},
{jid,"test43",
"example.test.com","mobile",
"test43",
"example.test.com","mobile"},
{xmlelement,"presence",[{"type","unavailable"}],[]}}
So the SM asks for PID of To - it gets it, because it is in session table. It routs the presence to the PID and gets it back from router. Is there any mechanism to prevent such a behavior when From = To ? Also.. I do not see c2s here in dbg, so even if it hangs in 'call' why it does not time out?
=ERROR REPORT==== 2011-05-05 23:37:17 ===
E(<0.7292.2960>:ejabberd_c2s:1154) : Unexpected info: {'DOWN',
#Ref<0.0.1439.188998>,
process,
<10545.13676.2940>,
killed}
A process number 0.XXX.0 means that the process lives in the local node where you see that number. A number like 123.XXX.0 means that the process lives in a remote node of the cluster.
Maybe ejabberd got confused with the routing scheme, and is sending a XMPP packet from one node to the other, and back to the first, and to the second... until some process dies or timeouts?
Yeah, I got the pid structure, for now it looks like completely unrelated to 'loop'. Still I do not like correlation in time of those 2 issues. How to investigate routing scheme? How would you look for a reason to kill this local process?
I thought that socket may be closed, this kills receiver, which kills c2s and eventually affects the other side of conversation.
appreciate any hints on that..
strange loop
I think that {ejabberd_c2s,bounce_messages,0} is a root of my problems. It is executed endlessly.
It is because session is not removed from session table.. but why?;)
re
- get_status does not work (c2s hangs in call - thats mysterious also btw.)
Try:
ejabberd_sm:get_session_pid("badlop", "localhost", "home").