strange loop - ejabberd_router

Hi,
I experience strange loop in my system behavior:
- presence messages are routed by ejabberd_router
- then drop to ejabberd_sm
- and again I see ejabberd_router

I have 3 nodes, my router table has 3 entries.
What I know comes from dbg over user's process which hits database hundred times more often than regular.

I do not see any other activity for this process than ejabberd_router & ejabberd_sm. (side question - how to discover Pid of a reciever of the c2s?)

process_info on super active user shows empty queue:

[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,2},
{messages,[{'$gen_event',closed},
{'DOWN',#Ref<0.0.1179.161926>,process,<0.4679.1813>,
normal}]},
{links,[<0.306.0>]},
{dictionary,[{'$internal_queue_len',0},
{'$ancestors',[ejabberd_c2s_sup,ejabberd_sup,<0.37.0>]},
{'$initial_call',{gen,init_it,6}}]},
{trap_exit,false},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.36.0>},
{total_heap_size,17711},
{heap_size,6765},
{stack_size,63},
{reductions,10779599595},
{garbage_collection,[{min_bin_vheap_size,46368},
{min_heap_size,233},
{fullsweep_after,4},
{minor_gcs,3}]},
{suspending,[]}]]

The data flow is:
ejabberd_router:route
ejabberd_router:do_route
ejabberd_sm:route

and again _router..

How to determine if those are new messages or some loop in the code?

Reductions are going up if I query again and again. Queue is stable.

If I try to ask c2s for something - I got timeout. Only exit(C2S, kill) exits the process.

I see in logs lots of:

=ERROR REPORT==== 2011-05-05 23:37:17 ===
E(<0.7292.2960>:ejabberd_c2s:1154) : Unexpected info: {'DOWN',
#Ref<0.0.1439.188998>,
process,
<10545.13676.2940>,
killed}

how to decipher this error? Socket on remote node was closed?
How can I tell why socket/connection was closed?

Thanks a lot for any help,

Dzej

dzejefkej wrote: how to

dzejefkej wrote:

how to discover Pid of a reciever of the c2s?)

Maybe I didn't understand your question. ejabberd creates two processes per XMPP session (one for sm and one for c2s). In this example, the processes are 0.483.0 and 0.484.0:

=INFO REPORT==== 11-May-2011::00:44:10 ===
I(<0.476.0>:ejabberd_listener:281) : (#Port<0.5268>)
Accepted connection {{127,0,0,1},59334} -> {{127,0,0,1},5222}

=INFO REPORT==== 11-May-2011::00:44:10 ===
I(<0.484.0>:ejabberd_c2s:517) : ({socket_state,gen_tcp,#Port<0.5268>,<0.483.0>})
Accepted legacy authentication for badlop@localhost/work by ejabberd_auth_internal
dzejefkej wrote:

The data flow is:
ejabberd_router:route
ejabberd_router:do_route
ejabberd_sm:route

and again _router..

How to determine if those are new messages or some loop in the code?

Check if the function is called with the same arguments several times:

--- a/src/ejabberd_router.erl
+++ b/src/ejabberd_router.erl
@@ -65,6 +65,7 @@ start_link() ->
 
 
 route(From, To, Packet) ->
+    ?INFO_MSG("ejabberd_router:route:~nFrom: ~p~nTo: ~p~nPacket: ~p", [ From, To, Packet]),
     case catch do_route(From, To, Packet) of
        {'EXIT', Reason} ->
            ?ERROR_MSG("~p~nwhen processing: ~p",
dzejefkej wrote:

=ERROR REPORT==== 2011-05-05 23:37:17 ===
E(<0.7292.2960>:ejabberd_c2s:1154) : Unexpected info: {'DOWN',
#Ref<0.0.1439.188998>,
process,
<10545.13676.2940>,
killed}

A process number 0.XXX.0 means that the process lives in the local node where you see that number. A number like 123.XXX.0 means that the process lives in a remote node of the cluster.

Maybe ejabberd got confused with the routing scheme, and is sending a XMPP packet from one node to the other, and back to the first, and to the second... until some process dies or timeouts?

strange loop

badlop wrote:
dzejefkej wrote:

how to discover Pid of a reciever of the c2s?)

Maybe I didn't understand your question. ejabberd creates two processes per XMPP session (one for sm and one for c2s). In this example, the processes are 0.483.0 and 0.484.0:

Your right, but that info comes from INFO debbug level. I cannot turn it on on production.
Is there any other way to discover Pid of the reciever belonging to c2s?
- get_status does not work (c2s hangs in call - thats mysterious also btw.)
- process_info on pid which is lower by 1 integer 483 and 484 returns undefined, however under heavy traffic there is no guarantee that those pids diff by 1.

badlop wrote:
dzejefkej wrote:

The data flow is:
ejabberd_router:route
ejabberd_router:do_route
ejabberd_sm:route

and again _router..

How to determine if those are new messages or some loop in the code?

Check if the function is called with the same arguments several times:

--- a/src/ejabberd_router.erl
+++ b/src/ejabberd_router.erl
@@ -65,6 +65,7 @@ start_link() ->
 
 
 route(From, To, Packet) ->
+    ?INFO_MSG("ejabberd_router:route:~nFrom: ~p~nTo: ~p~nPacket: ~p", [ From, To, Packet]),
     case catch do_route(From, To, Packet) of
        {'EXIT', Reason} ->
            ?ERROR_MSG("~p~nwhen processing: ~p",

I attached DBG on production to malfunctioning process. It is all the time the same presence unavailable stanza
with From = To. I traced calls and messages on _router, _local, _sm, _c2s, _roster. Here is the result, sometimes instead of presence I see unsubscribe. Couple sec of dbg generates ~50mb of trace :/

(<6567.1465.114>) << {route,
                         {jid,"test43",
                             "example.test.com","mobile",
                             "test43",
                             "example.test.com","mobile"},
                         {jid,"test43",
                             "example.test.com","mobile",
                             "test43",
                             "example.test.com","mobile"},
                         {xmlelement,"presence",[{"type","unavailable"}],[]}}
(<6567.1465.114>) call ejabberd_router:route({jid,"test43","example.test.com","mobile",
     "test43","example.test.com","mobile"},{jid,"test43","example.test.com","mobile",
     "test43","example.test.com","mobile"},{xmlelement,"presence",[{"type","unavailable"}],[]})
(<6567.1465.114>) call ejabberd_local:route({jid,"test43","example.test.com","mobile",
     "test43","example.test.com","mobile"},{jid,"test43","example.test.com","mobile",
     "test43","example.test.com","mobile"},{xmlelement,"presence",[{"type","unavailable"}],[]})
(<6567.1465.114>) call ejabberd_sm:route({jid,"test43","example.test.com","mobile",
     "test43","example.test.com","mobile"},{jid,"test43","example.test.com","mobile",
     "test43","example.test.com","mobile"},{xmlelement,"presence",[{"type","unavailable"}],[]})
(<6567.1465.114>) <6567.1465.114> ! {route,
                                     {jid,"test43",
                                      "example.test.com","mobile",
                                      "test43",
                                      "example.test.com","mobile"},
                                     {jid,"test43",
                                      "example.test.com","mobile",
                                      "test43",
                                      "example.test.com","mobile"},
                                     {xmlelement,"presence",
                                      [{"type","unavailable"}],
                                      []}}
(<6567.1465.114>) << {route,
                         {jid,"test43",
                             "example.test.com","mobile",
                             "test43",
                             "example.test.com","mobile"},
                         {jid,"test43",
                             "example.test.com","mobile",
                             "test43",
                             "example.test.com","mobile"},
                         {xmlelement,"presence",[{"type","unavailable"}],[]}}

So the SM asks for PID of To - it gets it, because it is in session table. It routs the presence to the PID and gets it back from router. Is there any mechanism to prevent such a behavior when From = To ? Also.. I do not see c2s here in dbg, so even if it hangs in 'call' why it does not time out?

badlop wrote:
dzejefkej wrote:

=ERROR REPORT==== 2011-05-05 23:37:17 ===
E(<0.7292.2960>:ejabberd_c2s:1154) : Unexpected info: {'DOWN',
#Ref<0.0.1439.188998>,
process,
<10545.13676.2940>,
killed}

A process number 0.XXX.0 means that the process lives in the local node where you see that number. A number like 123.XXX.0 means that the process lives in a remote node of the cluster.

Maybe ejabberd got confused with the routing scheme, and is sending a XMPP packet from one node to the other, and back to the first, and to the second... until some process dies or timeouts?

Yeah, I got the pid structure, for now it looks like completely unrelated to 'loop'. Still I do not like correlation in time of those 2 issues. How to investigate routing scheme? How would you look for a reason to kill this local process?
I thought that socket may be closed, this kills receiver, which kills c2s and eventually affects the other side of conversation.

appreciate any hints on that..

strange loop

I think that {ejabberd_c2s,bounce_messages,0} is a root of my problems. It is executed endlessly.
It is because session is not removed from session table.. but why?;)

re

dzejefkej wrote:

- get_status does not work (c2s hangs in call - thats mysterious also btw.)

Try:
ejabberd_sm:get_session_pid("badlop", "localhost", "home").

Syndicate content