Looking for ejabberd docs?

To access the most up-to-date ejabberd documentation, please visit docs.ejabberd.im »

strange loop - ejabberd_router

Submitted by dzejefkej on Thu, 2011-05-05 23:45

ejabberd Development

Hi,
I experience strange loop in my system behavior:
- presence messages are routed by ejabberd_router
- then drop to ejabberd_sm
- and again I see ejabberd_router

I have 3 nodes, my router table has 3 entries.
What I know comes from dbg over user's process which hits database hundred times more often than regular.

I do not see any other activity for this process than ejabberd_router & ejabberd_sm. (side question - how to discover Pid of a reciever of the c2s?)

process_info on super active user shows empty queue:

[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,2},
{messages,[{'$gen_event',closed},
{'DOWN',#Ref<0.0.1179.161926>,process,<0.4679.1813>,
normal}]},
{links,[<0.306.0>]},
{dictionary,[{'$internal_queue_len',0},
{'$ancestors',[ejabberd_c2s_sup,ejabberd_sup,<0.37.0>]},
{'$initial_call',{gen,init_it,6}}]},
{trap_exit,false},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.36.0>},
{total_heap_size,17711},
{heap_size,6765},
{stack_size,63},
{reductions,10779599595},
{garbage_collection,[{min_bin_vheap_size,46368},
{min_heap_size,233},
{fullsweep_after,4},
{minor_gcs,3}]},
{suspending,[]}]]

The data flow is:
ejabberd_router:route
ejabberd_router:do_route
ejabberd_sm:route

and again _router..

How to determine if those are new messages or some loop in the code?

Reductions are going up if I query again and again. Queue is stable.

If I try to ask c2s for something - I got timeout. Only exit(C2S, kill) exits the process.

I see in logs lots of:

=ERROR REPORT==== 2011-05-05 23:37:17 ===
E(<0.7292.2960>:ejabberd_c2s:1154) : Unexpected info: {'DOWN',
#Ref<0.0.1439.188998>,
process,
<10545.13676.2940>,
killed}

how to decipher this error? Socket on remote node was closed?
How can I tell why socket/connection was closed?

Thanks a lot for any help,

Dzej

dzejefkej wrote: how to

Submitted by mfoss on Tue, 2011-05-10 22:53.

dzejefkej wrote:

how to discover Pid of a reciever of the c2s?)

Maybe I didn't understand your question. ejabberd creates two processes per XMPP session (one for sm and one for c2s). In this example, the processes are 0.483.0 and 0.484.0:

=INFO REPORT==== 11-May-2011::00:44:10 ===
I(<0.476.0>:ejabberd_listener:281) : (#Port<0.5268>)
Accepted connection {{127,0,0,1},59334} -> {{127,0,0,1},5222}

=INFO REPORT==== 11-May-2011::00:44:10 ===
I(<0.484.0>:ejabberd_c2s:517) : ({socket_state,gen_tcp,#Port<0.5268>,<0.483.0>})
Accepted legacy authentication for badlop@localhost/work by ejabberd_auth_internal

dzejefkej wrote:

The data flow is:
ejabberd_router:route
ejabberd_router:do_route
ejabberd_sm:route

and again _router..

How to determine if those are new messages or some loop in the code?

Check if the function is called with the same arguments several times:

--- a/src/ejabberd_router.erl
+++ b/src/ejabberd_router.erl
@@ -65,6 +65,7 @@ start_link() ->
 
 
 route(From, To, Packet) ->
+    ?INFO_MSG("ejabberd_router:route:~nFrom: ~p~nTo: ~p~nPacket: ~p", [ From, To, Packet]),
     case catch do_route(From, To, Packet) of
        {'EXIT', Reason} ->
            ?ERROR_MSG("~p~nwhen processing: ~p",

dzejefkej wrote:

=ERROR REPORT==== 2011-05-05 23:37:17 ===
E(<0.7292.2960>:ejabberd_c2s:1154) : Unexpected info: {'DOWN',
#Ref<0.0.1439.188998>,
process,
<10545.13676.2940>,
killed}

A process number 0.XXX.0 means that the process lives in the local node where you see that number. A number like 123.XXX.0 means that the process lives in a remote node of the cluster.

Maybe ejabberd got confused with the routing scheme, and is sending a XMPP packet from one node to the other, and back to the first, and to the second... until some process dies or timeouts?

strange loop

Submitted by dzejefkej on Wed, 2011-05-11 19:34.

badlop wrote:

dzejefkej wrote:

how to discover Pid of a reciever of the c2s?)

Maybe I didn't understand your question. ejabberd creates two processes per XMPP session (one for sm and one for c2s). In this example, the processes are 0.483.0 and 0.484.0:

Your right, but that info comes from INFO debbug level. I cannot turn it on on production.
Is there any other way to discover Pid of the reciever belonging to c2s?
- get_status does not work (c2s hangs in call - thats mysterious also btw.)
- process_info on pid which is lower by 1 integer 483 and 484 returns undefined, however under heavy traffic there is no guarantee that those pids diff by 1.

badlop wrote:

dzejefkej wrote:

The data flow is:
ejabberd_router:route
ejabberd_router:do_route
ejabberd_sm:route

and again _router..

How to determine if those are new messages or some loop in the code?

Check if the function is called with the same arguments several times:

--- a/src/ejabberd_router.erl
+++ b/src/ejabberd_router.erl
@@ -65,6 +65,7 @@ start_link() ->
 
 
 route(From, To, Packet) ->
+    ?INFO_MSG("ejabberd_router:route:~nFrom: ~p~nTo: ~p~nPacket: ~p", [ From, To, Packet]),
     case catch do_route(From, To, Packet) of
        {'EXIT', Reason} ->
            ?ERROR_MSG("~p~nwhen processing: ~p",

I attached DBG on production to malfunctioning process. It is all the time the same presence unavailable stanza
with From = To. I traced calls and messages on _router, _local, _sm, _c2s, _roster. Here is the result, sometimes instead of presence I see unsubscribe. Couple sec of dbg generates ~50mb of trace :/

(<6567.1465.114>) << {route,
                         {jid,"test43",
                             "example.test.com","mobile",
                             "test43",
                             "example.test.com","mobile"},
                         {jid,"test43",
                             "example.test.com","mobile",
                             "test43",
                             "example.test.com","mobile"},
                         {xmlelement,"presence",[{"type","unavailable"}],[]}}
(<6567.1465.114>) call ejabberd_router:route({jid,"test43","example.test.com","mobile",
     "test43","example.test.com","mobile"},{jid,"test43","example.test.com","mobile",
     "test43","example.test.com","mobile"},{xmlelement,"presence",[{"type","unavailable"}],[]})
(<6567.1465.114>) call ejabberd_local:route({jid,"test43","example.test.com","mobile",
     "test43","example.test.com","mobile"},{jid,"test43","example.test.com","mobile",
     "test43","example.test.com","mobile"},{xmlelement,"presence",[{"type","unavailable"}],[]})
(<6567.1465.114>) call ejabberd_sm:route({jid,"test43","example.test.com","mobile",
     "test43","example.test.com","mobile"},{jid,"test43","example.test.com","mobile",
     "test43","example.test.com","mobile"},{xmlelement,"presence",[{"type","unavailable"}],[]})
(<6567.1465.114>) <6567.1465.114> ! {route,
                                     {jid,"test43",
                                      "example.test.com","mobile",
                                      "test43",
                                      "example.test.com","mobile"},
                                     {jid,"test43",
                                      "example.test.com","mobile",
                                      "test43",
                                      "example.test.com","mobile"},
                                     {xmlelement,"presence",
                                      [{"type","unavailable"}],
                                      []}}
(<6567.1465.114>) << {route,
                         {jid,"test43",
                             "example.test.com","mobile",
                             "test43",
                             "example.test.com","mobile"},
                         {jid,"test43",
                             "example.test.com","mobile",
                             "test43",
                             "example.test.com","mobile"},
                         {xmlelement,"presence",[{"type","unavailable"}],[]}}

So the SM asks for PID of To - it gets it, because it is in session table. It routs the presence to the PID and gets it back from router. Is there any mechanism to prevent such a behavior when From = To ? Also.. I do not see c2s here in dbg, so even if it hangs in 'call' why it does not time out?

badlop wrote:

dzejefkej wrote:

=ERROR REPORT==== 2011-05-05 23:37:17 ===
E(<0.7292.2960>:ejabberd_c2s:1154) : Unexpected info: {'DOWN',
#Ref<0.0.1439.188998>,
process,
<10545.13676.2940>,
killed}

A process number 0.XXX.0 means that the process lives in the local node where you see that number. A number like 123.XXX.0 means that the process lives in a remote node of the cluster.

Maybe ejabberd got confused with the routing scheme, and is sending a XMPP packet from one node to the other, and back to the first, and to the second... until some process dies or timeouts?

Yeah, I got the pid structure, for now it looks like completely unrelated to 'loop'. Still I do not like correlation in time of those 2 issues. How to investigate routing scheme? How would you look for a reason to kill this local process?
I thought that socket may be closed, this kills receiver, which kills c2s and eventually affects the other side of conversation.

appreciate any hints on that..

strange loop

Submitted by dzejefkej on Thu, 2011-05-12 21:53.

I think that {ejabberd_c2s,bounce_messages,0} is a root of my problems. It is executed endlessly.
It is because session is not removed from session table.. but why?;)

re

Submitted by mfoss on Sat, 2011-05-14 10:17.

dzejefkej wrote:

- get_status does not work (c2s hangs in call - thats mysterious also btw.)

Try:
ejabberd_sm:get_session_pid("badlop", "localhost", "home").

Looking for ejabberd docs?

strange loop - ejabberd_router

dzejefkej wrote: how to

strange loop

strange loop

re

User login

Navigation