I am seeing some very strange behavior in our production environment. We have a cluster of 3 machines with about 1200 sessions. Our config is fairly minimal with a handful of custom modules.
CPU usage seems to fixed at a full CPU on each of our machines. Running etop I found that that one of the ejabberd_mod_pubsub processes has a message queue that seems to be growing indefinitely. I looked at the message queue using process_info and found that it was full of {presence, JID, Pid} and {presence, User, Server, Resources, JID} messages. Whats stranger is that presence subscriptions/updates work fine in our client. As people log on and off, presence updates are sent to clients.
A stacktrace using process_info showed me:
[{dets,req,2,[{file,"dets.erl"},{line,1245}]},
{dets,chunk_match,2,[{file,"dets.erl"},{line,1017}]},
{dets,do_safe_match,2,[{file,"dets.erl"},{line,982}]},
{dets,match_object,2,[{file,"dets.erl"},{line,571}]},
{mnesia_lib,catch_match_object,3,
[{file,"mnesia_lib.erl"},{line,1094}]},
{mnesia_lib,db_match_object,3,
[{file,"mnesia_lib.erl"},{line,1086}]},
{rpc,local_call,3,[{file,"rpc.erl"},{line,329}]},
{mnesia,do_dirty_rpc,5,[{file,"mnesia.erl"},{line,1807}]}]
The initial call is {mod_pubsub,send_loop,1}.
If this code weren't actually draining and processing messages, I would think that you wouldn't see presence updates in the client. In fact it almost seems like theres two send_loop processes running with one of them functioning normally and the other not functioning at all. Sometimes I see on etop that a 2nd mod_pubsub process has items in its queue but, they're drained by the next tick which leads me to further believe that theres some phantom process running that just has an ever growing queue.
Any insight at all would be very appreciated...