We have two clustered linux machines running ejabberd-2.1.6. When two different clients connect to two different nodes (at the same moment) and create the same room, strange things can happen. Sometimes it works as expected. Sometimes both clients receive something like:
And then one can send to the muc and the other (when sending) will receive:
<message xml:lang="en" type="error" to="nagios1.tester@chat.damballah.lindenlab.com/nagios" from="nagios-test-group@conference.chat.damballah.lindenlab.com"><body>NwLRbBMqbHcdARZoWkK</body><nick xmlns="http://jabber.org/protocol/nick">nagios1.tester</nick><error code="406" type="modify"><not-acceptable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/><text xmlns="urn:ietf:params:xml:ns:xmpp-stanzas">Only occupants are allowed to send messages to the conference</text></error></message>
Both servers put something like this in ejabberd.log:
Is this a problem with how we've configured mnesia? Is it a bug? Any advice is welcome. I can provide the configuration and the client code, if it might help.
It may be a race as you thought, in mod_muc.erl line 474. The code is like this:
1. IF the room is not stored in the DB
2... THEN
3....... create the room
4....... and store the room in the DB
5... ELSE
6....... send this stanza to the existing room
In the code, the DB read operation of step 1 and the write operation in step 4 are not a "DB atomic transaction", but as mod_muc in a node is run by a single process, it's guaranteed to be atomic in a single node. However, probably there isn't guarantee with several nodes, as each node has a different process.
The mnesia DB table used in that code is 'muc_online_room'. Have you configured that mnesia table to be shared among the nodes?
Maybe the problem happens like this: while the first node is running step 2, the second node is evaluating 1 and entering step 2 too.
It may take several runs to see the problem. If it works, it exits after about 10 seconds. If it fails (and demonstrates the race), it will keep running for a couple minutes.
It may be a race as you
It may be a race as you thought, in mod_muc.erl line 474. The code is like this:
1. IF the room is not stored in the DB
2... THEN
3....... create the room
4....... and store the room in the DB
5... ELSE
6....... send this stanza to the existing room
In the code, the DB read operation of step 1 and the write operation in step 4 are not a "DB atomic transaction", but as mod_muc in a node is run by a single process, it's guaranteed to be atomic in a single node. However, probably there isn't guarantee with several nodes, as each node has a different process.
The mnesia DB table used in that code is 'muc_online_room'. Have you configured that mnesia table to be shared among the nodes?
Maybe the problem happens like this: while the first node is running step 2, the second node is evaluating 1 and entering step 2 too.
Does it make sense to you all that I said?
I've bundled up the program
I've bundled up the program that can demonstrate this problem. On my ubuntu box, this will get it built:
sudo apt-get install check libexpat1-dev
mkdir test
cd test
# seehttp://code.stanziq.com/strophe/ http://headache.hungry.com/~seth/xmpp-nagios-check.tar.bz2
git clone git://code.stanziq.com/libstrophe
wget
tar xvf xmpp-nagios-check.tar.bz2
cd libstrophe
patch -p1 < ../xmpp-nagios-check/strophe.diff
./bootstrap.sh
./configure
make
cd ../xmpp-nagios-check
# edit xmpp-nagios-check.c and set USER* and PASS* at the top
make
./xmpp-nagios-check -H host0 -H host1 -j chat.something.com -v -r
It may take several runs to see the problem. If it works, it exits after about 10 seconds. If it fails (and demonstrates the race), it will keep running for a couple minutes.
Experimental patch
Try this patch for ejabberd 2.1.x:
http://tkabber.jabber.ru/files/badlop/4585-muc-creation-race.diff
And tell me if it solves the problem or not.