expat vs. xmerl_scan

I have some questions regarding XML parsing:

1. Why is the Expat XML parser used and not Erlang's xmerl_scan module? Is there historical reasons for using Expat? The way I understand it, an external context needs to be created everytime Expat is needed (which is everytime a port is opened), creating a lot of overhead (please correct me if I'm wrong).

A typical line taken from ejabberd/src/xml_stream.erl:
Port = open_port({spawn, expat_erl}, [binary]),

2. If possible to convert, will a future ejabberd release rather make use of xmerl_scan?

Pieter Rautenbach

Expat

Expat is a very efficient and robust XML stream parser. stream parser is important as Jabber XML is virtually infinite.
The opening port overhead is only paid once for each connection, but then the overall processing is very fast during all the duration of the connection.
Regarding the heavy use which is required on XML parsing, I think xmerl-scan can be hardly more efficient than expat.

We can in the future optimise a little bit the code by pooling between several C port managing different expat parsing process, but I think we have a lot of optimisation options first that would probably have a bigger impact on performance.

Are you usually using xmerl_scan ?

--
Mickaël Rémond
http://www.process-one.net/

I based my questions on a

I based my questions on a discussion between a friend and I. I've only been using Erlang for a month, but he's been using it extensively for quite a while now. I was discussing problems we're having with our cluster of 3 Jabber servers (seemingly random crashes). I told him that the CPU usage per node is close to 100% when having about 8000 users per node online and he suggested that it could be Expat introducing the high CPU usage. Under such circumstances, the system (a node) crashes with the SASL log having entries like this one:

=CRASH REPORT==== 5-Jan-2006::21:05:58 ===
crasher:
pid: <0.14800.5>
registered_name: []
error_info: {enfile,[{erlang,open_port,[{spawn,expat_erl},[binary]]},
{xml_stream,new,1},
{ejabberd_receiver,receiver,4},
{proc_lib,init_p,5}]}
initial_call: {ejabberd_receiver,receiver,
[#Port<0.1683984>,
gen_tcp,
none,
<0.14799.5>]}
ancestors: [<0.14799.5>,ejabberd_c2s_sup,ejabberd_sup,<0.39.0>]
messages: []
links: []
dictionary: []
trap_exit: false
status: running
heap_size: 233
stack_size: 21
reductions: 65
neighbours:

This obviously is a serious problem, but what is a bigger problem is that nodes seem to influence each other (it happened just now): Server 1 will crash, a few minutes after Server 2 and a few minutes later, *sometimes* Server 3.

Pieter Rautenbach

Syndicate content