Thursday, October 4, 2012

Data exchange format for enterprise application integration – XML, EDI and JSON?

Design for message exchange between modules of an application and across applications is often an afterthought. My experience is that this afterthought results in severe limitations on functionality, performance and scalability of the applications. In this note I will discuss relative merits of formatting data exchange through XML, JSON and EDI like flat file formats.

Decision about choosing data format depends on multiple factors. Among most important of these factors are size and structure of the message.

Consider a relational database for Orders. It will have multiple tables related to Order header and order lines. It will also have tables for items, suppliers, customers, addresses and currencies that will get referenced from Order header and order lines.

Technically it is possible to design a structure, in all three formats, that allows you to send complete database in a single huge message.

Size of message depends on its content. Size of message has impact on communication throughput, processing throughput, error handling and requirements for disk and memory resources.

Creating such large message is cumbersome but not very difficult. Actual issue comes when a recipient attempts to receive, log, parse and consume message.

Let us consider issues related to size & structure of message in more detail.

Communication time: Messages are exchanged serially. Considering network overheads, it may take several seconds for a large message to travel from source to destination. This communication time increases as message size increases. Any intermittent disruption may require source to resend the message to destination (Network compression can be used to partially mitigate this issue).

Logging: If traceability, retransmission and non-repudiation are a requirement, both source and destination systems will need to log the messages. Writing and reading large messages further reduces the throughput and also requires significant disk space.

Parsing: It is important to consider this aspect of integration. If your application needs to understand complete message before it can take any action, it will need to parse complete message and create an object in memory. Large messages require more memory and time to parse. Standard DOM (document object model) parsing for XML requires significant memory. Parsing of JSON messages will require minimum memory. One reason for need to parse complete message is message structure that do not enforce a specific order in which message elements can appear. You can partially mitigate need to parse complete object by enforcing such order.

Exception handling: As stated, you can transmit content of complete database, multiple orders in above mentioned example, in a single message. If your receiving application handles each message as a single transaction then a single failure will require reprocessing of complete message.

Let us now evaluate relative merits of three formats – XML, JSON and EDI/TEXT.

XML is a well established and popular standard. It is self describing, open and extensible. Extensibility allows you to add elements to message with limited programming impact. Multiple out of box parsing libraries exist in almost all popular languages. XML standards do not enforce a sequence in which elements must appear. This becomes a major weakness as receiving application will need to parse complete message to find element that it is interested in. Often this may require multiple passes through the message. One option is to parse complete message at once as a structure (called Document Object Model structure) that can be used later to find elements of interest. However, DOM parsing requires memory and is also overkill if you only need few elements from message. Self-describing structure of XML is another one of its major weaknesses. XML uses starting and end TAGs around each data element. While these descriptive tags are useful for a human reader, they add to overall size of XML.

JSON is an upcoming message format. This format is also open, extensible and self describing. As this format is a direct representation of Java in-memory object, parsing and loading it to memory requires significantly lower time than that needed to parse an XML. However, as JSON also does not enforce an element sequence, you still need to load complete message. JSON does not require an end tag. Hence size of a JSON message is typically 30% to 40% smaller than corresponding XML message. Parsing libraries exist for Java and for limited number of other language.

I am using term EDI like flat file format for a generic variable length text based message structure where each line in message corresponds to an individual record of data. First few characters of each line, called record identifier, identify record type. Elements within each line (record) are separated by an element separator. This message format is most compact and can be 60% to 70% smaller than a corresponding XML. This format is obviously not self-describing. Hence source and recipients need to share a previously agreed definition of message structure. Parsing libraries exist for standard EDI. However, one may need to write custom parsers for custom message formats. As formats are mutually agreed between senders and receivers, record & element sequences are usually enforced as part of that agreement. This guaranteed sequencing allows extremely large messages to be parsed sequentially using limited system resources at a significantly faster throughput. As messages are not self-describing, programmers need to be careful before changing formats.

I will now show same data in XML, JSON and EDI like format.

XML : size – size 341 characters

<ORDERS>

<ORDER>

<HEADER>

<SHIP_TO>Don Trump</SHIP_TO>

<SHIP_TO_ADDRESS>Atlanta, GA</SHIP_TO_ADDRESS>

<ORDER_DATE>April 21 2012</ORDER_DATE>

<PAYMENT_INFO>XYZ BANK</PAYMENT_INFO>

</HEADER>

<LINES>

<LINE>

<ITEM>Pen</ITEM>

<QTY>10</QTY>

</LINE>

<LINE>

<ITEM>Tablet</ITEM>

<QTY>05</QTY>

</LINE>

<LINE>

<ITEM>Laptop</ITEM>

<QTY>100</QTY>

</LINE>

</LINES>

</ORDER>

</ORDERS>

 

JSON: size – 241 characters

{

"ORDERS": {

"ORDER": {

"HEADER": {

"SHIP_TO": "Don Trump",

"SHIP_TO_ADDRESS": "Atlanta, GA",

"ORDER_DATE": "April 21 2012",

"PAYMENT_INFO": "XYZ BANK"

},

"LINES": {

"LINE": [

{

"ITEM": "Pen",

"QTY": "10"

},

{

"ITEM": "Tablet",

"QTY": "05"

},

{

"ITEM": "Laptop",

"QTY": "100"

}

]

}

}

}

}

 

EDI: size-88 characters

*B*ORD*

*H*Don Trump*Atlanta, GA*April 21 2012*XYZ Bank

*L*Pen*10

*L*Table*05

*L*Laptop*100

*E*ORD

 

You must have noticed that EDI-like format is most compact. Following table summarize pros and cons of three formats

 

XML

JSON

EDI like Flat File

Size (scale of 100 to 1)

100

70

25

Can enforce element sequencing

No

No

Yes

Standard-based

Yes

Yes

No

Availability of parsers

Best

Java and few limited languages

Limited, custom parsers may be needed

Ease of parsing huge messages

Need custom parsers

 

Easiest

Even though EDI like format requires more programming, I would recommend using this format as much as possible. If other application or module can handle only XML or JSON formats then you can implement translators to convert your EDI like format to XML, JSON or any other open standard format.

Note: you can use compression to further reduce size of your text messages by another 80%. I will write another note on message compression.

No comments: