Topic: APLX Help : Help on APL language : System Functions & Variables : ⎕XML Convert to/from XML
[ Previous | Contents | Index | APL Home ]

www.microapl.co.uk

⎕XML Convert to/from XML


Extensible Markup Language (XML) is a widely used standard for storing data in a text format that many different programs can access. It combines the actual data with 'mark-up' which indicates how the data should be interpreted.

The ⎕XML system function can be used to extract data from XML format into an APL array, and to generate XML from an APL array. The direction of conversion is determined by the type of the right argument.

See also the ⎕IMPORT and ⎕EXPORT functions, which allow data to be transferred to/from XML files in a single step.

An Example of XML format

A full description of XML is beyond the scope of this document. However, the following simple but complete XML example demonstrates some of the main features:

<?xml version="1.0" encoding="utf-8"?>
<sales>
    <!-- Sales by month -->
    <month>January
        <item>
            <name>Ice Cream</name>
            <amount currency="dollars">25.10</amount>
       </item>
       <item>
            <name>Fizzy Drinks</name>
            <amount currency="dollars">360.92</amount>
       </item>
    </month>
    <month>February
        <item>
            <name>Ice Cream</name>
            <amount currency="dollars">5.02</amount>
       </item>
       <item>
            <name>Fizzy Drinks</name>
            <amount currency="dollars">403.16</amount>
       </item>
    </month>
</sales>

The first line specifies the XML version used, and the third line ("Sales by month") is a comment. The remainder of the document consists of elements which contain the data. Each element begins with a start tag and ends with a matching end tag, for example:

    <name>...</name>

Element tag names are case-sensitive.

An element may contain data, or other elements nested within it, or both. In addition the start tag may include one or more attributes specifying how the data is to be interpreted. Each attribute is a pair of the form name="value", for example:

    <amount currency="dollars">25.10</amount>

An empty element which contains no data and no other elements nested within it can be written as:

    <name/>

Within an XML document there is usually no significance in the amount of white space used, for example the number of spaces used to indent an element or the positions of line breaks. The following is valid in XML:

<item><name>Ice Cream</name><amount currency="dollars">25.10</amount></item>

Converting XML Data to an APL Array

Syntax: R←[options] ⎕XML CHRVEC

The right argument is a character vector (with embedded carriage returns and/or line feeds) containing the XML text to be converted. The optional left argument gives some control over the conversion process and is discussed below.

The result is an N-row, 5-column matrix containing a flattened representation of the XML data. Each element in the XML data will produce one row in the result. The columns are as follows:

Column 1:An integer indicating the depth of nesting of the element.
A value of 0 is used for the outer-most nesting level, with deeper nesting being indicated by higher numbers.
Column 2:The element name as specified in the start tag.
Column 3:The element data as a character vector
Column 4:An M-row, 2-column nested matrix containing any attribute name/value pairs. Each item in the matrix is a character vector.
If the element has no attributes, this matrix will have 0 rows.
Column 5:A code to help interpret the type of data the row contains (See below)

For example, when presented with the XML sample listed above the array produced is as follows:

      ⎕XML xml_data
 0 sales                                  3
 1 month                                  7
 2        January                         4
 2 item                                   3
 3 name   Ice Cream                       5
 3 amount 25.10         currency dollars  5
 2 item                                   3
 3 name   Fizzy Drinks                    5
 3 amount 360.92        currency dollars  5
 1 month                                  7
 2        February                        4
 2 item                                   3
 3 name   Ice Cream                       5
 3 amount 5.02          currency dollars  5
 2 item                                   3
 3 name   Fizzy Drinks                    5
 3 amount 403.16        currency dollars  5
 
 
      ⎕DISPLAY ⎕XML xml_data
┌→─────────────────────────────────────────────────────┐
↓   ┌→────┐  ┌⊖┐            ┌→────────┐                │
│ 0 │sales│  │ │            ⌽ ┌⊖┐ ┌⊖┐ │              3 │
│   └─────┘  └─┘            │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌→────┐  ┌⊖┐            ┌→────────┐                │
│ 1 │month│  │ │            ⌽ ┌⊖┐ ┌⊖┐ │              7 │
│   └─────┘  └─┘            │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌⊖┐      ┌→──────┐      ┌→────────┐                │
│ 2 │ │      │January│      ⌽ ┌⊖┐ ┌⊖┐ │              4 │
│   └─┘      └───────┘      │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌→───┐   ┌⊖┐            ┌→────────┐                │
│ 2 │item│   │ │            ⌽ ┌⊖┐ ┌⊖┐ │              3 │
│   └────┘   └─┘            │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌→───┐   ┌→────────┐    ┌→────────┐                │
│ 3 │name│   │Ice Cream│    ⌽ ┌⊖┐ ┌⊖┐ │              5 │
│   └────┘   └─────────┘    │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌→─────┐ ┌→────┐        ┌→─────────────────────┐   │
│ 3 │amount│ │25.10│        ↓ ┌→───────┐ ┌→──────┐ │ 5 │
│   └──────┘ └─────┘        │ │currency│ │dollars│ │   │
│                           │ └────────┘ └───────┘ │   │
│                           └∊─────────────────────┘   │
│   ┌→───┐   ┌⊖┐            ┌→────────┐                │
│ 2 │item│   │ │            ⌽ ┌⊖┐ ┌⊖┐ │              3 │
│   └────┘   └─┘            │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌→───┐   ┌→───────────┐ ┌→────────┐                │
│ 3 │name│   │Fizzy Drinks│ ⌽ ┌⊖┐ ┌⊖┐ │              5 │
│   └────┘   └────────────┘ │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌→─────┐ ┌→─────┐       ┌→─────────────────────┐   │
│ 3 │amount│ │360.92│       ↓ ┌→───────┐ ┌→──────┐ │ 5 │
│   └──────┘ └──────┘       │ │currency│ │dollars│ │   │
│                           │ └────────┘ └───────┘ │   │
│                           └∊─────────────────────┘   │
│   ┌→────┐  ┌⊖┐            ┌→────────┐                │
│ 1 │month│  │ │            ⌽ ┌⊖┐ ┌⊖┐ │              7 │
│   └─────┘  └─┘            │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌⊖┐      ┌→───────┐     ┌→────────┐                │
│ 2 │ │      │February│     ⌽ ┌⊖┐ ┌⊖┐ │              4 │
│   └─┘      └────────┘     │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌→───┐   ┌⊖┐            ┌→────────┐                │
│ 2 │item│   │ │            ⌽ ┌⊖┐ ┌⊖┐ │              3 │
│   └────┘   └─┘            │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌→───┐   ┌→────────┐    ┌→────────┐                │
│ 3 │name│   │Ice Cream│    ⌽ ┌⊖┐ ┌⊖┐ │              5 │
│   └────┘   └─────────┘    │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌→─────┐ ┌→───┐         ┌→─────────────────────┐   │
│ 3 │amount│ │5.02│         ↓ ┌→───────┐ ┌→──────┐ │ 5 │
│   └──────┘ └────┘         │ │currency│ │dollars│ │   │
│                           │ └────────┘ └───────┘ │   │
│                           └∊─────────────────────┘   │
│   ┌→───┐   ┌⊖┐            ┌→────────┐                │
│ 2 │item│   │ │            ⌽ ┌⊖┐ ┌⊖┐ │              3 │
│   └────┘   └─┘            │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌→───┐   ┌→───────────┐ ┌→────────┐                │
│ 3 │name│   │Fizzy Drinks│ ⌽ ┌⊖┐ ┌⊖┐ │              5 │
│   └────┘   └────────────┘ │ │ │ │ │ │                │
│                           │ └─┘ └─┘ │                │
│                           └∊────────┘                │
│   ┌→─────┐ ┌→─────┐       ┌→─────────────────────┐   │
│ 3 │amount│ │403.16│       ↓ ┌→───────┐ ┌→──────┐ │ 5 │
│   └──────┘ └──────┘       │ │currency│ │dollars│ │   │
│                           │ └────────┘ └───────┘ │   │
│                           └∊─────────────────────┘   │
└∊─────────────────────────────────────────────────────┘


Options for converting XML to an APL array

The conversion from XML to an APL array described above can be controlled by an optional left argument which consists of one or more keyword/value pairs, for example:

     R←('markup' 'preserve') ('whitespace' 'preserve') ⎕XML xml_data

The supported keywords are:

  • 'markup': possible values 'preserve' and 'strip'

    By default ⎕XML strips out all XML statements which are not data elements. In the example above, the following two lines were stripped out:

            <?xml version="1.0" encoding="utf-8"?>
            <!-- Sales by month -->

    The first one is a processing instruction and the second is a comment. Neither of them contain any data.

    However it is sometimes necessary to have access to the complete content of the XML document, for example if you need to do special processing of entity declarations like <!DOCTYPE> and <!ELEMENT>. By specifying 'markup' 'preserve' you can tell ⎕XML that all elements in the XML should produce corresponding rows in the APL array.

  • 'whitespace': possible values 'preserve', 'strip' and 'trim'

    By default ⎕XML strips all leading and trailing white space from element data, and compresses runs of white space within the data into a single space. You can modify this behaviour by specifying that all white space should be preserved, or that only leading and trailing spaces which enclose the data should be trimmed.

    There is one exception to this behaviour. If an XML element has the attribute xml:space="preserve" then white space is always retained.

  • 'unknown-entity': possible values 'preserve' and 'replace'

    XML data can include a number of predeclared entity references like "&amp;" to represent the "&" character, or "&amp#9017;" for Unicode character 9017. These are always converted by ⎕XML to their single-character forms.

    However, additional entity references can be declared in the XML Document Type Definition (DTD) and then used in the text. APLX does not currently parse the DTD and so does not know how to substitute for these references. Instead, the default behaviour of ⎕XML is to substitute the character specified by ⎕MC (by default, a question mark).

    This behaviour can be changed so that ⎕XML preserves unknown entity references, in which case they are passed to the APL array. The leading '&' is converted to an <ESC> character (⎕TCESC) so that the entity reference can be detected by the APL program, e.g. "&ref;" becomes "<ESC>ref"

Type code returned by ⎕XML

The fifth column of the array produced by ⎕XML contains a type code which can be used to interpret the row. Its value depends on whether the XML element has any children.

Possible children can be of the following types. (Note that if markup is stripped only the first of these types can occur in the final result).

  • A nested XML element
        <Parent>
            <Child>...</Child>
        </Parent>
    
  • A nested XML comment
        <Parent>
            <!--Comment-->
        </Parent>
    
  • A nested XML Processing Instruction
        <Parent>
            <?Processing instruction?>
        </Parent>
    
  • Other nested XML markup
        <Parent>
            <!ELEMENT name (#PCDATA)>
        </Parent>
    

(a) If the XML element has children its type code is formed from a sum of the following values, reflecting the types of children found on subsequent rows:

1Element has a tag (in column 2) (Always true)
2Element contains nested child element
4Element contains data as well as nested items
8Element contains nested XML markup
16Element contains nested XML comment
32Element contains nested XML Processing Instruction

For example, the element <Weight> in the following example has a type code of 21 (1 + 16 + 4) when markup and comments are preserved:

    <Weight>
        <!-- All weights approximate-->
        100
    </Weight>

Notice that an XML element with children always has a tag name in column 2. It never has any data in column 3 : all the data is returned in subsequent rows.

(b) The following type codes are used for XML elements which don't have any children:

1Element is an empty XML tag, e.g. <empty/>.
The tag name in returned in column 2, and column 3 is blank.
4Row is data for parent (See below).
The data is returned in column 3, and column 2 is blank.
5Element has an XML tag and data, e.g. <Tag>Data</Tag>
The tag name is returned in column 2 and the data in column 3.
8Element is unprocessed XML markup, e.g. <!ELEMENT name (#PCDATA)>.
The markup is returned in column 2, and column 3 is blank.
16Element is XML comment, e.g. <!--Comment-->.
The comment is returned in column 2, and column 3 is blank.
32Element is XML Processing Instruction, e.g. <?xml version="1.0" encoding="utf-8"?>.
The processing instruction is returned in column 2, and column 3 is blank.

The following example illustrates how the codes are used:

  <Tag1>Text
        <Tag2>
        <Tag3>Text</Tag3>
        </Tag2> 
        More Text
  </Tag1>

When converted by ⎕XML this will produce the following array

     ⎕DISPLAY ⎕XML xml_data
┌→───────────────────────────────────┐
↓   ┌→───┐ ┌⊖┐         ┌→────────┐   │
│ 0 │Tag1│ │ │         ⌽ ┌⊖┐ ┌⊖┐ │ 7 │
│   └────┘ └─┘         │ │ │ │ │ │   │
│                      │ └─┘ └─┘ │   │
│                      └∊────────┘   │
│   ┌⊖┐    ┌→───┐      ┌→────────┐   │
│ 1 │ │    │Text│      ⌽ ┌⊖┐ ┌⊖┐ │ 4 │
│   └─┘    └────┘      │ │ │ │ │ │   │
│                      │ └─┘ └─┘ │   │
│                      └∊────────┘   │
│   ┌→───┐ ┌⊖┐         ┌→────────┐   │
│ 1 │Tag2│ │ │         ⌽ ┌⊖┐ ┌⊖┐ │ 3 │
│   └────┘ └─┘         │ │ │ │ │ │   │
│                      │ └─┘ └─┘ │   │
│                      └∊────────┘   │
│   ┌→───┐ ┌→───┐      ┌→────────┐   │
│ 2 │Tag3│ │Text│      ⌽ ┌⊖┐ ┌⊖┐ │ 5 │
│   └────┘ └────┘      │ │ │ │ │ │   │
│                      │ └─┘ └─┘ │   │
│                      └∊────────┘   │
│   ┌⊖┐    ┌→────────┐ ┌→────────┐   │
│ 1 │ │    │More Text│ ⌽ ┌⊖┐ ┌⊖┐ │ 4 │
│   └─┘    └─────────┘ │ │ │ │ │ │   │
│                      │ └─┘ └─┘ │   │
│                      └∊────────┘   │
└∊───────────────────────────────────┘

Creating XML Data from an APL Array

Syntax: R←[options] ⎕XML NSTMAT

When presented with an array of APL data, ⎕XMLwill convert it to XML representation. The result is a character vector with embedded line-feed characters.

The right argument must be a nested matrix with one row for each XML element, and between 3 and 5 columns as follows

Column 1:An integer indicating the depth of nesting of the element.
A value of 0 is used for the outer-most nesting level, with deeper nesting being indicated by higher numbers.
Column 2:The element name to use for the start tag.
Column 3:The element data (see below)
Column 4:(Optional) An M-row, 2-column nested matrix containing any attribute name/value pairs. Each item in the matrix is a character vector.
If the element has no attributes you can specify a 0-row matrix, or a pair of empty character vectors.
If none of the elements have any attributes you can omit column 4 completely.
Column 5:(Optional) An integer type code (ignored).
This column is only used to facilitate round-trip conversions from XML to APL and back again.

The data specified in Column 3 will usually be a character vector or scalar. However, as a convenience ⎕XML also allows you to specify numeric values. These are formatted as character data before copying to the XML result. Numeric values are also allowed for attribute values (but not names).

Example:

      array←1 4⍴0 '?xml version="1.0" encoding="utf-8"?' '' ('' '')
      array←array⍪0 'Person' '' ('' '')
      array←array⍪1 'Name' '' ('order' 'western')
      array←array⍪2 'FirstName' 'Fred' ('' '')
      array←array⍪2 'LastName' 'Smith' ('' '')
      array←array⍪1 'DateOfBirth' '' ('' '')
      array←array⍪2 'Year' 1943 ('' '')
      array←array⍪2 'Month' 12 ('' '')
      array←array⍪2 'Day' 17 ('' '')
      XML←⎕XML array
      ⎕SS XML ⎕L ⎕R    ⍝ Convert line feeds to carriage return for display
<?xml version="1.0" encoding="utf-8"?>
<Person>
    <Name order="western">
        <FirstName>Fred</FirstName>
        <LastName>Smith</LastName>
    </Name>
    <DateOfBirth>
        <Year>1943</Year>
        <Month>12</Month>
        <Day>17</Day>
    </DateOfBirth>
</Person>

The conversion process can be controlled by an optional left argument, for example:

     R←('whitespace' 'preserve') ⎕XML apl_data

The only supported option is:

  • 'whitespace': possible values 'preserve', 'strip' and 'trim'

    By default ⎕XML strips all leading and trailing white space from element data, and compresses runs of white space within the data into a single space. The XML text produced then has spaces and line-feed characters added to format it for readability. For example elements are indented to reflect their degree of nesting.

    You can modify this behaviour by specifying that all white space should be preserved, or that only leading and trailing spaces which enclose the data should be trimmed. The option of preserving all white space is most useful when you are re-creating XML data from an APL array which was itself produced by ⎕XML with spaces preserved.

    If you specify the attribute xml:space with the value preserve on any row, all white space is retained in the corresponding XML element.

Adding the XML Prologue

To be valid, an XML file must start with a line containing an XML prologue, e.g.

<?xml version="1.0" encoding="utf-8"?>

Note that ⎕XML does not add the prologue automatically. To ensure that the XML is valid you must do one of two things:

(a) Make sure that the first row of the array used to generate the XML contains a valid prologue, as in the example above, or

(b) Prepend the prologue after the XML has been generated:

      XML←⎕XML 0 'Name' 'Fred Smith'
      XML←'<?xml version="1.0" encoding="utf-8"?>',⎕L,XML

If you create an XML file using ⎕EXPORT, APLX will automatically add the prologue if it is missing from the array.


Acknowledgment

This work is based on the original design concepts and implementation by Mark E. Johns, and has been designed in cooperation with Dyalog Ltd


Topic: APLX Help : Help on APL language : System Functions & Variables : ⎕XML Convert to/from XML
[ Previous | Contents | Index | APL Home ]