Chocolat Blog

– by Alex Gordon & Jean-Nicolas Jolivet

Introducing self-ml

A human data format

self-ml1 is a structural data language designed to be written by humans and read by computers. Its syntax is based on S-expressions with some changes and simplifications.

The biggest feature of self-ml is its simplicity. It doesn't have attributes, namespaces, dictionaries, symbols or even numerics. Lists and strings are all you got.

self-mlXML
(empty-tag) <empty-tag/>
(vegetable Potato) <vegetable>Potato</vegetable>
(vegetables Potato Carrot Onion) <vegetables>
    <vegetable>Potato</vegetable>
    <vegetable>Carrot</vegetable>
    <vegetable>Onion</vegetable>
</vegetables>
(person [John Smith]) <person>John Smith</person>
(code [NSMutableString *str = [[[NSMutableString alloc] init] autorelease];]) <code>NSMutableString *str = [[[NSMutableString alloc] init] autorelease]</code>
(student [Robert'); DROP TABLE Students;--]) <student>Robert'); DROP TABLE Students;--</student>
`Some complex data !@£$%^&*<([{/` <![CDATA[Some complex data !@£$%^&*<([{/]]>
# A line comment
{# A block
comment #}
<!-- A block
comment -->
(apple
    (music
        iPod
        iTunes)
    (computers
        iMac
        [Mac mini])
    (phones
        iPhone)
    (dishwashers))
<apple>
    <music>
        <product>iPod</product>
        <product>iTunes</product>
    </music>
    <computers>
        <product>iMac</product>
        <product>Mac mini</product>
    </computers>
    <phones>
        <product>iPhone</product>
    </phones>
    <dishwashers/>
</apple>

As you can see, especially from the last example, self-ml is a lot more succinct than XML, easier to read and easier to write. There's not a single backslash escape in sight: self-ml is Regex Friendly™. self-ml can have lists of strings (music iPod iTunes) and multiple nodes under the root node.

Casual Grammar

Warning: Technical details ahead. Skip to details on implementations.

A node is the basic unit of self-ml. A node can either be a list or a string.

node := list | string.

A list comprises of a head and a list of other nodes, enclosed in round brackets. For example, (head node1 node2 node3). Unlike usual s-exprs, the empty list () is not accepted.

list := '(' string node_list ')'.

node_list := node_list node. 
node_list := node_list.
node_list := .

Strings can be written in three forms:

string := BACKTICK_STRING | BRACKETED_STRING | VERBATIM_STRING.
  1. If it contains no whitespace or brackets, then it can be written verbatim. For example, some-string.

    VERBATIM_STRING := [^[\](){}\s]+
  2. If all square brackets in the string are balanced, then it can be written enclosed in square brackets. For example, [NSMutableString *x = [[[NSMutableString alloc] init] autorelease];].

    BRACKETED_STRING := '[' ... deal with nested brackets ... ']'
  3. If you need to express unbalanced square brackets [ ] then you can use a *backtick string*. A backtick string starts at a ` continues until another ` is found, unless that backtick has another backtick after it (two backticks insert a single backtick into the backtick string). For example, `This is a ``backtick`` string`.

    BACKTICK_STRING := `(``|[^`])*?`

The root node is a list containing all top level nodes in the document. As mentioned, you can have any number of top level nodes, including zero.

root := node_list.

There are two types of comments. Comments may only occur outside of bracketed and backtick string literals.

  1. Line comments start at # and continue until a CR, LF, CR LF or other newline character sequence is found.

    LINE_COMMENT := #.*$
  2. Block comments start at {# and end when a matching #} is found. Block comments may be nested.

    BLOCK_COMMENT := \{#.*?#\}

Implementations

I have written an implementation in C. It's available on github.

Other implementations are of course welcome! As are text editor plugins, testcases, documentation, patches, etc. Contact me on github or email anything@fileability.net.

Edit: There's now a very simple textmate bundle


  1. self-ml stands for "S-expression like format... markup language". And no, it's not really a markup language.