Previous Up Next

4  Codecs

Utool is intended as a “Swiss Army Knife” for working with underspecified representations, and one of our design goals was therefore to make it compatible with as many current underspecification formalisms as possible. This is not a trivial task, as different formalisms typically differ in crucial details, and naive translations from one formalism to another are typically incorrect for pathological inputs.

As we have explained above, Utool internally works with labelled dominance graphs and uses codecs to translate between these graphs and the actual USRs that the user sees. An input codec transforms a USR in a certain format into an equivalent labelled dominance graph, and an output codec transforms a labelled dominance graph into a USR in a certain format.

Some codecs are quite simple; for instance, the domcon-gxl codec simply deals with one particular concrete syntax for labelled dominance graphs. However, there are also input codecs for MRS and Hole Semantics that do quite a bit of work to compute correct dominance graphs, and rely on nontrivial formal results; and there are output codecs for reading the solutions as terms or translating them into graph file formats that can be displayed using external viewers.

In addition, the Codec API is quite simple, and writing your own codec requires almost no knowledge of the rest of the Utool system. As long as you observe certain rules that codecs have to follow, the rest of the Utool system will simply cooperate with your new codec. In fact, you don't even have to recompile Utool to make your new codec available to it – you can simply package your own codec into a Jar file and add it to the classpath when running Utool.

Below, we will go through the input and output codecs in the current Utool distribution one by one, explain the formal background and the concrete syntax we assume, and discuss their limitations. Then we will explain how to write your own codec.

4.1  The domcon-oz codecs

The domcon-oz codecs deal with representations of dominance constraints [[]Egg etal.2001] as lists of terms of the Oz programming language. An example USR that these codecs can deal with looks as follows:
[label(x f(x1)) label(y g(y1)) label(z a) dom(x1 z) dom(y1 z)]
As you can see, USRs are lists of terms whose head symbols are either label or dom. Each such term stands for an atom of a dominance constraint. The labelling atom label(x f(x1 ... xn)) (where n ≥ 0) expresses that the node x should have the label f and its children in a tree that satisfies the constraint should be the nodes x1 to xn, from left to right. The dominance atom dom(x y) expresses that the node x should be above the node y in a satisfying tree. In addition, the constraint contains an implicit inequality atom xy for any two symbols x and y that appear as the first argument of a label term.

This format was first used in the old CHORUS demo (in 1999 or so), which was written in Mozart Oz. As we know today, weakly normal dominance constraints can be seen as dominance graphs quite directly [[]Koller2004]. If we want to read a domcon-oz USR as a weakly normal dominance graph, we can read the term label(x f(x1 ... xn)) as expressing that the node x in the dominance graph has the label f and the children x1 to xn over tree edges. The term dom(x y) then expresses that there is a dominance edge from x to y.

The domcon-oz input codec considers lines that start with a percent symbol as comments and ignores them. Both the input and the output codec will deal with arbitrary labelled dominance graphs, even if they are not weakly normal. These codecs are associated with the filename extension .clls.

4.2  The domgraph-gxl codecs

The domgraph-gxl codecs deal with representations of a labelled dominance graph in the GXL graph representation language. GXL (http://www.gupro.de/GXL/) is a XML-based standard language for this purpose. A USR in this format looks as follows:
<gxl xmlns:xlink="http://www.w3.org/1999/xlink">
   <graph id="utool-graph" edgeids="true" hypergraph="false" edgemode="directed">
      <node id="x">
         <type xlink:href="root" />
         <attr name="label"><string>f</string></attr>
      </node>
      <edge from="x" to="x1" id="edge0">
        <type xlink:href="solid" />
      </edge>

      <node id="x1">
         <type xlink:href="hole" />
      </node>
      <edge from="x1" to="z" id="edge3">
        <type xlink:href="dominance" />
      </edge>

      <node id="z">
         <type xlink:href="leaf" />
         <attr name="label"><string>a</string></attr>
      </node>
   </graph>
</gxl>
The USR specifies nodes (using node elements), which can have the type root (a node with outgoing tree edges), hole (a node with incoming but no outgoing tree edges), or leaf (a node with not adjacent tree edges). Each node which is not a hole must have an embedded attr element which specifies its label. In addition, the USR specifies edges, which can have the types solid or dominance.

The domgraph-gxl codecs are intended primarily as a portable exchange format for labelled dominance graphs. Both the input and the output codec will deal with arbitrary labelled dominance graphs, even if they are not weakly normal. These codecs are associated with the filename extension .dg.xml.

4.3  The MRS input codecs

Utool supports two input codecs that deal with underspecified representations based upon Minimal Recursion Semantics (MRS; [[]Copestake etal.1999]), the standard scope underspecification formalism used in current HPSG grammars. The mrs-prolog codec deals with MRS expressions in a Prolog style term representation. Alternatively, the mrs-xml codec deals with MRS expressions based upon an XML-style syntax. Both codecs are compatible with the concrete syntax that is used in the LKB system [[]Copestake2002].

Utool doesn't contain any MRS output codecs, because MRS makes some specific assumptions about the underlying object language, and it is not clear that a useful class of labelled dominance graphs can indeed be correctly translated into MRS.

The mrs-prolog input codec.
A concrete example of a Prolog style MRS expression looks as follows:
psoa(h1,e2,
  [ rel('prop-or-ques_m',h1,
        [ attrval('ARG0',e2),
          attrval('MARG',h3),
          attrval('PSV',u4),
          attrval('TPC',u5) ]),
    rel('_every_q',h6,
        [ attrval('ARG0',x7),
          attrval('RSTR',h9),
          attrval('BODY',h8) ]),
    rel('_dog_n_1',h10,
        [ attrval('ARG0',x7) ]),
    rel('_bark_v_1',h11,
        [ attrval('ARG0',e2),
          attrval('ARG1',x7) ]) ],
  hcons([ qeq(h3,h11), qeq(h9,h10) ]))
Here, h1 is the top handle, and e2 an event variable. Terms of the form
 
rel(L, H, [attrval(F, V), ...]) 
represent elementary predications, which pair a quoted string L, a handle H, and a list of feature-value pairs. Features are represented by (quoted) atoms, and values can be either handles, object language individual or event variables, “unspecified” values (see below) and quoted strings. Finally, terms of the form qeq(H1, H2) represent handle constraints. Terms of the form geq(H1, H2) can also be used to express handle constraints.

The codec makes the following assumptions: The codec also accepts terms starting with lowercase u or i, which correspond to values left unspecified by the syntax-semantics interface of the grammar used to derive the MRS expression. These terms are ignored by the input codec.

The codec implements an extended version of the translation of MRS into normal dominance graphs defined by [[]Niehren and Thater2003]. During the translation, the codec makes certain constraints that are implicit in the MRS description explicit by adding dominance edges (this is called normalisation). Most importantly, it adds a dominance edge from each quantifier to its bound variables. We assume that quantifiers are elementary predications which contain exactly the features ARG0, RSTR and BODY. An individual variable is treated as bound if it occurs as the value of some other feature.3

In order to guarantee correctness of the translation, the codec makes two important assumptions about the MRS structures is processes. First, the qeq (and geq) constraints translate simply into dominance edges. This is a conceptional simplification, which however seems to be compatible with the way modern grammars make use of qeq constraints [[]Fuchss etal.2004]. Second, the translation is restricted to MRS expressions whose graph satisfies certain structural restrictions: The MRS must be a net, or equivalently, the resulting dominance graph must be normal, hypernormally connected and leaf-labelled.

If the input MRS is not well-formed (see below) or not a net, the codec reports an error. The error code is obtained by forming the bitwise OR of 192 and the relevant bits in the following table. The codec is able to report multiple errors at once, so e.g. if the input MRS translates into a graph that is neither normal nor leaf-labelled, the error code would be 201.
code symbolic name meaning
1 NOT_NORMAL resulting graph is not normal
2 NOT_WEAKLY_NORMAL resulting graph is not weakly normal
4 NOT_HYPERNORMALLY_CONNECTED resulting graph is not hypernormally connected
8 NOT_LEAF_LABELLED resulting graph is not leaf-labelled
16 NOT_WELLFORMED the input MRS is not well-formed

An MRS is not well-formed if it violates some (in the widest sense) syntactic constraints, for instance, if it contains a variable x but no quantifier that binds x. We should note that this terminology departs from [[]Copestake etal.1999]'s terminology: in that paper, a well-formed MRS must additionally be solvable.

Both MRS codecs have an option normalisation, which can take the values nets (this is the default case) or none. The option value none makes the codec skip the normalisation step and the well-formedness tests. This means that the dominance graph will typically only be a weakly normal and not a normal graph, and the solved forms of the graph will not necessarily correspond one-to-one to the scopings of the MRS. This means that the codec translates the USR incorrectly. However, it can sometimes be useful for debugging.

The input codec is associated with the filename extension .mrs.pl.

The mrs-xml input codec.
In addition to the Prolog style term representation, Utool also supports an XML style syntax. Apart from the concrete syntax (XML as opposed to Prolog terms), the two codecs are exactly the same. The Prolog-style MRS expressions shown above is represented in XML as follows:
<?xml version="1.0"?>
<mrs>
  <var vid="h1"/>
  <ep>
    <pred>prop-or-ques_m_rel</pred>
    <var vid="h1"/>
    <fvpair><rargname>ARG0</rargname><var vid="e2"/></fvpair>
    <fvpair><rargname>MARG</rargname><var vid="h3"/></fvpair>
    <fvpair><rargname>PSV</rargname><var vid="u4"/></fvpair>
    <fvpair><rargname>TPC</rargname><var vid="u5"/></fvpair>
  </ep>
  <ep>
    <pred>_every_q_rel</pred>
    <var vid="h6"/>
    <fvpair><rargname>ARG0</rargname><var vid="x7"/></fvpair>
    <fvpair><rargname>RSTR</rargname><var vid="h9"/></fvpair>
    <fvpair><rargname>BODY</rargname><var vid="h8"/></fvpair>
  </ep>
  <ep>
    <pred>_dog_n_1_rel</pred>
    <var vid="h10"/>
    <fvpair><rargname>ARG0</rargname><var vid="x7"/></fvpair>
  </ep>
  <ep>
    <pred>_bark_v_1_rel</pred>
    <var vid="h11"/>
    <fvpair><rargname>ARG0</rargname><var vid="e2"/></fvpair>
    <fvpair><rargname>ARG1</rargname><var vid="x7"/></fvpair>
  </ep>
  <hcons hreln="qeq">
    <hi><var vid="h3"/></hi>
    <lo><var vid="h11"/></lo>
  </hcons>
  <hcons hreln="qeq">
    <hi><var vid="h9"/></hi>
    <lo><var vid="h10"/></lo>
  </hcons>
</mrs>
This codec supports the same normalisation option as mrs-prolog. It is associated with the filename extension .mrs.xml.

4.4  The holesem-comsem input codec

The holesem-comsem input codec can read USRs of the Hole Semantics formalism. Hole Semantics [[]Bos1996] is a rather popular underspecification formalism because it is conceptually very accessible (underspecify formulas by allowing them to have holes which can be plugged by other formulas). We assume the concrete syntax used in the Prolog system that accompanies the Computational Semantics textbook [[]Blackburn and Bos2005]; that is, it should be possible to use all USRs generated by the book software with this codec. USRs in this syntax are Prolog terms that look e.g. as follows:
some(A, some(B, some(C, some(X, and(label(A), and(hole(B),
and(label(C), and(some(A, X, B), and(pred1(C,man,X), leq(C,B))))))))))
Here A and C are labels, B is a hole, and X is an object variable of the intended semantic representation, which will be bound by an existential quantifier. All four symbols are introduced by the outer some terms. In addition, the term some(A,X,B) expresses that the formula under the label A is ∃ X. B, and the term pred1(C,man,X) expresses that the formula under the label C is man(X). The term leq(C,B) says that the label C must be below the hole B.

The codec supports USRs built from the logical symbols hole, label, some, and, or, imp, not, all, leq, que, eq, pred1, pred2, and arbitrary non-logical symbols. It does not support the symbols lam and app that are used in intermediate representations during semantics construction by Blackburn and Bos, as these symbols are beyond the scope of ordinary Hole Semantics USRs (they don't relate holes and labels, but entire Hole Semantics USRs).

This codec makes use of the theoretical result that Hole Semantics USRs that are hypernormally connected and leaf-labelled can be translated into equivalent labelled dominance graphs [[]Koller etal.2003]. The translation itself is not that complicated, but the correctness proof is not trivial. The codec checks whether the resulting graph is normal, leaf-labelled, and hypernormally connected. If it isn't, the codec reports this as a semantic error with one of the following exit codes:
code symbolic name meaning
193 ERROR_GRAPH_NOT_NORMAL resulting graph is not normal
194 ERROR_GRAPH_NOT_HNC resulting graph is not hypernormally connected
195 ERROR_GRAPH_NOT_LEAF_LABELLED resulting graph is not leaf-labelled
196 ERROR_MULTIPLE_PARENTS graph has a node with more than one parent

Although every normal, hypernormally connected, and leaf-labelled dominance graph can in principle be translated back into an equivalent Hole Semantics USR [[]Koller etal.2003], there is no corresponding output codec. This is because the concrete syntax imposes a number of very inconvenient restrictions on the USRs that it can express. For instance, all nodes that are not holes must have exactly one or two children via tree edges, and all non-holes whose children are not object-level variables must be labelled with a connective of predicate logic. This has the consequence that only a tiny minority of all labelled dominance graphs can actually be encoded given this syntax, which made the encoding more difficult to implement and debug than we felt it was worth.

The input codec is associated with the filename extension .hs.pl.

4.5  The domgraph-udraw and domgraph-dot output codecs

As an alternative to the display command, one can use utool to convert underspecified representations into formats which can be further processed by other graph viewers:
  1. The domgraph-udraw output-codec outputs a dominance graph in a format that can be used to display the graph with the uDraw(Graph) tool (formerly daVinci) from the University of Bremen (http://www.informatik.uni-bremen.de/uDrawGraph/en/index.html);
  2. the domgraph-dot codec outputs a dominance graph in the “dot” format supported by many graph drawing tools, such as graphviz (http://graphviz.org/).
We don't describe these output formats in detail here, but you can write the output into a file and then load it from the respective graph viewers. Note that these commands will only produce meaningful output if you use them from the convert command, rather than the solve command, as neither graph format supports the representation of multiple graphs (i.e. the multiple solved forms computed by solve) at once.

As a further convenience, the domgraph-udraw accepts the codec option pipe. If you use this codec option, you can directly feed Utool's output to uDraw(Graph) via a pipe instead of writing the graph to a file and then opening this file from uDraw(Graph), like so:
utool convert foo.clls -O domgraph-udraw --output-codec-options pipe=true \
   | uDrawGraph -pipe
The domgraph-dot codec accepts the codec option enforceEdgeOrder. If this option is set to true, the dot representation computed by the codec will contain an instruction that forces the Dot/Graphviz layout algorithm to respect the left-to-right order of edges that come out of the same node. This is a good thing where tree edges are concerned, as their relative order corresponds to e.g. the restriction and scope of a quantifier. However, because of limitations in the dot format, the ordering constraint will also apply to dominance edges. This is not only unnecessary (there is no meaningful order between outgoing dominance edges), but it is also inconvenient, as the graph layout for nontrivial graphs will typically not be pretty. The enforceEdgeOrder option is thus set to false by default.

Both codecs are able to encode any labelled dominance graph, even if it is not weakly normal, thus they never report any semantic errors. The domgraph-udraw codec is associated with the filename extension .dg.udg, and the domgraph-dot codec is associated with the extension .dg.dot.

4.6  The plugging output codecs

The plugging-oz and plugging-lkb output codecs display a dominance graph in solved form as a list of pairs (hole, root) that specify the dominance edges in the solved form (in Hole Semantics terminology, this is a plugging). They differ only in the concrete syntax:

plugging-oz is intended as a convenient output format if you only want to see the dominance edges. On the other hand, plugging-lkb is practically relevant because it displays pluggings in the format that the LKB workbench expects from its own MRS solver. This means that Utool with the mrs-prolog input codec and the plugging-lkb output codec can be used as a drop-in replacement for the LKB's own MRS solver (see also Section 5.343Integration with the LKB Workbenchsubsection.5.3).

These two codecs are only intended to be run on dominance graphs in solved form. They will also accept any other dominance graph and will then display its dominance edges, but this is less meaningful than displaying the dominance edges in a solved form. The plugging-oz codec is associated with the filename extension .plug.oz, and the plugging-lkb codec is associated with the extension .lkbplug.lisp.

4.7  The term output codecs

The term-oz and term-prolog output codecs can be used to encode labelled dominance graphs in simple solved form, i.e. solved forms which are trees and in which each hole has exactly one outgoing dominance edge. They traverse these trees top-down and print the ground term that corresponds to the tree structure. Their output looks as follows: The only difference between the two codecs is that the Prolog codec separates arguments of a term by commas, whereas the Oz codec separates them with whitespace. These two codecs are intended as human-readable representations of solved forms.

Both codecs assume that the solved form they encode is simple and leaf-labelled, and will report an error code of 225 if it isn't. term-oz is associated with the filename extension .t.oz, and term-prolog with .t.pl.

4.8  The chain input codec

The chain input codec will generate the pure chain [[]Koller2004] of a given length. A chain is a zig-zag graph consisting of upper and lower fragments that are connected by dominance edges; the pure chain of length 3 is shown in Fig. 111Running Utool from the command linefigure.1. Chains appear frequently as parts of linguistically motivated USRs, and are therefore a nice basis for benchmarking (e.g. in [[]Bodirsky etal.2004]).

The “underspecified descriptions” that this codec expects in server mode are simply string representations of numbers (such as the string 3). The codec will then generate the pure chain of this length. When used with the command-line version of Utool, chain behaves differently than all other codecs discussed so far in that it doesn't interpret its argument as a filename, but again directly as the chain length. This means that you can use a Utool call as follows:
$ utool convert -I chain 3 -O domcon-oz
The codec will report a parsing error (code 192) if the chain length specification isn't a number, and a semantic error with code 193 if the number isn't positive. Because chain doesn't read its USRs from files, it is not associated with any filename extension.

4.9  Writing your own codecs

Although Utool comes with a collection of codecs that cover many existing popular underspecification formalisms, there are some formalisms we don't support (yet), and we can expect that other formalisms will be developed in the future. To this end, it may be helpful for you to write your own codec. We will now describe how this is done. Notice that you don't have to rebuild Utool just to add codecs.

Codec classes.
A codec is a public, non-abstract class that is derived from one of the abstract base classes InputCodec (if it is an input codec) or OutputCodec (for an output codec), both of which belong to the package de.saar.chorus.domgraph.codec. Codecs are typically not instantiated directly by the programmer. They are registered in a codec manager (an object of class CodecManager), which has methods for querying and instantiating codecs. The codec manager separates codecs from the rest of the Utool system and enforces a uniform interface that makes sure that all components of the system interact properly with the codecs: Once you have implemented and registered a new codec, it will automatically be usable from the command line, the server, and the GUI.

The abstract base class InputCodec defines a method decode, which you must implement when you develop an input codec. The method takes a Reader, a DomGraph, and a NodeLabels object as arguments. Its job is to read an USR from the Reader, translate it into a labelled dominance graph, and then change the DomGraph and the NodeLabels to this graph. If an error occurs during this decoding process, it may throw an IOException, a ParserException (if the USR was syntactically not well-formed), or a MalformedDomgraphException (if the USR could not be translated to a dominance graph for semantic reasons). The Utool exit codes (see Section 3.416Exit codessubsection.3.4) corresponding to these errors are 128 for the I/O error, 192 for the parse error, and 192+N for the semantic errors, where N is an integer between 1 and 31 (inclusive) which you can specify when you construct the exception.

Conversely, the class OutputCodec defines a method encode, which accepts a DomGraph and a NodeLabels, translates them into your output format, and writes the result to a Writer. The method may throw IOExceptions and MalformedDomgraphExceptions, with the same meaning as above. The exit codes available to an output codec are 128 for I/O errors and 224 + N for semantic errors. In addition, the codec must implement the methods print_header and print_footer, which are called by Utool at the very beginning and end of the encoding process; the builtin codecs use them to print a version header at the beginning of the output files. It is a good idea to call flush() on the writer in the footer method.

If your output codec supports the output of multiple USRs into the same file, it should be derived from the class MultiOutputCodec, which is itself derived from OutputCodec. Only MultiOutputCodecs can be used as the output codec of a solve command on the command line, or the “Export Solved Forms” menu entry in Ubench. MultiOutputCodec inherits all abstract methods from OutputCodec and adds three more: The methods print_start_list and print_end_list are called before printing the first solved form (but after print_header) and after printing the last solved form (but before print_footer), and print_list_separator is called between any two encode calls for subsequent solved forms. Notice that print_start_list and print_end_list are called only for results of solve and “Export Solved Forms”, whereas print_header and print_footer are called in these cases and also for the convert and “Export” commands.

Annotations.
Every codec class must be annotated with a @CodecMetadata annotation. This annotation has two required arguments name and extension, both of which are of type String. The former specifies the codec's name; there may be no two input codecs and no two output codecs of the same name. The latter enables Utool to associate filenames with codecs: A file whose name ends with the specified extension will be automatically processed with this codec. If you pass an empty string for the extension, your codec will not be associated with any filename. In addition, you may mark your codec as experimental by passing the option experimental=true to the metadata annotation.

Thus, a typical declaration of an input codec would look as follows:
@CodecMetadata(name="mrs-prolog", extension=".mrs.pl")
public class MrsInputCodec extends InputCodec {
 ...
The next requirement is that each codec must declare exactly one codec constructor, which will be used whenever the codec manager wants to create a new instance of the codec. If the codec class has exactly one public constructor, this constructor is used as the codec constructor. (If the class declares no constructors at all, the argumentless default constructor will do.) On the other hand, if the class declares more than one public constructor, exactly one of them must be annotated with @CodecConstructor, and this annotated constructor will be used as the codec constructor.

The codec constructor must not be declared as throwing any checked exceptions. In addition, each parameter of the codec constructor must have an @CodecOption annotation. This annotation takes a required string argument name, which gives this particular parameter a public name. It may also take an argument defaultValue, which defines a default value for when the user doesn't specify an explicit value. If you specify no defaultValue, the empty string will be used.

Thus, a typical declaration of a codec constructor would look as follows:
@CodecConstructor
public DomgraphUdrawOutputCodec(@CodecOption(name="pipe", defaultValue="false") 
                                boolean usePipe) {
 ...
This way of declaring codec constructors and codec options looks a bit complicated at first sight, but is quite convenient in various respects. For one thing, Utool is aware of the codec options, and will decode them automatically. If you select the File/Export menu entry in Ubench and choose the domgraph-udraw input codec, you will be offered a button “Option”, which will reveal an option selection panel. This panel will contain a checkbox with the label “pipe”, because your codec constructor has a parameter with name “pipe” and type boolean; that is, it uses information about the parameter types to offer the user appropriate input forms. Similarly, the Utool command-line tool will be able to parse an argument --output-codec-options pipe=true, and the server will likewise deal with the option; these modes use the parameter types to decode the argument strings into appropriate values. For this reason, every parameter of the codec constructor must have one of the following types: On the other hand, the way that codecs represent metadata stays out of your way if you don't need it. Every codec class must have a @CodecMetadata annotation. But beyond that, there is no need for a @CodecConstructor annotation if there is a unique public constructor to begin with; and if the constructor has no parameters, nothing needs to be annotated with @CodecOption. The latter situation accounts for the majority of the builtin codecs.

Registration.
The final step of the story is to make Utool aware of your new codec. This is handled by creating a file de/saar/chorus/domgraph/codec/codecclasses.properties somewhere in your classpath. In this file, you list the fully qualified classname(s) of your new codec(s), one per line. Utool will then automatically read this file and try to register your codecs. If a codec fails to register (typically because an annotation was missing), Utool will be terminated with an error message and exit code 142.

For instance, let's say that you have implemented and compiled a codec class foo.bar.MyCodec. You would then create a codecclasses.properties file with the following contents:
foo.bar.MyCodec
Then you would start a MyCodec-enhanced version of Utool as follows:
$ jar cf mycodec.jar foo/bar/MyCodec.class \
    de/saar/chorus/domgraph/codec/codecclasses.properties
$ java -cp Utool.jar:mycodec.jar de.saar.chorus.domgraph.utool.Utool -d

Previous Up Next