Lingua::Interset::Atom - Atomic driver for a surface feature.
version 2.004
  use Lingua::Interset::Atom;
  my $atom = Lingua::Interset::Atom->new
  (
      'surfeature'    => 'gender',
      'decode_map' =>
          { 'M' => ['gender' => 'masc', 'animateness' => 'anim'],
            'I' => ['gender' => 'masc', 'animateness' => 'inan'],
            'F' => ['gender' => 'fem'],
            'N' => ['gender' => 'neut'] },
      'encode_map' =>
          { 'gender' => { 'masc' => { 'animateness' => { 'inan' => 'I',
                                                         '@'    => 'M' }},
                          'fem'  => 'F',
                          '@'    => 'N' }}
  );
Atom is a special case of a tagset driver. As the name suggests, the surface tags are considered atomic, i.e. indivisible. It provides environment for easy mapping between surface strings and Interset features.
While Atom can be used to implement drivers of tagsets whose tags are not structured (such as en::penn or sv::mamba), they should also provide means of defining “sub-drivers” for individual surface features within drivers of complex tagsets. For example, the Czech tags in the Prague Dependency Treebank are always strings of 15 characters where the i-th position in the string encodes the i-th surface feature (which may or may not directly correspond to a feature in Interset). A driver for the PDT tagset could internally construct atomic drivers for PDT gender, number, case etc.
Name of the surface feature the atom describes. If the atom describes a whole tagset, the tagset id could be stored here. The surface features may be structured differently from Interset, e.g. there might be an agreement feature, which would map to the Interset features of person and number.
A compact description of mapping from the surface tags to the Interset feature values. It is a hash reference. Hash keys are surface tags. Hash values are references to arrays of assignments. The arrays must have even number of elements and every pair of elements is a feature-value pair.
Example:
  { 'M' => ['gender' => 'masc', 'animateness' => 'anim'],
    'I' => ['gender' => 'masc', 'animateness' => 'inan'],
    'F' => ['gender' => 'fem'],
    'N' => ['gender' => 'neut'] }
Vertical bars may be used to separate multiple values of one feature. The other feature can have a structured value, so you can use standard Perl syntax to describe hash and/or array references.
  { 'name_of_dog' => [ 'pos' => 'noun', 'nountype' => 'prop', 'other' => { 'named_entity_type' => 'dog' } ],
    'wh_word'     => [ 'pos' => 'noun|adj|adv', 'prontype' => 'int|rel' ] }
A compact description of mapping from the Interset feature structure to the surface tags. It is a hash reference, possibly with nested hashes. The top-level hash must always have just one key, which is a name of an Interset feature. (It could be encoded without the hash but I believe that the whole map looks better this way.)
The top-level key leads to a second-level hash, which is indexed by the values of the feature. It is not necessary that all possible values are listed. A special value @, if present, means “everything else”. It is recommended to always mark the default value using @. Even if we list all currently known values of the feature, new values may be introduced to Interset in future and we do not want to have to get back to all tagsets and update their encoding maps. (On the other hand, if there are values that the decode() method of the current atom does not generate but we still have a preferred output for them, the preference must be made explicit. For instance, if the language does not have the pluperfect tense, it may still define that it be encoded the same way as the past tense.)
A feature may have a multi-value (several values joined and separated by vertical bars). A value (multi- or not) is always first sought using the exact match. If the search fails, both the current feature value and the keys of the value hash are treated as lists of values and their largest intersection is sought for. If no overlap is found, the default @ decision is taken.
Example:
  { 'gender' => { 'masc'      => { 'animateness' => { 'inan' => 'I',
                                                      '@'    => 'M' }},
                  'fem|masc'  => 'T',
                  'fem'       => 'F',
                  '@'         => 'N' }}
Note that in general it is not possible to automatically derive the encode_map from the decode_map or vice versa. However, there are simple instances of atoms where this is possible.
my $fs = $driver->decode ($tag);
Takes a tag (string) and returns a Lingua::Interset::FeatureStructure object with corresponding feature values set.
my $fs = $driver1->decode ($tag1); $driver2->decode_and_merge ($tag2, $fs);
Takes a tag (string) and a Lingua::Interset::FeatureStructure object. Adds the feature values corresponding to the tag to the existing feature structure.
my $tag = $driver->encode ($fs);
Takes a Lingua::Interset::FeatureStructure object and returns the tag (string) in the given tagset that corresponds to the feature values. Note that some features may be ignored because they cannot be represented in the given tagset.
my $list_of_tags = $driver->list();
Returns the reference to the list of all known tags in this particular tagset. This is not directly needed to decode, encode or convert tags but it is very useful for testing and advanced operations over the tagset. Note however that many tagset drivers contain only an approximate list, created by collecting tag occurrences in some corpus.
Lingua::Interset::Tagset, Lingua::Interset::FeatureStructure
Dan Zeman <zeman@ufal.mff.cuni.cz>
This software is copyright (c) 2014 by Univerzita Karlova v Praze (Charles University in Prague).
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.