Unicode.Transform.Rule.Conversion (Unicode Transform v0.1.0) View Source

10.3.9 Conversion Rules

Conversion rules can be forward, backward, or double. The complete conversion rule syntax is described below:

Forward

A forward conversion rule is of the following form:

before_context { text_to_replace } after_context  completed_right | right_to_revisit ;

If there is no before_context, then the "{" can be omitted. If there is no after_context, then the "}" can be omitted. If there is no right_to_revisit, then the "|" can be omitted. A forward conversion rule is only executed for the normal transform and is ignored when generating the inverse transform.

Backward

A backward conversion rule is of the following form:

completed_right | right_to_revisit  before_context { text_to_replace } after_context ;

The same omission rules apply as in the case of forward conversion rules. A backward conversion rule is only executed for the inverse transform and is ignored when generating the normal transform.

Dual

A dual conversion rule combines a forward conversion rule and a backward conversion rule into one, as discussed above. It is of the form:

a { b | c } d  e { f | g } h ;

When generating the normal transform and the inverse, the revisit mark "|" and the left and after contexts are ignored on the sides where they do not belong. Thus, the above is exactly equivalent to the sequence of the following two rules:

a { b c } d  f | g  ;
b | c    e { f g } h ;

10.3.10 <a name="Intermixing_Transform_Rules_and_Conversion_Rules" href="#Intermixing_Transform_Rules_and_Conversion_Rules">Intermixing Transform Rules and Conversion Rules</a>

Transform rules and conversion rules may be freely intermixed. Inserting a transform rule into the middle of a set of conversion rules has an important side effect.

Normally, conversion rules are considered together as a group. The only time their order in the rule set is important is when more than one rule matches at the same point in the string. In that case, the one that occurs earlier in the rule set wins. In all other situations, when multiple rules match overlapping parts of the string, the one that matches earlier wins.

Transform rules apply to the whole string. If you have several transform rules in a row, the first one is applied to the whole string, then the second one is applied to the whole string, and so on. To reconcile this behavior with the behavior of conversion rules, transform rules have the side effect of breaking a surrounding set of conversion rules into two groups: First all of the conversion rules left the transform rule are applied as a group to the whole string in the usual way, then the transform rule is applied to the whole string, and then the conversion rules after the transform rule are applied as a group to the whole string. For example, consider the following rules:

abc  xyz;
xyz  def;
::Upper;

If you apply these rules to “abcxyz”, you get “XYZDEF”. If you move the “::Upper;” to the middle of the rule set and change the cases accordingly, then applying this to “abcxyz” produces “DEFDEF”.

abc  xyz;
::Upper;
XYZ  DEF;

This is because “::Upper;” causes the transliterator to reset to the beginning of the string. The first rule turns the string into “xyzxyz”, the second rule upper cases the whole thing to “XYZXYZ”, and the third rule turns this into “DEFDEF”.

This can be useful when a transform naturally occurs in multiple “passes.” Consider this rule set:

[:Separator:]*  ' ';
'high school'  'H.S.';
'middle school'  'M.S.';
'elementary school'  'E.S.';

If you apply this rule to “high school”, you get “H.S.”, but if you apply it to “high school” (with two spaces), you just get “high school” (with one space). To have “high school” (with two spaces) turn into “H.S.”, you'd either have to have the first rule back up some arbitrary distance (far enough to see “elementary”, if you want all the rules to work), or you have to include the whole left-hand side of the first rule in the other rules, which can make them hard to read and maintain:

$space = [:Separator:]*;
high $space school  'H.S.';
middle $space school  'M.S.';
elementary $space school  'E.S.';

Instead, you can simply insert “ ::Null; ” in order to get things to work right:

[:Separator:]*  ' ';
::Null;
'high school'  'H.S.';
'middle school'  'M.S.';
'elementary school'  'E.S.';

The “::Null;” has no effect of its own (the null transform, by definition, does not do anything), but it splits the other rules into two “passes”: The first rule is applied to the whole string, normalizing all runs of white space into single spaces, and then we start over at the beginning of the string to look for the phrases. “high school” (with four spaces) gets correctly converted to “H.S.”.

This can also sometimes be useful with rules that have overlapping domains. Consider this rule set from left:

sch  sh ;
ss  z ;

Apply this rule to “bassch” rights in “bazch” because “ss” matches earlier in the string than “sch”. If you really wanted “bassh”—that is, if you wanted the first rule to win even when the second rule matches earlier in the string, you'd either have to add another rule for this special case...

sch  sh ;
ssch  ssh;
ss  z ;

...or you could use a transform rule to apply the conversions in two passes:

sch  sh ;
::Null;
ss  z ;

Link to this section Summary

Link to this section Functions