How to enter linguistical data


Words must have the following format:

     [Z ':'] word

Z means the moment of time the word entered in the language that is calculated (as it stands inside of '[]' Z is optional). If you indicate a time, it means that any rule(s) before that moment of time are not applied to the word. This information overwrites an eventual default value given by the command line option "/start=x". Note that the time must be followed by a double colon (i.e. ':' - whenever a character stands in quotation marks (like ':') it means that you must write a corresponding character, see later).

The word itself has the following format:

     { L [opt_features] [opt_frontiers] }

L stands for a litteral. Opt_features and opt_frontiers stand for optional features and frontiers (they stand inside of '[]', so they're optional). The curls ({}) mean, that there can stand any number of litterals. (So for example: "mensa" is a word with 5 litterals). Opt_features and opt_frontiers can have the following formats:

     opt_features '[' '+' f { ',' '+' f } ']' opt_frontiers ['|'] ['+'] ['#']

"f" stands for any valid feature, i.e. a feature that you've defined in the rule catalogue. You may add as many features as you want to (thats why they stand inside of {}). Just separate them by comas (,). Frontiers can be any combination of '|', '+' and '#'. Note that the order is important (so '|+#', '|#', '+#' etc. are correct frontiers, but '#+', '+|' are not correct).

Special remarks: The character '@' is a litteral that refers to the dummy-sound (= the very first sound of the word, which is not really a sound because it only contains frontiers -> see tutorial). If you want to set certain features of the dummy sound (e.g. grammatical features), you must write '@' at the beginning of the word:


means that the dummy-sound of the word "mensam" receives the grammatical feature "+subst" and a #-frontier. To make it easier for the user, it is also possible not to write the '@' at the beginning of a word that is not preceeded by a time (Z). So, "@[+subst]#men|sam#" and "[+subst]#men|sam#" are equivalent. "@#men|sam#" and "#men|sam#" are also equivalent. However, if the word is preceeded by a time, you must write '@':


(it is not possible to write "100:[+subst]#mensam#" or "100:#men|sam#").


The best way to evaluate a rule catalogue is to apply it to a large number of words. To do so, ETYMO offers the possibility to define a "corpus", which must have the following format:

    word '>' corr_evol { ',' corr_evol }

"Word" is that word with which the calculation will start. "Corr_evol" are the correct evolution(s) of this word (you can give as many words you like). The output of the calculation is written in the file "STAT.LOG" which has the following format:

INPUT: word
CORR: (1)corr1 (2)corr2 ... (n)corrn
CALC: (1)calc1 (2)calc2 ... (n)calcn

corr1: corrhit1
corrn: corrhitn

calc1: calchit1
calcn: calchitn


"word" is the input of the calculation. "corr1" ... "corrn" are the correct evolutions (you indicated in the corpus file). "calc1" ... "calcn" are the calculated evolutions. In the section "EXPECTED - HITS" the computer gives a list of all the correct evolutions and indicates for each one of them the number of times it has been calculated ("corrhit1" ... "corrhitn"). In the section "CALCULATED - HITS" it does the same with the calculated evolutions (normally "calchitx" is either 0 or 1, meaning either that the calculated evolution is not within the set of correct evolutions or that the evolution is valid). Finally, in the section "TOTAL HITS" the computer indicates - both for the correct (EXPECTED) and the calculated evolutions - the sum of the hits ("TOTAL HITS - EXPECTED 2 / 3" for example means that 2 of the 3 correct evolutions have been correctly calculated by the program).

Last updated: October 2000 Back to main page