標籤

2015年11月17日 星期二

Regular Expressions

1.4. Regular Expressions

Regular expressions allow users to create complicated queries. Below follows a list of most commonly used regular expressions together with explanations and some potential uses.
  • [abc] means "a or b or c", e.g. query "[br]ang" will match both "adbarnirrang" and "bang"
  • [^abc] means "begins with any character but a,b,c", e.g. query [^aeou]ang will match "rang" but not "baang"
  • [a-zA-Z] means "a charater from a/A through z/Z", e.g. b[a-zA-Z] will match "bang", "bLang" or "baang" but not "b8ng"
  • . (the dot) means "any character", e.g. "b.ng" will match "bang", "b8ng", but not "baang"
  • X* means "X zero or more times", e.g. "ba*ng" will match "bng", "bang", "baang", "baaang" etc.
  • X+ means "X one or more time", e.g. "ba+ng" will match "bang", "baang" but not "bng"
  • ^ means "the beginning of the annotation", e.g. "^ng" will match "ngabi" but not "bukung"
  • $ means "the end of the annotation", e.g. "ung$" will match "bukung" but not "ngabi"
    Examples
    • ^[pbtd][^aeiou]
      You can use this expression to search for complex onsets. It will find words that start with one of the plosives ("p","b","t","d") followed by a character that is not a vowel ("a","e","i","o","u"). An example of a matching word is "tsakeha"
    • [^n]g$
      You can use this expression in case you want to search for annotations ending with a "g", but not with "ng". In Dutch, you will find "snelweg" and "maandag" as the results but not words as "bang".
    • ^k.+k$
      You can use this expression if you want to search for annotation starting and ending with "k" and with one or more character between them, e.g. "kitik" or "kanak-kanak"
    • ^(.+)\1$
      You can use this expression to search for words that are reduplicated. When you put something in bracketes, you create a variable (.+), which you can refer to as "\1". This expression then searches for an annotation that starts with one or more random characters followed by that same sequence of characters. This expression will match for instance "kulukulu".
More about regular expressions...
The following tables have been created by a user of ELAN (an annotation tool which has the same search mechanism as TROVA). They may result quite useful also for other users since they offer a simple and clear overview of the main symbols (partly different from the ones just seen) used in regular expressions, with a short explanation and an example for each of them. Bear in mind that the examples are taken from the language that the user is being researching, so do not pay attention to the meaning of the words but to the working mechanism of the regular expressions.

Table 1.1. Symbols
SymbolsPlaceMeaning
\bat the beginning and/or end of a stringword boundary
\w+at the end of a stringvariable end of word
.anywhereany letter
.*between spacesany string of letters between spaces/any word
.*\between spacesany string of words
(x|y)anywhereeither x or y
[^x]place at the beginningnot x
(....)\lanywherewords with four reduplicated letters
?after a letterthe preceding letter is optional


Table 1.2. Search for particular word forms (examples)
SymbolsHitsExamples
saall words containing the string sasavasakusahatatisa
\bsaall words starting with sasasahatasana; NOT vasaku, tisa
\bsa\ball words sasa
\bsa..\ball words consisting of sa + two letters that follow sasakasakusana
\bsa\w+all words beginning with sa, but not the word sa by itselfsahatasana
\b.*ana\bal words ending in anasinanatamuanasanabanamaana
(....)\lall words with four reduplicated letterspakupakuvapakupakumahumahunvamahumahun
\b(....)\lall words beginning with four reduplicated letterspakupaku; NOT vapakupaku
\b(....)\lana\ball words beginning with four reduplicated letters and ending in anavasuvasuanahunuhunuana
\bva(....)\lall words consisting of the prefix va- + four reduplicated lettersvapakupakuvagunagunaha
\bvahaa?\ball tokens of vahaa and vahavahaa and vaha


Table 1.3. Search for particular sequences of words (examples)
SymbolsHitsExamples
\bsaka\b .* \bhaastring of 3 words: (1) saka; (2) any word; (3) the word haa by itself or with suffixessaka antee haasaka abana haarisaka kabuu haana
saka .* \bhaa\w+string of 3 words: (1) saka; (2) any word; (3) a word beginning with haa, but NOT the wordhaa by itselfsaka abana haarisaka kabuu haana
(\bsaka\b|\bsa\b) \bpaku\b2-word string consisting of saka or sa and pakusaka pakusa paku
(\bsaka\b|\bsa\b) .* \bvaha\bstrings of 3 words: (1) saka or sa; (2) any word; (3) vahasaka tii vahasa tapaku vaha
(\bsaka\b|\bsa\b) (....)\l \bhaastrings of 3 words: (1) saka or sa; (2) any word with four reduplicated letters; (3) the wordhaa or a word beginning with haasa natanata haasaka natanata haana

沒有留言:

張貼留言