|
|
(One intermediate revision by one other user not shown) |
Line 1: |
Line 1: |
| In [[computer science]], in the area of [[formal language theory]], frequent use is made of a variety of [[string functions]]; however, the notation used is different from that used on [[computer programming]], and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.
| | Hi there, I am Alyson Pomerleau and I believe it sounds fairly great when you say it. Kentucky is exactly where I've usually been living. One of the things she enjoys most is canoeing and she's been performing it for quite a whilst. Invoicing is what I do.<br><br>Feel free to visit my weblog accurate psychic predictions ([http://si.dgmensa.org/xe/index.php?document_srl=48014&mid=c0102 http://si.dgmensa.org/]) |
| | |
| ==Strings and languages==
| |
| A string is a finite sequence of characters.
| |
| The [[empty string]] is denoted by <math>\varepsilon</math>.
| |
| The concatenation of two string <math>s</math> and <math>t</math> is denoted by <math>s \cdot t</math>, or shorter by <math>s t</math>.
| |
| Concatenating with the empty string makes no difference: <math>s \cdot \varepsilon = s = \varepsilon \cdot s</math>.
| |
| Concatenation of strings is associative: <math>s \cdot (t \cdot u) = (s \cdot t) \cdot u</math>.
| |
| | |
| For example, <math>(\langle b \rangle \cdot \langle l \rangle) \cdot (\varepsilon \cdot \langle ah \rangle) = \langle bl \rangle \cdot \langle ah \rangle = \langle blah \rangle</math>.
| |
| | |
| A [[language (computer science)|language]] is a finite or infinite set of strings.
| |
| Besides the usual set operations like union, intersection etc., concatenation can be applied to languages:
| |
| if both <math>S</math> and <math>T</math> are languages, their concatenation <math>S \cdot T</math> is defined as the set of concatenations of any string from <math>S</math> and any string from <math>T</math>, formally <math>S \cdot T = \{ s \cdot t \mid s \in S \land t \in T \}</math>.
| |
| Again, the concatenation dot <math>\cdot</math> is often omitted for shortness.
| |
| | |
| The language <math>\{\varepsilon\}</math> consisting of just the empty string is to be distinguished from the empty language <math>\{\}</math>.
| |
| Concatenating any language with the former doesn't make any change: <math>S \cdot \{\varepsilon\} = S = \{\varepsilon\} \cdot S</math>,
| |
| while concatenating with the latter always yields the empty language: <math>S \cdot \{\} = \{\} = \{\} \cdot S</math>.
| |
| Concatenation of languages is associtive: <math>S \cdot (T \cdot U) = (S \cdot T) \cdot U</math>.
| |
| | |
| For example, abbreviating <math>D = \{ \langle 0 \rangle, \langle 1 \rangle, \langle 2 \rangle, \langle 3 \rangle, \langle 4 \rangle, \langle 5 \rangle, \langle 6 \rangle, \langle 7 \rangle, \langle 8 \rangle, \langle 9 \rangle \}</math>, the set of all three-digit decimal numbers is obtained as <math>D \cdot D \cdot D</math>. The set of all decimal numbers of arbitrary length is an example for an infinite language.
| |
| | |
| ==Alphabet of a string==
| |
| The '''alphabet of a string''' is the set of all of the characters that occur in a particular string. If ''s'' is a string, its [[alphabet (computer science)|alphabet]] is denoted by
| |
| | |
| :<math>\operatorname{Alph}(s)</math>
| |
| | |
| The '''alphabet of a language''' <math>S</math> is the set of all characters that occur in any string of <math>S</math>, formally:
| |
| <math>\operatorname{Alph}(S) = \bigcup_{s \in S} \operatorname{Alph}(s)</math>.
| |
| | |
| For example, the set <math>\{\langle a \rangle,\langle c \rangle,\langle o \rangle\}</math> is the alphabet of the string <math>\langle cacao \rangle</math>,
| |
| and the [[#Strings_and_languages|above]] <math>D</math> is the alphabet of the [[#Strings_and_languages|above]] language <math>D \cdot D \cdot D</math> as well as of the language of all decimal numbers.
| |
| | |
| ==String substitution==
| |
| Let ''L'' be a [[language (computer science)|language]], and let <math>\Sigma</math> be its alphabet. A '''string substitution''' or simply a '''substitution''' is a mapping ''f'' that maps letters in <math>\Sigma</math> to languages (possibly in a different alphabet). Thus, for example, given a letter <math>a\in \Sigma</math>, one has <math>f(a)=L_a</math> where <math>L_a\subseteq\Delta^*</math> is some language whose alphabet is <math>\Delta</math>. This mapping may be extended to strings as
| |
| | |
| :<math>f(\varepsilon)=\varepsilon</math>
| |
| | |
| for the [[empty string]] <math>\varepsilon</math>, and
| |
| | |
| :<math>f(sa)=f(s)f(a)</math>
| |
| | |
| for string <math>s\in L</math>. String substitution may be extended to the entire language as | |
| | |
| :<math>f(L)=\bigcup_{s\in L} f(s)</math>
| |
| | |
| [[Regular language]]s are closed under string substitution. That is, if each letter of a regular language is substituted by another regular language, the result is still a regular language.
| |
| | |
| A simple example is the conversion <math>f_{uc}(\cdot)</math> to upper case, which may be defined e.g. as follows:
| |
| | |
| {| class="wikitable"
| |
| |-
| |
| ! letter !! mapped to language !! remark
| |
| |-
| |
| ! <math>x</math> !! <math>f_{uc}(x)</math> !!
| |
| |-
| |
| | <math>\langle a \rangle</math> || <math>\{\langle A \rangle\}</math> || map lower-case char to corresponding upper-case char
| |
| |-
| |
| | <math>\langle A \rangle</math> || <math>\{\langle A \rangle\}</math> || map upper-case char to itself
| |
| |-
| |
| | <math>\langle \text{ß} \rangle</math> || <math>\{\langle SS \rangle\}</math> || no upper-case char available, map to two-char string
| |
| |-
| |
| | <math>\langle 0 \rangle</math> || <math>\{\varepsilon\}</math> || map digit to empty string
| |
| |-
| |
| | <math>\langle ! \rangle</math> || <math>\{\}</math> || forbid punctuation, map to empty language
| |
| |-
| |
| | <math>\ldots</math> || || similar for other chars
| |
| |}
| |
| | |
| For the extension of <math>f_{uc}</math> to strings, we have e.g.
| |
| * <math>f_{uc}(\langle \text{Straße} \rangle) = \{\langle S \rangle\} \cdot \{\langle T \rangle\} \cdot \{\langle R \rangle\} \cdot \{\langle A \rangle\} \cdot \{\langle SS \rangle\} \cdot \{\langle E \rangle\} = \{ \langle STRASSE \rangle \}</math>,
| |
| * <math>f_{uc}(\langle u2 \rangle) = \{\langle U \rangle\} \cdot \{\varepsilon\} = \{\langle U \rangle\}</math>, and
| |
| * <math>f_{uc}(\langle Go! \rangle) = \{\langle G \rangle\} \cdot \{\langle O \rangle\} \cdot \{\} = \{\}</math>.
| |
| For the extension of <math>f_{uc}</math> to languages, we have e.g.
| |
| * <math>f_{uc}(\{\langle \text{Straße} \rangle, \langle u2 \rangle, \langle Go! \rangle\}) = \{ \langle STRASSE \rangle \} \cup \{\langle U \rangle\} \cup \{\} = \{ \langle STRASSE \rangle, \langle U \rangle\}</math>.
| |
|
| |
| | |
| Another example is the conversion of an [[EBCDIC]]-encoded string to [[ASCII]].
| |
| | |
| ==String homomorphism==
| |
| A '''string homomorphism''' (often referred to simply as a [[Homomorphism#Homomorphisms_and_e-free_homomorphisms_in_formal_language_theory|homomorphism]] in [[formal language theory]]) is a string substitution such that each letter is replaced by a single string. That is, <math>f(a)=s</math>, where ''s'' is a string, for each letter ''a''.
| |
| | |
| String homomorphisms are [[monoid morphism]]s on the [[free monoid]], preserving the [[binary operation]] of [[string concatenation]]. Given a language ''L'', the set <math>f(L)</math> is called the '''homomorphic image''' of ''L''. The '''inverse homomorphic image''' of a string ''s'' is defined as
| |
| | |
| :<math>f^{-1}(s)=\{w\vert f(w)=s\}</math>
| |
| | |
| while the inverse homomorphic image of a language ''L'' is defined as
| |
| | |
| :<math>f^{-1}(L)=\{s\vert f(s)\in L\}</math>
| |
| | |
| Note that, in general, <math>f(f^{-1}(L))\ne L</math>, while one does have
| |
| | |
| :<math>f(f^{-1}(L)) \subseteq L</math>
| |
| | |
| and
| |
| | |
| :<math>L \subseteq f^{-1}(f(L))</math>
| |
| | |
| for any language ''L''.
| |
| | |
| A string homomorphism is said to be <math>\varepsilon </math>-free (or e-free) if <math>f(a) \ne \varepsilon</math> for all <math>a</math> in the alphabet <math>\Sigma</math>. Simple single-letter [[substitution cipher]]s are examples of (<math>\varepsilon</math>-free) string homomorphisms.
| |
| | |
| An example string homomorphism <math>g_{uc}</math> can also be obtained by defining similar to the [[#String_substitution|above]] substitution: <math>g_{uc}(\langle a \rangle) = \langle A \rangle</math>, ..., <math>g_{uc}(\langle 0 \rangle) = \varepsilon</math>, but letting <math>g_{uc}</math> undefined on punctuation chars. Besides this restriction of its input domain, <math>g_{uc}</math> differs from <math>f_{uc}</math> by returning strings, while the latter returned singleton sets of strings. Examples for inverse homomorphic images are
| |
| * <math>g_{uc}^{-1}(\{ \langle SSS \rangle \}) = \{ \langle sss \rangle, \langle \text{sß} \rangle, \langle \text{ßs} \rangle\} </math>, since <math>g_{uc}(\langle sss \rangle) = g_{uc}(\langle \text{sß} \rangle) = g_{uc}(\langle \text{ßs} \rangle) = \langle SSS \rangle</math>, and
| |
| * <math>g_{uc}^{-1}(\{ \langle A \rangle, \langle bb \rangle \}) = \{ \langle a \rangle\} </math>, since <math>g_{uc}(\langle a \rangle) = \langle A \rangle</math>, while <math>\langle bb \rangle</math> cannot be reached by <math>g_{uc}</math>.
| |
| For the latter language, <math>g_{uc}(g_{uc}^{-1}(\{ \langle A \rangle, \langle bb \rangle \})) = g_{uc}(\{ \langle a \rangle\}) = \{ \langle A \rangle \} \neq \{ \langle A \rangle, \langle bb \rangle \}</math>.
| |
| The homomorphism <math>g_{uc}</math> is not <math>\varepsilon </math>-free, since it maps e.g. <math>\langle 0 \rangle</math> to <math>\varepsilon</math>.
| |
| | |
| ==String projection==
| |
| If ''s'' is a string, and <math>\Sigma</math> is an alphabet, the '''string projection''' of ''s'' is the string that results by removing all letters which are not in <math>\Sigma</math>. It is written as <math>\pi_\Sigma(s)\,</math>. It is formally defined by removal of letters from the right hand side:
| |
| | |
| :<math>\pi_\Sigma(s) = \begin{cases}
| |
| \varepsilon & \mbox{if } s=\varepsilon \mbox{ the empty string} \\
| |
| \pi_\Sigma(t) & \mbox{if } s=ta \mbox{ and } a \notin \Sigma \\
| |
| \pi_\Sigma(t)a & \mbox{if } s=ta \mbox{ and } a \in \Sigma
| |
| \end{cases}</math>
| |
| | |
| Here <math>\varepsilon</math> denotes the [[empty string]]. The projection of a string is essentially the same as a [[projection in relational algebra]].
| |
| | |
| String projection may be promoted to the '''projection of a language'''. Given a [[formal language]] ''L'', its projection is given by
| |
| | |
| :<math>\pi_\Sigma (L)=\{\pi_\Sigma(s) \vert s\in L \}</math>
| |
| | |
| ==Right quotient==
| |
| The '''right quotient''' of a letter ''a'' from a string ''s'' is the truncation of the letter ''a'' in the string ''s'', from the right hand side. It is denoted as <math>s/a</math>. If the string does not have ''a'' on the right hand side, the result is the empty string. Thus:
| |
| | |
| :<math>(sa)/ b = \begin{cases}
| |
| s & \mbox{if } a=b \\
| |
| \varepsilon & \mbox{if } a \ne b
| |
| \end{cases}</math>
| |
| | |
| The quotient of the empty string may be taken:
| |
| | |
| :<math>\varepsilon / a = \varepsilon</math>
| |
| | |
| Similarly, given a subset <math>S\subset M</math> of a monoid <math>M</math>, one may define the quotient subset as
| |
| | |
| :<math>S/a=\{s\in M \vert sa\in S\}</math> | |
| | |
| Left quotients may be defined similarly, with operations taking place on the left of a string.
| |
| | |
| ==Syntactic relation==
| |
| The right quotient of a subset <math>S\subset M</math> of a monoid <math>M</math> defines an [[equivalence relation]], called the '''right [[syntactic relation]]''' of ''S''. It is given by
| |
| | |
| :<math>\sim_S \;\,=\, \{(s,t)\in M\times M \vert S/s = S/t \}</math>
| |
| | |
| The relation is clearly of finite index (has a finite number of equivalence classes) if and only if the family right quotients is finite; that is, if
| |
| | |
| :<math>\{S/m \vert m\in M\}</math>
| |
| | |
| is finite. In this case, ''S'' is a [[recognizable language]], that is, a language that can be recognized by a [[finite state automaton]]. This is discussed in greater detail in the article on [[syntactic monoid]]s.
| |
| | |
| ==Right cancellation==
| |
| The '''right cancellation''' of a letter ''a'' from a string ''s'' is the removal of the first occurrence of the letter ''a'' in the string ''s'', starting from the right hand side. It is denoted as <math>s\div a</math> and is recursively defined as
| |
| | |
| :<math>(sa)\div b = \begin{cases}
| |
| s & \mbox{if } a=b \\
| |
| (s\div b)a & \mbox{if } a \ne b
| |
| \end{cases}</math>
| |
| | |
| The empty string is always cancellable:
| |
| | |
| :<math>\varepsilon \div a = \varepsilon</math>
| |
| | |
| Clearly, right cancellation and projection [[Commutative property|commute]]:
| |
| | |
| :<math>\pi_\Sigma(s)\div a = \pi_\Sigma(s \div a )</math>
| |
| | |
| ==Prefixes==
| |
| The '''prefixes of a string''' is the set of all [[prefix (computer science)|prefixes]] to a string, with respect to a given language:
| |
| | |
| :<math>\operatorname{Pref}_L(s) = \{t \vert s=tu \mbox { for } t,u\in \operatorname{Alph}(L)^*\}</math> | |
| | |
| here <math>s\in L</math>.
| |
| | |
| The '''prefix closure of a language''' is
| |
| | |
| :<math>\operatorname{Pref} (L) = \bigcup_{s\in L} \operatorname{Pref}_L(s) = \left\{ t\vert s=tu; s\in L; t,u\in \operatorname{Alph}(L)^* \right\}</math>
| |
| | |
| '''Example:''' <br>
| |
| <math>L=\left\{abc\right\}\mbox{ then } \operatorname{Pref}(L)=\left\{\varepsilon, a, ab, abc\right\}</math>
| |
| | |
| A language is called '''prefix closed''' if <math>\operatorname{Pref} (L) = L</math>.
| |
| | |
| The prefix closure operator is [[idempotent]]:
| |
| | |
| :<math>\operatorname{Pref} (\operatorname{Pref} (L)) =\operatorname{Pref} (L)</math>
| |
| | |
| The '''prefix relation''' is a [[binary relation]] <math>\sqsubseteq</math> such that <math>s\sqsubseteq t </math> if and only if <math>s \in \operatorname{Pref}_L(t)</math>. This relation is a particular example of a [[prefix order]].
| |
| | |
| ==See also ==
| |
| * [[Comparison of programming languages (string functions)]]
| |
| * [[Levi's lemma]]
| |
| | |
| == References ==
| |
| {{reflist}}
| |
| * {{cite book | first1=John E. | last1=Hopcroft | first2=Jeffrey D. | last2=Ullman | title=Introduction to Automata Theory, Languages and Computation | publisher=Addison-Wesley Publishing | location=Reading, Massachusetts | year=1979 | isbn=0-201-02988-X | zbl=0426.68001 }} ''(See chapter 3.)''
| |
| | |
| [[Category:Formal languages]]
| |
| [[Category:Relational algebra]]
| |
| [[Category:String (computer science)|Operations]]
| |