/
Digital State Machines Digital State Machines

Digital State Machines - PowerPoint Presentation

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
343 views
Uploaded On 2020-01-17

Digital State Machines - PPT Presentation

Digital State Machines Regular Expressions amp Languages 7 October 2008 Veton Këpuska 2 Chapter Outline Regular Expressions Basic Regular Expression Patterns Disjunction Grouping and Precedence Examples ID: 773067

october regular puska veton regular october veton puska 2008 expressions expression language matches languages string state match fsa patterns

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Digital State Machines" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Digital State Machines Regular Expressions & Languages

7 October 2008 Veton Këpuska 2 Chapter Outline Regular Expressions Basic Regular Expression Patterns Disjunction, Grouping and Precedence Examples Advanced Operators Regular Expression Substitution, Memory and ELIZA Summary

Regular Expressions (RE) Algebraic Description of finite state automata. Regular Expressions can define exactly the same languages that the various forms of automata describe: regular languages. Regular Expressions (RE) offer a declarative way to express the strings we want to accept – FSA do not! REs serve as the input language for many systems that process strings: Search commands such as UNIX grep (egrep, etc.) for finding strings: WWW Browsers, Text-formatting systems, etc. Search Systems convert REs into FSA(s) (D-FSA or N-FSA).Lexical-analyzer generators, such as LEX or FLEX.Compiler,Language Modeling System in a Speech Recognizer.Grammar and Spell Checkers. 7 October 2008 Veton Këpuska 3

7 October 2008 Veton Këpuska 4 FSA, RE and Regular Languages Finite automata Regular languages Regular expressions

The Operators of Regular Expressions Regular Expressions denote languages. 01*+10* - denotes the language consisting of all strings that are either a: {0, 01, 011, 0111, 01111,…}, or {1, 10, 100, 1000, 10000, …} Operations on Regular Languages that Regular Expressions Represent. Let L, L 1 and L 2 be regular languages, L={0,1}, L 1 = {10, 001, 111} & L 2 = { e , 001}, then The union: L1 ∪ L2, the union or disjunction of L1 and L2.L1 ∪ L2 = {e, 10, 001, 111} The concatenation: L1L2 = {xy|x ∈ L1, y ∈ L2}.L1 L2 = {10, 001, 111, 10001, 00001, 111001} The closure (or star, *, or Kleene closure): L*.L* = {L0, L1, L2,…, Li,…, L∞} 7 October 2008 Veton Këpuska 5

Example L={0,11}, L 0 = { e } – independent of what language L is. L 1 = L – represents the choice of one string from L. {L 0 , L 1 } = { e, 0, 11 } L 2 = { 00, 011,110,1111 }L3 = {000, 0011, 0110, 01111,1100,11011,11110,111111}To compute L* must compute Li for each i (i)Li has 2i members.Union of infinite number of terms Li is generally an infinite language (L*) as it is this example.7 October 2008Veton Këpuska6

Example Let L={ e, 0, 00, 000, …} – a set of strings consisting all zeros. L – is infinite language L 0 = { e } – independent of what language L is. L 1 = L – represents the choice of one symbol from L. {L 0 , L 1 } = { e, 0, 00, 000, 0000, …... } L 2 = {e, 0, 00,000,0000, ...} = LL3 = LL*= L0  L1  L2  … = L - empty set. One of only two languages that its closure, *, is not infinite. 0 = {e}1 = {e}i = {e}* = {e}7 October 2008Veton Këpuska7

Distinction of Star (*) and Closure (*) Operator Star *: *- forms all strings whose symbols were chosen from alphabet . Closure * operator is essentially the same with a subtle difference. Let: L – be a language containing strings of length 1, and for each symbol a in  there is a string a in L. Thus:  - set of symbols, while L – set of strings * and L* denote the same language. 7 October 2008 Veton Këpuska 8

Building Regular Expressions The algebra of regular expressions follows the pattern of classical algebra. Constants and Variables denote Languages Operators ⇒ {Union, Product, Star/Closure} Define Regular Expression (E - the language that it represents is denoted by L (E)), Recursively: BASIS: The constants e and  and are regular expressions, denoting the languages L ( e )={ e } and L ( )= respectively.If a is any symbol, then a is a regular expression. L(a)={a}.Any variable, e.g., L, typically capitalized and italic represents any language.7 October 2008Veton Këpuska9

Building Regular Expressions INDUCTION: If E and F are regular expressions, than E+F is a regular expressions denoting their union: L(E+F) = L(E)  L(F) . EF is a regular expression denoting their concatenation: L(EF) = L(E)L(F) .A dot can optionally be used to denote the concatenation operator on languages or in a regular expression. A regular expression 0.1 is same as 01 that represents the language {01} E* is a regular expression denoting the closure of L(E): L(E*) = (L(E))*. (E) is also a regular expression denoting the same language as E: L((E))=L(E) 7 October 2008 Veton Këpuska 10

Example Develop a regular expression for the language consisting of the single string 01. 0 and 1 are expressions denoting the languages {0} and {1}Concatenation of the two expressions results in regular expression 01 for the language {01}.As a general rule, if we want a regular expression for the language consisting of only the string w , we use w itself as the regular expression.Write a regular expression for set of strings that consists of alternating 0’s and 1’s. Thus from the above we get (01)* Note 1: 01* ≠ (01)*Note 2: L((01)*) – is not exactly what we want – what about when 1 is at the beginning and/or 0 at the end? (01)*+( 10 )*+ 1 ( 01 )*+ 0 ( 10 )* “+” operator indicates union of the corresponding languages.7 October 2008Veton Këpuska11

Example Alternate Solution: Note: L( e +1)= L(e) L(1)={e}  {1}={ e,1}(e+1)(01 )*(e+0) 7 October 2008 Veton Këpuska 12

Precedence of Regular Expression Operators * operator has the highest precedence. Concatenation or dot operator. Union (+) operator Controlling the order of operations by grouping operator “()”. Example: (0(1*))+1 (01)*+1 0(1*+1) 7 October 2008 Veton Këpuska 13

Exercise Examples Exercise 3.1.1: Write regular expression for the following languages:a The set of strings over alphabet {a, b, c} containing at least one a and at least one b. (aba*b*c*) what about other combinations? ((e+a*)+( e +b*)+( e+c*))*(ab + ba)((e+a*)+(e+b*)+(e+c*))* The set of strings of 0’s and 1’s whose tenth symbols from the right end is 1.(0+1)*1(0+1) (0+1)… (0+1) (0+1) The set of strings of 0’s and 1’s with at most one pair of consecutive 1’s.(0+1)(0+(00)+(01)+(10))* 7 October 2008 Veton Këpuska 14

Finite Automata and Regular Expressions Regular-expressions describe languages in fundamentally different form from the finite automata. However, they both describe the same set of languages – “Regular Languages”. To show this one must: Every language defined by one of these automata is also defined by a regular expression. Must show that the language is accepted by some D-FSA. Every language defined by a regular expression is defined by one of these automata. Must show that there is an N-FSA with e -transitions accepting the same language. 7 October 2008 Veton Këpuska 15

Finite Automata and Regular Expressions Plan for showing the equivalency of four different notations for regular languages. 7 October 2008 Veton Këpuska 16 e -NFSA NFSA RE DFSA

Converting Regular Expressions to Automata We can show that every language L, that is L(R) for some regular expression R, is also L(E) for some e -NFSA E. Start by showing how to construct automata for basis expressions, single symbols e and f. Show how to combine these automata into larger automata that accept the union, concatenation, or closure. 7 October 2008 Veton Këpuska 17

Converting Regular Expressions to Automata Theorem: Every language defined by a regular expression is also defined by a finite automata. Proof: Suppose L=L(R) for a regular expression R. We will show that L=L(E) for some e-NFSA E with: Exactly one accepting state No arcs into the initial state. No arcs out of the accepting state. The proof is by structural induction on R, following the recursive definition of regular expressions.7 October 2008 Veton Këpuska 18

Converting Regular Expressions to Automata BASIS: The language of automaton is { e } Depicts construction for f , since there is no path from start state to accepting state. Thus f is the language of automaton.Language of the automaton is L(a) which is the one string a. 7 October 2008 Veton Këpuska 19

Converting Regular Expressions to Automata INDUCTION: It assumed that the statement of the theorem is true for the immediate sub-expressions of a given regular expression. R+S: L(R)  L(S) RS: L(R)L(S) R*: L(R*) 7 October 2008 Veton Këpuska 20

Example Convert (0+1)*1(0+1) to an e -NFSA. (0+1) (0+1)*(0+1)*1(0+1) 7 October 2008 Veton Këpuska 21

Converting D-FSA’s to Regular Expressions by Eliminating States When a state s is eliminated from D-FSA, all the paths that go through s no longer will exist in automaton. Thus, if the language of the automaton is not to change, we must include, an arc that goes directly from state q to state p, the labels of the paths that went from state q to p through state s that is eliminated. 7 October 2008 Veton Këpuska 22

Converting D-FSA’s to Regular Expressions by Eliminating States 7 October 2008 Veton Këpuska 23 R 11 +Q 1 S*P 1

Strategy from D-FSA to RE For each q of D-FSA apply reduction process to produce D-FSA with regular expressions labels on the arcs. Eliminate all states except q and the start state q 0 .If q ≠ q 0 then we shall be left with a two state automaton that looks like: 7 October 2008 Veton Këpuska 24 (R+SU*T)*SU*

Strategy from D-FSA to RE It the start state is also an accepting state, then we must also perform a state-elimination from the original automaton that gets rid of every state but the last start state. When this is done, what is left is a one state automaton that looks like the following: The desired regular expression is the sum (union) of all the expressions derived from the reduced automata for each accepting state, by rules (2) and (3): 7 October 2008 Veton Këpuska 25 R*

Example for: D-FSA to RE Consider N-FSA below that accepts all strings of 0’s and 1’s such that either the second or third position form the end has a 1. Derive equivalent regular expression of the language of this N-FSA. Solution: Replace labels with regular expressions. 7 October 2008 Veton Këpuska 26

Example for: D-FSA to RE Eliminate State B: Predecessor states: A Successor states: C Equivalent Expression A → C: 1(0+1) 7 October 2008 Veton Këpuska 27

Example for: D-FSA to RE Branching eliminating states C and D in separate reductions. Elimination of state C: Predecessor states: A Successor states: D Equivalent Expression A → D: 1(0+1)(0+1) 7 October 2008 Veton Këpuska 28

Example for: D-FSA to RE Generic two-state automaton: ((0+1)*1(0+1)(0+1)) Eliminating D from Resulting in: Corresponding RE: ((0+1)*1(0+1)) 7 October 2008 Veton Këpuska 29

Example for: D-FSA to RE Combining two expressions for the entire automaton by summing each RE: ((0+1)*1(0+1)(0+1)) + ((0+1)*1(0+1)) 7 October 2008 Veton Këpuska 30

Algebraic Laws for Regular Expressions 7 October 2008 Veton Këpuska 31

Algebraic Laws for Regular Expressions Collection of laws that define when two regular expressions are equivalent. Arithmetic: Commutativity: (x+y = y+x) Switching of order of operands does not change results. Associativity: (xy)z = x (yz) Regroup the operands when the operator is applied twice. Regular expressions have a number of laws similar to the laws for arithmetic. 7 October 2008 Veton Këpuska 32

Associativity and Commutativity For L,M and N Languages (defined by Regular Expressions or equivalently by FSA) Commutative Law for Union: L+M=M+L Associative Law for Union: (L+M)+N=L+(M+N)Associative Law for Concatenation:(LM)N=L(MN) 7 October 2008 Veton Këpuska 33

Identities and Annihilators Arithmetic Identity: 0 is identity for addition: 0+x = x+0 = x 1 is identity for multiplication: 1  x = x 1 = xAnnihilator:0 is annihilator for multiplication: 0  x = x  0 = 0 Regular Expressions Identity for Union and Concatenation: ∅ +L = L+ ∅ = L ∊L = L∊ = LAnnihilator for Concatenation:∅L = L∅ = ∅Important in simplification of regular expressions.7 October 2008Veton Këpuska34

Distributive Laws Arithmetic A distributive law involves two operators. Distributive law of multiplication over addition (most common): x  (y+z) = x  y+ x  z Regular Expressions Left Distributive Law of Concatenation over union: L(M+N) = LM + LN Right Distributive Law of Concatenation over union: (M+N)L = ML + NL 7 October 2008 Veton Këpuska 35

Distributive Laws Theorem: If L, M, and N are any languages, then: L(M  N) = LM  LN Proof: Show first that a string w is in L(M  N) if and only if it is in LM  LN. (Only-if) If w is in L(M  N) then w= xy , where x is in L and y is in (M  N) ⇒ y is in M or N. If y is in M then w=xy is in LM ⇒ is in LM  LNIf y is in N then w=xy is in LN ⇒ is in LM  LN(if) If w is in LM  LN then w is either in LM or in LNIf w=xy and w is in LM then x is in L and y in M ⇒ If y M then y is in M  N, thus w is in L(M  N)If w=xy and w is in LN then x is in L and y in N ⇒ If y N then y is is in M  N, thus w is in L(M  N)7 October 2008Veton Këpuska36

The Idempotent Law Arithmetic: Common arithmetic operators are not idempotent: x+x ≠ x and x  x ≠ xRegular Expressions: Idempotent lawL+L=L 7 October 2008 Veton Këpuska 37

Laws Involving Closures (L*)* = L* - Closing an expression that is already closed does not change the language. ∅ * =  - The closure of ∅ contains only the string .  * = L+ = LL* = L*LL + = L + LL + LLL + …L* =  + L + LL + LLL + … =  + L + LL* = L  + LL + LLL + LLLL + … L  =  L = LL* = L+ + L? =  + L7 October 2008Veton Këpuska38

Discovering Laws for Regular Expressions There is an infinite variety of laws about regular expressions that might be proposed. Is there a general methodology that will make proofs of the correct laws easy? The truth of a law reduces to a question of the equality of two specific languages. Technique is closely tied to the regular-expression operators It can not be extended to expressions involving some other operators (e.g., intersection)

Discovering Laws for Regular Expressions Consider a proposed law: (L+M)*=(L*M*)* Given two languages L and M : Closure of the union of the languages, (L+M)*, is identical to closure of concatenation of individually closed languages; (L*M*)*. Proof: Suppose w is in the language of (L+M)*. Thus we can write w = w 1 w 2 w 3 … w k for some k, where each wi is in either L or M. If string wi is in L, this string is also in L*. If the string is not in M then one can pick  from M*. Thus the string is in L*M*.Similarly we could rationalize for wi in M showing that the string is in L*M*Since each wi of w = w1 w2 w3 … wk … is in L*M*, its closed language must be in (L*M*)* Must also show that strings in (L*M*)* are in (L+M)* to complete the proof.Exercise Problem.

Regular Expressions Details

7 October 2008 Veton Këpuska 42 Regular Expressions Formally, a regular expression is an algebraic notation for characterizing a set of strings. Thus they can be used to specify search strings as well as to define a language in a formal way. Regular Expression requires A pattern that we want to search for, andA corpus of text to search through. Thus when we give a search pattern, we will assume that the search engine returns the line of the document returned. This is what the UNIX grep command does. We will underline the exact part of the pattern that matches the regular expression. A search can be designed to return all matches to a regular expression or only the first match. We will show only the first match.

7 October 2008 Veton Këpuska 43 Basic Regular Expression Patterns The simplest kind of regular expression is a sequence of simple characters: /woodchuck/ /Buttercup/ /!/ RE Example Patterns Matched /woodchuck/ “interesting links to woodchuck s and lemurs” /a/ “M a ry Ann stopped by Mon a ’s” /Claire says,/“Dagmar, my gift please,” Claire says,”/song/“all our pretty songs” /!/ “You’ve left the burglar behind again ! ” said Nori

7 October 2008 Veton Këpuska 44 Basic Regular Expression Patterns Regular Expressions are case sensitive /s/ /S/ /woodchucks/ will not match “Woodchucks” Disjunction: “[“ and “]”. RE Match Example Pattern /[wW]oodchuck/ Woodchuck or woodchuck “ Woodchuck ” /[abc]/ ‘a’, ‘b’, or ‘c’ “In uomini, in soldati”/[1234567890]/Any digit “plenty of 7 to 5 ”

7 October 2008 Veton Këpuska 45 Basic Regular Expression Patterns Specifying range in Regular Expressions: “-” RE Match Example Patterns Matched /[A-Z]/ An uppercase letter “we should call it ‘ D renched Blossoms’” /[a-z]/ A lower case letter “ m y beans were impatient to be hoed!” /[0-9]/ A single digit“Chapter 1: Down the Rabbit Hole”

7 October 2008 Veton Këpuska 46 Basic Regular Expression Patterns Negative Specification – what pattern can not be : “^” If the first symbol after the open square brace “[” is “^” the resulting pattern is negated. Example /[^a]/ matches any single character (including special characters) except a. RE Match (single characters) Example Patterns Matched /[^A-Z]/ Not an uppercase letter “O y fn pripetchik” /[^Ss]/ Neither ‘S’ nor ‘s’ “ I have no exquisite reason for ’t”/[^\.]/Not a period“our resident Djinn” /[e^]/ Either ‘e’ or ‘^’ “look up ^ now” /a^b/ Pattern ‘a^b’ “look up a^b now”

7 October 2008 Veton Këpuska 47 Basic Regular Expression Patterns How do we specify both woodchuck and woodchucks ?Optional character specification: /?/ /?/ means “the preceding character or nothing”. RE Match Example Patterns Matched /woodchucks?/ woodchuck or woodchucks “ woodchuck ” Colou?r color or colour “ colour”

7 October 2008 Veton Këpuska 48 Basic Regular Expression Patterns Question-mark “?” can be though of as “zero or one instances of the previous character”. It is a way to specify how many of something that we want. Sometimes we need to specify regular expressions that allow repetitions of things. For example, consider the language of (certain) sheep, which consists of strings that look like the following: baa! baaa? baaaa? baaaaa? baaaaaa? …

7 October 2008 Veton Këpuska 49 Basic Regular Expression Patterns Any number of repetitions is specified by “*” which means “any string of 0 or more”. Examples: /aa*/ - a followed by zero or more a’s /[ab]*/ - zero or more a’s or b’s. This will match aaaa or abababa or bbbb

7 October 2008 Veton Këpuska 50 Basic Regular Expression Patterns We know enough to specify part of our regular expression for prices: multiple digits. Regular expression for individual digit: /[0-9]/ Regular expression for an integer: /[0-9][0-9]*/ Why is not just /[0-9]*/? Because it is annoying to specify “at least once” RE since it involves repetition of the same pattern there is a special character that is used for “at least once”: “+” Regular expression for an integer becomes then: /[0-9]+/ Regular expression for sheep language: /baa*!/, or /ba+!/

7 October 2008 Veton Këpuska 51 Basic Regular Expression Patterns One very important special character is the period: /./, a wildcard expression that matches any single character (except carriage return). Example: Find any line in which a particular word (for example Veton) appears twice: /Veton.*Veton/ RE Match Example Pattern /beg.n/ Any character between beg and n begin beg’n , begun

7 October 2008 Veton Këpuska 52 Repetition Metacharacters RE Description Example * Matches any number of occurrences of the previous character – zero or more /ac*e/ - matches “ae”, “ace”, “acce”, “accce” as in “The ae rial acce leration alerted the ace pilot” ? Matches at most one occurrence of the previous characters – zero or one. /ac?e/ - matches “ae” and “ace” as in “The ae rial acceleration alerted the ace pilot”+Matches one or more occurrences of the previous characters/ac+e/ - matches “ace”, “acce”, “accce” as in “The aerial acce leration alerted the ace pilot” {n} Matches exactly n occurrences of the previous characters. /ac{2}e/ - matches “acce” as in “The aerial acce leration alerted the ace pilot” {n,} Matches n or more occurrences of the previous characters /ac{2,}e/ - matches “acce”, “accce” etc., as in “The aerial acce leration alerted the ace pilot” {n,m} Matches from n to m occurrences of the previous characters. /ac{2,4}e/ - matches “acce”, “accce” and “acccce” , as in “The aerial acce leration alerted the ace pilot” . Matches one occurrence of any characters of the alphabet except the new line character /a.e/ matches aae, aAe, abe, aBe, a1e, etc., as in ““The aerial acceleration ale rted the ace pilot” .* Matches any string of characters and until it encounters a new line character

7 October 2008 Veton Këpuska 53 Anchors Anchors are special characters that anchor regular expressions to particular places in a string. The most common anchors are: “^” – matches the start of a line “$” – matches the end of the line Examples: /^The/ - matches the word “The” only at the start of the line. Three uses of “^”: /^xyz/ - Matches the start of the line [^xyz] – Negation /^/ - Just to mean a caret / ⌴ $/ - “⌴” Stands for space “character”; matches a space at the end of line. /^The dog\.$/ - matches a line that contains only the phrase “The dog”.

7 October 2008 Veton Këpuska 54 Anchors /\b/ - matches a word boundary /\B/ - matches a non-boundary /\bthe\b/ - matches the word “the” but not the word “other”. Word is defined as a any sequence of digits, underscores or letters. /\b99/ - will match the string 99 in “There are 99 bottles of beer on the wall” but NOT“There are 299 bottles of beer on the wall”and it will match the string “$99” since 99 follows a “$” which is not a digit, underscore, or a letter.

7 October 2008 Veton Këpuska 55 Disjunction, Grouping and Precedence. Suppose we need to search for texts about pets; specifically we may be interested in cats and dogs. If we want to search for either “cat” or the string “dog” we can not use any of the constructs we have introduced so far (why not “[]”?). New operator that defines disjunction , also called the pipe symbol is “|”./cat|dog/ - matches either cat or the string dog.

7 October 2008 Veton Këpuska 56 Grouping In many instances it is necessary to be able to group the sequence of characters to be treated as one set. Example: Search for guppy and guppies. /gupp(y|ies)/ Useful in conjunction to “*” operator. /*/ - applies to single character and not to a whole sequence. Example: Match “Column 1 Column 2 Column 3 …” /Column ⌴ [0-9]+ ⌴ */ - will match “Column # …“ /(Column ⌴ [0-9]+ ⌴ *)*/ - will match “Column 1 Column 2 Column 3 …”

7 October 2008 Veton Këpuska 57 Operator Precedence Hierarchy Operator Class Precedence from Highest to Lowest Parenthesis () Counters * + ? {} Sequences and anchors ^ $ Disjunction |

7 October 2008 Veton Këpuska 58 Simple Example Problem Statement: Want to write RE to find cases of the English article “the”. /the/ - It will miss “The” /[ tT ]he/ - It will match “ amal thea”, “Be the sda”, “ the ology”, etc. /\b[ tT ]he\b/ - Is the correct RE Problem Statement: If we want to find “the” where it might also have underlines or numbers nearby (“The-” , “the_” or “the25”) one needs to specify that we want instances in which there are no alphabetic letters on either side of “the”: /[^a- zA -Z][tT]he[^a-zA-Z]/ - it will not find “the” if it begins the line./(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

7 October 2008 Veton Këpuska 59 A More Complex Example Problem Statement: Build an application to help a user purchase a computer on the Web. The user might want “any PC with more than 1000 MHz and 80 Gb of disk space for less than $1000 To solve the problem must be able to match the expressions like 1000 MHz, 1 GHz and 80 Gb as well as $999.99 etc.

7 October 2008 Veton Këpuska 60 Solution – Dollar Amounts Complete regular expression for prices of full dollar amounts: /$[0-9]+/ Adding fractions of dollars: /$[0-9]+\.[0-9][0-9]/ or /$[0-9]+\.[0-9] {2}/ Problem since this RE only will match “$199.99” and not “$199”. To solve this issue must make cents optional and make sure the $ amount is a word: /\b$[0-9]+(\.[0-9][0-9])?\b/

7 October 2008 Veton Këpuska 61 Solution: Processor Speed Processor speed in megahertz = MHz or gigahertz = GHz) /\b[0-9]+ ⌴*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/ ⌴ * is used to denote “zero or more spaces”.

7 October 2008 Veton Këpuska 62 Solution: Disk Space Dealing with disk space: Gb = gigabytes Memory size: Mb = megabytes or Gb = gigabytesMust allow optional fractions: /\b[0-9]+⌴*(M[Bb]|[Mm]egabytes?)\b/ /\b[0-9]+(\.[0-9]+)?⌴*(G[Bb]|[Gg]igabytes?)\b/

7 October 2008 Veton Këpuska 63 Solution: Operating Systems and Vendors /\b((Windows)+ ⌴ * (XP|Vista)?)\b/ /\b((Mac|Macintosh|Apple)\b/

7 October 2008 Veton Këpuska 64 Advanced Operators RE Expansion Match Example Patterns \d [0-9] Any digit “Party of 5 ” \D [^0-9] Any non-digit “ B lue moon” \w [a-zA-Z0-9 ⌴ ] Any alphanumeric or space D aiyu \W [^\w] A non-alphanumeric ! !!! \s [ ⌴\r\t\n\f] Whitespace (space, tab) “ ” \S [^\s] Non-whitespace “ i n Concord” Aliases for common sets of characters

7 October 2008 Veton Këpuska 65 Literal Matching of Special Characters & “\” Characters RE Match Example Patterns \* An asterisk “*” “K*A*P*L*A*N” \. A period “.” “Dr . Këpuska, I presume” \? A question mark “?” “Would you like to light my candle ? ” \n A newline \t A tab \r A carriage return character Some characters that need to be backslashed “\”

7 October 2008 Veton Këpuska 66 Regular Expression Substitution, Memory, and ELIZA Substitutions are an important use of regular expressions. s/regexp1/regexp2/ - allows a string characterized by one regular expression (regexp1) to be replaced by a string characterized by a second regular expressions (regexp2). s/colour/color/ It is also important to refer to a particular subpart of the string matching the first pattern. Example: replace “the 35 boxes”, to “the <35> boxes” s/([0-9]+)/<\1>/ - “\1” refers to the first pattern matched by the first regular expression.

7 October 2008 Veton Këpuska 67 Regular Expression Substitution, Memory, and ELIZA The parenthesis and number operators can also be used to specify that a certain string or expression must occur twice in the text. Example: “the X er they were, the Xer they will be”We want to constrain the two X’s to be the same string: /[ Tt ]he (.*) er they were, the \1er they will be/ This RE will match: “The bigger they were, the bigger they will be”, but not: “The bigger they were, the faster they will be” The number operator can be used with other numbers: if you match two different sets of parenthesis, \2 means whatever matched the second set.Example:/[Tt]he (.*)er they (.*), the \1er they \2/This Re will match:“The bigger they were, the bigger they were”, but not “The bigger the were, the better they will be”,

7 October 2008 Veton Këpuska 68 Registers Numbered memories are called registers: \1 – register 1 \2 – register 2 \3 – register 3

7 October 2008 Veton Këpuska 69 ELIZA Substitutions using memory are very useful in implementing simple natural-language understanding programs like ELIZA. Here is example of dialog with ELIZA: User1: Men are all alike. ELIZA1: IN WHAT WAY User2: They’re always bugging us about something or other. ELIZA2: CAN YOU THINK OF A SPECIFIC EXAMPLE User3: Well, my boyfriend made me come here. ELIZA3: YOUR BOYFRIEND MADE YOU COME HERE User4: He says I’m depressed much of the time. ELIZA4: I AM SORRY TO HEAR YOU ARE DEPRESSED.

7 October 2008 Veton Këpuska 70 ELIZA Eliza worked by having a cascade of regular expression substitutions that each matched some part of the input lines and changed them. The first substitutions changed all instances of: “my” ⇨ “YOUR” “I’m” ⇨ “YOU ARE” Next set of substitutions looked for relevant patterns in the input and created an appropriate output; s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/ s/.* ALL .*/IN WHAT WAY/ s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

7 October 2008 Veton Këpuska 71 ELIZA Since multiple substitutions could apply to a given input, substitutions were assigned a rank and were applied in order. Creation of such patterns is addressed in Exercise 2.2.

Applications of Regular Expressions 7 October 2008 Veton Këpuska 72

Lexical Analysis (lex, flex, yacc) http://dinosaur.compilertools.net/ Finding Patterns in Text http://www.gnu.org/software/grep/grep.html 7 October 2008 Veton Këpuska 73

End 7 October 2008 Veton Këpuska 74