/
1 CSC 594 Topics in AI – 1 CSC 594 Topics in AI –

1 CSC 594 Topics in AI – - PowerPoint Presentation

myesha-ticknor
myesha-ticknor . @myesha-ticknor
Follow
344 views
Uploaded On 2019-12-26

1 CSC 594 Topics in AI – - PPT Presentation

1 CSC 594 Topics in AI Natural Language Processing Spring 2018 3 Regular Expressions Some slides adapted from Jurafsky amp Martin 2 Document Search Information Retrieval IR implies a query eg search terms ID: 771520

regular match search sre match regular sre search string expression expressions character compile object python span matches line cost

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 CSC 594 Topics in AI –" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 CSC 594 Topics in AI –Natural Language Processing Spring 20183. Regular Expressions(Some slides adapted from Jurafsky & Martin)

2 Document Search ‘Information Retrieval (IR)’ implies a query (e.g. search terms) For a given query, relevant or similar documents are returned.But most basic document retrieval technique is keyword/search term matching.Retrieve all (or selected) documents which contain the search terms -- by string matching Python example: >>> s1 = 'public' >>> s2 = 'public' >>> s2 == s1 True myword = “month python” with open("textfile.txt") as openfile : for l ine in openfile : if myword in line: print line

Regular Expressions and Text Searching Regular expressions are a compact textual representation of a set of strings that constitute a languageIn the simplest case, regular expressions describe regular languagesHere, a language means a set of strings given some alphabet. Extremely versatile and widely used technology Emacs , vi, perl , grep , etc. 3

Example Find all the instances of the word “the” in a text./the//[tT ]he/ /\b[ tT ]he\b/ 4

5 String Matching Using Patterns Often, we wish to find a substring which matches a patterne.g. E-mail addresses:Any number of alphanumeric characters and/or dots (not a dot at beginning or end)@ Any number of alphanumeric characters and/or dots (not a dot at beginning or end ); must be at least one dot Examples: valid: tomuro@cs.depaul.edu , noriko.tomuro@gmail.comInvalid: .tomuro@cs.depaul.edu, tomuro@depaul But if you want to specify search words by patterns, regular expressions are commonly used.

Regular Expressions (1) Regular expression is an algebra for defining patterns. For example, a regular expression “a*b” matches with a string “aaaab”. But without going through the formal definitions, here is a (partial) summary.Simple PatternsCharacters match themselves. Note the chars are case-sensitive.Metacharacters – not to be used literally _as is_ . ^ $ * + ? { } [ ] \ | ( )To use a metacharacter, a back-slash has to be given before it \. \^ \+ etc. Other special characters \t, \n, \r, \f etc. 6

Regular Expressions (2) Character classes[abc ] – a, b, or c[^abc] – any character except a, b, or c.[a-zA-Z] – a throughx, or A through Z inclusive (range)Predefined character classes. (dot) – any character \d – a digit ([0-9])\D – a non-digit ([^0-9]) \s – a whitespace character (e.g. space, \t, \n, \r) \S – a non-whitespace character \w – a word character ([a-zA-Z_0-9]) \W – a non-word character ([^\w]) Boundary matchers ^ -- the beginning of a line $ -- the end of a line 7

Regular Expressions (3) Greedy quantifiersX? – X, once or not at all Z* -- X, zero or more timesX+ -- X, one or more timesX{n} – X, exactly n timesX{n,m} – X, at least n but no more than m timesLogical operatorsXY – X followed by YX|Y – either X or Y(X) – X, as a capturing group 8

Regular Expression in Python (1) Regular expressions are in the ‘re’ package.Notation for patterns is slightly different from other languages – using r aw string as an alternative to Regular string.First compile an expression (into an re object). Then match it against a string.>>> import re>>> p = re.compile('ab*') 9 Regular String Raw string "ab*" r"ab *" "\\\\section" r"\\section" "\\w+\\s+\\1" r"\w+\s+\1"

Regular Expression in Python (2) Matching a re object against a string is done in several ways. 10 Method/Attribute Purpose match() Determine if the RE matches at the beginning of the string. search() Scan through a string, looking for any location where this RE matches. findall() Find all substrings where the RE matches, and returns them as a list. finditer() Find all substrings where the RE matches, and returns them as an iterator .

11 >>> import re >>> sent = "This book on tennis cost $3.99 at Walmart.">>> p1 = re.compile("ten")>>> m1 = p1.match(sent)>>> m1>>> p2 = re.compile(".*ten.*")>>> m2 = p2.match(sent)>>> m2<_sre.SRE_Match object; span=(0, 42), match='This book on tennis cost $3.99 at Walmart.'> >>> m3 = re.search (p1,sent) >>> m3 <_ sre.SRE_Match object; span=(13, 16), match='ten'>>>> m4 = re.search(p2,sent)>>> m4<_sre.SRE_Match object; span=(0, 42), match='This book on tennis cost $3.99 at Walmart.'>>>> pp1 = re.compile("is")>>> m5 = re.findall(pp1, sent)>>> m5['is', 'is'] >>> pp2 = re.compile ("\\d") >>> m6 = re.search (pp2, sent) >>> m6 <_ sre.SRE_Match object; span=(26, 27), match='3'> >>> pp3 = re.compile ("\\d+") >>> m7 = re.search (pp3, sent) >>> m7 <_ sre.SRE_Match object; span=(26, 27), match='3 '>

12 >>> pp3 = re.compile("\\$\\d+\\.\\d\\d")>>> m8 = re.search(pp3, sent)>>> m8<_sre.SRE_Match object; span=(25, 30), match='$3.99'>>>> pp4 = re.compile(r"\$\d+\.\d\d")>>> m9 = re.search(pp4, sent) >>> m9 <_ sre.SRE_Match object; span=(25, 30), match='$3.99'>

Regular Expression in Python (3) Grouping – You can retrieve the matched substrings using parentheses.Capturing groups are numbered by counting their opening parentheses from left to right . In the expression ((A)(B(C))), for example, there are four such groups:((A)(B(C)))(A)(B(C))(C)Group zero always stands for the entire expression. 13

14 >>> ppp1 = re.compile("(\\w+) cost (\\$\\d+\\.\\d\\d)")>>> mm1 = re.search(ppp1, sent)>>> mm1<_sre.SRE_Match object; span=(13, 30), match='tennis cost $3.99'>>>> mm1.group(0)'tennis cost $3.99'>>> mm1.group(1)'tennis' >>> mm1.group(2) '$3.99'

TutorialsPoint , http://www.tutorialspoint.com/python/python_reg_expressions.htm 15

TutorialsPoint , http://www.tutorialspoint.com/python/python_reg_expressions.htm 16 Modifier Description re.I Performs case-insensitive matching. re.L Interprets words according to the current locale. This interpretation affects the alphabetic group (\w and \W), as well as word boundary behavior (\b and \B). re.M Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string). re.S Makes a period (dot) match any character, including a newline. re.U Interprets letters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B. re.X Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker. Regular Expression Modifiers: Option Flags Regular expression literals may include an optional modifier to control various aspects of matching. The modifiers are specified as an optional flag. You can provide multiple modifiers using exclusive OR (|), as shown previously and may be represented by one of these −