Course 2 Regular expressions and - PowerPoint Presentation

342 views
Uploaded On 2022-07-28

Course 2 Regular expressions and - PPT Presentation

AntConc concordance tool University of Bucharest Digital Humanities Master February 2020 Anca Dinu Regular expressions In corpus linguistics much of what we want to do with a tool is patternmatching over textscorpora ID: 930815

regular quot string words quot regular words string list expression word grammar tool language search matches grammars preceding regexes

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/930815" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Course 2 Regular expressions and" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Course 2Regular expressions andAntConc concordance tool

University of Bucharest

Digital Humanities Master

February 2020

Anca Dinu

Slide2

Regular expressionsIn corpus linguistics much of what we want to do with a tool

is pattern-matching over texts/corpora

, like:

Find all words that begin with

and end with a vowel.

Find all words that have a sequence of three vowels.

Find all three-syllable words.

Find all adjectives ending in -

Find all plural nouns preceded by

the

in questions.

Slide3

Regular expressionsRegexes are expressions that represent string patterns.

They are extremely useful in extracting information from any text by searching for matches of a specific search pattern (i.e. a specific sequence of ASCII or unicode characters).

Regular expression searches are the most popular, powerful, and easiest tool to use.

They originated in Chomsky

's Hierarchy and

formalized

by mathematician

Stephan

Kleene (*).

Slide4

Chomsky HierarchyFrom most to least strict, the four formal grammars in CH are:

Regular grammars

, which retain no past state knowledge from input string to output string

Context-free grammars

, which retain only recent state knowledge from input string to output string

Context-sensitive

grammars, which keep all past state knowledge from input string to output string

Unrestricted

(or recursively enumerable) grammars, which have all state knowledge and thus can create every output string imaginable from a given input string

Slide5

Chomsky Hierarchy

Slide6

Regular Grammar - linguistic flavour

A regular grammar is a mathematical object, G, with four components, G = (N, Σ, P, S), where

N is a nonempty, finite set of nonterminal symbols,

Σ is a finite set of terminal symbols , or alphabet, symbols,

P is a set of grammar rules, each of one having one of the forms

A → aB

A → a

A → ε, for A, B ∈ N, a ∈ Σ, and ε the empty string, and

S ∈ N is the start symbol.

Slide7

The Language Generated by a Regular GrammarLet G be a regular grammar. T

he language generated by

the regular grammar

= (N, Σ, P, S)

L(G) = {w | S ⇒ * w, where w ∈ Σ*}

Translation

the language of a regular grammar is the set of all strings over the alphabet Σ that can be derived from the start symbol

by application of the grammar rules.

Slide8

Regular grammar-regex equivalenceA formal grammar (like regular grammar) generates and recognizes a language.

Regexes do the same.

Regular grammars are equivalent with regexes (approximately)

Slide9

Regexes - CS flavourR

egular expression is recursively defined as follows:

The empty set is a regular expression.

The empty string is a regular expression.

For any character x in the input alphabet, x is a regular expression that produces the regular language {x}.

Plus the following 3 operations

Slide10

Regexes - CS flavourAlternation

: If x and y are regular expressions, then x | y is a regular expression. For example, the regular expression

produces the regular language {

Concatenation

: If x and y are regular expressions, then x • y is a regular expression. For example, the regular expression

•

produces the regular language {

Repetition

(Kleene star): If x and y are regular expressions, then x* is a regular expression. For example, the regular language

•

* produces the regular language {

abb

abbb

, ...}.

Slide11

Regexes - CS flavourThere are some other operators derived from combinations of the three original operations on regexes (alternation, concatenation, repetition):

+, *, etc (see regular expression cheat sheet of Michael Yoshitaka Erlewine)

parenthesis add extra power w

Regular Grammars

Special characters need to be escaped- preceded by

\- to be interpreted literally.

Slide12

SummaryOR:

A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey".

Grouping

Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" or "grey".

Quantification

(after a token)

? indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".

* indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc".

+ indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc",

...

, but not "ac".

Slide13

Summary{n} The preceding item is matched exactly n times.{min,} The preceding item is matched min or more times.

{min,max} The preceding item is matched at least min times, but not more than max times.

Wildcard

. matches any character. For example, a.b matches any string that contains an "a", then any other character and then a "b", a.*b matches any string that contains an "a" and a "b" at some later point.

Take a look at examples

https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

Slide14

AntConcAntConc

is a general purpose tool for analysing corpora

It is free and easy to download and use

It can be used for virtual

y any language

It supports plain and annotated corpora.

Made by Anthony Laurence:

http://www.laurenceanthony.net/software/antconc/

Slide15

BasicsLoading corpus files

Viewing files

Word list

Concordance tool

Tool preferences

Global settings

Slide16

Word list

A word list produces a list of words, ordered in their frequency order they appear in a corpus;

Sort by: the frequency (default), by the word (alphabetically), by the word end, by inverse order;

The word list can be saved by

AntConc

as a text file;

Tool preferences for Word list:

Lemma list:

a list with the inflections of words, for instance for be: is, are, been, was, were, etc. It returns the list of head words, accompanied by their family words (inflections) and their frequency.

Slide17

Word listWord list range

: use

specific words

(only the words the user is interested in) or use

stop list

(exclude the words in the stop list). The use of those two options depends on the goal of the analysis:

if the user studies the stylom of an author or authorship identification, s/he could look only for function words, like prepositions or pronouns, because they are harder to mistify;

if the user performs a semantic study, s/he might want

to exclude the functional (stop) words.

Slide18

Concordance tool

for words and patterns

Sort

by left and right context

Ex:

report, reported, reporting, report on, to report

Search with wildcards:

Ex:

report*