AntConc concordance tool University of Bucharest Digital Humanities Master February 2020 Anca Dinu Regular expressions In corpus linguistics much of what we want to do with a tool is patternmatching over textscorpora ID: 930815
Download Presentation The PPT/PDF document "Course 2 Regular expressions and" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Course 2Regular expressions andAntConc concordance tool
University of Bucharest
Digital Humanities Master
February 2020
Anca Dinu
Slide2Regular expressionsIn corpus linguistics much of what we want to do with a tool
is pattern-matching over texts/corpora
, like:
Find all words that begin with
k
and end with a vowel.
Find all words that have a sequence of three vowels.
Find all three-syllable words.
Find all adjectives ending in -
ic
.
Find all plural nouns preceded by
the
in questions.
Slide3Regular expressionsRegexes are expressions that represent string patterns.
They are extremely useful in extracting information from any text by searching for matches of a specific search pattern (i.e. a specific sequence of ASCII or unicode characters).
Regular expression searches are the most popular, powerful, and easiest tool to use.
They originated in Chomsky
's Hierarchy and
formalized
by mathematician
Stephan
Kleene (*).
Slide4Chomsky HierarchyFrom most to least strict, the four formal grammars in CH are:
Regular grammars
, which retain no past state knowledge from input string to output string
.
Context-free grammars
, which retain only recent state knowledge from input string to output string
.
Context-sensitive
grammars, which keep all past state knowledge from input string to output string
.
Unrestricted
(or recursively enumerable) grammars, which have all state knowledge and thus can create every output string imaginable from a given input string
.
Slide5Chomsky Hierarchy
Slide6Regular Grammar - linguistic flavour
A regular grammar is a mathematical object, G, with four components, G = (N, Σ, P, S), where
N is a nonempty, finite set of nonterminal symbols,
Σ is a finite set of terminal symbols , or alphabet, symbols,
P is a set of grammar rules, each of one having one of the forms
A → aB
A → a
A → ε, for A, B ∈ N, a ∈ Σ, and ε the empty string, and
S ∈ N is the start symbol.
Slide7The Language Generated by a Regular GrammarLet G be a regular grammar. T
he language generated by
the regular grammar
G
= (N, Σ, P, S)
is
L(G) = {w | S ⇒ * w, where w ∈ Σ*}
Translation
:
the language of a regular grammar is the set of all strings over the alphabet Σ that can be derived from the start symbol
S
by application of the grammar rules.
Slide8Regular grammar-regex equivalenceA formal grammar (like regular grammar) generates and recognizes a language.
Regexes do the same.
Regular grammars are equivalent with regexes (approximately)
Slide9Regexes - CS flavourR
egular expression is recursively defined as follows:
The empty set is a regular expression.
The empty string is a regular expression.
For any character x in the input alphabet, x is a regular expression that produces the regular language {x}.
Plus the following 3 operations
:
Slide10Regexes - CS flavourAlternation
: If x and y are regular expressions, then x | y is a regular expression. For example, the regular expression
a
|
b
produces the regular language {
a
,
b
}.
Concatenation
: If x and y are regular expressions, then x • y is a regular expression. For example, the regular expression
a
•
b
produces the regular language {
ab
}.
Repetition
(Kleene star): If x and y are regular expressions, then x* is a regular expression. For example, the regular language
a
•
b
* produces the regular language {
a
,
abb
,
abb
,
abbb
, ...}.
Slide11Regexes - CS flavourThere are some other operators derived from combinations of the three original operations on regexes (alternation, concatenation, repetition):
+, *, etc (see regular expression cheat sheet of Michael Yoshitaka Erlewine)
parenthesis add extra power w
.
r
.
t
.
Regular Grammars
Special characters need to be escaped- preceded by
\- to be interpreted literally.
Slide12SummaryOR:
A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey".
Grouping
:
Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" or "grey".
Quantification
(after a token)
? indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".
* indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc".
+ indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc",
...
, but not "ac".
Slide13Summary{n} The preceding item is matched exactly n times.{min,} The preceding item is matched min or more times.
{min,max} The preceding item is matched at least min times, but not more than max times.
Wildcard
:
. matches any character. For example, a.b matches any string that contains an "a", then any other character and then a "b", a.*b matches any string that contains an "a" and a "b" at some later point.
Take a look at examples
https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
Slide14AntConcAntConc
is a general purpose tool for analysing corpora
.
It is free and easy to download and use
.
It can be used for virtual
l
y any language
.
It supports plain and annotated corpora.
Made by Anthony Laurence:
http://www.laurenceanthony.net/software/antconc/
Slide15BasicsLoading corpus files
Viewing files
Word list
Concordance tool
Tool preferences
Global settings
Slide16Word list
A word list produces a list of words, ordered in their frequency order they appear in a corpus;
Sort by: the frequency (default), by the word (alphabetically), by the word end, by inverse order;
The word list can be saved by
AntConc
as a text file;
Tool preferences for Word list:
Lemma list:
a list with the inflections of words, for instance for be: is, are, been, was, were, etc. It returns the list of head words, accompanied by their family words (inflections) and their frequency.
Slide17Word listWord list range
: use
specific words
(only the words the user is interested in) or use
stop list
(exclude the words in the stop list). The use of those two options depends on the goal of the analysis:
if the user studies the stylom of an author or authorship identification, s/he could look only for function words, like prepositions or pronouns, because they are harder to mistify;
if the user performs a semantic study, s/he might want
to exclude the functional (stop) words.
Slide18Concordance tool
Search
for words and patterns
Sort
by left and right context
Ex:
report, reported, reporting, report on, to report
Search with wildcards:
Ex:
report*
(all wildcards are in Global Settings)
Editing tricks: click on the highlighted words, using shift, alt, ctrl
Search options: not word (rep, por), lower/upper case
Slide19Concordance tool
Search for regular expressions:
\
br
[a-z]+?t\b
\bcat\b
\bcat\w.
(cat|dog)
[aeiou][aeiou][aeiou]
\d
\b(\w+)er\b
Advanced
search
:
load
multiple
search
words
or a
search
list
from
a file
and search for a context word in a window (
said
)
‘Clone
results
’ for
comparing
2 or more
results
Exporting the results: Tool preferences (a
dding
a
delimiter
to
the
hit
word
, as tab, for
copy-pasting
into
Excel
spreadsheet
)
.