A regular expression is a way to specify
conditions to be fulfilled for a situation
in mind. Normally when you search in a text editor you specify
the text to search for literally, using a
regular expression, on the other hand, you tell what a given
match would look like. Examples of this include I'm
searching for the word KDE, but only at the beginning of the
line, or I'm searching for the word
the
, but it must stand on its own,
or I'm searching for files starting with the word
test
, followed by a number of digits, for
example test12
, test107
and test007
You build regular expressions from smaller regular expressions, just like you build large Lego toys from smaller subparts. As in the Lego world, there are a number of basic building blocks. In the following I will describe each of these basic building blocks using a number of examples.
Example 2.1. Searching for normal text.
If you just want to search for a given text, a then regular
expression is definitely not a good choice. The reason for this is that
regular expressions assign special meaning to some characters. This
includes the following characters: .*|$
. Thus if you want to
search for the text kde.
(i.e. the characters
kde
followed by a period), then you would need to
specify this as kde\.
[1] Writing \.
rather than just
.
is called escaping.
Example 2.2. Matching URLs
When you select something looking like a URL in KDE, then the program klipper will offer to start konqueror with the selected URL.
Klipper does this by matching the selection against several different regular expressions, when one of the regular expressions matches, the accommodating command will be offered.
The regular expression for URLs says (among other things), that the
selection must start with the text http://
. This is
described using regular expressions by prefixing the text
http://
with a hat (the ^
character).
The above is an example of matching positions using regular
expressions. Similar, the position end-of-line can
be matched using the character $
(i.e. a dollar
sign).
Example 2.3. Searching for the word the
, but not
there
,
brea
the or
ano
ther
Two extra position types can be matches in the above way,
namely the position at a word boundary, and
the position at a non-word
boundary. The positions are specified using the text
\b
(for word-boundary) and \B
(for
non-word boundary)
Thus, searching for the word the
can be done
using the regular expression \bthe\b
. This specifies
that we are searching for the
with no letters on each
side of it (i.e. with a word boundary on each side)
The four position matching regular expressions are inserted in the regular expression editor using four different positions tool
Example 2.4. Searching for either this
or that
Imagine that you want to run through your document searching for
either the word this
or the word
that
. With a normal search method you could do this in
two sweeps, the first time around, you would search for
this
, and the second time around you would search for
that
.
Using regular expression searches you would search for both in the
same sweep. You do this by searching for
this|that
. I.e. separating the two words with a
vertical bar.[2]
In the regular expression editor you do not write the vertical bar yourself, but instead select the alternative tool, and insert the smaller regular expressions above each other.
Example 2.5. Matching anything
Regular expressions are often compared to wildcard matching in the shell - that is the capability to specify a number of files using the asterisk. You will most likely recognize wildcard matching from the following examples:
ls *.txt
- here *.txt
is
the shell wildcard matching every file ending with the
.txt
extension.
cat test??.res
- matching every file starting with
test
followed by two arbitrary characters, and finally
followed by the test .res
In the shell the asterisk matches any character any number of
times. In other words, the asterisk matches anything.
This is written like .*
with regular expression
syntax. The dot matches any single character, i.e. just
one character, and the asterisk, says that the
regular expression prior to it should be matched any number of
times. Together this says any single character any number of
times.
This may seem overly complicated, but when you get the larger
picture you will see the power. Let me show you another basic regular
expression: a
. The letter a
on its
own is a regular expression that matches a single letter, namely the
letter a
. If we combine this with the asterisk,
i.e. a*
, then we have a regular expression matching
any number of a's.
We can combine several regular expression after each
other, for example ba(na)*
.
[3]
Imagine you had typed this regular expression into the search field in a
text editor, then you would have found the following words (among
others): ba
, bana
,
banana
, bananananananana
Given the information above, it hopefully isn't hard for you to write the
shell wildcard test??.res
as a regular expression
Answer: test..\.res
. The dot on its own is any
character. To match a single dot you must write
\.
[4]. In
other word, the regular expression \.
matches a dot,
while a dot on its own matches any character.
In the regular expression editor, a repeated regular expression is created using the repeat tool
Example 2.6. Replacing &
with
&
in a HTML document
In
HTML the special character &
must be
written as &
- this is similar to
escaping in regular expressions.
Imagine that you have written an HTML document in a normal editor
(e.g. XEmacs or Kate), and you totally forgot about this rule. What you
would do when realized your mistake was to replace every occurrences of
&
with &
.
This can easily be done using normal search and replace,
there is, however, one glitch. Imagine that you did remember
this rule - just a bit - and did it right
in some places. Replacing unconditionally would result in
&
being replaced with
&
What you really want to say is that &
should
only be replaced if it is not followed by the letters
amp;
. You can do this using regular expressions using
positive lookahead.
The regular expression, which only matches an ampersand if it is
not followed by the letters amp;
looks as follows:
&(?!amp;)
. This is, of course, easier to read using
the regular expression editor, where you would use the
lookahead tools.
[1] The regular expression editor solves this problem by taking care of escape rules for you.
[2] Note on each side of the vertical bar is a regular expression, so this feature is not only for searching for two different pieces of text, but for searching for two different regular expressions.
[3] (na)*
just says that what is inside
the parenthesis is repeated any number of times.
[4] This is called escaping
Would you like to comment or contribute an update to this page?
Send feedback to the TDE Development Team