How can you check if a word is found in a string using Python?
Recently, I had an issue where I wanted to exclude strings from a list if they contained a certain word. I thought I could use the following common code familiar to most using Python:
if 'word' in 'how to find word in string':
# do something
But the problem ended up being a little more difficult than that. For example, what if you want to exclude the term
word
but not if that word is found
inside
other words, like
sword
?
>>> 'word' in 'sword'
True
I then thought this could be achieved by adding spaces around my searched word:
>>> needle = 'word'
>>> haystack = 'sword'
>>> f' {needle} ' in haystack
False
But what if the word was at the end of the beginning or start of the
haystack
phrase?
>>> needle = 'word'
>>> haystack = 'word is here'
>>> f' {needle} ' in haystack
False
One method could be to similarly wrap the haystack in spaces too, like so:
>>> needle = 'word'
>>> haystack = 'word in here'
>>> f' {needle} ' in f' {haystack} '
True
Finding Word In String With Punctuation
But then I ran into another issue: what if the
haystack
contained punctuation like commas, colons, semi-colons, full stops, question marks and exclamation marks?
>>> needle = 'word'
>>> haystack = 'are you not a word?'
>>> f' {needle} ' in f' {haystack} '
False
As you can see the word
word
is found in the sentence, but as there is a trailing question mark it is not recognised.
One approach is to remove all non-alpha-numeric characters from the haystack string, and the easiest way to do this is to use the regex library.
To remove all non-alpha-numeric characters from the haystack string (except the space character) using
re.sub
:
>>> import re
>>> needle = 'word'
>>> haystack = 'are you not a word?'
>>> alpha_haystack = re.sub(r'[^a-z0-9\s]', '', haystack)
>>> f' {needle} ' in f' {alpha_haystack} '
True
But what if the word is a hyphenated word?
>>> import re
>>> needle = 'word'
>>> haystack = 'did you hear this from word-of-mouth?'
>>> alpha_haystack = re.sub(r'[^a-z0-9\s]', '', haystack)
>>> f' {needle} ' in f' {alpha_haystack} '
False
>>> print(alpha_haystack)
did you hear this from wordofmouth
With the above code, the hyphenated word “word-of-mouth” becomes
wordofmouth
, and it depends on your use case for whether hyphenated words are to retain their hyphens or not.
Let’s find how you can incorporate hyphenated words into your search.
Finding Hyphenated Word In String
What if the
needle
search term was a hyphenated word?
If I was searching for a hyphenated word, then I’d need to
exclude
the removal of hyphens in my
regex
pattern. This would be as simple as just adding the hyphen in the
re.sub
list, like so:
>>> import re
>>> needle = 'word'
>>> haystack = 'did you hear this from word-of-mouth?'
>>> alpha_haystack = re.sub(r'[^a-z0-9\s-]', '', haystack)
>>> f' {needle} ' in f' {alpha_haystack} '
False
>>> print(alpha_haystack)
did you hear this from word-of-mouth
But the issue here is that you need to wrap your needle in both spaces or hyphens. For this you’re going to need to use regex to wrap the
needle
variable, like so:
>>> import re
>>> needle = 'word'
>>> haystack = 'did you hear this from word-of-mouth?'
>>> alpha_haystack = re.sub(r'[^a-z0-9\s-]', '', haystack)
>>> re.findall(r'[\s-]' + needle + r'[\s-]', alpha_haystack)
[' word-']
As you can see the
re.findall
function successfully finds the
word
in the haystack sentence.
To wrap this all up into a function, here’s what I created:
import re
def word_in_string(needle: str, haystack: str):
alpha_haystack = re.sub(r'[a-z0-9\s-]', '', haystack)
return len(re.findall(r'[\s-]' + needle + r'[\s-]' in f' {alpha_haystack} ')) > 0
To use this function, simply call it as follows:
>>> needle = 'word'
>>> haystack = 'what is the word for today?'
>>> word_in_string(needle, haystack)
True
Finding Word In String With Capitalization
The last hurdle to overcome was handling capitalization. What is the difference between “word” and “Word” in a sentence? The latter could be talking about the handy software Microsoft Word .
To handle this particular case, an easy way would be to use the
.lower()
method on the
haystack
variable, by modifying the
word_in_string
function would be like so:
import re
def word_in_string(needle: str, haystack: str):
alpha_haystack = re.sub(r'[a-z0-9\s-]', '', haystack.lower())
return len(re.findall(r'[\s-]' + needle + r'[\s-]' in f' {alpha_haystack} ')) > 0
>>> needle = 'word'
>>> haystack = 'Do you use Word?'
>>> word_in_string(needle, haystack)
True
However, this doesn’t help distinguish whether you’re hunting for just the “word” or “Word”. Here are some
False
matches using the above code:
>>> needle = 'word'
>>> haystack_1 = 'Do you use Microsoft Word?"
>>> word_in_string(needle, haystack_1)
True
>>> haystack_2 = 'Yes. Word is great for processing'
>>> word_in_string(needle, haystack_2)
True
In the example above, I’ve tried to articulate the shortcomings of the function by using
.lower()
method on the
haystack
string. If the
needle
word is at the beginning of the sentence, there will be no easy way to distinguish whether it’s a proper noun or the
needle
.
Some of these conditions may need to be manually inserted into the function where capitalization is retained, such as:
-
Is the word found at the beginning of
haystack
. - Is the word found at the end of a full-stop.
-
Is the word found at the start of dialog, for example:
Simon said, "Word is awesome!"
– and then you have all the nuances on the 15 different character types for apostophes and quotes .
For my particular use case, keeping everything lower case was sufficient, and the purpose of this article hopefully has met your use case too. There are more complications that may need to be considered when trying to search for a word in a string, and capitalization would certainly be the hardest to tackle.
Summary
Using Python to search for a word in a string is a relatively simple exercise, but one that needs some additional thought depending upon your use case.
A simple one-liner can be performed if no modification is needed on the haystack string, and the haystack will not have the word combined with punctuation or hyphens :
f' {needle} ' in f' {haystack} '
If the haystack is likely to contain punctuation, capitalization and hyphens (and capitalization doesn’t matter), then you might want to look at defining a function and writing something like so:
import re
def word_in_string(needle: str, haystack: str):
alpha_haystack = re.sub(r'[a-z0-9\s-]', '', haystack.lower())
return len(re.findall(r'[\s-]' + needle + r'[\s-]' in f' {alpha_haystack} ')) > 0