Advanced Searching on Strings

3.1. Advanced Searching on Strings#

3.1.1. Finding Features within a String#

import pandas as pd
df = pd.read_csv("../data/titanic.csv")
df = df[["Name"]]
df

	Name
0	Braund, Mr. Owen Harris
1	Cumings, Mrs. John Bradley (Florence Briggs Th...
2	Heikkinen, Miss. Laina
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	Allen, Mr. William Henry
...	...
886	Montvila, Rev. Juozas
887	Graham, Miss. Margaret Edith
888	Johnston, Miss. Catherine Helen "Carrie"
889	Behr, Mr. Karl Howell
890	Dooley, Mr. Patrick

891 rows × 1 columns

When I am looking at the df, I notice that there is a “Rev.” in index 886. As a historian, I find this fascinating. Now, I start to wonder, how many reverends were there on the Titanic? Is this individual unique? If I wanted to ask this question outside of Pandas, I could do the following:

names = df.Name.tolist()
revs = []
for name in names:
    if "Rev." in name:
        revs.append(name)
print (revs)

['Byles, Rev. Thomas Roussel Davids', 'Bateman, Rev. Robert James', 'Carter, Rev. Ernest Courtenay', 'Kirkland, Rev. Charles Leonard', 'Harper, Rev. John', 'Montvila, Rev. Juozas']

Sure, that works, but I don’t have any of the other data associated with each of these reverends. I would have to then do some manual searching in the DataFrame to find their corresponding data, or save the data as a dictionary and then run look ups. But why do all of that, when we can do it in a single line of code using Pandas’ built-in function. We can use .str.contains() which takes an argument of what we want to return.

df.loc[df["Name"].str.contains("Rev\.")]

	Name
149	Byles, Rev. Thomas Roussel Davids
150	Bateman, Rev. Robert James
249	Carter, Rev. Ernest Courtenay
626	Kirkland, Rev. Charles Leonard
848	Harper, Rev. John
886	Montvila, Rev. Juozas

We can, therefore, see not only the reverends, but also their corresponding data.

3.1.2. Finding Strings that Don’t Contain Feature#

What if we wanted to eliminate all names that do not contain “Rev.”? We can introduce “~” prior to df to specify that the Names column should not have whatever condition we express.

df.loc[~df["Name"].str.contains("Rev\.")]

	Name
0	Braund, Mr. Owen Harris
1	Cumings, Mrs. John Bradley (Florence Briggs Th...
2	Heikkinen, Miss. Laina
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	Allen, Mr. William Henry
...	...
885	Rice, Mrs. William (Margaret Norton)
887	Graham, Miss. Margaret Edith
888	Johnston, Miss. Catherine Helen "Carrie"
889	Behr, Mr. Karl Howell
890	Dooley, Mr. Patrick

885 rows × 1 columns

3.1.3. Using RegEx with Pandas#

Out of the box, Pandas supports RegEx. RegEx stands for Regular Expressions. It is a powerful way of performing complex string matching. If we were interested in finding any instance of “Rev.” or “Mr.”, we would have to write something like this without RegEx:

df.loc[(df["Name"].str.contains("Rev\.")) | (df["Name"].str.contains("Mr\."))]

	Name
0	Braund, Mr. Owen Harris
4	Allen, Mr. William Henry
5	Moran, Mr. James
6	McCarthy, Mr. Timothy J
12	Saundercock, Mr. William Henry
...	...
883	Banfield, Mr. Frederick James
884	Sutehall, Mr. Henry Jr
886	Montvila, Rev. Juozas
889	Behr, Mr. Karl Howell
890	Dooley, Mr. Patrick

523 rows × 1 columns

While this works, imagine if we had 20 or 30 different conditions! That would be a very long piece of code to write and while it would work, it is always best practice to write shorter, tighter code. So, let’s do the same thing, but with RegEx. We can add the Or-condition into the str.contains() argument. This is a RegEx command. To ensure that RegEx is registered, it may be necessary to pass it as an argument.

df.loc[df["Name"].str.contains("Rev\.|Mr\.", regex=True)]

	Name
0	Braund, Mr. Owen Harris
4	Allen, Mr. William Henry
5	Moran, Mr. James
6	McCarthy, Mr. Timothy J
12	Saundercock, Mr. William Henry
...	...
883	Banfield, Mr. Frederick James
884	Sutehall, Mr. Henry Jr
886	Montvila, Rev. Juozas
889	Behr, Mr. Karl Howell
890	Dooley, Mr. Patrick

523 rows × 1 columns

In some instances, we may have uncleaned data and the use of “Rev.” may be lowercase in one instance. To ensure that we grab both upper and lowercase forms of this sequence, let’s ignore the case by using the case keyword and setting it to False.

import re
df.loc[df["Name"].str.contains("Rev\.|Mr\.", case=False, regex=True)]

	Name
0	Braund, Mr. Owen Harris
4	Allen, Mr. William Henry
5	Moran, Mr. James
6	McCarthy, Mr. Timothy J
12	Saundercock, Mr. William Henry
...	...
883	Banfield, Mr. Frederick James
884	Sutehall, Mr. Henry Jr
886	Montvila, Rev. Juozas
889	Behr, Mr. Karl Howell
890	Dooley, Mr. Patrick

523 rows × 1 columns

Advanced Searching on Strings

Contents

3.1. Advanced Searching on Strings#

3.1.1. Finding Features within a String#

3.1.2. Finding Strings that Don’t Contain Feature#

3.1.3. Using RegEx with Pandas#