3.1. Advanced Searching on Strings#

3.1.1. Finding Features within a String#

import pandas as pd
df = pd.read_csv("../data/titanic.csv")
df = df[["Name"]]
df
Name
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
... ...
886 Montvila, Rev. Juozas
887 Graham, Miss. Margaret Edith
888 Johnston, Miss. Catherine Helen "Carrie"
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick

891 rows × 1 columns

When I am looking at the df, I notice that there is a “Rev.” in index 886. As a historian, I find this fascinating. Now, I start to wonder, how many reverends were there on the Titanic? Is this individual unique? If I wanted to ask this question outside of Pandas, I could do the following:

names = df.Name.tolist()
revs = []
for name in names:
    if "Rev." in name:
        revs.append(name)
print (revs)
['Byles, Rev. Thomas Roussel Davids', 'Bateman, Rev. Robert James', 'Carter, Rev. Ernest Courtenay', 'Kirkland, Rev. Charles Leonard', 'Harper, Rev. John', 'Montvila, Rev. Juozas']

Sure, that works, but I don’t have any of the other data associated with each of these reverends. I would have to then do some manual searching in the DataFrame to find their corresponding data, or save the data as a dictionary and then run look ups. But why do all of that, when we can do it in a single line of code using Pandas’ built-in function. We can use .str.contains() which takes an argument of what we want to return.

df.loc[df["Name"].str.contains("Rev\.")]
Name
149 Byles, Rev. Thomas Roussel Davids
150 Bateman, Rev. Robert James
249 Carter, Rev. Ernest Courtenay
626 Kirkland, Rev. Charles Leonard
848 Harper, Rev. John
886 Montvila, Rev. Juozas

We can, therefore, see not only the reverends, but also their corresponding data.

3.1.2. Finding Strings that Don’t Contain Feature#

What if we wanted to eliminate all names that do not contain “Rev.”? We can introduce “~” prior to df to specify that the Names column should not have whatever condition we express.

df.loc[~df["Name"].str.contains("Rev\.")]
Name
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
... ...
885 Rice, Mrs. William (Margaret Norton)
887 Graham, Miss. Margaret Edith
888 Johnston, Miss. Catherine Helen "Carrie"
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick

885 rows × 1 columns

3.1.3. Using RegEx with Pandas#

Out of the box, Pandas supports RegEx. RegEx stands for Regular Expressions. It is a powerful way of performing complex string matching. If we were interested in finding any instance of “Rev.” or “Mr.”, we would have to write something like this without RegEx:

df.loc[(df["Name"].str.contains("Rev\.")) | (df["Name"].str.contains("Mr\."))]
Name
0 Braund, Mr. Owen Harris
4 Allen, Mr. William Henry
5 Moran, Mr. James
6 McCarthy, Mr. Timothy J
12 Saundercock, Mr. William Henry
... ...
883 Banfield, Mr. Frederick James
884 Sutehall, Mr. Henry Jr
886 Montvila, Rev. Juozas
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick

523 rows × 1 columns

While this works, imagine if we had 20 or 30 different conditions! That would be a very long piece of code to write and while it would work, it is always best practice to write shorter, tighter code. So, let’s do the same thing, but with RegEx. We can add the Or-condition into the str.contains() argument. This is a RegEx command. To ensure that RegEx is registered, it may be necessary to pass it as an argument.

df.loc[df["Name"].str.contains("Rev\.|Mr\.", regex=True)]
Name
0 Braund, Mr. Owen Harris
4 Allen, Mr. William Henry
5 Moran, Mr. James
6 McCarthy, Mr. Timothy J
12 Saundercock, Mr. William Henry
... ...
883 Banfield, Mr. Frederick James
884 Sutehall, Mr. Henry Jr
886 Montvila, Rev. Juozas
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick

523 rows × 1 columns

In some instances, we may have uncleaned data and the use of “Rev.” may be lowercase in one instance. To ensure that we grab both upper and lowercase forms of this sequence, let’s ignore the case by using the case keyword and setting it to False.

import re
df.loc[df["Name"].str.contains("Rev\.|Mr\.", case=False, regex=True)]
Name
0 Braund, Mr. Owen Harris
4 Allen, Mr. William Henry
5 Moran, Mr. James
6 McCarthy, Mr. Timothy J
12 Saundercock, Mr. William Henry
... ...
883 Banfield, Mr. Frederick James
884 Sutehall, Mr. Henry Jr
886 Montvila, Rev. Juozas
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick

523 rows × 1 columns