{ "cells": [ { "cell_type": "markdown", "id": "tested-egyptian", "metadata": {}, "source": [ "# Advanced Searching on Strings" ] }, { "cell_type": "markdown", "id": "provincial-interference", "metadata": {}, "source": [ "## Finding Features within a String" ] }, { "cell_type": "code", "execution_count": 1, "id": "following-disney", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name
0Braund, Mr. Owen Harris
1Cumings, Mrs. John Bradley (Florence Briggs Th...
2Heikkinen, Miss. Laina
3Futrelle, Mrs. Jacques Heath (Lily May Peel)
4Allen, Mr. William Henry
......
886Montvila, Rev. Juozas
887Graham, Miss. Margaret Edith
888Johnston, Miss. Catherine Helen \"Carrie\"
889Behr, Mr. Karl Howell
890Dooley, Mr. Patrick
\n", "

891 rows × 1 columns

\n", "
" ], "text/plain": [ " Name\n", "0 Braund, Mr. Owen Harris\n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th...\n", "2 Heikkinen, Miss. Laina\n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel)\n", "4 Allen, Mr. William Henry\n", ".. ...\n", "886 Montvila, Rev. Juozas\n", "887 Graham, Miss. Margaret Edith\n", "888 Johnston, Miss. Catherine Helen \"Carrie\"\n", "889 Behr, Mr. Karl Howell\n", "890 Dooley, Mr. Patrick\n", "\n", "[891 rows x 1 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"../data/titanic.csv\")\n", "df = df[[\"Name\"]]\n", "df" ] }, { "cell_type": "markdown", "id": "several-former", "metadata": {}, "source": [ "When I am looking at the df, I notice that there is a \"Rev.\" in index 886. As a historian, I find this fascinating. Now, I start to wonder, how many reverends were there on the Titanic? Is this individual unique? If I wanted to ask this question outside of Pandas, I could do the following:" ] }, { "cell_type": "code", "execution_count": 2, "id": "printable-bulgaria", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Byles, Rev. Thomas Roussel Davids', 'Bateman, Rev. Robert James', 'Carter, Rev. Ernest Courtenay', 'Kirkland, Rev. Charles Leonard', 'Harper, Rev. John', 'Montvila, Rev. Juozas']\n" ] } ], "source": [ "names = df.Name.tolist()\n", "revs = []\n", "for name in names:\n", " if \"Rev.\" in name:\n", " revs.append(name)\n", "print (revs)" ] }, { "cell_type": "markdown", "id": "floral-medline", "metadata": {}, "source": [ "Sure, that works, but I don't have any of the other data associated with each of these reverends. I would have to then do some manual searching in the DataFrame to find their corresponding data, or save the data as a dictionary and then run look ups. But why do all of that, when we can do it in a single line of code using Pandas' built-in function. We can use `.str.contains()` which takes an argument of what we want to return." ] }, { "cell_type": "code", "execution_count": 3, "id": "finished-danger", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name
149Byles, Rev. Thomas Roussel Davids
150Bateman, Rev. Robert James
249Carter, Rev. Ernest Courtenay
626Kirkland, Rev. Charles Leonard
848Harper, Rev. John
886Montvila, Rev. Juozas
\n", "
" ], "text/plain": [ " Name\n", "149 Byles, Rev. Thomas Roussel Davids\n", "150 Bateman, Rev. Robert James\n", "249 Carter, Rev. Ernest Courtenay\n", "626 Kirkland, Rev. Charles Leonard\n", "848 Harper, Rev. John\n", "886 Montvila, Rev. Juozas" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df[\"Name\"].str.contains(\"Rev\\.\")]" ] }, { "cell_type": "markdown", "id": "fatty-batch", "metadata": {}, "source": [ "We can, therefore, see not only the reverends, but also their corresponding data." ] }, { "cell_type": "markdown", "id": "hundred-response", "metadata": {}, "source": [ "## Finding Strings that Don't Contain Feature" ] }, { "cell_type": "markdown", "id": "italian-semiconductor", "metadata": {}, "source": [ "What if we wanted to eliminate all names that do not contain \"Rev.\"? We can introduce \"~\" prior to df to specify that the Names column should not have whatever condition we express." ] }, { "cell_type": "code", "execution_count": 4, "id": "adequate-health", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name
0Braund, Mr. Owen Harris
1Cumings, Mrs. John Bradley (Florence Briggs Th...
2Heikkinen, Miss. Laina
3Futrelle, Mrs. Jacques Heath (Lily May Peel)
4Allen, Mr. William Henry
......
885Rice, Mrs. William (Margaret Norton)
887Graham, Miss. Margaret Edith
888Johnston, Miss. Catherine Helen \"Carrie\"
889Behr, Mr. Karl Howell
890Dooley, Mr. Patrick
\n", "

885 rows × 1 columns

\n", "
" ], "text/plain": [ " Name\n", "0 Braund, Mr. Owen Harris\n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th...\n", "2 Heikkinen, Miss. Laina\n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel)\n", "4 Allen, Mr. William Henry\n", ".. ...\n", "885 Rice, Mrs. William (Margaret Norton)\n", "887 Graham, Miss. Margaret Edith\n", "888 Johnston, Miss. Catherine Helen \"Carrie\"\n", "889 Behr, Mr. Karl Howell\n", "890 Dooley, Mr. Patrick\n", "\n", "[885 rows x 1 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[~df[\"Name\"].str.contains(\"Rev\\.\")]" ] }, { "cell_type": "markdown", "id": "adolescent-plumbing", "metadata": {}, "source": [ "## Using RegEx with Pandas" ] }, { "cell_type": "markdown", "id": "fossil-dance", "metadata": {}, "source": [ "Out of the box, Pandas supports RegEx. RegEx stands for Regular Expressions. It is a powerful way of performing complex string matching. If we were interested in finding any instance of \"Rev.\" or \"Mr.\", we would have to write something like this without RegEx:" ] }, { "cell_type": "code", "execution_count": 5, "id": "other-queensland", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name
0Braund, Mr. Owen Harris
4Allen, Mr. William Henry
5Moran, Mr. James
6McCarthy, Mr. Timothy J
12Saundercock, Mr. William Henry
......
883Banfield, Mr. Frederick James
884Sutehall, Mr. Henry Jr
886Montvila, Rev. Juozas
889Behr, Mr. Karl Howell
890Dooley, Mr. Patrick
\n", "

523 rows × 1 columns

\n", "
" ], "text/plain": [ " Name\n", "0 Braund, Mr. Owen Harris\n", "4 Allen, Mr. William Henry\n", "5 Moran, Mr. James\n", "6 McCarthy, Mr. Timothy J\n", "12 Saundercock, Mr. William Henry\n", ".. ...\n", "883 Banfield, Mr. Frederick James\n", "884 Sutehall, Mr. Henry Jr\n", "886 Montvila, Rev. Juozas\n", "889 Behr, Mr. Karl Howell\n", "890 Dooley, Mr. Patrick\n", "\n", "[523 rows x 1 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[(df[\"Name\"].str.contains(\"Rev\\.\")) | (df[\"Name\"].str.contains(\"Mr\\.\"))]" ] }, { "cell_type": "markdown", "id": "structured-millennium", "metadata": {}, "source": [ "While this works, imagine if we had 20 or 30 different conditions! That would be a very long piece of code to write and while it would work, it is always best practice to write shorter, tighter code. So, let's do the same thing, but with RegEx. We can add the Or-condition into the `str.contains()` argument. This is a RegEx command. To ensure that RegEx is registered, it may be necessary to pass it as an argument." ] }, { "cell_type": "code", "execution_count": 6, "id": "narrative-traveler", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name
0Braund, Mr. Owen Harris
4Allen, Mr. William Henry
5Moran, Mr. James
6McCarthy, Mr. Timothy J
12Saundercock, Mr. William Henry
......
883Banfield, Mr. Frederick James
884Sutehall, Mr. Henry Jr
886Montvila, Rev. Juozas
889Behr, Mr. Karl Howell
890Dooley, Mr. Patrick
\n", "

523 rows × 1 columns

\n", "
" ], "text/plain": [ " Name\n", "0 Braund, Mr. Owen Harris\n", "4 Allen, Mr. William Henry\n", "5 Moran, Mr. James\n", "6 McCarthy, Mr. Timothy J\n", "12 Saundercock, Mr. William Henry\n", ".. ...\n", "883 Banfield, Mr. Frederick James\n", "884 Sutehall, Mr. Henry Jr\n", "886 Montvila, Rev. Juozas\n", "889 Behr, Mr. Karl Howell\n", "890 Dooley, Mr. Patrick\n", "\n", "[523 rows x 1 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df[\"Name\"].str.contains(\"Rev\\.|Mr\\.\", regex=True)]" ] }, { "cell_type": "markdown", "id": "particular-afternoon", "metadata": {}, "source": [ "In some instances, we may have uncleaned data and the use of \"Rev.\" may be lowercase in one instance. To ensure that we grab both upper and lowercase forms of this sequence, let's ignore the case by using the case keyword and setting it to False." ] }, { "cell_type": "code", "execution_count": 7, "id": "optical-auckland", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name
0Braund, Mr. Owen Harris
4Allen, Mr. William Henry
5Moran, Mr. James
6McCarthy, Mr. Timothy J
12Saundercock, Mr. William Henry
......
883Banfield, Mr. Frederick James
884Sutehall, Mr. Henry Jr
886Montvila, Rev. Juozas
889Behr, Mr. Karl Howell
890Dooley, Mr. Patrick
\n", "

523 rows × 1 columns

\n", "
" ], "text/plain": [ " Name\n", "0 Braund, Mr. Owen Harris\n", "4 Allen, Mr. William Henry\n", "5 Moran, Mr. James\n", "6 McCarthy, Mr. Timothy J\n", "12 Saundercock, Mr. William Henry\n", ".. ...\n", "883 Banfield, Mr. Frederick James\n", "884 Sutehall, Mr. Henry Jr\n", "886 Montvila, Rev. Juozas\n", "889 Behr, Mr. Karl Howell\n", "890 Dooley, Mr. Patrick\n", "\n", "[523 rows x 1 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "df.loc[df[\"Name\"].str.contains(\"Rev\\.|Mr\\.\", case=False, regex=True)]" ] }, { "cell_type": "code", "execution_count": null, "id": "lightweight-reconstruction", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b083da4e-96c0-49db-b1f9-c0fddf862ce6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 5 }