Filter and Querying

3.2. Filter and Querying#

3.2.1. Introduction#

In this chapter, we will meet two advanced ways of filtering or searching (querying) our data. These are aptly named Filter() and Query() functions. These two functions allow us to do some fairly advanced things within a narrow scope. By narrow scope, I mean the questions that we want to pose. Whenever you are manipulating or probing data, it is always best to think about the task as simply asking a question. In essence, this is precisely what you are doing. You are asking the database a question. In order to ask the question correctly, as is the case with any language, you need to know the correct syntax and when that particular question is the right one to ask. In this chapter, we explore how to frame specific questions with Filter() and Query().

Each function is used in particular circumstances. Filter() is useful for getting a large data down to a smaller size, based on the questions you want to ask. Query(), on the other hand, is useful for phrasing questions that use comparison operators (less than, equal to, greater than, etc.). Let’s explore each in turn, but first, let’s import our Titanic dataset.

import pandas as pd
df = pd.read_csv("../data/titanic.csv")
df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

3.2.2. The Filter Function#

The filter function is a great way to grab only the relevant columns. The syntax of filter is a bit easier to use. It takes a single argument, a list of strings. These strings correspond to the columns.

When to use Filter()
Use filter() when you want to get a quick sense of your dataset or, as we shall see, create a new dataframe based on the columns you want. It is particularly useful if your dataset has many columns. You can also use it to reorder your columns in a more desired way.

Let’s say that we are interested in just studying the names of the passengers of the Titanic. It does not make sense to work with the entire DataFrame. We can, therefore, use filter to just grab the names column, like so.

df.filter(["Name"])

	Name
0	Braund, Mr. Owen Harris
1	Cumings, Mrs. John Bradley (Florence Briggs Th...
2	Heikkinen, Miss. Laina
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	Allen, Mr. William Henry
...	...
886	Montvila, Rev. Juozas
887	Graham, Miss. Margaret Edith
888	Johnston, Miss. Catherine Helen "Carrie"
889	Behr, Mr. Karl Howell
890	Dooley, Mr. Patrick

891 rows × 1 columns

This is great, but what if I also want to see the ages of these passengers. No problem. I can add an additional column to the list.

df.filter(["Name", "Age"])

	Name	Age
0	Braund, Mr. Owen Harris	22.0
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0
2	Heikkinen, Miss. Laina	26.0
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0
4	Allen, Mr. William Henry	35.0
...	...	...
886	Montvila, Rev. Juozas	27.0
887	Graham, Miss. Margaret Edith	19.0
888	Johnston, Miss. Catherine Helen "Carrie"	NaN
889	Behr, Mr. Karl Howell	26.0
890	Dooley, Mr. Patrick	32.0

891 rows × 2 columns

What if I want age to come before name? I can rearrange the order.

df.filter(["Age", "Name"])

	Age	Name
0	22.0	Braund, Mr. Owen Harris
1	38.0	Cumings, Mrs. John Bradley (Florence Briggs Th...
2	26.0	Heikkinen, Miss. Laina
3	35.0	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	35.0	Allen, Mr. William Henry
...	...	...
886	27.0	Montvila, Rev. Juozas
887	19.0	Graham, Miss. Margaret Edith
888	NaN	Johnston, Miss. Catherine Helen "Carrie"
889	26.0	Behr, Mr. Karl Howell
890	32.0	Dooley, Mr. Patrick

891 rows × 2 columns

Note that we are not bound to the order of the DataFrame. This is particularly useful if we want to make a new DataFrame.

new_df = df.filter(["Age", "Name"])
new_df

	Age	Name
0	22.0	Braund, Mr. Owen Harris
1	38.0	Cumings, Mrs. John Bradley (Florence Briggs Th...
2	26.0	Heikkinen, Miss. Laina
3	35.0	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	35.0	Allen, Mr. William Henry
...	...	...
886	27.0	Montvila, Rev. Juozas
887	19.0	Graham, Miss. Margaret Edith
888	NaN	Johnston, Miss. Catherine Helen "Carrie"
889	26.0	Behr, Mr. Karl Howell
890	32.0	Dooley, Mr. Patrick

891 rows × 2 columns

Now, we can only examine the data that we actually need. This will make our code faster and require examining fewer data for each row. Filter’s big limitation is in the fact that it cannot filter the data further. We cannot, for example, add an extra argument to filter() that would only return the Names with “Miss.”, but we can tack on additional arguments to filter, such as those that we saw in the previous chapter on searching strings.

Let’s try that now.

df.filter(["Age", "Name"]).Name.str.contains("Miss\.")

    False
    False
     True
    False
    False
       ...  
  False
   True
   True
  False
  False
Name: Name, Length: 891, dtype: bool

Note that we now have a list of True False statements. These tell us if the word “Miss.” is in the column Name. If it is there, we see a True. If it is not, we see a False. Let’s say we want to know how many passengers have the title “Miss.”, we stack .value_counts() into the chain. Note the plural of counts.

df.filter(["Age", "Name"]).Name.str.contains("Miss\.").value_counts()

False    709
True     182
Name: Name, dtype: int64

Now, let’s say we were interested in ONLY seeing the rows that contain “Miss.” in them. We need to structure that filtering as a list. But note that if we wrap the whole thing in a list, we don’t filter out just the Age and Name columns. Instead, we get the entire DataFrame.

df[df.filter(["Age", "Name"]).Name.str.contains("Miss\.") == True]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
14	15	0	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	0	350406	7.8542	NaN	S
22	23	1	3	McGowan, Miss. Anna "Annie"	female	15.0	0	0	330923	8.0292	NaN	Q
...	...	...	...	...	...	...	...	...	...	...	...	...
866	867	1	2	Duran y More, Miss. Asuncion	female	27.0	1	0	SC/PARIS 2149	13.8583	NaN	C
875	876	1	3	Najib, Miss. Adele Kiamie "Jane"	female	15.0	0	0	2667	7.2250	NaN	C
882	883	0	3	Dahlberg, Miss. Gerda Ulrika	female	22.0	0	0	7552	10.5167	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S

182 rows × 12 columns

This is because of where our filter occurs in the chain of commands. Note that filter occurs within the brackets where we are setting up our parameters. This means that we are filtering under the conditions of how the list is created, but that once a row is processed, the DataFrame is unfiltered. If we want our filter to work, we need to place it after the conditions have been sorted. Notice that the filter is outside of our brackets now.

df[df.Name.str.contains("Miss\.") == True].filter(["Age", "Name"])

	Age	Name
2	26.0	Heikkinen, Miss. Laina
10	4.0	Sandstrom, Miss. Marguerite Rut
11	58.0	Bonnell, Miss. Elizabeth
14	14.0	Vestrom, Miss. Hulda Amanda Adolfina
22	15.0	McGowan, Miss. Anna "Annie"
...	...	...
866	27.0	Duran y More, Miss. Asuncion
875	15.0	Najib, Miss. Adele Kiamie "Jane"
882	22.0	Dahlberg, Miss. Gerda Ulrika
887	19.0	Graham, Miss. Margaret Edith
888	NaN	Johnston, Miss. Catherine Helen "Carrie"

182 rows × 2 columns

3.2.3. The Query Function#

The Pandas Query() method is a fantastic way to filter and query data. Unlike other Pandas methods, it uses a string argument that functions rather similar to SQL syntax.

When to use Query
You should only use Query() when your question (query) can be posed as greater than, less than, equal to, or not equal to (or some combination of these). Let me demonstrate. If we wanted to filter out all the rows where the Pclass was equal to 3, we could use the following string.

df.query("Pclass == 3")

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
882	883	0	3	Dahlberg, Miss. Gerda Ulrika	female	22.0	0	0	7552	10.5167	NaN	S
884	885	0	3	Sutehall, Mr. Henry Jr	male	25.0	0	0	SOTON/OQ 392076	7.0500	NaN	S
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	NaN	Q
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

491 rows × 12 columns

Query can also look for if a column contains any item in a list. We can do this by setting Pclass == [list]

df.query("Pclass == [1,3]")

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	NaN	Q
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

707 rows × 12 columns

We can even stack questions together within this string. Let’s say, I am interested in all who were in Pclass 3 and survived. I could write the following string argument.

df.query("Pclass == 3 & Survived == 1")

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	0	0	2649	7.2250	NaN	C
22	23	1	3	McGowan, Miss. Anna "Annie"	female	15.0	0	0	330923	8.0292	NaN	Q
...	...	...	...	...	...	...	...	...	...	...	...	...
838	839	1	3	Chip, Mr. Chang	male	32.0	0	0	1601	56.4958	NaN	S
855	856	1	3	Aks, Mrs. Sam (Leah Rosen)	female	18.0	0	1	392091	9.3500	NaN	S
858	859	1	3	Baclini, Mrs. Solomon (Latifa Qurban)	female	24.0	0	3	2666	19.2583	NaN	C
869	870	1	3	Johnson, Master. Harold Theodor	male	4.0	1	1	347742	11.1333	NaN	S
875	876	1	3	Najib, Miss. Adele Kiamie "Jane"	female	15.0	0	0	2667	7.2250	NaN	C

119 rows × 12 columns

Let’s make the question even more complex. We want to now find the number of these individuals who were over the age of 40.

df.query("Pclass == 3 & Survived == 1 & Age > 40")

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked
338	339	1	3	Dahl, Mr. Karl Edwart	male	45.0	7598	8.0500	NaN	S
414	415	1	3	Sundman, Mr. Johan Julian	male	44.0	STON/O 2. 3101269	7.9250	NaN	S
483	484	1	3	Turkula, Mrs. (Hedwig)	female	63.0	4134	9.5875	NaN	S

I now have a list of 3 individuals who met all criteria. I can also use my Or operator (|), rather than &. Let’s see if we can achieve what we want. (Note this is an intentional mistake. Look below for why).

df.query("Pclass == 3 & Survived == 1 & Age > 40 | Age < 10")

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.00	3	1	349909	21.0750	NaN	S
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.00	1	1	PP 9549	16.7000	G6	S
16	17	0	3	Rice, Master. Eugene	male	2.00	4	1	382652	29.1250	NaN	Q
24	25	0	3	Palsson, Miss. Torborg Danira	female	8.00	3	1	349909	21.0750	NaN	S
43	44	1	2	Laroche, Miss. Simonne Marie Anne Andree	female	3.00	1	2	SC/Paris 2123	41.5792	NaN	C
...	...	...	...	...	...	...	...	...	...	...	...	...
827	828	1	2	Mallet, Master. Andre	male	1.00	0	2	S.C./PARIS 2079	37.0042	NaN	C
831	832	1	2	Richards, Master. George Sibley	male	0.83	1	1	29106	18.7500	NaN	S
850	851	0	3	Andersson, Master. Sigvard Harald Elias	male	4.00	4	2	347082	31.2750	NaN	S
852	853	0	3	Boulos, Miss. Nourelain	female	9.00	1	1	2678	15.2458	NaN	C
869	870	1	3	Johnson, Master. Harold Theodor	male	4.00	1	1	347742	11.1333	NaN	S

65 rows × 12 columns

Woops! Something has gone seriously wrong here. We have all different kinds of Pclasses, not just 3s. We have people who survived and did not. And, most problematically, we have people only under the age of 10. What has gone wrong here!? The answer lies in a perhaps forgotten part of math from when we were children, the order of operations. If you recall from those lessons, the order of operations determines the way in which you process the problem. 4 + 7 x 2 is very different from (4+7) x 2. The former is 18 and the latter is 22 because the latter has parentheses which tell the reader to do that operation first. Because programming sits on top of mathematics (especially Boolean algebra), the syntax of mathematics is often embedded in programming.

Let’s use the order of operations correctly and rephrase our query. Note the parentheses now before Age > 40 and after Age < 10. Note also that the & is before the parentheses.

df.query("Pclass == 3 & Survived == 1 & (Age > 40 | Age < 10)")

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.00	1	1	PP 9549	16.7000	G6	S
165	166	1	3	Goldsmith, Master. Frank John William "Frankie"	male	9.00	0	2	363291	20.5250	NaN	S
172	173	1	3	Johnson, Miss. Eleanor Ileen	female	1.00	1	1	347742	11.1333	NaN	S
184	185	1	3	Kink-Heilmann, Miss. Luise Gretchen	female	4.00	0	2	315153	22.0250	NaN	S
233	234	1	3	Asplund, Miss. Lillian Gertrud	female	5.00	4	2	347077	31.3875	NaN	S
261	262	1	3	Asplund, Master. Edvin Rojj Felix	male	3.00	4	2	347077	31.3875	NaN	S
338	339	1	3	Dahl, Mr. Karl Edwart	male	45.00	0	0	7598	8.0500	NaN	S
348	349	1	3	Coutts, Master. William Loch "William"	male	3.00	1	1	C.A. 37671	15.9000	NaN	S
381	382	1	3	Nakid, Miss. Maria ("Mary")	female	1.00	0	2	2653	15.7417	NaN	C
414	415	1	3	Sundman, Mr. Johan Julian	male	44.00	0	0	STON/O 2. 3101269	7.9250	NaN	S
448	449	1	3	Baclini, Miss. Marie Catherine	female	5.00	2	1	2666	19.2583	NaN	C
469	470	1	3	Baclini, Miss. Helene Barbara	female	0.75	2	1	2666	19.2583	NaN	C
479	480	1	3	Hirvonen, Miss. Hildur E	female	2.00	0	1	3101298	12.2875	NaN	S
483	484	1	3	Turkula, Mrs. (Hedwig)	female	63.00	0	0	4134	9.5875	NaN	S
489	490	1	3	Coutts, Master. Eden Leslie "Neville"	male	9.00	1	1	C.A. 37671	15.9000	NaN	S
644	645	1	3	Baclini, Miss. Eugenie	female	0.75	2	1	2666	19.2583	NaN	C
691	692	1	3	Karun, Miss. Manca	female	4.00	0	1	349256	13.4167	NaN	C
751	752	1	3	Moor, Master. Meier	male	6.00	0	1	392096	12.4750	E121	S
777	778	1	3	Emanuel, Miss. Virginia Ethel	female	5.00	0	0	364516	12.4750	NaN	S
788	789	1	3	Dean, Master. Bertram Vere	male	1.00	1	2	C.A. 2315	20.5750	NaN	S
803	804	1	3	Thomas, Master. Assad Alexander	male	0.42	0	1	2625	8.5167	NaN	C
869	870	1	3	Johnson, Master. Harold Theodor	male	4.00	1	1	347742	11.1333	NaN	S

This is why query is such a powerful function in Pandas. You can do a lot with a single string. There are other ways to achieve this same result, but if your question can be entirely phrased as a series of comparison operators (equal to, less than, etc.), then Query is likely the best option.

Filter and Querying

Contents

3.2. Filter and Querying#

3.2.1. Introduction#

3.2.2. The Filter Function#

3.2.3. The Query Function#