Grouping with Groupby()

3.3. Grouping with Groupby()#

3.3.1. Introduction#

When working with large quantities of data, it can sometimes be a bit difficult to understand broad patterns within your data. Often, you will need to group your data into small subsections based on some parameter, such as age, name, or some other feature. You can do this in Pandas using groupby(), which will be the main subject of this chapter. Groupby is a feature of Pandas that returns a special groupby object. This object can be called to perform different types of analyses on data, especially when leveraging the built-in quantitative features of Pandas, such as count() and sum(). In this chapter, we will explore these features and see how they can be used on a real-world dataset, the Titanic dataset.

import pandas as pd
df = pd.read_csv("../data/titanic.csv")
df = df[["Name", "Sex", "Age", "Pclass", "Fare"]]
df

	Name	Sex	Age	Pclass	Fare
0	Braund, Mr. Owen Harris	male	22.0	3	7.2500
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	71.2833
2	Heikkinen, Miss. Laina	female	26.0	3	7.9250
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	53.1000
4	Allen, Mr. William Henry	male	35.0	3	8.0500
...	...	...	...	...	...
886	Montvila, Rev. Juozas	male	27.0	2	13.0000
887	Graham, Miss. Margaret Edith	female	19.0	1	30.0000
888	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	3	23.4500
889	Behr, Mr. Karl Howell	male	26.0	1	30.0000
890	Dooley, Mr. Patrick	male	32.0	3	7.7500

891 rows × 5 columns

3.3.2. Groupby()#

The groupby() function allows us to easily group our data in the DataFrame. Once your data are grouped, there are a lot of quantitative questions you can begin to ask. Let’s start simple. Let’s group our DataFrame by Sex.

df.groupby("Sex")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002754F3CFB80>

This output may not be quite what you expect. This is an object to which we can now pose targeted questions. Let’s try and see a DataFrame that only has “male” in the Sex column. We can do that by using get_group(“male”)

df.groupby("Sex").get_group("male")

	Name	Sex	Age	Pclass	Fare
0	Braund, Mr. Owen Harris	male	22.0	3	7.2500
4	Allen, Mr. William Henry	male	35.0	3	8.0500
5	Moran, Mr. James	male	NaN	3	8.4583
6	McCarthy, Mr. Timothy J	male	54.0	1	51.8625
7	Palsson, Master. Gosta Leonard	male	2.0	3	21.0750
...	...	...	...	...	...
883	Banfield, Mr. Frederick James	male	28.0	2	10.5000
884	Sutehall, Mr. Henry Jr	male	25.0	3	7.0500
886	Montvila, Rev. Juozas	male	27.0	2	13.0000
889	Behr, Mr. Karl Howell	male	26.0	1	30.0000
890	Dooley, Mr. Patrick	male	32.0	3	7.7500

577 rows × 5 columns

This argument does not have to be a string. Let’s say, we want to just get all the people who are aged 20. We can do the same thing by grouping the dataset by “Age” and then getting the group of 20 year olds.

df.groupby("Age").get_group(20)

	Name	Sex	Age	Pclass	Fare
12	Saundercock, Mr. William Henry	male	20.0	3	8.0500
91	Andreasson, Mr. Paul Edvin	male	20.0	3	7.8542
113	Jussila, Miss. Katriina	female	20.0	3	9.8250
131	Coelho, Mr. Domingos Fernandeo	male	20.0	3	7.0500
378	Betros, Mr. Tannous	male	20.0	3	4.0125
404	Oreskovic, Miss. Marija	female	20.0	3	8.6625
441	Hampe, Mr. Leon	male	20.0	3	9.5000
622	Nakid, Mr. Sahid	male	20.0	3	15.7417
640	Jensen, Mr. Hans Peder	male	20.0	3	7.8542
664	Lindqvist, Mr. Eino William	male	20.0	3	7.9250
682	Olsvigen, Mr. Thor Anderson	male	20.0	3	9.2250
725	Oreskovic, Mr. Luka	male	20.0	3	8.6625
762	Barah, Mr. Hanna Assi	male	20.0	3	7.2292
840	Alhomaki, Mr. Ilmari Rudolf	male	20.0	3	7.9250
876	Gustafsson, Mr. Alfred Ossian	male	20.0	3	9.8458

3.3.3. Quantitative Analysis with Count() and Sum()#

This is typically not how you would use the grouby function. It is far more powerful and often used for quantitative analysis on subsets of your data. Let’s say that I want to examine my dataset by sex and I am interested in known the quantity of column based solely on the metric of sex. I could use groupby() and .count(). When chained together, our question then becomes, how many PassengerId, Survived, Pclass, Name, etc. do we see for each column based on sex. While this question is particularly useful for the qualitative rows (such as Name) or numerical strings (such as PassengerId) because they display the total number of passengers because each person has a unique PassengerId and Name.

df.groupby("Sex").count()

	Name	Age	Pclass	Fare
Sex
female	314	261	314	314
male	577	453	577	577

For the quantitative rows, we can use sum() function. This will tell us the sum of all the columns that have floats or integers. Note that this is not a really good question to pose for the Age column. It is, however, very useful for the Fare column and the Survived Column. Remember, if a person survived, they have a 1; if they did not, they have a 0. We can use the sum to know how many male vs. female survivors there were.

df.groupby("Sex").sum()

	Age	Pclass	Fare
Sex
female	7286.00	678	13966.6628
male	13919.17	1379	14727.2865

Let’s say, though, that we are only interested in the Fare column. Before we add sum to our chain, we can specify that we want specifically the Fare column.

df.groupby("Sex").Fare.sum()

Sex
female    13966.6628
male      14727.2865
Name: Fare, dtype: float64

3.3.4. Working with Multiple Groups#

Now, we have just the data on a single column. We can see that the combined fare of male passengers was greater than the combined sum of female passengers. Let’s say though that we are interested in how these sums divide over Pclass. We can pass a list to groupby, rather than just a string. This list will be a list of a strings that correspond to columns.

df.groupby(["Sex", "Pclass"]).Fare.sum()

Sex     Pclass
female  1         9975.8250
        2         1669.7292
        3         2321.1086
male    1         8201.5875
        2         2132.1125
        3         4393.5865
Name: Fare, dtype: float64

The result of this new question is more nuanced. We are not looking at the sum of all fares, rather the sum of fares divided on a Pclass-by-Pclass basis. This means that we can now understand that these sums varied by class. For example, while the total fare for male passengers was greater, the total fare for first class female passengers was greater than their first class male counterparts. The male fare, however, is greater for both the 2nd Class and 3rd Class groups.

3.3.5. Groupings with Many Subsets#

What if we were interested in something that would have more than just 6 neat subsections, such as 3 classes per sex. What if we also wanted to add another aspect to the groups, such as age. If we try and do that, our results are cutoff. We can try and use pd.set_option()

df.groupby(["Sex", "Pclass", "Age"]).Fare.sum()

Sex     Pclass  Age 
female  1       2.0     151.5500
                14.0    120.0000
                15.0    211.3375
                16.0    183.8792
                17.0    165.9000
                          ...   
male    3       59.0      7.2500
                61.0      6.2375
                65.0      7.7500
                70.5      7.7500
                74.0      7.7750
Name: Fare, Length: 283, dtype: float64

What if we wanted to make this look a bit nicer, as a Pandas DataFrame? We can pass all our data back into a new DataFrame object.

df = pd.DataFrame(df.groupby(["Sex", "Pclass", "Age"]).Fare.sum())
df

			Fare
Sex	Pclass	Age
female	1	2.0	151.5500
		14.0	120.0000
		15.0	211.3375
		16.0	183.8792
		17.0	165.9000
...	...	...	...
male	3	59.0	7.2500
		61.0	6.2375
		65.0	7.7500
		70.5	7.7500
		74.0	7.7750

283 rows × 1 columns

This is now a bit easier to read. You should now have a fairly good understanding of how to group data in Pandas using groupby() and some of the more powerful ways you can use groupby() to manipulate quantitative data.