3.3. Grouping with Groupby()#

3.3.1. Introduction#

When working with large quantities of data, it can sometimes be a bit difficult to understand broad patterns within your data. Often, you will need to group your data into small subsections based on some parameter, such as age, name, or some other feature. You can do this in Pandas using groupby(), which will be the main subject of this chapter. Groupby is a feature of Pandas that returns a special groupby object. This object can be called to perform different types of analyses on data, especially when leveraging the built-in quantitative features of Pandas, such as count() and sum(). In this chapter, we will explore these features and see how they can be used on a real-world dataset, the Titanic dataset.

import pandas as pd
df = pd.read_csv("../data/titanic.csv")
df = df[["Name", "Sex", "Age", "Pclass", "Fare"]]
Name Sex Age Pclass Fare
0 Braund, Mr. Owen Harris male 22.0 3 7.2500
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 71.2833
2 Heikkinen, Miss. Laina female 26.0 3 7.9250
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 53.1000
4 Allen, Mr. William Henry male 35.0 3 8.0500
... ... ... ... ... ...
886 Montvila, Rev. Juozas male 27.0 2 13.0000
887 Graham, Miss. Margaret Edith female 19.0 1 30.0000
888 Johnston, Miss. Catherine Helen "Carrie" female NaN 3 23.4500
889 Behr, Mr. Karl Howell male 26.0 1 30.0000
890 Dooley, Mr. Patrick male 32.0 3 7.7500

891 rows × 5 columns

3.3.2. Groupby()#

The groupby() function allows us to easily group our data in the DataFrame. Once your data are grouped, there are a lot of quantitative questions you can begin to ask. Let’s start simple. Let’s group our DataFrame by Sex.

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002754F3CFB80>

This output may not be quite what you expect. This is an object to which we can now pose targeted questions. Let’s try and see a DataFrame that only has “male” in the Sex column. We can do that by using get_group(“male”)

Name Sex Age Pclass Fare
0 Braund, Mr. Owen Harris male 22.0 3 7.2500
4 Allen, Mr. William Henry male 35.0 3 8.0500
5 Moran, Mr. James male NaN 3 8.4583
6 McCarthy, Mr. Timothy J male 54.0 1 51.8625
7 Palsson, Master. Gosta Leonard male 2.0 3 21.0750
... ... ... ... ... ...
883 Banfield, Mr. Frederick James male 28.0 2 10.5000
884 Sutehall, Mr. Henry Jr male 25.0 3 7.0500
886 Montvila, Rev. Juozas male 27.0 2 13.0000
889 Behr, Mr. Karl Howell male 26.0 1 30.0000
890 Dooley, Mr. Patrick male 32.0 3 7.7500

577 rows × 5 columns

This argument does not have to be a string. Let’s say, we want to just get all the people who are aged 20. We can do the same thing by grouping the dataset by “Age” and then getting the group of 20 year olds.

Name Sex Age Pclass Fare
12 Saundercock, Mr. William Henry male 20.0 3 8.0500
91 Andreasson, Mr. Paul Edvin male 20.0 3 7.8542
113 Jussila, Miss. Katriina female 20.0 3 9.8250
131 Coelho, Mr. Domingos Fernandeo male 20.0 3 7.0500
378 Betros, Mr. Tannous male 20.0 3 4.0125
404 Oreskovic, Miss. Marija female 20.0 3 8.6625
441 Hampe, Mr. Leon male 20.0 3 9.5000
622 Nakid, Mr. Sahid male 20.0 3 15.7417
640 Jensen, Mr. Hans Peder male 20.0 3 7.8542
664 Lindqvist, Mr. Eino William male 20.0 3 7.9250
682 Olsvigen, Mr. Thor Anderson male 20.0 3 9.2250
725 Oreskovic, Mr. Luka male 20.0 3 8.6625
762 Barah, Mr. Hanna Assi male 20.0 3 7.2292
840 Alhomaki, Mr. Ilmari Rudolf male 20.0 3 7.9250
876 Gustafsson, Mr. Alfred Ossian male 20.0 3 9.8458

3.3.3. Quantitative Analysis with Count() and Sum()#

This is typically not how you would use the grouby function. It is far more powerful and often used for quantitative analysis on subsets of your data. Let’s say that I want to examine my dataset by sex and I am interested in known the quantity of column based solely on the metric of sex. I could use groupby() and .count(). When chained together, our question then becomes, how many PassengerId, Survived, Pclass, Name, etc. do we see for each column based on sex. While this question is particularly useful for the qualitative rows (such as Name) or numerical strings (such as PassengerId) because they display the total number of passengers because each person has a unique PassengerId and Name.

Name Age Pclass Fare
female 314 261 314 314
male 577 453 577 577

For the quantitative rows, we can use sum() function. This will tell us the sum of all the columns that have floats or integers. Note that this is not a really good question to pose for the Age column. It is, however, very useful for the Fare column and the Survived Column. Remember, if a person survived, they have a 1; if they did not, they have a 0. We can use the sum to know how many male vs. female survivors there were.

Age Pclass Fare
female 7286.00 678 13966.6628
male 13919.17 1379 14727.2865

Let’s say, though, that we are only interested in the Fare column. Before we add sum to our chain, we can specify that we want specifically the Fare column.

female    13966.6628
male      14727.2865
Name: Fare, dtype: float64

3.3.4. Working with Multiple Groups#

Now, we have just the data on a single column. We can see that the combined fare of male passengers was greater than the combined sum of female passengers. Let’s say though that we are interested in how these sums divide over Pclass. We can pass a list to groupby, rather than just a string. This list will be a list of a strings that correspond to columns.

df.groupby(["Sex", "Pclass"]).Fare.sum()
Sex     Pclass
female  1         9975.8250
        2         1669.7292
        3         2321.1086
male    1         8201.5875
        2         2132.1125
        3         4393.5865
Name: Fare, dtype: float64

The result of this new question is more nuanced. We are not looking at the sum of all fares, rather the sum of fares divided on a Pclass-by-Pclass basis. This means that we can now understand that these sums varied by class. For example, while the total fare for male passengers was greater, the total fare for first class female passengers was greater than their first class male counterparts. The male fare, however, is greater for both the 2nd Class and 3rd Class groups.

3.3.5. Groupings with Many Subsets#

What if we were interested in something that would have more than just 6 neat subsections, such as 3 classes per sex. What if we also wanted to add another aspect to the groups, such as age. If we try and do that, our results are cutoff. We can try and use pd.set_option()

df.groupby(["Sex", "Pclass", "Age"]).Fare.sum()
Sex     Pclass  Age 
female  1       2.0     151.5500
                14.0    120.0000
                15.0    211.3375
                16.0    183.8792
                17.0    165.9000
male    3       59.0      7.2500
                61.0      6.2375
                65.0      7.7500
                70.5      7.7500
                74.0      7.7750
Name: Fare, Length: 283, dtype: float64

What if we wanted to make this look a bit nicer, as a Pandas DataFrame? We can pass all our data back into a new DataFrame object.

df = pd.DataFrame(df.groupby(["Sex", "Pclass", "Age"]).Fare.sum())
Sex Pclass Age
female 1 2.0 151.5500
14.0 120.0000
15.0 211.3375
16.0 183.8792
17.0 165.9000
... ... ... ...
male 3 59.0 7.2500
61.0 6.2375
65.0 7.7500
70.5 7.7500
74.0 7.7750

283 rows × 1 columns

This is now a bit easier to read. You should now have a fairly good understanding of how to group data in Pandas using groupby() and some of the more powerful ways you can use groupby() to manipulate quantitative data.