3.3. Grouping with Groupby()#
3.3.1. Introduction#
When working with large quantities of data, it can sometimes be a bit difficult to understand broad patterns within your data. Often, you will need to group your data into small subsections based on some parameter, such as age, name, or some other feature. You can do this in Pandas using groupby(), which will be the main subject of this chapter. Groupby is a feature of Pandas that returns a special groupby object. This object can be called to perform different types of analyses on data, especially when leveraging the built-in quantitative features of Pandas, such as count() and sum(). In this chapter, we will explore these features and see how they can be used on a real-world dataset, the Titanic dataset.
import pandas as pd
df = pd.read_csv("../data/titanic.csv")
df = df[["Name", "Sex", "Age", "Pclass", "Fare"]]
df
Name | Sex | Age | Pclass | Fare | |
---|---|---|---|---|---|
0 | Braund, Mr. Owen Harris | male | 22.0 | 3 | 7.2500 |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 71.2833 |
2 | Heikkinen, Miss. Laina | female | 26.0 | 3 | 7.9250 |
3 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 53.1000 |
4 | Allen, Mr. William Henry | male | 35.0 | 3 | 8.0500 |
... | ... | ... | ... | ... | ... |
886 | Montvila, Rev. Juozas | male | 27.0 | 2 | 13.0000 |
887 | Graham, Miss. Margaret Edith | female | 19.0 | 1 | 30.0000 |
888 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 3 | 23.4500 |
889 | Behr, Mr. Karl Howell | male | 26.0 | 1 | 30.0000 |
890 | Dooley, Mr. Patrick | male | 32.0 | 3 | 7.7500 |
891 rows × 5 columns
3.3.2. Groupby()#
The groupby() function allows us to easily group our data in the DataFrame. Once your data are grouped, there are a lot of quantitative questions you can begin to ask. Let’s start simple. Let’s group our DataFrame by Sex.
df.groupby("Sex")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002754F3CFB80>
This output may not be quite what you expect. This is an object to which we can now pose targeted questions. Let’s try and see a DataFrame that only has “male” in the Sex column. We can do that by using get_group(“male”)
df.groupby("Sex").get_group("male")
Name | Sex | Age | Pclass | Fare | |
---|---|---|---|---|---|
0 | Braund, Mr. Owen Harris | male | 22.0 | 3 | 7.2500 |
4 | Allen, Mr. William Henry | male | 35.0 | 3 | 8.0500 |
5 | Moran, Mr. James | male | NaN | 3 | 8.4583 |
6 | McCarthy, Mr. Timothy J | male | 54.0 | 1 | 51.8625 |
7 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 21.0750 |
... | ... | ... | ... | ... | ... |
883 | Banfield, Mr. Frederick James | male | 28.0 | 2 | 10.5000 |
884 | Sutehall, Mr. Henry Jr | male | 25.0 | 3 | 7.0500 |
886 | Montvila, Rev. Juozas | male | 27.0 | 2 | 13.0000 |
889 | Behr, Mr. Karl Howell | male | 26.0 | 1 | 30.0000 |
890 | Dooley, Mr. Patrick | male | 32.0 | 3 | 7.7500 |
577 rows × 5 columns
This argument does not have to be a string. Let’s say, we want to just get all the people who are aged 20. We can do the same thing by grouping the dataset by “Age” and then getting the group of 20 year olds.
df.groupby("Age").get_group(20)
Name | Sex | Age | Pclass | Fare | |
---|---|---|---|---|---|
12 | Saundercock, Mr. William Henry | male | 20.0 | 3 | 8.0500 |
91 | Andreasson, Mr. Paul Edvin | male | 20.0 | 3 | 7.8542 |
113 | Jussila, Miss. Katriina | female | 20.0 | 3 | 9.8250 |
131 | Coelho, Mr. Domingos Fernandeo | male | 20.0 | 3 | 7.0500 |
378 | Betros, Mr. Tannous | male | 20.0 | 3 | 4.0125 |
404 | Oreskovic, Miss. Marija | female | 20.0 | 3 | 8.6625 |
441 | Hampe, Mr. Leon | male | 20.0 | 3 | 9.5000 |
622 | Nakid, Mr. Sahid | male | 20.0 | 3 | 15.7417 |
640 | Jensen, Mr. Hans Peder | male | 20.0 | 3 | 7.8542 |
664 | Lindqvist, Mr. Eino William | male | 20.0 | 3 | 7.9250 |
682 | Olsvigen, Mr. Thor Anderson | male | 20.0 | 3 | 9.2250 |
725 | Oreskovic, Mr. Luka | male | 20.0 | 3 | 8.6625 |
762 | Barah, Mr. Hanna Assi | male | 20.0 | 3 | 7.2292 |
840 | Alhomaki, Mr. Ilmari Rudolf | male | 20.0 | 3 | 7.9250 |
876 | Gustafsson, Mr. Alfred Ossian | male | 20.0 | 3 | 9.8458 |
3.3.3. Quantitative Analysis with Count() and Sum()#
This is typically not how you would use the grouby function. It is far more powerful and often used for quantitative analysis on subsets of your data. Let’s say that I want to examine my dataset by sex and I am interested in known the quantity of column based solely on the metric of sex. I could use groupby() and .count(). When chained together, our question then becomes, how many PassengerId, Survived, Pclass, Name, etc. do we see for each column based on sex. While this question is particularly useful for the qualitative rows (such as Name) or numerical strings (such as PassengerId) because they display the total number of passengers because each person has a unique PassengerId and Name.
df.groupby("Sex").count()
Name | Age | Pclass | Fare | |
---|---|---|---|---|
Sex | ||||
female | 314 | 261 | 314 | 314 |
male | 577 | 453 | 577 | 577 |
For the quantitative rows, we can use sum() function. This will tell us the sum of all the columns that have floats or integers. Note that this is not a really good question to pose for the Age column. It is, however, very useful for the Fare column and the Survived Column. Remember, if a person survived, they have a 1; if they did not, they have a 0. We can use the sum to know how many male vs. female survivors there were.
df.groupby("Sex").sum()
Age | Pclass | Fare | |
---|---|---|---|
Sex | |||
female | 7286.00 | 678 | 13966.6628 |
male | 13919.17 | 1379 | 14727.2865 |
Let’s say, though, that we are only interested in the Fare column. Before we add sum to our chain, we can specify that we want specifically the Fare column.
df.groupby("Sex").Fare.sum()
Sex
female 13966.6628
male 14727.2865
Name: Fare, dtype: float64
3.3.4. Working with Multiple Groups#
Now, we have just the data on a single column. We can see that the combined fare of male passengers was greater than the combined sum of female passengers. Let’s say though that we are interested in how these sums divide over Pclass. We can pass a list to groupby, rather than just a string. This list will be a list of a strings that correspond to columns.
df.groupby(["Sex", "Pclass"]).Fare.sum()
Sex Pclass
female 1 9975.8250
2 1669.7292
3 2321.1086
male 1 8201.5875
2 2132.1125
3 4393.5865
Name: Fare, dtype: float64
The result of this new question is more nuanced. We are not looking at the sum of all fares, rather the sum of fares divided on a Pclass-by-Pclass basis. This means that we can now understand that these sums varied by class. For example, while the total fare for male passengers was greater, the total fare for first class female passengers was greater than their first class male counterparts. The male fare, however, is greater for both the 2nd Class and 3rd Class groups.
3.3.5. Groupings with Many Subsets#
What if we were interested in something that would have more than just 6 neat subsections, such as 3 classes per sex. What if we also wanted to add another aspect to the groups, such as age. If we try and do that, our results are cutoff. We can try and use pd.set_option()
df.groupby(["Sex", "Pclass", "Age"]).Fare.sum()
Sex Pclass Age
female 1 2.0 151.5500
14.0 120.0000
15.0 211.3375
16.0 183.8792
17.0 165.9000
...
male 3 59.0 7.2500
61.0 6.2375
65.0 7.7500
70.5 7.7500
74.0 7.7750
Name: Fare, Length: 283, dtype: float64
What if we wanted to make this look a bit nicer, as a Pandas DataFrame? We can pass all our data back into a new DataFrame object.
df = pd.DataFrame(df.groupby(["Sex", "Pclass", "Age"]).Fare.sum())
df
Fare | |||
---|---|---|---|
Sex | Pclass | Age | |
female | 1 | 2.0 | 151.5500 |
14.0 | 120.0000 | ||
15.0 | 211.3375 | ||
16.0 | 183.8792 | ||
17.0 | 165.9000 | ||
... | ... | ... | ... |
male | 3 | 59.0 | 7.2500 |
61.0 | 6.2375 | ||
65.0 | 7.7500 | ||
70.5 | 7.7500 | ||
74.0 | 7.7750 |
283 rows × 1 columns
This is now a bit easier to read. You should now have a fairly good understanding of how to group data in Pandas using groupby() and some of the more powerful ways you can use groupby() to manipulate quantitative data.