3.3.
Advanced Grouping with Groupby()

Dr. W.J.B. Mattingly
Smithsonian Data Science Lab and United States Holocaust Memorial Museum
August 2021

3.3.1. Covered in this Chapter

  1. What is the Purpose of Groupby()

  2. What are Pandas Group Objects

  3. How to Analyze Group Objects Quantitatively

  4. How to work with Multiple Groupings at Once

3.3.2. Introduction

When working with large quantities of data, it can sometimes be a bit difficult to understand broad patterns within your data. Often, you will need to group your data into small subsections based on some parameter, such as age, name, or some other feature. You can do this in Pandas using groupby(), which will be the main subject of this chapter. Groupby is a feature of Pandas that returns a special groupby object. This object can be called to perform different types of analyses on data, especially when leveraging the built-in quantitative features of Pandas, such as count() and sum(). In this chapter, we will explore these features and see how they can be used on a real-world dataset, the Titanic dataset.

import pandas as pd
df = pd.read_csv("data/titanic.csv")
df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

3.3.3. Groupby()

The groupby() function allows us to easily group our data in the DataFrame. Once your data are grouped, there are a lot of quantitative questions you can begin to ask. Let’s start simple. Let’s group our DataFrame by Sex.

df.groupby("Sex")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F436D18910>

This output may not be quite what you expect. This is an object to which we can now pose targeted questions. Let’s try and see a DataFrame that only has “male” in the Sex column. We can do that by using get_group(“male”)

df.groupby("Sex").get_group("male")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

577 rows × 12 columns

This argument does not have to be a string. Let’s say, we want to just get all the people who are aged 20. We can do the same thing by grouping the dataset by “Age” and then getting the group of 20 year olds.

df.groupby("Age").get_group(20)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
91 92 0 3 Andreasson, Mr. Paul Edvin male 20.0 0 0 347466 7.8542 NaN S
113 114 0 3 Jussila, Miss. Katriina female 20.0 1 0 4136 9.8250 NaN S
131 132 0 3 Coelho, Mr. Domingos Fernandeo male 20.0 0 0 SOTON/O.Q. 3101307 7.0500 NaN S
378 379 0 3 Betros, Mr. Tannous male 20.0 0 0 2648 4.0125 NaN C
404 405 0 3 Oreskovic, Miss. Marija female 20.0 0 0 315096 8.6625 NaN S
441 442 0 3 Hampe, Mr. Leon male 20.0 0 0 345769 9.5000 NaN S
622 623 1 3 Nakid, Mr. Sahid male 20.0 1 1 2653 15.7417 NaN C
640 641 0 3 Jensen, Mr. Hans Peder male 20.0 0 0 350050 7.8542 NaN S
664 665 1 3 Lindqvist, Mr. Eino William male 20.0 1 0 STON/O 2. 3101285 7.9250 NaN S
682 683 0 3 Olsvigen, Mr. Thor Anderson male 20.0 0 0 6563 9.2250 NaN S
725 726 0 3 Oreskovic, Mr. Luka male 20.0 0 0 315094 8.6625 NaN S
762 763 1 3 Barah, Mr. Hanna Assi male 20.0 0 0 2663 7.2292 NaN C
840 841 0 3 Alhomaki, Mr. Ilmari Rudolf male 20.0 0 0 SOTON/O2 3101287 7.9250 NaN S
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S

3.3.4. Quantitative Analysis with Count() and Sum()

This is typically not how you would use the grouby function. It is far more powerful and often used for quantitative analysis on subsets of your data. Let’s say that I want to examine my dataset by sex and I am interested in known the quantity of column based solely on the metric of sex. I could use groupby() and .count(). When chained together, our question then becomes, how many PassengerId, Survived, Pclass, Name, etc. do we see for each column based on sex. While this question is particularly useful for the qualitative rows (such as Name) or numerical strings (such as PassengerId) because they display the total number of passengers because each person has a unique PassengerId and Name.

df.groupby("Sex").count()
PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked
Sex
female 314 314 314 314 261 314 314 314 314 97 312
male 577 577 577 577 453 577 577 577 577 107 577

For the quantitative rows, we can use sum() function. This will tell us the sum of all the columns that have floats or integers. Note that this is not a really good question to pose for the Age column. It is, however, very useful for the Fare column and the Survived Column. Remember, if a person survived, they have a 1; if they did not, they have a 0. We can use the sum to know how many male vs. female survivors there were.

df.groupby("Sex").sum()
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
female 135343 233 678 7286.00 218 204 13966.6628
male 262043 109 1379 13919.17 248 136 14727.2865

Let’s say, though, that we are only interested in the Fare column. Before we add sum to our chain, we can specify that we want specifically the Fare column.

df.groupby("Sex").Fare.sum()
Sex
female    13966.6628
male      14727.2865
Name: Fare, dtype: float64

3.3.5. Working with Multiple Groups

Now, we have just the data on a single column. We can see that the combined fare of male passengers was greater than the combined sum of female passengers. Let’s say though that we are interested in how these sums divide over Pclass. We can pass a list to groupby, rather than just a string. This list will be a list of a strings that correspond to columns.

df.groupby(["Sex", "Pclass"]).Fare.sum()
Sex     Pclass
female  1         9975.8250
        2         1669.7292
        3         2321.1086
male    1         8201.5875
        2         2132.1125
        3         4393.5865
Name: Fare, dtype: float64

The result of this new question is more nuanced. We are not looking at the sum of all fares, rather the sum of fares divided on a Pclass-by-Pclass basis. This means that we can now understand that these sums varied by class. For example, while the total fare for male passengers was greater, the total fare for first class female passengers was greater than their first class male counterparts. The male fare, however, is greater for both the 2nd Class and 3rd Class groups.

3.3.6. Groupings with Many Subsets

What if we were interested in something that would have more than just 6 neat subsections, such as 3 classes per sex. What if we also wanted to add another aspect to the groups, such as age. If we try and do that, our results are cutoff. We can try and use pd.set_option()

df.groupby(["Sex", "Pclass", "Age"]).Fare.sum()
Sex     Pclass  Age 
female  1       2.0     151.5500
                14.0    120.0000
                15.0    211.3375
                16.0    183.8792
                17.0    165.9000
                          ...   
male    3       59.0      7.2500
                61.0      6.2375
                65.0      7.7500
                70.5      7.7500
                74.0      7.7750
Name: Fare, Length: 283, dtype: float64

In order to print off this many rows, we need to increase the amount of data Pandas will display. We can use the pd.set_options() command and pass an argument to display 1000 rows.

pd.set_option('display.max_rows', 1000)
df.groupby(["Sex", "Pclass", "Age"]).Fare.sum()
Sex     Pclass  Age  
female  1       2.00     151.5500
                14.00    120.0000
                15.00    211.3375
                16.00    183.8792
                17.00    165.9000
                18.00    569.5500
                19.00    147.3625
                21.00    340.3333
                22.00    322.6500
                23.00    376.2750
                24.00    534.2625
                25.00    151.5500
                26.00     78.8500
                29.00    211.3375
                30.00    374.3542
                31.00    278.1417
                32.00     76.2917
                33.00    229.6000
                35.00    926.5375
                36.00    326.6333
                38.00    378.8083
                39.00    329.5916
                40.00    287.9625
                41.00    134.5000
                42.00    227.5250
                43.00    211.3375
                44.00     85.7000
                45.00    164.8667
                47.00     52.5542
                48.00     65.5292
                49.00    102.6584
                50.00    276.2333
                51.00     77.9583
                52.00    171.7667
                53.00     51.4792
                54.00    137.6667
                56.00     83.1583
                58.00    326.5333
                60.00     75.2500
                62.00     80.0000
                63.00     77.9583
        2       2.00      26.0000
                3.00      41.5792
                4.00      62.0000
                5.00      27.7500
                6.00      33.0000
                7.00      26.2500
                8.00      26.2500
                13.00     19.5000
                14.00     30.0708
                17.00     22.5000
                18.00     36.0000
                19.00     52.0000
                21.00     10.5000
                22.00     70.5792
                23.00     13.7917
                24.00    177.2500
                25.00     56.0000
                26.00     26.0000
                27.00     45.3583
                28.00     88.6500
                29.00     62.5000
                30.00     46.3500
                31.00     26.2500
                32.00     13.0000
                32.50     13.0000
                33.00     53.7500
                34.00     79.0000
                35.00     21.0000
                36.00     52.0000
                38.00     13.0000
                40.00     67.7500
                41.00     19.5000
                42.00     39.0000
                44.00     26.0000
                45.00     39.7500
                48.00     65.0000
                50.00     47.0000
                54.00     23.0000
                55.00     16.0000
                57.00     10.5000
        3       0.75      38.5166
                1.00      26.8750
                2.00      81.9250
                3.00      21.0750
                4.00      52.1417
                5.00      63.1208
                6.00      31.2750
                8.00      21.0750
                9.00     108.7958
                10.00     24.1500
                11.00     31.2750
                13.00      7.2292
                14.00     19.0959
                14.50     14.4542
                15.00     29.7084
                16.00     62.3833
                17.00     22.3833
                18.00     91.4667
                19.00     15.7334
                20.00     18.4875
                21.00     59.6000
                22.00     50.8792
                23.00     15.4750
                24.00     60.6583
                25.00     15.7000
                26.00     31.8792
                27.00     31.5333
                28.00     22.2958
                29.00     46.7833
                30.00     45.2875
                30.50      7.7500
                31.00     55.0625
                32.00     15.5000
                33.00     15.8500
                35.00     20.2500
                36.00     17.4000
                37.00      9.5875
                38.00     31.3875
                39.00     60.4000
                40.00      9.4750
                41.00     59.9000
                43.00     46.9000
                45.00     50.1042
                47.00     14.5000
                48.00     34.3750
                63.00      9.5875
male    1       0.92     151.5500
                4.00      81.8583
                11.00    120.0000
                17.00    110.8833
                18.00    108.9000
                19.00    316.1000
                21.00     77.2875
                22.00    135.6333
                23.00     63.3583
                24.00    326.7208
                25.00    146.5209
                26.00     30.0000
                27.00    371.8292
                28.00    191.3208
                29.00     96.6000
                30.00     27.7500
                31.00    159.4958
                32.00     30.5000
                33.00      5.0000
                34.00     26.5500
                35.00    565.1667
                36.00    803.9792
                37.00    135.3542
                38.00    243.4625
                39.00      0.0000
                40.00     58.7208
                42.00    130.8417
                44.00     90.0000
                45.00    172.0750
                45.50     28.5000
                46.00    140.3750
                47.00    150.1083
                48.00    155.2792
                49.00    256.9167
                50.00    295.9750
                51.00     87.9292
                52.00    110.1500
                54.00    129.1500
                55.00     30.5000
                56.00     92.7458
                58.00    142.9750
                60.00    105.7500
                61.00     65.8208
                62.00     53.1000
                64.00    289.0000
                65.00     88.5292
                70.00     71.0000
                71.00     84.1584
                80.00     30.0000
        2       0.67      14.5000
                0.83      47.7500
                1.00      76.0042
                2.00      26.0000
                3.00      44.7500
                8.00      36.7500
                16.00     36.5000
                18.00    109.5000
                19.00     70.7500
                21.00    158.5000
                23.00     76.0458
                24.00     97.0000
                25.00    106.5792
                26.00     10.5000
                27.00     52.0000
                28.00     70.0000
                29.00     59.2208
                30.00     73.5000
                31.00     86.7542
                32.00    110.0000
                32.50     30.0708
                33.00     12.2750
                34.00    107.0000
                35.00     36.5000
                36.00     64.1250
                36.50     26.0000
                37.00     26.0000
                39.00     52.0000
                42.00     53.0000
                43.00     26.2500
                44.00     26.0000
                46.00     26.0000
                47.00     15.0000
                48.00     13.0000
                50.00     13.0000
                51.00     12.5250
                52.00     26.5000
                54.00     66.0000
                57.00     12.3500
                59.00     13.5000
                60.00     39.0000
                62.00     10.5000
                66.00     10.5000
                70.00     10.5000
        3       0.42       8.5167
                1.00     107.1625
                2.00      89.8875
                3.00      47.2875
                4.00      99.4333
                6.00      12.4750
                7.00      68.8125
                8.00      29.1250
                9.00     114.7125
                10.00     27.9000
                11.00     65.6875
                12.00     11.2417
                14.00     86.5875
                15.00      7.2292
                16.00    154.9042
                17.00     47.3959
                18.00     74.2333
                19.00     94.7915
                20.00    110.8751
                20.50      7.2500
                21.00    111.3541
                22.00    108.8874
                23.00     24.9750
                23.50      7.2292
                24.00     95.1791
                24.50      8.0500
                25.00     85.2125
                26.00    166.3333
                27.00     45.7833
                28.00    153.2374
                28.50     23.3292
                29.00     65.3749
                30.00     71.3000
                30.50      8.0500
                31.00     23.4500
                32.00    192.5291
                33.00     70.9083
                34.00     36.9958
                34.50      6.4375
                35.00     38.1708
                36.00     55.0916
                37.00      7.9250
                38.00     23.6083
                39.00     71.2750
                40.00     58.5208
                40.50     22.2500
                41.00     21.2333
                42.00     32.2667
                43.00     14.5000
                44.00     40.1250
                45.00     15.0250
                45.50      7.2250
                47.00     16.2500
                48.00      7.8542
                49.00      0.0000
                50.00      8.0500
                51.00     22.8542
                55.50      8.0500
                59.00      7.2500
                61.00      6.2375
                65.00      7.7500
                70.50      7.7500
                74.00      7.7750
Name: Fare, dtype: float64

What if we wanted to make this look a bit nicer, as a Pandas DataFrame? We can pass all our data back into a new DataFrame object.

df = pd.DataFrame(df.groupby(["Sex", "Pclass", "Age"]).Fare.sum())
df
Fare
Sex Pclass Age
female 1 2.00 151.5500
14.00 120.0000
15.00 211.3375
16.00 183.8792
17.00 165.9000
18.00 569.5500
19.00 147.3625
21.00 340.3333
22.00 322.6500
23.00 376.2750
24.00 534.2625
25.00 151.5500
26.00 78.8500
29.00 211.3375
30.00 374.3542
31.00 278.1417
32.00 76.2917
33.00 229.6000
35.00 926.5375
36.00 326.6333
38.00 378.8083
39.00 329.5916
40.00 287.9625
41.00 134.5000
42.00 227.5250
43.00 211.3375
44.00 85.7000
45.00 164.8667
47.00 52.5542
48.00 65.5292
49.00 102.6584
50.00 276.2333
51.00 77.9583
52.00 171.7667
53.00 51.4792
54.00 137.6667
56.00 83.1583
58.00 326.5333
60.00 75.2500
62.00 80.0000
63.00 77.9583
2 2.00 26.0000
3.00 41.5792
4.00 62.0000
5.00 27.7500
6.00 33.0000
7.00 26.2500
8.00 26.2500
13.00 19.5000
14.00 30.0708
17.00 22.5000
18.00 36.0000
19.00 52.0000
21.00 10.5000
22.00 70.5792
23.00 13.7917
24.00 177.2500
25.00 56.0000
26.00 26.0000
27.00 45.3583
28.00 88.6500
29.00 62.5000
30.00 46.3500
31.00 26.2500
32.00 13.0000
32.50 13.0000
33.00 53.7500
34.00 79.0000
35.00 21.0000
36.00 52.0000
38.00 13.0000
40.00 67.7500
41.00 19.5000
42.00 39.0000
44.00 26.0000
45.00 39.7500
48.00 65.0000
50.00 47.0000
54.00 23.0000
55.00 16.0000
57.00 10.5000
3 0.75 38.5166
1.00 26.8750
2.00 81.9250
3.00 21.0750
4.00 52.1417
5.00 63.1208
6.00 31.2750
8.00 21.0750
9.00 108.7958
10.00 24.1500
11.00 31.2750
13.00 7.2292
14.00 19.0959
14.50 14.4542
15.00 29.7084
16.00 62.3833
17.00 22.3833
18.00 91.4667
19.00 15.7334
20.00 18.4875
21.00 59.6000
22.00 50.8792
23.00 15.4750
24.00 60.6583
25.00 15.7000
26.00 31.8792
27.00 31.5333
28.00 22.2958
29.00 46.7833
30.00 45.2875
30.50 7.7500
31.00 55.0625
32.00 15.5000
33.00 15.8500
35.00 20.2500
36.00 17.4000
37.00 9.5875
38.00 31.3875
39.00 60.4000
40.00 9.4750
41.00 59.9000
43.00 46.9000
45.00 50.1042
47.00 14.5000
48.00 34.3750
63.00 9.5875
male 1 0.92 151.5500
4.00 81.8583
11.00 120.0000
17.00 110.8833
18.00 108.9000
19.00 316.1000
21.00 77.2875
22.00 135.6333
23.00 63.3583
24.00 326.7208
25.00 146.5209
26.00 30.0000
27.00 371.8292
28.00 191.3208
29.00 96.6000
30.00 27.7500
31.00 159.4958
32.00 30.5000
33.00 5.0000
34.00 26.5500
35.00 565.1667
36.00 803.9792
37.00 135.3542
38.00 243.4625
39.00 0.0000
40.00 58.7208
42.00 130.8417
44.00 90.0000
45.00 172.0750
45.50 28.5000
46.00 140.3750
47.00 150.1083
48.00 155.2792
49.00 256.9167
50.00 295.9750
51.00 87.9292
52.00 110.1500
54.00 129.1500
55.00 30.5000
56.00 92.7458
58.00 142.9750
60.00 105.7500
61.00 65.8208
62.00 53.1000
64.00 289.0000
65.00 88.5292
70.00 71.0000
71.00 84.1584
80.00 30.0000
2 0.67 14.5000
0.83 47.7500
1.00 76.0042
2.00 26.0000
3.00 44.7500
8.00 36.7500
16.00 36.5000
18.00 109.5000
19.00 70.7500
21.00 158.5000
23.00 76.0458
24.00 97.0000
25.00 106.5792
26.00 10.5000
27.00 52.0000
28.00 70.0000
29.00 59.2208
30.00 73.5000
31.00 86.7542
32.00 110.0000
32.50 30.0708
33.00 12.2750
34.00 107.0000
35.00 36.5000
36.00 64.1250
36.50 26.0000
37.00 26.0000
39.00 52.0000
42.00 53.0000
43.00 26.2500
44.00 26.0000
46.00 26.0000
47.00 15.0000
48.00 13.0000
50.00 13.0000
51.00 12.5250
52.00 26.5000
54.00 66.0000
57.00 12.3500
59.00 13.5000
60.00 39.0000
62.00 10.5000
66.00 10.5000
70.00 10.5000
3 0.42 8.5167
1.00 107.1625
2.00 89.8875
3.00 47.2875
4.00 99.4333
6.00 12.4750
7.00 68.8125
8.00 29.1250
9.00 114.7125
10.00 27.9000
11.00 65.6875
12.00 11.2417
14.00 86.5875
15.00 7.2292
16.00 154.9042
17.00 47.3959
18.00 74.2333
19.00 94.7915
20.00 110.8751
20.50 7.2500
21.00 111.3541
22.00 108.8874
23.00 24.9750
23.50 7.2292
24.00 95.1791
24.50 8.0500
25.00 85.2125
26.00 166.3333
27.00 45.7833
28.00 153.2374
28.50 23.3292
29.00 65.3749
30.00 71.3000
30.50 8.0500
31.00 23.4500
32.00 192.5291
33.00 70.9083
34.00 36.9958
34.50 6.4375
35.00 38.1708
36.00 55.0916
37.00 7.9250
38.00 23.6083
39.00 71.2750
40.00 58.5208
40.50 22.2500
41.00 21.2333
42.00 32.2667
43.00 14.5000
44.00 40.1250
45.00 15.0250
45.50 7.2250
47.00 16.2500
48.00 7.8542
49.00 0.0000
50.00 8.0500
51.00 22.8542
55.50 8.0500
59.00 7.2500
61.00 6.2375
65.00 7.7500
70.50 7.7500
74.00 7.7750

This is now a bit easier to read. You should now have a fairly good understanding of how to group data in Pandas using groupby() and some of the more powerful ways you can use groupby() to manipulate quantitative data.