2.2.
Organizing the DataFrame

Dr. W.J.B. Mattingly
Smithsonian Data Science Lab and United States Holocaust Memorial Museum
August 2021

2.2.1. Covered in this Chapter

  1. How to Sort Data by Single Column

  2. How to Reverse Sort Data by Single Column

  3. How to Sort Data by Multiple Columns

  4. How to Sort Data by Multiple Columns with Different Values Organized Differently

2.2.2. Video

%%HTML
<center>
<iframe width="560" height="315" src="https://www.youtube.com/embed/1G1ursLeZNg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</center>

2.2.3. How to Sort Data By Single Column

import pandas as pd
df = pd.read_csv("data/titanic.csv")

Now that we’ve import pandas and created our DataFrame, let’s see what it looks like again.

df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

In this scenario, I am interested in sorting the data (rather like Excel). Rather than using the sort feature in Excel, we can use the df.sort_values() method in Python. This will take one argument, specifically the column that you want to organize by. By default, this will be ascending. Let’s do this by class. In other words, sort the DataFrame so that those who were in first class appear first and those in third class appear last. We will do this by passing the argument “Pclass”, the column name corresponding to Passenger Class.

df.sort_values("Pclass")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
445 446 1 1 Dodge, Master. Washington male 4.0 0 2 33638 81.8583 A34 S
310 311 1 1 Hays, Miss. Margaret Bechstein female 24.0 0 0 11767 83.1583 C54 C
309 310 1 1 Francatelli, Miss. Laura Mabel female 30.0 0 0 PC 17485 56.9292 E36 C
307 308 1 1 Penasco y Castellana, Mrs. Victor de Satode (M... female 17.0 1 0 PC 17758 108.9000 C65 C
306 307 1 1 Fleming, Miss. Margaret female NaN 0 0 17421 110.8833 NaN C
... ... ... ... ... ... ... ... ... ... ... ... ...
379 380 0 3 Gustafsson, Mr. Karl Gideon male 19.0 0 0 347069 7.7750 NaN S
381 382 1 3 Nakid, Miss. Maria ("Mary") female 1.0 0 2 2653 15.7417 NaN C
382 383 0 3 Tikkanen, Mr. Juho male 32.0 0 0 STON/O 2. 3101293 7.9250 NaN S
371 372 0 3 Wiklund, Mr. Jakob Alfred male 18.0 1 0 3101267 6.4958 NaN S
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

2.2.4. How to Reverse Sort Data by Single Column

As we can see, our data is now appearing as expected. We can pass additional keyword arguments to sort the data in the opposite direction, or descending by setting ascending to False. See the example below.

df.sort_values("Pclass", ascending=False)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
511 512 0 3 Webber, Mr. James male NaN 0 0 SOTON/OQ 3101316 8.0500 NaN S
500 501 0 3 Calic, Mr. Petar male 17.0 0 0 315086 8.6625 NaN S
501 502 0 3 Canavan, Miss. Mary female 21.0 0 0 364846 7.7500 NaN Q
502 503 0 3 O'Sullivan, Miss. Bridget Mary female NaN 0 0 330909 7.6292 NaN Q
... ... ... ... ... ... ... ... ... ... ... ... ...
102 103 0 1 White, Mr. Richard Frasar male 21.0 0 1 35281 77.2875 D26 S
710 711 1 1 Mayne, Mlle. Berthe Antonine ("Mrs de Villiers") female 24.0 0 0 PC 17482 49.5042 C90 C
711 712 0 1 Klaber, Mr. Herman male NaN 0 0 113028 26.5500 C124 S
712 713 1 1 Taylor, Mr. Elmer Zebley male 48.0 1 0 19996 52.0000 C126 S
445 446 1 1 Dodge, Master. Washington male 4.0 0 2 33638 81.8583 A34 S

891 rows × 12 columns

2.2.5. How to Sort Data by Multiple Columns

Again, we can see the power of Pandas over Excel by the simplicity of altering our command to include multiple columns. Let’s say that we want to sort all the data by Pclass, then we want that data organized again by sex, so that all male and female passengers appear in order. We can do this by passing the argument of what we want organized as a list. Note the order of the list as well. The columns that appear earlier in the list correspond to those that receive primacy in the ascending. In other words, we organize by passenger class firsit, then sex. In this case, the method head, is simply showing the top 100 rows.

df.sort_values(["Pclass", "Sex"]).head(100)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
31 32 1 1 Spencer, Mrs. William Augustus (Marie Eugenie) female NaN 1 0 PC 17569 146.5208 B78 C
52 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 PC 17572 76.7292 D33 C
... ... ... ... ... ... ... ... ... ... ... ... ...
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
30 31 0 1 Uruchurtu, Don. Manuel E male 40.0 0 0 PC 17601 27.7208 NaN C
34 35 0 1 Meyer, Mr. Edgar Joseph male 28.0 1 0 PC 17604 82.1708 NaN C
35 36 0 1 Holverson, Mr. Alexander Oskar male 42.0 1 0 113789 52.0000 NaN S

100 rows × 12 columns

As with before, we can control how the data is sorted, either ascending or descending. If we set ascending to False, we organize all items in the list by this method. We can do this with the sample code below.

df.sort_values(["Pclass", "Sex"], ascending=False)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
856 857 1 1 Wick, Mrs. George Dennick (Mary Hitchcock) female 45.0 1 1 36928 164.8667 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S

891 rows × 12 columns

2.2.6. How to Sort Data by Multiple Columns with Different Values Organized Differently

What if we want to organize the data differently. By this I mean, we want for all the data to be organized by passenger class first and for that data to be ascending (1, 2, 3), but we want the sex of the passengers to be organized descending (male, female, rather than female, male). To achieve this, we can pass a list to ascending with 0s and 1s. 0 is False and 1 is True.

df.sort_values(["Pclass", "Sex"], ascending=[1,0])
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
30 31 0 1 Uruchurtu, Don. Manuel E male 40.0 0 0 PC 17601 27.7208 NaN C
34 35 0 1 Meyer, Mr. Edgar Joseph male 28.0 1 0 PC 17604 82.1708 NaN C
... ... ... ... ... ... ... ... ... ... ... ... ...
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S

891 rows × 12 columns

What is particularly nice about Pandas over Excel is that this operation scales nicely. If we want to add more methods of sorting, we can do that too by simple increasing the indices of our lists. Always make sure that the length of your lists match, however. In other words, do not have 3 attributes to sort by and 2 items in your ascending list. In this case, we want to organize by passenger class, sex, and age with passenger class ascending, sex descending, and age ascending. Let’s see what that would look like.

df.sort_values(["Pclass", "Sex", "Age"], ascending=[1,0,1])
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
305 306 1 1 Allison, Master. Hudson Trevor male 0.92 1 2 113781 151.5500 C22 C26 S
445 446 1 1 Dodge, Master. Washington male 4.00 0 2 33638 81.8583 A34 S
802 803 1 1 Carter, Master. William Thornton II male 11.00 1 2 113760 120.0000 B96 B98 S
550 551 1 1 Thayer, Mr. John Borland Jr male 17.00 0 2 17421 110.8833 C70 C
505 506 0 1 Penasco y Castellana, Mr. Victor de Satode male 18.00 1 0 PC 17758 108.9000 C65 C
... ... ... ... ... ... ... ... ... ... ... ... ...
697 698 1 3 Mullens, Miss. Katherine "Katie" female NaN 0 0 35852 7.7333 NaN Q
727 728 1 3 Mannion, Miss. Margareth female NaN 0 0 36866 7.7375 NaN Q
792 793 0 3 Sage, Miss. Stella Anna female NaN 8 2 CA. 2343 69.5500 NaN S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S

891 rows × 12 columns

As we move forward throughout this textbook, we will explore more robust ways to sort and organize our data. For now, you should feel comfortable with how to use sort_values() to do fairly robust tasks quickly.