Cleaning the DataFrame

2.3. Cleaning the DataFrame#

2.3.1. How to Drop a Column in Pandas DataFrame#

import pandas as pd
df = pd.read_csv("../data/titanic.csv")
df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

Imagine that we have a large DataFrame, but we are not interested in a couple columns. This is especially import when your DataFrame has 10s or 100s of columns. In these instances, you need to examine the DataFrame without the useless data. Imagine that we wanted to study the Titanic data but knew that Parch and Ticket were categories that we did not need. We can use df.drop() to pass an argument to remove those specific columns.

df.drop(columns=["Parch", "Ticket"])

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	7.7500	NaN	Q

891 rows × 10 columns

Likewise, we can do the opposite. Rather than dropping certain columns, we can keep certain columns with the example code below.

df[["Survived", "Name"]]

	Survived	Name
0	0	Braund, Mr. Owen Harris
1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...
2	1	Heikkinen, Miss. Laina
3	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	0	Allen, Mr. William Henry
...	...	...
886	0	Montvila, Rev. Juozas
887	1	Graham, Miss. Margaret Edith
888	0	Johnston, Miss. Catherine Helen "Carrie"
889	1	Behr, Mr. Karl Howell
890	0	Dooley, Mr. Patrick

891 rows × 2 columns

Note the use of double brackets here, e.g. [[]].

2.3.2. How to Remove Rows that have NaN in any Column#

One of the biggest problems in datasets is the absence of data. If you are training a machine learning model or just performing quantitative analysis, rows that have missing values, or NaN, can radically alter your results. It is often good practice to ignore that data or alter it in some way. Let’s presume that we want to simply remove it from our dataset. To do that, we can use df.dropna() which will remove all rows that have any instance of NaN in any column.

df.dropna()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
...	...	...	...	...	...	...	...	...	...	...	...	...
871	872	1	1	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	1	1	11751	52.5542	D35	S
872	873	0	1	Carlsson, Mr. Frans Olof	male	33.0	0	0	695	5.0000	B51 B53 B55	S
879	880	1	1	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	0	1	11767	83.1583	C50	C
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C

183 rows × 12 columns

2.3.3. How to Remove Rows that have NaN in a Specific Column#

In some instances, though, we don’t want to remove an entire row just because of NaN in one column. Maybe that column is not as important for quantitative analysis and we are not planning to include it in our analysis, but we still want to see it. A good example of this is the column Cabin which is a string or Age which is a float (we’ll get to that in a moment). Let’s say we want to remove all rows that have NaN in the Age column. We can use the command below.

df2 = df[df["Age"].notna()]

df2

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	NaN	Q
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

714 rows × 12 columns

As we can see, the size of our DataFrame dropped from 891 rows to 714.

2.3.4. How to Convert DataFrame Data Types (from Float to Int)#

In other instances, it may be important not to simply remove a column, but alter it into a different type of data. In this dataset, Age is a float. This is to account for infants who were below the age of 1 on the Titanic. Let’s presume that we want to convert all these floats to integers. To do that we can use the .astype() method on a specific row.

df2.Age = df2.Age.astype(int)

/tmp/ipykernel_708450/3009824114.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.Age = df2.Age.astype(int)

df2

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39	0	5	382652	29.1250	NaN	Q
886	887	0	2	Montvila, Rev. Juozas	male	27	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19	0	0	112053	30.0000	B42	S
889	890	1	1	Behr, Mr. Karl Howell	male	26	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32	0	0	370376	7.7500	NaN	Q

714 rows × 12 columns

Now our Age column is no longer a float, rather an integer.

2.3.5. Conclusion#

Sorting, cleaning, and organizing data in Pandas can require practice. Even after years of using Pandas, you will still find yourself looking things up on Stackoverflow or other resources online. The power of Pandas comes at the cost of making it difficult to master quickly. With regular practice, however, Pandas does get easier to use over time. Once you have a command of Pandas, you can do quick data cleaning and analysis all in Python. In the next chapter, we will look at how leverage Pandas to do more advanced searching methods.

Cleaning the DataFrame

Contents

2.3. Cleaning the DataFrame#

2.3.1. How to Drop a Column in Pandas DataFrame#

2.3.2. How to Remove Rows that have NaN in any Column#

2.3.3. How to Remove Rows that have NaN in a Specific Column#

2.3.4. How to Convert DataFrame Data Types (from Float to Int)#

2.3.5. Conclusion#