Data Transformations with Pandas

5 min readMar 2, 2021

Pandas is an immensely useful Python package used mainly for data manipulation but also can be used for data visualization. As we have talked about in a previous article, “A Pandas Primer”, Pandas is built upon two essential packages, Numpy and Matplotlib. Numpy allows for the easy and quick manipulation of our data and Matplotlib allows for visualizations of our data. One of the more important aspects of the Pandas package that we need to get familiar with are dataframes.

DataFrames

In data science we have to store our data in some type of form that is easy to work with, most often this is in a rectangular form or more commonly refferred as tabular data. In pandas this tabular data is represented as a DataFrame object. We have already seen how to import this type of data in the article mentioned above, now we will start to explore it’s contents and later learn how to start manipulating it. When we first import our data we need to get acquainted with it, there are a few methods that we should always use to accomplish this task. These methods are as follows:

.head()— returns the first few rows of the DataFrame for quick inspection
.info() — displays the names of columns, the datatypes they contain, and if there are any missing values
.shape — (attribute) returns a tuple with the number of rows and columns in the DataFrame
.describe() — gives some summary statsitcs for columns that contain numerical data such as mean and median

For all the examples from here on out we will be using the Kings County Dataset, that can be found on Kaggle, to illustrate our examples. We will inspect our DataFrame with the .head() method to see what we are working with, we will then use this to compare methods that we will go over later.

Sorting and Substetting Rows

Once we get a sense of the dataset that we are working with we want to dig into manipulating our DataFrame to uncover and present Data in a clear way. There are many different ways to manipulate our DataFrame and we won’t always necessarily start the same way but one of the first things that we can start to do is to sort our data. We can accomplish this by using the .sort_values()method as well as passing in the column that we would like to sort by. Going back to our DataFrame above we will sort our values by the column ‘sqft_living’.

As you can see using the sort values method gave us the smallest value at the top of our DataFrame and the largest at the bottom. We can sort our values in descending order if needed by passing in an the argument ascending = False into the .sort_values() method, by default this argument is set to True. We can also sort our DataFrame by multiple columns, we simply need to pass in a list of column names in to .sort_values() in the preferred order you would like to sort by.

Another way that we can start to manipulate our DataFrame is by subsetting it, that is only looking at one or a few columns. Pandas allows an easy way to do this with the use of square brackets, for instance if we wanted to just look at the bedrooms column of our DataFrame the code would look like so, df['bedrooms'].To Subset by multiple columns we will instead pass in a list of columns into the square brackets like so, df[['bedrooms', 'bathrooms]], to subset for just the bedrooms and bathrooms columns.

We can also subset, or filter, the rows of our DataFrame using logical conditions. For instance if we wanted to see what houses had 4 or more bedrooms we can use the following code, df[df['bedrooms]>=4].

New Columns

One of the cooler things we can do is create new columns for our data, many times based on data that we already had, this is often referred to as feature engineering and is another way to explore our dataset. Let’s say we would like to add a column to our DataFrame that gives the number of bedrooms per bathrooms, which we will call ‘b_per_b’. We do so with a simple equation, on the left side will be the name of the column we want to create in a subsetting format, on the right side we will divide the bathrooms column by the bedrooms column like so,

df['b_per_b'] = df['bedrooms']/df['bathrooms'].

When we next inspect our DataFrame we will have the ‘b_per_b’ column added as the last column of our DataFrame.

Conclusion

These are just a few of the ways that we can manipulate our DataFrames. There are plenty of other methods and ways to transform our data to explore and give insights into our Data set. I urge you to play around with the methods we have gone over as well as looking into the Pandas Documentation to figure out other ways to transform your data.

Data Transformations with Pandas

DataFrames

Sorting and Substetting Rows

New Columns

Conclusion

Written by Jason Drummond