Categorical Data
One of the most important and biggest steps in almost any Data Science project is the data preparation phase, this includes loading and cleaning our data. When we begin to clean our data we will most like encounter two different types of data. The first of these data types is numerical data, sometimes called quantitative data, it is expressed using a value and is usually in the form of a measurement such as heights or weights. The other data type we will encounter is categorical data, sometimes called qualitative data, it is usually composed of a finite number of distinct groups such as eye or hair colors. Our focus for the remainder of this article will be explaining what categorical data is and how we deal with it before we move on to the modeling process.
Categorical Categories
We can break down categorical data even further into two different types, these are ordinal and nominal. Ordinal variables are not too hard to understand once you think about it, these are variables that have some sort of order to them. For example I’m sure we have all filled out surveys asking us to rank certain things with choices ranging from strongly disagree to strongly agree. On the other hand nominal variables are ones that can not be placed in to any sort of order. A good example of this is being asked what your favorite color is, one color will not have a higher rank than any other.
Working with Categorical variables
Let’s start getting the hang of working with categorical variables, we will be using a housing data set, containing numerical and categorical data, taken from Kaggle for all of our examples, you can follow the link at the bottom of the article to download the same dataset or you can work with your own. Now we can start working with our data, to do this we will use the pandas library. If you are unfamiliar with pandas we first must import the library and then read in our csv file so that we can begin exploring it, we will highlight these steps now.
One of the first things you should do whenever you read in a dataset is get a sense of what type of data we are working with, this can be done with pandas .info() method which will give us a look at the dataset’s variables and their data types.
This gives us a lot of useful info, we will focus on the Dtype column at the moment though. If we take a look at the Street column, we can see that it has a dtype of object. A dtype of object is how pandas stores strings and is a good indicator that this variable might be categorical. We can explore this column in even more detail by using pandas .describe() method, make sure to subset your dataframe to only this column as we will show here.
This shows us that there are only two unique values with ‘Pave’ being the most frequent of the 1,460 occurences. We can be pretty sure from here that this is a categorical variable but is it an ordinal or nominal type. Pandas .value_counts() method will let us know the frequency of each distinct value in the specified column.
We can now confidently say that the categorical data found in this column is of the nominal type. As a reminder this is because nothing is being ranked here, they are simply categorizing the street as either pavement or gravel. To give an example of ordinal data we will look into the ‘OverallCond’ column, which assesses the overall condition of the house. Lets take a look at this column with pandas value_counts() method.
As we can see this column is filled with rankings of the condition of the house going from 1 at the low end to 9 at the high end. In general we don’t really need to do anything with this type of data before moving on to the modeling phase, with the exception being if its nominal data in text format. This is because machine learning models don’t really like to work with text data, they tend to behave better with numerical data. We can deal with this in a few ways but the main options are ranking the data in some sort of numerical form or doing something called one hot encoding, which we will go over in the next section.
One-Hot Encoding
As mentioned when working with categorical data many times this will be represented with text. This can easily be understood by humans but we will need to encode these categorical features as numeric values in order to use them in our machine learning models. We also cant just simply convert these categories to integer values as our model may pick up on some sort of ranking if we make the first category into a 1, the second into a 2, and so on. This is where One-Hot Encoding comes in, values can be encoded by creating additional binary features corresponding to whether each value was present or not. This may be hard to understand at first but once we show the resulting dataframe it will make a lot more sense, first we will go the code to do this encoding.
We will make use of pandas .get_dummies() function to encode our categorical data. This function takes a dataframe and a list of categorical columns that we want converted into one hot encoded columns, and returns an updated dataframe with these columns included. Again, this will make much more sense after we see an example. Here we will One-Hot encode the Street column that we were working with earlier.
The .get_dummies() is easy enough to implement and now that we have the resulting dataframe in front of us it is easy to see what is going on. We can see that the Street column was broken into multiple columns based on how many categories were present in that column, in this case we have two columns one representing a gravel street and one representing a paved street. The important thing to note are the values present in these columns, if the house (row) that we are looking at has the feature (column) that we are looking at a 1 will be present if not there will be a 0. As we can see a majority of the streets were paved so all of the houses in the ‘Street_Pave’ column are coded with 1’s and all of the houses in the ‘Street_Grvl’ column are coded with 0’s. If you have more columns to encode simply add the columns into the columns parameter in the .get_dummies() function and pandas will handle the rest. If we have encoded all of our variables now we can move on to the beginnings of our modeling process.
Conclusion
Handling categorical data is much more involved than that of numerical data. It is important that we learn how to figure out if we are working with categorical data and what type it is, ordinal or nominal, as it will change how we preprocess this feature. I hope that this has been a good introduction to dealing with categorical data but these techniques are just a brief look at the amount of work you may have to do to clean this type of data, I urge you to keep doing projects and after some time you will learn the ins and outs of dealing with categorical data!