test post

One of the most important and biggest steps in almost any Data Science project is the data preparation phase, this includes loading and cleaning our data. When we begin to clean our data we will most like encounter two different types of data. The first of these data types is numerical data, sometimes called quantitative data, it is expressed using a value and is usually in the form of a measurement such as heights or weights. The other data type we will encounter is categorical data, sometimes called qualitative data, it is usually composed of a finite number of distinct…

A subquery in SQL is a query that is nested inside another query. This simply means that we will have an additional SELECT statement, contained inside parentheses, surrounded by another complete SQL statement. Many times in order to retrieve certain information that we want we will have to perform some intermediary transformations to our data before selecting, filtering, or calculating information. This is why subqueries are so valuable to us, they are a common way of performing these intermediary transformations. A subquery can be placed in any part of our main query, such as the SELECT, FROM, or WHERE clauses…

We are generating huge amounts of data on a daily basis simply by using our phones, playing that YouTube video, or by simply paying for a meal. The companies that are providing these services to us collect this data internally and use it to help them make data driven decisions. In general the public will never get to see that data unless the company decides to make it freely available. One of the first problems you will be faced with as a data scientist will be where to find the data for your project. …

In Python we use a for loop to iterate over a sequence, generally this sequence is a list, dictionary, tuple, set, or string. Traditionally it is used when we have a block of code which we want to repeat a certain number of times. For loops are extremely valuable to us as programmers as they help reduce the number of lines of code we have to write, keep us from writing the same line of code over and over, and also help keep our programs less complex and readable. The general syntax of a for loop is as follows:

This…

Confusion Matrices are a useful tool for evaluating binary classification models, it can even be extended to multiclass classification problems but we will stick to binary cases for this article. The matrix is a way for us to see our model’s predicted classes vs the actual outcomes. It’s called a confusion matrix because it reveals how “confused” the model is between the two possible outcomes and highlights instances in which one class is confused for the other. …

Today we will be continuing our series on visualization tools, specifically focusing on the Seaborn library. As a reminder, we went over the basics of the Matplotlib library in our previous article, a link will be provided at the bottom of the page for to view it. In this blog we will introduce the basics of plotting techniques for Seaborn’s library and go over different type of plots that we can make.

Seaborn is an awesome Python library for creating easy and stunning data visualizations. It was developed in order to make it easier to create the most common types…

This is the fifth part in my series teaching the basics of SQL. As a reminder in parts 1–4 we have gone over a few of the most common keywords that are used to query a database. We will be utilizing many of these keywords that we have gone over already such as GROUP BY, WHERE, and different types of JOINS. If a refresher is needed please follow the links at the the bottom of this page to review those keywords. …

We have seen the importance of the pandas library and its impact on data manipulation and visualization for Data Scientists. We can also use pandas to explore our dataframes by grouping them by certain variables and then performing summary statistics on them. In this blog we will go over aggregating data as well as performing summary statistics on our data. First we will dive into summary statistics.

Summary statistics, as follows from their name, are numbers that summarize and tell you about your dataset. For example, mean, median, minimum, maximum, and standard deviation are summary statistics. Calculating summary statistics allows…

Pandas is an immensely useful Python package used mainly for data manipulation but also can be used for data visualization. As we have talked about in a previous article, “A Pandas Primer”, Pandas is built upon two essential packages, Numpy and Matplotlib. Numpy allows for the easy and quick manipulation of our data and Matplotlib allows for visualizations of our data. One of the more important aspects of the Pandas package that we need to get familiar with are dataframes.

In data science we have to store our data in some type of form that is easy to work with…