Importing files is of huge importance to any budding data scientist. You will find yourself working with many different types of files such as .txt, .csv, excel spreadsheets, Stata, and MATLAB files. It is a necessary skill to be able to import any type of file thrown your way so that you can work with that data in your environment.
The first type of files that we will learn to import is the basic text file, which can be broadly classified in two types of files. The first being ones that contain plain text, such as this words.txt file which is taken from MIT’s Introduction to Computer Science and Programming in Python course used as the dictionary of an interactive hangman game built in python. The next type of files are ones that contain records, or tabular data, such as the kings_county.csv file, where each row is a house listing and each column is a charecteristic of the house such as square footage, number of bedrooms, number of bathrooms, and so on. This las type of file is called a flat file and we will learn the importance of these later on.
Importing Plain Text Files
To open any plain text file we can use python’s basic open function to start a connection to the file. To do this we will assign the filename to a variable and pass the filename to the function open. In the open function we will also pass the argument
mode = 'r', which will ensure that we only read the file and do not write to it. We will then assign the text of the file. from the file to another variable applying the read method to it. Once we are done we need to also close the connection to the file using the
.close() method. An example of this code can be seen below:
Now we can simply print out the contents of our file with the
print() statement. If we are working with a very large file we may not necessarily want to print out the all of the text at once. We can use the
readline() method to read the file line by line so we don’t have to deal with a huge wall of text. We will only need to make a slight alteration to our code above in order to do this which you can see below.
The above code block will only print the first two lines of our words.txt file, we can print as many lines as we want as long as we continue to use the
Importing Flat Files
Now that we hae gone over how to import plain text files, we will start to look at flat files, such as the kings county housing dataset, kings_county.csv. As a reminder each row in this data set is the sale of a house and each column are characteristics of the house such as number of bedrooms, number of bathrooms, and square footage, as well as many others. An example of this file can be seen below:
Flat files most often times have a header, which is a line at the top of file that describes the contents of the columns in our file, always take not if your file has a header as this will alter the way we import the data slightly. You may have noticed that the extension of the above file that we are using is .csv, this is an acronym standing for comma seperated values. As we can say from the above image each row conatins muiltiple entries and they are sepearted by a comma. This comma is called a delimiter, we can almost use any special character as a delimiter often you will find tabs and slashes as a delimiter but more often it will be a comma.
We can import flat files in two different ways either using the numpy or pandas package. Importing with numpy works best if you have numerical values and want to store the values as a numpy array while importing with pandas is best for storing values in a dataframe. We will start by going over how to import a flat file with numpy.
To import a flat file with numpy we must first import numpy and alias it as np. Nest, as we did with the plain text file we will store the file name as a variable, remember our file is named kings_county.csv, we will then use numpy’s
loadtxt() method and pass in our file name as well as the delimiter used in our file. Additionally if our file has a header we would want to use the skiprows argument to skip the first row of our file. An example of the above process will be given below:
We will now go over how to import files with the Pandas package. Pandas is without a doubt the best way to import flat files if we want to perform basic data science operations on our dataset. Pandas allows us to manipulate, reshape, groupie, join, merge, perform statistics, visualize and do a whole bunch of other things with our dataset. If you need to perform any type of analysis or modeling with your data I highly recommend uilizing the pandas library. Pandas also is very easy to work with, importing our files is as easy as using the
read_csv() method that it has to offer. To do so we first import the pandas library aliased as pd. Then we will use the
read_csv() method to import our file, if needed we can change the delimiter which defaults to commas. Also if our file does not have a header we must change the default header argument to None to work with our whole dataset.
Congratulations! You have now learned how to import basic text files and flat files using the numpy and pandas libraries. If you find yourself needing to import other files, such as excel spreadsheets, I recommend looking into the pandas documentation more as it most likely has a method that will easily import that file.