Before we begin with our Python tutorial on how to parse data with Python, we would like you to download this machine learning data file, and then get set to learn how to parse data.
The data set we have provided in the above link, mimics exactly the way the data was when we visited the web pages at that point of time, but the interesting thing about this is we need not visit the page even. We actually have the full HTML source code, so it is just like parsing the website without the annoying bandwidth use.
Now, the first thing to do when we start is to correspond the date to our data, and then we will pull the actual data.
Here is how we start:
import pandas as pd import os import time from datetime import datetime path = "X:/Backups/intraQuarter"
As given above, we are importing the Pandas for the Pandas module, OS, that is so we can interact with the directories, date and time for managing the date and time information.
Furthermore, we will finally define the path, which is the path to the intraQuarter folder than one will need to unzip the original zip file, which you just downloaded from the website.
def Key_Stats(gather="Total Debt/Equity (mrq)"): statspath = path+'/_KeyStats' stock_list = [x for x in os.walk(statspath)] #print(stock_list)
We began our functions, with the specification that we are going to try to collect all the Debt/equity values.
The path to the stats directory is Statspath.
To list all the contents in the directory, you can use stock_list which is a fast one-liner for the loop that uses os.walk.
Take up our Machine Learning training course with Python to know more about this in-demand skill!
Then the next step is to do this:
for each_dir in stock_list[1:]: each_file = os.listdir(each_dir) if len(each_file) > 0:
Mentioned above is a cycling through of directory (which is every stock ticker). Then the next step is to list “each_file”, which is each file within that very stock’s directory. If in case the length of each_file which is in fact is a list of all of the files in the stock’s directory, is greater than 0 only then will we want to proceed. However, there are some stocks with no files or data:
for file in each_file: date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html') unix_time = time.mktime(date_stamp.timetuple()) print(date_stamp, unix_time) #time.sleep(15) Key_Stats()
Finally, at the end, we must run a loop that pulls the date_stamp, from each file. All our files are actually stored under their ticket, with a file name for the exact date and time from which the information is being taken out.
It is from there that we will explain to date-time what the format for our date stamp is, and then we will convert it to a Unix time stamp.
To know more about data parsing or anything else in python, learn Machine Learning Using Python with the experts at DexLab Analytics.
This post originally appeared on – pythonprogramming.net/parsing-data-website-machine-learning