Crash Course — Python & Pandas for Trading and Investing (Part 2)

Conditional distributions, custom candlestick plots, & visualization tricks

15 min readJun 5, 2021

In this crash course, you’ll learn how to:

Test directional signals for statistical significance
Make candlestick plots from scratch & visualize signals with shading
Import data from free sources & properly index it

Make sure to go through the first part of this course so you’re up to speed. You can grab the notebook for this crash course here.

For even more advanced analysis, be sure to check out part 3!

Let’s dive in.

Signal Testing — Conditional Distributions

One reasonable way to test a trading signal is to look at the distribution of future returns (or some other future variable of interest) for the entire population, and compare that to the distribution of future returns when the trading signal is on.

If a signal has value, we should see a noticeably different distribution of future returns when it is active compared to the general population of future returns.
If we want to optimize our signal parameters, we need to do so only during the “training” portion of our data set (for time series, usually the earliest 70% of the time period) and then see the signal’s performance in the remaining sample. However, this is not as important if we did not optimize our signal parameters.

To check if the distributions are significantly different, we can calculate the Kolmogorov–Smirnov (KS) statistic using scipy.stats, which quantifies the difference between two distributions. In addition, we can look at the 20th, 50th, and 80th percentile values of our target variable (usually future returns) for the general population, as well as for the filtered population. A valuable signal should cause a noticeable change in these percentile values.

Overall, we want a plot like the one below. It shows the population’s distribution of future returns (in blue), the distribution of future returns when the signal is active (in orange), the number of observations in the orange distribution, and the KS statistic. Its vertical lines show the percentiles of the population vs the “signal active” periods; you can list the percentiles of interest. Finally, it allows us to break our timespan into multiple, evenly-sized chunks (in the example below, two chunks). That way, we can see how the strength of the signal has changed over time.

We’ll assume that our dataframe is formatted currectly; it has a datetime index sorted in ascending order and contains the “Open”, “High”, “Low”, and “Close” columns. Let’s get started by defining the parameters of our function.

Before we go further, we should import some useful tools and packages.

from matplotlib.ticker import FormatStrFormatter
import matplotlib 
from itertools import permutations
import scipy.stats as stats
from scipy.stats import skew, kurtosis
import matplotlib.pyplot as plt
from datetime import timedelta
from datetime import datetime
from datetime import date
import scipy.stats as stats
from cycler import cycler
import matplotlib as mpl
import pandas as pd
import numpy as np
import seaborn as sns
import statistics
import random
import math
import re 
sns.set()

Real-World Example: Testing Closes Below a Trend Line

Let’s create a random time series and run some tests on it. We’ll use the same method we used in the prior crash course.

##### Generating simulated stock priceour_index = pd.date_range(start='2021-01-01', periods=25000*24*2, freq='30T') # create an index, 30min time increments
df1 = pd.DataFrame(index = our_index)
R_old = 0.0002
stdev_old = 0.008
R_new = (1+R_old)**(1/48)-1
stdev_new = (stdev_old)*(1**0.5)*((1/48)**0.5)### Get list of decimal values. 
# These are the log changes of each increment
df1['Close'] = np.random.normal(loc=R_new, scale=stdev_new, size=25000*24*2)
df1['Close'] = np.exp(df1['Close'])
df1['Close'].iloc[0] = 1
df1['Close'] = df1['Close'].cumprod()### Summarize intraday simulated moves as daily increments, OHLC.
# Only keep rows that take place betw 930am and 4pm
df1 = df1.between_time('09:30', '16:00') # Downsample - summarize those rows as daily data
df1 = df1['Close'].resample('1D').ohlc()
df1.rename(columns={'open':'Open','high':'High','low':'Low','close':'Close'}, inplace=True)
df1.head(5) # show first 5 rows

Let’s make a quick plot of the data to see if it looks natural. We’ll create one large plot of the Close price, and a plot below it that is only 30% as tall. The bottom plot will show the distance between the Close price and its 30-increment average. Using “gridspec” makes these plotting tasks pretty easy.

## Making columns to ploti = 30 # lookback for moving average
name_ma = f'{i}ma'
name_madist = f'{i}madist'
df1[name_ma] = df1['Close'].rolling(i).mean()
df1[name_madist] = (df1['Close']/df1[name_ma]) - 1
df1.head(5) # show first 5 rows#### plottingfrom matplotlib import gridspec
fig = plt.figure(figsize=(9,6))
gs = gridspec.GridSpec(2, 1, height_ratios=[1, 0.3]) 
# 2 rows, one column. Second plot is 30% as tall as first.
ax1 = plt.subplot(gs[0])
ax2 = plt.subplot(gs[1])dfplot = df1.head(500)
dfplot[['Close',name_ma]].plot(ax=ax1, logy=True, alpha=0.8)
## plot dist from ma as several circular dots connected by a line
dfplot[[name_madist]].plot(ax=ax2, style=['.-'], markevery=30, logy=False, alpha=0.7, color='tab:purple', sharex=True) 
## plot dist from ma as an area under a curve.
dfplot[[name_madist]].plot.area(ax=ax2, stacked=False, logy=False, alpha=0.5, color='tab:purple', sharex=True)ax1.legend(loc='upper left')
ax2.legend(loc='upper left')
import matplotlib.ticker
ax2.yaxis.set_major_locator(matplotlib.ticker.MultipleLocator(0.05))
plt.tight_layout()

Here’s what this looks like.

Now, back to testing our signal. Suppose the signal of interest occurs when the Close column falls below the its average over the past 50 increments. First we need to create a column that is “1” when our signal occurs, and 0 otherwise.

The first step is to create our trend line column. Its name will be a string that we store in a variable called name_trendline. Depending on our lookback parameter (in this case 50), our name will change. We use an f-string so that the name changes if we change the variable “i”.

Then, we make a condition that looks for our signal of interest and avoids overlapping readings.

### Testing a trend line signal - making key columns
i = 50 # lookback param, for finding avg over past 50 increments
name_trendline = f'{i}EMA'
df1[name_trendline] = df1['Close'].ewm(span=i, min_periods=i).mean()### Can also create simple moving average (shown below)
# name_trendline = f'{i}SMA'
# df1[name_trendline] = df1['Close'].rolling(i).mean()
futparam = 20 # looking at returns or vol over next 20 increments. 
# Observations must be 20 days apart (avoid overlap)### Signal is active if recent increment's close < trendline today 
### *and* prior increment's close was >= trendline
condition_on = (df1['Close'] < df1[name_trendline]) & (df1['Close'].shift(1) >= df1[name_trendline].shift(1))### Need to avoid overlapping observations of future variable
# Note, doing condition*1 creates a series (column)
# In that series, it is 1 when condition is True, 0 otherwise
condition_nooverlap = ((condition_on)*1).rolling(futparam).sum()==1
condition_signal = (condition_on & condition_nooverlap)
df1['Signal'] = condition_signal*1 # column is 1 when condition_signal is true, 0 otherwise

For signals which do not occur at predictable intervals, we need to account for the fact that there will be a lag between when we see the signal and when we can act on it. If at the end of increment 3, the Close crosses below the trend line, we cannot transact until the start (Open) of increment 4. To account for this delay we can shift our signal column by 1. Note that for seasonality trades (buying when a certain month of hour begins), we don’t need to worry about a lag.

df1['Signal'] = df1['Signal'].shift(1)

Quick tip: for intraday data, we should attempt to look at increments that are at least 1 hour long, and we should use TWAP instead of “Close”, because it is hard to assess the precision of OHLC readings for smaller increments. Doing this helps reduce the odds that you are fooled by measurement noise. During backtests, we will include the effect of slippage (transacting at sub-optimal prices), but for the initial exploration we will just test the raw signal.

Now, time to generate the variable of interest: in this case, future returns.

### Creating dependent variable of interest
j = 20 # looking 20 increments ahead
name_yvar = f'{j}futret'
df1[name_yvar] = df1['Close'].pct_change(j).shift(-j)

If we want to focus on only the first 65% of our sample, we can select that portion with the code below.

### looking only at first 65% of data 
### (may not be necessary if we did not try to optimize parameters)train_portion = 0.65 # look at first 65% of sample
# Find row number where training portion endstrain_portion_end = int((len(df1)*100*train_portion)//100)df0 = df1.copy().iloc[:train_portion_end, :] 
# keep rows from row number 0 to train_portion_endprint(df1.tail())
print(df0.tail())

Next, we need to specify if we want to divide our overall sample into subperiods, aka “chunks”, which allows us to test how stable the performance of the signal is throughout the overall timespan. When plotting multiple chunks, we will create a grid of plots, and we need to specify how many columns we want that grid to have.

### Creating variable to plot
yvar = name_yvar
df = df0.copy()
df[f'{yvar}_pop'] = df[yvar]
df[f'{yvar}_filt'] = df[(condition_signal)][yvar]
# if condition not true, yvar is nan.### creating grid 
nchunks = 2
fig_cols = 3 # dimension of grid of subplots (number of cols)
# number of rows needed to stay neat
fig_rows = 1 + (nchunks - (nchunks % fig_cols))*(1/fig_cols)
fig_height = fig_rows * 5.1 
# each grid row has a height of 5.1 so figures are modestly sized
fig_width = fig_cols * 7.1
# Create a figure that contains our grid of plots
fig = plt.figure(figsize=(fig_width,fig_height))
nrows = math.ceil(len(df)/ (nchunks))
list_df = []
# create a list, each element is a subperiod of the main df
list_df = [df.copy()[i:i+nrows] for i in range(0,df.shape[0],nrows)]

We will want to make vertical lines that show the percentiles of the population vs the “signal active” periods, and to be able to list the percentiles of interes. Let’s define that list.

list_percentiles = [0.35, 0.50, 0.65]

Now we can loop through the list of dataframes that we’ll use to test and plot the signal’s performance. Each dataframe represents a subperiod of interest (such as the first third of the overall time period). One way to loop through a list is using “enumerate.” It allows us to keep track of what item number we are currently on.

For each subperiod that we plot, we want to keep track of when it starts and stops. We can store the date information as strings, and then use them in our plot later on. We also create a subplot in a specific position in the grid; add_subplot handles this for us. If we have 3 columns and are on the 5th list element, it will automatically add it on the second row in the second column.

for index, element in enumerate(list_df, start = 1): 
# default setting is for first element to have index of 0.
  dfz = element.copy() # element is a df in the list
  startstring = str(dfz.index[0].year)+'-'+str(dfz.index[0].month)+'-'+str(dfz.index[0].day)
  endstring = str(dfz.index[-1].year)+'-'+str(dfz.index[-1].month)+'-'+str(dfz.index[-1].day)
  ax = fig.add_subplot(fig_rows, fig_cols, index)

Within the same loop, we want to get a count of the number of y variable observations in the population as opposed to when the signal is active.

num_popobservs = dfz[f'{yvar}_pop'].count()
num_filtobservs = dfz[f'{yvar}_filt'].count()
nbins = 20

To assess the difference between the population distribution and the filtered distribution, we can calculate the KS statistic and its associated p-value. It’s also useful to calculate the higher-order features, aka “moments”, of the distributions. These include the mean, variance, skewness, and kurtosis. We are still working within the same loop as before.

# import scipy.stats as stats # in case you haven't imported yet
KSi, pval_KS = stats.ks_2samp(dfz[f'{yvar}_pop'],dfz[f'{yvar}_filt'])# Anderson-Darling test to see if dists are different
ADi, _, pval_AD = stats.anderson_ksamp((dfz[f'{yvar}_pop'],dfz[f'{yvar}_filt']))pop_mean = dfz[f'{yvar}_pop'].mean()
pop_var = dfz[f'{yvar}_pop'].var()
pop_skew = skew(dfz[f'{yvar}_pop'], nan_policy='omit')
pop_kurtosis = kurtosis(dfz[f'{yvar}_pop'], fisher=False, nan_policy='omit')filt_mean = dfz[f'{yvar}_filt'].mean()
filt_var = dfz[f'{yvar}_filt'].var()
filt_skew = skew(dfz[f'{yvar}_filt'], nan_policy='omit')
filt_kurtosis = kurtosis(dfz[f'{yvar}_filt'], fisher=False, nan_policy='omit')title_moments = f'''FiltVsPop- mean:{filt_mean:.2f},{pop_mean:.2f}, var:{filt_var:.1e},{pop_var:.1e}, 
FiltVsPop- skew:{filt_skew:.2f},{pop_skew:.2f}, kurtosis:{filt_kurtosis:.2f},{pop_kurtosis:.2f}'''

Within the same loop, we will create a sub loop to find each percentile of interest, and build a string to store the results. We will plot a vertical line in blue to represent the general population’s percentile, and in red to represent the filtered population’s percentile.

titleQ_text = 'Qchng(Filt-Pop): 'for num, Q in enumerate(list_percentiles, start = 0):
  ypop_Q = dfz[f'{yvar}_pop'].quantile(Q)
  yfilt_Q = dfz[f'{yvar}_filt'].quantile(Q)
  changeQ = yfilt_Q - ypop_Q
  titleQ_text = titleQ_text + f' {Q*100:.0f}%ile:{changeQ*100:.2f}%,'
  ax.axvline(ypop_Q, linewidth=0.6, alpha=0.5, color='blue', linestyle='--') 
  ax.axvline(yfilt_Q, linewidth=0.9, alpha=0.7, color='orangered', linestyle='-')

That’s it for the sub loop. Now we continue with the first loop we started. The last step is to plot and add a multi-line caption that contains the key information. For a slightly cleaner visual, we will not plot observations that are in the bottom or top 1% of the distributions. For numbers that might be very large, to save space we will display them in scientific notation by using “:.1e” inside the f-string (this means we want scientific notation and 1 decimal place shown).

cut_upper = dfz[f'{yvar}_pop'].quantile(0.99)
cut_lower = dfz[f'{yvar}_pop'].quantile(0.01)
dfz = dfz[(dfz[f'{yvar}_pop'] < cut_upper) & (dfz[f'{yvar}_pop'] > cut_lower)]
dfz[[f'{yvar}_pop',f'{yvar}_filt']].plot.hist(alpha=0.6, density=1, bins=nbins, ax=ax, title=titleQ_text)
ax.set_xlabel(f'''Dates:{str(startstring)} to {str(endstring)}, #observs(Pop,Filt): {num_popobservs},{num_filtobservs:},
pval_KS:{pval_KS:.0e}, KS:{KSi:.1e}, 
pval_AD:{pval_AD:.0e}, AD:{ADi:.1e},
{title_moments}''')plt.tight_layout()

And that’s it. Now let’s give it a try. We should get something that looks like this.

The left plot shows us the first half of the timespan we investigated. As you can see from the title of that plot, the filtered distribution has a median (50th percentile) that is 0.0034 high than the population distribution. The KS statistic is fairly high, at around 0.98 (close to the max, 1), and its p-value is essentially zero. This means there is a strong chance that our filtered distribution is significantly different from our population distribution. The filtered distribution is negatively skewed (leans to the right), so its median is higher than the population, and its skew is -0.08, while the population’s skew is +0.2.

Making Candlestick Charts & Visualizing Signals with Shading

Third party tools that help you draw candlesticks can be helpful, but I find they don’t give you full control of the appearance and settings. To be able to freely customize the chart, I decided to try building candlesticks from scratch. It turns out, doing so is fairly simple.

First, we assume we have a dataframe with a datetime index, and with columns including Open, High, Low, and Close. We need to note how many minutes are in between each observation (for daily, say 24*60), and we need to state what axis to populate with the candlesticks. We also need to state what our “up” and “down” candle colors will be.

Let’s define a function that contains these parameters. We will also specify default values for the color_up and color_down parameters. We refer to our dataframe as “data”.

def draw_candlesticks(axis, data, MinutesPerCandle, color_up='cornflowerblue', color_down='gold'):

Inside this function, we plot every single row of data as a candlestick, which contains a rectangle as well as a vertical line. Thus, we need to iterate over every observation. To do this, we will use “enumerate” to go over each of the dataframe’s index values. To access the “Close” value for each observation, we would say data.loc[index][‘Close’]. First, we need to check whether that observation is an “up” increment or a “down” increment, and set the candlestick color accordingly.

from matplotlib.patches import Rectangle 
import matplotlib.dates as mdatesfor num, index in enumerate(data.index):
        if data.loc[index]['Close'] > data.loc[index]['Open']:
            color = color_up
        
        if data.loc[index]['Close'] <= data.loc[index]['Open']:
            color = color_down

We want each candlestick to be centered over its start time. Suppose we are working with 30 minute increments. The row whose index is 2021–03–01 12:00 refers to the March 1st, 2021, from 12:00 to 12:30. 12:00 is the start time of that row and our candlestick will be centered over that value.

To do this, we have the rectangle start before 12:00 and extend a bit past 12:00. One simple way to do this is to divide the time increment (30 minutes) by 3; we subtract that amount of minutes from our start time (12:00) to get the start of the rectangle. Inside the loop, we can say the following:

start = index - pd.DateOffset(hours=0, minutes=1*MinutesPerCandle//3)end = index + pd.DateOffset(hours=0, minutes=1*MinutesPerCandle//3)width = end - startheight = data.loc[index]['Close'] - data.loc[index]['Open']

To plot the rectangle, we specify its bottom left corner, then specify its width and height. To plot the vertical line, we use “vline” and specify the index (datetimeindex), the minimum y value, and the maximum y value. To make sure the line is displayed *behind* the rectangle, we give it a lower value of “zorder.” We also add two additional lines to make sure our datetimeindex is preserved, and do adjust the rotation of the xticks. So inside the loop, we add the following:

rect = Rectangle((start, data.loc[index]['Open']), width=width, height=height, color=color, zorder=3)axis.add_patch(rect)axis.vlines(index, ymin=data.loc[index]['Low'], ymax=data.loc[index]['High'], linewidth=0.5, color='black', zorder=2)axis.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m-%d %H:%M"))
plt.xticks(rotation=75)

And that’s it! Now let’s try plotting a grid of candlestick plots.

### creating grid 
df3 = df1.copy().head(200)
nchunks = 6
fig_cols = 3 # dimension of grid of subplots (number of cols)
# number of rows needed to stay neat
fig_rows = 1 + (nchunks - (nchunks % fig_cols))*(1/fig_cols)
fig_height = fig_rows * 5.1 
# each grid row has a height of 5.1 so figures are modestly sized
fig_width = fig_cols * 7.1# Create a figure that contains our grid of plots
fig = plt.figure(figsize=(fig_width,fig_height))# create a list, each element is a subperiod of the main df
nrows = math.ceil(len(df3)/ (nchunks))
list_df = []
list_df = [df3.copy()[i:i+nrows] for i in range(0,df3.shape[0],nrows)]for index, element in enumerate(list_df, start = 1): 
# default setting is for first element to have index of 0.
  dfz = element.copy() # element is a df in the list
  ax = fig.add_subplot(fig_rows, fig_cols, index)
  
  MinutesPerCandle = 24*60
  draw_candlesticks(axis=ax, data=dfz[['Open','Close','High','Low']], MinutesPerCandle=MinutesPerCandle, color_up='cornflowerblue', color_down='gold')
  plt.tight_layout()

The result should look something like this:

We can also shade certain areas on each subplot that satisfy a condition of interest. Here’s how we shade regions where the Close is above its 5-day average in green. Within our loop, we add:

condition_fill = (dfz['Close']>dfz['Close'].rolling(5).mean())
ax.fill_between(dfz.index, dfz['Low'].min(), dfz['High'].max(), where=condition_fill, alpha=0.2, color='green')

The result should look like this:

Importing & Formatting Data from Free Sources

We’re fortunate to be able to download a lot of historical financial time series data from free. Investing dot com is a particularly good resource for most publicly-traded U.S. stocks, as well as non-tradeable market indices (such as the VIX).

When we download historical data for U.S. equities from investing dot com, the formatting tends to be pretty consistent. Once we download that csv file, we can read it into a dataframe using the code below. It formats the “Date” column and uses it to set a datetimeindex. Each daily observation has a datetimeindex and is marked at 16:00, or 4 P.M., which is when the U.S. market closes.

def data_from_investing(url):
  name = pd.read_csv(url)
  if 'Vol.' in name.columns:
    name.rename(columns={"Price": "Close", 'Vol.':'Volume'}, inplace=True) 
    name = name[['Date','Open','High','Low','Close','Volume']]
  else: 
    name.rename(columns={"Price": "Close"}, inplace=True) 
    name = name[['Date','Open','High','Low','Close']] 
  name['Date'] = pd.to_datetime(name['Date'])
  name.sort_values(by=['Date'], inplace=True, ascending=True)
  name['Close']= name['Close'].astype(str).str.replace(',', '').astype(float)
  name['Open']= name['Open'].astype(str).str.replace(',', '').astype(float)
  name['High']= name['High'].astype(str).str.replace(',', '').astype(float)
  name['Low']= name['Low'].astype(str).str.replace(',', '').astype(float)
  if 'Volume' in name.columns:
    name['Volume'] = name['Volume'].astype(str).str.replace('-', '0')
    name['Volume'] = name['Volume'].replace({'K': '*1e3', 'M': '*1e6','B': '*1e9' }, regex=True).map(pd.eval).astype(float)
  name['ChngOC'] = -1 + (name['Close']/name['Open'])
  name['ChngON'] = (name['Open']/name['Close'].shift(1)).add(-1)
  name['ChngPrevC'] = name['Close'].pct_change()
  name.set_index('Date', inplace=True)
  name = name.asfreq('D').dropna() 
  name.index = name.index.tz_localize('US/Eastern', ambiguous=True) # get time zone information
  name.index = name.index + pd.DateOffset(hours=16, minutes=00) # specify time is 4:00 pm New York time, market close. 
  name.fillna(method='ffill', inplace=True) # fill NaNs due to index expansion with the most recent older row
  name.dropna(inplace=True)
  return name

To use this function, just download the historical data as a csv from investing dot com. Put that csv file in the same folder that contains your python notebook. Use the filename, “XYZ.csv” as the “url” in the function above. For example, we can get historical data for Apple’s stock price here. The name of the file we download is “Apple Historical Data.csv”. This will be our “url”.

Once we download that data, we can create a dataframe like so.

url = 'Apple Historical Data.csv'
df = data_from_investing(url)
df.head()

When request the head of your dataframe, it should look something like this (this example looks at XLK, the tech sector ETF):

Backtesting, Performance Analysis, & Sequence Sensitivity Testing

Once we have a signal that we believe has value, we can try to trade it. To do this, we need to build a function that can backtest our strategy, include slippage, and show the reward/risk profile (MFE/MAE) of each trade. It should produce a plot like the one below. To learn how to do this, you can check out the part three of the crash course here.

We can also look at how two x-variables of interest interact: do high values of xvar1, paired with low values of xvar2, lead to higher future returns? In the plot below, the height of the peaks show future returns.

To get more clarity on a strategy’s behavior, we can analyze its performance by looking at the distribution of it trailing risk-adjusted returns experienced by the Benchmark and the Active portfolios.

We can also plot the rolling spread between the risk-adjusted performance of the Active portfolio and the Benchmark over time.

If our strategy looks good, we should try to measure how sensitive it is to the order of its trades. A few good trades in the beginning can make a strategy’s long-term performance curve look impressive. We should try shuffling the orders of trades hundreds or thousands of times, and plot the resulting equity curves to see how much a difference the order of trades makes.

If you dig these, you can checkout the complete walkthrough of the code in part three here.