By Vinesh Jha, CEO & founder of ExtractAlpha
In Part I and Part II, we talked about the advantages and pitfalls of systematic investment research. In this post, we’ll discuss some best practices we’ve picked up over the years and which we apply at ExtractAlpha.
Practice makes perfect
We all know how to get to Carnegie Hall, right? Practice, practice, practice. As a young music student, I learned the importance of repetition in gaining the necessary muscle memory needed to move beyond pure mechanics. After sufficient amounts of practice, a musician can start to think about phrasing, structure, and nuance rather than focusing on being able to play the notes.
It’s not so different with quant research. We have a set of analytical tools and through their repeated use, we are able to more fluidly answer deep research questions without expending as much time and energy on the mechanics, which in this case, involves a lot of rote data analysis. Data manipulation is still a big part of the job – when we get a new dataset, we need to line it up and get familiar with it, like a musician sight-reading a new piece for the first time – but over time and with repetition we get better at it.
Checks and balances
Having a fantastic teacher or mentor helps, of course … in this regard, I was fortunate as a musician, but not in quant, where I’m largely an autodidact. And having a handbook or checklist, whether it’s written down or part of your routine, is also helpful. Here are a few of the things we do as part of our research routine at ExtractAlpha.
We start each research project with a list of hypotheses and ideas. This can be a long list – anything that’s sort of reasonable and which pertains to the questions at hand. We’ll brainstorm these ideas over the course of many days, often logging the ideas in a task management system. Then it’s time for triage, by sorting to the top those ideas we definitely want to test, and for which we have the resources – usually this means we have or can acquire the relevant data. In the middle go the “maybes,” and then we’ll throw out ideas that seem a bit too outlandish or for which data isn’t available.
Next, we pay careful attention to the datasets we’re analyzing. An often overlooked step is opening up the data in a spreadsheet-like format and scrolling through each of the fields (assuming we are talking about structured, tabular data). We’ll often notice things like odd ways null values are stored, or oddly repeating values, when we take this simple but crucial step.
We also look at the distribution of the fields of interest. Which are well populated? Which categorical variables’ values are common versus sparse? Which ones have outliers we need to take into consideration, for example by Winsorization?
Be strict about in- and out-of-sample testing. Now that we’ve got an understanding of the look and feel of the data, we can begin our in-sample hypothesis testing. For the majority of our research process, we will be in sample – that is, we will spend our time analyzing a pre-specified testing dataset, and the remainder of the data will remain out of sample, for verification at the very end of the process. It’s vital to be completely strict about in and out of sample testing, as tempting as it might be to “peek” out of sample midway through your testing to ease your discomfort. Choosing an in-sample and out-of-sample split is also quite important. You want your out-of-sample period to be long enough to be meaningful, i.e., to encompass more than one type of market condition, but you also want your in-sample period to be representative of the current time period; things have changed in the last decade, as noted in the previous post. We have a few techniques we use to address these issues, and they are worth thinking about before beginning the research process.
Use the right tool for the job. Hypothesis testing doesn’t necessarily mean portfolio backtesting. Event studies are helpful, and sometimes we’re trying to predict something other than returns; for example, models which predict company fundamentals such as revenues can often end up being more robust.
When it comes to backtests, we do a lot of cross sectional tests. The goal is to come up with a score across a wide swath of stocks at each time period (say, day), and determine whether the high-scoring stocks outperform the low-scoring stocks. The advantage of this approach, versus say a trading simulation, is that we get a very rich set of information. We can learn how the factor performs across sectors, capitalization ranges, time, and other slices. Furthermore the results are not as sensitive to the particular choice of portfolio construction parameters, and they are more indicative of how this factor might add into an existing multifactor model.
We also look at a factor’s exposure to common risk factors, its turnover and autocorrelation, and a plot of the time series of its (in sample!) returns, before and after transaction cost assumptions. All of this together gives us a holistic view of the efficacy of a factor, or of a variant of a factor as we try applying different hypotheses. And as we test, it allows us to understand the sensitivity of the idea to various formulations – the more robust, the better, lest we find the next butter in Bangladesh.
Use a realistic universe construction. We see a lot of backtests from commercial vendors that include something like 5,000 stocks in the US. Even for a retail investor, a small trade can move the price of the 5,000th-most-liquid stock, and for institutional investors these stocks aren’t tradable at all, even at very long horizons. We use a universe designed to mimic what institutions look at, but we’re also always careful to split results by capitalization range, lest we find something that only has value among very small, illiquid stocks.
Hopefully some of these pointers were helpful. With a lot of quant research, it doesn’t take huge amounts of resources to do it right. It’s more about being careful and finding the right tool for the job, and being aware of some common pitfalls. For many of us quants, there are few things more satisfying than finding value in a new dataset or anomaly; and it’s all the more satisfying if, having followed some of these best practices, we can have more confidence that we’re right.
And we’re here to help! If you’d like to learn more about the services we provide, please schedule a call.