Machine Learning Algorithmic Trading Project
This project was to create a trading algorithm for use in the S&P 500 futures. The project uses python to make the model and is executed in C++ ran in the trade platform.
Details
The algorithmic trading project was built using python to pull data from trade files that captured every trade made on the E-Mini Futures contract (S&P 500). I went through and labeled trades that were winning trades that fit a developed system. I built scripts to pull the data from the trade files starting from the beginning of the day up to the labeled traded. From there, I created approximately 30 features that I believed described the trades as best as possible. I then started with a simple decision tree classification model to label trades as either true or false. I optimized my feature algorithms to get the best results as possible with the simplest model, and then continued to using more advanced models. Through this process I used a Random Forest Model and then landed on a gradient boosting algorithm that gave the best results.
Once the process of finding the best model was complete using a test data set, I set up similar algorithms to capture the identical features in C++ so I could capture the features live when running the trading software. The end result is a live trading system that evaluates trade data to create the features. Those features are piped over to a running python script to be run through the trained model, the result True or False is sent back to the trading script to either execute a trade or to do nothing. That is what is shown in the video above.
Scripts / Algorithms
To make this all work, there were a number of scripts that were created to process data and run tests. These are listed below:
- Create Trade Data - Python
- This is a binary search algorithm that finds the labeled trade in the data and processes it so that it is in a usable format and ready to be processed into a feature.
- This also uses pythons multiprocessing functionality to increase speed
- Analysis - Python, Pandas, Scikit-learn
- This algorithm takes in all the data from the previous script and processes it to create features.
- This script is also where the model training is completed. To do this, it splits the training data into test and train data sets and optimizes using a gradient boosting algorithm.
- The script also provides important outputs like precision-recall curves to evaluate the output once training is complete, as well as waterfall plots with Shapley values to gain insights into how the algorithm is classifying certain samples.
- Read Data and Run Model - Python
- This script is what is shown running in the shell in the video above.
- The purpose is to receive feature values from the C++ script and then run them through the trained model.
- Depending on the predicted probability of the model, it sends a true or false back to the trading algorithm to execute on.
- Trade Execution - C++
- Trading script that uses live data to create features and pass them to the Run Model python script.
- Manages trades and determines when and how to execute a trade.
- Manages stop losses, targets, and number of contracts
Lessons Learned
There are many lessons learned while doing this, but here are a few:
- Organization is key when handling large amounts of data. This is as simple as making sure there is a solid naming convention so that there is minimal confusion about what information a file holds and what it should be used for.
- Imbalanced data sets with a few positive cases can be hard to train a model around, especially if there are nuances in each positive case.
- Data labeling is incredibly important. The algorithm only learns off of what is labeled and the information the features extract. Because of this, if you don't have features to detect certain nuances, you must be sure every sample meets a minimum standard to be classified accordingly.
- Evaluating a model with dirty data is a waste of time. Again, this goes back to the importance of labeling. However, in an imbalanced data situation, the negative cases still matter. This is because it is hard to properly evaluate false positives and performance when your data is not perfect. However, you can still measure relative improvements. This just won't correlate to the real world data when the model is live.
Accomplishments
- I was able to create an algorithm that had a precision of 90% on live trading data.
- The model was fast enough, so trading performance was not impacted by model and program speed.
Improvements and Other Variations to Try
- Creating a better pipe between the python and C++ scripts, or using the C++ code to directly run the trained model.
- Potentially add time of day information to the create trade data script. This way, different time based strategies could be incorporated to improve the model.
- A strictly programmed version that does not use machine learning. This would give better control and precision of trades to take. Machine learning may not be the perfect match for this trading strategy.