Order Placement: RL Environment
These series of blog posts are meant to provide an overview of the execution model developed at Eveince. In the previous part, we described order execution duties, provided formal problem definition and reviewed the problem modeling done at Eveince. In this blog post, the environment used for the reinforcement learning approach is described.
As described in the previous blog post, the execution model is a function approximator which gets as input the market descriptors (feature vector) and generates as output the price and volume to be sent to the exchange. The abstract execution model is illustrated in the figure below.
This modeling lays out a supervised setting, in which a labeled dataset alongside any optimization algorithm, e.g. gradient descent, will do all the work. Here comes the tricky part: There is no labeled dataset and building one is also out of the question since the dataset should cover all possible situations. Usually when you’re dealing with a problem where you do not have a dataset but you can generate samples using an environment, reinforcement learning is a great choice.
Environment
In an RL setup, the agent interacts with the environment to produce learning samples, a.k.a episodes. In our case the environment simulates an exchange. An overview of the environment and its components is illustrated in the Figure below.
Order Book Data
The environment is in charge of simulating an exchange. More specifically, the environment provides functionalities to simulate a real exchange using historical data. We use historical order book data for this simulation. The environment is built upon this data to match orders.
Environment Status
The environment keeps the status of the agent interacting with it. For instance the environment knows that a specific agent is trying to execute a buy order. How much of the order is filled, what trades have been generated so far and the rewards have been passed to the agent are all stored in the environment.
Match Engine
This part is where the simulation takes place. Upon receiving any order, the environment fetches the corresponding order book data and tries to execute the order against the order book data. The match engine slides the order through valid order books and generates trades where possible. The match engine returns trades occurred during a specific timeline.
Reward
The environment executes the order received from the agent and generates a reward. This reward is the core concept used for RL training. We will elaborate more on rewards in the following posts.
Reset
Reset function is used by the agent to reset all variables in the environment to start a new simulation. This function selects a random time and emits a new situation where the agent is required to execute the order.
Step
Step is the main function used by the agent to interact with the environment. The agent uses market data and its model to select the best possible action. This action is passed to the environment through Step function. The environment then processes this action and returns the new state and corresponding reward.
So to sum up:
The environment is built upon historical order book data
It provides limit and market orders, as real exchanges do
The agent interacts with the environment via placing limit or market orders
The environment processes any order and matches them against order book data as if the order book data is real.
After processing each order, a set of trades might be generated which will be returned to the agent with their associated reward.
Agent
In the RL setup, agents interact with their environment and tune their policy based on the rewards they get. In the order execution modeling each agent follows these steps:
Call reset on the environment. This will lead to a random position in the order book sequence. This random position specifies the start time of the new simulation.
The agent is now in charge of executing a pre-defined order (buy/sell) with a pre-defined volume in an specific timeline
Agent extracts feature vectors based on orderbook data (or any other available data)
The function approximator is called to produce the expected reward with respect to each action
The best action is selected and used to update the environment
The received reward is used to update the function approximator
This process with be repeated until the episode is finished
The the agent will call reset one again and start over
In the next part we will elaborate more on reward, state, action space, feature space and learning paradigm. The distributed architecture of the pipeline will also be described in the following posts.