Reinforcement Learning: the feedback chaos!!
The Beginning
I worked on DPO(Direct Performance Optimization) first without getting to know reinforcement learning from scratch, but starting with the hugging face course from now on I will jot down the important and interesting key pointers.
As a basic point, reinforcement learning works on feedback, giving the agent a positive reward for a correct action and a negative for an incorrect one. Here, the cumulative reward is considered to be an expected return.
This reward function cannot be considered simple in the addition of all rewards in the sequence, we also consider gamma — a fundamental parameter that influences the training and performance of the agent. It balances the importance of immediate vs future rewards. Gamma is a scalar value between 0 and 1, inclusive. It’s also known as the discount factor.
- Gamma closer to zero: The agent will tend to consider only immediate rewards.
- Gamma closer to one: The agent will consider future rewards with greater weight, willing to delay the reward.
Interestingly, it follows the MDP(Markov Decision Process), which states that the agent only needs the current state to decide what action to take unlike what we usually see in LLMs(not direct relation though).
Each agent performs actions in an environment from where it gets the information and this information is considered as observation(gives a partial description of the state of the world) — example — super Mario Bros game/state spaces(gives a complete description of the state of the world) — example — chess.
Action space like search space in binary search is the set of all possible actions in an environment, and it can be discrete(eg — can move in 4 directions) or continuous(movement in any degree).
There can be two types of tasks —
- Episodic: where there is a start and a terminal state and,
- Continuing: where the agent keeps moving until we decide to stop it.
The exploration/exploitation trade-off: need to balance how much we explore the environment and how much we exploit what we know about the environment.