Page 1 of 1

Global Client Engagement Leader

Posted: Thu Dec 26, 2024 4:39 am
by rriiffaatt77
OpenAI’s CriticGPT trains the model through the RLHF method to write natural language feedback for real-world coding tasks and successfully generalizes to the OOD distribution (data that the model did not encounter during the training process). This feedback can be used to help humans make more accurate judgments, thereby achieving effective reward feedback for complex outcomes. . Speculation on technical principles . OpenAI official “tips” Through reinforcement learning, o learns to refine its reasoning chain and optimize the strategies it uses. It learns to recognize and correct errors, break complex steps into simpler parts, and try different ways when the current approach doesn’t work.



This process significantly improves brazil email list the model’s reasoning abilities. The o model introduces inference tokens. The model “thinks” using these inference tokens, breaking down its understanding of the prompt and considering multiple ways to generate an answer. After generating an inference token, the model generates the answer as a visible inference token and discards the inference token from its context. Below is an example of a multi-step conversation between a user and an assistant. The input and output tokens for each step are retained, while the inference tokens are discarded. (How Reasoning Is Performed, OpenAI official website) .



Zhang Junlin's assumption: MCTS search technology route OpenAI o mentioned the RL scaling law during training and inference, and pointed out that it has different characteristics from the scaling law during pre-training. Obviously, if o takes the MCTS search technology route, then divide the COT into thinner parts (increasing the depth of the search tree) or propose more possible choices (increasing the branch nodes, that is, the width of the tree), the larger the search space, the greater the possibility of finding a good COT path, the better the effect, and the greater the computing power required for training and inference.