Many Generative Pre-Trained Transformer (GPT): a pragmatic evaluation.

Artificial intelligence (AI) is a rapidly evolving field that involves the development of intelligent machines that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI has the potential to revolutionize the way we work and live, with the ability to automate tasks and processes, increase efficiency, and enhance decision-making. As seen in the graph below, the AI market has experienced significant growth over the past few years and is expected to continue to expand in the coming years. Additionally, the adoption of AI technologies in various industries has the potential to generate trillions of dollars in economic value, as illustrated in the image below. The impact of AI on the workforce is a topic of much discussion, as some fear that it may lead to job displacement, while others argue that it will create new job opportunities and lead to increased productivity. Overall, the potential of AI is vast and its impact on our daily lives will only continue to grow in the coming years.

Instruction Tuning

INSTRUCTION TUNING WITH GPT-4 is a great work made by: Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley and Jianfeng Gao from Microsoft Research. It is one of the most significative works I’ve seen so far in comparing GPT Large Language Model (LLM) based on a pragmatic process.

In this paper, they present the first attempt to use GPT-4 to generate instruction-following data for LLM fine-tuning. their early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks to the instruction-following data generated by previous state-of-the-art models. They also collect feedback and comparison data from GPT-4 to enable a comprehensive evaluation and reward model training. We make our data generated using GPT-4 as
well as our codebase publicly available.

The evaluations have been conducted over three parameters by Humans:
1. Helpfulness: whether it helps humans achieve their goals. A model that can answer questions
accurately is helpful.
2. Honesty: whether it provides true information, and expresses its uncertainty to avoid misleading
human users when necessary. A model that provides false information is not honest.
3. Harmlessness: whether it does not cause harm to humans. A model that generates hate speech
or promotes violence is not harmless.

The following graphs are self-explanatory, on the top the main available models against ChatGTP (in other words, how do they behave if compared to ChatGPT) while in the second istogram “everybody” compared against GPT4 (in other words, how do they behave if compared to GPT4). As we may appreciate GPT4 is way long distant from other engine/models. However Vicuna (based on LLaMA 13B) reaches a significative 99% score if compared to ChatGPT, which is, as all of us know, an impressive Pre-trained Large Language Model. It means ChatGPT is able to train (fine-tune) another intelligence (model) like it.

Spectacular reading how they did generated dataset for the “fine-tuning”: asking to GPT4 the best prompts to train an AI agent. In particular they claim:

We train two models using supervised finetuning using the LLaMA 7B checkpoint:
(i) LLaMA-GPT4
is trained on 52K English instruction-following data generated by GPT-4, which distribution is
displayed in Figure 1.
(ii) LLaMA-GPT4-CN is trained on 52K Chinese instruction-following data
from GPT-4. We follow the training schedule in (Taori et al., 2023) for fair comparisons. These
models are used to study the data quality of GPT-4 and the cross-language generalization properties
when instruction-tuning LLMs in one language.
From https://arxiv.org/pdf/2304.03277.pdf

Conclusions

Upon encountering this enlightening paper, it became immediately clear to me that its message was of great significance and therefore, merited dissemination. As one who has engaged in the production of scientific publications, I am acutely aware of the intricacies involved in such an endeavor. The task of uncovering dissimilarities and resemblances between Generative Pre-Trained Large Language Models (LLM-GPT) is not one that can be taken lightly, and in fact, is an essential undertaking in comprehending their potential impact on society.

Through comprehensive comparisons of these powerful models, we can gain valuable insight into their unique capabilities, limitations, and biases. By examining their respective strengths and weaknesses, we can discern which models may be better suited for particular tasks and applications. Furthermore, by identifying any potential societal implications of these models, we can take proactive measures to mitigate their negative effects while maximizing their benefits.

Therefore, I cannot stress enough the importance of conducting rigorous comparisons of these LLM-GPT models, as it is a critical step towards promoting the responsible and ethical use of artificial intelligence in our increasingly interconnected world.

References

1.Paper: https://arxiv.org/pdf/2304.03277.pdf
2.Github: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM