We Made a Large Model Test Set "Z-Bench", 300 Questions to Comprehensively Examine the Ability of Large Language Models

Since the release of ChatGPT, we often exclaim when using it: "Ah, it can answer this!" At the same time, we are also pleased to see that more and more large-scale model teams and products have sprung up usually appear.

As early investors, we often need to try and evaluate newly released conversational AI products. The more common way is to use some Prompts to compare them intuitively with the output results of the iconic ChatGPT. In this process, we have gradually recorded some problems that large language models cannot handle well now, as well as many interesting prompts.

So, which prompts are we testing with? OpenAI has demonstrated  48 basic capabilities of ChatGPT on the official website. In the NLP field, there are also  widely used test sets such as SuperGLUE, MMLU, and Google BIG-bench . At the same time, in view of the fact that as the scale of parameters and data increases, new capabilities will emerge in large models, and the test sets related to these new capabilities are also increasing.

However, through practice, we found that the current NLP task test set has the following problems:1. Some tasks are not necessarily suitable for dialogue systems, and some tasks may not have a good Chinese version;2. As these test sets become industry standards, directional optimization and overfitting may also occur;3. These test sets often need to deploy automated tests, and are not suitable for non-professionals to use for daily question and answer.

Therefore, several of us VC muggles, as heavy users of conversational AI, designed and sorted out "Z-Bench" based on our own needs - a qualitative test of large-scale conversational products for non-technical personnel (ChatGPT-like products ) prepared test set, I hope to share with you.

"Z-Bench v0.9" provides a total of 300 Prompts from the three perspectives of basic capabilities, advanced capabilities, and vertical capabilities. Our starting point is to cover as many types of NLP tasks as possible. Our goal is not to provide an academically rigorous and complete test set, but to combine the existing academic test sets, some interesting cases collected daily, and the emergence and epiphany of academic discoveries after the emergence of large models , providing a large model capability test set suitable  for use by non-technical professionals . However, we will inevitably miss some scenes, or there will be a lot of amateur content from a professional perspective. In the future, we will continue to supplement and improve it based on the feedback we have collected, and publish it in a timely manner.