China proposes stricter curbs on training data and models used to build generative AI services in bid to tighten security

Draft targets two key areas for improvement – the security of raw training data and large language models used to build the generative AI services
The draft proposes a blacklist system to block training data materials that contain more than 5 per cent of illegal content

Artificial intelligence

Ben Jiangin Beijing

Published: 8:00am, 14 Oct 2023

Why you can trust SCMP

China is planning stricter curbs on how generative artificial intelligence (AI) services are applied in the country, as authorities attempt to strike a balance between harnessing the benefits of the technology while mitigating the risks.

New draft guidance published on Wednesday by the National Information Security Standardisation Technical Committee, an agency that enacts standards for IT security, targets two key areas for improvement – the security of raw training data and the large language models (LLMs) used to build the generative AI services.

The draft stipulates that AI training materials should not infringe copyright or breach personal data security. It requires that training data be processed by authorised data labellers and reviewers to pass security checks first.

Secondly, when developers build their LLMs – the deep learning algorithms trained with massive data sets that power generative AI chatbots such as Baidu’s Ernie Bot – they should be based on foundational models filed with and licensed by authorities, according to the draft.

Alibaba opens Tongyi Qianwen model to public as new CEO embraces AI

The draft proposes a blacklist system to block training data materials that contain more than 5 per cent of illegal content, together with information deemed harmful under the nation’s cybersecurity laws.

Illegal content in China is typically defined as material that incites violence and extremism, spreads rumours and misinformation, or promotes pornography and superstition. Beijing also censors sensitive political information, such as questions about Taiwan’s status.

The draft proposes that during the training process, developers should consider the security of the content generated as one of the major points of evaluation, and “in every dialogue [with generative AI services], information keyed in by users should go through a security check to ensure the AI models generate positive content”.

The proposed draft is soliciting public feedback until October 25.

China in August imposed a general regulation targeting domestic generative AI services, making it one of the first countries to impose rules governing the emerging technology.

The Chinese government last month approved a batch of local generative AI services, including chatbots from search engine giant Baidu, state-backed iFlyTek, Zhipu AI, Sogou co-founder Wang Xiaochuan’s new venture Baichuan and SenseTime.

In tests performed by the Post, Chinese chatbots respond in a variety of ways when asked whether Taiwan is part of China. Some refuse to give a response and end the conversation abruptly, while others give a brief, affirmative response before also ending the interaction.

Post