Singapore Wants To Train Its Sea-Lion AI Model Ethically
- By Paul Mah
- April 17, 2024
AI Singapore is working to uphold a higher standard with its AI model in Singapore by sourcing the data for its AI model ethically, says Leslie Teo, the senior director of AI Products at AI Singapore.
At a media briefing earlier this week, Teo and two team members at AI Singapore shared more about their work on the open-source Sea-Lion AI model, or Southeast Asian Languages in One Network.
Last year, we reported that the Singapore government is setting aside USD52 million to build the region’s first large language model (LLM) to incorporate the diverse cultures and languages of Southeast Asia.
Billed as the National Multimodal LLM Programme (NMLP), it will be built on Sea-Lion, which was trained on 11 languages in the region. The work will extend Sea-Lion into a multimodal speech-to-text model.
How is Sea-Lion better
To demonstrate Sea-Lion’s unique capabilities, Teo and team members pitted the publicly-released version of Sea-Lion against top language models such as Meta's Llama 2, Alibaba's SeaLLM, and OpenAI's GPT-4 Turbo.
Sea-Lion fares well when quizzed in regional languages such as Bahasa Indonesia, Thai, or even Tamil, on local topics. Specifically, it gave contextually relevant advice that takes into account local sensitivity and realities.
In contrast, Llama 2 and SeaLLM might decline to respond to questions perceived as touching on sensitive areas, give generic responses, or offer bad advice.
This is hardly surprising, as one-eighth of the almost trillion tokens (981 billion) used to train Sea-Lion were Southeast Asian in origin. In comparison, just 0.5% of the data used to train Llama 2 comprises Southeast Asian content.
There are currently three models of Sea-Lion: Sea-Lion 3B, Sea-Lion 7B, and Sea-Lion 7B Instruct. They are trained on Nvidia A100 GPUs on the AWS cloud.
Ethical data sourcing
A recent New York Times report accused technology giants such as OpenAI, Meta, and Google of cutting corners to harvest high-quality data to train their AI models. This includes using copyrighted books, YouTube videos, and other online content without permission.
However, it was clear from the get-go that the AI Singapore team took great pains to source their data ethically. When quizzed about the source of data used to train Sea-Lion, Teo said: "We are aware of what others are doing... But because of who we are, we are very conservative. We have to make sure we are doing things correctly."
Teo says Sea-Lion was created with a lean team of just 20 Singaporeans, who painstakingly evaluated and cleaned up the data fed to Sea-Lion to ensure that it does not use data that it is not supposed to.
In fact, Leslie shared how he turned down data brokers who offered to sell high-quality data from dubious or unknown sources. “We want to do things correctly and uphold a higher standard with our AI model in Singapore.”
He conceded there are downsides: "We pay a price... our models will not be as good.”
On the bright side, Teo shared how regional partners, inspired by AI Singapore’s work on Sea-Lion, have approached with offers to share their data for training Sea-Lion.
Training on a budget
Teo clarified that not all of the USD52 million for the NMLP will go towards purchasing GPU time; a majority is earmarked for other related tasks. However, he says the money is enough for the next two Sea-Lion models planned.
Leslie says he is not working to pit Sea-Lion against state-of-the-art AI LLMs like GPT-4 with its estimated trillion parameters. Moreover, the cost of GPU time is on a downward trend right now, which should help reduce training costs. Finally, Teo noted that his motivated, mission-driven team is paid "academic rates."
The next version of Sea-Lion, with some 13 billion parameters, could be released as soon the middle of this year. A larger one, with 30 billion parameters, is slated for release at the end of this year. These models will incorporate greater safety tuning to ensure their suitability for public use, Teo says.
Ultimately, Teo sees the development of Sea-Lion as a sacred mission to both fill a gap for greater representation and to put Singapore on the global map for AI. "We believe in the mission. We are mission-driven."
Image credit: iStock/Wirestock
Paul Mah
Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.