From code to conversation: Can AI teach robots Arabic?
The rise of advanced artificial intelligence (AI) systems, such as ChatGPT, holds the potential to completely transform our world. However, the majority of these platforms primarily operate in English, leaving languages like Arabic to face setbacks due to limited online data. In a world increasingly reliant on AI, Nizar Habash, a computer scientist specialising in natural language processing and computational linguistics, finds himself at a unique crossroads. Through extensive research spanning machine translation, morphological analysis, and computational modelling of Arabic and its dialects, Habash’s work offers insights into the challenges and opportunities presented by building Arabic-language AI systems, or in simpler terms, “teaching Arabic to the robots”.
“Arabic is one of the most important languages globally. It ranks among the top in terms of the number of people who use it, whether for day-to-day life or solely for religious purposes. It’s a significant language that has carried knowledge over a large span of human history, essentially preserving it,” he adds.
A professor of Computer Science at NYU Abu Dhabi, Habash points out the urgent need for developing more sophisticated machine learning systems that are better equipped at processing cultural nuances embedded within different languages. “Today, when we assess the resources available for Arabic and the AI systems currently under use, we find that they don’t match the level of complexity the language holds.”
Originally hailing from Palestine, Habash mentions, “Being a native speaker of Arabic, I’ve been aware of its complexity from a very young age—from its various dialects across the Arab world to the standards I’ve had to adhere to throughout my education. I’ve often thought about how Arabic serves as a vehicle for our identity, knowledge, and communication, especially in the age of AI. And we encounter numerous examples of problems in this regard.”
Data challenges
Can the limitations of available online data for Arabic language learning impact the development and performance of AI systems? According to Habash, the current idea which has been quite successful in AI is “simply that the more data the better”. “That’s not the biggest challenge but for some people, it may be seen as the only challenge. The problem with this idea is that you will eventually get to the point where there is no more data which is naturally created and the moment you start generating artificial data and training AI systems on itself, it’s like creating monsters,” says Habash, who was previously a research scientist at Columbia University’s Centre for Computational Learning Systems before joining NYUAD.
AI uses feedback loops, which can sometimes involve inputs with ‘creative’ mistakes. So, to produce 100 times the amount of data, means that the mistakes will also be amplified a 100 times, Habash explains. “When the mistakes are repeated again and again, it becomes the norm, and the norm then becomes the operating model. The model has no concept of reality. It is simply trying to predict the next word, or fill in the blank, or use what’s called masking techniques to figure out the next part of the sentence. AI is great at making mistakes with confidence,” he adds.
When discussing the limitations of collecting online data for Arabic, Habash also spotlights the dangers of algorithmic bias and the inability to decipher grammatical nuances inherent in Arabic script, such as the absence of diacritical marks. Diacritical marks, also known as ‘Tashkeel’ or ‘Harakat’ in Arabic, are small symbols placed above or below Arabic letters to indicate vowel sounds, pronunciation, and grammatical structures.
These intricacies pose critical challenges for AI systems striving to comprehend and process Arabic text accurately. “Arabic, typically for common usage, is written without the diacritical marks, which signify the vowels. Only about one to two per cent of Arabic words in newspapers actually have a marker for vowels but the Arabic readers know how to understand it. It’s a subconscious understanding that readers have, we don’t have to think about it. However, a word may be ambiguous as a result and could have many meanings. So, when we’re teaching the machines, context becomes really important,” says Habash.
Another key aspect to keep in mind is the many different dialects within the umbrella of Arabic language, he adds. “Where there are dialects, there also are historical variants. Classical Arabic, the Arabic of the Quran, is spelled in slightly different ways than modern standard Arabic. This is another thing that machines are dealing with. It can mix up the Quran text with the modern standard Arabic, with Egyptian dialect, and put this pile together, which would confuse a lot of things.”
“There are different complexities. In my opinion, some of the interesting challenges that are not tapped in yet, are potentially to do with algorithmic bias,” says Habash.
Cultural sensitivity and biases
When it comes to advanced AI systems such as ChatGPT, there are different kinds of biases one has to keep in mind. “One is content bias and the other one is the grammatical form bias but both are interconnected,” Habash explains. “The content bias is related to the kind of ideas about the world that a system is likely to generate in generative models. As AI scientist Toby Walsh had previously said, ‘Language is political. There’s always bias embedded’. To an extent, I agree with this. For instance, in traditional journalism reportage, we’d always see the die-kill paradigm, where Israelis seem to always be ‘killed’ and Palestinians always ‘die’—we are impossible to kill. These types of biases can also occur in Arabic language.”
Citing a more recent example from ChatGPT doing rounds on social media, he adds, “Similarly, ChatGPT was asked ‘Do Palestinians deserve to be free?’ and ‘Do Israelis deserve to be free?’ The answer for the Israelis was something related to ‘Of course, Israelis are human beings, and all human beings deserve freedom’, whereas, for Palestinians, the response was along the lines of ‘The question of Palestinians being free is a complex question with many opinions’. There are biases everywhere; AI will repeat what it learns,” says Habash.
What steps can be taken to ensure that Arabic-language AI systems are culturally sensitive and avoid biases in their interactions? “The real challenge is to figure out how to get the machines to model properly, to know which things should be given higher or lower weight,” says Habash.
“One solution is to add more data in the training systems, to get better results. However, that comes with its own challenges. Another solution, a more promising one, is that researchers should work towards identifying content that seems to be away from the normal distribution which is expected,” he adds. “For example, if there’s a lot of mentions of doctors being men and nurses being women, can you actually artificially reduce the weights of the model? You don’t have to change the data; you can change how you learn from the data. If we see a pattern that looks kind of odd we can work on balancing it out.”
“It’s really an exciting new space because we’re dealing with data and information and it could be manipulated in different ways” says Habash.
Role of language and AI experts
In what ways can computational language experts, such as Habash, contribute to overcoming these challenges and make ‘better’ design choices to ensure cultural sensitivity in AI systems?
“That’s a great question. As an industry, we are more focused on the efficiency, efficacy and design of the model, creating something that is simple and easy with the sort of ‘Google elegance. Google simplified everything with one simple search box and that’s very attractive for people who are already overwhelmed. The amount of data on the web is so ridiculously huge. Everyone wants the short answer,” Habash responds.
In the realm of design choices for AI models, the computer scientist advocates for simplicity without sacrificing substance, cautioning against ‘deceptive fluency’. “For example, if you talk to an English speaker who has good pronunciation, your basic assumption is that this person sounds good to the ear, you understand him or her. Clearly, they’re smart, if they’re smart, they’re good, if they’re good, they’re telling the truth. But if a super smart person who actually knows a lot more but has trouble speaking in English, you might not think the same,” he adds.
“It’s the same thing with the machines. Fluency equals intelligence equals truth, which is not logically valid. So, we’re not dealing with something that we have not dealt with before but the only thing is the volume and accessibility with advanced AI systems are a lot higher,” says Habash.
Hence, the perils of relinquishing human agency to AI are steep and “if we rely too much on AI to make decisions on our behalf and to be our voice, we’re giving up something about our humanity, intelligence, conscience, and potentially our responsibility, which is not going to take us too far”, says Habash, emphasising the irreplaceable role of human judgement, empathy, and ethical responsibility. “That’s why I think it’s extremely important to continue to educate human beings.”