Even British singer Ed Sheeran couldn’t resist Misal Pav’s allure during his recent Mumbai visit when he joined Sanjyot Keer to whip up Maharashtra’s iconic dish.
Bengaluru-based startup Smallstep.ai is cooking up its own “Misal”. However, it’s not making the spicy curry, but building a large language model (LLM) for the Marathi language.
In a space where homegrown players are developing their own LLMs in local languages, Sagar Sarkale, CEO and Founder of Smallstep.ai, drew inspiration from misal to build two versions of LLMs, called Misal 7B and 1B.
Founded in January this year, the AI firm focuses on simplifying AI and developing AI models, particularly in the Indic language space.
“Misal was developed to address the limitations of existing large language models, which was predominantly trained on English data, with only a small percentage dedicated to non-English languages,” Sarkale tells YourStory.
In simpler terms, an LLM is a type of AI designed to understand and generate human language by training on vast amounts of text data.
This foundational model can be finetuned for various tasks such as translation, regionalisation of content, content generation and summarisation, provision of customer service, among others.
“One example is, imagine coding Python in Marathi. We aim to become the Shopify for multilingual AI apps in the education and media space, providing accessible solutions for audiences to seek language-based tools and experiences,” he adds.
Misal is only the beginning, according to the founder. In the future, the startup aims to build LLMs in other languages, starting with Bisibele bhath for Kannada.
How does it work?
Sagar believes that many people still miss out on the opportunities created by the internet, as most of the media is centred around English.
“With generative AI (GenAI) technology, it is now possible to make the internet and all its opportunities available in local languages all over the world. Smallstep.ai is creating a platform that helps developers build multilingual AI apps with ease,” Sarkale explains.
Currently, more than hundreds of developers are using the platform to build applications in multilingual languages. Misal is built on top of Meta’s Llama 2, an open-source GenAI model, which was further customised by him for Marathi.
The model was developed to address the limitations of the Llama2 model, which was mostly trained on English data, with only a small portion dedicated to other languages, such as code and miscellaneous languages.
“Since only 2% of its data represents non-English languages, Llama2 isn’t well-suited for building GenAI applications in languages beyond English,” Sarkale says.
The model underwent training using data sourced from libraries and offline materials unavailable on the internet. As a result, the Marathi-based LLM excelled in the reading comprehension task, surpassing the performance of ChatGPT 3.5.
Finding the right ingredient for ‘Misal’
The startup used a three-step procedure to create the Instruction Tuned Misal model.
Misal first addressed the challenge of non-English languages, like Marathi, by developing a specialised SentencePiece tokeniser, expanding the token vocabulary. Then, the model underwent pretraining, where it was exposed to a large volume of Marathi text data, totaling over two billion tokens from various sources such as newspapers and online datasets.
To refine the model, the team gathered a collection of Marathi instructions, leaving out any related to code.
There are two versions of the model: Misal-1B and Misal-7B. While Misal-1B is based on a smaller model called TinyLlama and understands Marathi, it doesn’t perform as well as Misal-7B, which is larger and more advanced.
However, the team is actively working on improving Misal-1B. They want to make it better by potentially making it larger or refining its training data. Their aim is to boost Misal-1B’s capabilities and make it perform closer to Misal-7B.
“While we cannot outperform ChatGPT 3.5 in tasks such as sentiment analysis, paraphrasing, and translation, the Misal-7B model surpasses
Krutrim in all tasks except translation,” Sarkale says.
Misal-7B is proficient in many tasks but requires fine-tuning for creative writing such as essays or poems. Some areas for improvement include refining the ability to provide concise responses, formatting numbers, and reducing word repetitions in longer outputs.
Misal competes with the likes of
, Sarvam.ai, and Ola Krutrim. As well as homegrown LLMs such as Project Vaani by Indian Institute of Sciences (IISc) and ARTPARK (AI & Robotics Technology Park); BharatGPT by CoRover.ai, and Indus Project built by Tech Mahindra.
India’s AI market is expected to reach $17 billion by 2027, says a report by
-BCG.
Some of Smallstep’s target customers include content platforms, publishing houses, production houses, and educational platforms.
Edited by Affirunisa Kankudti