Indika AI bets on synthetic data to overcome challenges of real-world data

What’s common to self-driving cars, fraud detection tools, and language systems? These seemingly disconnected scenarios have a common thread—that of synthetic data. Synthetic data is helping autonomous cars respond better to challenges in the actual world. It is also improving fraud detection in the fintech industry and training language systems.

High-quality datasets are required to train artificial intelligence (AI) solutions, so that the performance of the AI model can be as desired. But real-world data comes with the challenges of privacy, security, cost, access, and availability and the limitations of size and scale.

This is where synthetic data comes in—with the promise of scale, privacy, security, and accuracy. And this is what Mumbai-based startup Indika AI is betting on. It is developing a platform for synthetic data generation in various areas such as finance, medical and legal AI.

“Synthetic data retains all the insights of real-world data but not the identities. It helps overcome data privacy and regulatory issues concerning data sharing,” says Hardik Dave, Founder and CEO of Indika AI.

Founded in 2021 by Hardik Dave and Dr Anshul Pandey, Indika AI is a data solutions company that works with AI firms across the globe to help them train dataset solutions to build effective AI models.

Data annotation

An AI model must be trained to comprehend particular information for it to make decisions and take action. For this, data needs to be annotated or labelled so that the AI solution can understand areas of interest, identify objects, and uncover hidden patterns, contexts, intents, and sentiments in the dataset.

For instance, Indika AI has worked on a project to label financial news to train an AI-based stock price prediction tool. The tool suggests stocks that an individual should add to their portfolio based on news on stock prices.

Training datasets for AI firms operating in niche and regulated areas, such as financial services, legal and medical and conversational AI, is challenging due to the need for strong domain expertise and subjectivity and data security and regulatory concerns, says Hardik.

Which is why synthetic data generation could be a game-changer in the coming years, as it would offer datasets that are complete, more accurate, consistent, and without any bias, he says. Of course, the quality of any synthetically generated data depends on the quality of the AI model that generates the data.

What synthetic data brings to the table

Synthetic data is algorithmically generated (artificially manufactured) data, approximating the properties of original real-world data, such as tabular data, text, images, videos, and speech. The process involves feeding information into an AI model to generate synthetic data that can be a useful addition or a substitute to real-world data.

Currently, most AI models are trained with real-world data, and only a small percentage of models use synthetic data. But this will change in the future, according to Hardik.

Synthetic data would not only be able to fix the gaps in AI training data in scenarios where real-world data is not available, unusable because of security or privacy reasons, or expensive. It would also quickly generate larger datasets to test and train AI models, he says.

Currently, Amazon uses synthetic data to train Alexa’s language system, Google’s Waymo uses synthetic data to train its self-driving cars, while American Express and J. P. Morgan use synthetic financial data to improve fraud detection.

By 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated, according to research and consulting firm Gartner.

What Indika AI is doing

Indika AI is developing a platform to generate synthetic data in various areas such as finance, medical and legal AI. It expects to launch this platform in three months.

The company’s synthetic data platform would be compatible with tabular data to begin with and would support other data types, such as unstructured text and images, as the platform is scaled.

Hardik illustrates a possible use case in financial services. Let us say an existing dataset does not contain the credit-card spending details of working women in their 30s, earning a salary of Rs 10 lakhs, living in Tier III cities. A synthetic dataset could be created by analysing the patterns of spending of other users, for each of the factors such as gender, age, income group, and location.

Market size and growth

Indika AI works with AI companies in India, North America, and Europe to develop customised solutions. It competes with firms such as iMerit Technology, Scale Labs, and Appen.

The company generated a revenue of Rs 60 lakh in the first ten months since its inception in May last year. The startup expects larger growth this financial year and aims to touch about Rs 5 crore in revenue.

Last year, Indika AI raised an undisclosed amount in a Pre-Seed round from Dr Anshul. The company plans to raise Seed funding soon.

The global data collection and labelling market size was valued at $1.67 billion in 2021 and is expected to expand at a CAGR of 25% from 2022 to 2030, according to business consulting firm Grand View Research.

The data annotation industry in India is at a nascent stage. According to a NASSCOM report, the data annotation market serviced by India can exceed $7 billion by 2030.

Team Indika AI

The Indika AI team comprises over 100 people, including domain experts, data scientists, solution architects, and trained annotators.

Hardik, who takes care of the legal AI team, has a background in corporate, tax, and IP laws. He had earlier worked with consulting firms Ernst & Young and Baker Tilly International. Dr Anshul, the co-founder of Indika AI, is also the co-founder of Accern, a no-code fintech AI firm based in the US.

Source link