Poor data quality is estimated to cost about $12.9 million annually, according to Gartner. A Deloitte survey found that 67% of executive respondents are uncomfortable using data from advanced analytics systems, even in companies with data-driven cultures. Turns out, the old adage “garbage in, garbage out” is more relevant than ever before.
Data quality – a key component of data management – plays a crucial role in customer trust, innovation, and business opportunities while significantly impacting the accuracy and reliability of information used for decision-making. However, there are challenges that affect data quality, including privacy and protection laws, mistakes and inaccuracies hidden in large volumes of data, duplicate data, dark data, and building a data quality culture in organisations.
YourStory, in association with Snowflake, organised a round table with tech leaders of companies, to explore the implications of poor quality data, best practices to address quality and create strategies that are tailored to each organisation and its long-term success. The discussion, titled ‘The question of data quality: strategies, best practices & challenges’, delved deep into the challenges of maintaining high data quality, the impact of poor data on decision making, and the need for bespoke solutions.
The panellists included Anand Gopal, Vice President of Product Management, HackerRank; Vishwastam Shukla, Chief Technology Officer, HackerEarth; Joydeep Banik Roy, Head of Data Science and ML Engineering, Zeotap; Dev Kumar, Co-founder and Chief Product and Technology Officer, Prosperr.io; Shikhar Jaiswal, Head of Product Management, Xeno; Arvind Subramanian, VP Engineering – Software, Skyserve; Arvind Singh, Chief Technology Officer and Chief Information Officer, Puravankara Group and Sumeet Tandure, Senior Manager, Sales Engineering at Snowflake.
Why data quality matters
Defining the right quality of data is a very important exercise. Measuring data quality can be complex and context-dependent as what constitutes “good” data varies between industries and individual businesses. For example, a financial institution prioritises accuracy since even minor data discrepancies can have significant economic impacts. Participants agreed that organisations must build data quality frameworks that complement the business operations and objectives to accomplish this.
Kumar, of Prosperr.io, said everything functions on trust in the financial services industry. “And trust comes with the data – how secure and accurate your data is, what quality of data is maintained so there is zero chance of error in terms of calculations,” he said.
Shukla, of HackerEarth, said skills data is a key component. The company’s mission is to dive deep into the skills of individuals, mostly developers, and figure out if they can be matched with the right opportunities that exist across the globe. Skills data impacts a large number of individuals as it affects their job prospects and the skill sets they have today versus the skill sets they possessed a few years back. Data lineage plays a key role here – the original systems from where data is sourced and then ingesting them into each system.
“For techies like us who are used to capturing millions and millions of data points on a regular basis, it’s very easy to talk in terabytes and petabytes, but the smallest bit of data if incorrectly captured could significantly impact businesses,” he said.
The speakers laid emphasis on a bespoke data quality system where not everybody would need everything. While some organisations would need more data governance, others would need data lineage and data security. Therefore, it is important for leaders to choose what fits best for their organisation and perhaps create a bespoke assortment of products from their respective cloud providers.
People and processes are the other two sides of data quality. This must be followed by putting the right processes in place to make sure the flow of data quality does not deteriorate over a period of time. The value of a platform is realised only when the data is accurate and customers can trust organisations with the data. If data quality is inaccurate, decision making would be fundamentally flawed.
Tools to measure and assess data quality
The participants discussed specific tools and strategies to improve data quality such as Amazon SageMaker, QuickSight, Great Expectations, and large language model (LLM)-generated data checks. Tooling is highly context-driven and application-specific, they agreed, while touching upon the importance of data enrichment and the potential risks of data quality issues in real-time decision-making scenarios.
Data enrichment occurs at two levels: at source and at destination. It is important to create visibility of how enriched the data is and educate customers about healthy data and work with them on ground to deploy strategies by understanding the ecosystem. “Data enrichment is actually done by our customer data platform (CDP) because, as a company aggregating data, we have to provide enrichment, transformation, standardisation, and data cataloging facilities,” said Roy, of Zeotap. From a data science angle, one problem is the transformation of data to features, which needs a lot of pre-processing and custom processing. “So we are grappling with how a lot of LLM data getting generated actually gets passed on to other systems. The problem has shifted from actual data to LLM-generated data,” he added.
Jaiswal, of Xeno, said the company has been using tools to ensure that data is replicated accurately across their platform, which involves analytics that have to be very relevant and real time for their customers. “As we scale, there’ll be a point in our journey where we will have to deploy more sophisticated tools. But currently, we are implementing the basics and building simple things in-house as a matter of principle,” he said.
From a data governance standpoint, Gopal, of HackerRank, believes in constantly educating his teams about understanding whether they are operating in a high-risk or low-risk use case. “While there is a lot of bespoke tooling, the way we determine when to use what is by knowing where we stand,” he said.
“As we are growing, we are building tooling infrastructure that is needed to make the journey very fast. We are discovering and iterating our tooling based on what outcomes we need to deliver,” said Subramanian, of SkyServe, a spacetech and edge computing firm. He discussed the challenges of submitting models for execution in a specific environment, emphasising the need for security measures like obfuscation and model encryption. Subramanian also highlighted the constraints of running models on devices with limited power and time, such as satellites with only 20 minutes of operation.
Securing the data governance framework
An interesting development that’s being awaited is the Digital Personal Data Protection Act, which will be enforced soon. The law will compel tech businesses to reconsider data strategies, mostly in terms of data residency and data transfer. Panellists agreed that human-generated data is going to become more and more valuable because as we increasingly consume AI-generated data across domains, the focus will be more on how data is preserved and given its right value.
Singh, of Puravankara Group, suggested breaking data capture into stages to better align with different stages of the sales process, allowing for more targeted data collection. “We use the CDP of Salesforce to streamline data management and improve customer interactions,” he said, highlighting the challenges of culture change in traditional industries such as real estate and the importance of training and adopting new technologies.
The importance of data quality, security, and freshness was underscored, with a focus on centralised platforms and the role of data enrichment in enhancing AI models. With the kind of technology tools available today, data governance is not seen as much of a challenge in fintech if architected in a good manner as the data is very limited and concise. The issue of governing the data arises at the operation level – one who protects the user data. The discussion also covered the need for clear separation of data governance for different areas of business to avoid conflicts.
The potential of synthetic data was explored to address customer concerns about data privacy and compliance. The speakers also highlighted the importance of having a LLM observability platform to track and manage the performance of LLMs, while providing real-time metrics for data quality. They also discussed the challenges of maintaining data accuracy between external systems and the importance of quantifiable data quality.