For GenAI to live up to its promise reliable flow of data is key. AI models are only as good as the data pipeline connections bringing in quality data.
Outdated connections mean more hallucinations and untrustworthy results with data engineers hopelessly trying to manually integrate hundreds of AI data feeds. We spoke to Rivery co-founder and CEO Itamar Ben Hemo to discuss why good data pipelines are key to success.
BN: Why is data pipeline quality critical to GenAI app reliability and trustworthiness?
IBH: GenAI applications tend to present their responses as highly accurate but in reality, they are subject to hallucinations to varying degrees. A GenAI app is only as good as the data that’s fed into it. If this data is inconsistent, inaccurate or missing and fed into the LLM engine, users of the GenAI application will very likely get inaccurate or incorrect results. Over time, users will begin to lose trust in the application and stop using it or worse, they will trust the answer with incorrect data.
BN: How is building data pipelines for GenAI applications different from building data pipelines for analytics purposes?
IBH: When building data pipelines for analytics, data engineers often deal with structured data that is loaded into data warehouses or data lakes designed to handle large volumes of data. It is modeled so it can be used by downstream applications such as business intelligence tools. This effort requires connecting pipelines to many different data sources that are mainly databases and third-party applications from different SaaS tools such as Salesforce, NetSuite, Google Analytics, and others.
When building AI pipelines, the ingested data consists of structured, semi-structured, and now unstructured data which exists in text files, email exchanges, slack messages, or just free-form fields of text collected via different applications (e.g. email exchanges within support systems such as Zendesk).
Ingested unstructured data needs to be handled in a way that is different than classic data modeling for analytics purposes. For example, to help LLMs produce better results, the data should be grouped together logically and then uploaded to a database that supports vector formats (e.g. Pinecone, Snowflake) or an abstracted AI service like Amazon Bedrock to execute retrieval-augmented generation (RAG) workflows. This enables the GenAI app to create contextual data to generate responses.
For data engineers, that means they have new data sources to extract data from and potentially new target locations to load data into. It also means they need to transform data in new ways as opposed to data modeling techniques designed for data analytics. Finally, data engineers are spending more effort to orchestrate new kinds of data workflows for GenAI apps compared to analytics where landing and modeling data within warehouses is typically sufficient.
BN: What are the best ways to design, implement, and maintain data pipelines that prioritize security and scalability?
IBH: While there are popular open source libraries to connect to data sources and replicate data (e.g. Debezium helps run CDC processes on top of databases), running those at scale and according to security standards best practices (e.g. SOC2, HIPPA, GDPR, etc.) is a challenging mission for data teams.
Unless teams are prepared to set up infrastructure and invest time into maintenance, a platform that manages these processes at scale can take away the responsibility of managing data infrastructure management and data protection. Adopting an ELT-first approach that uses the compute resources of cloud data warehouses to transform data ensures there is no memory and compute limitations compared to using ETL-first tools that transform data using local servers. With an ELT approach, data teams not only benefit from unlimited scale to manage any volume of data but also the ability to iterate and adapt data models faster. In this scenario, the raw data is already available for analysts connecting to the warehouse without the need to reference the source systems and modify data extraction processes.
BN: What are the key challenges that data teams encounter to steadily ingest high-quality data into their pipelines?
IBH: With the advent of the cloud and more specifically the SaaS revolution, organizations started to adopt software much faster without going through lengthy acquisition processes. Instead of data residing within a handful of on-premises systems data sources, data teams now need to manage pulling data from hundreds of data sources, which are typically accessed via APIs provided by the vendors of the source applications.
The challenge is that leveraging those APIs requires not only writing code to build the data pipelines to extract the data but also maintaining the pipeline over time as APIs frequently change and evolve. Teams that don’t use a managed solution to extract data are at a disadvantage in trying to keep up with ever-changing data source APIs to ensure a consistent data flow to the business.
BN: How will data professionals’ roles change as they rely on AI to ensure data pipeline integrity?
IBH: Like in many other domains, AI will bring a new level of automation and efficiency to the work of data engineers. While this will speed up their work (e.g. in the form of generating SQL or Python code faster), data engineers will have to make sure they put in place the right checks and balances to ensure there are no AI hallucinations that could be producing incorrect outputs.
As AI tools become more sophisticated, data engineers will also increasingly focus on higher-level responsibilities. For instance, they’ll need to ensure data quality, manage data governance, and create scalable architectures that can accommodate AI applications. This requires not just technical skills but also strong communication and collaboration abilities, as data engineers work closely with data scientists, analysts, and business leaders to align data strategies with organizational objectives. As AI continues to evolve, the demand for skilled data engineers who can bridge the gap between technology and business will only grow, ensuring their crucial role in the data ecosystem for years to come.
Image credit: anterovium/depositphotos.com