In this episode of The AI Kubernetes Show, we talked with Chris Khanoyan, Tech Lead and Senior Data Scientist at Booz Allen, about the rapid evolution of AI, the foundational challenges of data governance, and the technology’s role in increasing accessibility.
This blog post was generated by AI from the interview transcript, with some editing.
Khanoyan highlighted the rapidly changing data science landscape, with new capabilities emerging incredibly quickly. There's a significant overlap between data scientists and data engineers, particularly when it comes to transforming data for AI and machine learning initiatives. Today's AI tools have made coding accessible to practically anyone; even people who aren't tech-savvy can generate code.
However, despite the ease of auto-generated code, a solid technical foundation remains absolutely critical. It’s important to understand the basics, because when a bug inevitably appears in your code, you need to know what it means, where to look for it, and how to find a solution.
Data governance and management play a huge role in successful AI implementation. To start, data scientists and engineers have to identify the data pipeline, the various data sources, and the different systems involved.
Controlling the data is also essential. Even when creating data pipeline code with AI, practitioners need to consider how that process controls the governance of the data itself. Looking ahead, the industry is focused on learning more about how to govern, manage, and protect data, as well as understanding the associated risks.
The cleanliness and relevance of data are a universal challenge, both when training Large Language Models (LLMs) and when using data as context. One of the most important steps is establishing data provenance: where the data comes from, who owns the dataset, and, for AI-generated datasets, who is responsible for its origin.
When cleaning a dataset, you need to determine how much data is actually necessary. For training models, it’s essential to select relevant data to give the AI foreseeable success – you don't always need the entire dataset.
A core principle for starting any AI project is to begin with the end in mind. Think about what you want to accomplish and then design towards that goal. Ultimately, data is the core part of AI. Everything starts with figuring out where the data is coming from, what you are going to use it for, and how you will prepare it to meet your objectives.
Working with data often presents two big hurdles: not having the necessary access and a general lack of data. When dealing with non-technical clients, especially in sectors like the federal government, access is a common sticking point. To perform certain actions with data across different organizations, a formal agreement is usually required to ensure you're granted the necessary permissions.
Data scarcity, while not ideal, has existing workarounds. To compensate, it's necessary to combine various datasets to get the full picture. This process often means gathering systems that are on-prem and those in the cloud and consolidating them into a single, central repository. This centralized location, whether on-prem or in the cloud, is fed using different data pipelines. A great example of an efficient data management scenario in Google Cloud involves leveraging PubSub and Dataflow.
As a deaf individual, Khanoyan offered a unique perspective on how AI is transforming accessibility. AI assistance is helping to ease the mental fatigue that comes with constantly processing captions and interpreters. This allows for a quick review of scenarios, boosting productivity.
A significant new development is the potential of AI-powered glasses, like the product from Meta/Ray-Ban, which provide live captions. It's a game changer because it frees the user from having to stare at a screen to read captions. However, the widespread adoption of this technology faces a security barrier due to data sensitivity. Customers are concerned about what happens to that captioned data and where it goes. The hope is that one day, users can control the data source of the captioning.
You can connect with Chris Khanoyan and learn more about his work as a tech lead and senior data scientist at Booz Allen on LinkedIn and GitHub.
The main challenge is ensuring that the data used for AI is clean and well-governed and that the data set's source and ownership are clearly understood.
AI has automated the creation of code, making it possible for non-technical people to generate code. This shifts the focus for technical practitioners to understanding the foundational elements, ensuring data quality, and controlling the governance of the data.
The consensus among technical experts is to begin with the end in mind, think about what you want to accomplish, and then design towards that.