Laying a Unified Data Foundation for an AI-Powered Future
- By Masahiro Waki, NetApp
- July 16, 2024
Enterprises around the world have been eager to adopt GenAI since it burst onto the scene a few years ago. According to NetApp’s 2024 Cloud Complexity Report, AI-leading organizations globally are more likely to report benefits from AI, including a 50% increase in production rates, 46% in the automation of routine activities, and a 45% improvement in customer experience.[LD1] [GG2]
Closer to home, enterprises in Asia are also looking to harness benefits from GenAI. The GenAI market in Asia is expected to grow at an annual rate of 22.65% between 2024 and 2030, reaching US$60.83 billion by 2030, according to Statista.
While the potential of GenAI is immense, AI and GenAI are only as good as the data that fuels them. The Language Learning Models (LLM)s behind GenAI run on an immense amount of data that needs to be well-organized and run on a fast data pipeline to function effectively.
Think of data and GenAI as the basic foundation and building blocks that form the larger structure of enterprise innovation. To lay a strong foundation that lets firms maximize ROI from their AI investments, corporate data must be meticulously prepared and optimally used with the right tools throughout its life cycle.
The 5W-1H rule
The most common sources of data in most organizations are internal product documentation, sales and purchase records and support information, as well as media such as videos and images. Proper organization is essential for the GenAI engine to use this wide range of data efficiently.
Enter the 5W-1H way of preparing data.
When: The first step is to know when and how often data is collected. In-house product documentation is created whenever product releases and updates occur. Customer information is stored in CRM or other systems as needed. In the case of video and audio, the files may be generated in real time. A system must be built to promptly collect data when they are created.
Where: Network sprawl across the data estate can pose a challenge to data storage and management. In-house product documentation is edited on a local PC and can be stored on a file server or online in the cloud. Customer information is typically stored in a database on-premises or in the cloud. Video and audio are often generated at the edge and must be collected over the network. A system needs to be properly set up to collect data at each location.
Who: It’s crucial for ownership of the various data to be clearly defined. Only then can business owners and stakeholders accountably work with the owners to better manage and protect their data, and ensure consistent data usage.
What: Data comes in multiple forms and formats. Product documents are typically stored in word processing file formats, while customer information is often stored as structured data in databases. Media files, which are unstructured, include video and audio. Understanding the type of data being handled enables organizations to pre-process and analyze them effectively.
Why: When tapping data for AI, users need to define the issues at hand from the start. Doing so enables an organization to focus on the most relevant data. Measurable numerical targets should also be identified and used to track progress over time.
How: Appropriate methods for data collection should consider the nature and location of data. For instance, the collection of file server data uses protocols like NFS or CIFS. Data collection from databases uses appropriate accounts and database-specific protocols. And for collection of real-time data, the ability to work with edge devices is vital.
Boosting speed to deployment
MLOps is key to streamlining workflows and speeding up the shift from AI development to production. At this stage, it's essential to fully utilize your organization's storage infrastructure, especially for MLOps and data operations including data pipelines and DataOps.
To facilitate a smooth and rapid deployment, the following key features of enterprise storage systems should be leveraged.
Optimizing Enterprise Data Management
Enterprise storage facilitates the collection of data from diverse sources through multi-protocol support, including Network File System (NFS) and Common Internet File System (CIFS), streamlining essential management tasks such as data protection, versioning, and security for AI applications. Moreover, recent advancements in container support enable MLOps and DataOps, allowing data scientists to concentrate on AI model development.
Advancing Hybrid Multi-Cloud Strategy
Setting up a hybrid multi-cloud environment and data mobility is crucial, especially for GenAI and LLMs that rely on cloud-exclusive services and functionalities.
India and Singapore organizations are leading in this regard, with 70% and 69% of respondents already on hybrid cloud and well positioned for AI deployments, according to NetApp’s Cloud Complexity 2024 report. This is well above the global average of 58% hybrid cloud adoption.
Establishing a data pipeline between on-premise environments and the cloud offers flexibility and scalability to support AI initiatives. Enterprise storage provides features that work with cloud vendors' object storage, or enable mirroring and caching within the cloud, facilitating a tailored hybrid cloud strategy for enterprises.
Implementing Security Measures
Security in AI is vital as data is constantly exposed to the risk of cyberattacks. Enterprise storage systems equipped with security features like multi-tenancy and encryption ensure data safety while managing it efficiently. This allows companies to hold multiple datasets in minimal space for auditing and compliance purposes. Understanding the nature of data and applying these security measures enables organizations to drive transformation and deepen business insights.
Conclusion
As businesses stand on the edge of the AI revolution, data are like building blocks of AI innovation, where each piece holds the potential for transformative insights. However, organizations must first understand the nature of the data they wish to leverage.
Collection mechanisms become more complex when data is varied and widely dispersed across the data estate. Approaches such as MLOps, data pipelines and DataOps, can help from the AI-centric and data-centric operational perspectives, respectively.
A winning formula can be achieved by appropriately blending the approaches to accelerate organizations’ GenAI programs, helping edge out competition in the race towards AI supremacy.
The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Image credit: iStockphoto/goncharovaia.