How Commvault and Lucidworks Turn Backups Into A Searchable Data Lake
In a story I wrote earlier this year, I referenced how data protection can actually empower a new generation of app development. As part of this, I said that there’s no reason the backup function can’t play a vital role in helping companies drive innovation.
How can backups help? I asserted that backups in a platform could be put to the following uses:
- Creating a metadata catalog
- Using data crawls
- Establishing better search functionality within the data
- Serving as a transformation engine
- Operating as a workflow engine
- Analyzing the use of data over time
And in the piece, I also suggested that those mechanisms, within a platform, could be used to satisfy a variety of use cases and that the use cases included:
- Understanding what data you have
- Getting access to all the data
- Extracting nuggets
- Looking back in time
- Performing metadata analytics
(For more detail on each of these bullet points, please see the original story: “How Data Protection Platforms Can Power A New Generation Of Apps, AI And Data Science.”)
Therefore, it was very interesting to hear that Commvault had made a partnership agreement with Lucidworks, an AI powered search engine that should be able to do many of the things I mentioned above.
The idea is that the dynamic index that Commvault creates as part of its process can be fed into Lucidworks which can then do a search based on the content, but can also use its algorithms to do things like enrich the data and find related data in a form of entity resolution. In my view, such a combination goes a long way toward providing a business with a lot of the benefits a data lake provides, including many of the Data Lake 2.0 benefits I describe in the “Saving Your Data Lake” research mission on Early Adopter Research.
With Lucidworks integration with the Commvault Data Platform, customers will be able search across all the data in Commvault’s dynamic index. But in addition, AI and ML algorithms can search through the data to support more advanced tasks such as:
- Entity resolution, through which all data associated with a person, company or product is assembled
- Compliance auditing, making sure no traces of data that is supposed to be deleted are left, and ensuring data is not being stored in a way that incurs regulatory liability
- Advanced data quality, finding contradictions or toxic data (see “Toxic Data: A New Challenge For Data Governance And Security”)
- Supporting faster legal eDiscovery or investigations with concept clustering (naturally occurring keyword and proximity clusters derived from unsupervised ML, that can be tuned with supervised ML)
- Addressing misspellings in search queries or content
- Categorization and classification of document types for various types of analysis and processes, such as contracts, procurement documents and so on. For example:
- Show me contracts with no GDPR provisions.
- Show me purchase orders whose payment terms don’t match our payables system.
- Sentiment and other analysis that could indicate frame of mind for customers, employees, partners (frequently used in customer/employee experience initiatives or employee/partner/vendor performance evaluations)
- VIP (employee, partner, customer) actions to be taken when events are sensed from communications monitoring. Common for customer churn, employee turnover, fraud detection, etc.
- Leveraging AI to enrich centralized metadata to enable Information Governance with automated data policies
Perhaps the most exciting use cases are related to finding and assembling data for training of predictive ML models:
- Discovery of data that could be useful in predictive big data applications – an interesting area for data scientists trying to discover and use correlation data to predict various outcomes
- Predictive/recommendation searching (if the user likes this, they’ll probably like that)
- Predictions of which customers will likely be impacted by known product faults based on their service and support history
- Anomaly detection for product performance. Such techniques could be applied to IoT data, hardware/software instrumentation, or be used for defect prediction.
While all of these use cases could be done with other sets of data, using the Commvault dynamic index means the largest, freshest set of data can be brought to bear to make use cases work optimally.
A New Kind of Data Lake?
How does the Commvault-Lucidworks partnership get us down the road toward a de-facto data lake? Well for one, by pairing search and AI technology with backup and data protection, companies can glean more from their existing data and leverage that for analytics.
As the two companies’ press release about the partnership points out, merging Commvault’s “ability to collect, index and store data from all across an organization” with “Lucidworks’ AI and machine learning capabilities will enable enterprises to perform content-aware discovery and analysis of critical content across data sources, whether on-premises or in the cloud. The integration will also provide a constant stream of data enrichment for Commvault data: when updated AI models from Lucidworks and search usage data (signals) are reapplied to data from Commvault, further context and meaning is applied to the data under management. Enriched data offers the user a guided and faster data discovery experience to quickly find the most relevant data to reduce the time and expense incurred in discovery events or power new search application use-cases.”
I point all this out because I find it exciting that Commvault and Lucidworks also intend to create new products using AI “to help identify and remediate sensitive data, apply data policy and automate workflows to enable and enforce data policies and processes. This will allow customers to close the process gaps and eliminate the multiple niche tools typically associated with the various phases of discovery, information management and risk management.”
The end result could very well be an ability to track data and leverage it with AI that serves all the functions of the data lake. It’s a partnership worth keeping an eye on to see how much of this vision can be brought to life.