Informatica has just announced that they have made another acquisition this summer: GreenBay Technologies, a startup focused on AI and machine learning. Read about their July 2020 acquisition here.

GreenBay Technologies brings CloudMatcher to Informatica’s Intelligent Data Platform (IDP). CloudMatcher uses machine learning to automate entity matching and schema matching tasks with high accuracy. This impacts several key data management capabilities such as master data management, data cataloging, data quality, governance, and data integration.

This acquisition adds to the core capabilities of Informatica’s CLAIRE® engine. Informatica has previously collaborated with and invested in GreenBay Technologies, with some elements of CloudMatcher technology already embedded in Informatica products. Founders and team members of GreenBay Technologies will join Informatica as full-time employees. The collaboration, investment, and mutual aim to provide a more complete view and understanding of enterprise data foreshadows a successful acquisition.

Empowering the Business User with AI-Powered Data Management Tools

CloudMatcher further strengthens Informatica’s approach to no-code data management. It brings innovations to both schema matching and entity matching. While schema matching determines that two columns in two tables, such as “ename” and “emp name”, are semantically the same, entity matching determines that two records, such as (David Smith, Acme Company) and (Dave M. Smith, Acme), are the same real-world entity.

Historically, entity matching has remained a complex problem because the solutions are rule-based that require skilled developers and a significant amount of time. GreenBay Technologies has created a blend of “declarative rules” and “AI rules” for match classification. Many existing solutions to schema and entity matching use hand-crafted rules. With this acquisition, Informatica adds a powerful machine learning model that can capture complex and powerful matching rules, the kind of rules that users cannot create manually. The supervised learning used in this approach requires relatively little effort from the user to train the system. The active learning stage presents the business user with interactive labeling exercises for tuple pairs from two separate tables. Simultaneously, the system learns from more user feedback and curation to continuously improve over time. This workflow goes through multiple iterations and yields a high degree of accuracy. CloudMatcher is also highly amenable to distributed and parallel processing, allowing the solution to scale to very large data sets and dramatically reduce manual data stewardship required.

This opens the door for many data management services to be “hands-off”, meaning easily usable by business users without the need for coding or a developer background. As data sets are getting larger and larger, the improved match rate offered by CloudMatcher will help reduce the false positives. By applying a crowdsourcing approach with business users labeling training data, Informatica can configure a sophisticated data matching system that adapts to an organization’s data landscape in a few hours as opposed to days or weeks. This technology also helps in extending matching beyond identity data by matching product, supplier, location, and other types of data domains with even higher accuracy.

AI on Demand: A Continuation of the Platform Economy

We see the trend of harnessing AI to make complex tasks more accessible to non-technical users rising in many different fields. Conversational AI Building Platforms is one such area with many vendors providing the general technology and language capabilities, with support to train the chatbot on a customer’s specific industry and corporate knowledge. Other areas include cybersecurity, with AI capabilities already integrated into the cybersecurity solution and able to be trained in a context-specific environment rather than a “build it yourself” approach. This may shift in the future though, as methodologies for presenting the AI development process (data preparation, model selection, training, validating, implementation, maintenance, and retirement) to non-AI experts improve. Some tech companies (read about some examples here and here) are offering AI capabilities irrespective of industry; they aim to provide the infrastructure, data pipeline management, automated support to determine appropriate learning models, and governance for the AI lifecycle.

The data management use cases that we see Informatica addressing have the potential to unlock more data-based insights at a lower cost – be it through reducing time-to-value or by enabling subject matter experts to work with their data as a data scientist or developer would. Enabling more data-driven business will shape the processes, internal decisions, production goals, and much more. But this is only possible with strong data management practice, including the correct labeling of entities. User-friendly AI to manage this step is key to seeing more value from data management products.