Aryn: Bringing Generative AI to OpenSearch and Data Preparation

Today, we’re proud to share that Aryn is coming out of stealth. We’re a team that’s built and scaled a variety of AWS big data and database services over a decade and led the creation of the OpenSearch project. We have raised $7.5M in series seed funding from Factory, 8VC, Lip-Bu Tan, Amarjit Gill, and other notable investors. Our mission is to answer questions from all of your data, and we’re doing this by bringing generative AI to OpenSearch and data preparation.

Over the last six years, we’ve seen tremendous growth in both the size and capabilities of generative AI models. The most remarkable property of these models continues to be their ability to perform tasks that they weren’t trained to do. With simple natural language instructions, a large language model (LLM) can write a limerick, answer an email, or summarize Don Quixote. At Aryn, we are most interested in their magical ability to process unstructured data for information extraction, question-answering, summarization, content generation, and more.

Paradigms for Unstructured Data

LLMs also have caught the imagination of countless people across industries. For example, analysts at investment firms want to sift through volumes of internal reports and external analyst research to form investment theses. Or, imagine a manufacturing company with thousands of service manuals and guides written and collected over the years. They'd like to let their customers easily ask questions on this information to answer questions about the installation and interoperability of parts. An emerging paradigm to enable these use cases is to empower users to “chat with their unstructured enterprise data,” an approach we call “conversational search.”

Unfortunately, LLMs are limited in important ways that block many critical enterprise use cases. They don’t know about private enterprise data and will confidently provide inaccurate answers, also known as hallucinations. Training them on private data is prohibitive for most companies because it’s expensive and requires specialized hardware and unique expertise. Most AI companies are focused on building and tuning these models, despite the inevitable difficulties in using them for conversational search. On the other hand, when these AI models are prompted with relevant data from a prepared dataset, they generate high-quality answers.

At Aryn, our roots are in databases and search, and our perspective on question-answering and data processing is different from pure ML approaches. To overcome the limitations of LLMs, we borrow an idea from Ted Codd – the father of relational databases. He had a simple and powerful idea: let users specify “what” they want from the data, and the computer should figure out “how” to compute the answer. For structured data, the language for specifying the “what” was relational calculus, and SQL was a user-friendly way to express it [1]. Codd’s insight was that for any SQL query, a computer could generate a plan (the “how”) to compute the answer using a data pipeline of simple operations. This paradigm has fueled the $100B+ database and data warehousing industries for the last 50 years.

Much lesser known, Codd had another related project for answering natural language questions over unstructured data. This project didn’t go anywhere because the technology for analyzing unstructured data was not good enough. In the last 30 years, we’ve seen another paradigm emerge - enterprise search - for finding relevant information in unstructured data. Search and retrieval are helpful, but do not go all the way to question-answering and conversations that users want. Until now.

Enter LLMs. Alone, they’re not enough, though they have an incredible ability to understand and capture the meaning of unstructured data. Similarly, current enterprise search engines are insufficient. But, if we look a little closer at these enterprise search stacks, we see that they’re built with data pipelines — for data preparation, for search and retrieval, and for question-answering. What if we brought AI to these dataflows, and closed the gap between these two technologies?

At Aryn, we’re data people, and we don’t see AI as the center of the universe – instead, we see data pipelines. We borrowed ideas from Codd and applied them to natural language queries and unstructured data. In this case, the “how” is using LLMs as operations to enhance and simplify the data pipelines that compose search stacks. At each stage, developers can choose the best models for their use cases, and easily swap in new ones as needed. We also know that the quality of natural language answers is highly dependent on how well data is prepared for search. Combining generative AI and search stacks gives the best of both worlds, giving a conversational experience with high-quality answers generated from prepared enterprise datasets.  Finally, our goal is to enable developers, and give them the tools to easily build and scale systems to power the next generation of conversational search applications.

Aryn’s Conversational Search Stack

Today, we are launching an open source conversational search stack for unstructured enterprise data such as documents, presentations, and internal websites. This stack makes it easy for developers to build and scale applications like question-answering over knowledge bases, chatbots for customer support or documentation, and research and discovery platforms across internal datasets. It is an integrated one-stop shop and consists of three main components: a new semantic data preparation system called Sycamore, semantic search with OpenSearch, and new conversational capabilities in OpenSearch. Generative AI powers each of these components, leading to higher quality answers and ease of use. Developers can quickly build and get to production with this stack without needing to become experts in AI and search. Additionally, Aryn’s stack is 100% open source (Apache License v2.0), which gives developers the freedom to customize and build without vendor lock-in.

At its core, Aryn’s stack includes OpenSearch, a tried and true, open source enterprise search engine. We chose to build with OpenSearch because of its robust vector database and search techniques, enterprise-grade security, battle-tested scalability and reliability, and most of all: a large and fast growing open source community and userbase. Today, the project has 5M+ downloads per week, 450+ contributors, and thousands of enterprise customers such as Goldman Sachs, Pinterest, and Zoom use OpenSearch in production. We are excited to join these developer and user communities, too.

We added support for conversation memory and APIs in OpenSearch v2.10, so that developers can build conversational apps without needing to stitch together and manage generative AI toolkits and vector databases that are in their infancy. This new functionality stores the history of conversations and orchestrates interactions with LLMs using retrieval-augmented generation (RAG) pipelines. RAG is a popular approach for grounding LLMs to synthesize answers. With these features integrated into OpenSearch, developers can easily build conversational search apps and deploy them to production faster.

We are also releasing Sycamore, a robust and scalable, open source semantic data preparation system. Sycamore uses LLMs for unlocking the meaning of unstructured data and preparing it for search. We believe this is the missing link for getting quality answers from LLMs and RAG-based architectures, and we are excited to build and work with the community on this software. Sycamore provides a high-level API to construct Python-native data pipelines with operations such as data cleaning, information extraction, enrichment, summarization, and generation of vector embeddings that encapsulate the semantics of data. And, it uses generative AI to make these operations simple and effective. With Sycamore, developers can easily prepare data in order to get high-quality answers for conversational search.

We’re starting with conversational search, but we believe this technology and approach can go much further, and it will take a community to see how far we can take it. As with Codd’s impact with the relational model, we’re hoping to lay the groundwork that will power the next 50 years of unlocking value from unstructured data. We’re excited to go on this journey with you.

- The Aryn Team

Learn more about Aryn

[1] SQL was created by Don Chamberlin and Ray Boyce.