Better answers over tables using Sycamore

When building a conversational search application, you need to consider how to get the highest quality answers on unstructured data. The downstream data flows in a search query (e.g. RAG) rely on retrieving the correct data from a set of indexes (in Aryn’s case, we use hybrid search, which combines semantic and keyword retrieval). Furthermore, the ability to retrieve the most relevant parts of your dataset to a query is directly correlated with how that data was prepared and enriched before creating vector embeddings and being indexed.

The Aryn Search Platform not only provides RAG pipelines through capabilities we added to OpenSearch, but also enables you to focus on getting your data ready for search through Sycamore, an open source semantic data preparation system. In unstructured datasets, it’s common to find tables throughout the documents. In order to answer questions from data in these tables with high-quality, you often need to extract and process the tables during the preparation data flow. Luckily, Sycamore makes table extraction easy, and we’ll show you an example in this post using Amazon Textract as the underlying extractor.

AWS prerequisites for using Amazon Textract

This example will build from our previous blog post here, and you will reconfigure and relaunch the Aryn stack to use table extraction. We will assume you have already run through those instructions before starting this second part. 

As a component of the Aryn Search Platform, Sycamore has the ability to utilize a variety of libraries and web services during the data preparation process. In this example, we will use Sycamore’s table extraction feature and configure it to use Amazon Textract as the underlying service for the extraction. Therefore, you will need AWS credentials that can access Textract in the US-East-1 region and an Amazon S3 bucket in US-East-1. Please note that you will be charged for AWS resources consumed, though it should be negligible. If you do not have an AWS account, sign up here.

First, install the AWS CLI here

Next, you can enable AWS SSO login with these instructions, or you can use other methods to configure the values for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and if needed AWS_SESSION_TOKEN.

If using AWS SSO:

aws sso login --profile YOUR-PROFILE-NAME

eval "$(aws configure export-credentials --format env --profile your-profile-name)"

Next, create an Amazon S3 bucket for use with Textract. Make sure to set the region to US-East-1.

aws s3 mb your-bucket-name --region us-east-1 –profile your-profile-name

Then, set the configuration for Textract's input and output S3 location:

export SYCAMORE_TEXTRACT_PREFIX=s3://your-bucket-name

Relaunch Aryn stack with new configuration

Take down and reset the Aryn stack from the previous blog post if you have not already done so. Run these commands in the folder where you downloaded the Docker compose files:

docker compose down

docker compose run reset

Next, you will enable Textract:

export ENABLE_TEXTRACT=true

Note that the default setting is “true” for the Aryn Quickstart configuration, but in the prior post, you had disabled it.

Now, you will relaunch the Aryn stack:

docker compose up --pull=always

The Aryn Search Platform will start up. The Quickstart will automatically run an example data ingestion job using a provided Sycamore script, and it will ingest this file into a newly created index. We will ingest our files into this same index in the next step. You will know when the Aryn stack is ready when you see log messages similar to:

No changes at [datetime] sleeping

Ingest the earnings report data

Next, let’s ingest the Amazon Q3 2023 earning reports into the relaunched Aryn stack. Sycamore will now use Textract for table extraction during the preparation data flow. In a new terminal window, run:

docker compose run sycamore_crawler_http https://s2.q4cdn.com/299287126/files/doc_financials/2023/q3/AMZN-Q3-2023-Earnings-Release.pdf

docker compose run sycamore_crawler_http https://s2.q4cdn.com/299287126/files/doc_financials/2023/q3/64aef25f-17ea-4c46-985a-814cb89f6182.pdf

In the terminal window with the original Docker compose command, you’ll see log notes from Sycamore running a new data processing job. When the new files are loaded, you will see log messages similar to:

No changes at [datetime] sleeping

Ask questions and search

Now that the newly prepared data is loaded, you will return to the Aryn Quickstart demo UI to ask some questions on the tables in the 10Q earnings report. Using your internet browser, visit http://localhost:3000 to access the demo UI. Create a new conversation by entering the name in the text box in the "Conversations" panel and hit enter.

First, ask “What were the net sales for Amazon in Q3 2023?” and we’ll get back the answer and a citation to where the data was found. This data was in a table in the document, and would not have been retrieved without the table extraction step. Next, we may want to see how this was divided across business units. Let’s use the conversational features of Aryn Search and ask “Can you break this up into business units?”

Aryn returns the requested information, and this data was also taken from a table in the 10Q document. Using table extraction with Sycamore enabled Aryn Search to retrieve information from the tables in the dataset and return high-quality answers.

Finally, we can ask “What was the biggest operating expense in Q3 2023?” and Aryn Search will retrieve the correct value from data and provide a citation.

You can choose to ask other sample questions on this dataset. Also, if you want to shut down and clean up the Aryn stack:

docker compose down
docker compose run reset
You can also choose to delete the S3 bucket you created as well. 

Conclusion

This second part to this blog post shows how data preparation and enrichment can have a huge impact on conversational search quality. In this example, we used Sycamore and Amazon Textract to easily extract tables from our dataset and make this data available for high-quality search.

For more examples on data preparation and enrichment with Sycamore, and check out get started with Sycamore development using Jupyter notebooks.

If you have any feedback or questions, please email us at: info@aryn.ai 

or join our Sycamore Slack group here:

https://join.slack.com/t/sycamore-ulj8912/shared_invite/zt-23sv0yhgy-MywV5dkVQ~F98Aoejo48Jg


- Jon Fritz, CPO

Get started with Aryn conversational search in seconds with containers

Getting started with the Aryn Search Platform is now even easier! We just launched a Quickstart configuration, which deploys the Aryn stack in Docker containers with a single command. The Quickstart also makes it easy to ingest some sample unstructured data and ask questions on this dataset using a demo UI. The Aryn stack deploys two main components: semantic data preparation with Sycamore and conversational search with OpenSearch. You can also crawl and ingest data from arbitrary websites using the Sycamore crawler.  When you’re done with your Aryn stack, you can easily clean up your environment with a single command.

Aryn provides a one-stop shop that includes all of the data flows needed for high-quality, conversational search. It includes a retrieval-augmented generation (RAG) pipeline in OpenSearch for a last-mile natural language experience, but goes well beyond that to deliver high-quality answers. The Aryn Search Platform also focuses on the data enrichment and information retrieval data flows involved in a search query, which greatly affects the quality of data that's retrieved and used in a RAG pipeline.

The Aryn Quickstart GitHub README has details on launching the Aryn stack and running a demo that ingests HTML and PDFs from the Sort Benchmark website. However, in this blog post, we will instead focus on processing a different dataset. We’ll pretend we’re financial analysts interested in learning more about Amazon’s Q3 2023 earnings. In this example, we aren't going to use any external dependencies, so we won't require you to have or configure AWS credentials for use with Amazon Textract. It is enabled by default, so we will disable it in the steps below.

In this post, we’ll will deploy the Aryn stack locally with Docker containers, download and ingest Amazon’s Q3 2023 earnings reports, and ask questions on this data using conversational search. Let’s get started!

Launch the containerized Aryn stack

You’ll need two prerequisites for launching the Aryn stack for this example. First, you need to run Docker to deploy the containerized Aryn stack. If you don't have Docker already installed, visit here. Second, you need an OpenAI key so the Aryn stack can use the GPT large language model (LLM) service for entity extraction and RAG data flows. Keep in mind that you will accrue costs for this usage, though it will likely be negligible. You can create an OpenAI account here, or if you already have one, you can retrieve your key here.

Next, download the Docker compose files from the Quickstart repository. You will need the compose and .env files.

git clone git@github.com:aryn-ai/quickstart.git

Then, you will set two variables. For your OpenAI key:

export OPENAI_API_KEY=YOUR-KEY

We will disable Amazon Textract in this example. This simplifies our prerequisites, because we do not need to supply AWS credentials that are required to use Textracf. However, for other use cases, we encourage the use of Textract for table extraction, which can provide better data preparation and thus better search results. To disable Textract in the Quickstart:

export ENABLE_TEXTRACT=false

Next, start the Docker service. On MacOS or Windows, start Docker desktop, and on Linux, if you used your local package manager, it should be started. We also recommend adjusting the memory settings to 6 GB and Swap to 4 GB. You can make these adjustments in the “Resources” section in the “Settings” menu, which is accessed via the gear icon in the top right of the UI.

Start the Aryn stack in the directory where you downloaded the Docker compose files:

docker compose up --pull=always

The Aryn Search Platform will start up. The Quickstart will automatically run an example data ingestion job using a provided Sycamore script, and it will ingest this file into a newly created index. We will ingest our files into this same index in the next step. You will know when the Aryn stack is ready when you see log messages similar to:

No changes at [datetime] sleeping

Ingest the earnings report data

Next, let’s ingest our Amazon Q3 2023 earning reports. You will use the Aryn crawlers to download and ingest the files. In a new terminal window, run:

docker compose run sycamore_crawler_http https://s2.q4cdn.com/299287126/files/doc_financials/2023/q3/AMZN-Q3-2023-Earnings-Release.pdf

docker compose run sycamore_crawler_http https://s2.q4cdn.com/299287126/files/doc_financials/2023/q3/64aef25f-17ea-4c46-985a-814cb89f6182.pdf

In the terminal window with the original Docker compose command, you’ll see log notes from Sycamore running a new data processing job. When the new files are loaded, you will see log messages similar to:

No changes at [datetime] sleeping

Ask questions and search

Now that the data is loaded, you will go to the Aryn Quickstart demo UI for conversational search over this data. Using your internet browser, visit http://localhost:3000 to access the demo UI. Create a new conversation by entering the name in the text box in the "Conversations" panel and hit enter.

You can now ask questions on this data. Let’s start by asking “How much impact did AWS have on the Q3 2023 Amazon earnings?” and hit enter. You’ll see the Aryn Search Platform spring into action, with question rewriting using generative AI, hybrid search results (semantic + keyword search) return in the right panel, and a conversational answer to the question using generative AI and RAG in the middle panel.

You can follow up by asking "What AI related events happened with it?" and you will see how Aryn Search uses conversational memory to understand prior context. It relates "it" to AWS (from the prior interaction), and rephrases the search query accordingly. You can click on the citation links or the documents from the hybrid search results to take you to a highlighted section of the source documents.

Next, you could ask “What was the effect of Rivian on the Q3 2023 Amazon earnings?” and you'll see a similar set of data flows. You can also create other conversations in parallel, ask additional questions, or ingest more data by replacing the URL in the "docker compose run" commands for Sycamore crawler you ran above. 

If you want to shut down and clean up the Quickstart:

docker compose down
docker compose run reset

Conclusion

This example uses a demo Sycamore processing script we wrote for processing and ingesting the Sort Benchmark dataset. Although you used that same script to prepare and ingest the Amazon earnings reports, you could likely prepare the data even better by iterating on the Sycamore processing script, using Amazon Textract for table extraction, and possibly experimenting with other vector embedding models. All of this is easy to do with Sycamore, and you can check out a second part to this example that enables entity extraction for better answers data in tables.

You can also learn more about how to get started with Sycamore development using Jupyter notebooks.

In this example, you launched the Aryn Search Platform using the Quickstart, ingested Amazon’s Q3 2023 earnings reports, and ran conversational search over that dataset. If you have any feedback or questions, please email us at: info@aryn.ai 

or join our Sycamore Slack group here:

https://join.slack.com/t/sycamore-ulj8912/shared_invite/zt-23sv0yhgy-MywV5dkVQ~F98Aoejo48Jg


- Jon Fritz, CPO

Aryn: Bringing Generative AI to OpenSearch and Data Preparation

Today, we’re proud to share that Aryn is coming out of stealth. We’re a team that’s built and scaled a variety of AWS big data and database services over a decade and led the creation of the OpenSearch project. We have raised $7.5M in series seed funding from Factory, 8VC, Lip-Bu Tan, Amarjit Gill, and other notable investors. Our mission is to answer questions from all of your data, and we’re doing this by bringing generative AI to OpenSearch and data preparation.

Over the last six years, we’ve seen tremendous growth in both the size and capabilities of generative AI models. The most remarkable property of these models continues to be their ability to perform tasks that they weren’t trained to do. With simple natural language instructions, a large language model (LLM) can write a limerick, answer an email, or summarize Don Quixote. At Aryn, we are most interested in their magical ability to process unstructured data for information extraction, question-answering, summarization, content generation, and more.

Paradigms for Unstructured Data

LLMs also have caught the imagination of countless people across industries. For example, analysts at investment firms want to sift through volumes of internal reports and external analyst research to form investment theses. Or, imagine a manufacturing company with thousands of service manuals and guides written and collected over the years. They'd like to let their customers easily ask questions on this information to answer questions about the installation and interoperability of parts. An emerging paradigm to enable these use cases is to empower users to “chat with their unstructured enterprise data,” an approach we call “conversational search.”

Unfortunately, LLMs are limited in important ways that block many critical enterprise use cases. They don’t know about private enterprise data and will confidently provide inaccurate answers, also known as hallucinations. Training them on private data is prohibitive for most companies because it’s expensive and requires specialized hardware and unique expertise. Most AI companies are focused on building and tuning these models, despite the inevitable difficulties in using them for conversational search. On the other hand, when these AI models are prompted with relevant data from a prepared dataset, they generate high-quality answers.

At Aryn, our roots are in databases and search, and our perspective on question-answering and data processing is different from pure ML approaches. To overcome the limitations of LLMs, we borrow an idea from Ted Codd – the father of relational databases. He had a simple and powerful idea: let users specify “what” they want from the data, and the computer should figure out “how” to compute the answer. For structured data, the language for specifying the “what” was relational calculus, and SQL was a user-friendly way to express it [1]. Codd’s insight was that for any SQL query, a computer could generate a plan (the “how”) to compute the answer using a data pipeline of simple operations. This paradigm has fueled the $100B+ database and data warehousing industries for the last 50 years.

Much lesser known, Codd had another related project for answering natural language questions over unstructured data. This project didn’t go anywhere because the technology for analyzing unstructured data was not good enough. In the last 30 years, we’ve seen another paradigm emerge - enterprise search - for finding relevant information in unstructured data. Search and retrieval are helpful, but do not go all the way to question-answering and conversations that users want. Until now.

Enter LLMs. Alone, they’re not enough, though they have an incredible ability to understand and capture the meaning of unstructured data. Similarly, current enterprise search engines are insufficient. But, if we look a little closer at these enterprise search stacks, we see that they’re built with data pipelines — for data preparation, for search and retrieval, and for question-answering. What if we brought AI to these dataflows, and closed the gap between these two technologies?

At Aryn, we’re data people, and we don’t see AI as the center of the universe – instead, we see data pipelines. We borrowed ideas from Codd and applied them to natural language queries and unstructured data. In this case, the “how” is using LLMs as operations to enhance and simplify the data pipelines that compose search stacks. At each stage, developers can choose the best models for their use cases, and easily swap in new ones as needed. We also know that the quality of natural language answers is highly dependent on how well data is prepared for search. Combining generative AI and search stacks gives the best of both worlds, giving a conversational experience with high-quality answers generated from prepared enterprise datasets.  Finally, our goal is to enable developers, and give them the tools to easily build and scale systems to power the next generation of conversational search applications.

Aryn’s Conversational Search Stack

Today, we are launching an open source conversational search stack for unstructured enterprise data such as documents, presentations, and internal websites. This stack makes it easy for developers to build and scale applications like question-answering over knowledge bases, chatbots for customer support or documentation, and research and discovery platforms across internal datasets. It is an integrated one-stop shop and consists of three main components: a new semantic data preparation system called Sycamore, semantic search with OpenSearch, and new conversational capabilities in OpenSearch. Generative AI powers each of these components, leading to higher quality answers and ease of use. Developers can quickly build and get to production with this stack without needing to become experts in AI and search. Additionally, Aryn’s stack is 100% open source (Apache License v2.0), which gives developers the freedom to customize and build without vendor lock-in.

At its core, Aryn’s stack includes OpenSearch, a tried and true, open source enterprise search engine. We chose to build with OpenSearch because of its robust vector database and search techniques, enterprise-grade security, battle-tested scalability and reliability, and most of all: a large and fast growing open source community and userbase. Today, the project has 5M+ downloads per week, 450+ contributors, and thousands of enterprise customers such as Goldman Sachs, Pinterest, and Zoom use OpenSearch in production. We are excited to join these developer and user communities, too.

We added support for conversation memory and APIs in OpenSearch v2.10, so that developers can build conversational apps without needing to stitch together and manage generative AI toolkits and vector databases that are in their infancy. This new functionality stores the history of conversations and orchestrates interactions with LLMs using retrieval-augmented generation (RAG) pipelines. RAG is a popular approach for grounding LLMs to synthesize answers. With these features integrated into OpenSearch, developers can easily build conversational search apps and deploy them to production faster.

We are also releasing Sycamore, a robust and scalable, open source semantic data preparation system. Sycamore uses LLMs for unlocking the meaning of unstructured data and preparing it for search. We believe this is the missing link for getting quality answers from LLMs and RAG-based architectures, and we are excited to build and work with the community on this software. Sycamore provides a high-level API to construct Python-native data pipelines with operations such as data cleaning, information extraction, enrichment, summarization, and generation of vector embeddings that encapsulate the semantics of data. And, it uses generative AI to make these operations simple and effective. With Sycamore, developers can easily prepare data in order to get high-quality answers for conversational search.

We’re starting with conversational search, but we believe this technology and approach can go much further, and it will take a community to see how far we can take it. As with Codd’s impact with the relational model, we’re hoping to lay the groundwork that will power the next 50 years of unlocking value from unstructured data. We’re excited to go on this journey with you.

- The Aryn Team


Learn more about Aryn

[1] SQL was created by Don Chamberlin and Ray Boyce.


Finding Your Next Exponential

This is re-posted from Mehul A. Shah's personal blog.

We, tech workers, have been fortunate. The technology industry has been rapidly expanding for decades with no foreseeable end in sight. The recent macroeconomic environment, I believe, is temporary. It has given us all a chance to pause and reflect on our careers and what we find meaningful. And, we already are starting to see new opportunities appear, especially in areas related to AI.

People often ask me for advice on careers in times like these. What fields of endeavor would be fruitful to pursue? How do you decide which job or role to take? What should you look for and stay away from? When is the right time to leave and take on something new?

While there are many axes to consider and no universal answer, a common theme that repeatedly emerges is one of learning and growth. My advice often boils down to how to look for and identify a place and opportunity where the people and environment will help you learn the most. Learning breeds passion and satisfaction. Money, title, prestige, fame, and other fruits are simply by-products.

People often learn the most when their company or organization is growing. I do not mean steady linear growth, but rather exponential growth. Growth can be measured many ways – in terms of users, customers, revenue, or employees –, and typically all of these go hand in hand.

My advice to look for exponential growth is not new. Eric Schmidt once famously told Sheryl Sandberg, “If you’re offered a seat on a rocket ship, don’t ask what seat. Just get on.” Paul Graham also argues in “Startup = Growth” that exponential growth is essential for a business to be considered a startup. Although one can find growth outside of startups, startups are certainly a cauldron for learning.

Some Rules of Growth

So, how can an outsider tell whether there’s growth?  In hindsight, it’s easy – start by talking to people on the inside. Over my career, I’ve accumulated some rules of thumb to help find your next exponential.

  1. Growth is fun. There’s a healthy vibe in the air. People are optimistic. For example, when I interviewed with Google in 2004 and with AWS in 2014, I could tell that people were having fun. There was an atmosphere of chaotic optimism. People were busy and hustling, but they were never too busy to talk and relate the optimism.

  2. Growth is obvious; it does not hide. It is not under the rug or just around the corner. It’s a Mack truck that hits you in the face. The data will tell you.

    Startups will stretch the truth to make it appear as if they’re growing. For example, they’ll highlight the credentials of founders or the initial team. They’ll talk about a recent high-valuation funding round. Let’s be clear. While these are reasons to be optimistic, they are not evidence of growth. Sometimes, companies will exaggerate or decline to divulge data about users, customers, or revenue. It’s hard to know without some verifiable data.

    So, as a proxy, I find it useful to get a sense of headcount growth. Fast growing places, especially startups, must hire behind need. A cute trick is to ask the people that you meet (e.g. interviewers) how long it's been since they started, and compute the average. Unless it’s early, the shorter the average, the faster the growth. For example, when I joined AWS in 2014, they were investing heavily in growing the Palo Alto office. The average tenure of people I met there was less than 4 weeks. In a couple months, I was a veteran. I later learned my division was growing at triple digit rates at that time.

  3. Growth is divined, not engineered. Sometimes companies are early in their journey. They may not have a product or may still be tweaking the product to discover what customers want. So, by definition, they will not show fast growth. Looking for growth is like searching for oil. You have the right tools and know-how, but need to make educated guesses and dig in many places.

    In this case, you should assess how well the company is searching for growth. While it’s important to have a long-term plan, people need to be working with customers, collecting feedback, and using both data and intuition to iterate fast. If customers are not using the product, the company needs to relentlessly try new angles of attack. Be wary of long development cycles with little to no customer interaction or feedback – it’s hard to engineer a product “from whole cloth” that will grow exponentially.

  4. Growth is not forever. Finally, every company or organization eventually slows down or plateaus. Either the market becomes saturated, or they hit some other internal bottleneck. So, if you’re experiencing growth, enjoy it while it lasts. When assessing a new opportunity, remember that reputation lags reality. Refer to the previous rules to search for and assess growth. If your current environment has lost its ability to grow, then that’s a sign to look for your next exponential.

The Thrills of Exponentials

Environments that grow exponentially are rare, and AWS in 2014 was the first place I experienced this kind of growth. At first, I found the environment to be unintuitive, chaotic, and often unsettling. My first manager warned me that “the world in which you operate and assumptions you make will fundamentally change every three months.” She was not wrong. It’s hard to describe a world with customers, revenue, and teams growing nearly 3x a year and the challenges and issues that accompany it. Scaling was an exercise in organizational and system brinkmanship. By the time we deployed a new feature, system, or process, it was time to revisit it. Everything was constantly breaking, and we learned to stay one step ahead of it all collapsing.

I quickly found the pace to be exhilarating and addictive, and the growth necessitated an environment of trust and camaraderie. There’s so much to do that everyone can find real impactful opportunities. And because there’s so much open space, people are not territorial, which means no politics and more fun. If I wanted to work on something, I could simply join an existing effort. Or I could start something new and convince others to join. While people did not always agree with my ideas, no one stopped me from trying. Everyone started with trust, assumed good intentions, and had high expectations. With success, this cycle built on itself. By landing at a place with exponential growth, I learned so much and so quickly, and as a by-product my career also grew quickly.

Growth is for Community

I often remind people that growth’s purpose is to enrich and sustain our community and not collect wealth and power for individuals. Silicon Valley and the technology industry that it spawned is predicated on exponential growth, but not on growth at all costs. I started my career at HP, which was founded by Bill Hewlett and David Packard, two pioneers that set the standard for the valley. (Unfortunately, I joined too late to have met them.) Dave famously said that a company’s responsibility is to its employees, customers, and community first, and then its shareholders. A company is a collaborative effort by a group of people that want to make a contribution, and money is simply fuel to sustain their activities. I concur. Somewhere, Silicon Valley lost its way with the recent growth-at-all-cost behavior of some big tech firms and personalities. I hope the pendulum swings back, in this respect, to the old school ways.

Many wondered why I left AWS last year. Like others, I was amazed by the unprecedented advances in generative AI models. While their abilities to create images, audio, video, and natural language are remarkable, we felt that an essential ingredient was missing and needed exploration. We saw a larger opportunity outside of AWS that many did not agree with. So, I co-founded Aryn to seize that opportunity, find my next exponential, and make a contribution back to our community.



I thank Ben Sowell and Jon Fritz for feedback on drafts of this post.