When building a conversational search application, you need to consider how to get the highest quality answers on unstructured data. The downstream data flows in a search query (e.g. RAG) rely on retrieving the correct data from a set of indexes (in Aryn’s case, we use hybrid search, which combines semantic and keyword retrieval). Furthermore, the ability to retrieve the most relevant parts of your dataset to a query is directly correlated with how that data was prepared and enriched before creating vector embeddings and being indexed.
The Aryn Search Platform not only provides RAG pipelines through capabilities we added to OpenSearch, but also enables you to focus on getting your data ready for search through Sycamore, an open source semantic data preparation system. In unstructured datasets, it’s common to find tables throughout the documents. In order to answer questions from data in these tables with high-quality, you often need to extract and process the tables during the preparation data flow. Luckily, Sycamore makes table extraction easy, and we’ll show you an example in this post using Amazon Textract as the underlying extractor.
AWS prerequisites for using Amazon Textract
This example will build from our previous blog post here, and you will reconfigure and relaunch the Aryn stack to use table extraction. We will assume you have already run through those instructions before starting this second part.
As a component of the Aryn Search Platform, Sycamore has the ability to utilize a variety of libraries and web services during the data preparation process. In this example, we will use Sycamore’s table extraction feature and configure it to use Amazon Textract as the underlying service for the extraction. Therefore, you will need AWS credentials that can access Textract in the US-East-1 region and an Amazon S3 bucket in US-East-1. Please note that you will be charged for AWS resources consumed, though it should be negligible. If you do not have an AWS account, sign up here.
First, install the AWS CLI here.
Next, you can enable AWS SSO login with these instructions, or you can use other methods to configure the values for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and if needed AWS_SESSION_TOKEN.
If using AWS SSO:
aws sso login --profile YOUR-PROFILE-NAME
eval "$(aws configure export-credentials --format env --profile your-profile-name)"
Next, create an Amazon S3 bucket for use with Textract. Make sure to set the region to US-East-1.
aws s3 mb your-bucket-name --region us-east-1 –profile your-profile-name
Then, set the configuration for Textract's input and output S3 location:
export SYCAMORE_TEXTRACT_PREFIX=s3://your-bucket-name
Relaunch Aryn stack with new configuration
Take down and reset the Aryn stack from the previous blog post if you have not already done so. Run these commands in the folder where you downloaded the Docker compose files:
docker compose down
docker compose run reset
Next, you will enable Textract:
export ENABLE_TEXTRACT=true
Note that the default setting is “true” for the Aryn Quickstart configuration, but in the prior post, you had disabled it.
Now, you will relaunch the Aryn stack:
docker compose up --pull=always
The Aryn Search Platform will start up. The Quickstart will automatically run an example data ingestion job using a provided Sycamore script, and it will ingest this file into a newly created index. We will ingest our files into this same index in the next step. You will know when the Aryn stack is ready when you see log messages similar to:
No changes at [datetime] sleeping
Ingest the earnings report data
Next, let’s ingest the Amazon Q3 2023 earning reports into the relaunched Aryn stack. Sycamore will now use Textract for table extraction during the preparation data flow. In a new terminal window, run:
docker compose run sycamore_crawler_http https://s2.q4cdn.com/299287126/files/doc_financials/2023/q3/AMZN-Q3-2023-Earnings-Release.pdf
docker compose run sycamore_crawler_http https://s2.q4cdn.com/299287126/files/doc_financials/2023/q3/64aef25f-17ea-4c46-985a-814cb89f6182.pdf
In the terminal window with the original Docker compose command, you’ll see log notes from Sycamore running a new data processing job. When the new files are loaded, you will see log messages similar to:
No changes at [datetime] sleeping
Ask questions and search
Now that the newly prepared data is loaded, you will return to the Aryn Quickstart demo UI to ask some questions on the tables in the 10Q earnings report. Using your internet browser, visit http://localhost:3000 to access the demo UI. Create a new conversation by entering the name in the text box in the "Conversations" panel and hit enter.
First, ask “What were the net sales for Amazon in Q3 2023?” and we’ll get back the answer and a citation to where the data was found. This data was in a table in the document, and would not have been retrieved without the table extraction step. Next, we may want to see how this was divided across business units. Let’s use the conversational features of Aryn Search and ask “Can you break this up into business units?”
Aryn returns the requested information, and this data was also taken from a table in the 10Q document. Using table extraction with Sycamore enabled Aryn Search to retrieve information from the tables in the dataset and return high-quality answers.
Finally, we can ask “What was the biggest operating expense in Q3 2023?” and Aryn Search will retrieve the correct value from data and provide a citation.
You can choose to ask other sample questions on this dataset. Also, if you want to shut down and clean up the Aryn stack:
docker compose down
docker compose run reset
Conclusion
This second part to this blog post shows how data preparation and enrichment can have a huge impact on conversational search quality. In this example, we used Sycamore and Amazon Textract to easily extract tables from our dataset and make this data available for high-quality search.
For more examples on data preparation and enrichment with Sycamore, and check out get started with Sycamore development using Jupyter notebooks.
If you have any feedback or questions, please email us at: info@aryn.ai
or join our Sycamore Slack group here:
https://join.slack.com/t/sycamore-ulj8912/shared_invite/zt-23sv0yhgy-MywV5dkVQ~F98Aoejo48Jg
- Jon Fritz, CPO