Skip to content

Commit ab7c673

Browse files
authored
Merge pull request #87 from cbossie/master
Sample Document Processing Application
2 parents 39994d1 + 0c9f7b0 commit ab7c673

File tree

138 files changed

+129372
-2
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

138 files changed

+129372
-2
lines changed

SampleApplications/2023/ServerlessDocumentAnalysis/.gitignore

Lines changed: 420 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# Serverless Document Analysis with .NET
2+
3+
## Introduction
4+
This is a sample project that is meant to show a full end-to-end serverless application using entirely AWS Serverless services. It uses [Amazon Textract](https://aws.amazon.com/pm/textract) to analyze a PDF file. You can preconfigure natural language queries that Textract will attempt to answer (e.g. 'What is the date of service of this invoice'). Also it will submit the document for expense document analysis, and return data about the document as a set of expense metadata documents.
5+
6+
This demonstrates how you can use .NET for an end to end serverless document processing solution in AWS. This README will detail the solution in full. The service can be deployed into an AWS account, and because it is self contained, can serve as an addon to an existing application.
7+
8+
## What specifically does this sample demonstrate?
9+
This solution is meant to be useful in real-world scenario, in which multiple technologies, techniques, and services are used. Specifically, this document analysis tool showcases the technologies listed below.
10+
11+
1. ### AWS Lambda with .NET
12+
- Custom runtime functions using .NET 8.0
13+
- Observability implemented using [Powertools for AWS Lambda (.NET)](https://docs.powertools.aws.dev/lambda/dotnet/)
14+
- [Lambda Annotations Framework](https://github.com/aws/aws-lambda-dotnet/blob/master/Libraries/src/Amazon.Lambda.Annotations/README.md) to implement dependency injection, with source generation to automatically create the "Main" method.
15+
16+
1. ### Infrastructure as Code with CDK
17+
- All infrastructure is expressed with [AWS CDK (C#/.NET)](https://docs.aws.amazon.com/cdk/v2/guide/work-with-cdk-csharp.html) with .NET 8.0.
18+
19+
1. ### Serverless AWS Services
20+
- Data and configuration are stored in an [Amazon DynamoDB](https://aws.amazon.com/dynamodb/) table. Data access uses the [.NET Object persistence model](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DotNetSDKHighLevel.html) to simplify data access with POCO objects.
21+
- The Lambda functions are orchestrated using an [AWS Step Function](https://aws.amazon.com/step-functions/) standard workflow. Standard workflow was chosen because it supports the [Wait for Callback (task token)](https://docs.aws.amazon.com/step-functions/latest/dg/connect-to-services.html#connect-to-services-integration-patterns) integration pattern.
22+
- [Amazon Textract](https://aws.amazon.com/dynamodb/) provides document analysis (standard and expense) capabilities.
23+
- An [Amazon Event Bridge rule](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-rules.html) is used to automatically trigger the workflow when a document is uploaded to an [Amazon S3](https://aws.amazon.com./s3) bucket.
24+
- Feedback is provided to the client application through the use of two [Amazon SQS](https://aws.amazon.com/sqs) queues
25+
26+
## Overview of the solution
27+
28+
This is an overview of the process. The names of the resources are generic, since each deployment will yield resources with different physical names (to avoid resource name collission). Sone design decisions are noted below, but there are alternate ways of accomplishing some of the items.
29+
![Overview of serverless document analysis](/assets/doc-analysis-overview.jpg)
30+
31+
This application is self contained. We will refer to an external application that integrates with this system as the "client application". There can be more than one client application, and a client application that provides input (i.e. uploads a file) may be different than an application that responds to the output of the system.
32+
1. A client application writes a PDF to the `InputBucket` S3 bucket.
33+
34+
- If the service has been configured to use natural language queries (explanation below), a subset of them can be specified using a colon separate list of query keys, supplied as a tag on the uploaded S3 object. For example:
35+
36+
`Tag: "Queries"`
37+
38+
`Value: 'q1:q2:q3'`
39+
40+
If no queries are supplied, then all configured queries will be used.
41+
42+
- The client can also supply an that will be passed through the entire system. This will allow the correlation of an uploaded file's result with the client's system. For example:
43+
44+
`Tag: "Id"`
45+
46+
`Value: "abc-12345"`
47+
48+
_Note: A client application must have permissions to write files to the InputBucket. A CloudFormation output is created when this is deployed, `inputBucketPolicyOutput`, that provides an example IAM policy that you can use to allow access to the bucket._
49+
50+
2. An EventBridge rule triggers the Step Function.
51+
52+
3. The Step Function definition can be seen here. It consists of seven Lambda function integrations and two SQS integrations. Any unrecoverable errors (from any of the Lambda functions) are caught and sent to the `FailureFunction` function, which then writes a message to the `FailureQueue` with details for the client.
53+
54+
![Step Function Definition](/assets/stepfunctions_graph.png)
55+
56+
4. The EventBridge message is parsed by the `InitializeProcessing` Lambda function, which creates a record in the `ProcessData` DynamoDB table. It also retrieves the query text from the `ConfigData` DynamoDB table for use in the next step.
57+
58+
5. In the `SubmitToTextract` Lambda function, the uploaded file is submitted to Textract for standard analysis. This step in the workflow uses the `Wait for Task Token` pattern; the step function will pause until restarted.
59+
60+
6. When Textract is complete, it writes the output to the `TextractBucket` S3 bucket and sends a message to the supplied SNS Topic, `TextractSuccessTopic`. The Lambda function `RestartStepFunction` then calls the _SendTaskSuccess_ or _SendTaskFailure_ depending on the Textract job status.
61+
62+
7. The function `ProcessTextractQueryResults` retrieves the results from the `TextractBucket` bucket, and writes all the query results to the `ProcessData` table.
63+
64+
8. In the `SubmitToTextractExpense` Lambda function, the uploaded file is submitted to Textract for expense analysis. This step in the workflow uses the `Wait for Task Token` pattern; the step function will pause until restarted.
65+
66+
9. When Textract is complete, step 6 is repeated, and the Step Function is restarted accordingly.
67+
68+
10. The function `ProcessTextractExpenseResults` retrieves the results from the `TextractBucket` bucket, and writes all the query results to the `ProcessData` table.
69+
70+
11. The `SuccessFunction` Lambda function formats the results from both query and expense analyses, and writes the data to the `SuccessQueue` queue.
71+
72+
_Note: A client application must have permissions access both the `SuccessQueue` and `FailureQueue`. A CloudFormation output is created for each when this application is deployed, `failureQueueOutput` and `successQueueOutput`. These provide example IAM policies that you can use to allow access to the queues._
73+
74+
## Codebase
75+
76+
This is a brief explanation of the solution's codebase.
77+
78+
`/assets` - Images and diagrams
79+
80+
`/functionss` - Lambda function source code
81+
82+
`/infrastructure` - CDK .NET project source code
83+
84+
## Deploying in your environment
85+
86+
### Prerequisites
87+
To deploy this solution, you will need the following prerequisites.
88+
- Clone this repository
89+
- You will need an AWS account and and IAM user with adequate permissions to deploy resources. You will need to set up a [credentials profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html). For the remainder of this exercise, we will assume the profile is named `my-profile`.
90+
- Install and set up the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html)
91+
- Install the [.NET 8.0 SDK](https://dotnet.microsoft.com/en-us/download/dotnet/8.0)
92+
- Install the latest version of the [AWS CDK](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html), and [bootstrap the environment](https://docs.aws.amazon.com/cdk/v2/guide/bootstrapping.html);
93+
- Install the [AWS Amazon.Lambda.Tools .NET Global CLI tools](https://docs.aws.amazon.com/lambda/latest/dg/csharp-package-cli.html)
94+
95+
### Build
96+
Before you deploy the solution, you will need to build the .NET 8.0 Lambda functions. A script is included to build all the Lambda funtions, for both Windows and Linux/Mac.
97+
98+
Windows:
99+
```
100+
cd infrastructure
101+
.\build.bat
102+
```
103+
104+
Linux:
105+
```
106+
cd infrastructure
107+
sh build.sh
108+
```
109+
110+
The process will take several minutes. The Lambda function archives are output to the `infrasturcure/function-output/` directory
111+
112+
### Deploy
113+
114+
To deploy the CDK application, you will need to supply several context values. They are:
115+
116+
- `environmentName` - The environment deployed to (e.g. 'dev', 'test', prod)
117+
- `stackName` - The name of the CloudFormation stack this will be deployed as. _Note: the stack name will have the environment name as a suffix._
118+
- `functionDirectory` - The directory where the .NET Lambda functions archives are located. If not supplied, will default to './function-output'.
119+
- `resourcePrefix` - A prefix used when physically naming resources. This must be all lowercase and alphanumeric. Defaults to 'docprocessing'
120+
121+
You can supply these [runtime context](https://docs.aws.amazon.com/cdk/v2/guide/context.html) values in several ways. For the purpose of this demo, you can use a local file.
122+
123+
Craete a file called `cdk.context.json` in the `infrastructure` directory.
124+
125+
Populate it similarily to the following:
126+
```
127+
{
128+
"environmentName":"dev",
129+
"stackName":"docAnalysis",
130+
"functionBaseDirectory":"./function-output",
131+
"resourcePrefix":"doc"
132+
}
133+
```
134+
135+
Synthesize the CDK stack with:
136+
137+
```
138+
cdk synth
139+
```
140+
_Note:_ You can actually include the build in the synthesis step by adding the `--build` switch:
141+
142+
```
143+
cdk synth --build .\build.bat
144+
```
145+
146+
You can then deploy the stack with the following command:
147+
148+
```
149+
cdk deploy --profile my-profile
150+
```
151+
152+
### Configure queries
153+
154+
To configure natural language queries for your environment, you will need to add them to the "QueryData" dynamoDB table. Each record will represent one query. This is an example of what a query should look like:
155+
156+
```
157+
{
158+
"query":"q1",
159+
"queryText":"What is the date of service?"
160+
}
161+
```
162+
163+
## Cleanup
164+
You can remove the infrastructure by using the following command:
165+
166+
```
167+
cdk destroy --profile my-profile
168+
```
169+
You can also manually delete the CloudFormation stack that was originally created.
170+
171+
This will delete any resources created, as well as any data contained within your S3 buckets or DynamoDB tables.
172+
173+
## TODO
174+
These are some items that will be added at a later date to make the solution more extensible
175+
1. Create a Systems Manager Parameter that will parameterize the following items:
176+
- The name of the tag that is used to specify queries to be applied to the document analysis (currently hardcoded to 'Queries')
177+
1. Add a configuration swith that will enable selecting to build the .NET Lambda functions with Native AOT

0 commit comments

Comments
 (0)