Document not found (404)
+This URL is invalid, sorry. Please use the navigation bar or search to continue.
+ +diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml deleted file mode 100644 index b5063de..0000000 --- a/.github/workflows/deploy.yml +++ /dev/null @@ -1,29 +0,0 @@ -name: Build and Deploy -on: - push: - branches: - - main - -jobs: - build-and-deploy: - runs-on: ubuntu-latest - steps: - - name: Checkout 🛎️ - uses: actions/checkout@v2 - - - name: Install mdbook - run: | - curl -sSL https://github.com/rust-lang/mdBook/releases/download/v0.4.43/mdbook-v0.4.43-x86_64-unknown-linux-gnu.tar.gz | tar -xzv - - name: Install mdbook-open-on-gh - run: | - curl -sL https://github.com/badboy/mdbook-open-on-gh/releases/download/2.4.3/mdbook-open-on-gh-2.4.3-x86_64-unknown-linux-musl.tar.gz | tar zxv - - name: Build - run: | - export PATH=$PATH:$(pwd) - mdbook build - - - name: Deploy 🚀 - uses: JamesIves/github-pages-deploy-action@v4 - with: - branch: gh-pages - folder: book diff --git a/.gitignore b/.gitignore deleted file mode 100644 index f9644bd..0000000 --- a/.gitignore +++ /dev/null @@ -1,2 +0,0 @@ -book -.DS_store \ No newline at end of file diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..f173110 --- /dev/null +++ b/.nojekyll @@ -0,0 +1 @@ +This file makes sure that Github Pages doesn't process mdBook's output. diff --git a/404.html b/404.html new file mode 100644 index 0000000..118b9a4 --- /dev/null +++ b/404.html @@ -0,0 +1,213 @@ + + +
+ + +This URL is invalid, sorry. Please use the navigation bar or search to continue.
+ +To query tables through GraphQL, send the query through POST
request to
+/api/graphql
endpoint.
GraphQL query frontend supports the same set of operators supported by REST +query frontend. Here how is you can apply various operators in a query:
+{
+ table_name(
+ filter: {
+ col1: false
+ col2: { gteq: 4, lt: 1000 }
+ }
+ sort: [
+ { field: "col2", order: "desc" }
+ { field: "col3" }
+ ]
+ limit: 100
+ ) {
+ col1
+ col2
+ col3
+ }
+}
+
+
+ ROAPI exposes a diverse set of interfaces through the HTTP protocol.
+ +Query key value stores through REST API by sending GET
requests to
+/api/kv/{kv_name}/{key}
.
For example, the kvstore defined in the sample +config can be queried like below:
+$ curl -v localhost:8080/api/kv/spacex_launch_name/600f9a8d8f798e2a4d5f979e
+Starlink-21 (v1.0)%
+
+
+ Query tables through REST API by sending GET
requests to
+/api/tables/{table_name}
. Query operators are specified as query params.
REST query frontend currently supports the following query operators:
+To sort column col1
in ascending order and col2
in descending order, set
+query param to: sort=col1,-col2
.
To find all rows with col1
equal to string 'foo'
, set query param to:
+filter[col1]='foo'
. You can also do basic comparisons with filters, for
+example predicate 0 <= col2 < 5
can be expressed as
+filter[col2]gte=0&filter[col2]lt=5
.
To query tables using a subset of standard SQL, send the query payload through
+POST
request to /api/sql
endpoint. This is the only query interface that
+supports table joins.
SQL query frontend is the more flexible and powerful compared to +REST and GraphQL. It is the only query frontend +that supports table joins.
+SQL frontend supports [""]
operator for accessing struct fields and array
+element. For example struct_col["key1"]["key2"]
or array_col[0]
.
By default, ROAPI encodes responses in JSON format, but you can request
+different encodings by specifying the ACCEPT
header:
curl -X POST \
+ -H 'ACCEPT: application/vnd.apache.arrow.stream' \
+ -d "SELECT launch_library_id FROM spacex_launches WHERE launch_library_id IS NOT NULL" \
+ localhost:8080/api/sql
+
+ROAPI currently supports the following serialization formats:
+application/json
application/csv
application/vnd.apache.arrow.file
application/vnd.apache.arrow.stream
application/parquet
Get schemas for all tables:
+curl localhost:8080/api/schema
+
+Get schema for a specific table by name:
+curl localhost:8080/api/schema/{TABLE_NAME}
+
+
+ ROAPI currently supports the following blob storages:
+Filesystem store can be specified using file:
or filesystem:
schemes. In a
+Windows environment, the scheme is mandatory. On Unix systems, a uri without a
+scheme prefix is treated as filesystem backed data source by ROAPI.
For example, to serve a local parquet file test_data/blogs.parquet
, you can
+just set the uri to the file path:
tables:
+ - name: "blogs"
+ uri: "test_data/blogs.parquet"
+
+Filesystem store supports loading partitioned tables. In other words, you can +split up the table into mulitple files and load all of them into a single table +by setting uri to the directory path. When loading a partitioned dataset, you +will need to manually specify table format since the uri will not contain table +format as a suffix:
+tables:
+ - name: "blogs"
+ uri: "test_data/blogs/"
+ option:
+ format: "parquet"
+
+ROAPI can build tables from datasets served through HTTP protocols. However, one +thing to keep in mind is HTTP store doesn't support partitioned datasets because +there is no native directory listing support in the HTTP protocol.
+To set custom headers for HTTP requests, you can use the headers
io option:
tables:
+ - name: "TABLE_NAME"
+ uri: "http://bucket/TABLE/KEY.csv"
+ io_option:
+ headers:
+ 'Content-Type': 'application/json'
+ Authorization: 'Bearer TOKEN'
+
+ROAPI can build tables from datasets hosted in S3 buckets. Configuration is +similar to filesystem store:
+tables:
+ - name: "TABLE_NAME"
+ uri: "s3://BUCKET/TABLE/KEY"
+ option:
+ format: "csv"
+
+To configure S3 credentials, you can set the following environment variables:
+AWS_ENDPOINT_URL
: The endpoint URL for your S3 providerAWS_REGION
: The region where your S3 bucket is locatedAWS_SECRET_ACCESS_KEY
: The secret access key from your S3 providerAWS_ACCESS_KEY_ID
: The access key ID from your S3 providerROAPI can build tables from datasets hosted in GCS buckets. Configuration is +similar to filesystem store:
+tables:
+ - name: "TABLE_NAME"
+ uri: "gs://BUCKET/TABLE/KEY"
+ option:
+ format: "csv"
+
+To configure GCS credentials, you can set the following +environment variables if you are using service accont:
+GOOGLE_SERVICE_ACCOUNT
/ GOOGLE_SERVICE_ACCOUNT_PATH
: location of service account fileGOOGLE_SERVICE_ACCOUNT_KEY
: JSON serialized service account keyGOOGLE_APPLICATION_CREDENTIALS
: set by gcloud SDKROAPI can build tables from datasets hosted in Azure Storage. Configuration is +similar to filesystem store:
+tables:
+ - name: "TABLE_NAME"
+ uri: "az://BUCKET/TABLE/KEY"
+ option:
+ format: "csv"
+
+The supported url schemas are
+abfs[s]://<container>/<path>
az://<container>/<path>
adl://<container>/<path>
azure://<container>/<path>
To configure Azure credentials, you can set the following +environment variables:
+AZURE_STORAGE_ACCOUNT_NAME
AZURE_STORAGE_ACCOUNT_KEY
AZURE_STORAGE_CLIENT_ID
AZURE_STORAGE_TENANT_ID
AZURE_STORAGE_SAS_KEY
AZURE_STORAGE_TOKEN
AZURE_MSI_ENDPOINT
/ AZURE_IDENTITY_ENDPOINT
: Endpoint to request a imds managed identity tokenAZURE_OBJECT_ID
: Object id for use with managed identity authenticationAZURE_MSI_RESOURCE_ID
: Msi resource id for use with managed identity authenticationAZURE_FEDERATED_TOKEN_FILE
: File containing token for Azure AD workload identity federationYou can configure ROAPI to load as many tables as you want by repeating the --table
argument:
roapi --table 'table1_name=table1_uri' --table 'table2_name=table2_uri'
+
+You can use RUST_LOG
environment variable to control the logging verbosity:
RUST_LOG=debug roapi ...
+
+Here is the output from roapi --help
:
roapi 0.7.1
+QP Hou
+Create full-fledged APIs for static datasets without writing a single line of code.
+
+USAGE:
+ roapi [OPTIONS]
+
+OPTIONS:
+ -a, --addr-http <IP:PORT>
+ HTTP endpoint bind address
+
+ -c, --config <config>
+ config file path
+
+ -d, --disable-read-only
+ Start roapi in read write mode
+
+ -h, --help
+ Print help information
+
+ -p, --addr-postgres <IP:PORT>
+ Postgres endpoint bind address
+
+ -t, --table <[table_name=]uri[,option_key=option_value]>
+ Table sources to load. Table option can be provided as optional setting as part of the
+ table URI, for example: `blogs=s3://bucket/key,format=delta`. Set table uri to `stdin`
+ if you want to consume table data from stdin as part of a UNIX pipe. If no table_name is
+ provided, a table name will be derived from the filename in URI.
+
+ -V, --version
+ Print version information
+
+
+ You can configure multiple table sources using YAML config, which supports more +advanced format specific table options. For example:
+addr:
+ # binding address for TCP port that speaks HTTP protocol
+ http: 0.0.0.0:8080
+ # binding address for TCP port that speaks Postgres wire protocol
+ postgres: 0.0.0.0:5432
+tables:
+ - name: "blogs"
+ uri: "test_data/blogs.parquet"
+
+ - name: "ubuntu_ami"
+ uri: "test_data/ubuntu-ami.json"
+ option:
+ format: "json"
+ pointer: "/aaData"
+ array_encoded: true
+ schema:
+ columns:
+ - name: "zone"
+ data_type: "Utf8"
+ - name: "name"
+ data_type: "Utf8"
+ - name: "version"
+ data_type: "Utf8"
+ - name: "arch"
+ data_type: "Utf8"
+ - name: "instance_type"
+ data_type: "Utf8"
+ - name: "release"
+ data_type: "Utf8"
+ - name: "ami_id"
+ data_type: "Utf8"
+ - name: "aki_id"
+ data_type: "Utf8"
+
+ - name: "spacex_launches"
+ uri: "https://api.spacexdata.com/v4/launches"
+ option:
+ format: "json"
+
+ - name: "github_jobs"
+ uri: "https://web.archive.org/web/20210507025928if_/https://jobs.github.com/positions.json"
+
+Table sources can be loaded into in-memory key value stores if you specify which two +columns to be used to load keys and values in the config:
+kvstores:
+ - name: "spacex_launch_name"
+ uri: "test_data/spacex_launches.json"
+ key: id
+ value: name
+
+The above config will create a keyvalue store named spacex_launch_name
that allows you to lookup SpaceX launch names using launch ids.
You can override DataFusion configuration settings by specifying them in the datafusion
section of your config file. This allows you to tune the query engine's behavior for your specific use case:
datafusion:
+ "execution.collect_statistics": "true"
+ "execution.batch_size": "8192"
+ "sql_parser.enable_ident_normalization": "true"
+
+The datafusion
field accepts a map of configuration key-value pairs where
+both keys and values are strings. You can reference the DataFusion configuration documentation
+for a complete list of available configuration options.
Use -c
argument to run ROAPI using a specific config file:
roapi -c ./roapi.yml
+
+
+ Thanks to the connector-x Rust crate, +ROAPI is able to load tables from popular relational databases like MySQL, +PostgreSQL and SQLite.
+tables:
+ - name: "table_foo"
+ uri: "mysql://username:password@localhost:3306/database"
+ - name: "table_bar"
+ uri: "mysql://username:password@localhost:3306/database"
+ - name: "table_baz"
+ uri: "sqlite://path/to/sqlitelite/file"
+
+With this, you can now write a single SQL query to join tables between MySQL, SQLite and local CSV files!
+By default, ROAPI will use the provided table name as the table name to extract
+data from the database. If you want to expose a specific database table with a
+different name, you can use the option
field to specify the original table
+name like below:
tables:
+ - name: custom_table_name
+ uri: "sqlite://path/to/sqlitelite/file"
+ option:
+ table: original_table_name
+ format: sqlite
+
+
+ When a table uri ends in .arrow
or .arrows
, ROAPI will try to load it as
+Arrow IPC file or stream if no format
option is specified:
tables:
+ - name: "mytable"
+ uri: "http://mytable.arrow/" # or .arrows
+
+You can partition an Arrow dataset into multiple partitions and load all of them +into a single table by directory path:
+tables:
+ - name: "mytable"
+ uri: "./table_dir"
+ option:
+ format: "arrow" # or arrows
+
+
+ When a table uri ends in .csv
, ROAPI will try to load it as CSV table if no
+format
option is specified:
tables:
+ - name: "mytable"
+ uri: "http://mytable.csv/"
+
+You can partition a CSV dataset into multiple partitions and load all of them +into a single table by directory path:
+tables:
+ - name: "mytable"
+ uri: "./table_dir"
+ option:
+ format: "csv"
+
+
+ ROAPI supports loading Delta tables from Delta Lake
+through delta-rs. Since a Delta table
+doesn't have an extension suffix, ROAPI cannot infer table format from table
+URI alone. Therefore, the format
option needs to be set to delta
+explicitly for Delta table sources:
tables:
+ - name: "mytable"
+ uri: "s3://bucket/delta_table/path"
+ option:
+ format: "delta"
+
+ROAPI loads the entire table into memory as the default behavior. If your
+table is large or you want to avoid loading all data during startup, you can
+set an additional option use_memory_table: false
(default: true
). With that
+configuration, ROAPI will not copy the data into memory, but instructs datafusion
+to directly operate on the backing storage.
At the moment, this comes with the following limitations:
+Example:
+tables:
+ - name: "mytable"
+ uri: "./path/to/delta_table"
+ option:
+ format: "delta"
+ use_memory_table: false
+
+Note that when providing use_memory_table
option, it becomes necessary to
+also specify the format.
ROAPI supports loading Microsoft Excel compatible formats including xls, xlsx, xlsb, and ods files.
+tables:
+ - name: "mytable"
+ uri: "path/to/file.xlsx"
+ option:
+ format: "xlsx"
+ sheet_name: "Sheet1"
+
+You can specify which sheet to load from the spreadsheet using the sheet_name
option. By default, ROAPI will use the first sheet if no sheet name is specified.
+
tables:
+ - name: "mytable"
+ uri: "path/to/file.xlsx"
+ option:
+ format: "xlsx"
+ sheet_name: "MyDataSheet"
+
+You can specify a specific range of cells to load from the spreadsheet: +You can specify a specific range of cells to load from the spreadsheet:
+tables:
+ - name: "mytable"
+ uri: "path/to/file.xlsx"
+ option:
+ format: "xlsx"
+ sheet_name: "Sheet1"
+ rows_range_start: 1
+ rows_range_end: 4
+ columns_range_start: 1
+ columns_range_end: 3
+
+ROAPI can automatically infer the schema from your Excel data. The first row within the specified range is treated as column names, and ROAPI will analyze the remaining rows to determine data types.
+You can control schema inference with the schema_inference_lines
option, which specifies how many rows to analyze (including the header row). For example, schema_inference_lines: 3
will use the first row for column names and analyze 2 additional rows for data types.
If a column contains mixed data types (like both integers and floats), ROAPI will default to the Utf8 (string) data type.
+For better performance and predictable data types, you can define the schema explicitly in your configuration:
+tables:
+ - name: "excel_table"
+ uri: "path/to/file.xlsx"
+ option:
+ format: "xlsx"
+ schema:
+ columns:
+ - name: "int_column"
+ data_type: "Int64"
+ nullable: true
+ - name: "string_column"
+ data_type: "Utf8"
+ nullable: true
+ - name: "float_column"
+ data_type: "Float64"
+ nullable: true
+ - name: "datetime_column"
+ data_type: !Timestamp [Seconds, null]
+ nullable: true
+ - name: "duration_column"
+ data_type: !Duration Second
+ nullable: true
+ - name: "date32_column"
+ data_type: Date32
+ nullable: true
+ - name: "date64_column"
+ data_type: Date64
+ nullable: true
+ - name: "null_column"
+ data_type: Null
+ nullable: true
+
+
+ To serve a Google spreadsheet as API, you need to gather the following config values:
+https://docs.google.com/spreadsheets/d/{SPREADSHEET_ID}#gid={SHEET_ID}
.SHEET_ID
is not specified in URL through #gid
.Here are the steps to configure the service account:
+ADD KEY
-> Create new key
. Then select JSON format
+and save it somewhere safe as a file. {ACCOUNT_NAME}@{PROJECT_ID}.iam.gserviceaccount.com
. Now you can configure ROAPI to load the google spreadsheet into a table using:
+tables:
+ - name: "table_name"
+ uri: "https://docs.google.com/spreadsheets/d/1-lc4oij04aXzFSRMwVBLjU76s-K0-s6UPc2biOvtuuU#gid=0"
+ option:
+ format: "google_spreadsheet"
+ application_secret_path: "path/to/service_account_key.json"
+ # sheet_title is optional if `#gid` is specified in uri
+ sheet_title: "sheet_name_within_google_spreadsheet"
+
+ROAPI only invokes Google Sheets API during the initial data load, subsequent +query requests are served using in memory data.
+Here is what it looks like to serve this public google +spreadsheet +with ROAPI.
+Start server:
+$ roapi -c local.yaml
+[2022-05-31T01:07:55Z INFO roapi::context] loading `uri(https://docs.google.com/spreadsheets/d/1-lc4oij04aXzFSRMwVBLjU76s-K0-s6UPc2biOvtuuU#gid=0)` as table `properties`
+[2022-05-31T01:07:56Z INFO roapi::context] registered `uri(https://docs.google.com/spreadsheets/d/1-lc4oij04aXzFSRMwVBLjU76s-K0-s6UPc2biOvtuuU#gid=0)` as table `properties`
+[2022-05-31T01:07:56Z INFO roapi::startup] 🚀 Listening on 127.0.0.1:5432 for Postgres traffic...
+[2022-05-31T01:07:56Z INFO roapi::startup] 🚀 Listening on 127.0.0.1:8080 for HTTP traffic...
+
+Query through Postgres wire protocol:
+$ psql -h 127.0.0.1
+psql (12.10 (Ubuntu 12.10-0ubuntu0.20.04.1), server 13)
+WARNING: psql major version 12, server major version 13.
+ Some psql features might not work.
+Type "help" for help.
+
+houqp=> select "Address", "Bed", "Bath", "Occupied" from properties;
+ Address | Bed | Bath | Occupied
+------------------+-----+------+----------
+ Bothell, WA | 3 | 2 | f
+ Lynnwood, WA | 2 | 1 | f
+ Kirkland, WA | 4 | 2 | f
+ Kent, WA | 3 | 2 | f
+ Mount Vernon, WA | 2 | 1 | f
+ Seattle, WA | 3 | 1 | f
+ Seattle, WA | 2 | 1 | f
+ Shoreline, WA | 1 | 1 | f
+ Bellevue, WA | 3 | 1 | f
+ Renton, WA | 4 | 2 | f
+ Woodinville, WA | 3 | 3 | f
+ Kenmore, WA | 4 | 3 | f
+ Fremont, WA | 5 | 3 | f
+ Redmond, WA | 2 | 2 | f
+ Mill Creek, WA | 3 | 3 | f
+(15 rows)
+
+Query with aggregation using HTTP SQL frontend:
+$ curl -s -X POST localhost:8080/api/sql --data-binary @- <<EOF | jq
+SELECT DISTINCT("Landlord"), COUNT("Address")
+FROM properties
+GROUP BY "Landlord"
+EOF
+
+[
+ {
+ "Landlord": "Carl",
+ "COUNT(properties.Address)": 3
+ },
+ {
+ "Landlord": "Roger",
+ "COUNT(properties.Address)": 3
+ },
+ {
+ "Landlord": "Mike",
+ "COUNT(properties.Address)": 4
+ },
+ {
+ "Landlord": "Sam",
+ "COUNT(properties.Address)": 2
+ },
+ {
+ "Landlord": "Daniel",
+ "COUNT(properties.Address)": 3
+ }
+]
+
+Query with filter using HTTP GraphQL frontend:
+$ curl -s -X POST localhost:8080/api/graphql --data-binary @- <<EOF | jq
+query {
+ properties(
+ filter: {
+ Bed: { gt: 3 }
+ Bath: { gte: 3 }
+ }
+ sort: [
+ { field: "Bed", order: "desc" }
+ ]
+ ) {
+ Address
+ Bed
+ Bath
+ Monthly_Rent
+ }
+}
+EOF
+[
+ {
+ "Address": "Fremont, WA",
+ "Bed": 5,
+ "Bath": 3,
+ "Monthly_Rent": "$4,500"
+ },
+ {
+ "Address": "Kenmore, WA",
+ "Bed": 4,
+ "Bath": 3,
+ "Monthly_Rent": "$4,000"
+ }
+]
+
+Note: when inferring table schema from Google spreadsheets, ROAPI automatically
+replaces spaces in column names with _
(underscore).
When a table uri ends in .json
, ROAPI will try to load it as JSON table if no
+format
option is specified:
tables:
+ - name: "mytable"
+ uri: "http://mytable.json/"
+
+Sometimes the JSON array you want to serve might be stored inside a JSON +object. To support this use-case, ROAPI supports loading JSON using a JSON +pointer.
+Take the following JSON data as an example:
+{
+ "x": {
+ "y": [{"col1": "z"}, {"col1": "zz"}]
+ }
+}
+
+In order to only serve [{"col1": "z"}, {"col1": "zz"}]
through ROAPI, you can
+configure the JSON table source as:
tables:
+ - name: "mytable"
+ uri: "http://mytable.json/"
+ option:
+ format: "json"
+ pointer: "/x/y"
+
+Each row in JSON data can be encoded using array for size reduction. This +convention allows us to avoid repeating column names in every row.
+For example:
+[
+ {"col1": 1, "col2": "abc"},
+ {"col1": 2, "col2": "efg"}
+]
+
+Can be stored as:
+[
+ [1, "abc"],
+ [2, "efg"]
+]
+
+However, when loading JSON rows using array encoding, you must explicitly specify +the schema, since there is no column name in the datasource anymore for ROAPI to +perform the schema inference.
+tables:
+ - name: "mytable"
+ uri: "http://mytable.json/"
+ option:
+ format: "json"
+ array_encoded: true
+ schema:
+ columns:
+ - name: "col1"
+ data_type: "Int64"
+ - name: "col2"
+ data_type: "Utf8"
+
+
+ NDJSON stands for Newline delimited JSON. It is a +convenient format for storing or streaming structured data that may be +processed one record at a time.
+When a table uri ends in .ndjson
, ROAPI will try to load it as NDJSON table
+if no format
option is specified:
tables:
+ - name: "mytable"
+ uri: "http://mytable.ndjson/"
+
+
+ When a table uri ends in .parquet
, ROAPI will try to load it as Parquet table if no
+format
option is specified:
tables:
+ - name: "mytable"
+ uri: "http://mytable.parquet/"
+
+You can partition a Parquet dataset into multiple partitions and load all of them +into a single table by directory path:
+tables:
+ - name: "mytable"
+ uri: "./table_dir"
+ option:
+ format: "parquet"
+
+ROAPI loads the entire table into memory as the default behavior. If your
+table is large or you want to avoid loading all data during startup, you can
+set an additional option use_memory_table: false
(default: true
). With that
+configuration, ROAPI will not copy the data into memory, but instructs datafusion
+to directly operate on the backing storage.
At the moment, this comes with the following limitations:
+Example:
+tables:
+ - name: "mytable"
+ uri: "./table_dir"
+ option:
+ format: "parquet"
+ use_memory_table: false
+
+Note that when providing use_memory_table
option, it becomes necessary to
+also specify the format.
ROAPI automatically spins up read-only APIs +and query frontends for slowly moving datasets without requiring you to write a +single line of code. It builds on top of Apache Arrow and +Datafusion. The +core of its design can be boiled down to the following:
+Its pluggable query core design makes it possible for users to efficiently +perform join queries across a diverse set of datasources from simple +CSV/Parquet files in Data warehouses, to MySQL/Postgres, to SASS like Google +spreadsheet.
+See below for a high level design diagram:
+++ +If you'd like to share feedback on how you're using ROAPI, please fill out this survey. Thanks!
+