The Amazon Athena Query Federation SDK allows you to customize Amazon Athena with your own data sources and code.
Amazon Athena Query Federation
The Amazon Athena Query Federation SDK allows you to customize Amazon Athena with your own code. This enables you to integrate with new data sources, proprietary data formats, or build in new user defined functions. Initially these customizations will be limited to the parts of a query that occur during a TableScan operation but will eventually be expanded to include other parts of the query lifecycle using the same easy to understand interface.
This functionality is currently in Public Preview while customers provide us feedback on usability, ease of using the service or building new connectors. We do not recommend that you use these connectors in production or use this preview to make assumptions about the performance of Athena’s Federation features. As we receive more feedback, we will make improvements to the preview and increase limits associated with query/connector performance, APIs, SDKs, and user experience. The best way to understand the performance of Athena Data Source Connectors is to run a benchmark when they become generally available (GA) or review our performance guidance.
To enable this Preview feature you need to create an Athena workgroup named AmazonAthenaPreviewFunctionality and run any queries attempting to federate to this connector, use a UDF, or SageMaker inference from that workgroup.
tldr; Get Started:1. Ensure you have the proper permissions/policies to deploy/use Athena Federated Queries1. Navigate to Servless Application Repository and search for “athena-federation”. Be sure to check the box to show entries that require custom IAM roles. 1. Look for entries published by the “Amazon Athena Federation” author.1. Deploy the application1. Go to the Athena Console in us-east-1 (N. Virginia) and create a workgroup called “AmazonAthenaPreviewFunctionality”, any queries run from that workgroup will be able to use Preview features described in this repository.1. Run a query “show databses in `lambda:`” where is the name of the Lambda function you deployed in the previous steps.
For more information please consult:
- Intro Video
- SDK ReadMe
- Quick Start Guide
- Available Connectors
- Federation Features
- How To Build A Connector or UDF
- Gathering diagnostic info for support
- Frequently Asked Questions
- Common Problems
- Installation Pre-requisites
- Known Limitations & Open Issues
- Predicate Pushdown How-To
- Our Github Wiki.
- Java Doc
We’ve written integrations with more than 20 databases, storage formats, and live APIs in order to refine this interface and balance flexibility with ease of use. We hope that making this SDK and initial set of connectors Open Source will allow us to continue to improve the experience and performance of Athena Query Federation.
Serverless Big Data Using AWS Lambda
Queries That Span Data Stores
Imagine a hypothetical e-commerce company who’s architecture uses:
- Payment processing in a secure VPC with transaction records stored in HBase on EMR
- Redis is used to store active orders so that the processing engine can get fast access to them.
- DocumentDB (e.g. a mongodb compatible store) for Customer account data like email address, shipping addresses, etc..
- Their e-commerce site using auto-scaling on Fargate with their product catalog in Amazon Aurora.
- Cloudwatch Logs to house the Order Processor’s log events.
- A write-once-read-many datawarehouse on Redshift.
- Shipment tracking data in DynamoDB.
- A fleet of Drivers performing last-mile delivery while utilizing IoT enabled tablets.
- Advertising conversion data from a 3rd party source.
Customer service agents begin receiving calls about orders ‘stuck’ in a weird state. Some show as pending even though they have delivered, others show as delivered but haven’t actually shipped. It would be great if we could quickly run a query across this diverse architecture to understand which orders might be affected and what they have in common.
Using Amazon Athena Query Federation and many of the connectors found in this repository, our hypothetical e-commerce company would be able to run a query that:
- Grabs all active orders from Redis. (see athena-redis)
- Joins against any orders with ‘WARN’ or ‘ERROR’ events in Cloudwatch logs by using regex matching and extraction. (see athena-cloudwatch)
- Joins against our EC2 inventory to get the hostname(s) and status of the Order Processor(s) that logged the ‘WARN’ or ‘ERROR’. (see athena-cmdb)
- Joins against DocumentDB to obtain customer contact details for the affected orders. (see athena-docdb)
- Joins against DynamoDB to get shipping status and tracking details. (see athena-dynamodb)
- Joins against HBase to get payment status for the affected orders. (see athena-hbase)
“`sqlWITH logs AS (SELECT logstream, message AS orderprocessorlog, Regexpextract(message, ‘.orderId=(d+) .‘, 1) AS orderId, Regexpextract(message, ‘(.*):.*’, 1) AS loglevel FROM “lambda:cloudwatch”.”/var/ecommerce-engine/order-processor”.alllogstreams WHERE Regexpextract(message, ‘(.*):.*’, 1) != ‘WARN’), activeorders AS (SELECT * FROM redis.redisdb.rediscustomerorders), orderprocessors AS (SELECT instanceid, publicipaddress, state.NAME FROM awscmdb.ec2.ec2instances), customer AS (SELECT id, email FROM docdb.customers.customerinfo), addresses AS (SELECT id, isresidential, address.street AS street FROM docdb.customers.customeraddresses), shipments AS ( SELECT orderid, shipmentid, fromunixtime(cast(shippeddate as double)) as shipmenttime, carrier FROM lambdaddb.default.ordershipments), payments AS ( SELECT “summary:orderid”, “summary:status”, “summary:ccid”, “details:network” FROM “hbase”.hbasepayments.transactions)
SELECT key AS redisorderid, customerid, customer.email AS custemail, “summary:ccid” AS creditcard, “details:network” AS CCtype, “summary:status” AS paymentstatus, status AS redisstatus, addresses.street AS streetaddress, shipments.shipmenttime as shipmenttime, shipments.carrier as shipmentcarrier, publicipaddress AS ec2orderprocessor, NAME AS ec2state, loglevel, orderprocessorlog FROM activeorders LEFT JOIN logs ON logs.orderid = activeorders.key_ LEFT JOIN orderprocessors ON logs.logstream = orderprocessors.instanceid LEFT JOIN customer ON customer.id = customerid LEFT JOIN addresses ON addresses.id = addressid LEFT JOIN shipments ON shipments.orderid = activeorders.key_ LEFT JOIN payments ON payments.”summary:orderid” = activeorders.key“`
This project is licensed under the Apache-2.0 License.
To restore the repository download the bundle
git clone awslabs-aws-athena-query-federation_-_2019-11-26_22-56-58.bundle
Upload date: 2019-11-26