What is AWS Athena?
AWS Athena is a code-free, fully automated, fully managed, data pipeline that performs database automation, Parquet file conversion, table creation, Snappy compression, partitioning, and more. It is an interactive query service to analyze Amazon Simple Storage Service (S3) data using standard SQL.
Amazon launched Athena on November 20, 2016, and this serverless query service provides data analysis with standard SQL.
With the AWS management console, users can point Athena at data stored in Amazon S3 and execute queries to get results in seconds using standard SQL.
Amazon Athena has no infrastructure to set up or manage, & the customers need to pay only for the queries they run on it.
Amazon Athena scales executing queries in parallel, scales automatically, providing fast results even with a large dataset & complex questions.
Why Amazon Athena?
In the current scenario of Big Data, data increases day-by-day and in this Datalake, so we want to keep all of our data. But we don’t actually need expensive redshift nodes to be running all the time. This led us to find Presto. Presto is a distributed SQL query engine tool, designed for analytic queries. Presto decouples the data from its processing; No data is stored in Presto, so it reads it from elsewhere. e.g. S3. and since S3 storage is really cheap, it makes a lot of sense to use it as the storage system for your Datalake.
Amazon Athena is designed on the base of Presto and it supports standard SQL syntax which makes it easier for use by our data analysts. Do note however that there are some differences in the SQL dialect from e.g. Redshift’s SQL.
Partitioning of Data:
By partitioning your data, you can restrict the amount of data scanned by each query thus improving performance & reducing cost
Athena leverages HIVE for partitioning data
You can apply a partition on your data by any key
You can query geospatial data.
You can query different kinds of logs as your datasets.
Athena stores query results in S3.
Athena retain query history for 45 days.
Athena does-not support user-defined functions “INSERT INTO” statements and stored procedures.
Features of Athena
Athena is one of the best services offered by AWS. It has several features making it suitable to analyze your data. Let’s have a look at the various features of Athena given below:
Easy Implementation: Athena requires no installation & can directly access using the AWS Console.
Serverless: The end-user does not face any problems in configuring, scaling or failure as Athena is a serverless service. It can take care of everything on its own.
Pay per query: It charges only for queries you run, which means the amount of data that is managed per query.
Fast: Athena is a high-speed analytics tool and can perform even the complex queries in relatively less time by splitting into simpler ones and running them parallelly, and merge them to provide the desired output.
Secure: Using AWS Identity and IAM policies (IAM), Athena provides you with complete control over the data set.
High availability: With AWS, Athena is accessible & the user can run queries round the clock.
Integration: The best feature of Athena is its integration with AWS Glue which is an ETL service for the customer.
AWS Glue is a perfectly managed ETL service which makes it flexible for customers who want to prepare and load data for analytics. You can build and execute an ETL in the Amazon Management Console with a few clicks. You can point AWS Glue ETL service to your AWS data and discovers your data and store associated metadata like Schema and table definition in the AWS Glue Data Catalog. Your data once cataloged is immediately searchable, queryable, and available for ETL.
Benefits of AWS Glue:
AWS Glue is integrated with a wide range of AWS services, which means less hassle for you while onboarding.
AWS Glue is serverless i.e. No infrastructure required to provision or manage.
Need to pay only for the resources used to run the jobs.
Amazon QuickSight is a cloud-powered, fast BI service, which makes it easy to deliver insights to everyone in the organization. Being a wholly managed service, QuickSight lets you create interactive dashboards easily and publish with ML insights. Dashboards can be accessed from any device embedded into your applications, websites, and portals. Using Pay-per-Session pricing, it allows you to provide everyone to obtain data required when only paying for what you use.
Some of the major benefits provided by Amazon QuickSight are listed as follows:
Pay only for what you use
Scale from 10 users to 10,000
Embed self-service data analytics
Build end-to-end BI solutions
How does AWS Athena work?
Athena works directly with S3 data. It uses a distributed SQL engine, Presto for running queries. It uses Apache Hive to create and alter tables and partitions.
Let’s have a look at the prerequisites to start working with Athena:
Must have an AWS account
Enable your account to export your cost and usage data into an S3 bucket.
Prepare buckets for Athena to connect.
AWS creates manifest files using metadata every time it writes to the bucket. It creates a folder inside the technology-aws-billing-data bucket known as Athena, which contains only the data.
To simplify the setup, we can use one region: the us-west-2 region.
The final step is downloading the credentials for the new IAM user. The credentials will directly map to the database credentials to connect
|Database username||IAM username|
|Database password||Secret Access Key|
|Database name||Access Key ID|
|S3 staging directory||s3://aws-athena-query-results-technology/|
With public cloud services, providing service-based analytics services such as Amazon Athena, businesses can get more analysis without any expensive complications that arise with home-built analytics tools.
Amazon Athena a serverless architecture and employing ANSI SQL, Athena makes data queries quick to set up, easy to use, and fast to run. The pay-per-use model of Amazon Athena will make it affordable to run analytics. Since Athena works with Amazon Simple Storage Service (S3) and comes with unmatched scalability, durability, reliability and the power of object storage, this is the perfect-suite to run analytics workloads.
#Last but not least, always ask for help!