Visualizing S3 Data using Athena and Quicksight

AWS Athena is an interactive query engine that enables us to run SQL queries on raw data that we store on S3 buckets. It is a completely serverless solution, meaning you do not need to deploy or manage any infrastructure to use that. And most importantly you do not need to pay for any data warehouse resource like clusters etc. when you are not using that. It is developed based on Presto, which was developed by Facebook in 2012 and open-sourced in 2013 as a data warehousing tool.

When we use AWS Athena to query data, we actually leverage three AWS services together to accomplish that; AWS S3 to store your data, AWS Glue Data Catalog to create catalog of your data that you store, AWS Athena to query. Note that AWS Athena doesn't store any data or copy of your data. Since it is already stored on your S3 bucket and cataloged using AWS Glue, it doesn't need to.

AWS Glue stores all following information related to the data that has been stored,

where do you store this data?
which tables do you have?
how is the data formatted?
which column do you have?
is there any compression on your data?
is there any partitioning enabled?

Unlike RDBMS databases, Athena works in an async manner, which means when you send a query you get an query execution ID as a response. You can poll AWS Athena to get the status of the query and when it is completed you can use this ID to get the query result through AWS CLI, SDK and Management Console. It also saves the query result to a S3 bucket that you configure on the setup.

Two options available as a data storage,

Amazon S3; using SerDe (serializer/deserializer) libraries. Available data formats to query data on S3 varies like, CSV (comma-separated), TSV (tab-separated), Custom-Delimited, JSON, Apache Arvo, Parquet, ORC. Also using line based text SerDe like Regex and Grok, you can even write your own format, so it gives you infinite elasticity.

Other data sources than S3 using Athena Federated Query; which puts AWS Lambda between AWS Athena and your data storage to run serverless queries. Data connectors available for Athena Federated Query are Amazon CloudWatch Logs, Amazon DynamoDB, MongoDB (AWS Managed or self-hosted), MySQL (Amazon RDS or self-hosted), PostgreSQL (Amazon RDS or self-hosted), ElasticSearch (Amazon Elasticsearch or self-hosted), Redis (Amazon ElastiCache or self-hosted). One great advantage is you can use this feature to run asynchronous queries on a relational database within your application.

Using AWS Athena and Amazon Quicksight to Visualize Data

I used WHO's COVID-19 data published as a CSV file on their website, containing official daily counts of COVID-19 cases, deaths and vaccine utilization reported by countries, territories and areas reported to WHO. On this data country codes are in ISO Alpha-2 country code format, so to be able to visualize it on a map, we'll need ISO Alpha-2 codes and coordinate information (latitude/longitude) of countries. For that I used a gist library that has been shared on github by a community member.

Covid Data: https://covid19.who.int/info/
Country List: https://gist.github.com/tadast/8827699#file-countries_codes_and_coordinates-csv

Download and upload your data to different folders in a Amazon S3 bucket using this structure:

➜  ~ aws s3 ls s3://sufle-athena-bucket --recursive
2021-10-18 22:52:01          0 country/
2021-10-18 22:55:01      12720 country/countries_codes_and_coordinates.csv
2021-10-18 22:51:47          0 covid/
2021-10-18 22:55:40    4802871 covid/WHO-COVID-19-global-data.csv

Let's go to AWS Athena service on the AWS Management Console. If this is the first time that you use this service, it will ask you some settings initially;

Query result location is where AWS Athena will store the result of the queries that you run. Remember that I mentioned it is running asynchronously and saving results on a bucket. Create a bucket if you like from Amazon S3 service, or select an existing bucket that is on your mind. In our example we have selected sufle-athena-output-bucket.

Creating Catalog

After you are done with initial configuration, click on the "Connect data source" button from the "Data sources" tab to start creating your first catalog.

On "Step 1: Choose a data source", you will choose;

Choose where your data is located: Query data in Amazon S3
Choose a metadata catalog: AWS Glue Data Catalog

And on the next one "Step 2: Connection details" you need to select,

Connection details: choose an AWS Glue Data Catalog, AWS Glue Data Catalog in this account
Choose a way to create a table, Create a crawler in AWS Glue

These sections will redirect you to the AWS Glue Data Catalog creation page.

Before we move on, what is a "Crawler"?

Crawler is a tool that automatically scans your data and populates AWS Glue Data Catalog automatically for you. It is the easiest way to create a catalog.

It classifies your data to determine the metadata information such as format, schema, and associated properties of the raw data.
It groups your data into tables or partitions.
Finally, it writes the metadata to the AWS Glue Data Catalog.

Let's get back to the creating data catalog, on step-by-step form:

Crawler info:
- Crawler name: enter a name that you want, let's call it covid
Crawler source type:
- Crawler source type: it will be "Data sources" since we want it to crawl the data that we have on S3 bucket.
- Repeat crawls of S3 data stores: we want it to crawl all folders
Data store:
- Choose a data store: "S3", obviously :)
- Connection: we will leave this as a blank, since we don't want to configure a custom network connection on this example
- Include path: enter the location that you upload COVID-19 data, s3://sufle-athena-bucket/covid in our example
IAM role:
- Choose to create a new IAM role and name it something like AWSGlueServiceRole-covid
Schedule: Here we will select the "Run on demand" option since it is an example but for real workloads you can use this feature to run crawlers automatically. It can run on the schedules that you configure and update your data catalog.
Database: Create a new database with naming it something like covidprod

We'll do this also for the country data that we have, very similarly:

Crawler info:
- Crawler name: enter a name that you want, let's call it country this time
Crawler source type:
- Crawler source type: it will be "Data sources" since we want it to crawl the data that we have on S3 bucket.
- Repeat crawls of S3 data stores: we want it to crawl all folders
Data store:
- Choose data store: "S3", again.
- Connection: we will leave this as a blank
- Include path: enter the location that you upload country data, s3://sufle-athena-bucket/country in our example
IAM role:
- Choose to create a new IAM role and name it something like AWSGlueServiceRole-country
Schedule: Here we will select the "Run on demand" option as before.
Database: Choose the database that you have created during configuration of previous crawler, in our example it will be covidprod.

After you have created crawlers you can select them and click on "Run crawler" to AWS Glue to create a data catalog automatically for you. After these processes are completed, you can go to "Tables" to see details of cataloged data such as record counts, delimiter, columns etc.

Querying Data

Since the data and catalogs are ready now, we can start querying our data using Amazon Athena. Let's go back to Athena service, and select "Query editor". Then from sidebar, you need to select the database that you created on crawler configuration and use following queries to test,

SELECT * FROM "covidprod"."covid" limit 10;
SELECT * FROM "covidprod"."covid" where country="Turkey";

If you check the output bucket that you've selected before and see the results that have been saved.

If you query the country table that we have created, you'll see that there are " on the fields, since the CSV data file contains them. Also latitude and longitude are empty. To fix this we'll go back to the table on AWS Glue and click on "Edit Schema". Couple thing that we change to query it easily,

Change column name alpha-2 code to alpha2
Change column name alpha-3 code to alpha3
Change column name numeric code to numeric
Change column name latitude (average) to latitude
Change column name longitude (average) to longitude and save these changes.

AWS Glue uses LazySimpleSerDe automatically. If you have a different format you can use "Supported Data Formats and SerDes" section of AWS Glue Documentation to select a different serializer/deserializer for your catalog. As it states on the documentation, "Use the OpenCSVSerDe for Processing CSV when your data includes quotes in values"; we will use org.apache.hadoop.hive.serde2.OpenCSVSerde, add quoteChar with value " and add escapeChar with value \ to SerDe parameters section.

After you are done with these configuration changes you'll see that all data will be on the Results section of Amazon Athena console when you query the table once more.

Now you can also join these tables as,

SELECT * FROM "covidprod"."covid" AS cvd JOIN "covidprod"."country" AS ctr 
   ON cvd.country_code=ctr.alpha2 
   LIMIT 3;

Visualizing Data

Amazon QuickSight is a fast, easy-to-use, cloud-powered business analytics service of AWS that enables you to build visualizations, perform ad-hoc analysis, and quickly get business insights from their data, anytime, on any device. You can upload CSV and Excel files; connect it to SaaS applications like Salesforce; access your on-premises databases like SQL Server, MySQL, and PostgreSQL; and seamlessly discover your AWS data sources such as Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon Athena, and Amazon S3. We will use QuickSight to visualize our data.

On the QuickSight dashboard, let's create a new Dataset, select Athena as data source and name our dataset with something like, covid-data. It will bring us the list of databases and tables to select from them. Since we want to use both tables we'll use custom query option and put following joined query:

SELECT ctr.country, ctr.latitude, ctr.longitude, cvd.date_reported, cvd.new_cases 
    FROM "covidprod"."covid" AS cvd 
    JOIN "covidprod"."country" AS ctr 
    ON cvd.country_code=ctr.alpha2;

Then go back to datasets, click on the dataset that you created and select Create Analysis to start visualizing. On the sidebar that will appear on the left side of your window, you'll see different Visual types, we'll select the Points on map icon (the one with an earth image). At the top you'll see that it requires Geospatial input to visualize, so to be able to do that we need to create coordinates from latitude and longitude values that we have on our data. Let's go back to Datasets and use Edit to customize our dataset. On the left sidebar, you'll see our fields latitude and longitude, if you click on the three dots next to one of them you'll see the first option is Add to Coordinates. Name the coordinates as coordinate and select the fields from drop down lists.

Go back to analysis that you created couple minutes ago and drag drop the fields to wells from Field List:

Geospatial: coordinate
Size: new_cases
Color: country

And final result:

Visualizing S3 Data using Athena and Quicksight

Yazar hakkında:

Dijital Dönüşümünüzü Başlatın

Visualizing S3 Data using Athena and Quicksight

Using AWS Athena and Amazon Quicksight to Visualize Data

Creating Catalog

Querying Data

Visualizing Data

Yazar hakkında:

Gizem Gür

Dijital Dönüşümünüzü Başlatın

Çerezler nedir?