Dataflow and BigQuery for ETL on Google Cloud

ETL Processing on Google Cloud Using Dataflow and Big Query

date 9th April, 2021 |
by Prwatech |
0 Comments

Dataflow and BigQuery for ETL on Google Cloud

Prerequisites

Hardware : GCP

Google account

Open the console.

Click on activate cloud shell.

$ gcloud config list project #To List Projects

$ gcloud config set project my-demo-project-306417 #To Select Project

In the Cloud console, click Menu > IAM & Admin.

Confirm that the default compute Service Account is present and has the editor role assigned.

{project-number}-compute@developer.gserviceaccount.com

To check the project number, Go to Menu > Home > Dashboard

In the dashboard under the project info project number will be displayed.

In the console, copy the below code.

$ gsutil -m cp -R gs://spls/gsp290/dataflow-python-examples . # To copy the Dataflow Python Examples from Google Cloud’s professional services in GitHub

Export the name of your cloud project as an environment variable.

$ export PROJECT=<YOUR-PROJECT-ID>

Select the project which has been exported. [Don’t change $PROJECT]

gcloud config set project $PROJECT

gsutil mb -c regional -l us-central1 gs://$PROJECT #To create new Bucket

gsutil cp gs://spls/gsp290/data_files/usa_names.csv gs://$PROJECT/data_files/ #To copy the files into the bucket which we created

bq mk lake #To make Big Query Lake

Go to the directory

cd dataflow-python-examples/

Install Virtual Environment

sudo pip install virtualenv

virtualenv -p python3 venv #Creating Virtual Environement

source venv/bin/activate #Activating Virtual Environment

pip install apache-beam[gcp]==2.24.0 #Installing Apache-beam

In the Cloud Shell window, Click on Open Editor

In the Workspace, the dataflow-python-examples folder will be copied. Go through each python programs and try to understand the code.

Here we’ll run the Dataflow pipeline in cloud. It will add the workers required, and shut them down when complete

python dataflow_python_examples/data_ingestion.py –project=$PROJECT –region=us-central1 –runner=DataflowRunner –staging_location=gs://$PROJECT/test –temp_location gs://$PROJECT/test –input gs://$PROJECT/data_files/usa_names.csv –save_main_session

It will take some time to assign the workers and to finish the job assigned.

Open the console and click Menu > Dataflow > Jobs

In Jobs, Check the status. It will show Succeeded after completion. It will show running if the work is not fully assigned.

Big Query > SQL Workspace

It will show the usa_names table under the lake dataset.

Here we’ll run the Dataflow pipeline saved in data_transformation.py file. It will add the workers which is required, and shut it down when complete.

python dataflow_python_examples/data_transformation.py –project=$PROJECT –region=us-central1 –runner=DataflowRunner –staging_location=gs://$PROJECT/test –temp_location gs://$PROJECT/test –input gs://$PROJECT/data_files/head_usa_names.csv –save_main_session

Cloud Shell Editor, open dataflow-python-examples > dataflow_python_examples > data_enrichment.py

In Line 83 replace

values = [x.decode(‘utf8’) for x in csv_row]

With $ values = [x for x in csv_row]

This code will populate the data in BigQuery.

Now again we’ll run the Dataflow pipeline saved in data_enrichment.py. It will add the workers, and shut down when finished.

python dataflow_python_examples/data_enrichment.py –project=$PROJECT –region=us-central1 –runner=DataflowRunner –staging_location=gs://$PROJECT/test –temp_location gs://$PROJECT/test –input gs://$PROJECT/data_files/head_usa_names.csv –save_main_session

Open Menu > Big Query > SQL Workspace

It will show the populated dataset in lake.

Dataflow > Jobs.

Click on the job which you have done.

It will show the Data Flow pipeline of the work.