Dataflow and BigQuery for ETL on Google Cloud
Prerequisites
Hardware : GCP
Google account
Open the console.
Click on activate cloud shell.
![](https://prwatech.in/blog/wp-content/uploads/2021/04/1-8.jpg)
$ gcloud config list project #To List Projects
$ gcloud config set project my-demo-project-306417 #To Select Project
![](https://prwatech.in/blog/wp-content/uploads/2021/04/2-8.jpg)
In the Cloud console, click Menu > IAM & Admin.
![](https://prwatech.in/blog/wp-content/uploads/2021/04/3-8.jpg)
Confirm that the default compute Service Account is present and has the editor role assigned.
{project-number}-compute@developer.gserviceaccount.com
![](https://prwatech.in/blog/wp-content/uploads/2021/04/4-7.jpg)
To check the project number, Go to Menu > Home > Dashboard
![](https://prwatech.in/blog/wp-content/uploads/2021/04/5-4.jpg)
In the dashboard under the project info project number will be displayed.
![](https://prwatech.in/blog/wp-content/uploads/2021/04/6-5.jpg)
In the console, copy the below code.
$ gsutil -m cp -R gs://spls/gsp290/dataflow-python-examples . # To copy the Dataflow Python Examples from Google Cloud’s professional services in GitHub
![](https://prwatech.in/blog/wp-content/uploads/2021/04/7-4-1024x154.jpg)
Export the name of your cloud project as an environment variable.
$ export PROJECT=<YOUR-PROJECT-ID>
Select the project which has been exported. [Don’t change $PROJECT]
gcloud config set project $PROJECT
![](https://prwatech.in/blog/wp-content/uploads/2021/04/8-2.jpg)
gsutil mb -c regional -l us-central1 gs://$PROJECT #To create new Bucket
![](https://prwatech.in/blog/wp-content/uploads/2021/04/9-2.jpg)
gsutil cp gs://spls/gsp290/data_files/usa_names.csv gs://$PROJECT/data_files/ #To copy the files into the bucket which we created
![](https://prwatech.in/blog/wp-content/uploads/2021/04/10-2-1024x65.jpg)
bq mk lake #To make Big Query Lake
![](https://prwatech.in/blog/wp-content/uploads/2021/04/12-1.jpg)
Go to the directory
cd dataflow-python-examples/
Install Virtual Environment
sudo pip install virtualenv
![](https://prwatech.in/blog/wp-content/uploads/2021/04/15-1-1024x215.jpg)
virtualenv -p python3 venv #Creating Virtual Environement
![](https://prwatech.in/blog/wp-content/uploads/2021/04/16-1-1024x100.jpg)
source venv/bin/activate #Activating Virtual Environment
pip install apache-beam[gcp]==2.24.0 #Installing Apache-beam
![](https://prwatech.in/blog/wp-content/uploads/2021/04/17-1-1024x148.jpg)
In the Cloud Shell window, Click on Open Editor
![](https://prwatech.in/blog/wp-content/uploads/2021/04/ed.jpg)
In the Workspace, the dataflow-python-examples folder will be copied. Go through each python programs and try to understand the code.
![](https://prwatech.in/blog/wp-content/uploads/2021/04/14-2.jpg)
Here we’ll run the Dataflow pipeline in cloud. It will add the workers required, and shut them down when complete
python dataflow_python_examples/data_ingestion.py –project=$PROJECT –region=us-central1 –runner=DataflowRunner –staging_location=gs://$PROJECT/test –temp_location gs://$PROJECT/test –input gs://$PROJECT/data_files/usa_names.csv –save_main_session
It will take some time to assign the workers and to finish the job assigned.
![](https://prwatech.in/blog/wp-content/uploads/2021/04/18-1-1024x114.jpg)
Open the console and click Menu > Dataflow > Jobs
![](https://prwatech.in/blog/wp-content/uploads/2021/04/19-1.jpg)
In Jobs, Check the status. It will show Succeeded after completion. It will show running if the work is not fully assigned.
![](https://prwatech.in/blog/wp-content/uploads/2021/04/20.jpg)
Big Query > SQL Workspace
![](https://prwatech.in/blog/wp-content/uploads/2021/04/21.jpg)
It will show the usa_names table under the lake dataset.
![](https://prwatech.in/blog/wp-content/uploads/2021/04/22.jpg)
Here we’ll run the Dataflow pipeline saved in data_transformation.py file. It will add the workers which is required, and shut it down when complete.
python dataflow_python_examples/data_transformation.py –project=$PROJECT –region=us-central1 –runner=DataflowRunner –staging_location=gs://$PROJECT/test –temp_location gs://$PROJECT/test –input gs://$PROJECT/data_files/head_usa_names.csv –save_main_session
![](https://prwatech.in/blog/wp-content/uploads/2021/04/23-1024x168.jpg)
Cloud Shell Editor, open dataflow-python-examples > dataflow_python_examples > data_enrichment.py
In Line 83 replace
values = [x.decode(‘utf8’) for x in csv_row]
With $ values = [x for x in csv_row]
This code will populate the data in BigQuery.
![](https://prwatech.in/blog/wp-content/uploads/2021/04/25-1024x408.jpg)
![](https://prwatech.in/blog/wp-content/uploads/2021/04/26-1024x416.jpg)
Now again we’ll run the Dataflow pipeline saved in data_enrichment.py. It will add the workers, and shut down when finished.
python dataflow_python_examples/data_enrichment.py –project=$PROJECT –region=us-central1 –runner=DataflowRunner –staging_location=gs://$PROJECT/test –temp_location gs://$PROJECT/test –input gs://$PROJECT/data_files/head_usa_names.csv –save_main_session
![](https://prwatech.in/blog/wp-content/uploads/2021/04/27-1024x123.jpg)
Open Menu > Big Query > SQL Workspace
It will show the populated dataset in lake.
![](https://prwatech.in/blog/wp-content/uploads/2021/04/28.jpg)
Dataflow > Jobs.
![](https://prwatech.in/blog/wp-content/uploads/2021/04/28z.jpg)
Click on the job which you have done.
![](https://prwatech.in/blog/wp-content/uploads/2021/04/28y.jpg)
It will show the Data Flow pipeline of the work.
![](https://prwatech.in/blog/wp-content/uploads/2021/04/29-1024x843.jpg)
Dataflow and BigQuery for ETL on Google Cloud