Dataflow and BigQuery for ETL on Google Cloud
Prerequisites
Hardware : GCP
Google account
Open the console.
Click on activate cloud shell.
data:image/s3,"s3://crabby-images/d3110/d3110ca5a19710215d92ab4c6fe59b42959d4438" alt=""
$ gcloud config list project #To List Projects
$ gcloud config set project my-demo-project-306417 #To Select Project
data:image/s3,"s3://crabby-images/6b9a6/6b9a6ad62ec9eddde80c2c6bebccff80eb12e2ef" alt=""
In the Cloud console, click Menu > IAM & Admin.
data:image/s3,"s3://crabby-images/4c6a9/4c6a9f59fda98d1abf1a3808c742ea253c047f65" alt=""
Confirm that the default compute Service Account is present and has the editor role assigned.
{project-number}-compute@developer.gserviceaccount.com
data:image/s3,"s3://crabby-images/58e7b/58e7b25c827bb4674fcda5638b6125f0be917e8c" alt=""
To check the project number, Go to Menu > Home > Dashboard
data:image/s3,"s3://crabby-images/ca12a/ca12a00e0d0b6e2f9738ef2d0d2685c34f2fa58a" alt=""
In the dashboard under the project info project number will be displayed.
data:image/s3,"s3://crabby-images/089d6/089d600c3de5662d1777fbbbc9e1df00e0c343a0" alt=""
In the console, copy the below code.
$ gsutil -m cp -R gs://spls/gsp290/dataflow-python-examples . # To copy the Dataflow Python Examples from Google Cloud's professional services in GitHub
data:image/s3,"s3://crabby-images/6f877/6f877dd89b85585147eb79f118300ec84ca6fa8a" alt=""
Export the name of your cloud project as an environment variable.
$ export PROJECT=<YOUR-PROJECT-ID>
Select the project which has been exported. [Don’t change $PROJECT]
gcloud config set project $PROJECT
data:image/s3,"s3://crabby-images/c07df/c07df5fc661425e117e831222acc99f190dcfbee" alt=""
gsutil mb -c regional -l us-central1 gs://$PROJECT #To create new Bucket
data:image/s3,"s3://crabby-images/8c22b/8c22bf29f88dbf43aa2f85d838f01cc2897cb61d" alt=""
gsutil cp gs://spls/gsp290/data_files/usa_names.csv gs://$PROJECT/data_files/ #To copy the files into the bucket which we created
data:image/s3,"s3://crabby-images/bce50/bce50fee3b0f9f8d32bcb11f2b716bd2c035f02b" alt=""
bq mk lake #To make Big Query Lake
data:image/s3,"s3://crabby-images/eaf42/eaf4268c467ea4d038c5e804fc190af1e38a2cb5" alt=""
Go to the directory
cd dataflow-python-examples/
Install Virtual Environment
sudo pip install virtualenv
data:image/s3,"s3://crabby-images/a3a30/a3a3094bfc4628c34415dd06fb157bb4e5ad43c8" alt=""
virtualenv -p python3 venv #Creating Virtual Environement
data:image/s3,"s3://crabby-images/4157e/4157e65222a0a64494aa11eb58a148460a769dd0" alt=""
source venv/bin/activate #Activating Virtual Environment
pip install apache-beam[gcp]==2.24.0 #Installing Apache-beam
data:image/s3,"s3://crabby-images/f56e4/f56e4e5ee5876a4a073292bfef270cb9afa9807a" alt=""
In the Cloud Shell window, Click on Open Editor
data:image/s3,"s3://crabby-images/23427/2342754bab325e8a18bc1e055b68590cee81395a" alt=""
In the Workspace, the dataflow-python-examples folder will be copied. Go through each python programs and try to understand the code.
data:image/s3,"s3://crabby-images/1d1d4/1d1d4f4cc1b20dac0453098c697aa8592ac31a0d" alt=""
Here we'll run the Dataflow pipeline in cloud. It will add the workers required, and shut them down when complete
python dataflow_python_examples/data_ingestion.py --project=$PROJECT --region=us-central1 --runner=DataflowRunner --staging_location=gs://$PROJECT/test --temp_location gs://$PROJECT/test --input gs://$PROJECT/data_files/usa_names.csv --save_main_session
It will take some time to assign the workers and to finish the job assigned.
data:image/s3,"s3://crabby-images/1ffff/1ffff0b57953be591206f6313655d83c889f7655" alt=""
Open the console and click Menu > Dataflow > Jobs
data:image/s3,"s3://crabby-images/31b04/31b040a1c962e3c5f1d6cdb2af254721d1a904ca" alt=""
In Jobs, Check the status. It will show Succeeded after completion. It will show running if the work is not fully assigned.
data:image/s3,"s3://crabby-images/9e5d3/9e5d3c24cc0edc8f5c11726a2eee19a86608bfbc" alt=""
Big Query > SQL Workspace
data:image/s3,"s3://crabby-images/59f33/59f3334ce2fa625e1d6192c39bd087cd311515e1" alt=""
It will show the usa_names table under the lake dataset.
data:image/s3,"s3://crabby-images/b42af/b42afd70823ba56043d20e7b1ab712f8cc851921" alt=""
Here we'll run the Dataflow pipeline saved in data_transformation.py file. It will add the workers which is required, and shut it down when complete.
python dataflow_python_examples/data_transformation.py --project=$PROJECT --region=us-central1 --runner=DataflowRunner --staging_location=gs://$PROJECT/test --temp_location gs://$PROJECT/test --input gs://$PROJECT/data_files/head_usa_names.csv --save_main_session
data:image/s3,"s3://crabby-images/ceeb9/ceeb96f5dbb4aa82c6dff1052a1441c3efd7f5a3" alt=""
Cloud Shell Editor, open dataflow-python-examples > dataflow_python_examples > data_enrichment.py
In Line 83 replace
values = [x.decode('utf8') for x in csv_row]
With $ values = [x for x in csv_row]
This code will populate the data in BigQuery.
data:image/s3,"s3://crabby-images/32f29/32f297129fa99ff2abd29e98bdf1a20e30f2d2ee" alt=""
data:image/s3,"s3://crabby-images/d9dfb/d9dfbab582257aeef229442c182011079c45c213" alt=""
Now again we'll run the Dataflow pipeline saved in data_enrichment.py. It will add the workers, and shut down when finished.
python dataflow_python_examples/data_enrichment.py --project=$PROJECT --region=us-central1 --runner=DataflowRunner --staging_location=gs://$PROJECT/test --temp_location gs://$PROJECT/test --input gs://$PROJECT/data_files/head_usa_names.csv --save_main_session
data:image/s3,"s3://crabby-images/132bb/132bba11f236331bfb0be8862ab165c2fdc05bcc" alt=""
Open Menu > Big Query > SQL Workspace
It will show the populated dataset in lake.
data:image/s3,"s3://crabby-images/f720e/f720ea29d6884ffd0cd5383dfedc79eb3ebcff67" alt=""
Dataflow > Jobs.
data:image/s3,"s3://crabby-images/08f5b/08f5b21abf5e6f2ae566c49ac9af578b364f267f" alt=""
Click on the job which you have done.
data:image/s3,"s3://crabby-images/35163/351633a0b1174dedcba8eb082366a05699815b74" alt=""
It will show the Data Flow pipeline of the work.
data:image/s3,"s3://crabby-images/6d6e7/6d6e74dcbda1adbaad5b1307acd7cfa5e57b492a" alt=""
Dataflow and BigQuery for ETL on Google Cloud