Prerequisites
Hardware : GCP
Google account
Open the console.
Click on activate cloud shell.

$ gcloud config list project #To List Projects
$ gcloud config set project my-demo-project-306417 #To Select Project

In the Cloud console, click Menu > IAM & Admin.

Confirm that the default compute Service Account is present and has the editor role assigned.
{project-number}-compute@developer.gserviceaccount.com

To check the project number, Go to Menu > Home > Dashboard

In the dashboard under the project info project number will be displayed.

In the console, copy the below code.
$ gsutil -m cp -R gs://spls/gsp290/dataflow-python-examples . # To copy the Dataflow Python Examples from Google Cloud’s professional services in GitHub

Export the name of your cloud project as an environment variable.
$ export PROJECT=<YOUR-PROJECT-ID>
Select the project which has been exported. [Don’t change $PROJECT]
$ gcloud config set project $PROJECT

$ gsutil mb -c regional -l us-central1 gs://$PROJECT #To create new Bucket

$ gsutil cp gs://spls/gsp290/data_files/usa_names.csv gs://$PROJECT/data_files/ #To copy the files into the bucket which we created

$ bq mk lake #To make Big Query Lake

Go to the directory
$ cd dataflow-python-examples/
Install Virtual Environment
$ sudo pip install virtualenv

$ virtualenv -p python3 venv #Creating Virtual Environement

$ source venv/bin/activate #Activating Virtual Environment
$ pip install apache-beam[gcp]==2.24.0 #Installing Apache-beam

In the Cloud Shell window, Click on Open Editor

In the Workspace, the dataflow-python-examples folder will be copied. Go through each python programs and try to understand the code.

Here we’ll run the Dataflow pipeline in cloud. It will add the workers required, and shut them down when complete
$ python dataflow_python_examples/data_ingestion.py –project=$PROJECT –region=us-central1 –runner=DataflowRunner –staging_location=gs://$PROJECT/test –temp_location gs://$PROJECT/test –input gs://$PROJECT/data_files/usa_names.csv –save_main_session
It will take some time to assign the workers and to finish the job assigned.

Open the console and click Menu > Dataflow > Jobs

In Jobs, Check the status. It will show Succeeded after completion. It will show running if the work is not fully assigned.

Open Menu > Big Query > SQL Workspace

It will show the usa_names table under the lake dataset.

Here we’ll run the Dataflow pipeline saved in data_transformation.py file. It will add the workers which is required, and shut it down when complete.
$ python dataflow_python_examples/data_transformation.py –project=$PROJECT –region=us-central1 –runner=DataflowRunner –staging_location=gs://$PROJECT/test –temp_location gs://$PROJECT/test –input gs://$PROJECT/data_files/head_usa_names.csv –save_main_session

Open the Cloud Shell Editor, open dataflow-python-examples > dataflow_python_examples > data_enrichment.py
In Line 83 replace
$ values = [x.decode(‘utf8’) for x in csv_row]
With $ values = [x for x in csv_row]
This code will populate the data in BigQuery.


Now again we’ll run the Dataflow pipeline saved in data_enrichment.py. It will add the workers, and shut down when finished.
$ python dataflow_python_examples/data_enrichment.py –project=$PROJECT –region=us-central1 –runner=DataflowRunner –staging_location=gs://$PROJECT/test –temp_location gs://$PROJECT/test –input gs://$PROJECT/data_files/head_usa_names.csv –save_main_session

Open Menu > Big Query > SQL Workspace
It will show the populated dataset in lake.

Open Menu > Dataflow > Jobs.

Click on the job which you have done.

It will show the Data Flow pipeline of the work.
