Google account
Open the console.
Click on activate cloud shell.

$ gcloud config list project #To List Projects
$ gcloud config set project my-demo-project-306417 #To Select Project

In the Cloud console, click Menu > IAM & Admin.

Confirm that the default compute Service Account is present and has the editor role assigned.
{project-number}-compute@developer.gserviceaccount.com

To check the project number, Go to Menu > Home > Dashboard


In the console, copy the below code.
$ gsutil -m cp -R gs://spls/gsp290/dataflow-python-examples . # To copy the Dataflow Python Examples from Google Cloud’s professional services in GitHub

Export the name of your cloud project as an environment variable.
$ export PROJECT=<YOUR-PROJECT-ID>
Select the project which has been exported. [Don’t change $PROJECT]
gcloud config set project $PROJECT

gsutil mb -c regional -l us-central1 gs://$PROJECT #To create new Bucket

gsutil cp gs://spls/gsp290/data_files/usa_names.csv gs://$PROJECT/data_files/ #To copy the files into the bucket which we created

bq mk lake #To make Big Query Lake

Go to the directory
cd dataflow-python-examples/
Install Virtual Environment
sudo pip install virtualenv

virtualenv -p python3 venv #Creating Virtual Environement

source venv/bin/activate #Activating Virtual Environment
pip install apache-beam[gcp]==2.24.0 #Installing Apache-beam

In the Cloud Shell window, Click on Open Editor

In the Workspace, the dataflow-python-examples folder will be copied. Go through each python programs and try to understand the code.

python dataflow_python_examples/data_ingestion.py –project=$PROJECT –region=us-central1 –runner=DataflowRunner –staging_location=gs://$PROJECT/test –temp_location gs://$PROJECT/test –input gs://$PROJECT/data_files/usa_names.csv –save_main_session
It will take some time to assign the workers and to finish the job assigned.

Open the console and click Menu > Dataflow > Jobs

In Jobs, Check the status. It will show Succeeded after completion. It will show running if the work is not fully assigned.

Big Query > SQL Workspace

It will show the usa_names table under the lake dataset.

Here we’ll run the Dataflow pipeline saved in data_transformation.py file. It will add the workers which is required, and shut it down when complete.
python dataflow_python_examples/data_transformation.py –project=$PROJECT –region=us-central1 –runner=DataflowRunner –staging_location=gs://$PROJECT/test –temp_location gs://$PROJECT/test –input gs://$PROJECT/data_files/head_usa_names.csv –save_main_session

Cloud Shell Editor, open dataflow-python-examples > dataflow_python_examples > data_enrichment.py
In Line 83 replace
values = [x.decode(‘utf8’) for x in csv_row]
With $ values = [x for x in csv_row]
This code will populate the data in BigQuery.


Now again we’ll run the Dataflow pipeline saved in data_enrichment.py. It will add the workers, and shut down when finished.
python dataflow_python_examples/data_enrichment.py –project=$PROJECT –region=us-central1 –runner=DataflowRunner –staging_location=gs://$PROJECT/test –temp_location gs://$PROJECT/test –input gs://$PROJECT/data_files/head_usa_names.csv –save_main_session

Open Menu > Big Query > SQL Workspace
It will show the populated dataset in lake.

Dataflow > Jobs.

Click on the job which you have done.

It will show the Data Flow pipeline of the work.
