Cloud Data Fusion and BigQuery for data integration
Prerequisites
GCP account
Open Console.
Click on Activate cloud shell
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-156.png)
$ echo $GOOGLE_CLOUD_PROJECT #Display the Project ID
$ gsutil cp gs://cloud-training/OCBL017/ny-taxi-2018-sample.csv gs://$GOOGLE_CLOUD_PROJECT
#Copying file from cloud training public bucket into our bucket
$ gsutil mb gs://$GOOGLE_CLOUD_PROJECT-temp #Creating bucket
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-157.png)
Open Menu > Cloud Storage > Browser
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-158.png)
The temporary bucket is created and the file is copied in main bucket.
Open Menu > Data Fusion
Click on View Instance. It will open the Data Fusion page.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-160.png)
Click No thanks.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-161.png)
Click on Wrangler
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-162.png)
In Default Cloud Storage > Open the bucket which we copied the file.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-163.png)
Select the copied file.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-164.png)
Click on drop down arrow.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-165.png)
Click Parse > CSV
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-166.png)
Select comma and tick set first row as header.
Click Apply
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-167.png)
In drop down of body column, Click delete column
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-168.png)
In Trip distance drop down > change data type to Float
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-169.png)
open Filter > Custom condition
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-170.png)
Give Condition as >0.0
Click Apply
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-171.png)
In right side of table, we can see the transformations which we did in our table
Then Click on Create a pipeline
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-172.png)
Click on Batch Pipeline.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-173.png)
It will create a Pipeline.
properties in Wrangler
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-174.png)
The modifications done in table will be already mentioned in here
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-175.png)
In right side, output schema is visible. In that delete column named extra.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-176.png)
Validate.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-177.png)
Click on close if no errors found.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-178.png)
Open Console > BigQuery
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-179.png)
Click Create dataset.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-180.png)
Give dataset name. Click Create Dataset.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-181.png)
Click on More > Query Settings
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-182.png)
Select Set destination table for query results.
Give Project name,Dataset name, and table name.
Select Write if empty or Overwrite table and press save.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-183.png)
Click save
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-184.png)
In Query Editor, Paste the below code
SELECT
zone_id,
zone_name,
borough
FROM
bigquery-public-data.new_york_taxi_trips.taxi_zone_geom
Click Run
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-185.png)
The output will be saved in the created table.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-186.png)
Go back to Data Fusion
Source > BigQuery
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-187.png)
Properties in BigQuery.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-188.png)
Reference name.
Give the dataset name and table name which we created.
Click Get schema
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-189.png)
The Output Schema will be displayed in Right side.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-190.png)
Temporary bucket name.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-191.png)
Click on Validate.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-192.png)
if no errors found.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-193.png)
Click Analytics > Joiner
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-194.png)
Drag the arrow into joiner from both BigQuery and Wrangler.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-195.png)
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-196.png)
Click on properties in joiner
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-197.png)
Jointype : inner
Join condition
Wrangler : pickup_location_id
BigQuery : zone_id
Click on Get Schema.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-198.png)
In output schema remove pickup_location_id & zone_id
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-199.png)
remove zone_id.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-200.png)
Validate.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-201.png)
close if no errors found.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-202.png)
Sink > bigQuery
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-203.png)
properties in second big query.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-204.png)
Give Reference name, dataset, table name.
Give the temporary bucket name which we created.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-205.png)
Click on Validate.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-206.png)
close if no errors found
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-207.png)
Now drag the arrow from Joiner
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-208.png)
rename the pipeline. Click save
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-209.png)
Save.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-210.png)
Deploy
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-211.png)
It will deploy pipeline
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-212.png)
Click run.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-213.png)
It will Start Execution. This execution will take arround 10-15 minutes.
Execution is Provisioning
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-214.png)
Starting
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-215.png)
Running
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-216.png)
Finished Successfully.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-217.png)
Go to BigQuery. In the dataset, new table will be created.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-218.png)
In the data Fusion, Click Logs.You can see the logs of Execution.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-219.png)
Cloud Data Fusion and BigQuery for data integration