Cloud Data Fusion and BigQuery for data integration
Prerequisites
GCP account
Open Console.
Click on Activate cloud shell

$ echo $GOOGLE_CLOUD_PROJECT #Display the Project ID
$ gsutil cp gs://cloud-training/OCBL017/ny-taxi-2018-sample.csv gs://$GOOGLE_CLOUD_PROJECT
#Copying file from cloud training public bucket into our bucket
$ gsutil mb gs://$GOOGLE_CLOUD_PROJECT-temp #Creating bucket

Open Menu > Cloud Storage > Browser

The temporary bucket is created and the file is copied in main bucket.
Open Menu > Data Fusion
Click on View Instance. It will open the Data Fusion page.

Click No thanks.

Click on Wrangler

In Default Cloud Storage > Open the bucket which we copied the file.

Select the copied file.

Click on drop down arrow.

Click Parse > CSV

Select comma and tick set first row as header.
Click Apply

In drop down of body column, Click delete column

In Trip distance drop down > change data type to Float

open Filter > Custom condition

Give Condition as >0.0
Click Apply

In right side of table, we can see the transformations which we did in our table
Then Click on Create a pipeline

Click on Batch Pipeline.

It will create a Pipeline.
properties in Wrangler

The modifications done in table will be already mentioned in here

In right side, output schema is visible. In that delete column named extra.

Validate.

Click on close if no errors found.

Open Console > BigQuery

Click Create dataset.

Give dataset name. Click Create Dataset.

Click on More > Query Settings

Select Set destination table for query results.
Give Project name,Dataset name, and table name.
Select Write if empty or Overwrite table and press save.

Click save

In Query Editor, Paste the below code
SELECT
zone_id,
zone_name,
borough
FROM
bigquery-public-data.new_york_taxi_trips.taxi_zone_geom
Click Run

The output will be saved in the created table.

Go back to Data Fusion
Source > BigQuery

Properties in BigQuery.

Reference name.
Give the dataset name and table name which we created.
Click Get schema

The Output Schema will be displayed in Right side.

Temporary bucket name.

Click on Validate.

if no errors found.

Click Analytics > Joiner

Drag the arrow into joiner from both BigQuery and Wrangler.


Click on properties in joiner

Jointype : inner
Join condition
Wrangler : pickup_location_id
BigQuery : zone_id
Click on Get Schema.

In output schema remove pickup_location_id & zone_id

remove zone_id.

Validate.

close if no errors found.

Sink > bigQuery

properties in second big query.

Give Reference name, dataset, table name.
Give the temporary bucket name which we created.

Click on Validate.

close if no errors found

Now drag the arrow from Joiner

rename the pipeline. Click save

Save.

Deploy

It will deploy pipeline

Click run.

It will Start Execution. This execution will take arround 10-15 minutes.
Execution is Provisioning

Starting

Running

Finished Successfully.

Go to BigQuery. In the dataset, new table will be created.

In the data Fusion, Click Logs.You can see the logs of Execution.

Cloud Data Fusion and BigQuery for data integration