Data Fusion with BigQuery

  • date 29th May, 2021 |
  • by Prwatech |
  • 0 Comments

Cloud Data Fusion and BigQuery for data integration

 

Prerequisites

GCP account

Open Console.

Click on Activate cloud shell

$          echo $GOOGLE_CLOUD_PROJECT                   #Display the Project ID

$          gsutil cp gs://cloud-training/OCBL017/ny-taxi-2018-sample.csv gs://$GOOGLE_CLOUD_PROJECT                                

#Copying file from cloud training public bucket into our bucket

$          gsutil mb gs://$GOOGLE_CLOUD_PROJECT-temp   #Creating bucket

Open  Menu > Cloud Storage > Browser

The temporary bucket is created and the file is copied in main bucket.

Open Menu > Data Fusion

Click on View Instance. It will open the Data Fusion page.

Click No thanks.

Click on Wrangler

In Default Cloud Storage > Open the bucket which we copied the file.

Select the copied file.

Click on drop down arrow.

Click Parse > CSV

Select comma and tick set first row as header.

Click Apply

In drop down of body column, Click delete column

In Trip distance drop down > change data type to Float

 open Filter > Custom condition

Give Condition as >0.0

Click Apply

In right side of table, we can see the transformations which we did in our table

Then Click on Create a pipeline

 

Click on Batch Pipeline.

It will create a Pipeline.

properties in Wrangler

The modifications done in table will be already mentioned in here

In right side, output schema is visible. In that delete column named extra.

Validate.

Click on close if no errors found.

Open  Console > BigQuery

Click Create dataset.

Give dataset name. Click Create Dataset.

Click on More > Query Settings

Select Set destination table for query results.

Give Project name,Dataset name, and table name.

Select Write if empty or Overwrite table and press save.

Click save

In Query Editor, Paste the below code

SELECT

  zone_id,

  zone_name,

  borough

FROM

  bigquery-public-data.new_york_taxi_trips.taxi_zone_geom

Click Run

The output will be saved in the created table.

Go back to Data Fusion

 Source > BigQuery

Properties in BigQuery.

Reference name.

Give the dataset name and table name which we created.

Click Get schema

The Output Schema will be displayed in Right side.

Temporary bucket name.

Click on Validate.

if no errors found.

Click Analytics > Joiner

Drag the arrow into joiner from both BigQuery and Wrangler.

Click on properties in joiner

Jointype : inner

Join condition

Wrangler : pickup_location_id

BigQuery : zone_id

Click on Get Schema.

In output schema remove pickup_location_id & zone_id

remove zone_id.

Validate.

close if no errors found.

Sink > bigQuery

properties in second big query.

Give Reference name, dataset, table name.

Give the temporary bucket name which we created.

Click on Validate.

close if no errors found

Now drag the arrow from Joiner

rename the pipeline. Click save

Save.

Deploy

It will deploy pipeline

Click run.

It will Start Execution. This execution will take arround 10-15 minutes.

Execution is Provisioning

 Starting

Running

 Finished Successfully.

Go to BigQuery. In the dataset, new table will be created.

In the data Fusion, Click Logs.You can see the logs of Execution.

 

Cloud Data Fusion and BigQuery for data integration

 

Quick Support

image image