Cloud Data Fusion and BigQuery for data integration
Prerequisites
GCP account
Open Console.
Click on Activate cloud shell
data:image/s3,"s3://crabby-images/77fd0/77fd0d72544414be2ca41157e1b8e41207a8ddc7" alt=""
$ echo $GOOGLE_CLOUD_PROJECT #Display the Project ID
$ gsutil cp gs://cloud-training/OCBL017/ny-taxi-2018-sample.csv gs://$GOOGLE_CLOUD_PROJECT
#Copying file from cloud training public bucket into our bucket
$ gsutil mb gs://$GOOGLE_CLOUD_PROJECT-temp #Creating bucket
data:image/s3,"s3://crabby-images/a7e70/a7e703e9a39756e3976c05f3fe5b23056ade2541" alt=""
Open Menu > Cloud Storage > Browser
data:image/s3,"s3://crabby-images/9f0e0/9f0e089436fc3eb9a71e9a950f785f00e354cd3d" alt=""
The temporary bucket is created and the file is copied in main bucket.
Open Menu > Data Fusion
Click on View Instance. It will open the Data Fusion page.
data:image/s3,"s3://crabby-images/94f1c/94f1cef0912d68084e818716809606bdbe04b656" alt=""
Click No thanks.
data:image/s3,"s3://crabby-images/ba378/ba378e8b67b2be5644f45727d546f0baa9e8c701" alt=""
Click on Wrangler
data:image/s3,"s3://crabby-images/590f8/590f83afb89de492df3f29d264c71072d9b9451a" alt=""
In Default Cloud Storage > Open the bucket which we copied the file.
data:image/s3,"s3://crabby-images/5528a/5528a5c7e2a750a67792dfe36ed436d9854e3cc2" alt=""
Select the copied file.
data:image/s3,"s3://crabby-images/a8bc1/a8bc1cd972c762b042fadba88b227f1eb8f7c682" alt=""
Click on drop down arrow.
data:image/s3,"s3://crabby-images/3f3e1/3f3e1c7fa92ea6e33d17bb707cd3758903505801" alt=""
Click Parse > CSV
data:image/s3,"s3://crabby-images/66f2c/66f2cd3ceb4cab4591a1f40c50c8f0db79f2c185" alt=""
Select comma and tick set first row as header.
Click Apply
data:image/s3,"s3://crabby-images/fda14/fda14c5a081533d65dcad15a32f6ce91af438c85" alt=""
In drop down of body column, Click delete column
data:image/s3,"s3://crabby-images/bd554/bd554bbdf99a89b9a3e0e54282f91b8445e32c1d" alt=""
In Trip distance drop down > change data type to Float
data:image/s3,"s3://crabby-images/de6be/de6be049613ec8f0560050873157b8ebace30106" alt=""
open Filter > Custom condition
data:image/s3,"s3://crabby-images/0fb6c/0fb6c20bde33ecc257a9f40ac52e4adf5d375d23" alt=""
Give Condition as >0.0
Click Apply
data:image/s3,"s3://crabby-images/6c4ed/6c4edf88aa43b475005798b219039f3f1b8eb7c6" alt=""
In right side of table, we can see the transformations which we did in our table
Then Click on Create a pipeline
data:image/s3,"s3://crabby-images/849e4/849e461588a6049d3c552133d2b1a271ef4edf3c" alt=""
Click on Batch Pipeline.
data:image/s3,"s3://crabby-images/b8e8e/b8e8ec04300f3b9931fd33d2e7602d5f9e0a8405" alt=""
It will create a Pipeline.
properties in Wrangler
data:image/s3,"s3://crabby-images/dde1d/dde1d0becf4db3ae4e016492b0821275a7c22c7b" alt=""
The modifications done in table will be already mentioned in here
data:image/s3,"s3://crabby-images/15a9c/15a9c58c798926ec6c2a8e29457602a8df35507c" alt=""
In right side, output schema is visible. In that delete column named extra.
data:image/s3,"s3://crabby-images/27223/27223b1e9b1048176f0a7d42c34f666175aa661c" alt=""
Validate.
data:image/s3,"s3://crabby-images/9c643/9c64380baae6cf17bcce7950214518eb8e5688f6" alt=""
Click on close if no errors found.
data:image/s3,"s3://crabby-images/45f8a/45f8a15c96f9a16554bcfdfc701599a163606b45" alt=""
Open Console > BigQuery
data:image/s3,"s3://crabby-images/92963/929631cb84efb8ef226925d09b0cd900cb9aefa7" alt=""
Click Create dataset.
data:image/s3,"s3://crabby-images/368ce/368ce8bb8c2e2ffb660de1def471683e07572375" alt=""
Give dataset name. Click Create Dataset.
data:image/s3,"s3://crabby-images/a566d/a566d9ed05eb4c5f3e244bf591f6c77541f506e1" alt=""
Click on More > Query Settings
data:image/s3,"s3://crabby-images/5de7d/5de7dd5fc68dbea157a5b6b003593fae9c611517" alt=""
Select Set destination table for query results.
Give Project name,Dataset name, and table name.
Select Write if empty or Overwrite table and press save.
data:image/s3,"s3://crabby-images/6f7d6/6f7d6d55c7e3ae95d707d890e7e29f3858fe72cb" alt=""
Click save
data:image/s3,"s3://crabby-images/b5200/b52001392779f87d92e02e5f897a19daef33db22" alt=""
In Query Editor, Paste the below code
SELECT
zone_id,
zone_name,
borough
FROM
bigquery-public-data.new_york_taxi_trips.taxi_zone_geom
Click Run
data:image/s3,"s3://crabby-images/bc1d7/bc1d7de98ed390fb3fa583c7dd3456065b814237" alt=""
The output will be saved in the created table.
data:image/s3,"s3://crabby-images/66caf/66caffb72cacf7dd9d1a9be8b49f077b18ba0aa1" alt=""
Go back to Data Fusion
Source > BigQuery
data:image/s3,"s3://crabby-images/69289/6928922c1cde99f452a861719bca51b769aa05ed" alt=""
Properties in BigQuery.
data:image/s3,"s3://crabby-images/cb710/cb710da57c2c7201f84f25fc71969ba438245ed5" alt=""
Reference name.
Give the dataset name and table name which we created.
Click Get schema
data:image/s3,"s3://crabby-images/4c038/4c038434994556a5f021c59f88c277fb3c6d913d" alt=""
The Output Schema will be displayed in Right side.
data:image/s3,"s3://crabby-images/cd831/cd831878a7f4bfc1c591a758b7064ba38d9a98c8" alt=""
Temporary bucket name.
data:image/s3,"s3://crabby-images/1a7e1/1a7e14e0e56bfd3c7d561e9ded4515089a4cf3f7" alt=""
Click on Validate.
data:image/s3,"s3://crabby-images/4bcfe/4bcfe22d14f2c7cac3e09f75e0c5984e221de342" alt=""
if no errors found.
data:image/s3,"s3://crabby-images/17baf/17bafbbd9496a415a537cc63a0ca641b6a5718c5" alt=""
Click Analytics > Joiner
data:image/s3,"s3://crabby-images/f4c2e/f4c2ebd0dc6177fc9861bbc6954990627b851054" alt=""
Drag the arrow into joiner from both BigQuery and Wrangler.
data:image/s3,"s3://crabby-images/c5031/c503115a5b1b3ab8c2998b455921d22b3b9aab91" alt=""
data:image/s3,"s3://crabby-images/88eb6/88eb6f7938ec93f3b6c44aa2926e221bb5b5f4d6" alt=""
Click on properties in joiner
data:image/s3,"s3://crabby-images/285e3/285e3bf90ae3af4b8d16f9eca5cfd06ecb4c5972" alt=""
Jointype : inner
Join condition
Wrangler : pickup_location_id
BigQuery : zone_id
Click on Get Schema.
data:image/s3,"s3://crabby-images/57e88/57e883a10de784dabe12ac1dc2842590adba36e6" alt=""
In output schema remove pickup_location_id & zone_id
data:image/s3,"s3://crabby-images/1af58/1af588eab58ad2620f228d553e718c58c9bd5fb7" alt=""
remove zone_id.
data:image/s3,"s3://crabby-images/6ffa1/6ffa16f330a32c4ccfd24ffae320b95ec25ebb36" alt=""
Validate.
data:image/s3,"s3://crabby-images/6601c/6601cd81d1ccda840cf19f2d8ebcf626e2c8eeee" alt=""
close if no errors found.
data:image/s3,"s3://crabby-images/3af5b/3af5b4c869ec6afe5280f6ec76064e2f37bb4bae" alt=""
Sink > bigQuery
data:image/s3,"s3://crabby-images/e3349/e3349ff92766553d51df376d9e36cd398a698a26" alt=""
properties in second big query.
data:image/s3,"s3://crabby-images/7df1b/7df1b3df4c03530adaac6f1dbaa7f551489aad93" alt=""
Give Reference name, dataset, table name.
Give the temporary bucket name which we created.
data:image/s3,"s3://crabby-images/b33fd/b33fdb6844d061a4ff9c696704a65c1cdc3cd3d6" alt=""
Click on Validate.
data:image/s3,"s3://crabby-images/2bcf4/2bcf4af3246237bb9dd5fed984ce811148564bb0" alt=""
close if no errors found
data:image/s3,"s3://crabby-images/8e60d/8e60d9647337a7b49ee2a78f7666a06a79bd0c89" alt=""
Now drag the arrow from Joiner
data:image/s3,"s3://crabby-images/b2192/b2192e1ed7385b0b87dd185b4d70feda7a522d55" alt=""
rename the pipeline. Click save
data:image/s3,"s3://crabby-images/2256e/2256e58f673af2e4b72d53d06b51e9fe8c1d24b0" alt=""
Save.
data:image/s3,"s3://crabby-images/29115/2911530fcbbee6c834ccb149a9d77bcebab714b8" alt=""
Deploy
data:image/s3,"s3://crabby-images/d5b7f/d5b7f6fe4222d50fdfa3ea2f8e331ffc707465e6" alt=""
It will deploy pipeline
data:image/s3,"s3://crabby-images/934b0/934b0c60df8b9090af85acac6e7442281921fd11" alt=""
Click run.
data:image/s3,"s3://crabby-images/36750/36750e02abe396acb600e78dfc73b2000c27b35b" alt=""
It will Start Execution. This execution will take arround 10-15 minutes.
Execution is Provisioning
data:image/s3,"s3://crabby-images/f3f31/f3f31ad7ff7b14baff672bd98c2e9cf01e56308c" alt=""
Starting
data:image/s3,"s3://crabby-images/f19dc/f19dcfed89b7bfd22d8eeb3a7e286d2f57aeafc7" alt=""
Running
data:image/s3,"s3://crabby-images/58144/58144a7b4d4ef493bddd4679633e52b8874be1c4" alt=""
Finished Successfully.
data:image/s3,"s3://crabby-images/ad1b4/ad1b42eec3fb02fe9a3d1e886f4cee16af3a3aa0" alt=""
Go to BigQuery. In the dataset, new table will be created.
data:image/s3,"s3://crabby-images/dc894/dc8943bacf41b3bf154691ef4ee9c6032faa72c5" alt=""
In the data Fusion, Click Logs.You can see the logs of Execution.
data:image/s3,"s3://crabby-images/e69c0/e69c0734b0a3d543bc01047e498303bb258458d5" alt=""
Cloud Data Fusion and BigQuery for data integration