Cloud Dataflow: Building Data Processing Pipelines
Cloud Dataflow offers a powerful platform for building scalable and reliable data processing pipelines within Google Cloud. These pipelines enable organizations to ingest, transform, and analyze large volumes of data in real-time, providing valuable insights and driving informed decision-making.
At its core, Cloud Dataflow simplifies the development and management of data pipelines by abstracting away the complexities of distributed computing infrastructure. Developers can focus on defining the data processing logic using familiar programming languages and libraries, while Dataflow handles the underlying infrastructure provisioning, scaling, and optimization.
Prerequisites
GCP account
Open Console.
Open Menu > BigQuery
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-451.png)
In Query Editor, Paste the below query.
SELECT
content
FROM
fh-bigquery.github_extracts.contents_java_2016
LIMIT
10
Click Run.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-452.png)
It will display the results.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-453.png)
Paste the below query.
SELECT
COUNT(*)
FROM
fh-bigquery.github_extracts.contents_java_2016
Click Run.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-454.png)
It will give the query result.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-455.png)
Click on activate cloud shell
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-456.png)
ls
![](https://prwatech.in/blog/wp-content/uploads/2021/05/2-1024x46.jpg)
Create bucket in console. Give bucket name as same as the project ID
In shell, execute the below command
BUCKET="<bucket-name>"
echo $BUCKET
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-458.png)
cd training-data-analyst/courses/data_analysis/lab2/python
ls
The files will be displayed
![](https://prwatech.in/blog/wp-content/uploads/2021/05/9-1024x72.jpg)
nano JavaProjectsThatNeedHelp.py
![](https://prwatech.in/blog/wp-content/uploads/2021/05/6-1024x74.jpg)
It will open the file. Ctrl + x to exit
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-459.png)
python3 JavaProjectsThatNeedHelp.py --bucket $BUCKET --project $DEVSHELL_PROJECT_ID --DataFlowRunner
![](https://prwatech.in/blog/wp-content/uploads/2021/05/7.jpg)
Go to DataFlow > Jobs The jobs will be running
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-460.png)
Open the running job
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-461.png)
See the dataflow running
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-462.png)
After running, It will shown as succeeded
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-463.png)
After execution, Go to bucket.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-464.png)
Open javahelp/ folder
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-465.png)
The Result will be stored in it.
![](https://prwatech.in/blog/wp-content/uploads/2021/05/image-466.png)