GCP Data Fusion
Posted in :
(Update 12-10-2020) Ran the DataFusionQuickstart from Data Fusion Hub. Need to make sure the compute@developer service account have the following roles:
BigQuery Admin
Cloud Data Fusion Runner
Dataproc Worker
Service Account User
Storage Admin
then the datafusion user service account still has “Service Account User” role (this is same as below). The big query and storage roles are needed because the pipeline uses both. When it runs successfully, at the end we will see “Pipeline ‘DataFusionQuickstart’ succeeded.”
(Original 10-26-2020) Tried running couple more pre-set pipelines from google. It took a while to run (don’t know why). More on permissions (IAMs): need to add “Dataproc Worker” role to “Compute Engine default service account”. Continue added Service Account User to “Cloud Data Fusion Service Account / Cloud Data Fusion API Service Agent”
Couple tutorials
Permission issue (note the exact error will depends on the setup of network as well, for example, this DF service account needs to have network access to run the pipeline, and it needs that role if applicable).
Cost: the developer edition for data fusion instance costs 35 cents per hour. The basic edition is 1.80 per hour but comes with first 120 hours free, this is 5 days free usage and recommended. Also, there is ways in GCP to set up budgets and alerts.