Cloud Dataflow

Welcome to the Google Cloud Dataflow idea forum. You can submit and vote on ideas here to tell the Google Cloud Dataflow team which features you’d like to see.

This forum is for feature suggestions. If you’re looking for help forums, look here:

We can’t wait to hear from you!

  1. Log GCE worker setup process and exception

    We had trouble to do even simple word count job with the DataflowRunner.

    The job was stuck for 1 hour, but log said nothing.

    ==========================================
    INFO:root:2017-08-28T07:26:07.709Z: JOBMESSAGEBASIC: (5296cce74062ca91): Starting 1 workers in asia-northeast1-c...
    INFO:root:2017-08-28T07:26:07.731Z: JOBMESSAGEDEBUG: (f4c12c649707e205): Value "group/Session" materialized.
    INFO:root:2017-08-28T07:26:07.745Z: JOBMESSAGEBASIC: (f4c12c649707e994): Executing operation read/Read+split+pairwithone+group/Reify+group/Write

    Stuck in 1 hour, so cancelled

    The reason was we need to set [network] and [subnetwork] option explicitly and correctly.
    After that, the job was going to work.

    Its so good to know what is happening, or what is stuck the job when setting up the workers.…

    6 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  2. Python 3x support

    Python 3x support is overdue. Python 3.6+ is now very mature and adds some serious speed improvements over 2.7x

    188 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  3. You could update the screen shots to be the current console.

    The screen shots are just old and outdated so cannot be used for troubleshooting. Permissions is not on the left for example.

    The error log messages for permissions fail to say which account is having access issues so you can and I have spent hours trying different random combinations. yet still have failed to get any dataflow most basic examples to work ever.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  4. Iterative parallel processing

    Some more complex flows require one transformation being applied multiple times until a certain condition is met (like transverse a tree). Currently Dataflows do not allows to do that in a parallel way.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  5. Javascript

    Would be great to add Javascript as a compute language in addition to Python and Java.

    2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  6. Pin products on console drawer by project

    As today, when you pin on the console drawer a product (e.g.: Google BigQuery) this product is pinned independently of the selected Google project. Since different projects may use different sets of products, would be nice if the pinned products were scoped by project.

    2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  7. Bug: Nullpointer when reading from BigQuery

    I believe I'm experiencing a bug in BigQuerySource for apache Beam when running on google dataflow. I described this in details on stackoverflow: https://stackoverflow.com/questions/44718323/apache-beam-with-dataflow-nullpointer-when-reading-from-bigquery/44755305#44755305

    Nobody seems to be able to respond to this. So posting it here as a potential bug.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  8. autodetect

    Enable --autodetect for BigQuery loads, consistent with bq load --autodetect on the command line

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  9. 3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  10. Ability to run dataflow pipeline from deployed flex python app-engine service without gcloud

    I have defined a batch python pipeline inside a flex app-engine python service.

    Once deployed on GCP, the pipeline cannot be compiled and started.
    The current workaround is to install gcloud-sdk in dockerfile of the service.

    It would be great to not install gcloud-sdk on deployed service.
    It would be great to have documentation about best practice to run python pipeline from deployed service

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  11. Patch Apache Beam's XmlSource to allow templating in Dataflow

    Background

    In our project, a Cloud Function is used to start a Dataflow pipe in batch modus to upload data to ElasticSearch. The source to the Dataflow is an XML file.

    The Dataflow template is used to upload the data into GCP.

    Problem

    Uploading of templates requires option parameters to accept parameters at runtime. This is implemented by using the ValueProvider interface to embrace the option-type.

    The class for reading XML source XmlSource did not use ValueProvider for its option parameters, we solved this by patching XmlSource and applied these changes to the class.

    Upload of dataflow template should be…

    32 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  12. Show avg/median per-element processing time in monitoring interface

    When selecting a transform in the Dataflow monitoring interface, you can currently see the number of elements that have been processed as well as the total execution time.

    It would be nice to be able to see the per-element processing time: either a simple average or better yet, a histogram. This would allow much easier performance monitoring.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  13. Example Code is Maddeningly Incomplete

    The example code provided is maddeningly incomplete. The biggest issue I have is that things like the complete code for the default templates is not provided. I want to create a slight variation of the Pub/Sub -> BigQuery example template, but I can't find the code for that template anywhere. It would be nice if that code were available so that I could base a custom dataflow job off of it. This would provide a known working example of exactly what I want from which to build on.

    30 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  14. Ability to Downscale Google Provided Templates

    We have a job which the basic templated DataFlow job works fairly well for, but so far we cannot see a way to have the machine use fewer machines. Our data ingestion is large and growing, but not yet extremely large. The 3 4 vcpu 15 gig ram machines that are started to process our volume of data are very overkill. I do not see any way to use these basic templates, but to also set the max_workers setting.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  15. Show Total Memory and CPU Usage Alongside Worker Graph

    It would be amazing to in addition to the worker graph be able to directly see Memory and CPU consumption. This makes it much easier to co-relate & debug different stages and also gives a bit of insight into what machine types perform most optimal. It is possible to do this now by making metrics in Stackdriver, but it's very involved especially if a super simple graph could do the trick...just like in Kubernetes/GoogleContainerEngine

    28 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  16. Ability to load data to BigQuery Partitions from dataflow pipelines (Python)

    Python Dataflow pipelines fail in parsetable_reference function when you specify a BigQuery Table name with partition decorator for loading. This is very important aspect if you would want to leverage BigQuery Table Partitioning.

    28 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  17. Ability to use our own kubernetes cluster as dataflow runner

    It would be good to use our own container cluster for running dataflow workers as dataflow already use kubernetes for deploying workers. This could even take into consideration of user supplied cluster's current workload and balance workers between user provided cluster and dataflow cluster.

    13 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  18. Show cost of current job according to the new pricing structure.

    It could be good to see the cost of the job in job view, I even thought of doing an chrome extension for this cause it's pretty trivial with datas vCPU sec, RAM MB sec, PD MB sec etc.

    10 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  19. Customizable Columns in Overview Page

    Ability to show total worker time, max number of workers, zone information in the overview page. This should be customizable similar to what we see in the app engine versions page

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  20. Labeling Dataflow jobs

    Should be able to assign labels to data flow jobs and filter by labels in the overview page

    21 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  • Don't see your idea?

Cloud Dataflow

Categories

Feedback and Knowledge Base