Cloud Dataflow

Welcome to the Google Cloud Dataflow idea forum. You can submit and vote on ideas here to tell the Google Cloud Dataflow team which features you’d like to see.

This forum is for feature suggestions. If you’re looking for help forums, look here:

We can’t wait to hear from you!

  1. Show cost of current job according to the new pricing structure.

    It could be good to see the cost of the job in job view, I even thought of doing an chrome extension for this cause it's pretty trivial with datas vCPU sec, RAM MB sec, PD MB sec etc.

    10 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  2. Deployment Manager template for Dataflow

    Please provide GDM template to CRUD data flow templates. Currently there is no deployment manager support for Dataflow. Since Dataflow uses GCE and GCS under the hood, a rich GDM support can be created to add tremendous value. GDM will allow mass/concurrent deployment of Dataflow templates.

    9 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  3. scala friendly api / library

    Current darkjh/scalaflow library is pretty basic and the DoFn etc is pretty messy. It would be nice if scala was natively supported.

    8 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  4. Wait.on() PDone instead of PCollection

    I would like to be able to publish an event to PubSub after writing data to BigQuery - currently there is Wait.on() transform which is intended to be used in this situation; howeverm Wait.on() requires a PCollection as an input to wait on, but BigQuery.IO returns a PDone. As such, I would like to be able to use Wait.On() a PDone before applying a Transform.

    7 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  5. Add the ability to sort jobs by status (e.g. running vs closed)

    I would like to be able to quickly see the number of jobs that are currently running. Sometimes streaming jobs that have been running for weeks get buried below batch or testing jobs.

    7 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  6. Improve cost effectiveness of autoscaling algorithm

    I'm currently processing log data from multiple days with Cloud Dataflow. According to the defined options it uses 10 to 100 workers and the throughput-based autoscaling algorithm. At the moment there are still 64 workers active, while only one job is still running with around 1500 elements per second. If you look at the CPU graph of the workers you see, that almost all of them are idle for the last 30 minutes. I would prefer a more carefree autoscaling, where I know I always get the optimal cost effectiveness.

    7 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →

    Hi Max,

    We’ve done a few performance optimizations lately that should result in a much improved experience. Could you share a jobID for us to take a look at? (I’m curious to examine the experience you describe).

    Thanks!

  7. API for GO

    It would be nice if GO was natively supported.

    6 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  8. Is there plan for GO SDK for Dataflow? Seems Apache beam has SDK for GO now.

    I want to build pipeline using GO SDK. Java is slow for my application and I don't want waste money on hiring more CPUs to pay for slowness of Java/Python. Now that apache beam supports GO SDK, is there plan to support it in Dataflow?

    6 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  9. Log GCE worker setup process and exception

    We had trouble to do even simple word count job with the DataflowRunner.

    The job was stuck for 1 hour, but log said nothing.

    ==========================================
    INFO:root:2017-08-28T07:26:07.709Z: JOBMESSAGEBASIC: (5296cce74062ca91): Starting 1 workers in asia-northeast1-c...
    INFO:root:2017-08-28T07:26:07.731Z: JOBMESSAGEDEBUG: (f4c12c649707e205): Value "group/Session" materialized.
    INFO:root:2017-08-28T07:26:07.745Z: JOBMESSAGEBASIC: (f4c12c649707e994): Executing operation read/Read+split+pairwithone+group/Reify+group/Write

    Stuck in 1 hour, so cancelled

    The reason was we need to set [network] and [subnetwork] option explicitly and correctly.
    After that, the job was going to work.

    Its so good to know what is happening, or what is stuck the job when setting up the workers.…

    6 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  10. pub/sub to cloud sql template

    There are lots of useful templates. One that would be useful to me is pub/sub to cloud sql.

    5 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  11. Cannot specify location when starting a dataflow job from REST API

    I'm using a DataFlow template to export data from Bigtable. Using the cmdline api, I'm able to specify a region to run the job (europe-west1). But, when comes to REST API, i can't specify any region, except us-central1. The error is:

    : The workflow could not be created, since it was sent to an invalid regional endpoint (europe-west1). Please resubmit to a valid Cloud Dataflow regional endpoint.">

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  12. Is there a way to get notified whenever there is a new dataflow release?

    Can we subscribe to an email list and get notified whenever there is a new dataflow release ?

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  13. Automated FTP

    Would like a way to create a Read Transform that can be scheduled to upload an FTP payload to Google Storage for further processing.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  14. Stop a streaming pipeline when idle to save costs

    Reading from PubSub for a amount of time was possible on SDK 1.91 with maxReadTime(Duration), but now in 2.x we don't have that option anymore.

    I know that using PubSub as bounded collection is a bad design, but sometimes we have scenarios in that a few messages are pushed once a day and is impracticable to leave a streaming job in that cases.

    Like Dataproc is doing with scheduling deletion (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-deletion) it will be nice to we have something simiiar in Dataflow to save money.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  15. Ability to Downscale Google Provided Templates

    We have a job which the basic templated DataFlow job works fairly well for, but so far we cannot see a way to have the machine use fewer machines. Our data ingestion is large and growing, but not yet extremely large. The 3 4 vcpu 15 gig ram machines that are started to process our volume of data are very overkill. I do not see any way to use these basic templates, but to also set the max_workers setting.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  16. Set Proxy Host

    When trying to run a Dataflow driver program behind a firewall, it needs to use a proxy to connect to GCP, but there does not seem to be a way to specify that HTTPS traffic should go through a proxy.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  17. Link directly to logs

    It would be nice to have a direct link to the logs of the job from the overview page.

    At the moment, you have to:

    -Click the job
    -Wait for it to load
    -Click "Logs"
    -Click "Worker Logs"
    -Wait for it to load

    It should be just one click to get the logs :-)

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
    planned  ·  Rafael Fernandez responded

    Thanks for the feedback!

    We are working on improving this experience along this lines. Will be happy to discuss more details next time we sync.

  18. Add a more helpful error message to "projects.jobs.create"

    I'm trying to launch a batch pipeline from outside the project and the "projects.jobs.create" API is returning:

    {
    "error": {

    "code": 400,
    
    "message": "Request contains an invalid argument.",
    "status": "INVALID_ARGUMENT"

    }
    }

    With no indication on which argument is invalid. This wouldn't be such a big deal, except that the documentation for this (https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/create?authuser=1) does not indicate which fields are required. :(

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  19. Iterative parallel processing

    Some more complex flows require one transformation being applied multiple times until a certain condition is met (like transverse a tree). Currently Dataflows do not allows to do that in a parallel way.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  20. Meaningful information about steps output collection in UI

    In the UI, when clicking on a step, adding output collections tag Names or OutPutTag id when available instead of "out"+index would be more meaningful.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  • Don't see your idea?

Cloud Dataflow

Categories

Feedback and Knowledge Base