Cloud Dataflow

Welcome to the Google Cloud Dataflow idea forum. You can submit and vote on ideas here to tell the Google Cloud Dataflow team which features you’d like to see.

This forum is for feature suggestions. If you’re looking for help forums, look here:

We can’t wait to hear from you!

  1. Codegen

    I would like to get a valid job and generate at least some of the code I would need to re-create this in python using the client API. I want to be able to get this for my historic jobs.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  2. BigQuery to PubSub Dataflow Template

    I need a BigQuery to PubSub Template in Dataflow. This will allow us to create a streaming job to export BigQuery Gmail Logs into our SIEM using our PubSub HTTPS Endpoint.

    61 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  3. Cannot specify location when starting a dataflow job from REST API

    I'm using a DataFlow template to export data from Bigtable. Using the cmdline api, I'm able to specify a region to run the job (europe-west1). But, when comes to REST API, i can't specify any region, except us-central1. The error is:

    : The workflow could not be created, since it was sent to an invalid regional endpoint (europe-west1). Please resubmit to a valid Cloud Dataflow regional endpoint.">

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  4. pub/sub to cloud sql template

    There are lots of useful templates. One that would be useful to me is pub/sub to cloud sql.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  5. Button: "copy dataflow job"

    I would like to be able to copy a dataflow job so that i can tweak the parameters and run it again without having to enter them all in manually.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  6. Cannot specify diskSizeGb when launching a template

    When I create a template, It's possible to specify --diskSizeGb, but, If you don't specify it, it's not possible to pass it as parameter

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  7. delete or cancel job using the console

    Delete or cancel a job using console

    13 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  8. Python 3x support

    Python 3x support is overdue. Python 3.6+ is now very mature and adds some serious speed improvements over 2.7x

    183 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  9. 75 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  10. 24 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  11. Bug: Apache beam DataFlow runner throwing setup error

    Hi,
    We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,

    A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.

    But could not find detailed worker-startup logs.

    We tried increasing memory size, worker count etc, but still getting the same error.

    Here is the command we use,
    python run.py \
    --project=xyz \
    --runner=DataflowRunner \
    --staging_location=gs://xyz/staging \
    --temp_location=gs://xyz/temp \
    --requirements_file=requirements.txt \
    --worker_machine_type n1-standard-8 \
    --num_workers 2

    pipeline snippet

    data = pipeline | "load data" >> beam.io.Read(
    beam.io.BigQuerySource(query="SELECT * FROM abc_table LIMIT 100")
    )

    data…

    16 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  12. Wait.on() PDone instead of PCollection

    I would like to be able to publish an event to PubSub after writing data to BigQuery - currently there is Wait.on() transform which is intended to be used in this situation; howeverm Wait.on() requires a PCollection as an input to wait on, but BigQuery.IO returns a PDone. As such, I would like to be able to use Wait.On() a PDone before applying a Transform.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  13. dataprep / dataflow jobs region

    TLDR: dataprep creates jobs in the us only. This is probably a bug.

    I'm trying to prep data from BQ - EU to BQ -
    EU.

    But dataprep creates dataflow in the us, and because of that I get an error:

    Cannot read and write in different locations: source: EU, destination: US, error: Cannot read and write in different locations: source: EU, destination: US

    15 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  14. dataprep/Causing a job to fail if there is an error in the recipe

    TL;DR - I want to be able to get notified in case a recipe is not valid due to errors in the data source/recipe that can happen from a change in the source tables or any other reason, and see that the transformation has failed

    Example Scenario - I have a dataset (e.g. a salesforce table) that is being extracted on a daily basis and saved in GCS. A scheduled dataprep job is being invoked every day on the new table and transforms it with a pre-defined recipe. When a column is added/removed from the data source, the recipe might…

    10 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  15. PubSub to BigQuery template should allow Avro and/or Protobuf

    BigQuery already has support for loading files in Avro, why not streaming them in from PubSub? This seems like an obvious feature to have but I don't see any way to do it currently. The PubSub to BigQuery template is great, but it would be so much better with this one feature turned on.

    11 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  16. Provide more frequent Python updates

    Python SDK is not up-to-date with various cloud SDKs, last update was in September...

    14 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  17. Stop a streaming pipeline when idle to save costs

    Reading from PubSub for a amount of time was possible on SDK 1.91 with maxReadTime(Duration), but now in 2.x we don't have that option anymore.

    I know that using PubSub as bounded collection is a bad design, but sometimes we have scenarios in that a few messages are pushed once a day and is impracticable to leave a streaming job in that cases.

    Like Dataproc is doing with scheduling deletion (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-deletion) it will be nice to we have something simiiar in Dataflow to save money.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  18. Example Code is Maddeningly Incomplete

    The example code provided is maddeningly incomplete. The biggest issue I have is that things like the complete code for the default templates is not provided. I want to create a slight variation of the Pub/Sub -> BigQuery example template, but I can't find the code for that template anywhere. It would be nice if that code were available so that I could base a custom dataflow job off of it. This would provide a known working example of exactly what I want from which to build on.

    30 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  19. Show Total Memory and CPU Usage Alongside Worker Graph

    It would be amazing to in addition to the worker graph be able to directly see Memory and CPU consumption. This makes it much easier to co-relate & debug different stages and also gives a bit of insight into what machine types perform most optimal. It is possible to do this now by making metrics in Stackdriver, but it's very involved especially if a super simple graph could do the trick...just like in Kubernetes/GoogleContainerEngine

    28 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  20. Labeling Dataflow jobs

    Should be able to assign labels to data flow jobs and filter by labels in the overview page

    13 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
← Previous 1 3 4
  • Don't see your idea?

Cloud Dataflow

Categories

Feedback and Knowledge Base