Cloud Dataflow

Welcome to the Google Cloud Dataflow idea forum. You can submit and vote on ideas here to tell the Google Cloud Dataflow team which features you’d like to see.

This forum is for feature suggestions. If you’re looking for help forums, look here:

We can’t wait to hear from you!

  1. gradle

    I would love an example gradle project. Of course gradle is super popular in the java community so it is very odd that there is only documentation on how to use dataflow with maven.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  2. Beam SDK for Python: support for Triggers

    Beam Python SDK doesn't support triggers (or state, or timers)

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  3. Add numWorkers and autoscalingAlgorithm parameter on google provided templates?

    I'm using "Datastore to GCS text" provided Template but with high load (~8M entities) dataflow takes too long to scale. If we can provide numWorkers and autoscalingAlgorithm parameters jobs will take less time to execute.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  4. Bug: Apache beam DataFlow runner throwing setup error

    Hi,
    We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,

    A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.

    But could not find detailed worker-startup logs.

    We tried increasing memory size, worker count etc, but still getting the same error.

    Here is the command we use,
    python run.py \
    --project=xyz \
    --runner=DataflowRunner \
    --staginglocation=gs://xyz/staging \
    --temp
    location=gs://xyz/temp \
    --requirementsfile=requirements.txt \
    --worker
    machinetype n1-standard-8 \
    --num
    workers 2

    pipeline snippet

    data = pipeline | "load data" >> beam.io.Read(

    beam.io.BigQuerySource(query="SELECT *
    16 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  5. 1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  6. BigQuery to PubSub Dataflow Template

    I need a BigQuery to PubSub Template in Dataflow. This will allow us to create a streaming job to export BigQuery Gmail Logs into our SIEM using our PubSub HTTPS Endpoint.

    63 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  7. Bug: Dataflow jobs are not shown consistently on cloud console

    Currently running jobs are sometimes not shown on cloud console. After refreshing the page, they sometimes show up, just to disappear again a couple of seconds later.

    The behaviour is very inconsistent and therefore I have not found a way to replicate the issue. It seems mostly time dependent. A couple of weeks ago I noticed this issue for the first time. Today, I've been suffering from the issue the whole day, while yesterday everything was working fine.

    20 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    4 comments  ·  Flag idea as inappropriate…  ·  Admin →
  8. dataprep/ Enable to delete more the one dataset

    In the DATASETS tab on dataproc, it would be extremely helpful to check multiple datasets and delete them together

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  9. 34 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  10. dataprep/ Drop columns by a specific rule

    Today in order to remove columns from dataprep I need to manually remove all columns. When dealing with raw datasets I often find myself manually deleting hundreds of columns due to lack of value (e.g. 90% empty). I want to be able to define a rule that will let me delete all the columns according to a specific logical expression (e.g. delete all the columns that have more than 90% empty values)

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  11. dataprep/Causing a job to fail if there is an error in the recipe

    TL;DR - I want to be able to get notified in case a recipe is not valid due to errors in the data source/recipe that can happen from a change in the source tables or any other reason, and see that the transformation has failed

    Example Scenario - I have a dataset (e.g. a salesforce table) that is being extracted on a daily basis and saved in GCS. A scheduled dataprep job is being invoked every day on the new table and transforms it with a pre-defined recipe. When a column is added/removed from the data source, the recipe might…

    11 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  12. PubSub to BigQuery template should allow Avro and/or Protobuf

    BigQuery already has support for loading files in Avro, why not streaming them in from PubSub? This seems like an obvious feature to have but I don't see any way to do it currently. The PubSub to BigQuery template is great, but it would be so much better with this one feature turned on.

    11 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  13. dataprep / dataflow jobs region

    TLDR: dataprep creates jobs in the us only. This is probably a bug.

    I'm trying to prep data from BQ - EU to BQ -
    EU.

    But dataprep creates dataflow in the us, and because of that I get an error:

    Cannot read and write in different locations: source: EU, destination: US, error: Cannot read and write in different locations: source: EU, destination: US

    15 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  14. Stop a streaming pipeline when idle to save costs

    Reading from PubSub for a amount of time was possible on SDK 1.91 with maxReadTime(Duration), but now in 2.x we don't have that option anymore.

    I know that using PubSub as bounded collection is a bad design, but sometimes we have scenarios in that a few messages are pushed once a day and is impracticable to leave a streaming job in that cases.

    Like Dataproc is doing with scheduling deletion (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-deletion) it will be nice to we have something simiiar in Dataflow to save money.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  15. Is there a way to get notified whenever there is a new dataflow release?

    Can we subscribe to an email list and get notified whenever there is a new dataflow release ?

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  16. 2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  17. Provide more frequent Python updates

    Python SDK is not up-to-date with various cloud SDKs, last update was in September...

    17 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  18. I have to Dump Data in a table Reading a File From Bucket and Then Fire a 200 Query on that.

    I have to Read a File From Bucket then I Have To Dump that File Data to a BigQuery table and Then I have to Run 200 Queries in that Dumped table My Whole Process is one By one But Due to parallel Work my Work is Not Done Perfectly So i Have to Synchronize my Work Such that one Job get Finished then another get Triggered .
    Can anyone Help me.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  19. 77 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  20. I really need Python examples.

    How to read a file from GCS and load it into Bigquery, with Python.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  • Don't see your idea?

Cloud Dataflow

Categories

Feedback and Knowledge Base