Cloud Dataflow

Welcome to the Google Cloud Dataflow idea forum. You can submit and vote on ideas here to tell the Google Cloud Dataflow team which features you’d like to see.

This forum is for feature suggestions. If you’re looking for help forums, look here:

We can’t wait to hear from you!

  1. BUG: google-cloud-firestore not working in apache_beam[gcp] 2.16.0 works but in 2.13.0

    Running

    python pipeline --runner DataflowRunner ... --requirements_file requirements.txt

    throws an error if having

    apache_beam[gcp]==2.16.0
    google-cloud-firestore

    in requirements.txt but not if

    apache_beam[gcp]==2.13.0
    google-cloud-firestore

    Part of error:

    ModuleNotFoundError: No module named 'Cython'\n \n ----------------------------------------\nCommand "python setup.py egg_info" failed with error code 1 in /private/var/folders/h4/n9rzy8z52lqdh7sfkhr96nnw0000gn/T/pip-download-lx28dwpv/pyarrow/\n'

    See also: https://stackoverflow.com/questions/57286517/importing-google-firestore-python-client-in-apache-beam

    Where to report bugs?

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  2. Deployment Manager template for Dataflow

    Please provide GDM template to CRUD data flow templates. Currently there is no deployment manager support for Dataflow. Since Dataflow uses GCE and GCS under the hood, a rich GDM support can be created to add tremendous value. GDM will allow mass/concurrent deployment of Dataflow templates.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  3. Support MapState from Beam API

    Support MapState from Beam API

    Currently the DataflowRunner does not support MapState for stateful DoFns

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  4. Cannot specify location when starting a dataflow job from REST API

    I'm using a DataFlow template to export data from Bigtable. Using the cmdline api, I'm able to specify a region to run the job (europe-west1). But, when comes to REST API, i can't specify any region, except us-central1. The error is:

    : The workflow could not be created, since it was sent to an invalid regional endpoint (europe-west1). Please resubmit to a valid Cloud Dataflow regional endpoint.">

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  5. Codegen

    I would like to get a valid job and generate at least some of the code I would need to re-create this in python using the client API. I want to be able to get this for my historic jobs.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  6. BigQuery to PubSub Dataflow Template

    I need a BigQuery to PubSub Template in Dataflow. This will allow us to create a streaming job to export BigQuery Gmail Logs into our SIEM using our PubSub HTTPS Endpoint.

    63 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  7. pub/sub to cloud sql template

    There are lots of useful templates. One that would be useful to me is pub/sub to cloud sql.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  8. Button: "copy dataflow job"

    I would like to be able to copy a dataflow job so that i can tweak the parameters and run it again without having to enter them all in manually.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  9. Cannot specify diskSizeGb when launching a template

    When I create a template, It's possible to specify --diskSizeGb, but, If you don't specify it, it's not possible to pass it as parameter

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  10. delete or cancel job using the console

    Delete or cancel a job using console

    13 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  11. Python 3x support

    Python 3x support is overdue. Python 3.6+ is now very mature and adds some serious speed improvements over 2.7x

    184 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  12. 28 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  13. Wait.on() PDone instead of PCollection

    I would like to be able to publish an event to PubSub after writing data to BigQuery - currently there is Wait.on() transform which is intended to be used in this situation; howeverm Wait.on() requires a PCollection as an input to wait on, but BigQuery.IO returns a PDone. As such, I would like to be able to use Wait.On() a PDone before applying a Transform.

    7 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  14. 77 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  15. Bug: Apache beam DataFlow runner throwing setup error

    Hi,
    We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,

    A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.

    But could not find detailed worker-startup logs.

    We tried increasing memory size, worker count etc, but still getting the same error.

    Here is the command we use,
    python run.py \
    --project=xyz \
    --runner=DataflowRunner \
    --staginglocation=gs://xyz/staging \
    --temp
    location=gs://xyz/temp \
    --requirementsfile=requirements.txt \
    --worker
    machinetype n1-standard-8 \
    --num
    workers 2

    pipeline snippet

    data = pipeline | "load data" >> beam.io.Read(

    beam.io.BigQuerySource(query="SELECT *
    16 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  16. dataprep/Causing a job to fail if there is an error in the recipe

    TL;DR - I want to be able to get notified in case a recipe is not valid due to errors in the data source/recipe that can happen from a change in the source tables or any other reason, and see that the transformation has failed

    Example Scenario - I have a dataset (e.g. a salesforce table) that is being extracted on a daily basis and saved in GCS. A scheduled dataprep job is being invoked every day on the new table and transforms it with a pre-defined recipe. When a column is added/removed from the data source, the recipe might…

    11 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  17. dataprep / dataflow jobs region

    TLDR: dataprep creates jobs in the us only. This is probably a bug.

    I'm trying to prep data from BQ - EU to BQ -
    EU.

    But dataprep creates dataflow in the us, and because of that I get an error:

    Cannot read and write in different locations: source: EU, destination: US, error: Cannot read and write in different locations: source: EU, destination: US

    15 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  18. PubSub to BigQuery template should allow Avro and/or Protobuf

    BigQuery already has support for loading files in Avro, why not streaming them in from PubSub? This seems like an obvious feature to have but I don't see any way to do it currently. The PubSub to BigQuery template is great, but it would be so much better with this one feature turned on.

    11 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  19. Provide more frequent Python updates

    Python SDK is not up-to-date with various cloud SDKs, last update was in September...

    17 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  20. Stop a streaming pipeline when idle to save costs

    Reading from PubSub for a amount of time was possible on SDK 1.91 with maxReadTime(Duration), but now in 2.x we don't have that option anymore.

    I know that using PubSub as bounded collection is a bad design, but sometimes we have scenarios in that a few messages are pushed once a day and is impracticable to leave a streaming job in that cases.

    Like Dataproc is doing with scheduling deletion (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-deletion) it will be nice to we have something simiiar in Dataflow to save money.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
← Previous 1 3 4
  • Don't see your idea?

Cloud Dataflow

Categories

Feedback and Knowledge Base