PyFlink Starter Archetype | ML-Affairs

Why This Exists

The hardest part of trying a new streaming stack is often not the API. It is deciding what the first non-chaotic project shape should look like. This starter is intentionally small and biased toward learning the runtime model early.

Open These First

PyFlink installation for supported Python versions and package install.
Python DataStream API intro for the basic program shape.
Python dependency management for shipping Python files, requirements, and archives.
PyFlink debugging for local and remote debug patterns.
Official Flink Docker image if you want a local cluster quickly.
Kafka connector docs if your first real job is Kafka-shaped.

Suggested Structure

pyflink-starter/
  README.md
  pyproject.toml
  flink_jobs/
    __init__.py
    job.py
    transforms.py
    model_logic.py
  tests/
    test_transforms.py
  docker/
    Dockerfile
  conf/
    local.env

Local Setup Commands

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install apache-flink==2.2.0

# optional: verify the install
python -c "import pyflink; print('pyflink ok')"

At the time of writing, the stable installation docs require Python 3.9, 3.10, 3.11, or 3.12.

Local Runtime Option

docker network create flink-net

docker run -d --name jobmanager \
  --network flink-net \
  -p 8081:8081 \
  -e JOB_MANAGER_RPC_ADDRESS=jobmanager \
  flink:2.2 jobmanager

docker run -d --name taskmanager \
  --network flink-net \
  -e JOB_MANAGER_RPC_ADDRESS=jobmanager \
  flink:2.2 taskmanager

This is the quickest way to get a local Flink runtime without pretending the cluster side does not exist.

What To Build First

Start with one small streaming job that reads a source, applies stateful logic, and emits a sink.
Keep the model logic replaceable so you can compare native Python execution with a service boundary later.
Prove local packaging, dependency shipping, and replay behavior before chasing throughput.

First Job Command

python flink_jobs/job.py

# once you move to a real cluster, keep dependency shipping explicit
# and treat Python files / requirements / connector JARs as part of the job

The dependency-management docs matter early because PyFlink becomes operationally confusing exactly when teams treat packaging as an afterthought.

What The Archetype Should Prove

A small DataStream job can run locally without magical hidden state.
Python dependencies are explicit and can be shipped deliberately.
Connector JARs are treated as runtime dependencies, not forgotten later.
The model logic can stay native Python at first, but the boundary can still be changed later.

Agent Prompt

Use the pyflink article as context and create a minimal PyFlink starter project.

Requirements:
- Python-first project layout
- one small DataStream job
- local runnable setup
- clear dependency management
- one testable transform module
- notes on where Java/JAR dependencies still enter the picture
- include a Docker-based local runtime option
- include a README with exact commands
- include one path for native Python model logic and one note on how to swap to a service boundary later

Do not optimize for scale yet.
Optimize for clarity, packaging sanity, and understanding the runtime boundary.

Useful Tooling

Python venv to keep the client-side environment isolated.
Docker to make the cluster side visible early instead of delaying it.
PyCharm / IntelliJ plus the official PyFlink debugging flow if you want local or remote Python UDF debugging.
Kafka only when your first example actually needs it; otherwise start with a smaller source/sink path and add connector JARs later.