PyFlink Starter Archetype
A minimal path from article to working scaffold
A simple scaffold and agent prompt for readers who want to move from the PyFlink article into a small, Python-first streaming project without guessing the first few structural decisions.
Why This Exists
The hardest part of trying a new streaming stack is often not the API. It is deciding what the first non-chaotic project shape should look like. This starter is intentionally small and biased toward learning the runtime model early.
Open These First
- PyFlink installation for supported Python versions and package install.
- Python DataStream API intro for the basic program shape.
- Python dependency management for shipping Python files, requirements, and archives.
- PyFlink debugging for local and remote debug patterns.
- Official Flink Docker image if you want a local cluster quickly.
- Kafka connector docs if your first real job is Kafka-shaped.
Suggested Structure
pyflink-starter/
README.md
pyproject.toml
flink_jobs/
__init__.py
job.py
transforms.py
model_logic.py
tests/
test_transforms.py
docker/
Dockerfile
conf/
local.env
Local Setup Commands
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install apache-flink==2.2.0
# optional: verify the install
python -c "import pyflink; print('pyflink ok')"
At the time of writing, the stable installation docs require Python 3.9, 3.10, 3.11, or 3.12.
Local Runtime Option
docker network create flink-net
docker run -d --name jobmanager \
--network flink-net \
-p 8081:8081 \
-e JOB_MANAGER_RPC_ADDRESS=jobmanager \
flink:2.2 jobmanager
docker run -d --name taskmanager \
--network flink-net \
-e JOB_MANAGER_RPC_ADDRESS=jobmanager \
flink:2.2 taskmanager
This is the quickest way to get a local Flink runtime without pretending the cluster side does not exist.
What To Build First
- Start with one small streaming job that reads a source, applies stateful logic, and emits a sink.
- Keep the model logic replaceable so you can compare native Python execution with a service boundary later.
- Prove local packaging, dependency shipping, and replay behavior before chasing throughput.
First Job Command
python flink_jobs/job.py
# once you move to a real cluster, keep dependency shipping explicit
# and treat Python files / requirements / connector JARs as part of the job
The dependency-management docs matter early because PyFlink becomes operationally confusing exactly when teams treat packaging as an afterthought.
What The Archetype Should Prove
- A small DataStream job can run locally without magical hidden state.
- Python dependencies are explicit and can be shipped deliberately.
- Connector JARs are treated as runtime dependencies, not forgotten later.
- The model logic can stay native Python at first, but the boundary can still be changed later.
Agent Prompt
Use the pyflink article as context and create a minimal PyFlink starter project.
Requirements:
- Python-first project layout
- one small DataStream job
- local runnable setup
- clear dependency management
- one testable transform module
- notes on where Java/JAR dependencies still enter the picture
- include a Docker-based local runtime option
- include a README with exact commands
- include one path for native Python model logic and one note on how to swap to a service boundary later
Do not optimize for scale yet.
Optimize for clarity, packaging sanity, and understanding the runtime boundary.
Useful Tooling
- Python venv to keep the client-side environment isolated.
- Docker to make the cluster side visible early instead of delaying it.
- PyCharm / IntelliJ plus the official PyFlink debugging flow if you want local or remote Python UDF debugging.
- Kafka only when your first example actually needs it; otherwise start with a smaller source/sink path and add connector JARs later.
