A LoRA adapter on FLUX.2 Klein (4B) that predicts the magnitude spectrogram of a room impulse response from a top-down schematic of the room. Reframes acoustic modeling as image-to-image generation: the source image is a schematic showing room geometry plus source and listener positions, the target is the RIR spectrogram in RGB, and an inverse bijection recovers a mono RIR suitable for audio convolution.
This adapter tests whether the recipe from
Image Generators are Generalist Vision Learners
(Gabeur et al., 2026;
arXiv:2604.20329
) extends to physics-grounded prediction tasks where the input is a 2D image and the output is a signal that captures the response of a physical system.
Method
Reframe room acoustics as image-to-image.
Source: a 768 × 768 top-down schematic of a rectangular room with the audio source rendered as a red ⊕ glyph, the listener as a blue ⊙ glyph, and floor brightness encoding surface absorption (lighter = more reflective). Target: the room impulse response, computed via the image-source method, encoded as an RGB spectrogram.
Bijective magnitude↔RGB encoding.
Linear-amplitude STFT magnitude → dB clipped to [−100, 0] → curve
u ∈ [0, 1]
→ 7-segment Hamiltonian path through the corners of the RGB cube (black → blue → cyan → green → yellow → red → magenta → white). The wider dB range relative to speech encodings captures the full RIR dynamic range from direct-arrival peak to late-reverberation tail.
Audio params.
16 kHz, n_fft = 1024, hop = 256, 1-second clips. STFT (513 frequency bins × 63 time frames) placed top-left in a 768 × 768 canvas with silence padding.
Training data: 10,000 randomly-generated rectangular rooms via
pyroomacoustics
. Dimensions uniform on 3–12 m × 3–12 m with 2.4–4.0 m ceiling; surface absorption uniform on [0.05, 0.50]; source and listener positions uniform inside the room with minimum 0.5 m separation. RIRs computed by image-source method up to reflection order 6.
Status
Training in progress. Weights will be added when complete.
Training
Base
black-forest-labs/FLUX.2-klein-base-4B
Adapter
LoRA, rank 256 on transformer attention + rank 32 on text encoder
Resolution
768 × 768
Batch size
4
Optimizer
AdamW, lr 1e-4, cosine schedule, 300-step warmup
Max steps
15 000
Mixed precision
bf16
Training data
10 000 synthetic rooms (pyroomacoustics, image-source method, max order 6)
Audio params
16 kHz, n_fft 1024, hop 256, 1-second RIR clips
Spectrogram encoding
Linear magnitude → dB clipped [−100, 0] → Hilbert RGB-cube path
Usage
import torch
from PIL import Image
from diffusers import Flux2KleinPipeline
pipe = Flux2KleinPipeline.from_pretrained(
"black-forest-labs/FLUX.2-klein-base-4B", torch_dtype=torch.bfloat16,
).to("cuda")
pipe.load_lora_weights("phanerozoic/echo-plantain")
# A top-down schematic of the target room (see `render_schematic.py` for the# renderer convention: walls as outline, source as red ⊕, listener as blue ⊙,# floor brightness encoding absorption).
schematic = Image.open("room_schematic.png").convert("RGB").resize((768, 768))
prompt = (
"Generate a room impulse response spectrogram for the depicted space. ""Time on horizontal axis (early reflections at left, late reverb tail extending right), ""frequency on vertical axis. Energy encoded in RGB along a Hilbert path through ""the color cube: black is below noise floor, blue/cyan is faint reflections, ""green/yellow is strong reflections, red/magenta is direct-arrival energy."
)
img = pipe(
image=schematic, prompt=prompt, height=768, width=768,
guidance_scale=4.0, num_inference_steps=20,
).images[0]
The decoder (RGB → magnitude → mono RIR) is in
decode_rir.py
. The recovered RIR can be convolved with any dry signal to apply the predicted room reverb.
License
The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B.
Training data attribution
The training data is fully synthetic, generated at preparation time from random rectangular room geometries via the
pyroomacoustics
Python library (Scheibler, Bezzam, Dokmanić, 2018). Pyroomacoustics is distributed under the MIT License. No external dataset is required to reproduce the training corpus; the dataset-generation script is included in this repository.
echo-plantain huggingface.co is an AI model on huggingface.co that provides echo-plantain's model effect (), which can be used instantly with this phanerozoic echo-plantain model. huggingface.co supports a free trial of the echo-plantain model, and also provides paid use of the echo-plantain. Support call echo-plantain model through api, including Node.js, Python, http.
echo-plantain huggingface.co is an online trial and call api platform, which integrates echo-plantain's modeling effects, including api services, and provides a free online trial of echo-plantain, you can try echo-plantain online for free by clicking the link below.
phanerozoic echo-plantain online free url in huggingface.co:
echo-plantain is an open source model from GitHub that offers a free installation service, and any user can find echo-plantain on GitHub to install. At the same time, huggingface.co provides the effect of echo-plantain install, users can directly use echo-plantain installed effect in huggingface.co for debugging and trial. It also supports api for free installation.