I attempted to implement a readback of a drawn framebuffer for auto-exposure, but no matter what I try, either glGetTexImage or glMapBuffer stalls.
Am I doing this wrong, or this this an issue with my particular machine, because I can’t observe the stalling on my notebook.
I had to shorten the code to be able to post it, but basically I just read back a particular downsampled version of the framebuffer.
glGetTexImage with a GL_PIXEL_PACK_BUFFER bound isn’t supposed to stall, but it still does, despite the fact that I only map it 2 frames later.
// Simple class wrapper for PBO
class DownloadPbo {
GLuint pbo = 0;
public:
MOVE_ONLY_CLASS_MEMBER(DownloadPbo, pbo);
DownloadPbo () {} // not allocated
DownloadPbo (std::string_view label) { // allocate
glGenBuffers(1, &pbo);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
OGL_DBG_LABEL(GL_BUFFER, pbo, label);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
}
~DownloadPbo () {
if (pbo) glDeleteBuffers(1, &pbo);
}
operator GLuint () const { return pbo; }
};
struct ExposureReadback {
int buffers = 3;
int counter = 0;
std::vector<DownloadPbo> pbos;
ExposureReadback () {
for (int i=0; i<buffers; ++i) {
pbos.push_back( DownloadPbo{prints("ExposureReadback[%d]", i)} );
}
counter = 0;
}
int desired_res = 4;
float2 edge_weight = 0.0f;
bool readback (Render_Texture& tex, int2 full_res, lrgb* weighted_average) {
ZoneScoped;
OGL_TRACE("exposure readback");
int cur_buf = counter;
counter = (counter + 1) % buffers;
int oldest_buf = counter;
int mips = calc_mipmaps(full_res);
int mip = clamp(mips - desired_res, 0, mips-1);
auto res = calc_mip_res(full_res, mip);
int size = res.x * res.y * sizeof(lrgb);
printf("--------n");
printf("glGetTexImage %dn", cur_buf);
{ // Trigger read into current pbo
ZoneScopedN("glGetTexImage"); // CPU Profiler Zone
OGL_TRACE("glGetTexImage"); // GPU Profiler Zone + Nsight region
glInvalidateBufferData(pbos[cur_buf]);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbos[cur_buf]);
glBufferData(GL_PIXEL_PACK_BUFFER, size, nullptr, GL_STREAM_DRAW); // TODO: avoid reallocating?
glBindTexture(GL_TEXTURE_2D, tex);
glGetTexImage(GL_TEXTURE_2D, mip, GL_RGB, GL_FLOAT, nullptr);
glBindTexture(GL_TEXTURE_2D, 0);
}
bool readback_avail = false;
{ // Map now available oldest pbo
ZoneScopedN("glMapBuffer");
OGL_TRACE("glMapBuffer");
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbos[oldest_buf]);
auto* mapped = (lrgb*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
if (mapped) { // returns null for first few frames
printf("glMapBuffer %dn", oldest_buf);
lrgb total = 0;
float total_weight = 0;
for (int y=0; y<res.y; ++y)
for (int x=0; x<res.x; ++x) {
float2 uv = ((float2)int2(x,y) + 0.5f) / (float2)res;
float2 t2d = lerp(edge_weight, 1.0f, 1 - abs(uv * 2 - 1));
float t = min(t2d.x, t2d.y);
total += mapped[x + y*res.x];
total_weight += t;
}
*weighted_average = total / total_weight;
readback_avail = true;
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glInvalidateBufferData(pbos[oldest_buf]);
}
}
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
return readback_avail;
}
};
Profiler Results:
This question would benefit from having a minimal reproducible example and a description of the machines on which it does and doesn’t work.
Nvidia 4070 Ti Super with the driver reinstalled and factory settings.
How should a minimal reproducible example look?
Does it need to be an entire opengl application with boilerplate?
Where in the graph is it stalling?
The lower part are cpu side timing regions, the upper ones are gpu side timings, I believe this particular profiler syncs them accurately as well.
It’s clear that it’s stalling simply because async glGetTexImage should take negligible time not most of the frame, and we can even see the cpu side waiting exactly until the gpu finishes drawing the FBO.
How do you know it’s stalling and not computing something?
Cpu should not be waiting, and the few kB copy should take little time on the gpu (which it does).
What does it look like when it’s not stalling?
glGetTexImage taking less time.
4