Hey,
I have for a while been fruitlessly trying to solve an unexpected problem in an octree based frustum culling 3D drawing algorithm.
For every frame, the visible cells are determined. Then, sorted per material, the cells have their triangle indices concatenated into a system memory buffer. Then this memory buffer is sent to the rendering pipeline by a glDrawElements operation, and the next material is processed until no more materials need to be drawn for the frame.
However, I have been unable to use VBOs to speed up this process. If my vertex array, tex coord and normal arrays are stored in VBOs, the frame rate of my test cases drops extremely, suggesting I have hit a software path. If on the other hand, my vertex array, tex coord and normal arrays are in normal system memory, the frame rate is much more reasonable. As an example, in a scene were on average some 10 or 20 thousand triangles are on screen, the VBO arrays will give a 1 or 2 FPS perfomance while system memory arrays will result in 40-50 FPS. These results are similiar on an Nvidia 5200 PCI card with the latest drivers and an ATI 9600 AGP card with recent drivers, suggesting it is not a driver bug, nor specific to one single computer system.
In trying to identify the problem, I tried the following solution attempts.
-
There might be too many indices for the bus bandwidth. While it seemed like the similiar results on both a PCI and an AGP machine would rule this out, I still created VBOs prestoring the indices of each cell, per material, and then issued one glDrawElements per cell, using the VBO indices. No notable difference in performance.
-
The above mentioned solution might not have worked because of too many batches, that is glDraw* operations. According to nVidia’s book GPU gems, the number of triangles modern graphics cards can handle is up, but the number of batches they can handle is still limited to some 10-30 000 per second. In order to reduce the number of batches, I attempted the following solutions:
- Using a streaming VBO to create an Elements Array, using glBufferSubDataARB to store the indices and then using glDrawElements to output the triangles. No notable difference in performance.
- Using a streaming VBO to create an Elements Array, using glMapBufferARB to memcpy the indices to the VBOs and then using glDrawElements to output the triangles. Care was taken to use a NULL glBufferDataARB to discard rendered buffers as described in nVidia’s VBO usage hints PDF document. No notable difference in performance.
- The above solutions but with glGetIntegerv(GL_MAX_ELEMENTS_INDICES, …); as a guide for how many indices to store at a time before calling glDrawElements. No notable difference in performance.
-
Using glDrawRangeElements instead of glDrawElements. For some reason this only worked for the streaming buffers described above. If the index buffer was in system memory, glDrawRangeElements crashed where glDrawElements would not on exactly the same buffer. Possibly a driver bug, I did not test for this behavior on more than one machine.
-
Having read online that glDrawElements is liable to scan the index list before submitting it to the graphics card, I suspected this behavior caused the driver to read back data from the graphics card. I substituted glDrawElements with a glBegin(GL_TRIANGLES); … glArrayElement(…): … glEnd(); loop to ascertain that no analysis would be attempted on the elements. No notable difference in performance.
-
Perhaps VBO data is much more sensitive to alignment issues. I aligned the vertex data by adding a 4th ‘w’ component. No notable difference in performance, possibly a slight increase in non VBO performance.
-
Maybe the graphics card is out of memory and has to swap the VBOs over the bus. I reduced all textures to a quarter size and rendered a scene where the VBOs would be relatively small. No notable difference in performance.
So, at the end of the day I have been unable to gain anything from using VBOs. I have successfully used VBOs many times before, in particular, the engine used to render more than 40k triangles in 50 FPS on the current test system when it had all arrays in VBOs and then one big element array in another VBO which was unconditionally rendered. With the frustum culling, different indices have to be rendered every frame though and it is somewhat disturbing that due to the fact that I cannot seem to get VBOs to work with it, the frame rate actually dropped with the frustum culling implementation, even that much fewer triangles are being rendered.
Sorry about the lengthy post. Is there anybody out there who has experience with similiar VBO behavior? To recap, glDrawElements with indices in system memory and vertex, color, and texture coordinate arrays in system memory as well, can render a given scene in 40 FPS but with the vertex, color and texture coordinate arrays as VBOs the scene is rendered in 1-2 FPS or less.
