I have experienced the exact same problem.
The problem turned out to be due to greedy threads and windows insisting on doing some background work. I had a render thread and a thread optimizing a large mesh. The optimizer was trying to use 100% cpu time (on a dual core system) and the renderer was just rendering along.
When some background work had to be done, my optimizer was not about to be stopped, so the renderer was blocked for a short while, while something went on in the background, and then ran again.
The solution was to down prioritize my mesh optimization, so when windows insisted on getting cpu time, the optimizer was paused, and not the renderer. Problem went away.
My experience relates heavyly to dual core systems with two threads, but perhaps it can shed some light on your problem.
Even if you're not using multiple threads yourself you could be influenced by other threads in the system. I figure they're better at it these days, but AV software used to be guilty of grabbing lots of CPU time every now and then and leading multimedia apps to randomly pause...
Another issue I've had in the past was with log-files - on high logging levels I was streaming lots of text to a file and whenever the buffer was committed to disk it'd pause slightly (I/O bound) and then carry on.
One other thing that hasn't been suggested so far is your CPU/GPU synchronization. Your CPU-based application can keep on piling work on the GPU and get several frames ahead; once the command buffer is full D3D will force your application to stall until it can accept more commands - this stall often happens on Present() calls.
IIRC the "Accurately profiling Direct3D API Calls" paper in the SDK has code to force a pipeline stall using queries. You normally wouldn't want to do this, but if you place one of these in EVERY frame then you'll stop the CPU getting too far ahead and you effectively replace one big occasional stall for lots of small stalls thus smoothing out the overall frame-rate. Trying this could be a useful way of determining if Direct3D or the GPU is at all responsible...
hth
Jack
You're mostly correct on your conclusion, but it is more general in that the CPU is getting ahead of the GPU. The command buffer filling up is more your indicator than your cause.
I would recommend using "PIX for Windows" along with that article on profiling API calls to try and build up some information about where your time is being spent. Which bits of work require little CPU time but a lot of GPU time Profiling is absolutely key here - you NEED results to both determine where your problem is as well as to measure any improvements/fixes.
Could also be worth considering the hardware used - it is always possible that your GPU is too slow (or relatively, your CPU is too fast).
hth
Jack
It can take a bit of time/effort to learn how to get the most from PIX, so may well not be the best tool for you.
In order for the CPU and GPU to be running in parallel at maximum efficiency they must be producing (CPU) and consuming (GPU) at a roughly equal rate. If the GPU consumes too quickly it'll be idle whilst waiting for the CPU to catch up, alternatively if the CPU produces too quickly then the GPU will have to block it until its cleared some of the backlog.
You wouldn't be going too wrong to consider the CPU/GPU dynamic in the standard producer-consumer multiprogramming concept.
My comment about the CPU being too fast isn't usually the case - it was more a throw-away comment to illustrate the possibility. It is more likely that the high-end GPU's (X1950's and 8800's) will be able to outpace many CPU's and thus the mismatch results in idle time rather than blocked time which rarely causes a noticeable problem.
Comparing your situation against other applications isn't really a good idea - they'll interact with the API/GPU in different ways that may well mean that the CPU/GPU utilization is better balanced. Running PIX against these apps might well reveal some interesting comparisons.
Looking at how many Draw calls you make per frame, how many times you read/write resources (you did comment that you don't create/destroy resources, but modifying them can be equally bad), state changes and so on...
Given that you mention that the frame-rate with forced synchronization is 15-25fps then have you considered that you're just pushing the hardware too hard I can quite easily write graphics demos that exhibit similar characteristics where I make only a few draw calls with 100's of 1000's of polygons (or even millions) each. Throw in a complex shader and you can quite easily choke a good GPU without bothering the CPU too much...
hth
Jack