The quickest thing available for comping images together is the GPU and some blending operations (you might need to split the images up and do it there). Whilst normally sending that amount of data about isn’t generally recommended, the benenifts of using the GPU should outweight the (comparitivly) small cost of sending and reading back the data.
/edit
btw, this sounds like a very slow thing imo. You are accessing raw data pointers i assume? ie, not doing something silly like :
setOutPixel( getInPixel1() + getInPixel2() );
Function call overhead of that sort would kill performance (and i assume you are building in release mode with optimisations turned on).
I assume you are using some sort of direct access to the pixels anyway, ie walking over the raw data with pointers rather than using some sort of (i*height + j) indexing to the pixels?
unsigned bytes_per_pixel=3; // rgb
unsigned char* pImage1 = someImageData;
unsigned char* pImage2 = someOtherImageData;
unsigned char* pOut = someOutputImage;
unsigned char* pEndImage1 = pImage1 + img_w* img_h * bytes_per_pixel;
for( ; pImage1 != pEndImage1; )
{
*pOut = *pImage1 + *pImage2;
++pOut; ++pImage1; ++pImage2;
}