View Full Version : Taking advantage of sse4?


Mark Johnstone
March 8th, 2008, 01:01 PM
I’ve been reading a lot about how sse4 on the newer x86 CPUs is supposed to speed up video encoding. Will Cineform be able to take advantage of these new instructions? If so, when do you plan to support it?

I shoot and edit underwater video, and need to do a lot of color correction. So, I’m really interested in speeding up the editing time and playback of color corrected HDV in my NLE (Vegas 7).

It is my understanding that color correction is performed by uncompressing the video, modifying the color channel information, and recompressing the resulting video. Since I don’t fully understand how the Cineform codec integrates into Vegas, I might have this wrong. Any insight you can give me into this process would also help.

Thanks,

--Mark

David Newman
March 8th, 2008, 05:04 PM
We have looked into SSE4, CineForm encoding is already very fast (all SSE4 equipped systems can basic encode HD in real-time), we are more likely to see what impacts it has on Decoder performance, where you are often decoding more than one stream. No dates on adding SSE4 acceleration.

How color correction is described is basically correct. The CineForm codec connects to Vegas through the Video for Windows interface. When rendering color correction to CineForm you are benefitting for CineForm's ability to hold on to more information than native camera codecs, more like the quality performance of uncompressed files, without the huge storage and bandwidth headaches.

Anmol Mishra
November 19th, 2008, 03:32 AM
Hi DAvid. Just a clarification on your post. ""All SSE4 systems are fast enough for encoding" I am trying to use a mini-itx battery powered system as a portable Cineform recorder. An option is the 17W penrym mobile CPUs..However the Penryn SL9400 tops out at 1800 MHz. I have successfully encode 1080p/30p with a 2.2GHz overclocked Merom.
2 GHz Meroms are pushing it, but with a custom configured XP, it is possible. Can you confirm if the 1.8GHz 17W TDP Penryns are able to record without a problem ?
Benchmarks show about 5% improvement in Divx and H.264 encoding though..

Thanks!

We have looked into SSE4, CineForm encoding is already very fast (all SSE4 equipped systems can basic encode HD in real-time), we are more likely to see what impacts it has on Decoder performance, where you are often decoding more than one stream. No dates on adding SSE4 acceleration.

How color correction is described is basically correct. The CineForm codec connects to Vegas through the Video for Windows interface. When rendering color correction to CineForm you are benefitting for CineForm's ability to hold on to more information than native camera codecs, more like the quality performance of uncompressed files, without the huge storage and bandwidth headaches.

Anmol Mishra
November 19th, 2008, 03:36 AM
Check this out - for SSE4 optimized code

AnandTech: Intel Mobile Penryn Benchmarked: Battery Life Improves Again (http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3195&p=4)

Is Cineform optimized for SSE4, David ?

Richard Leadbetter
November 20th, 2008, 01:59 AM
I've run CineForm on both 2.6GHz T9500/2.5GHz T9300-equipped notebooks and it runs well but not as well as a 2.4GHz Conroe. This could be down to many factors: front side bus, memory speed, chipset... regardless, the Conroe doesn't have SSE4 so while I might be completely wrong, I don't think SSE4 is supported.

David Taylor
November 20th, 2008, 01:52 PM
Check this out - for SSE4 optimized code

AnandTech: Intel Mobile Penryn Benchmarked: Battery Life Improves Again (http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3195&p=4)

Is Cineform optimized for SSE4, David ?

Anmol, SSE4 instructions don't do as much for CineForm algorithms, so we haven't invested (yet) in SSE4 upgrades. We have gained more speed at a faster rate by upgrading our threading algorithms. That doesn't mean we won't code for SSE4, we just haven't done it yet.

Regarding your recording station, your CPU seems a bit slow. We won't "guarantee" a specific processor because there are too many variables such as scene complexity, encoder quality preference, HD-SDI versus HDMI ingest, spatial resolution, frame rate, telecine removal (or not), etc. But I understand that you prefer to use the lowest cost CPU you can get away with because it reduces your system cost.

Here is the way we usually recommend this. Setup a scene of reasonably high image complexity , then setup your CineForm encoding parameters (quality setting, pre-processing options, etc) in the worst case manner you intend to use them. Then as long as the CPU meters are in the 70% range while recording you're probably in good shape. Remember, you want to set up a reasonably high image complexity scene that "taxes" the CPU more than an easy scene. If the CPU meters hover at 70% or below then it allows sufficient overhead for even more complex scenes.

Richard Leadbetter
November 22nd, 2008, 04:31 AM
Gut feeling is that your best bet will be to see how Core i7 works out in its mobile iterations. There was a big, big leap between Core Duo and Core 2 Duo on mobile CPUs and gut feeling is that you'll see a similar leap between C2D and i7.

The bottom line is that CineForm threading is on par with the best in the business and i7 is showing up to 50% improvements over Core 2 Quad. The integrated memory controller and hyper threading are two key factors that should give better CineForm performance without any changes to the code.

Any chance of a clock for clock performance increase benchmark from CineForm - Q9450 or downclocked Q9550 vs Core i7 920?

David Newman
November 22nd, 2008, 09:41 AM
Did you see my i7 performance blog entry (http://cineform.blogspot.com/2008/11/intel-core-i7-and-cineform.html)?

The system I was running is 3.2Ghz 965 Extreme and I don't have any 920 or 940s -- the chart you want would be a lot of work. I did do one underclocked test by running the 965 at 1.6GHz (the slowest I could make it go) and it run 1080p 4:4:4 at 54fps vs 104fps at 3.2fps. It scales pretty well with clock speed.

Richard Leadbetter
November 22nd, 2008, 11:13 AM
Where is the :eek: smiley when you need it?

The performance increase at 1080p is staggering. I'm sitting here in a room with three quad core systems and now I want another one.

One thing though - your HP workstation uses two quad core Penryns and the 1080p 4:2:2 10-bit benchmark is 81.5fps. My QX6850 decodes 1080p60 4:2:2 8-bit in realtime. Indeed, the 1080p60 clips I sent to you were encoded, and decoded on a Q6600. Approximate/ballpark performance on a single CPU system is the same as on your dual quad - that doesn't seem right...

CineForm encoding is more efficient than decoding, right? At least until the latest enhancement... If you're looking at a 150% to 185% increase in decoding speed, can you speculate on encoding?

David Newman
November 22nd, 2008, 11:45 AM
There are a couple of factors that impact the performance numbers. First the clips used will impact the numbers (20-30%), and 4:2:2 doesn't scale as well on 8 cores as 4:4:4 does (4:4:4 has more work for more cores.) 4:2:2 sources on 4 cores will likely run at 75% of the speed on 8 core (Core-2 numbers, remember dual quads share one memory bus and this changes with Core-i7.) The gap between the encode speeds and decode speed has shrunk, decoder threading gains have caught up and maybe exceeded our very fast encoder. The only Core-i7 encode test I've done is a 4kx2K RAW encode at 43 fps using Filescan-2, which is blazing, as it also does a 16-bit to 12-bit LUT before encoding as part of those numbers. This is equivalent to a 4:4:4 12-bit encode at 60-80fps.

Richard Leadbetter
December 8th, 2008, 02:06 PM
Was just thinking about the decoding improvements and wondered whether it applies to both VFW and DS decoders. What's the difference speed-wise between the VFW and DS decoders these days? I used to be able to run 720p60 high in VirtualDub, but it lost frames on filmscan modes - that was on a QX6850.

David Newman
December 8th, 2008, 02:31 PM
Both should be improve with Quad core (or between) systems, as they are linked using the same core libraries. The RGB nature of VfW will slow it down a little, let me know if the gap has wided.

Richard Leadbetter
December 10th, 2008, 04:14 AM
I could do a like for like test on my Q9300 and Q6600 systems, but the QX6850 is gone - overclocked i7 920 here we come! Really looking forward to seeing if I can move from 1080p60 high to 1080p60 filmscan, not to mention benching RGB performance of Huffyuv and Lagarith :)

Richard Leadbetter
December 12th, 2008, 05:43 AM
Quick i7 performance review based on our hardware and application:

1080p60 8-bit YUY2 capture at 'high' quality is being comfortably achieved with 50% CPU usage of the 2.66GHz Core i7 920 - astonishing bearing in mind how cheap this CPU is. I would estimate that I'm getting a 30% to 50% improvement over my QX6850 based on capture of the same material - maybe even more. I can encode 1080p60 Filmscan 2 with CPU usage going up to around 80%. I could never use Filmscan 1 or 2 with the QX6850 without severe loss of frames.

This chip is officially astounding.

CPU temperature doesn't even hit 45 degrees (I'm using a TRUE 120 HSF) - the 920 is supposed to be quite a hot chip. The i7 dual core CPUs should be VERY powerful processors.

Lagarith seems to be as slow as ever it was, even in multi-CPU mode, while Huffyuv seems to get a good speed boost. I can hit 1080p YUY2 at 58fps just in VFW mode. Our DirectShow version doesn't appear to work though - the CPU multiplier stays at minimum for some reason. At 1.6GHz, 1080p YUY2 is at 38fps.