PanoTools mailing list archive

Mailinglist:PanoTools NG
Sender:matt_nolan_uaf
Date/Time:2007-Dec-24 18:27:19
Subject:New hardware testing on PTgui performance

Thread:


PanoTools NG: New hardware testing on PTgui performance matt_nolan_uaf 2007-Dec-24 18:27:19
I've spent the past few days doing performance testing of PTgui using 
various hardware configurations on my new computer.  I found many of 
the results surprising and non-intuitive.  I'm sharing them with you 
here in gory detail, in the hopes that you can either correct me 
where I'm wrong, confirm them, or follow up with the next phases of 
testing.

The computer:
My goal was to get the faster computer I could for $5000.  Why 
$5000?  The official rationalization is that my camera body cost that 
much, so it seems reasonable to match that with the computer.  That's 
also about all I could afford.  Plus I had looked around a bit first 
and new that a lot of good stuff could be had for that price.  I did 
not include the monitor and graphics card in that limit, because 
these are driven by other needs, with prices ranging from $100 to 
$4000.
CPU: Intel Core 2 Extreme QX9650 Quad Core (penryn), Air cooled
Motherboard: Asus P5K Premium Wireless
Memory: 2x OCZ DDR2-800 Reaper 2048MB w/ Heatpipe cooling
HDD1:  Seagate Barracude 7200.11 750GB SATAII (XP boot on 50GB part.)
HDD2: same (but no boot)
HDD3: Western Digital SATA Raptor 150GB
HDD4: same
HDD5: Gigabyte iRAM 2GB
HDD6: Gigabyte iRAM 2GB
RAID controller: 3Ware 9650SE-2LP 2-port PCI-E SATA2 RAID

Why this stuff?  I had originally thought along the lines of a dual 
quad Xeon configuration with 8GB RAM running 64 bit Windows.  But I 
couldn't afford the second Xeon initially, and only the 2.33 MHz 
model at that.  Then I read scary things about 64bit OS and great 
things about the QS9650 and the Asus P5K, especially in terms of 
their overclocking abilities.  So I thought it would be nice to have 
a CPU that I could run between 2GHz and 4GHz to assess the impact of 
CPU speed on stitching.  In terms of disk drives, Raptors and 
Barracudas are commonly regarded as the fastest drives around at the 
10k spin rate.  I decided against 15k SAS partly because gamers 
weren't using them much and partly because of noise and power.  I'm 
on home power, and another nice thing about the QS9650 is that it is 
a very low power CPU.  I went with the iRAM disks because I wanted to 
see if they could eliminate the I/O bottleneck and put the CPU to 
full use.  I almost went with Mtrons in RAID, but that would have 
cost $2000 on its own, whereas these iRAM (with only 2GB RAM on two 
cards) only cost $500, and would be sufficient to make the tests.  In 
terms of size and numbers of disks, for my work I stitch fisheye 
sphericals and gigapixel partial panoramas (with 300GB scratch space 
or more), so I wanted a system that could do both.

Using Milko's SpeedTest (a 3 shot fisheye scene at 6000 x 3000 pix 
and an 68 MB output), I wanted to assess the impact of CPU speed on 
the stitch speed.  I also wanted to see the impact of drive type on 
stitch speed.  Here speeds are given in Minutes:Seconds.  I ran the 
test using PTgui only for both stitching and blending, and ran it 
with regular and fast transform modes (the two times separated by /).

	iRAM		Raptor	R0	Barracuda
2.0GHz	1:07/0:38	1:10/0:42	1:10/0:40
3.0GHz	0:46/0:29	0:52/0:35	0:47/0:29
4.2GHz	0:34/0:20	0:39/0:24	0:34/0:21

The results are pretty clear: CPU speed matters a lot for this test, 
with a doubling of CPU frequency resulting in nearly a halving of 
stitch speed (note that I'm using exactly the same hardware, just 
under or overclocking the CPU).  The iRAMs are consistently a bit 
faster, followed by a single Barracuda followed just slightly by the 
two Raptors in RAID0 on the PCI card.  Note that times here are to 
the nearest second or so (measured by a stopwatch), and no reboots 
were done in between typically (except when overclocking) but I 
confirmed through multiple tests that this new and nearly empty 
machine did not require rebooting to improve speed between tests.

I wondered how these results scaled with project size, so I tried a 
stitch of one my regular panoramas, an 8 shot 12MP spherical, 
resulting in nearly a 12000 x 6000 tiff image, requiring a 4.8 GB 
scratch disk for an 8 bit Tiff.  For this test I only used the 
Barracuda, because the iRAMs were too small to use alone (because I 
had only 4GB not the 8GB that is possible) and I wanted a pure CPU 
test.  I used PTgui defaults and fast transform.


2.0GHz	7:11
3.0GHz	6:02
4.0GHz	5:28

Here again, CPU matters, but not nearly as much as previously.  A 
doubling of frequency results in only a 32% difference in stitch 
time.  Presumably now the I/O rates are becoming more important, due 
to larger file sizes and temp files.  This leads me to the idea that 
the results of Milko's speed test cannot be scaled to this size, and 
for those typically stitching at this larger size, only tests at this 
larger size should be used for hardware purchase decisions.  But I 
may be missing something.

I took this same project and deleted a few images to make the scratch 
disk fit onto the iRAM disks (4GB max).  So with 5 images, the 
scratch space needed was 3.4GB.  I re-ran the tests for both the 
iRAMs and the Barracuda.  Note that on all tests, the original files, 
the saved file, and the scratch space were all on the same disk, 
unless otherwise noted.
	iRAM		Barracuda
2.0GHz	3:12		4:13
4.0GHz	1:55		3:12

I draw two conclusions from this: iRAMs rock and CPU speed still 
matters.  One interesting thing is that the percentage decrease in 
stitch time was not the same for both drives: 66% vs 31%.  Neither 
are near 100% as in Milko's speed test with double the CPY frequency, 
indicating that I/O time here is more of a bottleneck. I did some 
tests described later that shed light on this.  The point is that 
neither drive is able to keep up with the CPU.  I tried an external 
USB drive on Milko's test just to see (this was the slowest drive I 
had) and found it to only be about 20% slower; and while significant, 
it is still a smaller percentage difference than the iRAM and 
Barracuda at my larger project size, suggesting that hard drive speed 
is not as important as CPU speed for Milko's test.

When trying to figure out which of my hardware was optimal for my 
standard 12kx6k (described later) I discovered something very odd 
which I thought I'd describe first: my Raptors in RAID0 are much 
slower than using them as independent disks (JBOD)!  I couldn't 
believe it at first, but I retried them several times to confirm.  I 
also tried in a software Windows striping mode and as a single disk.  
I tried to do a motherboard RAID 0, but couldn't figure it out.  For 
this test, I added a 20,000 x 10,000 project (175 72GB images, with 
max resolution of 60kx30k, and a 25 GB scratch space requirement) to 
assess how the results scaled across different project sizes.
	Raptor		Raptor		Raptor		Raptor
	RAID0		JBOD		Striping	Single
Milko's 0:41/0:28	0:38/0:24	0:33/0:23       xxxx
My 12k	8:50		5:43		5:46		6:57	
My 20k	40:19		37:31           xxxx            xxxx 

As can be seen, at a variety of project sizes, using the Raptors as 
JBOD is much faster than in RAID0.  I think an obvious next question 
to ask is whether my RAID controller card is functional or not, 
because this goes completely against conventional wisdom.  I used the 
popular HDtach software to test out all of my disks.  It seems to 
confirm that my RAID array was working properly.

HDtach results:
Raptor Windows-stripe:  75 MB/s average read speed, 135 MB/s burst
Raptor 3ware RAID0: 	107 MB/s average read speed, 170 MB/s burst
iRAM: 			134 MB/s average read speed, 134 MB/s burst
Barracuda:		85 MB/s average read speed, 134 MB/s burst

A potentially important note here is that the Barracuda average read 
speed over the first 100 GB was actually about 110MB/s, but that the 
speed declined markedly over the full disk size of 750GB (reducing 
the average), so this might help explain why it performs as good or 
better than the Raptor, even in RAID.  Also, the iRAM speed was 
consistent across its entire size.  Several tests by others show that 
it scales perfectly in RAID.  I have no explanation for why the 
Raptors in RAID are sooo much slower – I'd really appreciate it if 
others could try this experiment with their own projects.  I guess 
either the card really is broken (or I'm using it wrong) or PTgui 
somehow is more efficient at using separate disks.

Other conclusions I reached from the Raptor testing are that Windows 
striping is of no use (and may even be slower) and two disks are much 
faster than one.

So, for my 12k spherical panoramas, what's the best hardware 
configuration?  An important note here is that the scratch space 
needs for these are larger than my iRAM disks (4GB total), so it was 
not possible for me to process a full resolution project using only 
the iRAMs yet.

5:00	Raptors JBOD
5:00 	Barracudas JBOD
3:52	iRAM (preferred) and 2GB Ramdisk (motherboard RAM)
3:44	iRAM and 2GB Ramdisk (both preferred) with input files on 
Ramdisk
3:31 	iRAM (preferred) and Raptors JBOD

I found it was really important to set the preferred tab the right 
way when doing this.  While conducting these tests, I could watch the 
CPU rate fall from 70-80% utilization to 5-10% utilization when the 
iRAM disks filled and the Raptors took over, which I could tell from 
the sound they make.  This suggested to me that the iRAM disks are 
delivering data fast enough to the CPU that they are not much of a 
bottleneck comparatively; below I show that's really not the case.  
This is occurring at a difference of 134 MB/s for the iRAM compared 
to 85 MB/s for the single Raptors.  I thought that replacing the 
Raptors with a motherboard Ramdisk would improve the speed, but it 
did not, and I'm really not sure why (ideas?).  In any case, its 
clear that the iRAM disk makes a huge improvement.  Unfortunately I 
only had 4GB of RAM for the iRAM, and it can hold 8GB.  

So to pretend like I had 8GB of iRAM and could stitch a panorama 
using only an iRAM, I used the 5 image version of this pano which 
only needed a 3.4GB scratch space.  This comparison I find very 
interesting:

2:49	iRAM only (for input, output and scratch space)
4:59	Raptors JBOD only (for input, output and scratch space)

The message that I take home from this is that if I invest in another 
4GB of cheap RAM (1GB 400Mhz at $65), that I can get a 75% increase 
in speed, saving more than nearly 3 minutes on a full stitch.  That 
is, I can process an 16 bit Tiff on iRAM in the time it takes to 
process an 8 bit Tiff on Raptors in RAID0 (not shown).  That's well 
worth the price in my mind, and is the same price if I hadn't bought 
the RAID controller.  That is, for an $800 investment in two iRAM 
cards ($130 each) and 8 GB of cheap RAM ($65 each), I can more than 
double the speed of two Raptors in RAID0 (which would cost about the 
same amount with the controller card) or do 75% better than 2 Raptors 
JBOD.

One thing I was not able to test yet was using the two iRAM disks in 
RAID0.  Reviews show they scale perfectly, with over 250 MB/s 
sustained rates.  I tried it with both my PCI controller and 
motherboard controller, and neither worked.  I read in some forum 
that the iRAM use an older technology and most new controller expect 
newer technology in some way.  Any ideas would be appreciated.

Would anything be gained by either RAID0 iRAMs or more CPU cores?  To 
sort of test this, I set the affinity of PTgui to 1, 2, 3, and 4 
cores for both iRAM only and Raptor JBOD only at 4.0 GHz CPU, and 
tried Milko's speedtest.
		iRAM		Raptor	JBOD	
1 core		1:13		1:13
2 core		0:44		0:43
3 core		0:35		0:35
4 core		0:35		0:35

In the cases of 1 and 2 cores, CPU utilization was maxed at 100% 
during warping for both drives.  In the 3 core case, it was maxed for 
iRAM but only 95% for Raptors.  It didn't seem to make a difference 
in the time, but I suspect for larger project the times would space 
out a little bit.  Using 4 cores, both were reading about 75% 
utilization.  I guess the take home message here is that the 4th core 
is not helping in either case (ie, 75% of 4 cores = 3cores at 100%).  
Presumably if there were 8 cores, it would not matter either, but 
this is speculation from a computer neophyte.  I think it would be 
really neat if someone could try this.

As this seemed like an interesting experiment, I tried it again for a 
larger project.  This time I used my 5 image 12k panorama with 3.4 GB 
scratch space needs.
		iRAM		CPU utilization
1 core		3:09		~90% max
2 core		2:52		~45% max	
4 core		2:49		~20-30%

What I learned here was that even using 1 core, the single CPU was 
not being maxed out, which means that there is an I/O bottleneck 
again.  I carefully watched the CPU plot in Task Manager while 
simultaneously watching a Windows Explorer window with the iRAM 
folder open.  High CPU speeds corresponded to reading (or at least no 
writing) or writing only small files (~25MB).  There were basically 
two file sizes being written: 25MB and 225MB.  When the larger files 
were being written, CPU utilization plummeted to 5% or so with both 
drives.  With 25MB file, it was only a small blip of 5% below the max 
CPU rate.  Why PTgui needs to write 10 temp files at 225MB each when 
the final tiff is only 192MB is not clear to me, but I guess must be 
done.  But if those files were only 25MB, performance might improve 
(or might not, since it has to write 10 times more files).  In any 
case, the iRAMs at 134 MB/s are not able to keep up with the CPU 
(write speeds may be lower than read speeds).  Thus, my conclusion is 
that more cores and faster CPUs are not likely to speed things up 
nearly as much as putting the iRAMs in RAID0 or finding something 
even faster.  In the case of Milko's speedtest, the files are only 
70MB and barely make a blip on the CPU utilization, meaning to me 
that the iRAMs (and other disks) are keeping up.  Plus there are 
fewer of them, probably related to their being fewer input images.  

In terms of overall system performance compared to other machines, 
for Milko's speed test my best results were 55 seconds for the stock 
test (meaning using panotools for warping and not click the fast 
transform).  Only one machine listed does faster than this (I think), 
at 29 seconds (Garret Veley).  Given that this was a 3 GHz machine, 
it seems suspicious that my fast transform results at 3GHz were quite 
similar, but with twice the cores and a faster disk maybe a 50% 
improvement can be achieved.  I'd like to get some confirmation on 
the parameters of that test (including RAID speeds), as this would 
really help assess the difference that dual quad cores make and might 
lead to new insights into the bottlenecks, especially if the tests 
could be redone with 4 cores instead of 8.

The situation gets worse for the gigapixel stitches, because here 
nearly all of the I/O has to happen on the slower spinning disks, 
since the iRAMs get filled up early and never get reutilized.  I 
don't know whether this is possible or not, but it might be worth 
investigating whether PTgui could treat a really really fast disk as 
swap space rather than scratch space, and moving files from it to 
slower disks in the background, allowing it to be reutilized.

In all of these tests, I never saw the motherboard RAM used much 
beyond what was needed for the system cache (about 275 MB), meaning 
that I always had more than 3GB of RAM not in use.  What's going on 
here?  Seems like this could be better utilized, and somehow used to 
avoid writing out these larger files, but I really have no idea how 
these things work.  Using a motherboard Ramdisk does not seem to 
help, which really surprises me.

So what's my conclusion for hardware?

The iRAMs clearly show that fast I/O makes a real difference to 
stitching, confirming what Joost has said all along regarding the 
bottlenecks.  For $800, they are worth it to me.  However, they are a 
bit flaky.  One of mine refuses to be recognized on boot up, and I 
must format it each time (which takes 2 seconds!) I boot.  The other 
card works fine.  They take up a lot of space because the RAM chips 
stick out a bit, and I'm worried about shorting them with an adjacent 
card.  I was also unable to RAID them, but I know this is possible.  
Still, because only 2 cards at 4GB each is possible to fit in a 
standard motherboard, only a doubling of speed is possible with 
RAID0.  The Mtron Pro flash disks however, are nearly as fast as the 
iRAMs, and have the added advantage of not losing the brains when the 
power goes out.  They come in 16GB, 32GB, and 64GB sizes, and a 
review has shown that more than 800 GB/s is possibly when putting 8 
of these in RAID0. 
http://www.nextlevelhardware.com/storage/battleship/  At $800 each 
for the 16GB cards, this is not a cheap solution.  However, 2 of them 
in RAID 0 reach 250 GB/s, which is really fast, and for many regular 
windows applications, increasing the number of cards does not improve 
performance so much.  So for $2000, this seems like a nice solution, 
and something that's bound to last better than the iRAMs.  Another 
nice thing about these is that they only take up one card for the 
RAID controller and they disks can be jammed into empty spaces (or 
routed externally perhaps), and they are very low power.  While $2000 
is a lot, to go from my  single quad processor configuration to a 
dual quad motherboard and second processor costs about that much (or 
much more, if I want 3 GHz processors), and from what I've seen, wont 
improve speed nearly as much.  From what I've read, the next big 
player will be fusionio, who are creating some sort of hybrid 
ram/flash thingie with 750 MB/s transfer speeds and disks starting at 
80GB. http://www.fusionio.com/ They claim such a disk will be less 
than $3000.  So it seems that these technologies are what's needed to 
really take stitching to the next level of speed, or at least 
projects that are on the order of 10,000 x 5000 pixels or higher.  
For now, I think I'm going to stick with the iRAMs since I have 
them.  But now that I have convinced myself that I/O at this level 
does have a major impact on stitching speeds, any future investments 
in hardware I make would be best spent on disks with such speed.

But, as many of you know, I'm new both to stitching and to this type 
of hardware-level performance testing, so I may be missing 
fundamental things here.  I certainly don't mean to suggest my 
results are the last word, just trying to call it as I currently see 
it.  I hope that others more experienced that I can point out either 
where I've gone astray or have other suggestions for where to go 
next.  In any case, I've yet to even view a panorama on my new 30" 
monitor yet, so I'm ready to get back to production stitching and can 
now sleep at night knowing that I'm getting close to the most out of 
my current hardware while doing it.  (Though I suppose I will have to 
do a few tests with Photoshop next…).

Happy Holidays,
Matt


PS.  BTW, I bought my new computer here http://www.pugetsystems.com/  
These guys are really first rate and interested in long-term repeat 
customers.  They were willing to track down all of my whacky ideas, 
they delivered the computer within a week of paying for it, they did 
a bunch of burn in tests and had an online tracking system saying 
where they were in assembly/testing, it came delivered with a nice 
binder containing all of various manufacturer information/disks/etc, 
their production notes/etc, and everything worked without a hitch. 
They claim that they'll take back any component in the first 30 days 
that does perform as well as I'd like (meaning not just broken, but 
just too slow for my ever-expanding quest for speed), and I believe 
them based on what I know of them so far.  Plus, I dont void the 
warranty or lifetime tech support by messing around on the inside.  
They get five stars from me, and if you mention me as a referral by 
name, they'll apparently give you free shipping!




-- 
<*> Wiki: http://wiki.panotools.org
<*> User Guidelines: http://wiki.panotools.org/User_Guidelines
<*> Nabble (Web) http://www.nabble.com/PanoToolsNG-f15658.html
<*> NG Member Map http://www.panomaps.com/ng
<*> Moderators/List Admins: #removed# 
 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/PanoToolsNG/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/PanoToolsNG/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:#removed# 
    mailto:#removed#

<*> To unsubscribe from this group, send an email to:
    #removed#

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 

Next thread:

Previous thread:

back to search page