Great test, this is exactly why I visit this site. I am an architect for a large online backup company, these articles are great resources.
I realize you might not have many samples, but it would be great to see what results you would get, if you setup your raid 10 with SSD disks. I don't plan to use a single stranded SSD for critical data storage in my environment.
To really test the SSD performance one would need driver implementing Managed Flash Technology.
http://managedflash.com/">http://managedflash.com/ It can in practice boost write efficiency to nearly meet read one (10k IOPS) on SSD drives not mentioning improving SSD's life.
You can try MFT drivers evaluation version on the given site.
Intel does not have a cheap MLC. If $600+ is cheap, you must be crazy. Other manufacturers produce cheap MLCs, which cost under $200 for the same size.
Those cheap MLCs also have random-write problems are not adequate for anything really. Intel has produced one of the best MLCs, almost equivalent in performance to a SLC, for slightly less money. It is not good to identify one of the best MLCs, call it cheap, and compare it to top of the line hard disk systems. That misleads the reader into thinking the actual cheap MLCs (like OCZ or MTrons) are being used, which is not the case.
I had a system where we have large number of files with sizes ~60kB
The disk subsystem could not run at full speed due to the 'bursty' nature of the file writes. If it is a large file chunk, it is a sequential write so everything goes much faster.
With that as comparison, the transaction logging is essentially bursty traffic and relatively smaller compared to data writes hence benefits more from lower latency than higher throughput.
You can use a ramdisk for the transaction log as you will have high throughput and low latency. That will be your gold standard for comparisons.
I have been doing MySQL benchmarks on SSd for the past several months. I do normally use sysbench, but find DBT2 a little more reliable in terms of mirroring real world oltp performance. Reguardless take a look here: http://www.bigdbahead.com/">http://www.bigdbahead.com/ for some interesting MySQL SSD stuff.
If the performance improved significantly with low latency, would running the log on a RAMdrive increase performance? Might be worth checking while you're testing.
Yeah, why not use a RAM drive for benchmarking? While it might not be a good idea for a production database (*), it would maximize the chances of a CPU bottleneck - and thus a good CPU test. Rather like benchmarking games at minimum resolution and detail on top of the line graphics cards.
* Gigabyte make battery backed RAM disks, and there are at least a few enterprise grade systems too, for example:
For benchmarking, I don't think the expense of the enterprise system would be justified (but if you can convince a vendor to "donate" one for you benchmarking, so much the better :D). However, something like the cheap Gigabyte cards or just, a server with tons of RAM and a simple operating system RAM drive should be OK for benchmarking (i.e. no serious consequences if you loose power/reboot and have to restore DB).
Are MLC flash devices even worth considering for database applications with their 10000 cycle erase limit? I'm no database expert, but 10000 sounds low for a database application.
Even SLCs might not live that long on databases with much write activity. For mostly read access applications SSDs are great, though.
Well, the MLC was the only one that was available to me, you are probably right. I do think the SLC "wear levelling" is capable of making the drive survive longer than a mechanical disk, especially Intel quality wear levelling.
But I was quite surprised to hear how many IT people (*) were using these cheap MLC drives for databases and quite happy about it. They might be in for a bad surprise.
Intel's MLC has a wear-leveling controller that reduces wear of a cell and extends the hard disk life. Because the location of where the data is stored doesn't matter (unlike a hard disk, where closer is faster) the algorithm makes sure to use the whole drive. Additionally, it reduces the amount of times a cell is used, to minimize erases.
Notice first that it is not the re-writes that kill the MLCs, it is the erase and then write again. So long as a cell is not erased, it will not be damaged. I'm not sure how this will affect the storage, but if they're using a good logging system, no DELETEs/UPDATEs will occur, only inserts with adjustments, which would mean that data is hardly ever being changed (minus the indexes and logs).
I would Imagine the raid controller would hide the write latency as long you have full write cache with battery backup running.
Since the price has dropped on "real" raid controllers and the cache is now 256/512 MB per card, going with with 2 controllers, 1 for data, and 1 for logging. This should give you the write speed you need, but it wouldn't really help with the low access speed of the reads if it's not cached though.
I think you would get alot more if you swapped to x25-M on the data part than on the logging part.
But I'm eagerly waiting for a full SSD / Nehalem database article, hopefully it will change the battlefield like AMD did with it's opterons when they where launched and got to fight the P4 based Xeons.
"I think you would get alot more if you swapped to x25-M on the data part than on the logging part. "
I don't think so. Average latency on the data disk is much lower, as the RAID controller is accessing the disks a lot more sequential. We are seeing very low activity (a few KB/s) and 50 MB/s at the end of the test. I am pretty sure those 50 MB/s are very sequential writes to our RAID-0 of 6 disks. That RAID-0 set (capable of 500 MB/s probably *) would not have any problem with a sequential 50 MB/s. I don't see how the SSD could make a difference there. In this case more spindles do help.
Yes, the combination of low latency and non-volatility of SSDs is really a game changer for database applications.
I does trouble me, that you seem surprised to (re-)discover that low write latency to the WAL is absolutely critical to the performance of a transactional database. In fact, the proof is right in your own numbers:
quote: "* typical average I/O latency is 0.23 ms (90%), with about 10% spikes of 7 to 12 ms
That reassured us that our transaction log disk was not a bottleneck"
No, that shows exactly that your disk latency is the limit: If these number are in the right ball park, the average latency is at least (0.9*0.23 + 0.1*7) ~= 0.9 ms, which limits the number of transactions per second to ~1100. Your performance is limited by the 10% of transactions that actually incur a disk related latency.
I am trying to understand every step of the transaction, but I don't think it is that simple.
I can not imagine that a thread would actually wait for the disk to say "it is finished" ? After all, for the thread, the disk is just a block device it writes too, and there is no further communication?
The surprise was that the current advice of "get enough spindles" is not accurate. I would have expected that as long as the disks can keep up, the user thread would sent their transaction to the disk and forget about it. Kind of an async operation.
Secondly, I would have expected our BBU protected RAID card which is a really fast card(*) to take care of disk latency as long as the disks can keep up.
I think your assumption about asynchronous writes is correct for the data part (on some databases), but not for the transaction log data. To ensure ACID compliance the thread writing the transaction log HAS to wait for the disk controller to say "finished". I believe that is what the "flush" part of "innodb_flush_log_at_trx_commit" refers to.
As you and others point out, the controller cache can mostly hide the latency, but if the cache runs full your thread will have to wait for one or more rotational periods of the disk (4ms). This appears to happen ~10% of the time for you. As you increase the size of the controller cache and/or the rate at which the controller can flush it to disk (which depends on both transfer speed and latency of the disk(s) itself), the fraction of transactions having to wait drops, as you also see in your experiment. I wonder if allocating more than one spindle to the transaction log file (in your non-SSD setup) would actually help in this regard?
Sorry, but I don't have concrete experience with this setup, so your guess is probably better than mine. Anything involving RAID-1 (and by extension RAID-10) might not be optimal since the controller has to wait for both disks in the pair to commit a block and therefore the average latency is higher. I think you'll just have to experiment...RAID-0, RAID-10, different block sizes etc.
RAID-0 is probably not acceptable from a reliability standpoint, but at least it should give you an idea whether anything can be gained by going this route. I look forward to hearing what you find.
The desktop i7's have been released, but the not the server variants haven't been released yet. And of course, AMD's Shanghai is coming soon too, so we can't be completely certain which processor it is... ;)
I am not able to answer that question accurately, but my first estimate would be that the difference would be rather small. I don't really see how SLC would improve the already spectacular low write latencies much more.
seems like software will never be optimized enough, in this case for the SSDs write latencies, and in other cases for the raw throughput of HDDs. We will always find new bottlenecks and weak points, as development will never cover all aspects possible.
The main objective of my Blogpost was not to point out a "weakness" of MySQL. More importantly was the fact that we were surprised to find that the mindset of "make sure you have enough spindles" for your transactions logs is not really accurate.
It is also important for our CPU tests: as CPUs get faster, it gets harder to keep the CPU load high enough to make a meaningful comparison. It is so horribly easy to screw up servertesting :-).
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
33 Comments
Back to Article
cgsaben - Wednesday, November 19, 2008 - link
Great test, this is exactly why I visit this site. I am an architect for a large online backup company, these articles are great resources.I realize you might not have many samples, but it would be great to see what results you would get, if you setup your raid 10 with SSD disks. I don't plan to use a single stranded SSD for critical data storage in my environment.
mafj - Friday, November 14, 2008 - link
To really test the SSD performance one would need driver implementing Managed Flash Technology.http://managedflash.com/">http://managedflash.com/
It can in practice boost write efficiency to nearly meet read one (10k IOPS) on SSD drives not mentioning improving SSD's life.
You can try MFT drivers evaluation version on the given site.
vol7ron - Thursday, November 13, 2008 - link
Intel does not have a cheap MLC. If $600+ is cheap, you must be crazy. Other manufacturers produce cheap MLCs, which cost under $200 for the same size.Those cheap MLCs also have random-write problems are not adequate for anything really. Intel has produced one of the best MLCs, almost equivalent in performance to a SLC, for slightly less money. It is not good to identify one of the best MLCs, call it cheap, and compare it to top of the line hard disk systems. That misleads the reader into thinking the actual cheap MLCs (like OCZ or MTrons) are being used, which is not the case.
jand - Tuesday, November 11, 2008 - link
I had a system where we have large number of files with sizes ~60kBThe disk subsystem could not run at full speed due to the 'bursty' nature of the file writes. If it is a large file chunk, it is a sequential write so everything goes much faster.
With that as comparison, the transaction logging is essentially bursty traffic and relatively smaller compared to data writes hence benefits more from lower latency than higher throughput.
You can use a ramdisk for the transaction log as you will have high throughput and low latency. That will be your gold standard for comparisons.
bigdbahead - Tuesday, November 11, 2008 - link
I have been doing MySQL benchmarks on SSd for the past several months. I do normally use sysbench, but find DBT2 a little more reliable in terms of mirroring real world oltp performance. Reguardless take a look here: http://www.bigdbahead.com/">http://www.bigdbahead.com/ for some interesting MySQL SSD stuff.overzealot - Monday, November 10, 2008 - link
If the performance improved significantly with low latency, would running the log on a RAMdrive increase performance? Might be worth checking while you're testing.RagingDragon - Monday, November 10, 2008 - link
Yeah, why not use a RAM drive for benchmarking? While it might not be a good idea for a production database (*), it would maximize the chances of a CPU bottleneck - and thus a good CPU test. Rather like benchmarking games at minimum resolution and detail on top of the line graphics cards.* Gigabyte make battery backed RAM disks, and there are at least a few enterprise grade systems too, for example:
http://www.superssd.com/products/ramsan-440/">http://www.superssd.com/products/ramsan-440/
RagingDragon - Monday, November 10, 2008 - link
For benchmarking, I don't think the expense of the enterprise system would be justified (but if you can convince a vendor to "donate" one for you benchmarking, so much the better :D). However, something like the cheap Gigabyte cards or just, a server with tons of RAM and a simple operating system RAM drive should be OK for benchmarking (i.e. no serious consequences if you loose power/reboot and have to restore DB).ggordonliddy - Monday, November 10, 2008 - link
> The last time, more and more of the database vendors are talking> about the wonders that SSD can do for transactional (OLTP) databases.
That statement does not make sense. Did you mean "were talking"?
JohanAnandtech - Tuesday, November 11, 2008 - link
Does it make sense now? :-)ggordonliddy - Tuesday, November 11, 2008 - link
Yes, thanks.hieuu - Monday, November 10, 2008 - link
hiplease explain your exact setup / drive configuration / controller configuration.
was the log drive after swap on the same controller? what was your caching scheme.
thanks
JohanAnandtech - Tuesday, November 11, 2008 - link
This should answer your question:Innodbpool of 950 MB, 258 MB Database (so all indexes are cached), Adaptec 5805 RAID card with 512 MB of DDR-2.
http://it.anandtech.com/cpuchipsets/intel/showdoc....">http://it.anandtech.com/cpuchipsets/intel/showdoc....
Basically the SSD was plugged into the backplane of our Supermicro server. So it is running on a relatively low end LSI 1068 controller.
torsteinowich - Monday, November 10, 2008 - link
Are MLC flash devices even worth considering for database applications with their 10000 cycle erase limit? I'm no database expert, but 10000 sounds low for a database application.Even SLCs might not live that long on databases with much write activity. For mostly read access applications SSDs are great, though.
JohanAnandtech - Monday, November 10, 2008 - link
Well, the MLC was the only one that was available to me, you are probably right. I do think the SLC "wear levelling" is capable of making the drive survive longer than a mechanical disk, especially Intel quality wear levelling.But I was quite surprised to hear how many IT people (*) were using these cheap MLC drives for databases and quite happy about it. They might be in for a bad surprise.
(*) Just "hear say", no real statistical data :-)
vol7ron - Thursday, November 13, 2008 - link
Intel's MLC has a wear-leveling controller that reduces wear of a cell and extends the hard disk life. Because the location of where the data is stored doesn't matter (unlike a hard disk, where closer is faster) the algorithm makes sure to use the whole drive. Additionally, it reduces the amount of times a cell is used, to minimize erases.Notice first that it is not the re-writes that kill the MLCs, it is the erase and then write again. So long as a cell is not erased, it will not be damaged. I'm not sure how this will affect the storage, but if they're using a good logging system, no DELETEs/UPDATEs will occur, only inserts with adjustments, which would mean that data is hardly ever being changed (minus the indexes and logs).
RagingDragon - Monday, November 10, 2008 - link
Well, hopefully they backup their databases regularly. No matter what HD or SSD they use.Goto 10 - Monday, November 10, 2008 - link
I would Imagine the raid controller would hide the write latency as long you have full write cache with battery backup running.Since the price has dropped on "real" raid controllers and the cache is now 256/512 MB per card, going with with 2 controllers, 1 for data, and 1 for logging. This should give you the write speed you need, but it wouldn't really help with the low access speed of the reads if it's not cached though.
I think you would get alot more if you swapped to x25-M on the data part than on the logging part.
But I'm eagerly waiting for a full SSD / Nehalem database article, hopefully it will change the battlefield like AMD did with it's opterons when they where launched and got to fight the P4 based Xeons.
JohanAnandtech - Monday, November 10, 2008 - link
"I think you would get alot more if you swapped to x25-M on the data part than on the logging part. "I don't think so. Average latency on the data disk is much lower, as the RAID controller is accessing the disks a lot more sequential. We are seeing very low activity (a few KB/s) and 50 MB/s at the end of the test. I am pretty sure those 50 MB/s are very sequential writes to our RAID-0 of 6 disks. That RAID-0 set (capable of 500 MB/s probably *) would not have any problem with a sequential 50 MB/s. I don't see how the SSD could make a difference there. In this case more spindles do help.
(*) http://it.anandtech.com/IT/showdoc.aspx?i=3147&...">http://it.anandtech.com/IT/showdoc.aspx?i=3147&... (gives a rough idea)
rmlarsen - Monday, November 10, 2008 - link
Yes, the combination of low latency and non-volatility of SSDs is really a game changer for database applications.I does trouble me, that you seem surprised to (re-)discover that low write latency to the WAL is absolutely critical to the performance of a transactional database. In fact, the proof is right in your own numbers:
quote: "* typical average I/O latency is 0.23 ms (90%), with about 10% spikes of 7 to 12 ms
That reassured us that our transaction log disk was not a bottleneck"
No, that shows exactly that your disk latency is the limit: If these number are in the right ball park, the average latency is at least (0.9*0.23 + 0.1*7) ~= 0.9 ms, which limits the number of transactions per second to ~1100. Your performance is limited by the 10% of transactions that actually incur a disk related latency.
Anyhow, thanks for posting these measurements.
JohanAnandtech - Monday, November 10, 2008 - link
I am trying to understand every step of the transaction, but I don't think it is that simple.I can not imagine that a thread would actually wait for the disk to say "it is finished" ? After all, for the thread, the disk is just a block device it writes too, and there is no further communication?
The surprise was that the current advice of "get enough spindles" is not accurate. I would have expected that as long as the disks can keep up, the user thread would sent their transaction to the disk and forget about it. Kind of an async operation.
Secondly, I would have expected our BBU protected RAID card which is a really fast card(*) to take care of disk latency as long as the disks can keep up.
(*)http://it.anandtech.com/cpuchipsets/intel/showdoc....">http://it.anandtech.com/cpuchipsets/intel/showdoc....
rmlarsen - Monday, November 10, 2008 - link
I think your assumption about asynchronous writes is correct for the data part (on some databases), but not for the transaction log data. To ensure ACID compliance the thread writing the transaction log HAS to wait for the disk controller to say "finished". I believe that is what the "flush" part of "innodb_flush_log_at_trx_commit" refers to.As you and others point out, the controller cache can mostly hide the latency, but if the cache runs full your thread will have to wait for one or more rotational periods of the disk (4ms). This appears to happen ~10% of the time for you. As you increase the size of the controller cache and/or the rate at which the controller can flush it to disk (which depends on both transfer speed and latency of the disk(s) itself), the fraction of transactions having to wait drops, as you also see in your experiment. I wonder if allocating more than one spindle to the transaction log file (in your non-SSD setup) would actually help in this regard?
JohanAnandtech - Tuesday, November 11, 2008 - link
"I wonder if allocating more than one spindle to the transaction log file (in your non-SSD setup) would actually help in this regard? "Do you have a suggestion on how this could be done best? RAID-10 with a stripe size of 16 KB (= log writes)?
rmlarsen - Tuesday, November 11, 2008 - link
Sorry, but I don't have concrete experience with this setup, so your guess is probably better than mine. Anything involving RAID-1 (and by extension RAID-10) might not be optimal since the controller has to wait for both disks in the pair to commit a block and therefore the average latency is higher. I think you'll just have to experiment...RAID-0, RAID-10, different block sizes etc.RAID-0 is probably not acceptable from a reliability standpoint, but at least it should give you an idea whether anything can be gained by going this route. I look forward to hearing what you find.
CU - Monday, November 10, 2008 - link
Is this a new AMD cpu? I thought all the Core i7 stuff had already been released.RagingDragon - Monday, November 10, 2008 - link
The desktop i7's have been released, but the not the server variants haven't been released yet. And of course, AMD's Shanghai is coming soon too, so we can't be completely certain which processor it is... ;)JarredWalton - Monday, November 10, 2008 - link
Probably not for the server market... I can't say for sure.davepermen - Monday, November 10, 2008 - link
I'd like to see how the Intel SLC SSD would perform.JohanAnandtech - Monday, November 10, 2008 - link
I am not able to answer that question accurately, but my first estimate would be that the difference would be rather small. I don't really see how SLC would improve the already spectacular low write latencies much more.JohanAnandtech - Tuesday, November 11, 2008 - link
I stand corrected: apparantely: SLC drives have up to 3 times lower write latency.Devzero - Monday, November 10, 2008 - link
Is this a problem only on MySQL or would MSSQL server/Oracle/DB2/.. have the same problems?bollux78 - Sunday, November 9, 2008 - link
seems like software will never be optimized enough, in this case for the SSDs write latencies, and in other cases for the raw throughput of HDDs. We will always find new bottlenecks and weak points, as development will never cover all aspects possible.JohanAnandtech - Monday, November 10, 2008 - link
The main objective of my Blogpost was not to point out a "weakness" of MySQL. More importantly was the fact that we were surprised to find that the mindset of "make sure you have enough spindles" for your transactions logs is not really accurate.It is also important for our CPU tests: as CPUs get faster, it gets harder to keep the CPU load high enough to make a meaningful comparison. It is so horribly easy to screw up servertesting :-).