Linux Client TCP Stack Slower Than Windows

Conventional wisdom says that Linux has a better TCP stack than Windows.  But with the current latest Linux and the current latest Windows (or even Vista), there is at least one aspect where this is not true.  (My definition of better is simple- which one is fastest)

Over the past year or so, researchers have proposed to adjust TCP’s congestion window from it’s current form (2pkts or ~4KB) up to about 10 packets.  These changes are still being debated, but it looks likely that a change will be ratified.  But even without official ratification, many commercial sites and commercially available load balancing software have already increased initcwnd on their systems in order to reduce latency. 

Back to the matter at hand – when a client makes a connection to a server, there are two variables which dictate how quickly a server can send data to the client.  The first variable is the client’s “receive window”.  The client tells the server, “please don’t exceed X bytes without my acknowledgement”, and this is a fundamental part of how TCP controls information flow.  The second variable is the server’s cwnd, which, as stated previously is generally the bottleneck and is usually initialized to 2.

In the long-ago past,  TCP clients (like web browsers) would specify receive-window buffer sizes manually.  But these days, all modern TCP stacks use dynamic window size adjustments based on measurements from the network, and applications are recommended to leave it alone, since the computer can do it better.  Unfortunately, the defaults on Linux are too low. 

On my systems, with a 1Gbps network, here are the initial window sizes.  Keep in mind your system may vary as each of the TCP stacks does dynamically change the window size based on many factors.

Vista:  64KB
Mac:    64KB
Linux:    6KB

6KB!  Yikes! Well, the argument can be made that there is no need for the Linux client to use a larger initial receive window, since the servers are supposed to abide by RFC2581.  But there really isn’t much downside to using a larger initial receive window,  and we already know that many sites do benefit from a large cwnd already.  The net result is that when the server is legitimately trying to use a larger cwnd, web browsing on Linux will be slower than web browsing on Mac or Windows, which don’t artificially constrain the initial receive window.

Some good news – a patch is in the works to allow users to change the default, but you’ll need to be a TCP whiz and install a kernel change to use it.  I don’t know of any plans to change the default value on Linux yet.  Certainly if the cwnd changes are approved, the default initial receive window must also be changed.  I have yet to find any way to make linux use a larger initial receive window without a kernel change.

Two last notes: 

1) This isn’t theoretical.  It’s very visible in network traces to existing servers on the web that use larger-than-2 cwnd values.  And you don’t hit the stall just once, you hit it for every connection which tries to send more than 6KB of data in the initial burst.

2) As we look to make HTTP more efficient by using fewer connections (SPDY), this limit becomes yet-another-factor which favors protocols that use many connections instead of just one.  TCP implementors lament that browsers open 20-40 concurrent connections routinely as part of making sites load quickly.  But if a connection has an initial window of only 6KB, the use of many connections is the only way to work around the artificially low throttle.

There is always one more configuration setting to tweak.

Chrome: Cranking Up The Clock

Over the past couple of years, several of us have dedicated a lot of time to Chrome’s timer system. Because we do things a little differently, this has raised some eyebrows. Here is why and what we did.

Goal
Our goal was to have fast, precise, and reliable timers. By “fast”, I mean that the timers should fire repeatedly with a low period. Ideally we wanted microsecond timers, but we eventually settled for millisecond timers. By “precise”, I mean we wanted the timer system to work without drift – you should be able to monitor timers over short or long periods of time and still have them be precise. And by “reliable”, I mean that timers should fire consistently at the right times; if you set a 3.67ms timer, it should be able to fire repeatedly at 3.67ms without significant variance.

Why?
It may be surprising to hear that we had to do any work to implement these types of timers. After all, timers are a fundamental service provided by all operating systems. Lots of browsers use simpler mechanisms and they seem to work just fine. Unfortunately, the default timers really are too slow.

Specifically, Windows timers by default will only fire with a period of ~15ms. While processor speeds have increased from 500Mhz to 3Ghz over the past 15 years, the default timer resolution has not changed.  And at 3GHz,15ms is an eternity.

This problem does affect web pages in a very real way. Internally, browsers schedule time-based tasks to run a short distance in the future, and if the clock can’t tick faster than 15ms, that means the application will sleep for at least that long. To demonstrate, Erik Kay wrote a nice visual sorting test. Due to how Javascript and HTML interact in a web page, applications such as this sorting test use timers to balance execution of the script with responsiveness of the webpage.

John Resig at Mozilla was also wrote a great test for measuring the scalability, precision, and variance of timers. He conducted his tests on the Mac, but here is a quick test on Windows.

In this chart, we’re looking at the performance of IE8, which is similar to what Chrome’s timers looked like prior to our timer work. As you can see, the timers are slow and highly variable. They can’t fire faster than ~15ms. 

timers.IE

A Seemingly Simple Solution
Internally, Windows applications are often architected on top of Event Loops. If you want to schedule a task to run later, you must queue up the task and wake your process later. On Windows, this means you’ll eventually land in the function WaitForMultipleObjects(), which is able to wait for UI events, file events, timer events, and custom events.  (Here is a link to Chrome’s central message loop code) By default, the internal timer for all wait-event functions in Windows is 15ms. Even if you set a 1ms timeout on these functions, it will only wake up once every 15ms (unless non-timer related events are pumped through it).

To change the default timer, applications must call timeBeginPeriod(), which is part of the multimedia timers API. This function changes the clock frequency and is close to what we want.  Its lowest granularity is still only 1ms, but that is a lot better than 15ms. Unfortunately, it also has a a couple of seriously scary side effects. The first side effect is that it is system wide. When you change this value, you’re impacting global thread scheduling among all processes, not just yours. Second, this API also effects the system’s ability to get into it’s lowest-power sleep states.

Because of these two side effects, we were reluctant to use this API within Chrome. We didn’t want to impact any process other than a Chrome process, and all of the possible impacts of the API were nebulous.  Unfortunately, there are no other APIs which could make our message loop work quickly. Although Windows does have a high-performance cycle counter API, that API is slow to execute1, has bugs on some AMD hardware2, and has no effect on the system-wide wait functions.

Justifying timeBeginPeriod
At one point during our development, we were about to give up on using the high resolution timers, because they just seemed too scary.  But then we discovered something. Using WinDbg to monitor Chrome, we discovered that every major multi-media browser plugin was already using this API. And this included Flash3, Windows Media Player, and even QuickTime.  Once we discovered this, we stopped worrying about Chrome’s use of the API.  After all – what percentage of the time is Flash open when your browser is open?  I don’t have an exact number, but it’s a lot. And since this API effects the system globally, most browsers are already running in this mode.

We decided to make this the default behavior in Chrome.  But we hit another roadblock for our timers.

Browser Throttles and Multi-Process
With the high-resolution timer in place, we were now able to set events quickly for Chrome’s internals.  Most internal delayed tasks are long timers, and didn’t need this feature, but there are a half dozen or so short timers in the code, and these did materially benefit. Nonetheless, the one which matters most, the timer stall for the browser’s setTimeout and setInterval functions did not yet benefit. This is because our WebKit code (and other browsers do this too) was intentionally preventing any timer sustaining a faster than 10ms tick.

There are probably several reasons for the 10ms timer in browsers. One was simply for convention. But another is because some websites are poorly written, and will set timers to run like crazy.  If the browser attempts to service the timers, this can spin the CPU, and who gets the bug report when the browser is spinning? The browser vendor, of course.  It doesn’t matter that the real bug is in the website, and not the web browser, so it is important for the browser to address the issue.

But the 3rd, and probably most critical reason is that most single-process browser architectures can become non-responsive if you allow websites to loop excessively with 0-millisecond delays in their JavaScript. Remember that browsers are generally written on top of Event Loops.  If the slow JavaScript interpreter is constantly scheduling a wakeup through a 0ms timer, this clogs the Event Loop which also processes mouse and keyboard events. The user is left with not just a spinning CPU, but a basically hung browser.  While I was able to reproduce this behavior in single-process browsers, Chrome turned out to be immune – and the reason was because of Chrome’s multi-process architecture. Chrome puts the website into a separate process (called a “renderer”) from the browser’s keyboard and mouse handling process.  Even if we spin the CPU in a renderer, the browser remains completely responsive, and unless the user is checking her Task Manager, she might not even notice.

So the multi-process architecture was the enabler. We wrote a simple test page to measure the fastest time through the setTimeout call and verified that a tight loop would not damage Chrome’s responsiveness.  Then, we modified WebKit to reduce the throttle from 10ms to 1ms and shipped the world’s peppiest beta browser: Chrome 1.0beta.

Real World Problems
Our biggest fear with shipping the product was that we would identify some website which was spinning the CPU and annoying users.  We did identify a couple of these, but they were with relatively obscure sites. Finally, we found one which mattered – a small newspaper known as the New York Times. The NYTimes is a well constructe site – they just ran into a little bug with a popular script called prototype.js, and this hadn’t been an issue before Chrome cranked up the clock. We filed a bug, but we had to change Chrome too. At this point, with a little experimentation we found that increasing the minimum timer from 1ms to 4ms seemed to work reasonably well on most machines. Indeed, to this day, Chrome still uses a 4ms minimum tick.

Soon, a second problem emerged as well. Engineers at Intel pointed out that Chrome was causing laptops to consume a lot more power. This was a far more serious problem and harder to fix.  We were not concerned much about the impact on desktops, because Flash, Windows Media Player, and QuickTime, were already causing this to be true.  But for laptops, this was a big problem. To mitigate, we started tapping into the Windows Power APIs, to monitor when the machine is running on battery power. So before Chrome 1.0 shipped out of beta, we modified it to turn off fast timers if it detects that the system is running on batteries. Since we implemented this fix, we haven’t heard many complaints.

Results
Overall, we’re pretty happy with the results.  First off, we can look at John Resig’s timer performance test. In contrast to the default implementation,  Chrome has very smooth, consistent, and fast timers: 

timers.chrome

Finally, here is the result at the Visual Sorting Test mentioned above.  With a faster clock in hand, we see performance doubles. 

clock

Future Work
We’d still like to eliminate the use of timeBeginPeriod.  It is unfortunate that it has such side effects on the system. One solution might be to create a dedicated timer thread, built atop the machine cycle counter (despite the problems with QueryPerformanceCounter), which wakens message loops based on self-calculated, sub-millisecond timers. This sounds trivial, but if we forget any operating system call which is stuck in a wait and don’t manually wake it, we’ll have janky timers. We’d also like to bring the current 4ms timer back down to 1ms. We may be able to do this if we better detect when web pages are accidentally spinning the CPU.

From the operating system side, we’d like to see sub-millisecond event waits built in by default which don’t use CPU interrupts or otherwise prevent CPU sleep states. A millisecond is a long time.

1. Although written in 2003, the data in this article is still relatively accurate: Win32 Performance Measurement Options
2. http://developer.amd.com/assets/TSC_Dual-Core_Utility.pdf
3. Note:  The latest versions of Flash (10) no longer use timeBeginPeriod.
NOTE: This article is my own view of events, and do not reflect the views of my employer.

Velocity Conference 2009

velocity2009_banner_speaking_120x240 I’ll be presenting as part of a discussion called What Makes Browsers Performant at the Velocity 2009 Conference, on June 23rd.  I’ve got limited time, but I’ll give an overview of how we approach performance in Google Chrome, detail some of the key areas in performance which make Chrome stand out, share some performance numbers never before shared, and hopefully squeeze in a must-see demo or two.

I’m a developer, not a marketer, so this will be an entertaining, technical talk, with no spin and no “marketecture”!  As a bonus, I promise to tell at least 2 good jokes.  If you don’t laugh, you get your money back.  Ok – that’s not true, ask the conference people about that.

If you haven’t signed up yet for Velocity you can use the coupon code VEL09FSP to get a 15% discount on tickets.

Javascript Faster and Slower

Several articles have been written about the latest in Javascript performance.  Here are some interesting points:

DownloadSquad:
“Chrome 2 beats Safari 4 like a rented mule”

CNet:
“The upshot: Chrome wins both tests handily, with Firefox in second place on Sunspider and Safari in second place on the V8 benchmark.”

Also interesting is that Firefox’s Tracemonkey Javascript engine may be falling behind.  Numerous articles have opined that Firefox 3’s ship date is in jeopardy due to Tracemonkey related bugs.  But new data also confirms that Javascript in Firefox 3.1 beta 3 is markedly slower than Firefox 3.1 beta 2.  The performance loss is palpable – Firefox lost 20% in performance from beta 2 to beta 3.  The problem may be that as the bugs have piled up in Tracemonkey, the fixes to ensure stability have eroded the performance gains initially boasted by the team.  It will be interesting to see Firefox’s final performance numbers when it ships out of beta.

Chrome Multi-Process Performance

One of the features in Chrome is that it is a multi-process browser.  To most people, that doesn’t mean much.  I could tell you that using multiple processes improves security, performance, and memory management, but you’ll probably yawn.  Here is my attempt to look at one angle of multi-process performance with a demo.

 

Process Priorities

Operating Systems like Microsoft Windows generally support the notion of running different processes at different priorities.  Because Chrome isolates each “tab” into its own process, Chrome can tell the operating system which tabs are important and which are less important.  When you switch to a tab, Chrome automatically tells the operating system to lower the priority of the tab which moved to the background, and raise the priority of the one which moved to the foreground.  Other browsers can’t do this because those browsers run all tabs run in a single process.

Why does this matter?  There are two primary reasons. 

First, when you have many tabs open, the tab you are actively working with will get the most CPU resources.  A background tab can continue to run, but it will never slow down your foreground work – because the operating system ensures that the higher priority processes always run first, even if it means starving the background process.  This keeps your browser responsive and snappy no matter how many tabs you’ve opened.

Second, by lowering the priority of unused tabs, Chrome is being nicer to other applications.  Whether your other application is Outlook, Word, Firefox, or even a game, Chrome’s background tabs cannot slow down those other applications because Chrome has intentionally yielded its priority to other applications.  Chrome is the only browser which does this.  Ironically, by making Chrome’s background tabs run slower, it makes your system faster!

 

Quick Demo

Let’s run Chrome 1.0.  In one tab, I’m running the v8 benchmark.  In another tab, I’m running my CPU spinner page which eats up lots of CPU.  If these two pages run concurrently, the browser’s score on the benchmark will be lower because the browser is doing two things at once.  However, if Chrome properly lowers the priority of the background task, then the benchmark score should be unaffected.  I’m running these tests on my laptop- a single processor machine.

First, here is the V8 Benchmark for Chrome.  Chrome scored a 1333 in this quick run when nothing else was running.

chrome1

Next, we run the CPU spinner in a background tab, and run the benchmark in the foreground.  Chrome’s performance is unaffected by the background work, and in this run scores 1345.

chrome2

Let’s try the same thing with Firefox 3.

First, a dry run for the benchmark.  Firefox scores 165 on this test.

ff1

Now, with Chrome running the CPU spinner in the background, Firefox is not impacted.  Firefox still gets a score of 157, even though Chrome is using 100% of the CPU in a background process!  Priorities actually work.

ff5

Finally, lets see what happens if Chrome didn’t do this.  This time, we’ll let Firefox run the background tab and see if it effects Chrome’s benchmark score.  Sure enough, Chrome’s score drops to 762 (it was 1333).  Firefox degraded Chrome’s performance by nearly 50%.

ff4

 

Other Performance Benefits

Startup performance is also enhanced with this feature – especially when you startup with multiple tabs.  In this case, Chrome will lower the priority of the background tabs, and only the foreground tab will have high priority.  You get the foreground tab first (which is what you want), an the secondary tabs fill in as CPU becomes available.

 

Conclusion

When single process applications like Firefox, Internet Explorer or Safari run web pages in background tabs, they can reduce your foreground application performance by as much as 50%!  However, because Chrome uses background process priorities, it has almost zero impact on foreground applications.  This lets the user get his work done and not worry that idle applications are in the way.

All part of what you get with a multi-process browser.

Chrome Channel Changer

I get a lot of questions about the latest releases of Chrome which are available and how to get it.  Chrome releases a little differently from other software.

If you just download the default version of Chrome, you’ll get the “stable” version, and you’ll automatically be kept up to date with the latest “stable” releases.  New releases will come out only when the software is ready – it may be several months between versions.

If you want to see features more quickly, you can switch the to “beta” version.  Again, you’ll be automatically kept up to date with the latest beta versions.  These versions are generally stable (we constantly run a plethora of tests to keep code in reasonably good shape), but this is less tested than a stable release.  New releases may come out every month or so.

Finally, if you really want to see new features as they come in, you can subscribe to the “dev” channel.   Again, you’ll be automatically updated as new developer releases are available.   New releases may come out every few weeks.  These builds are tested well, but users should be aware that these releases will be the least stable.

To select which version of Chrome you want, you’ll need to run the Chrome “channel changer”.  You can get a copy of it here, or you can read more about it from here.  Once you’ve picked your channel, restart your browser, and soon you’ll have the version to which you’ve subscribed.

Here is a screenshot of the channel changer:

channels

Google Chrome Released

google Google Chrome shipped today!  If you didn’t try Chrome because it was in beta, you can now download a finished and supported product.

Some people associate Google with never-ending betas.  GMail, for instance, is still in beta.  Remaining in beta was never the intention for Chrome; we always had a simple goal to take Chrome out of beta as soon as it we had data to prove that it contains enough features, stability, and performance that real users would be happy with it as a primary browser.  We hope we’re at that point.

Our goal is to move quickly with new features and fixes for Chrome.  More needs to be done and more is coming.  If you’ve got comments or suggestions, be sure to let us know.

PS:  I do not speak officially for my employer.