|
How does Solaris 2.6 improve performance stats and Web performance?We fill you in on all the new performance and measurement enhancements |
Solaris 2.6 is out this month. It has a lot of new features, and apart from performance improvements for Web servers and databases, there are some extremely useful new performance measurements provided. (3,400 words)
Mail this article to a friend |
A lot, a whole lot, way too much for me to cover in this column. I'll concentrate on performance improvements and new performance measurements.
Solaris 2.6 is a different kind of release from Solaris 2.5 and Solaris 2.5.1. Those releases were tied to very important hardware launches -- UltraSPARC support in Solaris 2.5 and Ultra Enterprise Server support in Solaris 2.5.1. With a hard deadline you have to keep functionality improvements under control, so there were relatively few new features. Solaris 2.6 is not tied to any hardware launch. New systems released this summer all run an updated version of 2.5.1 as well as Solaris 2.6. The current exception is the Enterprise 10000 (Starfire) which was not a Sun product early enough (Sun brought in the development team from Cray during 1996) to have Solaris 2.6 support at first release. Later this year an update release of Solaris 2.6 will include support for the E10000. Because Solaris 2.6 had a more flexible release schedule and fewer hardware dependencies it was possible to take longer over the development and add far more new functionality.
Some of the projects weren't quite ready for Solaris 2.5 (like large file support), so they ended up in Solaris 2.6. Other projects like the integration of Java 1.1 were important enough to delay the release of Solaris 2.6 by a few months. There are other documents on www.sun.com that describe most of the new features (see Resources below), so I'll concentrate on explaining some of the performance tuning that was done for this release and tell you about some small but useful changes to the performance measurements that sneaked into Solaris 2.6. Some of them were filed as request for enhancements (RFEs) by myself and Brian Wong over the last few years. (Brian has a feature story, "The TPC-C database benchmark -- What does it really mean?" in SunWorld this month.)
Advertisements
|
|
|
|
Web server performance
This is the most dramatic performance change in Solaris 2.6.
Multiprocessor scalability is now excellent, and that multiplies up
the good performance on one and two CPU systems that we had already
with Solaris 2.5.1. At the low end, the Ultra2/2300 has two 300-MHz
CPUs with 2 MB caches, giving it a hardware boost of about 50
percent over the Ultra 2/2200 with 200-MHz CPUs and 1 MB caches. The
first processor has a useful increase in performance over Solaris
2.5.1, but with Solaris 2.6 the second processor is now contributing
almost as much performance, with no internal contention to slow it
down. The results on larger systems are harder to interpret, using
different server software, processor modules, and network interface
types. The essential message is very clear however. If you use
Solaris 2.6 you can throw a lot of CPUs at this problem and get good
additional performance from every one of them.
The essential difference is that Solaris 2.6 has extremely good multiprocessor scalability for TCP connection intensive workloads like Web service. No other vendor has demonstrated anything more than poor scalability to four CPUs; Solaris 2.6 has good scalability to 10 CPUs and beyond. This is the result of a large sustained effort by a team of engineers at SunSoft. They rewrote the locking strategies in TCP, IP, and streams, building on the in-kernel socket code that was introduced as part of Solaris 2.5.1/ISS. The benefit applies to any Web server code, although the large increase in kernel efficiency exposes the relative performance of different Web servers. With previous releases the kernel's TCP/IP stack barely scaled to two CPUs for connection-intensive workloads, so differences between server code were masked, and there was no performance improvement on large multiprocessor systems. These results show that the new Solaris Web Server (SWS1.0) that comes with server editions of Solaris 2.6 is the most efficient, with Netscape's Enterprise Server also performing well. Internal tests have shown that SWS is faster than the Zeus server used for many published SPECweb96 benchmarks, followed by Netscape then Apache.
The message should be obvious. Upgrade busy Web servers to Solaris 2.6 as soon as you can. Check out the features of SWS1.0 to see if you can use it (see Resources). In this release it has no server API, but does have a flexible security management system, so it could be a good upgrade for a basic Apache setup.
Database server performance
Database server performance was already very good and scaled well
with Solaris 2.5.1. There is always room for improvement though, and
several changes have been made to increase efficiency and
scalability even further in Solaris 2.6. If you look at the recently
published TPC benchmarks you will see that they have all used
Solaris 2.6. TPC rules say that the products used must ship within
six months. There are a couple of features worth mentioning. The
first is a transparent increase in efficiency on UltraSPARC
systems. The intimate shared memory segment used by most databases
is now mapped using 1 MB pages, rather than lots of 8 KB pages. This
greatly reduces the load on the memory management unit (MMU).
Intimate shared memory is an existing optimization which causes the
memory to be locked into RAM at a fixed address, and have its MMU
translations shared by all processes. See the
SHM_SHARE_MMU
option to shmat(2)
.
The second new feature is direct I/O. This enables a database table
that is resident in a filesystem to bypass the filesystem buffering
and behave more like a piece of raw disk. This benefit does not show
up in TPC benchmarks, as they are always run using raw disk for
maximum efficiency, but for many realworld installations that use
filesystems for administrative convenience the performance
improvement can be dramatic. It makes the most difference on
write-intensive workloads. All I/O must be block aligned. If it is
not, then UFS buffering is used to hold the unaligned data. Some new
mount_ufs(1M)
options enable and control the direct I/O
features. For even higher performance the optional Veritas VxFS
filesystem is now a supported Sun product. It also has a direct I/O
capability, but its extent-based on-disk layout gives further
performance advantages over the UFS indirect block scheme.
New and improved performance measurements
A collection of RFEs had built up over several years, asking for
better measurements in the operating system and improvements for the
tools that display the metrics. Brian Wong and I filed some of them,
others came from database engineering and from customers.
These RFEs have now been implemented -- so I'm having to think of
some new ones! You should be aware that Sun's bug tracking tool has
three kinds of bug in it. Problem bugs, RFEs, and Ease Of Use (EOU)
issues. If you have an idea for an improvement, or think that
something should be easier to use, you can help everyone by taking
the trouble to call up Sun Service and asking them to register it.
It may take a long time to appear in a release, but it will take
even longer if you don't tell anyone!
The improvements we got this time include new disk metrics, new
iostat
options, tape metrics, client side NFS mount
point metrics, network byte counters, and detailed process memory
usage measurements.
Disk metrics
Disk configurations have become extremely large and complex on big
server systems. A maximally configured E10000 supports several
thousand disk drives, but even dealing with a few hundred is a
problem. When large numbers of disks are configured the overall
failure rate also increases. It can be hard to keep an inventory of
all the disks, and tools like Solstice Symon depend upon parsing
messages from syslog to see if any faults are reported. The size of
each disk is also growing. When more than one type of data is stored
on a disk, it becomes hard to work out which disk partition is
active. A series of new features have been introduced to help solve
these problems.
iostat
options are provided to present these
metrics. One option (iostat -M
) shows throughput in
MB/s rather than KB/s, which is useful for hardware RAID units on
high-end systems. Another option (-n
) translates disk
names into a much more useful form so you don't have to deal with
the "sd43b" format, you get "c1t2d5s1." This makes it much easier to
keep track of per-controller load levels in large configurations.
Tape metrics
Fast tapes now match the performance impact of disks. We recently
ran a tape backup benchmark to see if there were any scalability or
throughput limits in Solaris, and we were very pleased to find that
the only real limit is the speed of your disks and tape drives. The
final result was a backup rate of an Oracle database at 1
terabyte per hour. This works out at about 350 megabytes per second
which was as fast as
the disk subsystem we had configured could go. To sustain this rate
we used every tape drive we could lay our hands on, including 24
StorageTEK Redwood tape transports, which run at around 15 MB/s
each. We ran this test using Solaris 2.5.1, but there are no
measurements of tape drive throughput in Solaris 2.5.1. Tape metrics
have now been added to Solaris 2.6, thereby closing one of my RFEs
which was originally filed a few years ago, and now you can finally
see which tape drive is active, the throughput, average transfer
size, and service time for each tape drive. Thanks to Henry Newman
of Instrumental
(http://www.instrumental.com)
for raising this issue with me in the first place.
Tapes are instrumented the same way as disks; they appear in
sar
and iostat
automatically. Tape
read/write operations are instrumented with all the same measures
that are used for disks. Rewind and scan/seek are omitted from the
service time.
Some new iostat
options
The output format and options of sar(1)
are fixed by the generic
Unix standard SVID3, but the format and options for iostat
can be
changed. In Solaris 2.6, existing iostat
options are unchanged, and
apart from extra entries that appear for tape drives and NFS mount
points (described later), anyone storing iostat
data from a mixture
of Solaris 2 systems will get a consistent format. There are new
options that extend iostat
as follows:
E
full error stats
e
error summary stats
n
disk name and NFS mount point translation, extended service time
M
MB/s instead of KB/s
P
partitions only
p
disks and partitions
iostat
formats:
% iostat -xp extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd106 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd106,a 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd106,b 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd106,c 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 st47 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 % iostat -xe extended device statistics ---- errors ---- device r/s w/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn tot sd106 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 st47 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 % iostat -E sd106 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: SEAGATE Product: ST15230W SUN4.2G Revision: 0626 Serial No: 00193749 RPM: 7200 Heads: 16 Size: 4.29GB <4292075520 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 st47 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: EXABYTE Product: EXB-8505SMBANSH2 Revision: 0793 Serial No:
New NFS metrics
Local disk and NFS usage is functionally interchangeable so Solaris
2.6 was changed to instrument NFS Client mount points as if they are
disks! NFS mounts are always shown by iostat
and sar
. With automounted directories coming and going
more often than disks coming online this may cause problems for
performance tools that don't expect the number of
iostat
or sar
records to change often. We
will have to do some work on the SE toolkit to handle this
properly.
The full instrumentation includes the wait queue for commands in the
client (biod wait
) that have not yet been sent to the
server, the active queue for commands currently in the server, and
utilization (%busy) for the server mount point activity level. Note
that unlike for disks, 100 percent busy does NOT indicate that the
server itself is saturated; it just indicates that the client always
has outstanding requests to that server. An NFS server is much more
complex than a disk drive and can handle a lot more simultaneous
requests than a single disk drive can.
The example shows off the new "-xnP" option, although NFS mounts appear in all formats. Note that the "P" option suppresses disks and shows only disk partitions. The "xn" option breaks down the response time "svc_t" into wait and active times, and puts the full device name at the end of the line so that long names don't mess up the columns. The "vold" entry is used to mount floppy and CD-ROM devices.
crun% iostat -xnP extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 crun:vold(pid363) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 serv-dist:/usr/dist 0.0 0.5 0.0 7.9 0.0 0.0 0.0 20.7 0 1 serv-home:/export/home2/adrianc 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 serv-home:/var/mail 0.0 1.3 0.0 10.4 0.0 0.2 0.0 128.0 0 2 c0t2d0s0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t2d0s2
New network metrics
The standard SNMP MIB for a network interface is supposed to contain
IfInOctets and IfOutOctets counters that report the number of bytes
input and output on the interface. These were not measured by
network devices for Solaris 2, so the MIB always reported zero.
Brian Wong and I filed RFEs against all the different interfaces a
few years ago, and bugs were filed more recently against the SNMP
implementation. The result is that these counters have been added to
the "le" and "hme" interfaces in Solaris 2.6, and the fix has been
backported in patches for 2.5.1 103903-03 (le) and 104212-04 (hme).
The new counters added were:
rbytes, obytes
-- read and output byte counts
multircv, multixmt
-- multicast receive and transmit byte counts
brdcstrcv, brdcstxmt
-- broadcast byte counts
norcvbuf, noxmtbuf
-- buffer allocation failure counts
% netstat -k | more ... le0: ipackets 0 ierrors 0 opackets 0 oerrors 5 collisions 0 defer 0 framing 0 crc 0 oflo 0 uflo 0 missed 0 late_collisions 0 retry_error 0 nocarrier 2 inits 11 notmds 0 notbufs 0 norbufs 0 nocanput 0 allocbfail 0 rbytes 0 obytes 0 multircv 0 multixmt 0 brdcstrcv 0 brdcstxmt 5 norcvbuf 0 noxmtbuf 0
An unfortunate by-product of this change is that a spelling mistake was corrected in the metrics for "le." The metric "framming" was replaced by "framing." Not many tools look at all the metrics, but the SE toolkit does, and if patch 103903-03 is loaded any SE script that looks at the network and finds an "le" interface fails immediately.
New and changed ndd
parameters
tcp_conn_req_max
replaced.
This value is well-known as it normally needs to be increased for
Web servers in older releases of Solaris 2. It no longer exists in
Solaris 2.6, and patch 103582-12 adds this feature to Solaris 2.5.1.
The change is part of a fix that prevents denial of service from SYN
flood attacks. There are now two separate queues of partially
complete connections instead of one.
tcp_conn_req_max_q
(default value 128) is the maximum
number of completed connections waiting to return from an accept
call as soon as the right process gets some CPU time.
tcp_conn_req_max_q0
(default value 1024) is the maximum
number of connections with handshake incomplete. A SYN flood attack
could only affect this queue, and a special algorithm makes sure
that valid connections can still get through.
The new values are high enough to not need tuning in normal use as a Web server.
ip_addrs_per_if
is new in Solaris 2.5.1/ISS and 2.6.
This allows higher virtual IP hosting numbers. The default is 256 as
before, it has been tested up to 8192. Some work was also done to
speed up ifconfig
of large numbers of interfaces. You
configure a virtual IP address using ifconfig
on the
interface with the number separated by a colon.
ifconfig hme0:283 ...
tcp_conn_hash_size
has moved!
There is a hash table structure that TCP uses to locate a TCP
connection control block. By default the table contains 256 entries,
but when running at sustained high connection rates, tens of
thousands of control blocks can be present. The hashed lookup
degrades to a linear search and wastes CPU cycles. For SPECweb96
tests the table size was set to 262144. You shouldn't normally set
it this high as it is a waste of RAM. The current size is shown at
the start of the read-only tcp_conn_hash
display using ndd
.
This variable was introduced as an ndd
variable in
2.5.1/ISS, and could be changed online. In Solaris 2.6 the need for
multiprocessor scalability removed the lock that previously allowed
it to be changed online. The variable is now set in
/etc/system
as tcp:tcp_conn_hash_size
. It
needs a reboot to change it and rounds up to a power of two.
netstat
now non-invasive.
This option dumps out all the network protocol statistics. In
previous releases netstat
grabbed a global lock that
caused additional contention in TCP. With the new scalable TCP some
work was done so that you can run netstat
without
locking TCP, so it no longer slows Web servers down.
Process memory usage
There is a new /proc/pid/metric structure that allows you to use
open/read/close rather than open/ioctl/close to read data from
/proc. It can be seen using ls
.
% ls /proc/5436 ./ cred lpsinfo map rmap usage ../ ctl lstatus object/ root@ watch as cwd@ lusage pagedata sigact xmap auxv fd/ lwp/ psinfo status
The new xmap data provides extended mappings which show how much memory a process is really using and how much is resident -- shared and private for each segment. This is an excellent way to figure out memory sizing. If you want to run 100 copies of a process, you can look at one and figure out how much private memory you need to multiply by 100. This facility is based on work done by Richard McDougall, who joined our group earlier this year.
% /usr/proc/bin/pmap -x 5436 5436: /bin/csh Address Kbytes Resident Shared Private Permissions Mapped File 00010000 140 140 132 8 read/exec csh 00042000 20 20 4 16 read/write/exec csh 00047000 164 68 - 68 read/write/exec [ heap ] EF6C0000 588 524 488 36 read/exec libc.so.1 EF762000 24 24 4 20 read/write/exec libc.so.1 EF768000 8 4 - 4 read/write/exec [ anon ] EF790000 4 4 - 4 read/exec libmapmalloc.so.1 EF7A0000 8 8 - 8 read/write/exec libmapmalloc.so.1 EF7B0000 4 4 4 - read/exec/shared libdl.so.1 EF7C0000 4 - - - read/write/exec [ anon ] EF7D0000 112 112 112 - read/exec ld.so.1 EF7FB000 8 8 4 4 read/write/exec ld.so.1 EFFF5000 44 24 - 24 read/write/exec [ stack ] -------- ------ ------ ------ ------ total Kb 1128 940 748 192
Wrap up
I work in the enterprise server division at Sun, so it's really good
for us that Web server performance now scales, and an E4000 is now
the "sweet spot" for high-end Web server performance rather than a
cluster of Ultra 2s. Overall, despite the exponential growth of the
Internet, we seem to be matching or exceeding the performance needs
of Web servers. The situation now is that network bandwidth is the
bottleneck again -- the E4000-based SPECweb96 result used two 622 Mbit
ATM and a bunch of 100baseT interfaces!
The new metrics need to be supported by performance tools vendors. Unfortunately, some of the new metrics were not included in the beta release versions of Solaris 2.6, so vendors will have to test on the final release of Solaris 2.6 before they know what the situation is. It is quite normal for performance tools to be amongst the least portable applications from one release to the next. Since Rich Pettit recently left Sun to work for a performance tool vendor (Capital Technologies -- http://www.captech.com) we have not been able to keep the SE Toolkit tracking Solaris 2.6. It will take us a while after the final release of Solaris 2.6 is available before we have a version of SE that works and supports the new metrics. I'll update you on our progress next month.
|
Solaris-related stories in SunWorld
More resources
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall. Reach Adrian at adrian.cockcroft@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-08-1997/swol-08-perf.html
Last modified:
|
Resources
About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall.
Reach Adrian at adrian.cockcroft@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-08-1997/swol-08-perf.html
Last modified: