When you find out you have fragmentation, your next concern might
be, "How bad is it?" If the Disk Analysis Utility
reveals a Mean Fragments Per File (fragmentation rating) of
1.2 (badly fragmented) or more, you may be in trouble. You had
better do something about that fast, before the system
stops altogether.
If you think I am exaggerating, consider this: One site, with
a combination system/user disk with 4.9 fragments per file required
nearly half an hour for each user to log on. This dropped to a
few seconds once the main disk was defragmented. Another system,
with an incredible 18.7 fragments per file, was literally unusable
until defragmented.
A fragmentation rating of 1.2 means there are 20% more pieces
of files on the disk than there are files, indicating perhaps
20% extra computer work needed. It should be pointed out that
these numbers are merely indicators. If only a few files
are badly fragmented while the rest are contiguous, and those
few fragmented files are never accessed, the fragmentation may
have no performance impact at all. On the other hand, if your
applications are accessing the fragmented files heavily, the performance
impact could be much greater than 20%. You have to look further
to be sure. For example, if there were 1,000 files and only one
of those files is ever used, but that one is fragmented into 200
pieces (20% of the total fragments on the disk), you have a serious
problem, much worse than the 20% figure would indicate. In other
words, it is not the fact that a file is fragmented that causes
performance problems, it is the computer's attempts to access
the file that degrade performance.
To explain this properly, it is first necessary to examine how
files are accessed and what is going on inside the computer when
files are fragmented.
Here's a diagram of a disk:
This diagram represents one side of a single platter. The circles
represent tracks, though in reality there would be far more tracks
on one side of a platter. Within one track is a shaded strip representing
a file. Imagine a head on an arm, not much different from the
needle on the tone arm of a phonograph, moving from file to file
as the platter spins. The contents of the file can be scanned
from the disk in one continuous sweep merely by positioning the
head over the right track and then detecting the file data as
the platter spins the track past the head.
Now here is a diagram of a disk with one file broken into two parts:
In this case, the file is fragmented into two parts on the same
track. Thus, to access this file, the head has to move into position
as described above, scan the first part of the file, then suspend
scanning briefly while waiting for the second part of the file
to move under the head. Then the head is reactivated and the remainder
of the file is scanned.
As you can see, the time needed to read the fragmented file is
longer than the time needed to read the unfragmented (contiguous)
file. The exact time needed is the time to rotate the entire file
under the head, plus the time needed to rotate the gap
under the head. A gap such as this might add a few milliseconds
to the time needed to access a file. Multiple gaps would, of course,
multiply the time added. The gap portion of the rotation is wasted
time due solely to the fragmentation disease. Then, on top of
that, you have to add all the extra operating system overhead
required to process the extra I/Os.
Now let's look at another disk:
In this case, the file is again fragmented into two parts. But
this time the two parts are on two different tracks. So, in addition
to the delay added by the rotation of the disk past the gap, we
have to add time for movement of the head from one track to another.
This track-to-track motion is usually much more time-consuming
than rotational delay, costing tens of milliseconds per movement.
Further, this form of fragmentation is much more common than the
gap form.
To make matters worse, the relatively long time it takes to move
the head from the track containing the first fragment to the track
containing the second fragment can cause the head to miss the
beginning of the second fragment, necessitating a delay for nearly
one complete rotation of the disk, waiting for the second fragment
to come around again to be read.
But the really grim news is this: files don't always fragment
into just two pieces. You might have three or four, or ten or
a hundred fragments in a single file. Imagine the gymnastic maneuvers
your disk heads are going through trying to collect up all the
pieces of a file fragmented into 100 pieces!
When it takes more than one I/O to obtain the data contained in
one (fragmented) file, this is known as a split transfer or
split I/O. When a file is fragmented into more than the
seven pieces that can be accommodated by a single file window,
and the eighth or later fragment is accessed, one or more retrieval
pointers are flushed from the window and it is reloaded with seven
more retrieval pointers. This is called a window turn.
When more than 70 pointers are required to map (indicate the location
of) a file in its header, a second (or third, or fourth) file
header is required. The name for that is a multi-header file.
Each of these fragmentation symptoms costs overhead, and each
one described costs much more than the one before.
For every split transfer, the overhead of a second (or third,
or fourth, etc.) disk I/O transfer is added. For every window
turn, the overhead of reloading the window, in addition to the
I/O required just to access the fragment is added. For every multi-header
file accessed, add to each I/O the overhead of reading a second
(or third, or fourth, etc.) file header from the INDEXF.SYS file.
On top of all that, extra I/O requests due to split I/Os and window
turns are added to the I/O request queue along with ordinary and
needful I/O requests. The more I/O requests there are in the I/O
request queue, the longer user applications have to wait for I/O
to be processed. This means that fragmentation causes everyone
on the system to wait longer for I/O, not just the user accessing
the fragmented file.
Fragmentation overhead certainly mounts up. Imagine what it is
like when there are 300 users on the system, all incurring similar
amounts of excess overhead.
Now let's take a look at what these excess motions and file access
delays are doing to the computer.
OpenVMS is a complicated operating system. It is complex because
it has a great deal of functionality built in to the system, saving
you and your programmers the trouble of building that functionality
into your application programs. One of those functions is the
service of providing an application with file data without the
application having to locate every bit and byte of data physically
on the disk. OpenVMS will do that for you.
When a file is fragmented, OpenVMS does not trouble your program
with the fact, it just rounds up all the data requested and passes
it along. This sounds fine, and it is a helpful feature, but there
is a cost. OpenVMS, in directing the disk heads to all the right
tracks and LBNs within each track, consumes system time to do
so. That's system time that would otherwise be available to your
applications. Such time, not directly used for running your program,
is called overhead.
You can see overhead depicted graphically on your system by using
the MONITOR utility. Type this command:
$ MONITOR MODES
You should see a display that looks something like this:
The critical line of this display is the User Mode line.
That's the one that tells you how much of the VAX's computing
capacity is being used to run application programs. Everything
else is mostly overhead, unless you are running PDP-11 programs
in compatibility mode, in which case that
would have to be counted as productive (!) time as well.
Idle time, of course, is unused computer time, but that is a type
of overhead, isn't it? When you look at this display, you really
want to compare the User Mode value to the total of the
values above it. The modes higher in this table show you how much
of your computer's time is being spent doing work on your behalf,
other than running the application program itself. In my experience
as a System Manager, I have been fairly satisfied to see these
values split in a 2-to-1 ratio. That is, I expect to see two-thirds
of the system being used directly for running applications in
user mode, and one-third being consumed by overhead. If you see
more than one-third of the system spent on overhead, as in the
example above, you have a performance problem, and fragmentation
is a good place to look for the cause.
If there is fragmentation and system overhead is high, as indicated
by large values for Interrupt Stack and Kernel Mode, you probably
have a situation in which OpenVMS is spending a lot of extra time
processing I/O requests because two or three or more actual I/O
transfers have to be done to collect up all the pieces of fragmented
files. This adds up to a performance problem.
What's happening to your applications while all this overhead
is going on? Simple: Nothing. They wait.
Oh yes, the users. . . .
The users wait, too, but they do not often wait without complaining,
as computers do. They get upset, as you may have noticed.
The users wait for their programs to complete, while excess fragments
of files are chased up around the disk. They wait for keyboard
response while the computer is busy chasing up fragments for other
programs that run between the user's keyboard commands. They wait
for new files to be created, while the operating system searches
for enough free space on the disk and allocates a fragment here,
a fragment there, and so on. They wait for batch jobs to complete
that used to get done faster on the same computer with the same
user load, before fragmentation robbed them of their machine time.
They even wait to log in, as the operating system wades through
fragmented command procedures and data needed by startup programs.
Even backup takes longer - a lot longer - and the users suffer
while backup is hogging the machine for more and more of "their"
time.
All the users know is this: The system is slow; you're in charge
of the system; it's all your fault. And they're right. If you
are the System Manager, you are responsible for the computer system
and its performance.
If management and finance people are aware of the problem, they
view it as paying for 100% of a computer system, but getting something
less for their money. The users are not only upset, they're getting
less work done and producing less income for the company. That's
bad, and it's your responsibility.
Something had better be done about it, and quickly.