CHAPTER 4

WHAT'S WRONG WITH
FRAGMENTATION?


When you find out you have fragmentation, your next concern might be, "How bad is it?" If the Disk Analysis Utility reveals a Mean Fragments Per File (fragmentation rating) of 1.2 (badly fragmented) or more, you may be in trouble. You had better do something about that fast, before the system stops altogether.

If you think I am exaggerating, consider this: One site, with a combination system/user disk with 4.9 fragments per file required nearly half an hour for each user to log on. This dropped to a few seconds once the main disk was defragmented. Another system, with an incredible 18.7 fragments per file, was literally unusable until defragmented.

A fragmentation rating of 1.2 means there are 20% more pieces of files on the disk than there are files, indicating perhaps 20% extra computer work needed. It should be pointed out that these numbers are merely indicators. If only a few files are badly fragmented while the rest are contiguous, and those few fragmented files are never accessed, the fragmentation may have no performance impact at all. On the other hand, if your applications are accessing the fragmented files heavily, the performance impact could be much greater than 20%. You have to look further to be sure. For example, if there were 1,000 files and only one of those files is ever used, but that one is fragmented into 200 pieces (20% of the total fragments on the disk), you have a serious problem, much worse than the 20% figure would indicate. In other words, it is not the fact that a file is fragmented that causes performance problems, it is the computer's attempts to access the file that degrade performance.

To explain this properly, it is first necessary to examine how files are accessed and what is going on inside the computer when files are fragmented.

What's Happening to Your Disks?

Here's a diagram of a disk:

Disk

Figure 4-1 Disk

This diagram represents one side of a single platter. The circles represent tracks, though in reality there would be far more tracks on one side of a platter. Within one track is a shaded strip representing a file. Imagine a head on an arm, not much different from the needle on the tone arm of a phonograph, moving from file to file as the platter spins. The contents of the file can be scanned from the disk in one continuous sweep merely by positioning the head over the right track and then detecting the file data as the platter spins the track past the head.

Now here is a diagram of a disk with one file broken into two parts:

Disk With File In Two Parts

Figure 4-2 Disk With File In Two Parts

In this case, the file is fragmented into two parts on the same track. Thus, to access this file, the head has to move into position as described above, scan the first part of the file, then suspend scanning briefly while waiting for the second part of the file to move under the head. Then the head is reactivated and the remainder of the file is scanned.

As you can see, the time needed to read the fragmented file is longer than the time needed to read the unfragmented (contiguous) file. The exact time needed is the time to rotate the entire file under the head, plus the time needed to rotate the gap under the head. A gap such as this might add a few milliseconds to the time needed to access a file. Multiple gaps would, of course, multiply the time added. The gap portion of the rotation is wasted time due solely to the fragmentation disease. Then, on top of that, you have to add all the extra operating system overhead required to process the extra I/Os.

Now let's look at another disk:

Two File Extents On Different Tracks

Figure 4-3 Two File Extents On Different Tracks

In this case, the file is again fragmented into two parts. But this time the two parts are on two different tracks. So, in addition to the delay added by the rotation of the disk past the gap, we have to add time for movement of the head from one track to another. This track-to-track motion is usually much more time-consuming than rotational delay, costing tens of milliseconds per movement. Further, this form of fragmentation is much more common than the gap form.

To make matters worse, the relatively long time it takes to move the head from the track containing the first fragment to the track containing the second fragment can cause the head to miss the beginning of the second fragment, necessitating a delay for nearly one complete rotation of the disk, waiting for the second fragment to come around again to be read.

But the really grim news is this: files don't always fragment into just two pieces. You might have three or four, or ten or a hundred fragments in a single file. Imagine the gymnastic maneuvers your disk heads are going through trying to collect up all the pieces of a file fragmented into 100 pieces!

File In Many Fragments

Figure 4-4 File In Many Fragments

When it takes more than one I/O to obtain the data contained in one (fragmented) file, this is known as a split transfer or split I/O. When a file is fragmented into more than the seven pieces that can be accommodated by a single file window, and the eighth or later fragment is accessed, one or more retrieval pointers are flushed from the window and it is reloaded with seven more retrieval pointers. This is called a window turn. When more than 70 pointers are required to map (indicate the location of) a file in its header, a second (or third, or fourth) file header is required. The name for that is a multi-header file. Each of these fragmentation symptoms costs overhead, and each one described costs much more than the one before.

For every split transfer, the overhead of a second (or third, or fourth, etc.) disk I/O transfer is added. For every window turn, the overhead of reloading the window, in addition to the I/O required just to access the fragment is added. For every multi-header file accessed, add to each I/O the overhead of reading a second (or third, or fourth, etc.) file header from the INDEXF.SYS file.

On top of all that, extra I/O requests due to split I/Os and window turns are added to the I/O request queue along with ordinary and needful I/O requests. The more I/O requests there are in the I/O request queue, the longer user applications have to wait for I/O to be processed. This means that fragmentation causes everyone on the system to wait longer for I/O, not just the user accessing the fragmented file.

Fragmentation overhead certainly mounts up. Imagine what it is like when there are 300 users on the system, all incurring similar amounts of excess overhead.

What's Happening to Your Computer?

Now let's take a look at what these excess motions and file access delays are doing to the computer.

OpenVMS is a complicated operating system. It is complex because it has a great deal of functionality built in to the system, saving you and your programmers the trouble of building that functionality into your application programs. One of those functions is the service of providing an application with file data without the application having to locate every bit and byte of data physically on the disk. OpenVMS will do that for you.

When a file is fragmented, OpenVMS does not trouble your program with the fact, it just rounds up all the data requested and passes it along. This sounds fine, and it is a helpful feature, but there is a cost. OpenVMS, in directing the disk heads to all the right tracks and LBNs within each track, consumes system time to do so. That's system time that would otherwise be available to your applications. Such time, not directly used for running your program, is called overhead.

You can see overhead depicted graphically on your system by using the MONITOR utility. Type this command:

$ MONITOR MODES

You should see a display that looks something like this:

MONITOR MODES Display

Example 4-1 MONITOR MODES Display

The critical line of this display is the User Mode line. That's the one that tells you how much of the VAX's computing capacity is being used to run application programs. Everything else is mostly overhead, unless you are running PDP-11 programs in compatibility mode, in which case that would have to be counted as productive (!) time as well.

Idle time, of course, is unused computer time, but that is a type of overhead, isn't it? When you look at this display, you really want to compare the User Mode value to the total of the values above it. The modes higher in this table show you how much of your computer's time is being spent doing work on your behalf, other than running the application program itself. In my experience as a System Manager, I have been fairly satisfied to see these values split in a 2-to-1 ratio. That is, I expect to see two-thirds of the system being used directly for running applications in user mode, and one-third being consumed by overhead. If you see more than one-third of the system spent on overhead, as in the example above, you have a performance problem, and fragmentation is a good place to look for the cause.

If there is fragmentation and system overhead is high, as indicated by large values for Interrupt Stack and Kernel Mode, you probably have a situation in which OpenVMS is spending a lot of extra time processing I/O requests because two or three or more actual I/O transfers have to be done to collect up all the pieces of fragmented files. This adds up to a performance problem.

What's Happening to Your Applications?

What's happening to your applications while all this overhead is going on? Simple: Nothing. They wait.

What's Happening to Your Users?

Oh yes, the users. . . .

The users wait, too, but they do not often wait without complaining, as computers do. They get upset, as you may have noticed.

The users wait for their programs to complete, while excess fragments of files are chased up around the disk. They wait for keyboard response while the computer is busy chasing up fragments for other programs that run between the user's keyboard commands. They wait for new files to be created, while the operating system searches for enough free space on the disk and allocates a fragment here, a fragment there, and so on. They wait for batch jobs to complete that used to get done faster on the same computer with the same user load, before fragmentation robbed them of their machine time. They even wait to log in, as the operating system wades through fragmented command procedures and data needed by startup programs. Even backup takes longer - a lot longer - and the users suffer while backup is hogging the machine for more and more of "their" time.

All the users know is this: The system is slow; you're in charge of the system; it's all your fault. And they're right. If you are the System Manager, you are responsible for the computer system and its performance.

If management and finance people are aware of the problem, they view it as paying for 100% of a computer system, but getting something less for their money. The users are not only upset, they're getting less work done and producing less income for the company. That's bad, and it's your responsibility.

Something had better be done about it, and quickly.

[PREVIOUS PAGE][NEXT PAGE][RETURN TO TOP][TABLE OF CONTENTS]