KERMIT CHECKPOINT RESTART CAPABILITY REQUIREMENTS August 16, 1993 Author: Frank da Cruz Columbia University Internet: fdc@columbia.edu Telephone: W: 854-3508, H: 866-4894 Prepared By: Computer Sciences Corporation M/C 265 3160 Fairview Park Dr. Falls Church, VA 22042 Table of Contents INTRODUCTION REQUIREMENTS DEFINITION DESIGN PARAMETERS CONTROLLING THE CHECKPOINT FEATURE NEGOTIATION OF CHECKPOINTING PROTOCOL THE CHECKPOINT SYNC PACKET. THE FILE TRANSFER PHASE SPACE-CHECKING TERMINATING A SUCCESSFUL TRANSACTION TAKING CHECKPOINTS TEXT TRANSFER MODE BINARY TRANSFER MODE RECORDING CHECKPOINTS THE CHECKPOINT REQUEST AND CONFIRMATION PACKETS THE RECOVERY FILE THE RECOVERY PROCESS SERVER-MODE CONSIDERATIONS SECURITY CONSIDERATIONS MULTI-FILE TRANSFERS AUTOMATIC REESTABLISHMENT OF CONNECTION RECOVERY FROM AN INTERRUPTED TRANSFER THAT WAS NOT CHECKPOINTED SUMMARY OF NEW COMMANDS AND PROTOCOL MESSAGES PROJECT OVERVIEW DEVELOPMENT AND IMPLEMENTATION DOCUMENTATION TESTING TIMELINE ESTIMATED BUDGET LICENSING REFERENCES .c.INTRODUCTION This report outlines a method for restarting a Kermit file transfer from a point of failure that should work correctly and dependably for all types of files independent of the underlying operating system and file system, plus a tentative implementation plan for MS-DOS, Windows, OS/2, UNIX, (Open)VMS, VM/CMS, MVS/TSO, and CICS. This is a preliminary discussion of the design, and an estimate of the cost to create this functionality. Familiarity with Kermit file transfer protocol [1] is assumed, as well as with the operation of popular Kermit programs such as MS-DOS Kermit [2], C-Kermit [3] and IBM Mainframe Kermit [4]. .c.REQUIREMENTS DEFINITION The fundamental requirement of this project is the addition of a restart-from-point-of-failure capability to Kermit file transfer protocol and software. This means that the transfer of a particular file can be resumed where it was interrupted, e.g. by loss of connection, with a minimum of retransmission overhead, and with the resulting destination file exactly as it would have been if it had been transferred successfully without interruption. In particular: 1. A checkpoint/restart-capable Kermit program should be fully interoperable with a Kermit program that does not have this capability. 2. Recovery must work for both text and binary files. 3. Recovery methods must workable between any pair of computer/operating-system platforms, and be easily adaptable to future systems. 4. Recovery must not require the two computers to have similar file formats. 5. The design must not lock out any popular type of computer or file system. 6. The design must not depend on specific capabilities that some computers or operating systems are likely to lack. 7. Automatic (unattended) recovery should be possible. 8. Manual (attended) recovery must be possible when automatic recovery is not. 9. The net result of recovery must be a received file identical to what would have been received in an uninterrupted transfer. 10. Within reason, the constraints of the checkpointing mechanism should not cause checkpointed transfers to fail in cases where non-checkpointed transfers would succeed, nor vice versa. 11. Neither Kermit program should make assumptions about the internal operation of the other, nor about the other's underlying file system. 12. Checkpointing should operate independently of the underlying communication and protocol settings. That is, it should work uniformly on serial and network connections, slow and fast connections, full and half duplex connections, 7- and 8-bit connections, with or without sliding windows or long packets, with or without text-file character-set conversion, and so on. 13. It is not desirable to invent a new notation or language for recovery information. Ordinary Kermit commands, parsable by the existing command parsers, should be used to record and recover checkpointing information. Commands and terminology should be internally consistent with each Kermit software version, and that are also uniform among versions. To assure that our design meets these requirements, we will implement it on the following diverse platforms: 1. PCs with MS-DOS or Windows (MS-DOS Kermit) and PCs with OS/2 (C-Kermit). These computers have a sequential stream-oriented file system, in which text files consist of lines terminated by CRLF. 2. All computers running the UNIX operating system or any of its many variants as well as Data General MV-series computers running the AOS/VS operating system (C-Kermit). UNIX and AOS/VS have sequential stream-oriented file systems, in which text files consist of lines terminated by LF, and thus change size when they are transferred to (say) MS-DOS. 3. DEC VAX or Alpha AXP computers running the OpenVMS operating system (C-Kermit). OpenVMS has an extremely complex record-oriented file system, with many different record formats and file attributes. Both text and binary files almost always change size and format when transferred to non-VMS systems. 4. IBM mainframes with the VM/CMS, MVS/TSO, and CICS operating systems (IBM Mainframe Kermit-370). IBM mainframe operating systems have complicated record-oriented file systems, but with details and capabilities different from those of OpenVMS. In addition, text files are encoded in EBCDIC rather than ASCII-based codes. As with VMS, both text and binary files almost always change size and format when transferred to non-VMS systems. Thus, checkpoint/restart capability will be added to three separate Kermit software programs, each of which can be built for and/or executed on various different hardware platforms and/or software environments. These are, at present and for the foreseeable future, the three major Kermit software programs. It is recognized that the immediate requirements of the contractor might not call for checkpoint/restart-capable Kermit software on all these platforms, but it is essential that we obtain operational proof-of-concept over a wide variety of computers and file systems to be reasonably certain that our design is adequate to cover all contingencies. Before checkpoint/restart capability can be added to a Kermit software program, the program must already include the following capabilities: - Basic Kermit file transfer protocol. - Attribute packets: file modification date/time, transfer mode (text/binary). - An interactive command parser. - The ability to execute commands from files. These capabilities are available in C-Kermit, MS-DOS Kermit, and IBM Mainframe Kermit. Note, in particular, that long packets, sliding windows, international character sets, single shifts, locking shifts, and other optional negotiated protocol features are neither required nor prohibited. In addition, in order to automatically establish and reestablish connections, a Kermit program must support: - Local-mode operation - Connection-establishment commands such as DIAL or TELNET - A script programming language (INPUT, OUTPUT, IF, GOTO, etc) These capabilities are available in MS-DOS Kermit and C-Kermit, but not in IBM Mainframe Kermit. IBM Mainframe Kermit is never the initiator of a connection. Finally, the underlying operating system, file system, and programming interface must provide the following capabilities: - To restart a transfer from the point of failure, the file sender should be capable of positioning its file pointer to a given byte or record within the input (source) file. Thus, the source file must be on a random-access device. - The file receiver is assumed to be able to append new material to the end of an existing file, or at least to be able to append two files together. The destination file need not be on disk -- it can also be a printer or other type of sequential output device. - To keep crash-resistent recovery files, both the file sender and receiver must be capable of appending new material to the end of an existing file. It is believed that every operating system offers these features. .c.DESIGN File transfer failures can be recoverable or unrecoverable. If the Kermit program can determine the reason for a protocol failure, it must set a return or status code accordingly, which can be tested to determine whether automatic recovery should be attempted. This will require: - Assignment of standard error codes for transmission in error packets. These would be numeric strings at the head of the Error-packet field, which would cause no problems with Kermit programs which did not understand them, but which could be used by updated programs. - Creation of some kind of status variable that can be queried by a script program, e.g. \v(recovery) in MS-DOS Kermit or C-Kermit. This variable would be set locally, or from an incoming E-packet's error code. A recoverable failure is one that can be handled AUTOMATICALLY by a checkpoint-restart mechanism. These include: - Loss of connectivity, e.g. a dialup or network connection that was dropped, but which can be reestablished a short time later. - System failure, e.g. one of the two systems crashed for a short time. An unrecoverable failure is one that can NOT be handled AUTOMATICALLY by a checkpoint-restart mechanism. Examples include: - Destination disk filled up or storage quota exceeded. - Incorrect communication or protocol settings that prevented the transaction from beginning successfully. - Lack of sufficient transparency on the communication channel; for example, a device that changes modes when it receives a certain sequence of characters. - A system, component, or connection method that disappeared forever, and similar "natural disasters". Note that most unrecoverable failures can still be recovered manually. For example, by fixing a broken computer, changing protocol or communication settings, cleaning up a full disk. .c2.PARAMETERS Terminology: - In any particular connection, the Kermit program that originated the connection is in LOCAL MODE, and the other Kermit program is in REMOTE mode. Similarly, one Kermit program is the file SENDER and the other is the file RECEIVER. - The file being transferred is the SOURCE FILE from the SENDER's point of view, and is the DESTINATION FILE from the receiver's point of view. The information is required to recover a failed transfer is the information necessary and sufficient to locate and verify the source and destination files, plus the information required for the sender to interpret the contents of the source file, and the receiver to interpret the contents of the incoming data packets: - The TYPE OF TRANSFER: TEXT or BINARY. - The fully qualified FILE SPECIFICATION of the source file: node, device, directory, version, etc, sufficient to locate the same file again. If a fully qualified name is not available, then the RELATIVE NAME plus, separately, the DEVICE, DIRECTORY, and any other necessary location information. - The SIZE and MODIFICATION DATE AND TIME of the source file, to verify it has not changed. - Kermit's LOCAL ACCESS METHOD for the source file: TEXT, BINARY, MACBINARY, V-BINARY, D-BINARY, LABELED, IMAGE, BLOCK, etc. These items depend on the Kermit implementation and the underlying file system. - Any QUALIFIERS necessary for the source-file access method: ORGANIZATION (sequential, indexed, relative, random, etc), RECORD FORMAT (fixed, variable, variable with fixed header, stream CR, stream LF, stream CRLF), RECORD-LENGTH, CARRIAGE CONTROL, MARGINS, etc. - For text-mode transfers, the source FILE CHARACTER-SET. - For text-mode transfers, the TRANSFER CHARACTER-SET. - The fully-qualified FILE SPECIFICATION of the destination file: node, device, directory, version, etc. or the relative name plus other location information. - Kermit's LOCAL ACCESS METHOD for the destination file: TEXT, BINARY, MACBINARY, V-BINARY, D-BINARY, LABELED, IMAGE, BLOCK, etc. - Any QUALIFIERS necessary for the local access method: ORGANIZATION, RECORD FORMAT, RECORD-LENGTH, CARRIAGE CONTROL, MARGINS, etc. - The SIZE of the destination file, to tell whether it has changed. - For text-mode transfers, the local FILE CHARACTER-SET of the destination file. The information given above points up a minor inconsistency in Kermit command nomenclature. The command: SET FILE TYPE { TEXT, BINARY, } actually does two things. It defines the local file access method, and, by implication, also the transfer mode. Examples: SET FILE TYPE TEXT -- implies TEXT transfer mode SET FILE TYPE BINARY -- implies BINARY transfer mode SET FILE TYPE IMAGE -- implies BINARY transfer mode SET FILE TYPE V-BINARY -- implies BINARY transfer mode SET FILE TYPE LABELED -- implies BINARY transfer mode Strictly speaking, these are separate issues. We might, for example, want to transfer a text file in binary mode, but using local access methods appropriate for text files. Or we might want to transfer a binary file in text mode in order to get CRLFs appended to each record. It is therefore worth distinguishing, at least conceptually, between the FILE TYPE and the TRANSFER MODE, and postulating (if not requiring) the availability of a new command: SET TRANSFER MODE { TEXT, BINARY } .c2.CONTROLLING THE CHECKPOINT FEATURE Checkpoint-restart capability might add perceptible overhead to file transfer operations. Obviously, every attempt will be made to ensure that the checkpoint-restart implemetation is as efficient as possible, but the priority must be ironclad reliability. As currently envisioned, however, checkpointing overhead will occur because separate recovery files must be maintained, files must be closed and opened repeatedly, and additional messages must exchanged throughout file transfer. For this reason, and for compatibility with earlier Kermit software releases, this capability WILL NOT BE USED unless specifically requested. The command is: SET CHECKPOINT { ENABLED, DISABLED, ON } ENABLED will be the default. It means "I will do checkpointing if requested" by the other Kermit. DISABLED means "I won't do it", period. ON tells your Kermit program to actively negotiate the use of checkpointing with another Kermit program. For checkpointing to take place, at least one of the Kermits must SET CHECKPOINT ON, and the other must SET CHECKPOINT ON or ENABLED. When recoverability is always the priority, SET CHECKPOINT ON can be included in the Kermit initialization file. There must also be a control over how frequently checkpoints are taken: SET CHECKPOINT INTERVAL where is the number of transmitted bytes at or after which a checkpoint should be taken. The default is implementation-dependent, and also dependent on the type and characteristics of the file. Let's say the nominal default is around 10K. At 2400 bps -- a common dialup transmission speed -- this amounts to about 45 seconds of transfer time. At 9600 bps, it's only about 11 seconds and, naturally, decreases with transmission speed. Checkpoint information is kept in a separate "recovery file" by each transfer partner. The user should be allowed to specify the name of this file, even though this can complicate checkpointing setup for the user as well as the recovery process, particularly automated recovery. The advantage of this feature is that it allows multiple recoveries to be pending. For example, the user might have an automated procedure that connects to several hosts or services each night and transfers some files. If one of these operations fails, it would be desirable to go on immediately to the next one, rather than wait the indefinite amount of time required to recover from the failed one, even if more than one transfer had failed. The following command can be used to specify the name of the recovery file: SET CHECKPOINT RECOVERY-FILE If this command is not given, an implementation-dependent default is used, which should be a fully qualified absolute pathname, so it can be found automatically in the event of an unattended restart. In practice, this would be a file of a certain name that is located (for example) according to the same rules as the initialization file. Examples: UNIX: .kermrf in the user's home (login) directory (rf = recovery file) MS-DOS: MSKERMIT.RF in the same directory as the MSKERMIT.INI file OS/2: CKERMIT.RF in the same directory as the CKERMIT.INI file VMS: CKERMIT.RF in the user's home directory VM/CMS: KERMIT RF A1 (???) NOTE: The recommended device for recovery files on PCs is the boot drive, since drive letters of other drives can change unexpectedly, e.g. when file servers are involved. There are, of course, dangers in recording information in separate recovery files. For example, there might not be sufficient disk space for a recovery file. In particular, it will not be possible to send a file with checkpointing from a computer whose storage is completely full or write-protected; in such cases, the SET CHECKPOINT RECOVERY-FILE command allows the recovery file to be placed in a separate storage area. More subtly, a recovery file might grow to fill available storage on the file sender, receiver, or both. Before proceeding, let's consider this situation. Suppose that a particular file transfer would have succeeded without checkpointing, but would fail with checkpointing because the recovery file filled up the disk, or there was an I/O error writing the recovery file, or there was some kind of checkpoint-related protocol error (e.g. caused by a programming mistake). Should the transfer fail? The user should be given the choice. This can be accomplished with another SET CHECKPOINT command: SET CHECKPOINT ERROR-ACTION { PROCEED, QUIT } The default action should be PROCEED, so that a file transfer will not fail simply because it is checkpointed. In this case, the transfer continues but checkpointing is canceled. When QUIT is elected, a checkpointing failure (e.g. failure to write to the recovery file) is fatal, and the transfer is canceled by an Error packet. SET CHECKPOINT commands can be given to the file sender or the file receiver or both. Checkpointing may be initiated by either party to the file transfer. CHECKPOINT ERROR-ACTION QUIT, given to either party, is sufficient to stop a transfer when a checkpointing error occurs. Finally, there should be a command: SHOW CHECKPOINT This displays the current SET CHECKPOINT settings, and whether an active recovery file exists. .c2.NEGOTIATION OF CHECKPOINTING PROTOCOL The protocol initialization string (the data field of the S and I packets, and of their acknowledgements) contains the following new fields for checkpoint negotiation: 10 new new ---+-----------+-------+--------+--------+--------+--------+ ... | CAPAS ... | WINDO | MAXLX1 | MAXLX2 | CHKPNT | CHKINT | ---+-----------+-------+--------+--------+--------+--------+ These fields are positional. The CAPAS field (capabilities mask), beginning at position 10, is extensible to multiple bytes by setting its low-order bit (currently it occupies only one byte). The WINDO byte is the first byte after the last CAPAS byte (we call this position CAPAS+1, currently byte 11). MAXLX1 is at CAPAS+2, and so forth. The Attribute Packet Capability bit must be set in the capabilities mask. If it isn't, the following items are ignored and checkpointing is not done. If the Kermit program does not support lower-numbered fields (e.g. WINDOW, MAXLX1, MAXLX2), then their positions must be filled with blanks so that the CHKPNT field is at the CAPAS+4 position and the CHKINT field takes up the next three bytes. The new fields are encoded as follows: 1. CHKPNT, 1 byte, values: 0 = WONT I won't do it (SET CHECKPOINT DISABLED) 1 = WILL I will do it if asked (SET CHECKPOINT ENABLED) 2 = DO Please do it (SET CHECKPOINT ON) Anything else (including absence of this byte) is interpreted as WONT. These work as follows: Sender Receiver Checkpointing Initiator WONT (any) No None WILL WONT No None WILL WILL No None WILL DO Yes Receiver DO WONT No None DO WILL Yes Sender DO DO Yes Both 2. CHKINT, checkpoint interval: 3 bytes, containing a base-95 number, with digits in the normal offset-32 notation (SP = 0, ..., ~ = 95). Maximum value is 857374 = 95^3 - 1. If this field is missing from, or incomplete in, the receiver's ACK packet, or is zero (SP SP SP), checkpointing is not done. The protocol should allow any checkpoint interval at all, even an interval of one byte, but the implementation (e.g. the command parser) can prevent the user from selecting nonsensical values. The FILE SENDER sets this field to the largest value it can handle. For example, if the file sender is limited to 16-bit arithmetic, it might send a value of 65536. If the file sender has no particular limit on its checkpoint interval, it should set it to the maximum: 857374 (~~~). The FILE RECEIVER tells the file sender the checkpoint interval that should actually be used. This value must be no larger than the CHKINT value sent by the file sender. It may be any value equal to or less than the sender's value. In cases where checkpointing is supported but not elected (i.e. CHKPNT = 0), the content of the CHKINT field is immaterial. However, if the CHKPNT field is present, then the CHKINT field is required too. In that case, the recommended contents for the CHKINT field is "___" (three underscores) to allow easy (human) identification. .c2.THE CHECKPOINT SYNC PACKET At the beginning of a checkpointed (or recovery) file transfer, after the A-packet but before the first data packet, there is a CHECKPOINT SYNC packet. Its packet type is H. Its data field contains the following information: The fields are single-character numbers in base-95 excess-32 notation. The is an identifier for this file transfer. This is a dynamically computed quantity that should be more-or-less globally unique, and so a many-digit date-and-time stamp, accurate to at least the second, would be a good choice, for example: 930808152832. The is described later, but on an "original" (first attempt, non-recovery) file transfer, it is the null string, i.e. its field is 0 (SP). If the CHECKPOINT SYNC packet fails to appear when expected -- that is, if a Data (D) or End-Of-File (Z) packet appears when an H packet is expected (this should not happen) -- the transaction is cancelled with an Error packet (if CHECKPOINT ERROR-ACTION is QUIT) or else checkpointing is disabled and the file transfer proceeds. .c2.THE FILE TRANSFER PHASE With checkpointing enabled, normal (i.e. non-recovery) file transfer proceeds as follows. For each file: - The file sender sends the F packet. - The file receiver acknowledges it. - The file sender sends one or more A-packets. - The file receiver acknowledges the A-packets, accepting or rejecting the file. If the file is accepted: - Upon receipt of the file acceptance notification (in the ACK to the A-packet), the file sender opens a new recovery file (overwriting any previous recovery file of the same name), computes the Transfer ID, writes it to the recovery file, writes the "prelude" (file name, settings, etc) to it (discussed below), closes it, and then sends a CHECKPOINT SYNC (H) packet with the Transfer ID and with a null (zero-length) Checkpoint ID. If creation and initialization of the recovery file fails, the file sender first ensures that the recovery file is destroyed, and then sends an Error packet if CHECKPOINT ERROR-ACTION is QUIT, otherwise sends an H packet with a null (zero-length) Transfer ID to cancel checkpointing operations. - Upon receipt of an H packet containing a null Transfer ID notice, the file receiver cancels checkpointing operations if its CHECKPOINT ERROR-ACTION is PROCEED, and ACKs the H packet, with an uppercase letter X occyping the data field of the ACK. If its CHECKPOINT ERROR-ACTION is QUIT, it responds with an E packet to cancel the entire file transfer. - Upon receipt of an H packet containing a valid, non-null Transfer ID, the file receiver opens and initializes its own recovery file (deleting any previous recovery file), and ACKs the H-packet. The ACK contains the same Transfer ID and the receiver's checkpoint ID, which, on a non-recovery transfer, is also null. If the file receiver failed to open and initialize its recovery file, then, if CHECKPOINT ERROR-ACTION is PROCEED, it places an uppercase latter X in the data field of the ACK to the H packet; if it is QUIT, then an Error packet is sent. - At this point, data packets will start to arrive (unless the source file is empty). The file receiver writes incoming file data out to a TEMPORARY FILE rather than to the real output file. (The temporary file, obviously, must be created in such a way as not to overwrite any existing files.) - Checkpoints are taken and recovery files updated as described below. - If a fatal error occurs during the data transfer phase, an error packet is sent and a status code should be set to indicate the cause of the failure, so the higher-level procedures can decide whether the failure is recoverable. - Upon receipt of a Z packet, the file receiver takes the normal actions: closes the output file and responds with an ACK if and only if the file was closed successfully, otherwise with an E packet. If possible, the partial destination file's modification date / time should be reset from the A-packet value each time the file is closed, to ensure that the destination file can be correctly identified should recovery be necessary. In any case, the date/time should be set when the file is succesfully (fully) received and closed. .c2.SPACE-CHECKING An optional feature of Kermit protocol and software is the ability to check available disk space before agreeing to accept an incoming file. The file sender includes the file size (at best, an approximation, since it does not know what transformations will be done by the receiver); the receiver compares this number against available disk space, IF IT HAS THIS ABILITY (certain operating systems, notably UNIX and MVS/TSO, offer no good way to do this). The use of temporary files and recovery files during checkpointing must be accounted for in the space calculation -- that is, the receiver must compare available space against the incoming file's size PLUS the negotiated checkpoint interval PLUS the estimated maximum size for the recovery file (if it is on the same storage device), with the customary allowance for expansion, depending on the transfer mode and operating systems involved. .c2.TERMINATING A SUCCESSFUL TRANSACTION At the end of a successful transaction (B packet sent and ACK'd), both recovery files can (and should) be deleted. Thus, recovery files are deleted at the beginning of each file transfer and at the end of the transaction (this prevents the final recovery file from remaining on disk after a transaction is completed successfully). If the B packet is ACK'd but the ACK is never received, the sender can still delete its recovery file, because it knows the (last) file was received successfully, since the End-Of-File (Z) packet had already been ACK'd. .c2.TAKING CHECKPOINTS When should checkpoints be taken? We have to satisfy the constraints of both the sender and receiver. Record-oriented file systems cannot be expected to write out a partial record, close the file, reopen it in append mode, and finish the partial record later. Thus checkpoints must be taken at record boundaries when one or both of the file systems involved is record-oriented. Text- and binary-mode transfers, however, must be handled in different ways. .c2.TEXT TRANSFER MODE We take it for granted that all computer operating systems are capable of writing out a record (line) to a text file, no matter what the record format. We do not assume that an operating system can write partial lines. Therefore, in text mode transfers, the file sender must send checkpoint requests only on record (line) boundaries. This means that the data packet preceding a checkpoint request might not be filled to capacity and, in fact, could be very short. This should cause no protocol or data-integrity problems, but will, of course, have a slight impact on performance. If a text line is longer than the checkpoint interval, there is no choice but to postpone the checkpoint until the end of the record, because we can not assume that the receiver can commit a partial record to disk. Thus, in text mode, we view the checkpoint interval as a MINIMUM rather than a maximum, which simplifies matters quite a bit. If we had to send a checkpoint *before* the checkpoint interval, there would be a need for record-oriented lookahead, and we would still need special handling for the case in which a record was longer than the checkpoint interval. But note that this strategy also precludes the use of in-memory buffers in lieu of temp files, since there is no limit on the amount of data that might need to be stored in such a buffer. .c2.BINARY TRANSFER MODE Protocols like ZMODEM include a checkpoint-restart capability for binary files based on the assumption the length, format, and layout of a binary file will be exactly the same on both ends. Nothing special happens during a normal file transfer. To recover a binary-mode transfer, the file receiver sends the length of the partially-received destination file back to the file sender; the file sender positions its file pointer to the corresponding next byte in the source file and resumes sending from there. This method assumes that both systems have a stream-oriented file system in which the file length is recorded as an exact number of bytes and that a byte-oriented file pointer capability is available to the sender. There are numerous exceptions to this model. When transferring in binary mode, record boundaries will still be important if the file receiver has a record-oriented file system, and thus checkpoints should still occur only on record boundaries. But in this case, how does the file sender know when to send checkpoint requests? Conversely, the file sender might have a record-oriented file system, and can only restart a transfer from a record boundary. In the worst case, both systems are record-oriented, but use different record lengths. Assumptions: 1. On record-oriented systems, binary files have either fixed-length records or else a fixed-length "allocation unit" (e.g. blocksize). Discussion: CMS MODULEs and VMS object files are examples of binary files with variable length records. Each record includes a header giving its length. Normally, the record header is NOT considered part of the data. The Kermit protocol has a mechanism for dealing with such files, but this method has never been used because when such a file is sent to a stream-oriented file system, there is no way to preserve the record boundaries without also including the record headers. Therefore, all existing Kermit programs transfer such files by including the record headers as part of the data itself. The receiver is ignorant of the difference between files encoded this way and ordinary stream-binary files. To accomplish such transfers, the Kermit program on the record-oriented system is put into a special "local file mode", known only to itself, such as V-BINARY (VM/CMS) or LABELED (VMS). Files sent in such modes to non-record-oriented systems are said to be "archived", since the result contains structuring information as well as file data, and is, in general, not useful on the system to which it has been sent. Rather, it is designed to be sent back to the type of system on which it originated, where it can be restored to its original (useful) format. 2. One Kermit program cannot be expected to understand the archiving format of a different Kermit program. 3. Checkpoint requests can NOT be initiated by the file receiver. Therefore archived files are transferred in regular binary mode, and if record length is an issue, it must be handled with a fixed number, whose value is to be determined. Facts: 1. There is presently no mechanism in the A- or F-packet exchange for the file receiver to tell the sender the destination file's record length, blocksize, etc (let's call this the "allocation unit"). 2. It does not make sense to do this in the S-packet exchange, because the allocation unit can change from file to file. Therefore, we need to invent new syntax for the ACK to the A packet, in which the receiver informs the sender of its file allocation unit. This will be the new attribute tag '3' (ASCII 51). In the sender's attribute packet, this works in the normal way: the sender informs the receiver of its allocation unit: 3 e.g.: 3#512 However, the treatment of this attribute in the receiver's ACK to the attribute packet must be different from how other items in the Attribute ACK are handled. Normally, the file receiver's ACK contains Y or N to accept or reject the file, respectively, followed by a list of attribute tags, but with no associated data. The '3' tag, however, will have to carry data in the Attribute ACK. This is an ugly special case, but it is preferable to exchanging an extra packet to convey this information. The '3' tag is followed by a single-character base-94 offset-32 length field, and then a numeric value. A value of 0 means "I don't care", and a value of 1 means that the Kermit program is capable of writing one byte at a time to an output file (in practice, 0 and 1 would be equivalent). A value of 2 might be used by systems (like PRIME) that can do i/o only in "words" rather than bytes. Record-oriented systems would specify values like 80, 128, 512, 800, etc. NOTE: The effect of this field when received by Kermit programs that are not aware of it must be considered. Such Kermit programs will not understand the '3' and might misinterpret the subsequent data as attribute tags. Now, assuming we have a mechanism to allow the receiver to inform the sender of the destination file's allocation unit, the sender must compute a checkpoint interval that allows checkpoints to occur on record boundaries that the source and destination files share in common. This would be a number into which both the source and destination record lengths divide evenly, and which is also in the neighborhood of the desired checkpoint interval, e.g. 10240 for 512 and 80; in the worst case it would be the product of the two numbers. In the most common cases, e.g. UNIX, MS-DOS, etc, there are no records and therefore binary-mode checkpoints can occur anywhere at all. In the case where only one Kermit is record-oriented, the sender can choose any value close to the negotiated checkpoint interval that is a multiple of the record size. NOTE: The precise mechanism for binary mode checkpointing will require further study and refinement during the development stage. .c2.RECORDING CHECKPOINTS Checkpointing must be entirely consistent with sliding windows. Checkpoint requests and confirmations should flow smoothly among the data packets, which means that checkpoint requests and confirmations can be widely separated in time. Since checkpoint requests and confirmations are separate packets, there can never be more than 31 of them in the window, since 31 is Kermit's maximum window size. In fact, there will always be at least one data packet between checkpoints, so no more than 16 checkpoint requests would ever be in the window. Each checkpoint is assigned a serial number, or ID, on which the two Kermit programs can synchronize during recovery. Since there can never be more than 16 checkpoint requests outstanding, the checkpoint ID ranges from 0 to 15 and then recycles. To handle checkpoints in the general case (windowed as well as non-windowed transfers), the file sender keeps a checkpoint window, implemented as a 16-element array indexed by the Checkpoint ID, which contains the recovery information associated with each checkpoint. The checkpoint window is guaranteed to contain all the checkpoints that are also in the packet window. A CHECKPOINT RECORD is written to the recovery file for each checkpoint. The format of a checkpoint record is: CHECKPOINT where the Checkpoint ID is a decimal number, 0-15, and the system-dependent recovery information is as follows: - For the FILE SENDER: how to identify the point in the source file corresponding to the checkpoint, e.g. a file pointer to the next byte to be read from the file, or the number or location or ID of the next record. - For the FILE RECEIVER: the size of the destination file after the checkpoint operation is completed, expressed in units appropriate to the file system: bytes, blocks, etc. It is, however, essential that this number grow as each checkpoint is recorded. Examples: CHECKPOINT 0 10240 CHECKPOINT 1 20480 .c2.THE CHECKPOINT REQUEST AND CONFIRMATION PACKETS Checkpoint requests are made by the FILE SENDER by sending a discrete packet, with a new packet type of J. The CHECKPOINT REQUEST packet contains the Checkpoint ID as a decimal ASCII numeric string, "0"-"15", in its Data field. The CHECKPOINT CONFIRMATION packet is simply an Acknowledgement (Y) for a CHECKPOINT REQUEST packet, containing the same Checkpoint ID in the same format. If the data field of the J packet or its ACK contains the uppercase letter X instead of a numeric Checkpoint ID, this indicates a checkpointing error, which is to be handled according to the CHECKPOINT ERROR-ACTION setting. A checkpoint is taken as follows: 1. Sender opens the recovery file in append mode, writes a checkpoint record into it, and then closes it. If this operation fails, the transfer is canceled with an E packet, or checkpointing is canceled with a J(X) packet, according to CHECKPOINT ERROR-ACTION. 2. Sender sends a J packet with the checkpoint ID in the data field, for example J3. 3. Upon receipt of the J packet, the file receiver performs the following actions: a. Closes the temp file to ensure all data has been written out to it. b. Creates the destination file if it doesn't exist yet. c. Appends the temp file to the destination file. NOTE: There is a window of vulnerability if the computer should crash at this point, or if the append operation succeeds, but fills the disk: the destination file is updated, but the update is not recorded in the recovery file. This situation is detected and handled during the recovery operation. d. If and only if all the above actions were successful, and if the J packet did not contain the "X" cancellation indicator, the file receiver opens its recovery file in append mode, writes the current checkpoint info to it, and closes it. e. Deletes the temp file, creates a new one, and opens it for write access. f. If and only if all the above actions were successful, the receiver sends a CHECKPOINT CONFIRMATION (ACK with Checkpoint ID in Data field) back to the file sender. Otherwise, the error is handled according the the CHECKPOINT ERROR-ACTION setting: if PROCEED, cancel checkpointing and respond with X in the data field of the ACK; if QUIT, send an E-packet. Observe that the connection can fail after the J packet has been sent, but before it was received, and therefore the two recovery files will be out of sync. Similarly, the connection can fail after the ACK is sent but before it is received. It is impossible to devise a strategy to assure that the two recovery files always WILL be in sync, especially on a long-delay connection with sliding windows active. The simple strategy given above resolves this dilemma by ensuring that when the recovery files ARE out of sync, that the SENDER IS ALWAYS AHEAD of the receiver. We know that it is possible for the sender to move its source-file pointer back to any desired position (byte or record), but we cannot make any such assumption about the file receiver. For all practical purposes, the destination file could be a printer, a deck of cards, or a punched paper tape, where what is done cannot be undone. .c2.THE RECOVERY FILE Since a file transfer failure might have been caused by a computer crash, information about the transfer must be recoverable after a computer restart. Therefore it must be recorded on a nonvolatile device. This would normally be in the file system as a separate file on disk. Each Kermit program keeps its own recovery file. The recovery file will not contain connection information such as phone number, communication settings, etc. In order to reestablish a connection automatically from the recovery file, it would be necessary to store a password there, and this violates the most fundamental concepts of computer security. Therefore, automatic connection reestablishment must be accomplished using other methods, to be discussed later. The recovery file must contain sufficient information to ensure that in a recovery operation: - The two recovery files apply to the same file transfer transaction. - The correct source and destination files are identified for recovery, and have not changed in the meantime. - All settings that affect the final result of the transfer are the same as in the original transfer operation. - The two Kermit programs agree upon the exact point of failure. The recovery file is composed of ordinary Kermit commands (some of them new) and executed just like any other command file. Certain commands might make sense only in recovery mode; those commands could be marked as invisible or invalid in other modes. A new recovery file is written for EACH FILE that is transferred. No attempt is made to include transfer history for multiple files. There are several reasons for this: - The recovery file could become quite large. - Processing of the recovery file could take a long time and cause a lot of disk activity (e.g. accessing directory information for many files). - Various complications arise when we allow the recovery file to apply to many files. For example, ASSERT commands (see below) could fail, causing premature termination of a RECOVER operation, even though the file that the ASSERT commands apply to was transferred successfully (e.g. the file was modified after it was transferred). - There is no particular benefit in keeping records for multiple files. It does not, for example, tell us which files were NOT transferred yet. For this reason, a recovery file is created as the transfer of EACH file begins, and is destroyed (only) after the file is transferred successfully. The first command in the recovery file should be: SET TRANSFER ID The transfer ID is the key that joins the sender's and receiver's recovery files together. Each Kermit program should then write the commands corresponding to all settings that could affect the contents and form of the destination file, for example: SET FILE TYPE TEXT SET FILE CHARACTER-SET CP850 SET TRANSFER CHARACTER-SET LATIN1 SET FILE ... (system-dependent things -- record length, etc) Next we specify the direction of file transfer. SET TRANSFER ACTION { SEND, RECEIVE, MAIL
, PRINT } And then, if necessary (that is, if fully qualified file specifications are not available), we specify the current location at the time the SEND or RECEIVE command was given. If this command appears in the recovery file, all subsequent filenames that are not fully qualified are relative to this path: CD In a moment, we will give the name of the file that is being transferred and make several ASSERTIONS about it. An assertion fails if it is not true. Therefore we must ensure that if any of the subsequent commands fail, the recovery operation itself fails: SET TAKE ERROR ON; (This is the syntax for C-Kermit) (This would be equivalent, in MS-DOS Kermit 3.13 and earlier, to putting the command IF FAILURE END 1 after each command.) It is essential that processing fail if any of these assertions proves false, otherwise there is no guarantee that the recorded checkpoints are accurate. Now we identify the file: SET TRANSFER FILE is either a fully qualified path name or else, if a CD command was given, a path name relative to the path given in the CD command. If the TRANSFER ACTION is SEND, MAIL, or PRINT, the Kermit program also obtains the file's size and its modification date and time, and then checks to make sure they haven't changed. The new command, ASSERT, tells Kermit to check that the given condition is true and to FAIL if it isn't: ASSERT TRANSFER FILE DATE This ensures the current modification date-and-time of the file given in the SET TRANSFER FILE command are the same as the given date and time. Similarly, the file sender also includes: ASSERT TRANSFER FILE SIZE to ensures that the file's size is still the one given. The remainder of the recovery file consists of checkpoint records and a final STATUS record: CHECKPOINT 0 CHECKPOINT 1 CHECKPOINT 2 ... At the end of a transaction, a STATUS statement records the status of the file transfer: STATUS The code is the numeric status code (values to be assigned). 0 means the file was transferred successfully. If there is no STATUS statement -- that is, if the file ends on a CHECKPOINT statement -- it means the computer (or Kermit) crashed in the midst of file transfer, and the status is assumed to be a recoverable failure. SAMPLE RECOVERY FILE Here is a sample recovery file for a successful file transfer, from the file sender's point of view: ; ... PRELUDE SET TRANSFER ID 930719152832 ; File transfer ID SET FILE TYPE TEXT ; Transfer settings SET FILE CHARACTER-SET CP850 SET TRANSFER CHARACTER-SET LATIN1 SET TRANSFER ACTION SEND ; We're sending files CD /usr/olga ; Current directory SET TAKE ERROR ON; SET TRANSFER FILE NAME oofa.txt ; File identification ASSERT TRANSFER FILE DATE 930808125959 ASSERT TRANSFER FILE SIZE 1234567 ; ... CHECKPOINT HISTORY CHECKPOINT 0 10240 ; Checkpoint records CHECKPOINT 1 20480 CHECKPOINT 2 30720 ... STATUS 0 ; Completion status .c2.THE RECOVERY PROCESS A transfer failed, the connection is broken. The user reestablishes the connection, logs back in to the remote computer, starts Kermit, and gives the following new command: RECOVER or, to identify a non-default recovery file: RECOVER The RECOVER command enables checkpointing automatically, so if the recovery operation itself fails, it can be recovered just like any other interrupted file transfer for which checkpoints were taken. The RECOVER command is similar to the TAKE command in that it directs the command parser to execute commands from a file, but with certain key differences: 1. It sets a recovery-in-progress flag that persists until the transfer described in the recovery file is complete (or fails). 2. It enables or recognizes certain commands that are invalid or ignored during ordinary command processing (such as CHECKPOINT and STATUS). 3. It disables certain commands that are valid outside of recovery mode (such as SEND, RECEIVE, CONNECT, EXIT, HELP, etc), to protect against "hand-crafted" recovery files. 4. It enters protocol mode automatically upon encountering the end of a valid recovery file. The remote Kermit reads the recovery file, executes all the settings, makes all the checks, etc, and if all is well, gives the KERMIT READY TO xxx... message and enters packet mode ("xxx" is SEND or RECEIVE, depending on the given TRANSFER ACTION). Now the user escapes back to the local Kermit and gives a RECOVER command there too. The local Kermit reads its own recovery file. When the FILE SENDER (which may be the remote or local Kermit program) reads the checkpoint records from its recovery file, it loads them into its checkpoint window, so in case an earlier checkpoint must be used, it can be located immediately without having to re-read the recovery file. The FILE RECEIVER (which may be the local or remote Kermit program) reads checkpoint records until it has found the last one. Now it compares the with the current status (most commonly, the size) of the destination file, and then: IF THE DESTINATION FILE IS BIGGER THAN THE SIZE RECORDED IN THE FINAL CHECKPOINT RECORD, THE RECEIVER'S CHECKPOINT ID IS INCREMENTED BY ONE (modulo 16). If the destination file size is larger than the final recorded checkpoint, we know that exactly one checkpoint had been taken but not recorded. This shuts the "window of vulnerability" noted previously. Each program enters packet mode upon encountering the end of the recovery file, but only if the final entry was a CHECKPOINT statement or a STATUS statement indicating a failure. Otherwise, an error message is printed and the RECOVER command fails because there is nothing to recover. After the normal S, F, and A packet exchanges, the file sender sends the CHECKPOINT SYNC (H) packet, and the receiver checks it. If the Transfer IDs don't agree, the transfer terminates in error. NOTE: If, by chance, the wrong recovery file is used on one end, and we wind up with two recovery files specifying the same TRANSFER ACTION (SEND or RECEIVE), the operation will quickly fail with an unexpected packet type. Now the checkpoints from the CHECKPOINT SYNC packet are compared. If they do not agree, then -- by design -- the sender's will be the higher of the two, and the sender rolls back its checkpoint to the one reported by the receiver. This information is already loaded into the sender's checkpoint window from the CHECKPOINT records in the recovery file. Now the file sender positions the source file to the agreed-upon checkpoint and begins sending from there. The file receiver writes out incoming data to temporary files and appends them to the destination file in the normal manner. Checkpoints are appended to the SAME recovery files that were used to launch the recovery operation. Note that the recovery-in-progress flag should inhibit the re-writing of the recovery-file "prelude", i.e. the material preceding the first CHECKPOINT record. If a recovered transfer fails, the RECOVER command sets a failure code for IF SUCCESS / IF FAILURE, and the recovery file -- perhaps with additional checkpoints and status appended to it, is preserved so subsequent recovery attempts can be made. .c2.SERVER-MODE CONSIDERATIONS Some sites might wish to run a Kermit program only in server mode. For example, a Kermit server might be installed as the login shell on a particular computer for users who log in as "kermit" or "guest". Or a Kermit server might be set up on an Internet TCP socket, similar to an FTP server. Escape to command mode might be disabled for security or other reasons. Kermit servers, too, can participate in checkpointed file transfers. The protocol and procedures are the same. Checkpointing must be initiated by the client program unless the server has been told to SET CHECKPOINT ON before entering server mode. In order to recover an interrupted checkpointed file transfer when a Kermit server is involved, a new protocol message is required by which the client program instructs the server to recover the interrupted transfer. This will be in the form of a "Generic" server command, packet-type G, new subtype O: +-----+--------------------+ | G | O | +-----+--------------------+ Type Data This packet would be sent to the server when the client executed the command: REMOTE RECOVER [ ] Normally, the filename would not be given, since the client would usually have no way of knowing what it was. Thus the would normally be zero (expressed as a SP character). Upon receipt of a REMOTE RECOVER command packet, the Kermit server would behave exactly as if it had been given an interactive RECOVER command except that any errors would cause the server to send an error packet and return to server command wait, rather than setting a FAILURE status and returning to the prompt. That is, neither success nor failure of the recovery operation should cause the server to exit from server mode. .c2.SECURITY CONSIDERATIONS Kermit software programs should never give users access to files that they would not otherwise have access to. NOTE: The statement above is subject to minor caveats. For example, in UNIX, it is sometimes necessary to grant a Kermit program special privileges to access communication devices or UUCP lockfiles or UUCP lockfile directories that are not normally accessible, but these privileges should not otherwise amplify the user's access rights. See the discussion in the UNIX C-Kermit installation notes, CKUINS.DOC. Managers of multiuser computer systems in which it is possible to confer privileges on a program are always cautioned to install Kermit software as an ordinary, unprivileged user program. Obviously, this recommendation can not be enforced any more for Kermit than it can for any other application software program. Thus any discussion of security relative to Kermit software has to assume it is installed according to recommendations. During a checkpointed file transfer, unprivileged Kermit software programs will not create any files that the user could not have created by other conventional means. The additional files are the temporary files created by the file receiver and the recovery files created by both sender and receiver. Neither do recovery files themselves pose a risk, as long as the Kermit programs are unprivileged. Recovery files do not contain passwords or other authentication material. Even if users alter recovery files in an attempt to gain access to forbidden information or resources, unprivileged Kermit software programs will not grant them such access. That is, Kermit software does not run with any kind of privilege or identity in checkpointing or recovery mode that it does not ordinarily have. Thus, addition of checkpoint/restart capability to Kermit software introduces NO NEW SECURITY RISKS. .c2.MULTI-FILE TRANSFERS A multi-file transfer is the transfer of a group of files in a single Kermit TRANSACTION; that is, a series of protocol messages initiated by an S-packet exchange and terminated by a B-packet exchange. Zero, one, or more files may be transferred in this way. Multi-file transfers are typically initiated by the use of wildcards or with an MSEND command containing a file list. The file list (either directly given or the result of wildcard expansion) is not conveyed from one Kermit to another, nor is it necessarily recorded locally. The order in which files are transferred cannot be guaranteed from one transaction to another. Thus, the files themselves -- and their operating-system-dependent attributes -- are the database from which we must construct recovery information. The Kermit protocol already offers a mechanism to recover from multi-file transfers at the point of failure, on a per-file basis. To enable this type of recovery, one of the following settings are given to the file RECEIVER: SET FILE COLLISION DISCARD: If a file arrives that has the same name as a file that already exists in the current device/directory, the incoming file is refused via Kermit's attribute refusal mechanism, and the existing file is preserved. SET FILE COLLISION UPDATE If a file arrives that has the same name as a file that already exists in the current device/directory, AND the incoming file's modification date and time is less than or equal to (older than) that of the existing file, the incoming file is refused via Kermit's attribute refusal mechanism, and the existing file is preserved. This mechanism is independent of ordering, but entails a small amount of overhead as S, F, and Z packet exchanges occur for each file already transferred. This mechanism can be used in conjunction with checkpoint-restart to recover a multi-file transfer: 1. Recover the file that failed. 2. Resend the file group with the appropriate collision action selected at the receiver. .c2.AUTOMATIC REESTABLISHMENT OF CONNECTION Connection establishment occurs before Kermit protocol is activated, using commands like SET PORT, SET SPEED, DIAL, CONNECT, etc, and then by authenticating oneself to the remote host or service. This process is easily automated in MS-DOS Kermit or C-Kermit using a script program -- a procedure written in the Kermit program's own command language. Here is a crude example that would apply to both C-Kermit and MS-DOS Kermit. set count 20 ; Try up to 20 times to transfer the file set checkpoint on ; Turn on checkpointing askq \%p Password: ; Prompt for password interactively ; (This is used by LOGIN.SCR) :LOGIN hangup dial 7654321 ; Dial the phone number if fail end 1 take login.scr ; Execute the login script set file type text ; Set transfer parameters output kermit\13 ; Start Kermit on remote end input 5 ermit> ; Wait for prompt if failure end 1 ; No prompt, fail if > 0 \v(count) - goto recover ; Go to separate section for recovery :FIRST ; First try (non-recovery) output receive\13 ; Send RECEIVE command input 5 RECEIVE... ; Wait for packet-mode prompt send message.txt ; Try to send a file if success end 0 ; Success, we're finished if not = \v(recovery) 0 - end 1 ; Failure not recoverable, quit if count goto login ; Recoverable, go try again. end 1 Too many tries. :RECOVER output recover\13 ; Send RECOVER command input 5 RECEIVE... ; Wait for packet-mode prompt recover ; Tell local Kermit to RECOVER if success end 0 ; Success, we're finished if not = \v(recovery) 0 - end 1 ; Failure not recoverable, quit if count goto login ; Recoverable failure, try again end 1 Too many tries Automated recovery is always initiated by the caller, since only the caller knows how to reestablish the connection. Some Kermit programs, such as Kermit-370, are never the caller, and so need not implement any of the connection re-establishment features. .c2.RECOVERY FROM AN INTERRUPTED TRANSFER THAT WAS NOT CHECKPOINTED The checkpoint/restart protocol described in this document takes place only when: (a) both Kermit programs have implemented the checkpoint/restart protocol, and (b) the user has enabled its use. Suppose a non-checkpointed file transfer is interrupted? Normally, the receiving Kermit discards any incoming file that is not completely received. However, most Kermit programs include a command: SET FILE INCOMPLETE { DISCARD, KEEP } The default is DISCARD, which is proper because users should never be given the false impression that an incomplete file transfer was successful. To enable the retention of partially received files, the user must give the command to the file receiver prior to the transfer: SET FILE INCOMPLETE KEEP When this option is in effect, interrupted file transfers can be recovered manually by a somewhat laborious and error-prone process: 1. The user examines the partially received destination file to determine exactly where the transfer was interrupted. 2. The user uses a text editor or other utility to extract the as-yet unsent portion of the source file into a separate file. 3. The user transfers the newly created source-file fragment to the destination system, either as a new and separate file. 4. The user appends the two destination files together. (Steps 3 and 4 can be combined via some trickery plus SET FILE COLLISION APPEND, if available, on the receiving Kermit.) A simple modification to existing Kermit software -- independent of the checkpoint/restart feature and of the Kermit protocol itself -- can simplify this process somewhat. A new command, PSEND (Partial Send): PSEND can be used to tell the Kermit program to send the given file (the name of a single file, not a wildcard or file-group specification) starting at the position given by the , where is a system-dependent quantity, representing a byte position, a record number, etc. Meanwhile, the file receiver is told to: SET FILE COLLISION APPEND meaning: when a file arrives that has the same name as an existing file, append the new material to the end, rather than creating a new file or overwriting the old one. To handle the case where the file sender cannot be given a starting position that corresponds exactly to the end of the partially received destination file (for example, if the file sender has a record-oriented file system, but the receiver has a byte-oriented file system, or a different record size), the following new command can be given to the file receiver: PRECEIVE This instructs the file receiver to write incoming bytes beginning at the position given by , possibly overwriting existing material. PRECEIVE capability is not necessarily possible in all operating systems. It is, of course, the user's responsibility to reestablish all the original settings before attempting this type of recovery: text vs binary, character set, record length, etc. Recovery from interrupted transfers using this method can never be automatic (because the required information is not recorded anywhere) and is possible only when the file receiver has been given the command SET FILE INCOMPLETE KEEP in advance. If this mode of recovery is always desired as a fallback when true checkpoint/restart protocol has not been enabled or successfully negotiated, the SET FILE INCOMPLETE KEEP command can be added to the Kermit initialization file. .c2.SUMMARY OF NEW COMMANDS AND PROTOCOL MESSAGES Commands: SET CHECKPOINT { ENABLED, DISABLED, OFF } SET CHECKPOINT INTERVAL SET CHECKPOINT RECOVERY-FILE SET CHECKPOINT ERROR-ACTION { PROCEED, QUIT } SHOW CHECKPOINT SET TRANSFER ID SET TRANSFER ACTION { SEND, RECEIVE, MAIL
, PRINT } SET TRANSFER FILE ASSERT TRANSFER FILE DATE ASSERT TRANSFER FILE SIZE STATUS PSEND PRECEIVE Protocol Messages: CAPAS mask in Initialization string must include Attribute Packets bit. CHKPNT and CHKINT fields added to Initialization string. New CHECKPOINT SYNC (H) packet. New CHECKPOINT REQUEST (J) packet. New ALLOCATION UNIT field (Tag 3) in ACK to A-Packet. .c.PROJECT OVERVIEW The checkpoint-restart project will consist of five phases: 1. Requirements Definition 2. Design 3. Development / Implementation 4. Testing 5. Deployment The initial requirements definition and design for Kermit protocol extensions, user interface, nomenclature, and recovery procedures are given in this document. This design will be refined and expanded during the development process. .c2.DEVELOPMENT AND IMPLEMENTATION This is to be considered a small project that should not be subject to the formality of controls that apply to large projects. It will be conducted by one principal designer, with design review by the contractor and by the other developers, and with programming work by no more than four programmers on three different bodies of source code. Development and implementation will proceed in build-a-little, test-a-little increments. The individuals involved in the project will cooperate closely at all times (primarily by Internet email and file transfer), rather than working on discrete compenents in isolation from one another. Most of the work after the design stage will proceed in parallel, with developments and discoveries constantly feeding back into the design and implementation plan as real-life experience is gained with issues that have, so far, been completely abstract. Thus a highly detailed and specific "critical path" analysis would not apply to this project. However, the overall structure of the workflow can be depicted as follows: +---------+ +-------------+ +---------+ +------------+ | Initial | ------> | Development | ------> | Public | ---> | Acceptance | | Design | --+ | and testing | | Beta | | testing | +---------+ | +-------------+ +--> | testing | +------------+ | | +---------+ | +---------------+ | +--> | User & tech | --+ | Documentation | +---------------+ Omitted from this diagram (for simplicity) is the obvious fact that each stage can feed back to earlier stages. For example, development and testing might require changes in the design; Beta or acceptance testing might reveal bugs that need fixing or even previously undiscovered design problems, and so on. Here is a preliminary outline of the work to be done. This outline gives a suggested order in which tasks are to be accomplished, proceeding from the general (items not strictly related to checkpoint/restart capability, but needed as underpinnings to it) to the specific, with prototyping done at appropriate points. In most cases, later items depend on earlier items. This outline is not, however, a rigid prescription. In particular, developers should feel free to proceed to a later item if they are temporarily blocked (perhaps for reasons beyond their control) by an earlier item. For example, if a particular feature is to be tested among all combinations of MS-DOS Kermit, C-Kermit, and IBM Mainframe Kermit, but that feature is not yet ready in, say, C-Kermit, the MS-DOS and IBM Mainframe Kermit developers should not feel compelled to do nothing until C-Kermit is ready, but rather, they may proceed with other items that do not depend on that feature. A. LAYING THE FOUNDATION 1. Implementation, testing, and documentation of requisite capabilities that are lacking from specific Kermit software programs: - SET FILE COLLISION APPEND capability in MS-DOS and VMS C-Kermit. - SET FILE COLLISION UPDATE capability in MS-DOS Kermit. - SET { TAKE, MACRO } { ERROR, ECHO } { ON, OFF } in MS-DOS Kermit. - Carrier-loss detection in MS-DOS Kermit: SET CARRIER { ON, OFF, AUTO }. - Possible addition of an intrinsic DIAL command to MS-DOS Kermit, with associated SET MODEM and SET DIAL commands as in C-Kermit. 2. Redesign of C-Kermit's file input module specification to deliver consistently marked records, and recoding of system-dependent file i/o modules according to the new specification. 3. Implementation, testing, and documentation of PSEND and PRECEIVE commands to establish the ability to seek within a file. 4. Definition of standardized Kermit protocol error codes and their meanings. 5. Addition of standardized error codes to E packets in the three major Kermit versions. 6. Classification of error codes into recoverable and nonrecoverable categories and addition of a new variable or status code that can be queried to see whether a failed transfer is recoverable. 7. Coding and testing of file allocation unit to Attribute-packet reply. B. CHECKPOINT/RESTART FRAMEWORK CODE DEVELOPMENT 1. Definition of file and variable names for checkpoint/restart, and specification of the associated semantics. 2. Coding, testing, and documentation of the commands to set and display checkpoint-related variables and capabilities: SET CHECKPOINT { ON, DISABLED, ENABLED , INTERVAL, RECOVERY-FILE, ERROR-ACTION }, SHOW CHECKPOINT. At this stage, these commands simply set and display internal variables. 3. Coding, testing, and documentation of the ASSERT TRANSFER FILE command. 4. Coding, testing, and documentation of the following prototype (nonoperational) commands: - SET TRANSFER ID - SET TRANSFER FILE - SET TRANSFER ACTION { SEND, RECEIVE } - CHECKPOINT - STATUS The Kermit program will parse these commands, but will not associate any actions with them. Check for syntactic problems or conflicts with other commands as well as conceptual problems or difficulties with documentation. 5. Create and test a new module used by the file sender to generate a transfer ID and write the initial part of the recovery file. At this stage of development, this module would be called by the file sender at the time it opens the input (source) file. 6. Ensure that the same Kermit program can read the prototype recovery files back without syntax errors, and set the corresponding variables correctly. 7. Ensure that the file sender creates a new Transfer ID and recovery file for each file in a file group, and that each recovery file is destroyed after the file is successfully transferred, and that the proper recovery file remains on disk when a transfer is interrupted. Test with file groups consisting of zero files, one file, and more than one file. 8. Coding and testing of a module (in some cases, perhaps a preexisting system call) to create a temporary output file without destroying any existing file. 9. Coding of a module that appends one file to another. Install a new file-management command, APPEND , to test this code. 10. Enable use of temporary files by the file receiver without checkpointing. Receive into a temporary file, and when the transfer is complete, rename the temporary file to the desired name. C. CODE DEVELOPMENT FOR CHECKPOINTED FILE TRANSFER 1. Coding and testing of checkpoint/restart protocol negotiation: WILL, WONT, DO; communication of checkpoint interval. Add display and/or debugging tools to monitor the progress and/or results of the negotiation. Test all combinations of SET CHECKPOINT { ON, OFF, ENABLED } among MS-DOS Kermit, C-Kermit, and Kermit-370, as well as against a non-checkpoint-capable Kermit version. 2. Add the new CHECKPOINT SYNC (H) packet and the appropriate protocol state transitions and actions. Sender generates a new Transfer ID, and uses a null Checkpoint ID. Create and initialize the recovery file in both file sender and receiver. To test, collect packet logs and ensure that the new packets are exchanged, have the correct format, and that file transfers still work. Also, ensure that the H-packet is NOT exchanged when checkpointing has not been negotiated. Also, ensure that failure of the CHECKPOINT SYNC packet to appear at the proper time when checkpointing has been negotiated is handled according to CHECKPOINT ERROR-RECOVERY. Inspect the recovery files and ensure they are correct. 3. Implementation of the checkpointing process in the file sender: - Determination of when to initiate a checkpoint request, based on negotiated checkpoint interval and file record/line boundaries. - Addition of capability to terminate a file-transfer data packet on a record (e.g. text line) boundary, rather than filling the packet. - Creation of the checkpoint window structure and recording of checkpoints. - Writing of CHECKPOINT and STATUS records to the recovery file. - Transmission of CHECKPOINT REQUEST packets. - Ability of receiver to accept CHECKPOINT REQUEST packets without error, but without actually processing them, so the sender's code can be tested. 4. Testing of checkpointing process in the file sender: - Collect packet logs. - Test both text and binary mode transfers. - Binary transfers should be tested for stream and record-oriented systems. - Ensure that checkpoints were taken at the right places. - Ensure that CHECKPOINT records were written correctly. - Dump the checkpoint window periodically to ensure it is correct. - Ensure that appropriate STATUS records are written for both successful and failed transfers. 5. Implementation of the checkpointing process in the file receiver. Upon receipt of CHECKPOINT REQUEST packet: - Flush, close temp file. - Append temp file to destination file and delete temp file. - Create new temp file for subsequent incoming data. - If all OK, write CHECKPOINT record to recovery file. - If OK, send CHECKPOINT CONFIRMATION packet, otherwise cancel transfer or checkpointing according to CHECKPOINT ERROR-RECOVERY setting. 6. Testing of the checkpointing process in the file receiver: - Ensure destination files (both text and binary) are created correctly. - Ensure that CHECKPOINT CONFIRMATION packets are correct. - Ensure recovery file updated correctly with CHECKPOINT and STATUS records. - Ensure recovery file is retained when transfer fails, destroyed when transfer succeeds. 7. Further Testing of checkpointed transfers. Verify that all mechamisms coded so far work on or for: - Text and binary files - For binary files: stream and record oriented. - For record-oriented binary files: records of different sizes / formats - All window sizes (test on long-delay connections with large windows sizes) - Long and short packets - Single-file transfers and file-group transfers - 7-bit and 8-bit connections - With and without text character-set conversion. - Serial and network connections - Noisy and clean connections - Between all combinations of MS-DOS Kermit, C-Kermit, and Kermit-370 8. Performance Evaluation. Compare file transfer efficiency with and without checkpointing: - For various checkpoint intervals - For text and binary files - For various window-size / packet-length combinations D. CODE DEVELOPMENT FOR RECOVERY FROM POINT OF FAILURE 1. Implementation of the CHECKPOINT command. This command simply loads the given information into the indicated slot in the checkpoint window. 2. Implementation of the STATUS command. This simply sets an internal variable to the given number. 3. Implementation of SET TRANSFER { FILE, ID, ACTION } commands. 4. Implementation of the RECOVER command: - Locate and read recovery file, validate info, determine TRANSFER ACTION. - Open the transfer file in the given mode (read or write) - Load checkpoint window. - Determine final transfer status, don't recover if 0. - Close recovery file. - Enable checkpointing. - Enter protocol mode according to TRANSFER ACTION. - Process and synchronize CHECKPOINT SYNC from recovery file info. - Sender positions source-file pointer to indicated position. - Receiver opens destination file in append mode. - File transfer resumes where it left off when interrupted. - Recovery files are updated and disposed of in the normal way. 5. Testing the RECOVER command: - Collect packet logs on both ends, inspect to ensure correctness. - Using all combinations of connections, protocol settings, file settings, systems, etc, ensure that files of all types are recovered correctly. - Ensure that a recovery operation can itself be interrupted and recovered. E. SAMPLE SCRIPTS FOR AUTOMATIC RECOVERY 1. Write and refine sample scripts for automated connection establishment, file transfer, detection of recoverable failures, connection reestablishment, and file transfer recovery. 2. Test with single- and multiple-file transfers. 3. Test on direct serial, dialed, and network connections. 4. Test for recovery by caller and callee. 5. Test recovery when receiver's recovery file is one checkpoint behind the destination file. .c2. DOCUMENTATION Throughout the development / implementation stage, technical documentation will be updated and refined, and trial copies of user documentation of each user-visible feature will be produced as that feature is added. The technical documentation will include extensions to the Kermit protocol specification [1] as well as additions to the relevent program logic manuals (PLMs). Eventually the extensions to the Kermit protocol will be published in a new edition of [1]. PLMs are generally maintained as online English-language plain-text files. Trial user documentation will be compiled from the documentation written during the development period, in the form of online English-language plain text, to be issued in the update or release notes with beta-test or newly released versions of the updated Kermit software programs. Eventually, the user interface to the checkpoint/restart features will be described in new editions of the relevent published or online user manuals. .c2.TESTING Testing is performed by the developers at each step of the development process. When the initial implementation is complete, further testing will be done by project members who were not personally involved in the coding. Once the internal tests are complete, the updated software and documentation will be turned over to the contractor for evaluation and testing, and, upon the contractor's go-ahead, will also be released to the general Internet user community for Beta testing. We feel that the wider Internet community will give the updated software a much more thorough workout on a much wider variety of platforms and communication methods than the developers or contractors could ever hope to accomplish by themselves. After the public Beta test period is complete, the resulting software and documentation will be turned over to the contractor for acceptance testings. A detailed formal test plan will accompany the software. The contractor may, of course, devise its own tests. Once the contractor has accepted the software, all test notices will be removed from its banners and documentation, and it will be released. .c2.TIMELINE Requirements Definition and Initial Design: One month. Done. The development and testing phase is expected to take four months. Thus, if work commences September 1, it will be complete by December 31. Development (keyed to section labels used above), 3 calendar months, 3-5 people working in parallel on each phase: A: One calendar month B: One half calendar month C: One half calendar month D: One half calendar month E: One half calendar month Testing: One month .c.REFERENCES [1] da Cruz, Frank, "Kermit, A File Transfer Protocol", Digital Press (1987). [2] Gianone, Christine, "Using MS-DOS Kermit", 2nd Ed., Digital Press (1992). [3] da Cruz, F., and C. Gianone, "Using C-Kermit", Digital Press (1993). [4] Chandler, John, "IBM System/370 Kermit User's Guide", unpublished (1993). (End of Document)