Frank da Cruz
12 October 2021
Last update: Wed Oct 13 14:52:44 2021 (New York time)
The ability to invoke external file-transfer protocols such as XMODEM and ZMODEM was added to C-Kermit in version 6.0, announced 1 December 1996, and described in the manual Using C-Kermit (second edition) pp.303-306; it works by running an external file-transfer program such as "sx" (Send with Xmodem) with its standard input/output redirected to the connection that C-Kermit has with the other computer. The manual warns on p.306 that:
Even if they (the external clients) can be successfully redirected, protocols such XMODEM, YMODEM, and ZMODEM are likely to fail over TELNET connections because of transparency issues. The external protocol programs themselves are unaware that they have been redirected over a TELNET connection, and so even if they would know what to do in this case, they don't know they are supposed to do it. If your external protocol program has a command-line option to let you tell it to take precautions (certain ZMODEM implementations let you "escape" certain characters — carriage return [CR, Ascii 13] and "all ones" [0xFF] are the ones to watch out for), use it.Kermit protocol distinguishes itself from most other file-transfer protocols by its adaptibility to non-transparent connections. For example it can (if necessary) transfer control characters in a 2-byte printable form; it doubles the IAC (0xFF) on Telnet connections, etc. In the ten years that followed the release of C-Kermit 6.0, I didn't get any complaints about the external protocols feature, which meant:
I spent a some months in 2006-2007 trying to address this problem without
success. The original code is in the routine in the
Note: This document doesn't consider the problem of using an external protcol on SSH connections; it's a completely different case because C-Kermit implements SSH by forking the external SSH client. In this case we'd have SSH in one fork and (say) Xmodem in a separate fork, completely disconnected from each other (Kermit 95 for Windows has its own built-in SSH client).
Anyway, by this time (2021) I think this whole topic is just a historical curiousity because although C-Kermit supports SSL/TLS and Kerberos for Telnet and FTP connections, there are no servers for them any more. Meanwhile the use of clear-text Telnet and FTP are strongly discouraged for safety reasons (and servers for them are disappearing too). Nowadays everybody uses SSH for interactive network connections.
Created a new version of ttruncmd() called ttyptycmd(), which works by calling do_pty() to get a pty to run the command on, and then in a loop, reads from the pty and writes to the net and reads from the net and writes to the pty, using select() to which of those it should do on each pass. First cut just uses single-byte reads and writes. Tested using Kermit itself as an external protocol. Works but slowly: 6000cps. Zmodem doesn't work at all. ckutio.c, 24 Dec 2006.
Changed single-character read() and write() to buffered reads and writes, with ttxin() and ttol() used for network i/o. Using Kermit as the external protocol, this gives 450Kcps (about 1/3 normal on this connection).
But now there's a problem: the loop doesn't know when to stop. How does it know when the process that is running on the pty has exited? With single character read()'s that are executed unconditionally when select() says the pty has data waiting, as in the first pass, I get EIO if there actually isn't any, and can exit the loop. But now, to avoid blocking, I call in_chk() to see how much data is waiting, and I don't try to read anything if it says nothing is waiting. If the process associated with the pty file descriptor has terminated, in_chk() would presumably get some kind of error, but it doesn't. I changed do_pty to return the pid of the fork where it execs its command so we can check the pid with kill(pid,0) when in_chk() of the pty says 0, but this doesn't help either; it seems like the process is not exiting, but of course it is.
I could not find any legitimate way to test when the pty fork terminated. Select() always says the pty file descriptor was ready, no matter what. Select() never reports an exception on the pty file descriptor; in_chk(ptyfd) returns 0 and not an error. read(ptyfd,...) gets 0 but not an error. fcntl(ptyfd,...) doesn't get an error. Finally I tried write(ptyfd,c,0) and this indeed gets EIO (i/o error). With this, using Kermit as the external protocol works fine in Solaris but I tend to think this trick will not be very portable (it isn't). 24 Dec 2006.
Made ttptycmd() use a more intelligent buffering scheme, fixed a few things about how I was setting up the select() call that should address some of yesterday's problems. Still doesn't work but it's progress. A: 25 Dec 2006.
Debugging yesterday's code… Still, the error conditions are never set, we never detect when the pty closes. In Solaris, if select() says ptyfd is ready to read but in_chk() says there are no characters there, we can treat this as a loop-exit condition. But in NetBSD, in_chk() always says 0 when used on a pty (but works OK on a serial or net connection).
Realized I could not use in_chk() on the pty because there is too much baggage with the communication path — myread(), etc etc) — so I replaced this with a simple ioctl(ptyfd,FIONREAD,&n). This works fine in Solaris but always returns 0 in NetBSD, despite what the man page says (i.e. that this function can be used on any file descriptor).
OK, let's see.... select() does not return useful results. It says characters are waiting on ptyfd when they are not, and it never detects the closure of the pty..... Well of course not, because we are the ones who have to close it. Just because the process has stopped doesn't mean the pty is closed. So we're back to square one, how do we know when to close it? ckupty.c seems to keep the process ID in a global variable, pty_fork_pid (which is not the same as the pid now returned by do_pty(), which is useless, but I don't understand why). But it doesn't matter because when we kill(pty_fork_pid,0), we still get no error of any kind, even after we know the process has exited. I am completely flummoxed. select() lies, and even if it didn't, there is simply no completion criterion. In the loop, select() always says that the pty is ready to read. To be continued. 26 Dec 2006.
Back to Square One, single-byte reads and writes.
But the ensuing read() gets EIO so we know the pty is gone. That means the same thing should happen in the buffered version, no? Yes; I went back to the buffered version and replaced all the other nonworking tests by a blocking read of 1 byte on the pty and this detects the termination. Now:
Let's call the remote, forked, redirected, external Kermit A and its local partner B. A sends its S-packet, B receives it OK and Acks. A apparently does not receive the ACK in time, so sends the S again, but OK. followed immediately by the F. B Acks the F. A sends the A, B Acks it. But now A sends a piece of the previous F packet and the the first piece of a D packet.
Clearly the buffering is messed up. Sure enough, there was an extraneous statement incrementing a read pointer in a write section. Removing that cleared up the problems with Kermit, now we can send and receive substantial files efficiently in remote mode. Zmodem seems to work too, except that at the beginning a bunch of "**B0800000000022d"'s are stuffed into Kermit's command buffer, so after the transfer we get some error messages.
In local mode, over a Telnet connection, Kermit works fine. Zmodem works OK too except it doesn't finish right, so at the very end rz on the far end is still waiting for something; if I cancel out of it with ^X^X^X^X^X, it deletes the file. So there still is something wrong with the termination test.
Also you don't see anything on your screen when running Kermit or Zmodem this way. That's to be expected, since they are using stdio for the transfer, so they can't also be displaying progress or other messages.
Built this on NetBSD again… Seems to work this time, but has trouble finishing, like Zmodem. Hmmm, on closer examination, it turns out that since in_chk() always returns 0 on the ptyfd, we fall into our new single-byte read code, so it's really slow, like 10K cps on a connection where 1M is the norm. 27 Dec 2006.
Switched the pty from buffer peeking (FIONREAD) and blocking reads to to nonblocking reads (O_NONBLOCK / O_NDELAY). Works just fine on NetBSD except now we no longer get EIO at the end when trying to read from the pty process that has exited. In fact, we're back to square one again. not ioctl(), not fcntl(), not select(), not even read() gets an i/o error after the pty process exits. But in NetBSD, we have to use nonblocking reads because ... Hmmmm, maybe switch the fd between blocking and nonblocking for the test… Nope, NetBSD seems to be hopeless (later, Ed Ravin confirmed that similar problems have been observed with other applications that try to do this).
Switching to Linux, I see that yesterday's Solaris code (blocking reads) works exactly the same way on Linux.
Tried today's O_NDELAY method on Solaris. It works perfectly. And then I moved this one to Linux and it works perfectly there too. Except in both cases we have the wierd thing with Zmodem at the end, but I think that's because rz/sz don't use standard i/o. On NetBSD, it still hangs at the end.
Turns out that testing the pid works in NetBSD, even though it didn't in Solaris. Turns out read() gets an i/o error in Solaris and Linux but not in NetBSD. So checking the read result first, and then checking the pid if read() got zero bytes catches all three. 28 Dec 2006.
Now the question of return code. In the original ttruncmd() function, we do a fork() and a wait(). When the external protocol program finishes, wait() gives us its return code and we can pass it on through \v(pexitstat) as well ttruncmd's own return code. But ttptycmd() has to interact with the pty continuously, so it can't just sit back and wait() for it. Instead we have to detect when the process has exited and then call waitpid() on the fork pid, before shutting down the pty. Tested on Solaris using Kermit as the external protocol and then inducing failure, or letting it run to completion. FAILURE and SUCCESS set appropriately in each case. Tested with Zmodem too, works OK except for the aforementioned cosmetic glitch at the end. Tested on NetBSD, all OK.
To make K5 connection to Panix from Spam:
set telnet debug on authenticate K5 init /realm:PANIX.COM /password:xxxxx set host shell.panix.com 23 /k5login
Good… Now I try to send a file from Spam to Panix over the K5 connection using Kermit itself as the external protocol. It fails. Inspection of the debug log on the far side shows that the S-Packet was received correctly, good! This means we are reading the clear-text S-Packet from the external Kermit program, and that ttol() is encrypting appropriately.
The remote Kermit sends the Ack and goes to read the next packet: ttinl() calls myfillbuf() and:
SVORPOSIX myfillbuf calling read() SVORPOSIX myfillbuf=0 <-- read returns 0 SVORPOSIX myfillbuf ttcarr=2 SVORPOSIX myfillbuf errno=0 <-- and reports no error HEXDUMP: mygetbuf read (-3 bytes) mygetbuf errno=0 ttinl myread failure, n=-3 ttinl myread errno=0 ttinl non-EINTR -3[closing]
This happens because myfillbuf() deliberately returns -3 when read() gets 0 bytes. I don't understand why this happens but the real problem is yet to come. The local Kermit (the one that has made the secure connection and is running the external protocol through ttptycmd()) eventually figures out that the transfer failed and when we reconnect, we get total garbage — the encryption either stopped happening, or got out of sync.
Looking at the local debug log, ttol() is doing its job, converting the initial "kermit -r\13" from plaintext to cyphertext, as shown by the hexdumps. Then it enters ttptycmd()… Hmmmm, wait, how can it send the "kermit -r" before it starts the external protocol? Never mind, worry about that later… Anyway, ttptycmd() says:
ttptycmd loop top have_pty=1 ttptycmd loop top have_net=1 ttptycmd FD_SET ptyfd in ttptycmd FD_SET ttyfd in ttptycmd nfds=5 ttptycmd select=1 ttptycmd FD_ISSET ttyfd in ... ttptycmd in_chk(ttyfd) n=11 ttptycmd ttxin n=11
ttxin() asks for 11 bytes, myfillbuf() gets 11 bytes, and hexdump() shows the cyphertext, there doesn't seem to be any decrypting going on. Hmmm, it looks like the regular code calls ttinc() in a loop, rather than ttxin(). Maybe ttxin() doesn't have decryption hooks. No, that's not it, the code is there, but the Kermit packet reader does not use ttxin(), it uses ttinl(). But of course we can't use that for external protocols because it's designed only to read Kermit packets. Substituting a loop of ttinc()s for the ttxin() call fixes things (and stangely enough, it seems to be faster). And now we have our first external protocol transfer over a secure connection (external Kermit program, Linux over Kerberos 5 to NetBSD). Zmodem worked too for a short file but "something happens" with longer ones. 29 Dec 2006.
New makefile target for Linux with Kerberos 5, linux+krb5, that doesn't include anything extra from SSL or other security methods (but apparently it is still necessary to include -DOPENSSL_097 in order to get the right names for the DES routines?). Ditto netbsd+krb5 for NetBSD, except in this case -DOPENSSL_097 is not necessary. makefile, 30 Dec 2006.
Note to myself: On Panix:
export LD_LIBRARY_PATH=/usr/local/kerblib make netbsd+krb5 "K5LIB=-L/usr/local/kerblib" "K5INC=-I/usr/local/include"
Can't telnet-k5 from newly built Kermit on NetBSD; partway through the negotiations, just after "TELNET RCVD SB ENCRYPTION SUPPORT DES_CFB64 DES_OFB64 IAC SE" it dumps core. The last two lines in debug.log after this are:
Rebuilding with -DOPENSSL_097 doesn't change anything. Ed Ravin said they have two different Kerberos installations, Heimdahl and MIT; maybe some mixup between the two explains the problem (Jeff concurs). The core dump occurs in ck_crp: encrypt_support():
debug(F100,"XXX ep not NULL","",0); type = ep->start ? (*ep->start)(DIR_ENCRYPT, 0) : 0; <-- Here debug(F101,"XXX new type","",type);
Anyway, I can log in with Kerberos 5 to Panix OK from Columbia (sesame) using 8.0.201. So let's try to resurrect the Solaris version with everything:
I hunted around to find where the current library and header file directories were… Last time I tried this (March 2006) it bombed, not finding libdes. Instead we have /opt/kerberos5125/lib/libdes425.a. Made a new cu-specific target that includes this; now we get farther; it blows up in ckcftp.c with tons of errors and warnings, which we can worry about later. Building again with -DNOFTP, it gets to ckuath.c (the first security module) and:
ckuath.c:151:18: error: krb5.h: No such file or directory ckuath.c:152:21: error: profile.h: No such file or directory ckuath.c:153:21: error: com_err.h: No such file or directory ckuath.c:176:28: error: kerberosIV/krb.h: No such file or directory In file included from /opt/openssl-0.9.8d/include/openssl/des.h:101, from ckuath.c:219:
Found krb5.h in /opt/kerberos5125/include/krb5.h, added a -I for this directory ... Now we get lots of warnings in ckuath.c, but it completes OK, then we wind up bombing out in ck_crp.c; I don't know why — there are all the same warnings (related to argument passing to DES functions), but no errors. I have no clue.
Tried to resurrect the solaris2x+krb4 target; this required changing -lkrb to -lkrb4 and -ldes to -ldes425. Lots of warnings in ckutio.c, ckcnet.c, ckctel.c, then it bombs out in ckcftp.c because it can't find krb.h. I found it, adjusted the -I flags, but now it bombs because krb.h itself #includes <kerberosIV/des.h>, which of course it can't find because the brackets mean it's looking in /usr/include/kerberosIV/, which, of course, the sys folks have removed. Giving up on Solaris again. Later, Jeff said "Solaris does not publicly export the krb5 libraries. You need to build the MIT Kerberos libraries separately and link to them." 30 December 2006.
Changed copyright date to 2007. ckcmai.c, 1 Jan 2007.
With Ed Ravin's help, successfully built C-Kermit with Kerberos 5 and OpenSSL (netbsd+krb5+openssl+zlib), but it does not make K5 connections; it gets hung up in the Telnet negotiations. 3 Jan 2007.
Downloaded MIT Kerberos 5 v1.4.4 to Solaris 9, 54MB worth. This is just so I can build a Kerberized C-Kermit for testing ttyptycmd(). Ran the configure program, got a few warnings but it didn't fail (should it?) Did "make install", specifying a private directory but it failed immediately with "cannot stat libkrb5support.so.0.0: No such file or directory". OK, I tried. 3 Jan 2007.
Made a new makefile target for Mac OS X, macosx10.4+krb5+ssl, ran it on Mac OS X 10.4.8. It bombs out in ckcftp.c with: ckcftp.c:551: error: static declaration of 'gss_mech_krb5' follows non-static declaration /usr/include/gssapi/gssapi_krb5.h:76: error: previous declaration of 'gss_mech_krb5' was here". Ditto for gss_mech_krb5_old, gss_nt_krb5_name, and gss_nt_krb5_principal. Tried again with -DNOFTP. We get lots of warnings in the network modules, but they complete. But ck_ssl.c bombed with a conflict between its own declarations of encrypt_output and decrypt_input and the ones in ckuat2.h; removed the prototypes from the latter (as Jeff advised) it built OK and it works OK too. Built with FTP too, but with link-time warnings about the aformentioned gss_* symbols. #ifdef'd them out (gss_mech_krb5, gss_mech_krb5_old, gss_mech_name, and gss_mech_principal) for MACOSX, where these symbols are exported by the library. Now it all compiles and links OK, and runs OK too. 3 Jan 2007.
Spent a day hunting around for a version of Zmodem that would build and execute on Mac OS X, finally found one. Now at last I could try a Zmodem external-protocol transfer over a secure connection. But phooey, C-Kermit's pty support didn't work on this box. Kermit finds master /dev/ptypa OK, then in ptyint_void_association() tries to open /dev/tty but gets ERRNO=6 "device not configured" (which is apparently OK, because the same thing happens on other platforms where this works), then tries to open slave /dev/ttypa and gets ERRNO=13 "permission denied" because, indeed, I don't have r/w permission on the device. Left a message. 4 Jan 2007.
Changed TRANSMIT /BINARY output buffer size from 252 to 508 to avoid TCP fragmentation. Need to add a SET command for this later. ckuus4.c, 5 Jan 2007.
Found another Mac where the ptys weren't protected against me, make a K5 connection and transferred a largish file with Zmodem with zero glitches, except it was kind of slow, 84K cps. Well, we're doing single-character reads on the net (ttinc()'s instead of ttxin()). Hmmm, but then I did it again and got 2.2Mcps. Success was reported, but it actually didn't work; it only sent the first quarter of the file.... Oh well, at least now we have a testbed. 5 Jan 2007.
Tried again, saw that the file is actually transferred instantly but then we're not picking up the protocol at the end. Theory: after the transfer finishes, we come back to the prompt on the remote host, which means we have something to read from the net and write to the pty, but the pty has already exited. AFTER THE PTY IS GONE, WE DO NOT WANT TO READ FROM THE NET ANY MORE. Adding this test makes Kermit succeed right away when sending the same largish file, with a transfer rate of 4M cps, that's better. But the rz program on the far end is evidently not receiving the goodbye handshake from the receiver, because it sits there foreever in its *B09002402009418 mode until I ^X^X^X^X^X out of it, at which point it deletes the file it already received, not very helpful. In the code, I read from the pty if the pty is open and there is room in the buffer. This means that when we get to the end, either there is no room in the buffer (unlikely) or the last bit sent by sz before exiting was cut off when the fork closed. Why do we get in this fix only with Zmodem and not with Kermit?
In Mac OS X, after sz exits, we get ERRNO=5 if we try to write to the pty, but we still get no errors after that if we try to read from it. Still, prior to this we did more than 20 unproductive nonblocking reads from the pty (no error, no bytes) without incident; there did not seem to be anything waiting. In fact, the last thing we read from the pty were the text messages that are issued at the end of the transfer: "rz 3.73 1-30-03 finished." After which it pauses a second and spits out a message about UNREGISTERED COPY.
Figured out how to build lrzsz, in hopes that the previous problems were with rzsz and crzsz's fiddling with file descriptors, but I get the same behavior. Which is good, I guess, because if I can fix one, I fix them all. Or not… Testing lrz by itself (not under C-Kermit), I see that it doesn't work at all with Kermit's own Zmodem implementation.
OK, here's one problem: at the end of the transfer, the Omen Zmodems print stuff like "Please read the license agreement", Kermit dutifully reads this from the pty and sends it to the host, the host shell says "Please: command not found", issues its prompt again, which Kermit reads, feeds to the pty, and apparently the pty echoes it, so we send it back to the host, and there ensues an infinite loop of getty babble until the pty closes. Now, there ought to be a way to make the external protocol shut up, like Kermit's -q(uiet) flag, but these are unregistered versions so you can't shut up the messages. In fact, the transfer works, but the getty babble at the end ruins the experience. Now I'm beginning to wonder how any of these programs ever worked as external protocols. Hmmm, now that I try it, I see the same thing happens the old way, when using ttruncmd() rather than ttptycmd().
Reading the crzsz documentation I see it says that messages come out on stderr. OK, that's progress. In ckupty.c I try redirecting 2 to /dev/null. Well good, this filters out the messages from csz, but we still get getty babble on the prompt. In the debug log, we read the last bunch of stuff from net, 618 bytes of Zmodem stuff… Now what happens?
Zmodem on the remote exits, the host prints its prompt. Kermit, of course, reads the prompt from the net, now come to the bottom of the loop and we have 7 bytes to write to the pty, and no error condition, so we continue the loop. select() says that the pty is ready for writing. We write the 7 bytes and and get no error. Loop again, this time select() says the pty has data waiting. Sure enough we get the prompt back, and send it to the net, and thus begins the getty babble. There are two causes for this:
ttptycmd() needs to:
Tried setting the pty to noecho:
termbuf.c_lflag &= ~(ECHO|ECHOE|ECHOK);
and this seemed to stop the getty babble. After the file transfer, I read back the prompt from the host shell, I write the prompt bytes to the pty; there is no error. And now select() simply hangs forever (or times out if a timeout is set). The question here is: why didn't writing to the pty produce an error? And, because we never detect the pty has exited, we can't set a good return code. 5 Jan 2007.
Moved pty fork testing to a separate routine, pty_get_status(), and added a call to it from the place where we time out, in case the fork terminated; then we can get and return its status. 6 Jan 2007.
Added calls to pty_get_status() to every place where we suspect a pty error, tried again with lrzsz, crzsz, and regular rzsz. All three work, but in each case waitpid() indicates that the sz program gave exit code 1 (failure). ckutio.c, 7 Jan 2007.
Changing the subject… On my test system, every time I execute ttptycmd(), I get "permission denied" on /dev/ttyp3. Then I run it again and get to ttyp4 which is OK. I wanted to skip past any pty for which I lack permission and try the next without raising an error. Added debugging code:
16:25:23.524 pty_getpty() pty master open error[/dev/ptyp0]=5 16:25:23.524 pty_getpty() pty master open error[/dev/ptyp1]=5 16:25:23.524 pty_getpty() pty master open error[/dev/ptyp2]=5 16:25:23.524 pty_getpty() found pty master[/dev/ptyp3] 16:25:23.524 pty_getpty() slavebuf [/dev/ttyp3]
So it already was skipping past open errors; ttyp3 was opened successfully. The problem is that ptyp3 is rw-rw-rw-, but the corresponding master, ttyp3, is rw—r----. It seems the code assumes that if the master can be opened, then so can the corresponding slave. Unfortunately, the code is not structured to allow us to skip ahead to the next master if the slave can't be opened. 7 Jan 2007.
Spent a couple hours trying to rearrange the code in the pty module to skip past inaccessible slaves but it was a rabbit hole, not worth it, backed off. 8 Jan 2008.
Tried an upload over a secure connection using lsz. Unexpectedly, this time it worked; not only was the file (about 0.5MB) transferred correctly, but Kermit detected the fork's termination and got the pid's exit status, and, for the first time, correctly reported a successful transfer. I have no idea why this works today and not yesterday. More tests; it works most of the time. It works with csz and with regular sz too.
ckucns.c seems to do the right thing; it recognize the ZSTART string, activates the Zmodem-Receive APC, and returns. doconect() sees the APC and begins to execute it. The RECEIVE command results in a call to the GET command parser, doxget() (IS THAT RIGHT?), then comes a ttflui(), which throws away a bunch of stuff. Finally we get to ttptycmd(), we get a pty and run lrz in it, select() says stuff is waiting from the pty, but read returns 0, errno 0. Skipping the ttflui() in doxget() if the protocol was not Kermit didn't seem to make difference. ckuus6.c, 8 Jan 2007.
The problem is that in this case, reads from the pty never get anything (no data, no error), write always gets an error. It's as if the pty was not being set up right, or we're using the wrong file descriptor. And if we skip the autodownload? Same thing.
OK, putting downloads aside for a moment, let's get uploads working as well as possible. At this point we have the odd situation (at least in this configuration) that the upload succeeds, but now for some reason we are unable to read the exit status from the process, even though this was working before, so ttptycmd() returns 0 (failure), yet Kermit reports success.
Well, it turns out that kill(pty_fork_pid,0) was gumming up the works. If we use only waitpid() all is well, I think. waitpid() with WNOHANG returns -1 with status -1 errno 0 if the pid has not exited, and it returns the pid and status > -1 if the process has exited. Fixed pty_get_status() to do it this way. ckutio.c, 7 Jan 2007.
Let's move this from Mac OS to NetBSD and see how it works. Well, the file transfer was just fine, but then I used some sexps to calculate the elapsed time and transfer rate, and Kermit hung in dosexp(). Fine, ignoring that… The debug log shows that ttptycmd() gets the pty OK, master and slave, the i/o goes smoothly, and waitpid() does its job perfectly. Solaris, same deal; ttruncmd() goes smoothly, but then the sexps afterward get "Arithmetic exception". Turns out there was a BAD bug in dosexp() that allowed an integer division by 0 to occur under certain circumstances; it's always been there. Fixed in dosexp(): ckuus3.c, 8 Jan 2007.
After noticing a few problems running the pop.ksc script in production over the past year, rewrote \femailaddress() to be more reliable and a lot simpler. ckuus4.c, 9 Jan 2007.
Back to ttptycmd()… When we left off, we could send but not receive. Set up a test case using Kermit as the external protocol for receiving a short file. If I SET STREAMING OFF and use short packets, it actually does work, so it's not a complete failure to function, but apparently a lack of flow control for the pty. Began by completing the parameterization of the pty module, so it can be called for interactive use (fc 0) or for running protocols (1). Confirmed that everything works at least as well as before (e.g. "set host /pty emacs" vs external protocols). ckcdeb.h, ckutio.c, ckupty.c, 9 Jan 2007.
Found in HP-UX "man 7 pty" a description of ioctl(fd,TIOCTTY,fc) which is exactly what we want: fc 0 turns off all termio processing and guarantees an ininterrupted, unmolested, flow-controlled stream of bytes in both directions. This function also exists in Linux, but not in Solaris, NetBSD, or Mac OS X (TIOCNOTTY is not what we want, it does something else entirely).
Another possibility is TIOCREMOTE, which "causes input to the pseudoterminal to be flow controlled and not input edited, regardless of the terminal mode". This one exists in at least HPUX, NetBSD, Solaris, and Mac OS X.
Solaris: builds OK, but at runtime we get ENOTTY ("Inappropriate ioctl for device"). By the time this happens, it's hard to tell from the code whether the fd we're using is for the master or the slave; TIOCREMOTE can be used only on the master. Close inspection shows that I am indeed doing that; ptyfd as seen by ttptycmd() is truly the master, i.e. the /dev/ptyXX device, not the /dev/ttyXX device (the slave fd can't be seen at all, as it exists only in a separate fork). OK, so now we know that TIOCREMOTE can't be used on Solaris.
NetBSD: Somehow, whether as a result of today's fiddling or the phase of the moon, the code in pty_open_slave() that tries to open /dev/tty started failing on NetBSD ("Device not configured"). Changing it to be run only if fc == 0 (which doesn't seem to hurt anything), once again I get ENOTTY on the TIOCREMOTE ioctl. Zmodem works but Kermit totally fails (the fork exits immediately with an exit code of 0, even though it didn't do anything).
Mac OS X: Exactly the same sequence and results as NetBSD.
Linux: It did not execute the new ioctl at all; apparently the TIOC symbols are hidden or not exported or something.
Where we stand:
All today's work on ttptycmd() looks like a dead end. To roll back to yesterday:
cp ckutio.c-20070108 ckutio.c cp ckupty.c-20070108 ckupty.c cp ckupty.h-20070108 ckupty.h
or to continue with today's:
cp ckutio.c-20070109 ckutio.c cp ckupty.c-20070109 ckupty.c cp ckupty.h-20070109 ckupty.h
Comparing Monday's and Tuesday's pty-related code, the differences are:
Commenting out 2 and 3 should put us back where we were on Monday if the parameterization was done right. And with this, on Solaris, downloading with Kermit external protocol works but slowly, 8K cps, with or without debugging. Debug log does not show any obvious bottlenecks; select() takes anywhere between no time at all and 0.1 seconds to return. If I increase the pty-net buffer size from 1K to 4K, the rate goes up to 55K cps. If I make it 8K I get 136K cps. With 16K I get 346K cps. 32K: 395K cps — this last one isn't worth the doubling. But at 24K I get 490K cps, sometimes twice that. Let's stick with 24K for now. Downloading with Zmodem (rzsz) works at the same rate, but now we're back to seeing the getty babble (Several "**B0800000000022d") at the end. 10 Jan 2007.
Moving to Mac OS X, everything works the same as on Solaris, except I don't get the Zmodem getty babble there, not even with Omen rzsz. Tested sends in both remote and local mode, the latter over a secure Kerberos 5 Telnet connection, using C-Kermit, rzsz, lrzsz, and crzsz, all good. 10 Jan 2007.
Now we're back where we were yesterday morning, but with better throughput. The big issue then was receiving files. But yikes, now it works! Not only that, I got a transfer rate of 2.1M cps. That's using Kermit protocol, streaming, and big (4K) packets. Which didn't work before. Not a fluke either, I uploaded bigger and bigger files up to 6MB, they all went smoothly, at rates between 1 and 2 MBps. 10 Jan 2007.
Not so great in Zmodem land, however. If I start the external-protocol receiver on the far end, escape back and start a Zmodem send… nothing. If I leave the remote C-Kermit at its prompt (where it supposed to recognize the Zmodem start string), still nothing. On the other hand, if I do it with a script instead of by hand:
def xx output take blah\13, send /proto:zmodem \%1
it works, at least intermittently. But that's in remote mode. We won't be using this in remote mode. In local mode, where we have a secure connection to another computer, it seems we can read from the pty and write to the net, but we time out waiting to read from the net; nothing arrives. Well, we know that i/o works both ways, so there is some kind of screwup with the Zmodem protocol start itself. Increasing the (still hardwired timeout) from 5 to 22sec and driving the whole process with a script so as to avoid autodownload as well as manual dexterity effects… It just sits there forever, way longer than 22 sec. ^C'ing out, I see that sz was indeed started on the far end and the protocol was executing. But it looks like the receiver (the one running under ttptycmd()) is getting trashed packets, because (a) it seems to be sending the same thing over and over again, and (b) sometimes it waits as long as 10 seconds before anything arrives from the remote. Maybe I was too impatient; I interrupted it after 4 minutes but it seems to have been making some progress. Whenever there was data available to read from the net, it was always 65 bytes, and it was not actually the same data over and over. This is using lrz as the external protocol. crz gets a bit farther. In this case we read up to 24K at a gulp, but the amount varies a lot. It looks like we took in about 1.2MB of Zmodem protocol data, but were only able to output the first 20K of the file. Clearly there were lots of errors. In the end, the crz exits with status 1 (failure).
Anyway it looks like we're back at needing to find a way to accomplish something like TIOCREMOTE on the pty, which is where we came in. 10 Jan 2007.
Without any way to make the pty transparent and flow controlled, it would seem to make sense to write to the pty in smaller chunks than we do to the net. I left the read-from-pty-write-to-net buffer at 24K and changed the read-from-net-write-to-pty buffer to 48 bytes.
Upload using lsz worked but took about 3 minutes. Actually it didn't work. On the local end it seemed to work, but the file did not appear on the remote end. Tried this several times, each time with different results, adding more debugging each time. The problem this time was that the pty read could get EWOULDBLOCK. Changed the code to not treat this as an error, now Zmodem uploads are solid again except I never got EWOULDBLOCK again either, even though I repeated the same upload about 1000 times (with throughput of over 2MBps even with debugging on), so the test for it has not been exercised.
OK, uploads still work. Back to downloading… The very first pty read gets 0 bytes, followed by the fork test that shows that it exited with exit status 2.
Next we try starting sz with some different options on the far end:
-q: quiet (no messages): for some reason this gets totally stuck. it looks as if this option is misdocumented; sz seems to be sending the letter C (as in Xmodem 1K or whatever) -e: escape (all control chars): first attempt to read pty finds the process gone with exit status 2. -k: send 1k blocks: this one didn't stop immediately. It reads 48 bytes from net, writes 48 to the pty with no error. Then reads 21 bytes from the pty, writes them to the net OK. Then reads 48 bytes from net, writes them to pty OK, reads 21 from pty, writes to net OK, etc etc… It appears to have worked but (final read from pty returned 0, fork test showed lrz exited with status 0), but only 754 bytes were received from the net when the file is 420K…
Well this only goes to show that the faster we shove stuff into the pty, the worse it gets. Zmodem downloads won't work unless we can make the pty transparent and flow-controlled. So to summarize today's developments:
11 Jan 2007.
Next day. This has got to be the most delicate code ever, it's like Whack-A-Mole, fix A and B pops up. Even without touching it, something that worked perfectly a 2:00 doesn't work at all an hour later. Maybe I could have used pipes instead of ptys, but pipes have problems of their own. There has to be a way to do this. The telnet server, the SSH server, etc — they all run on ptys, and we can upload files to them with Kermit. Why? Because Kermit puts its terminal into all the right modes using the time-honored methods of ttpkt() and ttvt(). Perhaps all we need is a copy of ttpkt() that operates on the pty.
On that theory, let's go back to Kermit as the external protocol. It's important to suppress all messages and displays. With that, uploads work fine, no hitches.
Downloads: We fail right away. The debug log shows the Kermit program that we are starting in the pty says:
"" - Invalid command-line option, type "kermit -h" for help.
But of course we are not giving it an invalid command-line option. Switching to gkermit for the external protocol, now we see that no matter command-line options we use, we read 0d 0d 0a from the pty and then the next time we go to read from the pty we get 0 bytes and waitpid() says the program has exited with status 1.
Why should downloading be different from uploading? ttptycmd has no idea, it does everything the same. The only difference would seem to be which side sends first, but even that tends to get washed out by each program's startup messages.
Downloading with Kermit worked 2 days ago, what's different now? The buffer sizes. Putting the net-to-pty back up to 24K (from 48 bytes)… Now it works again.
Conclusion: Kermit conditions the pty correctly, Zmodem does not. Therefore ttruncmd() must duplicate what ttpkt() does.
Or not. Because rz works fine on ssh/telnet ptys too. But not on our pty. lrz exits immediately with status code 2 = 01000 but there are no clues in the lrz.c source code, I don't even see this exit status set anywhere. Unredirecting stderr, I see that the error is "lrz: garbage on command line".
Why do both Kermit and Zmodem sometimes think they are receiving an invalid command line? If I could capture the garbage…
Side trip #1: ("pty.log",O_WRONLY) gives "no such file or directory". Changed this to ("pty.log",O_CREAT,0644) and now it doesn't get an error, and it creates the file, but not with 0644 permissions, and with nothing written in it. How come nothing works?
Fine, the debug log shows that ttptycmd() receives the correct string (e.g. "lrz -v"). It passes it to do_pty() correctly, and do_pty() passes it to exec_cmd(), which runs cksplit() on it, coming up (in this case) with "lrz" and "-v", which is right, and then:
args = q->a_head + 1; execvp(args,args);
execvp() wants the args array to have a null element at the end. cksplit() does indeed do that, or at least the code is there. Added code to exec_cmd() to verify the argument list and that it is null-terminated. So far it is.
Anyway, we have traffic between the Zmodem partners, but no joy. Commenting out the bit that redirects stderr, now I can see it on my screen in real time:
lrz waiting to receive.Retry 0: Bad CRC Retry 0: Got ERROR Retry 0: TIMEOUT Retry 0: TIMEOUT Retry 0: TIMEOUT Retry 0: TIMEOUT
etc etc, forever. Trying sz -e on the far end, I get:
Retry 0: Bad CRC Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded ... Retry 0: Got ERROR Retry 0: Bad CRC Retry 0: Got ERROR Retry 0: Got ERROR lrz: xxufio.c removed.
So apparently it's not a matter of escaping. Trying some other stuff, I caught the command-line problem in the act:
lrz: garbage on commandline Try `lrz —help' for more information.
Debug log shows:
cksplit result[lrz]=1 cksplit result[-v]=2 exec_cmd arg[lrz]=0 exec_cmd arg[-v]=1 exec_cmd arg=2
An empty string at the end instead of a null pointer. I really do not see any way that could happen, but rather than dig into cksplit() again after all these years I added a test for this in exec_cmd(), which, of course after adding it, never encountered this behavior again.
Fiddled with pty buffer size again. Made it 512 bytes instead of 24K. Zmodem downloads are the same (Rety 0: TIMEOUT, over and over). But I don't see what the problem is — every time we receive n bytes from the net, we write n bytes successfully to the pty and there are no errors. But it also looks like the remote sender is sending the file header over and over because it's not receiving an acknowledgment. If we're not losing data, then maybe it's a transparency problem.
Tried uncommenting the TIOCblah stuff I commented out before. Now instead of only timeouts I get:
lrz waiting to receive.Retry 0: Bad CRC Retry 0: Got ERROR Retry 0: Bad CRC Retry 0: Got ERROR Retry 0: Bad CRC Retry 0: Got ERROR Retry 0: TIMEOUT
which is odd because the TIOCREMOTE ioctl failed with errno 14, EFAULT, bad address, which should indicate it had no effect. We're still receiving data from the remote in tiny chunks (from 12 to 65 bytes), apparently the same stuff (file header), and writing them to the pty successfully but nothing…
Looked at cloning ttpkt() for the pty, but these stupid routines use global tty mode structs so it's not going to be easy.
Well, we got exactly nowhere today, but I think I'll leave stderr as it is so users will see some feedback; no reason not to.
WHY DO KERMIT DOWNLOADS WORK AND ZMODEM NOT?
Is it 8-bit transparency? Up til now I've been testing with text files. If I try to download a binary what happens? Fails after 99 seconds. Packet log from the far end shows that as soon as the first packet containing 8-bit data is sent, everything stops. At least I got one of these:
17:23:56.475 exec_cmd arg[gkermit]=0 17:23:56.475 exec_cmd arg[-qr]=1 17:23:56.475 exec_cmd arg=2 17:23:56.475 exec_cmd SUBSTITUTING NULL=2 <-- the code I just added
Doing this again shows the same thing on the near end. All the 7-bit-only packets are sent and acknowledged OK. Three 8-bit data packets arrive and nothing else happens after that. This is with G-Kermit.
The same thing happens with C-Kermit receiving. But if I change C-Kermit's .kermrc to turn off streaming and use a short packet length:
The transfer works, even though it's sending 8-bit bytes. So the problem is not 8-bit data after all, per se. Facts:
So it's the combination of streaming and 8-bit data? 12 Jan 2007.
As a test I made a new routine pty_make_raw() that does cfmakeraw() (a nonportable "POSIX-like" function known to be used on ptys in applications that do approximately what we're attempting). Results:
Solaris: errno 25 - inappropriate ioctl for device.
This happens even when we try to get the terminal modes with tcgetattr(), which is completely nuts. We pass it the file descriptor of the pty master, which is supposed to work. But in Mac OS X, there are no errors. But downloads still don't work; lots of errors but the pattern is different. Using a very small buffer:
Retry 0: Bad CRC Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Got TIMEOUT Retry 0: TIMEOUT Retry 0: Bad CRC Retry 0: Bad CRC Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: TIMEOUT Retry 0: Got ERROR Retry 0: TIMEOUT Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Bad CRC
Using a bigger buffer:
Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded (several screensful)
Various other combinations… Nothing seems to work.
Insight: telnetd does exactly what we want to do, sort of. But it uses TIOCPKT, so every time it reads from pty, it receives one control byte and then the data bytes, which would complicate our buffering scheme considerably. Anyway the TIOCPKT ioctl() fails on Mac OS X with 14 "Bad address".
Also see: snoopserver.c (found in Google). It seems to do things in a slightly different way — it sets stdout to raw and then dups it to the slave side of the pty?
Maybe it's a mistake to use the ckupty.c routines. They are designed for creating and accessing an interactive session. Maybe just copy one of the other programs.
18 Jan 2007. Tried going back to blocking rather than nonblocking reads to see if it would make a difference, after all the other changes. Nope. OK, let's look at some of these other programs…
snoopserver.c. I don't know exactly what this is or where it's from or what platform it runs on and there are no comments to speak of, but it does approximately what ttptycmd() does. To get a pty it uses openpty():
if (openpty(&pty, &tty, NULL, NULL, NULL) == -1)
then creates a fork. In the fork, it closes the pty (master) and manipulates the modes of the tty (slave), dups tty to be stdio, and then doex execv() on the command. Meanwhile the upper fork closes the tty (slave), gets the attributes of stdin, using atexit() to have them automatically restored on exit. Then it sets stdin to raw mode and enters the select() loop on stdin, the pty master, and the net. It uses regular blocking reads. It does not use TIOCPKT or anything like it.
openpty() is supported on: Linux, Mac OS X, NetBSD, FreeBSD, ... openpty() is NOT supported on: Solaris, HP-UX, ...
1. Try copying the pty code, but keep everything else the same.
I did this; it compiles and starts OK, upper fork (ttptycmd) debug log shows no errors, but nothing happens. Logs show that the Kermit program that is started in the subfork seems to die as soon as it reaches eof on its init file. The good news, at least, is that select() doesn't report report that the pty is ready to be read. Clearly the file descriptors aren't being assigned as expected, or as before.
In ckupty.c getptyslave() dup2's the slave fd to 0 and 1. The new code does exactly the same thing. Debug log makes it look like the forked kermit is not receiving its command line. But now I'm not even sure that the forked kermit started at all. ps from another terminal doesn't show it.
19 Jan 2007: Noticed that in snoopserver, the select() calls use standard input and output file descriptors, rather than the pty master. Made that change… In doing that I had to look at every file descriptor in every line of code and discovered a couple mistakes, fixed them, put back the original code but with the fixes, tried it, but no change; can upload OK but still can't download with Zmodem without lots of errors and ultimate failure. Going back to the alternative version and trying to get the the file descriptors sorted out, now it appears that the external Kermit program never even starts in the lower fork. After a bit more fiddling I sort that out, but now when the lower Kermit program goes to open "/dev/tty" it gets errno 6 "Device not configured". Forcing it to use stdio with "-l 0", it gets past this and actually sends its first packet. But the Kermit on top reads nothing from the pty.
Next, I change the pty fd from STDIN_FILENO and STDOUT_FILENO to slavefd. No difference. Next I comment out the dup2() calls. This time I get some action. The transfer starts, but only one packet comes. Log shows that the lower Kermit sends its S packet. The upper Kermit receives the ACK but the lower Kermit never gets it. The write to the pty succeeds, no error. Different combinations give different results. If write to master and read from the slave, I get packets in both directions but tons of errors.... This happens only if I comment out the dup2()'s.
25 Jan 2007: After leaving it sit for a while, and realizing that what I'm trying to do has to be possible because so much other software does the same thing (e.g. Telnet servers), I put things back to how they were originally — the upper fork (Kermit) uses the master and the lower fork the slave. The upper fork puts the master in raw mode, the lower fork puts the slave in raw mode. The lower fork dup2's the slave fd to stdin/out. Send file in remote mode using external Kermit: works OK but select() times out at the end. This means that the self-contained pty code in ttptycmd() is sorted out — all the file descriptors go to the right place, etc, and now we can use this routine as a testbed, rather than the original ckupty.c-based one.
But send with lsz, csz, and regular rz: Nothing happens, times out after 0 bytes of i/o. Once again, Kermit works, Zmodem doesn't. The reason for running Zmodem in a pty is so its i/o will work as it does on a terminal, no matter how it may fiddle the file descriptors. So why don't we see a single byte come out?
Commenting out pty_make_raw(), I get a successful Zmodem send using lsz. csz manages to get the filename across, but then gets stuck. regular sz, on the other hand, works perfectly. Testing csz by itself (not under Kermit), I see it fails in exactly the same way ("Got phony ZEOF", etc). OK, forget crzsz.
OK, let's move to local mode over a Kerberized Telnet connection… Uploading (sending) with external Kermit protocol… works. Downloading (receiving) with external Kermit protocol… works. Uploading with sz… works. Downloading with lrz… Gets tons of errors and fails.
Running pty_make_raw() on the slave but not on the master: no difference. Running pty_make_raw() on the master but not on the slave: no difference.
Back where we started… Either:
Theoretically we should be able to test these by using different sz switches:
-q: quiet (should always use this) -e: escape all control characters -B n: Buffer n bytes (rather than whole file) -L n: Packet length -l n: Frame length (>= packet length) -w n: Window size -4: 4K blocksize (doesn't help) -q by itself doesn't help. -q -e, this one worked but still got about 100 errors and was very slow. -q -e -l 200 -L 100, failed fast and bad. -q -e -w 1. Failed quickly. -q -e -w 1 -B 100. Eventually failed. -q -w 1, Eventually failed. -q -l 1024, this gets much more errors, definitely need -e. -q -e -l 1024, got pretty far before failing. -q -e -w 1 -l 1024, also got pretty far before failing. -q -e, this one got farthest of all, about 48K, before getting errors.
In the latter combinations that work somewhat better, we always get up to 16K, or 32K, or 48K, before the errors start coming out and piling up. Sometimes the errors are recoverable and we receive as much as 300K successfully before giving up.
Now that we have data flowing pretty well (but not well enough), tried reinstating pty_make_raw(), but it hurt more than helped.
As a sanity check, I tried transferring from the same host over the same kind of connection (Kerberized Telnet) directly to K95's built-in Zmodem protocol, and that worked fine. So the problem is definitely in the pty. Or more precisely, where Kermit writes incoming net data to the pty master.
26 Jan 2007: Tried changing the size of the net-to-pty buffer from 24K to 1K. Result: total failure. Set both buffers to 1K. Still total failure. Set both to 4K: now we get about 45K of data, then failure. Put them both back to 24K, still fails totally — the same code that worked pretty well yesterday. Actually, no downloads work, not even Kermit, not even of text files.
27 Jan 2007: Since I have not been able to find a way to make ptys work for this, I made a third copy of this routine, this time using pipes instead of ptys. The disadvantage here is that if the external protocol does not use stdio, the pipes won't work, but one thing a time…
Inferior Kermit starts in lower fork, but when it tries to send its first packet it gets errno=9 EBADF, Bad File Descriptor. Substituting G-Kermit as the external protocol, which is simpler, reveals that the problem is that the external protocol gets errors when it tries to manipulate the its stdio file descriptors with ioctls, etc; these are not valid for a pipe. The pipe mechanism itself works. If I take out the test for ttpkt() failing in gkermit, the file transfer works OK. Trying Zmodem… Sending works OK; receiving works a lot better than with ptys (it got 360K into the file before failing). Making the buffers smaller, doesn't help.
I'm starting to wonder if the problem might be in my buffering code, rather than in the pty or pipe interface… Try making a version that does single-character reads and writes.
This one reads the first packet from the lower Kermit and sends it. It is recognized by the other Kermit, which sends an ACK. We see the ^A of the ACK, but then select() times out on the next character — OF COURSE: because at a lower level, it has already been read. We have to check the myread buffer, and then call select() only if it's empty. Making this change:
Let's work our way back… With the same changes to the buffered pipe version:
But maybe now we're seeing pipe artifacts, so going back one more step to the version that gets its own pty and starts its own fork:
Another breakthrough: Moved the write pieces to below the read pieces. This is what was preventing the buffer reset code from working — with the writes done before the reads, we never catch up and can never reset the buffers.
From the log it looks like ttpkt() fails in the lower Kermit. Switching this with the hacked G-Kermit… it gets "transmission error on reliable link". Tried again with real Kermit below, this time with "-l 0" and not streaming. This was actually working, but slowly, I don't see any NAKs in the packet log, but then select() timed out.
28 Jan 2007: Restored both the calls to pty_make_raw():
Backed off on calling pty_make_raw(). Same thing. Reduced size of net-to-pty buffer. Same thing.
15 Feb 2007… Decided to give up on this and publish it as is, in hopes that somebody with more experience with ptys can make it work, because I'm just going in circles. So today I just have to get the code into shape so people could choose among the three alternative routines. The second one, yttyptycmd(), is the one that uses openpty(), which is not portable, so it can be enabled only for Mac OS X, NetBSD, FreeBSD, and Linux, or by also defining HAVE_OPENTPY at compile time. Anyway, if you build Kermit in the normal way, you get the regular behavior — ttruncmd() is used to execute external protocols. If you build it with -DXTTPTYCMD, you get the first version of ttptycmd(); with -DYTTPTYCMD the second, and with -DZTTPTYCMD the third.
I wrote all that 13-14 years ago. I hardly remember any of it. Anyway, here's the last ckutio.c module where I was working on this from 15 Feb 2007, with the aforementioned routines:
Mon Mar 12 16:52:20 2007: Put some effort into making ttpty.c more useful; added a debug log. Found that for some reason, at least on Mac OS X, select() always timed out at the the end. I added a SIGCHLD alarm and that seems to handle the fork exit condition very nicely. Now we can send (say) a 3MB file at good speed on Ethernet (1Mcps) considering the debugging, etc, and it terminates instantly. But when sending a file into ttptycmd (with "gkermit -r"), things go wrong at the end — the Z packet is never acknowledged. This is reproducible. Maybe this is a good lead.... The log shows that select() timed out, even though the gkermit fork had not yet exited (or finished). It looks like gkermit sent the Ack, ttpty.c read it from the pty and sent it out the net:
0003: read pty=8 <-- read Ack from pty 0003: loop top have_pty=1 0003: loop top have_net=1 0003: FD_SET pty_in 0003: FD_SET ttyfd in 0003: FD_SET ttyfd out=8 0003: nfds=5 0003: select=1 0003: FD_ISSET ttyfd out 0003: write net=8 <-- send ack to net 0003: loop top have_pty=1 0003: loop top have_net=1 0003: FD_SET pty_in 0003: FD_SET ttyfd in 0003: nfds=5 0009: select=0 0009: select timeout - have_pty=1
But Ack never arrived. This is a streaming transfer. But nope, streaming is not the problem. If I disable streaming ("gkermit -Sr"), we hang in in the middle of sending the data. If I use small packets, we don't hang: 1000 is OK, 2000 is not. In fact, the cutoff is 1024. OK, TBC…
Wed 14 Mar 2007: Receiving a file thru ttpty "gkermit -e 1200 -Srd" produces a debug log that shows that gkermit gets a lot of EAGAIN errors when it tries to read from its stdin. In fact, it takes 6 tries (read() calls) to read the S packet (27 bytes). Then when the first data packet arrives (1200 bytes), read() never returns even one single byte. The timeout interval is 15 seconds and it times out repeatedly. Added a primitive hex dump to the ttpty debug log for each read/write (showing only the first 24 characters and the last character, so it fits on one line). Tried uploading a file. The S, F, and A packets (short) are received and Ack'd OK, but then ttpty select() times out, never receiving even one byte from the D packet. Clearly, when the pty driver receives a burst of > 1K bytes, stops working. As before, if I limit the packets to < 1K, it works fine.
Can I send an 8-bit binary file? Nope. ttpty reads the binary data just fine from the net and writes it exactly as it was received to the pty, but the first time we write an 8-bit byte, we never hear back from the PTY again. But the log shows that when the initial 7-bit packets from the pty, it looks like the PTY is not in rawmode, because these packets end with ^J rather than ^M. Calling pty_make_raw() on the masterfd and slavefd explicitly, however, doesn't change anything. It doesn't matter if I do this in the lower fork or the upper fork. So maybe it's the actual semantics of pty_make_raw() that are wrong.
Thu 15 Mar 2007: Went thru all the terminal mode flags in Mac OS X; didn't help. Changed hex dump to show whole packet. Put hex dump routine in a private copy of G-Kermit. Tried to transfer an 8-bit file, logging both ttpty and gkermit. Compared what ttpty received on stdin with what it sent to the pty (same) and what was received by G-Kermit (same). Then I realized that my little test program was not putting its controlling terminal into raw mode; when I did that, I could upload binary files (streaming, 2MB/sec). And with Zmodem too (with rz; lrz doesn't work for some reason). Looking back at the original in ckutio.c, I see that ttptycmd() never called ttpkt(). Maybe that was the trouble all along. (Yup, but maybe not the whole trouble.)
Moving back to C-Kermit and the original ttptycmd() routine, adding the call to ttpkt(), and stripping out a lot of cruft, and moving the pty_make_raw() code to ckupty.c, Kermit uploads and downloads (streaming) work fine in Solaris. Zmodem sends a file, but then the transfer hangs at the very end, as if the signoff protocol were lost. This happens on Solaris. If I move back to Mac OS X, everything works just fine. Then, making a Kerberized connection from the Mac to NetBSD, I can send files from the Mac with both Zmodem and Kermit. Receiving… Kermit OK. Zmodem… Nope. "rz: Persistent CRC or other ERROR" (and created a 265MB debug.log!)
Fri 16 Mar 2007: ttptycmd() was for sending files with Zmodem across encrypted connections. But it occurred to me that it's necessary for clear-text connections too; e.g. Telnet, where 0xff has to be doubled. Of course Zmodem doesn't do that itself, so there's no way Zmodem external protocol could work when executed over a Telnet connection, and in fact it doesn't. I wonder why I ever thought it did.
Wed 21 Mar 2007: Back to where we left off a week ago. Trying C-Kermit's ttptycmd() on the Mac again, in remote mode:
. G-Kermit send txt (kst): OK 832Kcps . G-Kermit recv txt (kr): OK 425Kcps . G-Kermit send bin (ksb): OK 1000Kcps . G-Kermit recv bin (kr): OK 188Kcps
. sz txt (zst): OK 563Kcps . sz bin (zsb): OK 714Kcps . rz txt (zr): OK 863Kcps . rz bin (zr): OK 198Kcps
So in remote mode, everything works. Now let's try a clear-text Telnet connection…
. G-Kermit send txt (kst): OK 841Kcps . G-Kermit recv txt (krt): OK 391Kcps . G-Kermit send bin (ksb): OK 811Kcps . G-Kermit recv bin (krb): OK 171Kcps
And Zmodem over the same clear-text telnet connection:
. sz txt (zst): OK 91Kcps (*)
Kermit is sending sz messages like "sz 3.73 1-30-03 finished." to the host, which tries to execute them, after the transfer is finished. Of course "sz" is a command, but:
sz: cannot open 3.73: No such file or directory sz: cannot open 1-30-03: No such file or directory sz: cannot open finished.: No such file or directory
Did I lose that code that dis-redirects stderr when I went back to using the pty code from the ckupty module? No, it's there and it's being executed. Apparently the copy of sz I have is writing its "finished" message to stdout because "sz blah 2> /dev/null" does not suppress it. Starting again with lsz instead of sz:
. sz txt (lzst): OK 413Kcps . sz bin (lzsb): OK FAILED (*) . rz txt (lzrt): OK . rz bin (lzrb): OK
(*) Sigh. Using lsz, we get "garbage count exceeded" errors and eventual failure. But using regular sz, we get the extraneous message that starts sz on the far tend, and the resulting getty babble.
But even without changing the code, it will work one minute, and then fail consistently the next. For example, I was able to send files with sz successfully over and over, but with the getty babble at the end. Then, after trying lsz and then going back to sz, every attempt at sending a file quits with "Got ZCAN". The difference has to be that Kermit always does at least some minimal encoding of C0/C1 control characters such NUL and DEL and IAC, and I doubt that Zmodem does.
If file transfer is initiated but never completes (ie a line like :Bytes Sent: 0/ 513 BPS:0 ETA 00:00 Retry 0: Got ZCANcan be seen, but transfer never completes), chances are the pty/tty on one of the systems are not 8-bit clean. (Linux is 8-bit clean, NetBSD is not). Using the -e (escape) option of rz should solve this problem.
It doesn't, at least not with lrz. And yes, the receiving end happens to be NetBSD. But it looks like the zssh people have been down this road too.
But with rz and sz, it worked. Once. Twice. Three times. But of course, with the getty babble at the end. This can be taken care of by doing:
rz -eq ; cat > foo
which puts "sz 3.73 1-30-03 finished" and any other messages in foo (but you have to type ^D to finish the cat). Using this method I was also able to send an 8K binary file that contained a test pattern of all 256 possible byte values. Then I tried a 3MB binary executable. All OK. So here we go again:
Downloading fails about halfway through a fairly large file. I tried an even bigger file, guaranteed to be 100% ASCII; same thing — halfway through: "rz: Persistent CRC or other ERROR". But it worked with a smaller version of the same file (82K versus 2MB). Tried again with the bigger version, it failed in exactly the same way at exactly the same spot: byte number 1048320. But this is just ASCII text so it can't be a transparency problem. Substituting another plain ASCII file of the same size but totally different contents, it doesn't fail (2.36MB). Back to the previous file, it fails again, but in a different spot (832960). So it's not totally deterministic.
To round things out, I tried downloading the binary test-pattern file; it's only 8K. This failed.
-4, —try-4k go up to 4K blocksize -B, —bufsize N buffer N bytes (N==auto: buffer whole file) -e, —escape escape all control characters (Z) -E, —rename force receiver to rename files it already has -L, —packetlen N limit subpacket length to N bytes (Z) -l, —framelen N limit frame length to N bytes (l>=L) (Z)
Tried again with "sz -L 256 -B 256 -4aeq". Doesn't change anything.
NOTE: Mac OS X rz 3.73 1-30-03 does not support -e. NetBSD rz 0.12.20 does support -e.
Thu 22 Mar 2007: It occurs to me that ttpkt() might still be a problem; maybe it's the network connection and not the pty that is not transparent enough. To test this theory I did "stty raw ; stty -a" and then copied all of the flag values into ttpkt in the BSD44ORPOSIX section:
A little more fiddling with the flags and I got the 8K binary test pattern to SEEM to download OK (in the sense that rz gave a 0 return code) but the file itself was truncated, always at 224. If I changed the test pattern file to not include any bytes with value 224 (0xe0) or 255 (0xff), the download worked. So we have a transparency problem somewhere. The debug log shows that all byte values are being received from the network correctly so the problem has to occur when we try to feed them to the pty.
But no amount of twiddling with the termios flags seems to let these characters pass through. Of course, since they are not in the C0 or C1 control range, "sz -e" doesn't quote them (which it does by prefixing with Ctrl-X and then adding 0x40 to the byte value so (e.g.) NUL becomes ^X@. Note that 255 does not cause problems because it coincides with the IAC character; the remote Telnet server doubles outbound IACs, and Kermit's ttinc() undoubles them automatically (as the log shows).
Trying to send a different file (a C-Kermit binary) shows that 255 is the real killer; the file is truncated where the first one appears (at about 6K), even though some 224's precede it. Going back to the remote-mode test, I see the same thing happens with the binary test-pattern file, if I send it from K95 direct to rz-under-C-Kermit-in-remote-mode. So it has nothing to do with C-Kermit having a network connection. Yet if I send the same file direct from K95 to rz, it goes OK and the result is not truncated, so it's not Zmodem either. The data arrives to C-Kermit intact, so the failure is definitely in writing it to the rz process through the slave and master ptys.
BUT if I send the same file from K95 to rz-under-ttpty, that works. What's the difference? Suppose I just transplant ttpty literally into C-Kermit… It makes no difference. When receiving the test-pattern, it truncates it in exactly the same place.
Well, all this is on Mac OS X. What if I move it to a different platform? OK, building on Solaris and following the exact same procedure, ttptycmd() doesn't even use the network connection. I think that's because rzsz on Solaris is hardwired to use the controlling terminal and can't be redirected, even in a pty?
Moved to NetBSD.
Well, this is a big mess. Sending doesn't work (or sometimes it does but reports that it didn't). Receiving… well, actually it's the same thing; the file is completely transferred but then the final protocol handshake is lost. The local C-Kermit returns to its prompt, but rz is still running:
Retry 0: Got TIMEOUT Retry 0: TIMEOUT Retry 0: Got TIMEOUT Retry 0: TIMEOUT Retry 0: Got TIMEOUT Retry 0: TIMEOUT Retry 0: Got TIMEOUT Retry 0: TIMEOUT Retry 0: Got TIMEOUT Retry 0: TIMEOUT Retry 0: Got TIMEOUT
I don't see how that is even possible. Even after I exit from Kermit the messages keep coming, even though ps doesn't show the rz process anywhere. Looking at the code, I see a place where end_pty() was still commented out from the ttpty.c episode, I uncommented it. But still:
Conclusion for the day: I think this is hopeless. Even if I can get it to work somewhere, the results depend on the exact Zmodem software, how it uses stdin/out vs stderr versus getting its own nonredirectable file descriptor, versus the Zmodem version on the other end and which options are available on each, versus the pty and select() quirks on each platform, and on and on. It will be so hard to explain and to set up that nobody would ever use it. It would be better to just implement Zmodem internally.
Fri 23 Mar 2007: Went back to the small test program, ttpty.c. Tried setting both the master and the slave pty to rawmode, even though I have never seen any other software that did this. I had it receive the binary test pattern file; it worked. I made a bigger test-pattern file, 3MB, containing single, double, and triple copies of each byte in byte order and in random order, this one was accepted too.
So it would seem that the ckupty.c module is something to avoid after all. It's full of stuff I don't understand and probably should not undo. So changing C-Kermit's ttptycmd() to manage its own pty again, using openpty() (which is not portable), I got it all to work in remote mode: Kermit text/binary up/down and Zmodem text/binary up/down. But in local mode on the client side of a Telnet connection…
zst: OK, but we still get the getty babble at the end that starts sz. zsb: OK, ditto. This is with the 3MB test-pattern file. zrt: Not OK — "Persistent CRC or other ERROR" zrb: Not OK — got the cutoff at 224 again "Persistent CRC or other ERROR"
It's close. But actually this was still with USE_CKUPTY_C defined. When I undefined it, it was back to being totally broken. Start over. (Check the new cfmakeraw() code.)
Tue 27 Mar 2007: Starting over. Back to ttpty.c. Let's verify, VERY CAREFULLY, that it really does work, using the most stressful of the four tests: sending the big (3.2768MB) binary test pattern from K95 into rz through ttpty, logging everything. ttpty definitely receives the big file smoothly with no errors or hiccups when I have it set to use the master side of the pty for i/o. The application program (Zmodem in this case) runs on the slave, and the network and/or control program communicates with the master. This implies that Zmodem controls the terminal modes of the slave, and ttpty should be concerned with those of the master. Doing it this way in ttpty confirms this.
Fine. But if I tell ttpty to SEND a file with sz, nothing happens. Ditto with lsz. Select times out waiting for input from the pty. But if I manually tell K95 to RECEIVE /PROTOCOL:ZMODEM it works OK. Somehow sz's initial B000000 string is being swallowed somewhere, and it's waiting for a reply from the receiver. sigh… But "ttpty gkermit -s filename" works fine. What's the difference? It has nothing to do with stdout vs stderr; sz is not writing to stderr at all. Is it some timing thing between the forks? Aha. It's that I change the modes of the pty master in one fork while sz is already starting in the other fork.
OK, good, now for the first time we have Kermit and Zmodem both able to upload and download a large worst-case binary test-pattern file… in remote mode. Now taking today's lessons and fitting them back into C-Kermit so I can try it local mode…
Using G-Kermit as the external protocol, first in remote mode… All good: text/binary up/down. The "halting problem" is solved by SIGCHLD, which catches fork termination instantly and lets ttptycmd() know there is no more pty. Zmodem:
zst: OK zsb: OK zrt: OK zrb: OK
That's a first. Next, repeat in local mode, in which C-Kermit is the client and has made a Telnet connection to another host over a secure (Kerberos V) connection:
kst: OK zst: ... ksb: OK krt: OK krb: OK
It seems we can never end a day on a high note. Somehow I seem to have broken regular internal Kermit protocol transfers over encrypted connections — the en/decryption engine loses sync. But they still work OK over a clear-text Telnet connection.
OK, back to ttptycmd.... It seems that back on March 27th, I got everything working but I thought that there was still something wrong with it because an unrelated problem so I put it aside. The version of ttpty.c from that date worked OK, and it looks like I updated ckutio.c from it, but that version of ckutio.c was put aside. Since then I have been working on the ckutio.c version that was NOT put aside and so now I have to reconcile the two:
As a first cut I did this simply by replacing the contents of the #ifdef CK_REDIR section of the latter with that of the former. Of course in Solaris this comes up with openty() implicitly declared at compile time and unresolved at link time. So the first task is to get HAVE_OPENPTY defined for platforms that have it and have the others use the ttruncmd(). For starters I put an #ifdef block in ckcdeb.h that defines HAVE_OPENPTY for Linux, FreeBSD, NetBSD, OpenBSD, and Mac OS X. Ones that don't have openpty() include AIX, HP-UX, and Solaris. Others like SCO I don't know but I doubt it. The real solution is to get the ckupty.c module to work but one thing at a time… This version is supposed work with secure builds on the openpty() platforms, and on the others like Solaris, if an external protocol is attempted on a secure (encrypted) connection, an error message is printed and the command fails. ckutio.c, 14 Aug 2007.
How to test? Apparently I did all my testing on Panix before, and that's where all my Zmodem builds are, but now when I build a Kerberized version (which works if I do it on the right pool host), it won't make a local connection, and there is no other place I can connect to that has a Kerberized Telnet server. I can, however, connect to Panix from here, using the same code, but on Mac OS X…
Kermit Zmodem kst OK zst OK ksb OK zsb OK krt OK zrt OK krb OK zrb Failed "rz: Persistent CRC or other ERROR"
We've seen this before. The problem is 0xff, Telnet IAC, as I proved to myself by constructing a 3MB file that contained every byte but 0xff in every mixture and order and transferring it successfully over the same connection. Presumably the Telnet server is doubling IACs, whereas of course rz is not undoubling, thus the CRC error. This is progress. 15 Aug 2007.
Log shows that indeed every IAC in the source file arrives doubled. Adding code to remove the first IAC of every adjacent pair, a small test file with different-length runs of IACs transfers OK. The 3MB all.bin file does not.
Starting over… I can receive a big text file with Zmodem OK. The 3.2MB binary test pattern that contains no IACs failed after 1.8MB, but the part that it transferred was OK. A second try, almost the whole thing arrived, it stopped just 584 bytes short of the end. Could be that file size is a separate problem. Making a new copy exactly 1MB long… Well, that's interesting, this one too stopped just short of the end. And again, the same thing. When connecting back to the host, the last Zmodem packet can be seen on the screen; it's as if the local Zmodem exited before reading the last packet… But OK, if I change the options on the remote sz sender to use small blocks, etc, then it works.
Now, changing from the 1MB no-IAC-binary test pattern, to the 1MB all-values test pattern, we fail after 81K. But the part that was transferred is correct. Second try, same thing, but 57K. Third: 40K. Each time, upon connecting back, the session is completely dead.
IF I HAVE TO undouble IACs for incoming files, don't I have to double them going out? To send a block to net we just call ttol(), but ttol() doesn't do any doubling (because Kermit protocol always quotes 0xff). To see what happens, I changed the ttol() call to ttoc() in a loop that doubles IACs. I tested this by sending the full 3.2MB test pattern, which worked fine.
For receiving, it's slow but it works OK with files that don't contain IACs (my concern was that IACs might appear in outbound files or in Zmodem protocol messages). It receives the 1MB no-IAC test pattern, so there are no problems with protocol or timing. But the full test pattern always gets cut off, but at different points, as before, with the remote session dead. Changing the Zmodem receiver from rz to lrz on the local end (since the sender on the remote end is lsz) does not change the behavior.
Anyway, I went back and replaced the byte loop with something more efficient, and it goes about 20 times faster. But this doesn't help either, it only makes it fail faster. But aha, what if a doubled IAC is broken across successive pty reads — we have to make the "previous character" memory persistent. Well, that was a good insight, but it still didn't fix it. The log shows the IAC handling code is working fine.
What does sz say? Capturing its stderr to a file… "Retry 1: Got ZCAN". Next time: "Retry 1: Got TIMEOUT". Next time: Got ZCAN.
Trying different Zmodem options… apparently I don't need to use short blocks. But I do need to use -e, probably because of Telnet NVT treatment of carriage return; without -e, there is a "persistent CRC error". -O disables timeouts, but this makes no difference.
OK, we still have two Big Problems:
1. When a long file has no IACs, the final < 1K of the file is not received. 2. When a long file has IACs, the transfer generally stops very early.
Problem 1: the transfer consistently fails less than 1K from the end of the file. Upon CONNECT back to the host, a big Zmodem packet is sitting there waiting to be read, which means ttptycmd()'s copy of rz is terminating early. Can we catch it in the debug log? Doing this takes forever and writes a GB to the disk… And then the problem doesn't happen. Also, I can receive a HUGE text file almost instantly with no errors at all.
Switching to lrz on the receiving end, now I see the error messages, about 300 lines like this:
Retry 0: Garbage count exceeded Bytes received: 872352/1000000 BPS:85464 ETA 00:01 Retry 0: Bad CRC Bytes received: 892448/1000000 BPS:86690 ETA 00:01 Retry 0: Bad CRC Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Retry 0: Got ERROR Bytes received: 898336/1000000 BPS:84293 ETA 00:01 Retry 0: Bad CRC Retry 0: Garbage count exceeded Retry 0: Garbage count exceeded Bytes received: 900384/1000000 BPS:83751 ETA 00:01 Bad escape sequence 2fRe try 0: Bad data subpacket Bytes received: 941472/1000000 BPS:86191 ETA 00:00 Retry 0: Bad CRC Retry 0: Garbage count exceeded
Even when it succeeds, it gets these. But if I receive a text file, no matter how big, no errors or retries or timeouts at all. So it appears that there is only one problem: a big-time lack of transparency regarding 8-bit and/or control characters. The odd thing is, it's not that the characters can't get through — they all can — but they seem to cause transitory blockages. 16 Aug 2007.
Cleaned up the remaining pointer signedness warnings in ckutio.c, but this was a mistake, it broke Kerberos connections completely. Undid the changes. ckutio.c, 17 Aug 2007.
Changed all return() in the fork()==0 section of ttptycmd() to exit(). ckutio.c, 17 Aug 2007.
Tried explicitly setting the slave pty to rawmode. Makes no difference. Tried using the Mac OS X (curses) raw() function, and also system("stty raw"); still no difference. Tried doing all of these in different combinations and orders. I found one combination that cuts the errors about in half, and the transfer of the no-IAC test pattern almost always succeeds (but it's slow). Anyway, it doesn't help much with the test pattern that contains IACs. Well, the code is more solid than it was before but functionally we have not advanced much if we can't download a binary file with Zmodem! On the other hand, we can upload them, and we can transfer text files in both directions, which is an improvement over the previous situation, in which the entire session would hang due to loss of synchronization of the encryption stream.
Tried adding -funsigned-char to CFLAGS of Mac OS X target. It does not make the "signedness" warnings go away and it doesn't change the runtime symptoms.
I tried a simpler version of pty_make_raw(), the one from Serg Iakovlev, but it was a total failure. That's encouraging though, because it indicates that pty_make_raw() is the right place to be working.
Then I made pty_make_raw() set or unset every single terminal flag explicitly. This made no difference, but didn't hurt anything either.
Then I made pty_make_raw() explicitly set all the c_cc characters to 0 (but left c_cc[VMIN] as 1). This made no difference either.
I checked pty_make_raw() against ttpkt() and the only difference I found in the terminal flags is that ttpkt() sets IGNPAR thinking it means "ignore parity errors" when really it means "discard any character that has a parity error" (at least according to Iakovlev) — exactly the opposite. But I tried it both ways, no difference. 17 Aug 2007.
I noticed that even Zmodem text receives can fail. They don't get any errors, they just get cut off shortly before the end. (But usually they succeed, and fast too, like 500K cps).
What if I don't call pty_make_raw() at all on the slave pty?
zrt: EESSSSSSSS: 80% good (E = stopped just before end but no other errors)
zrb no-IAC test pattern, short blocks:
1. S/5 (success with 5 screens of errors. 2. S/7 3. S/7 4. S/6 5. E/7 (failed just before end) 6. S/7 7. S/6 8. S/6 9. S/6
So, lots of errors, but it recovered 90% of the time. Next, same thing, but without requesting short blocks:
1. E/5 2. S/5 3. E/4 4. S/5 5. S/5 6. S/5 7. X/0 (hard failure right away: "Got ZCAN" 8. S/5 9. S/5
So it doesn't look like short blocks make that much difference. Now what if I turn off prefixing? Bad CRC, fails immediately every time. Putting back pty_make_raw(slave), it still fails hard.
Tried a new strategy with pty_make_raw(): rather than modify existing flags, I set all flags to 0, and then turn on only those few that we need like CS8. Now we get only 2.5 screens of errors instead 4-7 and the transfer rate is higher for binary files (all of the previous ones were under 100K CPS, while for text files it was 400-500K CPS):
1. S/2 195669 CPS 2. S/2 194720 3. E/3 4. S/2 192550 5. S/3 192325 6. S/3 145066 7. S/2 200689 8. S/3 188948 9. S/2 209461
10. S/3 181991
I noticed that there was no TIOCSTTY ioctl in the pty/fork setup sequence, which is recommended somewhere, so I tried that and it was a disaster; the entire session hung. I took it back out. 18 Aug 2007.
Tried some transfers over a clear-text (not encrypted) connection with the same results: smooth, fast transfer of a big text file (400K cps); rocky but successful transfer of the no-IAC binary pattern file (135K cps). Switching back to ttruncmd(), the same binary file is received at 1.5M cps, and the no-IAC binary file totally fails after too many "Bad CRC"s; and we already know that any file that contains IACs will fail. One might say that ttptycmd() is better in every respect than ttruncmd() except in speed (when it works).
Let's see if ttyptycmd still works in remote mode (to local K95):
What about ttruncmd() in remote mode?
So we use ttruncmd() for remote mode, and we use it for local mode serial-port and modem connections, and we use ttptycmd() on network connections because (a) they might be encrypted, and (b) even if they are not, they use some protocol that we have to handle, e.g. Telnet, Rlogin. 19 Aug 2007.
Discovered that Sending binary files no longer works. Text is OK, binary transfers don't even start. This happens on both encrypted and clear-text connections. ttptycmd() is being used in both cases. But oddly enough, receiving binary still works as before. What did I break, and when? Oh, it was just the script, when I changed it from using sz to lsz. Putting it back to sz makes it work, even with the full 3.2MB binary pattern with IACs.
I backed off the changes I made to ckctel.c to suppress some warnings, in view of the fact that similar changes to ckutio.c broke things so badly. 19 Aug 2007.
If sz is not given the -e flag, it sends control characters bare, except ^P, ^Q, ^S, and ^X. ^X is the control prefix, so ^A is sent ^X followed by A. With -e, all C0 control chars are prefixed, but with ^X, which is, of course, a control character. Interestingly, the C1 analogs of ^P, ^Q, ^S (but not ^X and, unfortunately, not IAC) are also prefixed. -e makes no difference for 8-bit characters.
If we have a Telnet connection and the server is in ASCII (NVT) mode, CR is always followed by LF or NUL. Well, it seems the server is putting us (Kermit) in binary mode in this case, but staying in ASCII mode itself. Added code to handle NVT byte stuffing and unstuffing in each direction independently, according to the TRANSMIT_BINARY state in that direction. I made a file containing just the bytes 0-31 and 127 and 128-159 and 255 (66 bytes all together) and sending it from the host to C-Kermit, the local log shows that every control character was received correctly and all TELNET conversions were done right — NUL removed after CR (and only after CR); IAC removed after IAC (and only after an IAC meant as a quote). For the first time, I can receive the 1MB all-values test pattern, but there are still tons of (correctable) CRC errors, so the transfer rate is really awful, like about 5% of what we get with a text file (25Kcps instead of 500).
Further experimentation shows that the fundamental transparency problem is fixed; we can receive short files (say, 1K or less) containing absolutely any byte values in any combination with no errors at all. But once the file size reaches (say) 10K, we get CRC errors, like one every 2 or 3K of data. These are not deterministic. In successive transfers of the same file, they come in different spots. It's tempting to blame pty buffer overruns, but then text files would show the same behavior. When a binary file size exceeds, say, 1MB, the chances of successful completion go way down, independent of whether my external protocol is rz or lrz. I like lrz better because the error reports come out on the screen as the transfer is going on. Trying to download a real-world binary file — a 2.2MB C-Kermit executable — I get 4500 error messages but the transfer evenually succeeds, with an effective throughput of 21Kcps.
Actually it turns out that "sz -a somebigtextfile" (2.2MB) also gets a lot of CRC errors. The -e flag (escape all control characters) makes the same big text file transfer with few or no errors. It's not sure-fire. Sometimes no errors, sometimes one or two, and sometimes a fatal error that kills the transfer.
With binary files… a 32K binary file seems to make it every time. 40K fails about 50% of the time. 48K fails 60% and every time it fails, it has created a partial file of exactly 32K (32768 bytes). 96K fails 9 out of 10 times, when it fails, the partial file is always 0 bytes, or 32768, or 65536, but that just means that rz's file output buffer is 32K.
Why, then, do binary files cause trouble if it is not a solid transparency problem? If a certain file can get through once, why can't it get through every time? When a character arrives at the pty, the pty driver probably takes a different path through its code, checking the terminal flags that would affect that character. I tried making Kermit's network read buffers very small but, surprisingly, this made things worse. I also tried making them very much bigger, which didn't help either. 24K still seems to be the right size.
So, is it that some characters take longer to process than others? So long that data is lost due to lack of flow control between TCP and the pty? One way to test this theory is to slow Zmodem down. I tried "-l 32" which, according to the man page, tells sz to "wait for the receiver to acknowledge correct data every N (32 <= N <= 1024) characters. This may be used to avoid network over-run when XOFF flow control is lacking." Makes no difference. I also tried the -w (Window) switch, ditto. In fact there are all sorts of options to set the "window size", "packet length", "block size", and "frame length", but with no explanation of what these mean or how they are related. If I crank everything down to minimum value:
lsz q -L 32 -l 32 -w 1
I get 50% success with the 96K file instead of 10%. Adding -e, oddly enough, made it worse. I also tried setting the environment variable ZNULLS to different numbers like 512, no help there either.
I tried making the read-from-net-write-to-pty buffer small (1K) but leaving the pty-to-net one big. This improves chances of success, but it's intolerably slow (3Kcps when the connection is capable of 500K).
I also changed the write-to-pty operation from a single write() call of possibly many K characters to a byte loop, one write() per byte. Same result: success (but still about 300 recoverable errors), throughput 3Kcps. 20 Aug 2007.
With ttptycmd() configured to write to the pty in a byte loop, it is possible to delay each write. Adding a 10msec delay per character results in a transfer that runs at about 20 cps and (for the 96K test file) would take about 80 minutes to complete. And yet it still gets just as many errors. So it's not a matter of timing either. The errors come, on average, every file 388 bytes, but not at regular intervals.
I tried the TIOCREMOTE ioctl on the pty master, as discussed somewhat obliquely in the Mac OS X "man pty" page; "This mode causes input to the pseudo terminal to be flow controlled and not input edited (regardless of the terminal mode)" — sounds like just the ticket but it made no difference. Actually, looking at a man page on another OS (Solaris), it says this is only for lines of text, EOLs are supplied, so that would mess up the protocol. So remember: don't use this.
Tried without O_NDELAY; the behavior was the same but the speed was much slower.
Tried switching back to the ckupty.c routines on Mac OS X and found that it works now the same as with openpty(), except that I seem to get more getty babble at the end. But this means I can run some tests on Solaris. I moved the entire test environment from Mac OS X 10.4.9 to Solaris 9. But it doesn't work at all.
Trying to figure out the ckupty.c modules again.
Note that the file descriptor of the slave is known only to the lower fork. Therefore the lower fork is the one that has to set all the tty modes, etc. I took care of all that but the ckupty.c method doesn't work at all on Solaris. But it works "fine" on Mac OS X (the 32K all-bytes test file transfers instantly with no errors, but the 96K one errors out).
The problem on Solaris is that pty_make_raw() fails on the masterfd (but not on the slavefd) with errno 25 "ioctl inappropriate for device". It doesn't matter whether I do it in ckupty.c or ckutio.c. I found a web page on kde.org that says Solaris does not allow tcget/setattr() on a pty master. But the Sun "knowledge base" is not open to the public. Well, presumably changes made to the slave are reflected in the master (comments in Solaris telnetd seem to confirm this...) Let's come back to Solaris later.
Moving to a Linux with lrzsz installed… Built a Kerberos 5 version with USE_CKUPTY_C. Like on Mac OS X, it transfers short files OK and chokes on longer ones. Switched to openpty(), it behaves the same. So the problems on Mac OS X are evidently not OS-specific, which is good I guess, since that means finding the way around them will apply to more than one platform. 21 Aug 2007.
Look into TIOCSCTTY again. On System V based OS's, opening a pty acquires a controlling terminal automatically. On BSD-based OS's, no; you have to use the TIOCSCTTY on the slave file descriptor to give it one. I'm not sure why a controlling terminal would be needed, except that without one, the virtual device "/dev/tty" does not exist for the process that runs on the pty, and maybe the application that runs there (e.g. rzsz) checks for it. On the downside, having a controlling terminal opens the process up to terminal interrupts like SIGINT and SIGQUIT. Until now I have not been using this ioctl(). Results (in Linux):
With TIOCSCTTY: 96K all-bytes test: 11 screens of errors, then success Without TIOCSCTTY: exactly the same.
Tried the same thing with TIOCNOTTY instead of TIOCSCTTY, with exactly the same results (no effect whatsoever).
There has to be a way to make this work, because Zmodem works through telnetd, which basically the same thing as ttptycmd(): a relay between the network and a pty. ttptycmd() is like telnetd backwards. Modern telnetds are not much help; they don't access ptys or the network directly, they go through "mux" devices so I can't see what they're doing to get transparency and flow control. An old BSD telnetd uses packet mode but that would be a big deal…
I tried ignoring various signals like SIGTTOU and SITSTP, since some Telnet clients do this. No effect, no difference. Anyway, in Linux the transfers almost always finish OK despite the many errors. There is just some trick I'm missing to make the pty accept a stream of arbitrary bytes without hiccuping.
What about Solaris, which uses ckupty.c? In streams-based OS's, where line disciplines and whatnot are pushed on top of the pty, it looks like the pty module saves the file descriptor of the "bare" slave pty (as 'spty') before pushing things onto it, and then later uses spty rather than the regular slave pty file descriptor when getting/setting terminal modes. I'm not sure what this is all about but it's definitely SysVish… It happens if STREAMSPTY is defined, but I noticed that STREAMSPTY is never defined anywhere. I tried defining it so we take an entirely different path through the code. It made absolutely no difference.
Then I noticed that HAVE_STREAMS is not defined for Solaris either. Tried defining it, but the session didn't work at all, no i/o. Removing the HAVE_STREAMS definition but keeping the STREAMSPTY defined, I rebuilt and tried "set host /connect /pty emacs". I got an EMACS screen but could not type anything into it, which means that STREAMSPTY should not be defined either. Removed the definition and "set host /pty" works again. So what's the problem with ttptycmd()?
In fact, ttptycmd() works on Solaris with Kermit as the external protocol, but not with Zmodem, not even with text files. So again, there is no fundamental problem with the code or the logic, it's Just A Matter Of Transparency to control and/or 8-bit characters — some trick I don't know about.
Looking at the Solaris debug log… I see that ckupty.c is calling init_termbuf() to set the tty modes of the master, not the slave, and set_termbuf() to set them, but you can't do that in Solaris, error 25. This is in getptyslave(). Shouldn't getptyslave() be setting the tty modes of the slave, not the master? I changed it to do this, but like all other changes, it made no difference. I checked to make sure that after the change, "set host /pty /connect emacs" still worked and it did.
And then what… I had some code to redirect stderr in ckupty.c that was not being executing due to a typo. When I fixed the typo, poof, Zmodem binary transfers started working, or working as well as they work in Linux and Mac OS X. It turns out that if I don't redirect stderr, sz and rz just don't work. But lsz and lrz do. But if I do redirect it, I don't see the progress messages from lsz/lrz. 22 Aug 2007.
Built on HP-UX 11i v3 (B.11.31 U ia64) with optimizing compiler, got tons of picky warnings, but it finished and linked and runs OK. Many of the warnings were like this:
"ckucns.c", line 1606: warning #2068-D: integer conversion resulted in a change of sign: tnopt = (CHAR) IAC;
IAC is defined as 255 in ckctel.h. If I define it as 0xff, I don't get the warnings. I changed the definitions of all the Telnet commands to be in hex notation rather than decimal. If cuts way down on the HP-UX warnings and doesn't seem to cause problems elsewhere. ckctel.h, 23 Aug 2007.
Now it looks like Solaris is working but then it hangs at the end. It appears as if the ckupty.c module is blocking SIGCHLD. Debug log shows that when the transfer is complete, we received IAC DM (Telnet Data Mark) after sz's last gasp and before the shell prompt is printed. But calling tn_doop() in this case is a mistake because we are reading the number of bytes that we know are available in a counted loop, but tn_doop() would consume an unknown number of bytes and we would never know when to exit the loop. Anyway, C-Kermit doesn't do anything with DM. Skipping over tn_doop() (and not writing out the Telnet command bytes) fixes the hanging condition at the end, even though SIGCHLD is never raised. ckutio.c, 23 Aug 2007.
Some tests, Solaris to NetBSD over K5. zst sends ascii.txt, a 2.36MB ascii text file (Kcps / Errors). zrt receives the same file:
zst 587/0 526/0 542/0 434/0 423/0 zrt 827/0 800/0 847/0 FAIL 610/0
So text is good. Binary not so good. Here we transfer the 1MB all-bytes pattern file. zrb receives it successfully, but with 1248 errors, at only 15Kcps. Sending the same file out always fails:
Begin 20070823 16:32:07: SEND BINARY all2.bin [sz] Sending: all2.bin Bytes Sent: 5600/1000000 BPS:12446 ETA 01:19 FAILURE End 20070823 16:32:13 Elapsed time: 6.617992999999842 cps = 151103.2121067556 lsz: caught signal 1; exiting
Decided to move to Linux but found that something is screwed up in Linux C-Kermit with tilde expansion:
doesn't expand at all (but it did yesterday!). The problem was in the ancient, ancient realuid/setuid handling code; real_uid() no longer works in Linux. I worked around this in whoami() by setting ruid to getuid() if real_uid() returned a negative number. Maybe dangerous, worry about it later. ckufio.c, 23 Aug 2007.
ANYWAY… after fixing that, I tested zsb on Linux, and it's broken there too, using openpty(), so it's nothing to do with ckupty.c. After sending the first Zmodem data packet, it just hangs, nothing comes back. In text mode it gets farther, but then the same thing happens. Captured stderr from rz on the far end:
Bytes received: 608/1000000 BPS:21137 ETA 00:47 Retry 0: Bad CRC Bytes received: 864/1000000 BPS:23540 ETA 00:42 Retry 0: Bad CRC Bytes received: 1120/1000000 BPS:25003 ETA 00:39 Retry 0: Bad CRC Bytes received: 5696/1000000 BPS:56988 ETA 00:17 Retry 0: Bad CRC Bytes received: 9120/1000000 BPS:62227 ETA 00:15 Retry 0: Bad CRC Bytes received: 9376/1000000 BPS:60766 ETA 00:16 Retry 0: Bad CRC Bytes received: 9632/1000000 BPS:60361 ETA 00:16 Retry 0: Got TIMEOUT Retry 0: Sender Canceled Retry 0: Got ZCAN
The local sz, however, doesn't give any error message. ZCAN means: "other end canceled session by sending 5 ^X's" (or user typed them). What actually happens is that ttptycmd()'s select() times out waiting for something from the Zmodem partner and ttptycmd() itself kills the sz fork with SIGHUP. When lsz receives SIGHUP it sends the ZCAN. So the real problem is that after some point we're not receiving anything.
I changed the timeout from 4 seconds to 30 seconds and now I see it just stops for long periods of time and then resumes. The lrz log on the receiving end shows tons of timouts, CRC errors, and other errors. The local log shows that lsz wound up sending ZCAN (2 x (10 x ^H, 10 x ^X)).
Moving on to another problem… Turns out Ctrl-C (SIGINT) is working right after all. Since I'm using my test scripts like kerbang scripts, Ctrl-C exits through trap(), as it should, closing the connection and cleaning up. If I start Kermit and tell it to TAKE the script, then Ctrl-C brings me back to the prompt with the connection still open (as it should). However, until now I haven't done anything about the fork or the ptys. Added code to trap() to kill the fork and close the master pty. ckuusx.c, 24 Aug 2007.
Added code to try to break the deadlock. If select() times out, but we have stuff to write either to the pty or the net, try to do it anyway, even though select() did not say we could. But this doesn't help because when select() times out we don't have anything to write. The problem is that after receiving that last packet from the remote rz, the local lsz doesn't seem to do anything, as if the lower fork wasn't running (and to confirm this hypothesis, sometimes I noticed that when I Ctrl-C'd out of this, the transfer would take off again).
Backing up and testing with gkermit rather than zmodem:
kst ripple.txt [824K] OK kst ascii.txt [1359K] OK krt ripple.txt — FAILED
It seems that we can't handle streaming. If I set up krt to disable streaming on receipt, it works OK.
krt ripple.txt [824K] OK krb all2.bin [1000K] OK
So here we have no trouble sending but big trouble receiving unless we disable streaming. Whereas with Zmodem we have trouble receiving.
But this wasn't happening before, what changed? Using C-Kermit on the far end to receive the file with debug log on, I see that it is sending 4K data packet after 4K data packet, with the local gkermit silent, as expected. About midway through the transfer, the local Kermit sends an error packet "Transmission error on reliable link". Looking at G-Kermit's debug log… It receives the first five 4K data packets OK, but gets a CRC error on the fifth one, and sends the Error packet. So it has received a stream of 20-some thousand bytes OK and then messes up. That number sounds a lot like ttptycmd()'s buffer size. I changed the buffer sizes to be different:
Read from pty and write to net: 4K Read from net and write to pty: 1K
This time it received the first 4K packet and failed on the second one. Then I increased the buffers to 98K each, expecting to receive lots more packets successfully but it bombed out on the 5th one. But that's good, it confirms there's no logic error in the buffer management. Just to make sure, though, let's set the buffer size smaller than the packet size and disable streaming. In this case we get 4 good data packets and a CRC error on the 5th one and so we request retransmission, and the next 8 times it arrives it gets a different CRC error, but the 9th copy is OK. Then the next packet comes and it gets a CRC error every time. And this is nothing but plain ASCII text.
Switching to remote mode:
REMOTE=1 kk kst
(after tricking myself because it was using ttruncmd() for this...) I see that nothing works at all. What did I break? 24 Aug 2007.
Fixed ttptycmd() to restore console modes after a remote-mode transfer. ckutio.c, 25 Aug 2007.
Noticed that error codes like ESRCH are not available in all modules. That's because of some complicated in #ifdefs in ckcdeb.h that wind up not always #including <errno.h>. But I notice that ckutio.c includes it unconditionally with no ill effects, and so does ckvfio.c. Does any version of Unix at all not have <errno.h>? Added a catch-all clause to ckcdeb.h to #include <errno.h> (in UNIX only) if, after the other clauses, ESRCH was still not defined. ckcdeb.h, 25 Aug 2007.
Now back to debugging ttptycmd()… Remote-mode transfers with ttptycmd() were broken in two places, maybe as long as 2 weeks ago (this would have affected non-network transfers too, which I can't test any more). The logic was missing in a couple places for the non-network and/or non-Telnet and/or non-encrypting connections (if statements with no else parts). Fixed in ckutio.c, 25 Aug 2007.
Testing remote mode:
kst OK zst OK ksb OK zsb OK krt OK zrt OK krb OK zrb OK
Functionally it all works but there are hitches with Zmodem as always. When sending to K95:
So clearly the ptys are getting in the way. The hanging at the end would be caused by the sz process closing before its last output reached the master pty. It would need to do some form of flushing and/or pausing at the end but there's nothing I can do about that; these programs were not designed to be used in this way. Anyway, it only seems to happen with files longer than 100K.
For local mode, testing in Solaris over our Kerberos 5 connection again:
gkermit lrzsz kst OK zst FAIL ksb OK zsb FAIL krt OK zrt OK but with errors krb OK zrb FAIL
If I use Omen rzsz as the external protocol (e.g. with zst), it blocks redirection and it sends the file to my terminal, rather than over the connection. This would probably be because it finds out the device name of the job's controlling terminal and opens it, to prevent redirection. This is hard to prevent in Solaris because there is no TIOCSTTY ioctl(). Supposedly the same thing is accomplished by closing and reopening the slave pty after doing setsid(). I added code to do this, but it made no difference. (If I use lsz instead of sz, it is indeed redirected, but jams up after about 15K.) ckupty.c, 27 Aug 2007.
On Mac OS X with sz 3.73 1-30-03, however, the redirection works, so I assume it would also work in Linux, FreeBSD, NetBSD, etc, too. Doing the full test suite on Mac OS X:
gkermit lrzsz rzsz kst OK zst FAIL (1) OK ksb OK zsb FAIL (2) OK krt OK zrt OK (3) OK for 100K file, fails for longer. krb OK zrb FAIL (4) OK (1MB all-bytes test pattern)
(1) 64K file OK every time; 100K file fails every time. (2) 10K file fails every time. (3) Succeeds with 800K file but gets a few recoverable errors. (4) Succeeds with 48K binary file with some errors, fails with longer ones.
So actually it looks pretty good, it's just that lrzsz messes up. When sending with lsz if I include -L 512 it sends the 100K test file with no errors, but still chokes on longer ones.
Testing on Mac OS X again, but this time over a clear-text Telnet connection:
gkermit lrzsz rzsz kst OK zst FAIL(1) OK ksb OK zsb FAIL(2) OK krt OK zrt OK(3) OK krb OK zrb FAIL(4) OK
(1) Almost worked, finished 777K out of 824K without errors. (2) Got tons of errors, failed in first 30K out of 1000K. (3) OK for 100K file but fails for larger. (4) OK for 48K binary fail but fails for larger.
Maybe see if we can do without the OPENPTY part.
TOMORROW — just clean up the code, add some SET / SHOW / HELP commands, document it, and move on.
Note: In K95, SET WINDOW sets the Zmodem packet length, 32 - 1024, multiple of 64.
Changed ftp port from int to unsigned int. ckcftp.c, 30 Aug 2007.
User interface that was never implemented (except to some extent in Kermit 95, that has its own built-in XYZMODEM protocol):
SYSTEM: Selects the system() service.
PTY: Selects the pseudoterminal method.
AUTO: Chooses the pty method for network connections and the system() service for serial connections as well as remote-mode file transfers.
Tried to build with -DCK_SRP and -lsrp but:
hash_supported ckcftp.o hash_getdescbyname ckcftp.o hash_getdescbyid ckcftp.o cipher_getdescbyname ckcftp.o krypto_delete ckcftp.o krypto_new ckcftp.o cipher_supported ckcftp.o krypto_msg_priv ckcftp.o krypto_msg_safe ckcftp.o hash_getlist ckcftp.o cipher_getlist ckcftp.o cipher_getdescbyid ckcftp.o
Sent mail to Tom Wu and backed off for now. makefile, 14 Feb 2008. (Tom Wu never answered; seems like SRP is defunct.)
The ".blah = xxx" form of variable assignment only worked for variables names of length 22 or less, noticed and fixed by Wolfram Sang. ckucmd.c, 5 Mar 2008.
In "set host /pty ssh ..." connections, the INPUT command suddenly stopped working. This is in Solaris 9. It happens with all 8.0.* versions of C-Kermit, so it's nothing to do with ttptycmd(). Added some debug() statements but they don't show anything. Turns out there wasn't a problem after all. Wed Mar 26 16:04:53 2008