Click on our Sponsors to help Support SunWorld

Go to the end of the line

Converting Word and MS-DOS documents to Unix

July 1999

Abstract

This month Mo explains how to move text documents from Word or DOS into a Unix environment. A couple of simple scripts can get rid of the annoying carats and such that appear when a Word document is opened in a Unix-based editor like vi. (2,400 words)

Mail this
article to
a friend

ith the ability to move documents and text files so easily from one system to another, you would think that some common format could be devised for information. Unfortunately, moving a Word document or a spreadsheet from a Windows environment to a Unix environment leaves you with a document or spreadsheet that cannot be read or used by any application in the Unix world. There are exceptions -- but very few.

Ah, but what about the humble text file? Surely files containing nothing but ASCII characters must be portable across multiple systems. Well, that's true ... mostly.

The gotcha in moving text files between systems is most apparent in moving text from an MS-DOS or Windows environment to a Unix environment. This problem comes up often because these three operating systems are so common. Of course, you and I know that DOS is dead. Just don't tell that to all those people running those thousands of vertical market DOS applications -- none of which were ever ported to Windows, since their software vendors went out of business while trying to do the ports.

Aside from the need to move text files in general, you no doubt also have many general purpose programs, probably written in C code, that you'd like to port from DOS to Unix or vice versa. ("Gee, Edwina, remember that utility that you wrote to unscramble the framis-gaggle? I bet you could move that code to Unix and it would compile and run there.") Yes, she probably could move it, but the text files that contain the source code for the framis-gaggle unscrambler will use a different end-of-line marker in the Unix world than it did under DOS or Windows, resulting in some tedious editing for poor Edwina. The same problem will arise for users who just want to move plain old text documents from platform to platform. Why should this be the case?

You have probably heard terms like carriage return, line feed, and newline bandied about in relationship to text files and printing, but you might not know exactly what they are, and how they relate to text files.

Back when dinosaurs ruled the earth, the primary method of outputting information from a computer was a printer or teletypewriter. (A vestigial memory of the latter piece of equipment is retained in the Unix designation for a terminal -- tty, an abbreviation for teletypewriter.) One of the important factors in controlling output to the printer or teletypewriter was the subject of carriage control. Printers and teletypewriters had a platen, or cylinder, which you can still see today on dot matrix and other impact printers. Paper was fed through the printer by rolling it around the platen, or, in the case of pin feed paper, by feeding it through a tractor feeder. The tractor feeder and the platen were both carriages, and their primary function was to carry the paper in a precision manner and position it in front of the print head. The mechanism that moved the print head back and forth was also considered part of the carriage; the carriage as a whole was responsible for the amount of space between each printed line, as well as the positioning of the print head before each line was printed.

Early printers attached to IBM's big iron expected to receive two (sometimes three) bytes of information at the start of each line from the computer; these bytes contained information on where to print the upcoming line of characters on the piece of paper. Carriage control commands ranged from the very simple (PRINT AFTER ADVANCING 1 LINE) to the complex (PRINT BEFORE ADVANCING 3 VERTICAL TABS).

Because carriage control allowed control over vertical movement on the printed page, two (now well known) font variations could be created. By printing a line and then printing it again without advancing a line, you could print the same characters in the same place twice, creating bold type. By printing a line and then printing a line of underscores and spaces without advancing a line, you could print a line containing underlining. You could also print a line or a character, backspace, and then print a hyphen or a slash for a strike through, although this was less common.

IBM printers were frequently fed by large bundles of wires that carried carriage control and printing signals separately. When smaller printers were developed that were connected by parallel printer or serial port connections, it became necessary to send bytes of printer control information as part of the data stream. Thus, the ASCII (American Standard Code for Information Interchange) character set includes control characters in addition to the standard letters and numbers. A control character is a single character that can be sent to a computer device, such as a printer or monitor, that controls the behavior of that device, rather than printing an actual character.

ASCII uses 128 numbers to represent all the uppercase and lowercase characters of the alphabet, the digits, the punctuation characters, and these special characters that are used to control printers, terminals, and other computer devices. The 128 values are numbered beginning with zero, so the numbers used range from 0 through 127. All the printable characters (letters, digits, and punctuation) have values between 32 and 126. The values 0 through 31 and 127 are used for control characters.

The table below is a brief ASCII chart with the decimal value of each entry and its ASCII name or character. Several of the ASCII codes represent the nonprintable characters, and these are given with their names. You might already be familiar with some of these.

0	NUL	32	SP	64	@	96	'
1	SOH	33	!	65	A	97	a
2	STX	34	"	66	B	98	b
3	ETX	35	#	67	C	99	c
4	EOT	36	$	68	D	100	d
5	ENQ	37	%	69	E	101	e
6	ACK	38	&	70	F	102	f
7	BEL	39	'	71	G	103	g
8	BS	40	(	72	H	104	h
9	HT	41	)	73	I	105	i
10	LF	42	*	74	J	106	j
11	VT	43	+	75	K	107	k
12	FF	44	,	76	L	108	l
13	CR	45	-	77	M	109	m
14	SO	46	.	78	N	110	n
15	SI	47	/	79	O	111	o
16	DLE	48	0	80	P	112	p
17	DC1	49	1	81	Q	113	q
18	DC2	50	2	82	R	114	r
19	DC3	51	3	83	S	115	s
20	DC4	52	4	84	T	116	t
21	NAK	53	5	85	U	117	u
22	SYN	54	6	86	V	118	v
23	ETB	55	7	87	W	119	w
24	CAN	56	8	88	X	120	x
25	EM	57	9	89	Y	121	y
26	SUB	58	:	90	Z	122	z
27	ESC	59	;	91	[	123	{
28	FS	60	<	92	\	124	|
29	GS	61	=	93	]	125	}
30	RS	62	>	94	^	126	~
31	US	63	?	95	_	127	DEL

ASCII chart with decimal values

Most of the nonprintable characters were and are used for communications protocols and have no real use for most applications programmers today. Even the others seem primitive in today's world of slick GUIs and laser printers. For example, the value 13 (CR) is a carriage return. When this value is sent to a printer, it causes the print head to return to column 1. A CR also is sometimes sent by the Return or Enter key on the keyboard, although Unix usually translates this as value 10, a line feed (LF). This latter control character is used to move a printer or terminal up one line. Value 7 (BEL), when sent by the computer to the terminal, usually causes a beep or rings an alarm. HT (horizontal tab, or just plain tab), value 9, is sent to a printer or a screen and causes the cursor or print head to advance to the next print column. SO (shift out) and SI (shift in), values 14 and 15, are also used in printer control. Many printers are set up with two built-in fonts. Sending an SI causes the printer to shift to the second font, while an SO causes it to shift back to the original typeface.

The values from 33 through 126 are printable characters. Value 32 (SP) is a space. Whether a space is actually a printable character is a debatable point, since a space does not usually put ink on the paper. Instead, it places a character containing no image. Some printers render this by simply advancing the print head one position.

The characters in the range below 32 are used extensively in telecommunications. For example, 2 and 3, STX (start of transmission) and ETX (end of transmission), are often used at the start and end of a block of transmitted information, respectively. 6 and 21, ACK and NAK, are often used by a receiving computer to signal an acknowledgement (ACK for well received) or a negative acknowledgement (NAK for not well received, please retransmit).

Control characters are also used inside text files to indicate the end of a line, and here is where our problem lies. Unix uses a single LF (line feed, ASCII value 10) character to designate this. DOS and Windows use a combination of CR (carriage return, value 13) and LF. Moving a text file back and forth between these two systems without translating the end-of-line marker causes some unusual results. For example, the MS-DOS Edit utility is smart enough to recognize a file that only has a line feed for an end-of-line marker and displays it correctly, but the Windows Notepad utility is not. Notepad displays an untranslatable control character as a thick black vertical bar that looks like a black box. In the following listings this black box is shown as a pair of square bracket (like this: []).

A Unix text file in MS-DOS Edit

These are the times
that try men's souls.
The Metropolitan Transit Authority,
better known as the MTA
etc.

Notepad cannot figure out where the lines in the same file end.

A Unix text file in Notepad

These are the times[]that try men's souls.[]The Metropolitan Transit Authority,[]better known as the MTA[]etc.

In the reverse case, a Windows text file has too many control characters for vi. The extra carriage return shows up as a control-M (^M) in the vi display.

A Windows/DOS text file in vi

These are the times^M
that try men's souls.^M
The Metropolitan Transit Authority,^M
better known as the MTA^M
etc.^M

Many Unix/Windows transfer utilities include a switch that can be set to indicate that a text file is being transferred, and the resulting file has its end-of-line character(s) translated. Some utilities have text translation as the default, and you must set a switch to suppress the translation when you are transferring binary files.

Unfortunately, most serious movement of files in volume is done by combining and compacting the files using one of the versions of zip, tar, or what have you, and the resulting file must be transferred as a binary file. The individual files within such an archive do not have their end-of-line characters translated when they are combined and transferred as a binary.

Advertisements

Conversion scripts
The extra carriage return can be removed or inserted with two simple scripts. I use scripts here so that you can save and reuse them. The first one takes two command arguments, the Unix file name and the DOS/Windows file name. It adds a carriage return to a Unix text file and outputs it under the DOS text file name so that it can be transferred to DOS. To enter this with vi, when you get to the ^M, type control-V then control-M. The control-V causes the next character to be inserted as a literal control character. After you have saved it as lf2crlf, change its mode to allow execution (chmod a+x lf2crlf).

# lf2crlf
# adds an extra carriage return in a unix
# text file so that end of line matches
# the Windows/DOS convention

usage()
{
echo "usage: lf2crlf unix.txt dos.txt
exit
}

if [ $# != 2 ]
then
usage
fi

sed 's/$/^M/g' <$1 >$2

For text files coming from Windows or DOS to Unix, the second script strips the extra CR from the end of each line. There is an additional hook in MS-DOS files. Some DOS editors and utilities append a control-Z (value 26 or SUB in ASCII) to the end of a text file, which will display in vi as a ^Z. This script also removes that character. Note that the single quoted portion starts on one line and ends on the second. Use control-V, control-M to create the ^M and control-V, control-Z to create the ^Z.

# crlf2lf
# removes an extra carriage return in a dos/widows
# text file so that end of line matches
# the Unix convention.
# Also removes a control-Z at end of file

usage()
{
echo "usage: crlf2lf dos.txt unix.txt
exit
}

if [ $# != 2 ]
then
usage
fi

sed 's/^M//g
s/^Z//g' <$1 >$2

Just as a final note: some DEC systems used only a carriage return to mark an end of line. I once ported a C application from Unix to DOS and then from Unix to VAX. The end-of-line terminators had to be handled to move the source code from one system to another.

Click on our Sponsors to help Support SunWorld

Resources

The JARGON dictionary entries for newline and related characters. You should bookmark the dictionary as it is very useful:
http://www.denken.or.jp/local/misc/JARGON/body-n/newline.html
An explanation of ASCII and the other popular collating sequence, EBCDIC:
http://www.denken.or.jp/local/misc/JARGON/body-a/ASCII.html
A useful ASCII chart:
http://members.tripod.com/~plangford/index.html
An ASCII chart including extended characters for values from 128 through 255:
ftp://dkuug.dk/i18n/WG15-collection/charmaps/ANSI_X3.110-1983
An alternative to crlf2lf:
http://mirriwinni.cse.rmit.edu.au/FAQ/FAQ137.html
A GNU public license utility that does the same conversions:
http://chrisheng.hypermart.net/tofrodos.html
Full listing of previous Unix 101 columns:
http://www.sunworld.com/common/swol-backissues-columns.html#unix101

Other SunWorld resources

The SunWorld Topical Index -- a comprehensive listing of all SunWorld articles by subject:
http://www.sunworld.com/common/swol-siteindex.html
Visit sunWHERE -- launchpad to hundreds of online resources for Sun users:
http://www.sunworld.com/sunwhere.html
Explore SunWorld's back issues:
http://www.sunworld.com/common/swol-backissues.html
IDG.net, your one-stop IT resource:
http://www.idg.net

About the author
Mo Budlong, president of King Computer Services Inc., specializes in Unix and client/server consulting and training. He currently publishes the COBOL Just In Time Course, a crash course for the year 2000 problem, as well as COBOL Dates and the Year 2000, which offers date solutions.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-07-1999/swol-07-unix101.html
Last modified:

Comments:
Name:
Email:
Company Name: