[KLUG Members] Re: Html vs Adobe Acrobat -- Content v. Style, Archiving v. Compression (long)

Bryan J. Smith members@kalamazoolinux.org
13 May 2002 16:32:05 -0400


On Mon, 2002-05-13 at 13:44, Patrick McGovern wrote:
> > > Every linux distro comes with tools for converting postscript to pdf. 
> Hmm, I am not familiar with this.

The output of the ps2pdf is definately NOT native.  A lot of things are
rendered as bitmaps or vector primatives so they don't look good on
screen.  But the default (which is an option) is to output 720dpi, so it
*DOES* output okay.  But it is much better to use the native PDF output
capabilities of the application -- _assuming_ it isn't simply calling
ps2pdf though.

> Can I convert html files to postscript

Why would you?  To "fix format" it?  HTML is _just_content_ (with
free-form styling).  PDF is content _and_ style.

I _far_better_ means is to create your document's _content_ in a
standard documentation language like DocBook/XML (or the older
DocBook/SGML is fine) then apply the proper XSL (or DSSSL in the case of
SGML) _style_ to format it HTML, PDF, RTF, etc... as you would like. 
This is, by far, the _preferred_way_ to write technical documentation
today -- even versus TeX, although converting to/from TeX is easy too.

An older, but good alternative is to use TeX typeset, which is both
content and style in one language.  Typeset pre-dates personal computers
because that's how books were done then.  Because of its age, it has
_very_rich_ HTML and PDF generation capabilities and there is an
excellent GUI for (http://www.lyx.org).  In fact, the first thing a new
"standard" language does is make sure there is a way to convert to/from
at least LaTeX (including DocBook), the most popular of TeX macros.  TeX
can do anything, but its age and all the support software and macros
written for it can also make it too "free-form" for most needs.  TeX is
used heavily by the American Mathematical Society (AMS) and the IEEE
(who produces 30% of all the world's technical documentation).

AbiWord is quickly becoming the "standard" document editor now.  Not
only does it to HTML and even XHTML, but LaTeX (including "clean" LaTeX
which LyX can use), Docbook/XML and even PDF too.

> and then to pdf. and still have the html work inside the pdf file?

_Native_ PDF supports HTML tags and even other, advanced elements.  I've
even seen calculators written in PDF (non-ActiveX/non-Java too).

I currently write my documentation in LyX and use a single Perl script
called tex2pdf which calls pdflatex, hyperref and a number of other TeX
support utilities to create a _native_, "feature rich" PDF -- complete
with auto-bookmarking, auto-internal/external references (including HTML
links), etc...

> slippets of HTML I'd use ".tgz".  Current versions of WinZip, etc... all
> not read tgz and tgz beats the pants of zip.
> Is'nt ".tgz" a unix format only??

WinZip's tar support is _sub_par_.  It doesn't even check CRCs.

First off, Tar is *NOT* UNIX-only.  In fact, 99.999999% of the time,
when you see something on UNIX and ASS-U-ME it is "UNIX-only", it is
really the _primary_standard_ for _all_non-Windows_systems_!  E.g.,
Postscript, OpenGL, Tar, etc... -- Mac, VMS, UNIX, etc... all use them! 
Windows "programming" has taught you to think that way.

Secondly, Archiving (packaging files into one file/device) and
Compression (reducing the size of a file) are _two_separate_
operations.  In UNIX, we often use tar (FIFO -- with many variants),
cpio (FIFO) or their new "pax" (FIFO) replacement for archiving, and
compress (LZH), gzip (LZ77), bzip2 (BWT) or the newer, real-time lzop
(LZO) block compression algorithms -- _separately_.  On Windows, PKZip
combines both steps into one -- using the same LZ77 algorithm as GZip2,
per file compressed, with a directory-based indexing archive.

Cipo, tar and pax are _very_recoverable_ since they are block FIFO
archiving algorithms with CRCs, so you can even recovery beyond the
point of corruption.  Unfortunately, the second you compress them, you
lose their recoverability as the archive format is at the mercy of the
compression programs recoverability (all suck, although bzip2 has some
block localization features).  Some replacement programs like afio (for
cpio) do per-file compression inside the cpio archive, so it retains
both cpio compatibility and recoverability -- I _highly_recommend_ it,
especially in combination with lzop, which is a real-time compression
algorithm, even if you backup to a device with hardware compression
because it gives you archive portability and backup performance.  The
original Tar format is limited to 8GB files, although GNU Tar and other
variants, including the POSIX pax replacement, have their own ways to
overcoming this limit.

PKZip actually does per-file compression inside the archive, which
improves recoverability over cpio/tar/pax compressed archives. 
Unfortunately PKZip does NOT use block FIFO, but directory-based
indexing, so corruptions affect the rest of the archive.  PKZip through
version 3 has a 4GB filesize limitation, and only the latest PKWare
PKZip does the new version 4 format that does not.  It still does not

I myself am working on an universal archiving/backup/copying _library_
(and associated utilities) written in C.  It does several things:
  - Per-file compression -- the main focus on the library
  - Verbatim copy utility -- replaces "cp -dpR" or "find|cpio -pmdv"
  - New 64-bit archive format:
      Block FIFO w/CRC like Tar/Cpio/Pax
      POSIX Access Control List (ACL) and Extended Attributes (EA)
  - Spanned backup features
      Per and Entire-Volume indicies like PKZip 
        (for quick lookup-only, CRC checking determines whether
         block FIFO recovery is required)
      Single volumes in spanned set are "self-contained" (so you aren't
        screwed if you lose a tape -- and you can pull from just one
        tape in a set)
  - Will add cpio, tar, pax, zip compatibility later on (much later ;-)

It's clear that the POSIX team is NOT addressing compression, so I feel
I must.  Afio is nice, but it doesn't have integrated, streaming
compression (very wasteful/inefficient).  Worse yet is that spanned
volume restores *SUCK* in _every_product_ I've used -- integrating
streamed compression would allow you do store each individual volume's
index at the end of each tape (assuming you specify the tape size --
_before_ compression which is handled by the library/utility).  And
_nothing_ seems to address the new Linux POSIX ACL/EA standard in the
2.5.x kernel that both Ext3 and XFS now agree on (and will be backported
to 2.4.x in the coming year), and many other OSes (like Solaris) also
have POSIX ACL/EA features too (and use modified utilities like Tar
"double-file" methods to store them, long story).

The idea came out of a tcsh CD backup script I wrote to do per-file
compression and was featured in the 2002 April edition of Sys Admin
magazine ( http://www.samag.com/articles/2002/0204/ ).  Most people
don't realize that the ISO9660 "Yellow Book" CD data track format
(commonly labelled ".iso" files) is actually an archive format in itself
(and can be created easily with mkisofs) and there is _no_need_ to
Tarball files before putting them to CD.  In fact, storing a compressed
tarball on a CD -- or, God forbid, splitting it across multiple CDs --
is the _worst_thing_ you can do when you backup because recoverability
becomes a _nightmare_.

For those that can't wait, Hugo Rabson has created a great "disaster
recovery" CD/tape solution called Mondo Rescue (
http://www.microwerks.net/~hugo/ ).  It uses multiple, self-contained
afio volumes for preserving data integrity.  It's a great solution
regardless because it creates a bootable CD (via its included Mindi
Linux utility -- reads your kernel or uses a generic one and then uses
Blackbox) and has ncurses-based interactive restore program.  I would
eventually like my universal C program to replace the "backup/restore"
library/utility "guts" of Mondo Rescue so Hugo can focus on more of the
user stuff.

-- Bryan

-- 
The US government could be 100x more effective, and 1/100th the
Constitutional worry, if it dictated its policy to Microsoft as
THE MAJOR CUSTOMER it is, and not THE REGULATOR it fails to be.
---------------------------------------------------------------
Bryan J. Smith, SmithConcepts, Inc.   mailto:b.j.smith@ieee.org
Engineers and IT Professionals     http://www.SmithConcepts.com