[KLUG Members] Data transformation

Adam Tauno Williams awilliam at whitemice.org
Mon Sep 26 07:33:48 EDT 2005


> >>>>I have solved this problem in a number of environments, inclusing SQL-based
> >>>> data systems....
> >>> I'd agree with Bob here; which bugs me to no end :-).
> >Mostly likely true.  But sometimes it is either (a) too much bother to 
> >introduce a third agent to get a solution or (b) just doesn't work out 
> >to the solutions advantage.
> I don't understand what you're saying, exactly. Perhaps you're arguing in
> favor of an "All-SQL" or an "All-in the DBMS" solution? Please clarify.

No,  I'm not arguing for an All-* solution.  I'm saying that the
appropriate solution for where to process data depends upon allot of
variables (some external to the technology itself).  Things like
sub-selects and outer joins exist in SQL for a reason.  I process data
using a variety of methods -

1.) PHP (aka LAMP, although with a proprietary database).  This is great
for interactive access to small sets of data;  it is terrible for data
processing.
2.) "In SQL" - when you want to integrate with something like a report
writer it is easiest to just make an 'insane' SQL as making a middleware
layer would require a pile of code and weeks of man hours for
development and testing.  And in the end it would be less flexible.  "In
SQL" is also best when multiple clients of various types developed in
different languages all need to process the same bit of information or
when the dataset is very large and the clients are remote.  No matter
how rigorous the documentation, different developers are going to
introduce different subtleties in how they go about doing something -
which equals some suit having a piss fit and waving a report in your
face.
3.) BIE.  This is good at data processing,  and doesn't do interactive
much at all.  You can locate it near, or on, the database server and it
does very well up to horrendously large datasets.  The XML tools start
to get wonky when datasets are really really large,  but they are better
than even six months or a year ago, and easily multitudes faster then
they were a couple of years ago.  It also provides sort of a
database-aware crond.
4.) .NET - If the data transform is really complicated or just enormous
I'll code a component in .NET [Mono].  As far as performance is
concerned this leaves both BIE and PHP staggering around chocking on
it's dust and wondering what just flew by.  But this means code with
dependencies and code that has to be maintained - with is overhead in
man-hours,  which are way more expensive than more-CPU.

And this doesn't take into account other variables, like how fast
management wants to results, a departments insistence on using some
stupid client, integrating with some third party that can only
communicate in method {X}, etc...  If I thought about it long enough I'm
certain I could come up with a dozen more.

I agree with you and in a perfect world would almost always advocate a
three tier model - {Data store}:{Data Processing}:{Data Presentation}.
But reality introduces a large amount of impediments to implementing a
technologically perfect solution;  often times the software packages
themselves introduce the impediments (bad performance, statelessness,
limited protocols supported, etc...) even without the help of the
suits.  

So one needs to select the solution appropriate to the particulars of
the situation and problem;  our friend who started this thread may have
perfectly legitimate reasons for wanting to solve the problem in the
manner he proposed.

> >If the host end is dramatically more powerful 
> >than the client end it just makes sense to do as much work there,  or the 
> >dataset is very large and the transfer time to the client would be 
> >unreasonable (especially true if WANs like frame-relay is involved).
> Almost always the case, as it turns out. generally the users only want to
> see a representation of the result that is vanishingly small compared to
> the data involved, and it is very, very common to process it all on the
> server, or even somededicated background machine.

Yep.

> >and writing good fast code in C requires someone who really knows what
> >they are doing.
> One of the reasons I eran the big bucks! :)

Exactly!  Did the dealership get your Mazerati in yet?

> >> Well, yeah, except there are some useful generalities to be observed here.
> >> I see the initial table as a transaction file, and now Andrew is 
> >> being asked to
> >> produce columnwise (or feildwise) classificationss for some reason. 
> >> Actually, it would seem that the out is not normalized,
> >Yea, it is cwazy what those pesky users ask for!  :)
> Well, not SO cwazy! those sorts of classifications and crosstabs are a lot
> more readable for people, result in better output from the POV of layout,
> and may be needed by legacy systems.

Gack.  Add "legacy systems" to all my points above.  I try to repress.

> >> I can more easily believe the input may be normalized for the 
> >> initial purpose of the table/database. However, SQL doesn't handle 
> >> this very well, but a program (and it could be PHP,  but doesn't have 
> >> to be) that calls a RDBMS/SQL interface layer may very well be the 
> >> best way to go. Other tasks for the non-SQL code would be SQL code 
> >> generation and managed execution. There are some data lanaguages that 
> >> essentially abstract the problem into higher-level commands, and that's 
> >> really helpful when such generality is nessesary.
> >BIE does some neat stuff since all selects from the database are in XML 
> >and you can then do anything an XML transform can do.
> Actually, I wasn't thinking of BIE; there are older tools and a variety
> of (quasi-proprietary) database and data manipulation tools that reduce
> this sort of problem tosingle commands, since the implementors understood
> that these kinds of mappings were clearly needed by their audiences.

Sure,  I was just tossing out an example of something in the general
category.

> >And XML transform tools keep getter faster and faster.  It lets you get
> >the data and mangle it with essentially no programming.
> Yes, they do, but they have a long way to go to catch well-designed compiler
> or optimized interpreter-based solutions that do intelligent stuff while
> using the RDBMS as a meta-filesystem. The targeting and profiling of the 
> data in the host language, coupled with the optimization work that has been
> invested in the good RDBMS's is a tough combination to beat for the serious
> data handling apps. The economics at this level also tend to make the "no
> need for programming" argument moot.

Sure.  But in most smallish businesses you aren't going to find these
tools, or the expertise required to use them;  which may be unfortunate,
but is still true.



More information about the Members mailing list