[KLUG Members] Match-making data types and algorithms

Adam Williams members@kalamazoolinux.org
02 Jul 2001 07:35:50 -0400


>>The point being that for the various things I am looking to use something
>>like this on, there are multiple categories of relevant data that I would
>>like to create a result from with some sort of ranking for the closest,
>>next closest and so forth and to have the various categories be weighted
>>or at least weightable.  (i.e. search for something related to 'cars'
>>would weight "content LIKE '%cars%'" while looking for sarcastic snippets
>>about government might equally weight "style LIKE '%sarcastic%' AND
>>description = 'sarcasm'".
>Whoa. ;)
>You're treading into philosophical waters there, my friend. Value judgements
>are notoriously resistant to analysis.

True, but if you need to mine squishy data you don't have a choice but
to make value judgements.  The real key to success with such data is
clear and consise defining of the "values".  This may be very hard when
working with marketing/management types but a true geek should be able
to pull that off without breaking a sweat.

>Instead of arbitrary unipole goals like 'sarcasm', you might wish to scale
>things on more abstract values. For instance, sarcasm is one form (reversed)
>of 'sincerity' in 'humor'. Perhaps the middle of that goal is everyday
>speech (neither highly sincere nor insincere) and the far end is
>'worshipful'.

Here is how I solved such a problem....  We have a BLOB space in an
Informix Online RDBMS engine.  A BLOB is loaded with a mime-type and
brief text description of what it is,  and is assigned a unique integer
(from a database sequence).  There is another table called document
types where things like Invoice, Workorder, Clipart collection, etc...
are defined.  In a third table we'll call "documents" are just that:
documents - specific instances of clip art collections, invoices,
etc....  Each document has a type (from the table containing types) and
a unique ID (from a database sequenct).  Then in a fourth table is a
simple linking of doc id and blob id.  A document can contain multiple
BLOBs, and a BLOB can be contained in multiple documents.  This lets you
do some clever "weighted" searching.  The BLOB that occurs most often in
a specific document type?  The BLOB occuring most often in this document
that also never occurs in any document of type X? etc.....  A system
like this can be built in PHP with relative ease and is amorphous enough
to handle the oddities of reality.

--
-----------------------------------------------------------
Ximian GNOME, Evolution, LTSP, and RedHat Linux + LVM & XFS
-----------------------------------------------------------