[KLUG Members] Data transformation

Jamie McCarthy jamie at mccarthy.vg
Sat Sep 24 09:33:45 EDT 2005


aeidson at meglink.com (Andrew Eidson) writes:

> This will be used for both Postgres databases and MS SQL
> databases.. my main question is.. Can this be done with a SQL
> query or will the data have to be exported Long and then run
> through a program/script

> there are some fields with Nulls.. 

It can be done with an SQL query but it's a little grungy and might
run extremely slowly on your system.

I don't have a copy of Postgres already installed to test on, so
I'm running this on MySQL, but this is standard SQL so the results
should work on most RDBMS's.  What I did was create a table like
the one you describe, but with one data value set to NULL, and one
data value missing:


mysql> create table foo (user_id int, field_id int, field_value varchar(10), unique key (user_id, field_id));      
Query OK, 0 rows affected (0.07 sec)

mysql> insert into foo values (1,1,'data11'),(1,2,'data12'),(1,3,'data13'),(1,4,'data14'),(2,1,'data21'),(2,2,'data22'),(2,3,'data23'),(2,4,'data24'),(3,1,'data31'),(3,2,NULL),(3,4,'data34');
Query OK, 11 rows affected (0.04 sec)
Records: 11  Duplicates: 0  Warnings: 0

mysql> select * from foo;
+---------+----------+-------------+
| user_id | field_id | field_value |
+---------+----------+-------------+
|       1 |        1 | data11      |
|       1 |        2 | data12      |
|       1 |        3 | data13      |
|       1 |        4 | data14      |
|       2 |        1 | data21      |
|       2 |        2 | data22      |
|       2 |        3 | data23      |
|       2 |        4 | data24      |
|       3 |        1 | data31      |
|       3 |        2 | NULL        |  <-- note missing row for (3,3)
|       3 |        4 | data34      |  <--
+---------+----------+-------------+
11 rows in set (0.00 sec)


The query to extract the data from that table in the format you want is:


mysql> select f.user_id, f1.field_value as id_1_value, f2.field_value as id_2_value, f3.field_value as id_3_value, f4.field_value as id_4_value from foo as f left join foo as f1 on (f.user_id=f1.user_id and f1.field_id=1) left join foo as f2 on (f.user_id=f2.user_id and f2.field_id=2) left join foo as f3 on (f.user_id=f3.user_id and f3.field_id=3) left join foo as f4 on (f.user_id=f4.user_id and f4.field_id=4) group by f.user_id;
+---------+------------+------------+------------+------------+
| user_id | id_1_value | id_2_value | id_3_value | id_4_value |
+---------+------------+------------+------------+------------+
|       1 | data11     | data12     | data13     | data14     |
|       2 | data21     | data22     | data23     | data24     |
|       3 | data31     | NULL       | NULL       | data34     |
+---------+------------+------------+------------+------------+
3 rows in set (0.00 sec)


Basically for each column you want in the output, you LEFT JOIN the
table against itself.  To put it loosely, the LEFT JOIN allows
missing data points to appear as NULLs.

That query is pretty slow, however, and even if you have the proper
indexes on the table, for a large table (thousands of rows, dozens
of columns) it might take a very long time to complete.  I can't
say.

You said "there are some fields with Nulls."  The above solution is
for the case where some rows are missing.  But if you know that your
data set actually has an existing row for every combination of
user_id and field_id (but some of those rows have NULL for a value),
then the solution is not a LEFT JOIN but a regular JOIN:


mysql> insert into foo values (3,3,'data33');
Query OK, 1 row affected (0.00 sec)

mysql> select f.user_id, f1.field_value as id_1_value, f2.field_value as id_2_value, f3.field_value as id_3_value, f4.field_value as id_4_value from foo as f, foo as f1, foo as f2, foo as f3, foo as f4 where f.user_id=f1.user_id and f1.field_id=1 and f.user_id=f2.user_id and f2.field_id=2 and f.user_id=f3.user_id and f3.field_id=3 and f.user_id=f4.user_id and f4.field_id=4 group by f.user_id;
+---------+------------+------------+------------+------------+
| user_id | id_1_value | id_2_value | id_3_value | id_4_value |
+---------+------------+------------+------------+------------+
|       1 | data11     | data12     | data13     | data14     |
|       2 | data21     | data22     | data23     | data24     |
|       3 | data31     | NULL       | data33     | data34     |
+---------+------------+------------+------------+------------+
3 rows in set (0.00 sec)


But if you do have "missing rows," then a regular JOIN will miss
some data (so you'll have to go back to the LEFT JOIN):


mysql> delete from foo where user_id=3 and field_id=3;
Query OK, 1 row affected (0.00 sec)

mysql> select f.user_id, f1.field_value as id_1_value, f2.field_value as id_2_value, f3.field_value as id_3_value, f4.field_value as id_4_value from foo as f, foo as f1, foo as f2, foo as f3, foo as f4 where f.user_id=f1.user_id and f1.field_id=1 and f.user_id=f2.user_id and f2.field_id=2 and f.user_id=f3.user_id and f3.field_id=3 and f.user_id=f4.user_id and f4.field_id=4 group by f.user_id;
+---------+------------+------------+------------+------------+
| user_id | id_1_value | id_2_value | id_3_value | id_4_value |
+---------+------------+------------+------------+------------+
|       1 | data11     | data12     | data13     | data14     |
|       2 | data21     | data22     | data23     | data24     |
+---------+------------+------------+------------+------------+
2 rows in set (0.00 sec)


But if I were doing this with a data set that had more than a few
columns, I'd probably just "SELECT *" into a perl array and then
massage the data in perl... that seems easier to me than typing a
bunch of LEFT JOIN, USING, blah blah... and perl is easier to
debug than spaghetti SQL... but that's just me.  :)
-- 
  Jamie McCarthy
 http://mccarthy.vg/
  jamie at mccarthy.vg



More information about the Members mailing list