ETL (Data import).from Json, XML, HTML

From time to time I still see people write a custom code to parse Json, XML or HTML on the fly to import the data into relational database tables.

I used to do the same till I found much better solution.
Almost all modern databases (Postgres, Microsoft SQL server, MySQL, ) have Json and XML datatype and functions to parse Json and XML, (Microsoft 2016 doesn't have Json datatype but has Json functions which have a great performance).
What you need to do is to save the XML, Json or HTML data in a relational table as a big string and then create a view using the Json and XML functions. In case of HTML you just need to convert HTML into XML before saving it in a table, there are several libraries which can do it (for example HtmlAgilityPack in .Net).
There are a lot of advantages of doing it this way. Here is a typical scenario. Let's say you need to import some data from a web service into a database. So the classic approach would be calling the web service with different parameters, parsing it on the fly, and saving the values in the database.

Here is much better way:

1. Save the data in XML datatype (Postgres, SQL server) or Json data type in Postgres or nvarchar(max) SQL server 2016.
2. Create views using Json or XML functions.
3. Now you can use the views for SELECT queries, but If you want better performance, you can always copy the data from the views intto tables and create appropriate indexes.

The only disadvantage of this approach is that you might end up saving much more data than you actually need, but with current cost of storage (almost nothing) it is usually not important. I don't know how it will work if the Json/XML strings are too big, but some of the strings I imported (stored in a single row) exceed 2GB, and it worked perfectly. The total amount of data parsed this way is unlimited.

Here are the main advantages:
1. The development is very simple and robust, to correct mistakes (for example misspelled object name), you just need to correct the view, and you see the corrected result right away.
2. Since you download and store all your data, if tomorrow, you decide you want to import some additional data from your raw data (Json/XML) into the database tables, you just need to change the view, you don't need to re-download and re-parse all the data.
3. You don't need to catch the errors, in case some of the expected objects/nodes are missing, the view will just return NULL.
4. Usually it is faster than parsing on the fly.

ETL (Data import).from Json, XML, HTML

Tuesday, January 31, 2017

The best way of importing Json, XML, load data from webservices and HTML into Relational database.