Pages

12/06/2010

Import StackOverflow DataDump into My Own Database

I'm about to duplicate StackOverflow data for researching purpose.

My database is MySQL v.5+ (I really forget the version number)
The problem is ..  the data dump from StackOverflow comes in XML format,
which I couldn't find any tool to help me import those into MySQL database.

data dump from http://blog.stackoverflow.com/category/cc-wiki-dump/

The format is like this:
<?xml version="1.0" encoding="utf-8"?>
<badges>
  <row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" />
  <row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" />
  ...
</badges>

So, I think I have no choice but to implement an xml importer myself.

Specifications of my problem:

  • Importing all dumped data from StackOverflow into my MySQL database
  • data dump can be downloaded from http://blog.stackoverflow.com/category/cc-wiki-dump/
  • data dump comes in format of several XML files in one folder
    • each XML file represents one table
      • each row node in XML represents one row of data
      • each attribute of each node represents one column
    • XML file size ranges from 85MB to 5.4 GB (badges.xml , posthistory.xml)
  • MySQL is running on a remote server
  • Need to be Java language implementation, for others to read

My 4 concerns are:
  • Schema of tables are currently unknown
    • although I can guess the configurations of each table's fields from looking at each XML, but it does not specify how large the field is.
  • XML files are large, using XML parser like XPATH would consume too much memory and processing time.
  • inserting large data into MySQL database would be very slow
  • supporting for new data. If there's a new data dump coming in next month, it must support to import new data into the database.
Solution:
  • For the first problem, I found a SQL script for creating StackOverflow's tables for MSSQL here http://code.google.com/p/stack-exchange-data-explorer/source/browse/Data/stackoverflow.sql. It's an opensource program to explorer stackoverflow data using ODATA format. On today (12/6/10), the file is gone. However, I have seen it and translated the script from MSSQL format into MySQL format, seen below.
  • Although, Large XML files couldn't be parsed efficiently with DOM,JDOM,XPATH, they could be parsed with SAX and Stax easily. Here I use Stax because it's more easily to use.
  • Slow inserting to database, this can be solved by batch processing. I have a configuration parameter in the config.properties file to set how many rows per one batch insert and commit
    • Using 1000 rows/batch, the program managed to use 300kB/s for uploading to database, which took about 8 hours for all files to finish (I excludes PostHistory.xml from the process)
    • Another solution is to use LOAD DATA INFILE ... , but it requires me to upload data into a file format that is readable by MySQL. It'd raise another concern on how to make the file, especially there are fields containing plain+HTML text.
  • I found an SQL command called "REPLACE". It is proprietary for MySQL. What it does is it will delete the row(only if the row is duplicated), and then insert new one. This command is opposite of "INSERT IGNORE" which ignores the new insert if it's duplicated. I think I should keep new data instead of old data, so I picked "REPLACE" command.

SQL script for creating database and tables
CREATE DATABASE IF NOT EXISTS StackOverflow

USE StackOverflow

CREATE TABLE IF NOT EXISTS VoteTypes(
 Id int NOT NULL,
 Name varchar(40) NULL,
 PRIMARY KEY (Id)
);

CREATE TABLE IF NOT EXISTS Votes(
 Id int NOT NULL,
 PostId int NULL,
 VoteTypeId int NULL,
 CreationDate datetime NULL,
 CONSTRAINT PK_Votes PRIMARY KEY (id)
);

CREATE TABLE IF NOT EXISTS Users(
 Id int NOT NULL,
 Reputation int NULL,
 EmailHash varchar(40) NULL,
 CreationDate datetime NULL,
 DisplayName varchar(40) NULL,
 LastAccessDate datetime NULL,
 WebsiteUrl varchar(200) NULL,
 Location varchar(100) NULL,
 Age int NULL,
 AboutMe LONGTEXT NULL,
 Views int NULL,
 UpVotes int NULL,
 DownVotes int NULL,
 CONSTRAINT PK_Users PRIMARY KEY (Id)
);

CREATE TABLE IF NOT EXISTS Tags(
 Id int NOT NULL,
 TagName varchar(255) NULL,
 PRIMARY KEY(Id)
);

CREATE TABLE PostTags(
 PostId int NULL,
 TagId int NULL
);

CREATE TABLE IF NOT EXISTS Posts(
 Id int NOT NULL,
 PostTypeId int NULL,
 AcceptedAnswerId int NULL,
 CreationDate datetime NULL,
 Score int NULL,
 ViewCount int NULL,
 Body LONGTEXT NULL,
 OwnerUserId int NULL,
 OwnerDisplayName nvarchar(40) NULL,
 LastEditorUserId int NULL,
 LastEditDate datetime NULL,
 LastActivityDate datetime NULL,
 Title nvarchar(250) NULL,
 Tags nvarchar(150) NULL,
 AnswerCount int NULL,
 CommentCount int NULL,
 FavouriteCount int NULL,
 ClosedDate datetime NULL,
 ParentId int NULL,
 CommunityOwnedDate datetime NULL,
 CONSTRAINT PK_Posts PRIMARY KEY (id)
);

CREATE TABLE IF NOT EXISTS Comments(
 Id int NOT NULL,
 PostId int NULL,
 Score int NULL,
 Text LONGTEXT NULL,
 CreationDate datetime NULL,
 UserId int NULL,
 CONSTRAINT PK_Comments PRIMARY KEY (Id)
);

CREATE TABLE IF NOT EXISTS Badges(
 Id int NOT NULL,
 UserId int NULL,
 Name varchar(50) NULL,
 Date datetime NULL,
 CONSTRAINT PK_Badges PRIMARY KEY (Id)
);



This is the code "XMLDumpImporter"
/**
* This is a tool for importing StackOverflow data dump into MySQL database.
* @author ptantiku
*/
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.Properties;

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;

public class XMLDumpImporter {

 Properties properties = null;
 String serveraddress;
 String port;
 String username;
 String password;
 String databasename;
 private int rowspercommit;
 String filePath;

 Connection connection;
 

 /**
  * Main Program 
  * @param args
  * @throws IOException
  */
 public static void main(String[] args) throws IOException {
  XMLDumpImporter xmlImporter = new XMLDumpImporter(
    "/media/DeerDisk2/Download/Stack Overflow Data Dump - Nov 2010/Content/112010 Stack Overflow");
  xmlImporter.importXMLFolder();
  xmlImporter.closeDB();
 }

 public XMLDumpImporter(String path) {
  loadConfig();
  connectDB();
  filePath = path;
 }

 private void loadConfig() {
  // create an instance of properties class
  properties = new Properties();

  // try retrieve data from file
  try {
   properties.load(new FileInputStream("config.properties"));
   serveraddress = properties.getProperty("serveraddress");
   port = properties.getProperty("port");
   username = properties.getProperty("username");
   password = properties.getProperty("password");
   databasename = properties.getProperty("databasename");
   rowspercommit = Integer.parseInt(properties.getProperty("rowspercommit"));
  } // catch exception in case properties file does not exist
  catch (IOException e) {
   e.printStackTrace();
  }
 }

 private void connectDB() {
  // connect to MySQL
  connection = null;
  try {
   // Load the JDBC driver
   String driverName = "org.gjt.mm.mysql.Driver"; // MySQL MM JDBC
               // driver
   Class.forName(driverName);

   // Create a connection to the database
   String url = "jdbc:mysql://" + serveraddress +":" + port + "/" + databasename; 
   connection = DriverManager.getConnection(url, username, password);
   connection.setAutoCommit(false);  //for batch insert
  } catch (ClassNotFoundException e) {
   // Could not find the database driver
   e.printStackTrace();
  } catch (SQLException e) {
   // Could not connect to the database
   e.printStackTrace();
  }
 }

 public void importXMLFolder() throws IOException {
  File path = new File(filePath);
  if (path == null || !path.isDirectory()) {
   throw new IOException("Path is wrong");
  }
  String fileList = properties.getProperty("filelist");
  String[] fileListArray = fileList.split(",");

  String tableName = properties.getProperty("tablename");
  String[] tableNameArray = tableName.split(",");

  for(int i=0;i<fileListArray.length;i++){
  //for (int i = 0; i < 1; i++) {
   String filename = filePath + File.separatorChar + fileListArray[i];
   String table = tableNameArray[i];
   importXMLFile(filename, table);
  }
 }

 public void importXMLFile(String filename, String tablename)
   throws IOException {
  System.out.println("Processing file: " + filename);

  String fields = properties.getProperty(tablename+"_fields");
  String[] fieldsArray = fields.split(",");
  
  
  
  try {
   FileInputStream fin = new FileInputStream(new File(filename));
   XMLInputFactory factory = XMLInputFactory.newInstance();
   XMLStreamReader r = factory.createXMLStreamReader(fin);
   int attributeCount = 0;
   int rowCount = 0;

   //connection part
   if(connection==null||connection.isClosed())
    connectDB();
   //REPLACE is non-standard SQL from MySQL, it's counter part of INSERT IGNORE
   String sql = "REPLACE INTO "+tablename+" ("+fields+") VALUES ("+fields.replaceAll("[^,]+", "?")+")";
   System.out.println("SQL: "+sql);
   PreparedStatement pstmt = connection.prepareStatement(sql);
   
   // start parsing
   int event = r.getEventType();
   while (true) {
    // for each wanted element 
    if (event == XMLStreamConstants.START_ELEMENT
      && r.getName().toString().equalsIgnoreCase("row")) {
     System.out.println("Table : "+tablename+" / Row : "+rowCount );
     if (attributeCount == 0)
      attributeCount = r.getAttributeCount();

     /*for (int a = 0; a < attributeCount; a++) {
      System.out.print("\t{" + r.getAttributeName(a) + ":"
        + r.getAttributeValue(a) + "}");
     }
     System.out.println();
     */
     
     //put each parameter to SQL
     int f=1;
     for (String field : fieldsArray){
      pstmt.setString(f++,r.getAttributeValue("", field));
     }
     pstmt.addBatch();
     rowCount++;
     
     if(rowCount%rowspercommit==0){
      System.out.println("Importing at row "+rowCount+" ... ");
      pstmt.executeBatch();
      connection.commit();
     }
    } // end for each row.

    // proceeding to next element
    if (!r.hasNext())
     break;
    else
     event = r.next();
   }

   System.out.println("Importing all rows into database (it might take time) ... ");
   pstmt.executeBatch();
   connection.commit();
   pstmt.close();
   r.close();
   fin.close();
   System.out.println("Importing "+filename+" is done");
  } catch (XMLStreamException e) {
   e.printStackTrace();
  } catch (SQLException e) {
   e.printStackTrace();
  }
 }
 
 public void closeDB(){
  try {
   if(connection!=null&&!connection.isClosed())
    connection.close();
  } catch (SQLException e) {
   e.printStackTrace();
  }
 }
}

This code needs config.properties file to supply necessary data to them, such as database configurations, StackOverflow table configurations.

This is the content of "config.properties" file (db config is masked):
# This is configuration file for XMLDumpImporter
serveraddress = XXX.XXX.XXX.XXX
port = 3306
username = XXXXX
password = XXXXX
databasename = StackOverflow

#number of rows inserted before each commit
rowspercommit = 1000

#file list for checking and processing in order
filelist = badges.xml,posts.xml,votes.xml,comments.xml,users.xml

#table name, sequentially according to filelist
tablename = badges,posts,votes,comments,users

#database fields: key = tablename_fields
badges_fields = Id,UserId,Name,Date
posts_fields = Id,PostTypeId,AcceptedAnswerId,CreationDate,Score,ViewCount,Body,OwnerUserId,OwnerDisplayName,LastEditorUserId,LastEditDate,LastActivityDate,Title,Tags,AnswerCount,CommentCount,FavouriteCount,ClosedDate,ParentId,CommunityOwnedDate
votes_fields = Id,PostId,VoteTypeId,CreationDate
comments_fields = Id,PostId,Score,Text,CreationDate,UserId
users_fields = Id,Reputation,EmailHash,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,Age,AboutMe,Views,UpVotes,DownVotes


In this work, I ignore the table "posthistory", since I have no use for it yet.

Other related info:
Odata from http://odata.stackexchange.com/stackoverflow/query/new
Odata explanation :: http://sqlserverpedia.com/wiki/Understanding_the_StackOverflow_Database_Schema
Doc: www.odata.org/developers/odata-sdk
Data dump @ http://blog.stackoverflow.com/category/cc-wiki-dump/
StackOverflow Schema: http://code.google.com/p/stack-exchange-data-explorer/source/browse/Data/stackoverflow.sql

4 comments:

  1. Great!
    there is a small typo in config list of bagdes columns should be CreationDate instead of Date. Regards

    --
    Szymon Chojnacki
    http://www.ipipan.eu/~sch/

    ReplyDelete
  2. Hi Szymon,
    Thanks for your correction. I just went to review it again and it seems like the SQL script is wrong. I use "CreationDate" in SQL script, rather than just "Date" like other places else. So, I corrected the SQL script. Thanks again for reading my blog post :)

    ReplyDelete
  3. Useful information, thanks for the contribution!
    http://SeanHaight.com

    ReplyDelete
  4. He needs to thoroughly examine any update that has to be made to the database and found to be genuine and safe. create mysql dashboard

    ReplyDelete