Tuesday, November 10, 2009

SQL Server and Bulk Data Persistence

In my last post I presented Bulk Data Persistence (BDP) as a general concept to send bulks of data data from a client to a database server.

Today I'll show you ways how to use BDP with SQL Server.

Sample Environment

To avoid duplicate source code I'll start with a short description of the environment we will work with.

The database actually contains one table "TestMany" which is described by the following listing and contains about 1,000,000 rows. The following DDL statement describes the structure of the "TestMany" table.
CREATE TABLE dbo.TestMany (
   Id int IDENTITY(1,1) NOT NULL
      PRIMARY KEY CLUSTERED
   ,SomeInt int NULL
   ,SomeDate datetime NULL
   ,SomeText varchar(100) NULL,
);
We use a domain objects from type TestMany which represents a 1:1 mapping to our database table with an additional property to represent the state of the object.
public class TestMany
{
   public TestMany(int id, int someInt, DateTime someDate, 
                   string someText, char changeState) {
      Id = id;
      SomeInt = someInt;
      SomeDate = someDate;
      SomeText = someText;
      ChangeState = changeState;
   }

   public int Id { get; set; }
   public int SomeInt { get; set; }
   public DateTime SomeDate { get; set; }
   public string SomeText { get; set; }
   public char ChangeState { get; set; }
}
Most likely you noticed the type char of the ChangeState property. Usually the change state of an objects should be represented as enum, I choose this just to keep the following samples simple. To maintain the value of ChangeState I'll use the following pseudo enum.
public static class ObjectState
{
   public const char Added = 'I';
   public const char Deleted = 'D';
   public const char Changed = 'U';
   public const char Unchanged = 'X';
}
Since we'll send a complete package of data to the database we have to tell it how to handle specific rows. Therefore I'll use the simple characters "I" (for INSERT), "U" (for UPDATE) and "D" (for DELETE). The Unchanged is not really needed in our samples, but added for completeness.

To avoid writing a connection string always and always again I use a simple helper method which gets a connection string from app.config and opens the connection automatically.
private SqlConnection GetConnection(bool open) {
   SqlConnection cn;
   cn = new SqlConnection(Properties.Settings.Default.SandboxOnSql2k801);
   if (open)
      cn.Open();
   return cn;
}
To create TestMany objects to be persisted we'll use the following GetTestData() method.
private IList<testmany> GetTestData() {
   List<testmany> data = new List<testmany>();

   // to update
   for (int i = 1; i <= 1000; i++)
      data.Add(new TestMany(
         i, 
         i, 
         DateTime.Now.AddSeconds(i),
         string.Format("Updated {0}", i),
         ObjectState.Changed));

   // to delete
   for (int i = 20000; i <= 21000; i++)
      data.Add(new TestMany(
         i, 
         0, 
         DateTime.Now,
         null,
         ObjectState.Deleted));

   // to insert
   for (int i = -1; i >= -1000; i--)
      data.Add(new TestMany(
         i, 
         i, 
         DateTime.Now.AddSeconds(i),
         string.Format("Added {0}", i),
         ObjectState.Added));

   return data;
}
Important: To get BDP work we need a unique surrogate id for new objects which will be send to the database and mapped to the new generated Ids in the client. I work with negative integer numbers in the previous method.

Since we will persist a complete bulk of objects in one database round trip we need a slightly different way to get the server side generated IDENTITY values for new inserted objects. When persisting single objects we can use the SCOPE_IDENTITY() (or @@IDENTITY) function to get the id of the last inserted row. With BDP we send all new rows in one operation, so we have to synchronize all new ids after this operation. The following method gets all data which have been handled in a BDP operation, the count of inserted rows (returned by the database) and the id of the last inserted row (returned by the database). Later in this post, we'll see how we get those information and how to keep them valid. For the moment I just want to show you how to synchronize the client ids with the specified last id and the count of added objects. Probably it makes sense to jump over this method for the moment and come back when we did the first BDP operations.
private void SyncAddedIds(IEnumerable<TestMany> data, int count, int lastId) {

   // get surrogate client ids for new added items
   SortedList<int, TestMany> idLookup = new SortedList<int, TestMany>(count);
   foreach (var item in data)
      if (item.ChangeState == ObjectState.Added)
         idLookup.Add(item.Id, item);

   // map client ids to server ids
   IList<int> clientIds = new List<int>(idLookup.Keys);
   foreach (var clientId in clientIds)
      idLookup[clientId].Id = lastId - --count;
}

SQL Server 2008 Configuration

To use BDP in SQL Server 2008, we need to define table-valued types for the data to be persisted in bulk. We will use these types as simple Data Transfer Objects (DTOs) to transfer structural data from the client to the database.

Notice: We are not needed to use stored procedures for BDP. We will see a sample using a SP but we will also see another sample which work without any procedure.

The table-valued type DTO for our TestMany class and table looks just like our class, containing a column Action which represents the kind of operation to be executed with each row.
CREATE TYPE TestManyMergeType AS TABLE (
   Id int NOT NULL
   ,Action char(1) NOT NULL
      CHECK (Action IN ('D', 'U', 'I'))
   ,SomeInt int NULL
   ,SomeDate datetime NULL
   ,SomeText varchar(100) NULL
   ,PRIMARY KEY CLUSTERED (Id, Action)
);
The PRIMARY KEY on the Id and Action columns is important to ensure integrity and optimizes the lookup performance when we merge data. I also defined a CHECK constraint for the "Action" column for my personal paranoia :-) - to avoid wrong data.

As a little hint: Keep all lookup columns (Id and Action in this case) at the beginning of the table. Try to avoid them at the end, especially after any columns with variable length (as SomeText). This could be another paranoia statement in case of our DTO, but is important with real tables.

With this type our database is ready to handle BDP operations. Let's start with some samples.

SQL Server 2008 and Stored Procedures

We start with the stored procedure sample because it separates the two parts of the party and makes it easier to discuss each of them separately.

Let's start with the stored procedure definition.
ALTER PROCEDURE MergeTestMany
   @data TestManyMergeType READONLY
AS
   -- delete
   DELETE trg 
      FROM TestMany trg 
      WHERE EXISTS (
         SELECT * 
         FROM @data src 
         WHERE src.Id = trg.Id AND src.Action = 'D'
         );

   -- update
   UPDATE trg SET 
         SomeInt = src.SomeInt
         ,SomeDate = src.SomeDate
         ,SomeText = src.SomeText
      FROM TestMany trg 
         JOIN @data src 
         ON trg.Id = src.Id AND src.Action = 'U';

   -- insert
   INSERT INTO TestMany
      SELECT SomeInt,SomeDate,SomeText 
      FROM @data 
      WHERE Action = 'I' 
      ORDER BY Id -- ## IMPORTANT to match ids in client

   -- return last IDENTITY and count of added rows
   SELECT SCOPE_IDENTITY(), @@ROWCOUNT;
The procedure gets one table-valued parameter with the melodic name @data (feel free to change this to a more specific name). Table-valued parameters are a new and very powerful feature of SQL Server 2008 to send a complete set of data to the server. The procedure body represents four separate parts.

As first we use a DELETE statement to delete all provided data which are marked with a "D"-action and matched by a specified Id. (At this point I have to say thank's to Paul White - a real SQL guru - from SQLServerCentral.com for his explanation of the left semi join when using EXISTS here.)

The second part of the procedure uses a joined UPDATE to update all existing data with their new values which are marked a "U"-action.

The third part simply inserts all provided DTOs which are marked with "I". Spot the ORDER BY for the source data. This ordering is very important to ensure the rows to be inserted in the order of the specified surrogate ids from client to be able to calculate the generated server IDENTITY values in the client application.

Last but not least we return the last generated IDENTITY value and the count of rows which have been newly inserted into our table. Feel free to scroll up at to the SyncAddedIds method which should be more comprehensible now since this is the most tricky and most important part to get a well working bulk insert operation.

Let's move to the client side implementation.

To provide a table-valued parameter from ADO.NET to SQL Server 2008, there are two different ways. You can provide a DataTable or an implementation of an IEnumerable<Microsoft.SqlServer.Server.SqlDataRecord> which returns all data to be transfered to the parameter. Personally I prefer the enumerator since a DataTable always represents a not needed overhead and requires relatively much memory.

The following listing shows a quiet straight forward implementation of this enumerable for our TestMany domain object.
class TestManyEnumerator : IEnumerable<SqlDataRecord>
{
   public TestManyEnumerator(IEnumerable<TestMany> data) {
      _data = data;
   }

   private IEnumerable<TestMany> _data;

   public IEnumerator<SqlDataRecord> GetEnumerator() {
      SqlMetaData[] metaData = {
         new SqlMetaData("Id", SqlDbType.Int),
         new SqlMetaData("MergeAction", SqlDbType.Char, 1),
         new SqlMetaData("SomeInt", SqlDbType.Int),
         new SqlMetaData("SomeDate", SqlDbType.DateTime),
         new SqlMetaData("SomeText", SqlDbType.VarChar, 100),
      };

      foreach (var item in _data) {
         SqlDataRecord record = new SqlDataRecord(metaData);

         record.SetValue(0, item.Id);
         record.SetValue(1, item.ChangeState);
         record.SetValue(2, item.SomeInt);
         record.SetValue(3, item.SomeDate);
         record.SetValue(4, item.SomeText);

         yield return record;
      }
   }

   System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() {
      return this.GetEnumerator();
   }
}
As you see, the GetEnumerator() first creates an array of SqlMetaData which is required for the constructor of a SqlDataRecord objects. After this we have to set the values of the current TestMany object and return it.

The next sample shows how to call our stored procedure from ADO.NET.
using (SqlConnection cn = GetConnection(true))
using (SqlTransaction tran = cn.BeginTransaction()) {
   // get test data
   IList<TestMany> data = GetTestData();

   // create a new command as stored procedure
   SqlCommand cmd = new SqlCommand("MergeTestMany", cn);
   cmd.CommandType = CommandType.StoredProcedure;
   cmd.Transaction = tran;
   // create a new parameter ("Structured" defines a TVP)
   SqlParameter p = cmd.Parameters.Add("@data", SqlDbType.Structured);
   // specify the name of the TVP
   p.TypeName = "TestManyMergeType";
   p.Value = new TestManyEnumerator(data);

   // catch the returned IDENTITY information
   using (SqlDataReader reader = cmd.ExecuteReader()) {
      reader.Read();
      int lastId = reader.IsDBNull(0) ? -1 : (int)reader.GetDecimal(0);
      int count = reader.GetInt32(1);
      // synchronize all newly added keys
      SyncAddedIds(data, count, lastId);
   }

   tran.Commit();
}
As you see, to call a procedure which receives a table-valued parameter we have to specify a SqlParameter as SqlDbType.Structured, the name of the our T-SQL table-valued type as TypeName and provide the above shown enumerator as value of the parameter. This enables ADO.NET to use a single RPC call to utilize the procedure with all changes which have been made on our objects.

A SqlDataReader gets the IDENTITY information, returned by the procedure, and calls the SyncAddedIds method to set the newly generated ids to all our added domain objects.

Sadly, the current version of JDBC apparently not yet supports table-valued parameters. However, since there are many voices in the web (and on the Microsoft JDBC Driver Team Blog) I really hope this great feature will be added to the next version.

SQL Server 2008 without Procedures

Sure, if we don't work with stored procedures (probably we use an O/R-Mapper) in our environment, we are still able to realize BDP.

Whenever we changed domain objects in memory for a specific business case there will come the point where we have to persist the data to the database. At this position we usually generate some dynamic SQL statements to send the changes to the database. This is the position where we just have to replace the existing single-row CRUD statements with bulk operation statements as we've already seen in the procedural approach. ADO.NET is able to bind table-valued parameters to any custom SQL command.

Important: As in the procedural approach, we need a table-valued type which can be used as DTO when communicating with the database and an implementation of a IEnumerable<SqlDataRecord> (or a DataTable which can be provided as value for the SqlParameter.

When we compare the following snippet with the stored procedure solution we'll see, the only difference is the custom SQL instead of a stored procedure call.
using (SqlConnection cn = GetConnection(true))
using (SqlTransaction tran = cn.BeginTransaction()) {
   // get test data
   IList<TestMany> data = GetTestData();

   // create the complete DELETE/UPDATE/INSERT sql command
   String sql = @"
      DELETE trg FROM TestMany trg WHERE EXISTS(
         SELECT * FROM @data src WHERE src.Id = trg.Id AND src.Action = 'D'
         );

      UPDATE trg SET SomeInt=src.SomeInt, 
                     SomeDate=src.SomeDate, 
                     SomeText=src.SomeText
         FROM TestMany trg 
         JOIN @data src ON trg.Id = src.Id AND src.Action='U';

      INSERT INTO TestMany
         SELECT SomeInt,SomeDate,SomeText 
         FROM @data 
         WHERE Action='I' 
         ORDER BY Id -- <- IMPORTANT to match ids in client

      SELECT SCOPE_IDENTITY(), @@ROWCOUNT;
      ";

   SqlCommand cmd = new SqlCommand(sql, cn);
   cmd.Transaction = tran;
   // create a new table-valued parameter for our TestManyMergeType type
   SqlParameter p = cmd.Parameters.Add("@data", SqlDbType.Structured);
   p.TypeName = "TestManyMergeType";
   // provide the data
   p.Value = new TestManyEnumerator(data);

   // execute the sql command and map the returned server ids
   using (SqlDataReader reader = cmd.ExecuteReader()) {
      reader.Read();
      int lastId = reader.IsDBNull(0) ? -1 : (int)reader.GetDecimal(0);
      int count = reader.GetInt32(1);
      SyncAddedIds(data, count, lastId);
   }

   tran.Commit();
}
Indeed, usually we will never put such dynamic SQL statements to our client application. Those statements will be generated by a generic implementation within an O/R-Mapper or any other data access layer.

BDP and SQL Server 2005

Sadly, table-valued types have just been introduced with SQL Server 2008, what means they are not available in SQL Server 2005 and previous version. The DTOs could be realized with user defined CLR types but SQL Server does not support arrays like other database systems. Therefore BDP is not possible to use with SQL Server 2005.

One workaround might be to create a XML document which contains the data to be persisted but this is a hard way to go and requires deep performance tests. I successfully used this for larger operations (more than 10,000 rows in one batch) but I have no experiences with general solutions for client layers.

Conclusion

We have seen how we can realize Bulk Data Persistence when using SQL Server 2008. We need a table-valued type defined in SQL Server which can be used as Data Transer Object and an enumerator for SqlDataRecord objects.

Hence, if you do not want to use user-defined table-valued types in your database BDP is not the right way to go in your system.

Outstanding

Probably the idea of BDP sounds nice for you. Nevertheless, one important part is missing here. A comparison with a single statement based solution. I'll provide this in a following post and reference it here.
Edit: Table-Valued Parameters - A Performance Comparison

Since it would be way too much source to code for lazy people like me, I'll also provide a possible implementation of a generic approach which can be used for any kind of domain object.
Edit: A Generic BDP Approach for SQL Server 2008

For the database gurus out there I'll provide an special post where I'll compare the classical INSERT/DELETE/UPDATE functions to the shiny new MERGE function which was introduced with SQL Server 2008. Since now my tests are not stressful enough to provide a real statement about.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.