Quantcast
Viewing latest article 5
Browse Latest Browse All 15

Converting Oracle Varchar2 Data Type to SQL Server

Have you ever had to import data from Oracle and store it in a SQL Server database? Normally, I don't think too hard about what is required, but I ran into a situation where I had a table with a row size that would exceed the 8000-byte row size limit if converted to SQL Server. I was using SSIS (SQL Server Integration Services.) The big issue was that I had were the columns defined as varchar2 in Oracle and my initial plan was to convert them as nvarchar in SQL Server. But for certain operations, the SQL Server buffer size limit was exceeded. By defining those columns as nvarchar, I would exceed the SQL Server buffer size limit. An easy solution is to define the columns as varchar in SQL Server and right away I am under the buffer size limit. But now my question becomes, "Did I mess up the data? Will there be any character loss? What am I affecting by moving the data from an Oracle varchar2 column into a SQL Server varchar column?" To know whether I messed up the data or not all depends on the character set used in Oracle and the collation used in SQL Server. In Oracle, the character set is chosen when the database is created.

In Oracle, varchar2 (10) means it can store 10 bytes. The maximum length is 4000 bytes. In a single-byte character set a varchar2 (10) column can store up to 10 characters; the number of bytes and the number of characters are basically the same. In a multi-byte character set things are different. If the database is setup to use the Unicode character dataset then it can store characters that use more than 1 byte.

Oracle also supports UTF-16 by using the national character set (which is used for data stored inside nvarchar2, nchar, nclob columns). Any character in a UTF-16 implementation occupies no less than 2 bytes of storage, whereas varchar2 characters can be stored in as little as 1 byte. In an nvarchar2 column, size is the number of characters. (The number of bytes may be up to 2 times this number for the AL16UTF16 encoding and 3 times this number for the UTF8 encoding.)

When an Oracle database is created, the user chooses the character set. Oracle's recommendation is that if the environment (clients and servers) consists entirely of Oracle 9i or higher, to use AL32UTF8 (It encodes Unicode data in the UTF-8 encoding) as NLS_CHARACTERSET and for the national character set to use AL16UTF16 (It encodes Unicode data in the UTF-16 encoding.) AL32UTF8 is a varying width character set, which means that the code for a character can be 1 to 4 bytes long, depending on the character itself; the AL16UTF16 character set will use 2 or 4 bytes to store a character. More detailed notes on the AL32UTF8 character set can be found here.

UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding. One Unicode character can be 1 to 4 bytes in UTF-8 encoding. Characters from the European scripts are represented in either 1 or 2 bytes. Characters from most Asian scripts are represented in 3 bytes. Supplementary characters are represented in 4 bytes.

UTF-16 is the 16-bit encoding of Unicode. It is an extension of UCS-2 and supports the supplementary characters defined in Unicode 3.1 by using a pair of UCS-2 code points. One Unicode character can be 2 bytes or 4 bytes in UTF-16 encoding. Characters (including ASCII characters) from European scripts and most Asian scripts are represented in 2 bytes. Supplementary characters are represented in 4 bytes.

In SQL Server, varchar uses one byte per character and within that 1 byte it can carry the ASCII characters. Characters 0 through 127 are the ASCII characters (which covers English). The characters 128 through 255 also present characters. Which ones depends on the collation chosen for the database. nchar/nvarchar are designed to use up to 2 bytes to store a unicode string. SQL Server 2012 provides full support for UTF-16 by using up to 4 bytes to store a unicode string. Previous versions of SQL Server do not support UTF-16, but it supports UCS-2 which is a subset of UTF-16; it only supports characters that fit in 2 bytes. When using double-bytes in SQL Server 2012, nvarchar(10) means it can only carry 5 characters of a double-byte type. You have to examine the collation and code page being used to determine the characters that can actually be stored.

With the above information you can see when converting Oracle varchar2 data to SQL Server, you should use an nvarchar data type in SQL Server. When that is not possible, as what I experienced in one of my recent projects, you need to find an alternative or inform the end user that there may be some character loss if that conversion is made. The only time when you can be 100 percent sure that there is not any character loss when using a varchar data type is when the source column only contains ASCII characters.


Viewing latest article 5
Browse Latest Browse All 15

Trending Articles