Searching through UTF-16 files

In a typical day of a Siebel EAI developer, we are required to investigate problems with XML/XSLT files, and unless you know the system inside out, and outside in, you may encounter XML fragments that look unfamiliar.

For example, your ESB team sends you an XML that they claim to have been sent from your system, but from looking at the XML you dont recognise the interface that sent it. However you reason that, if you could search through all the XSLT files in Siebel you could track down the interface that could be generating this file.

The most obvious tool at the developers disposal is Windows Find, this is the defacto method of searching for files or text in files in a Windows OS, but beware because this method will not yield any results when searching through UTF-16 files. Surely Unix users would have more reliable tools in their arsenal. What about Grep?

Grep belongs to the Unix family of tools who's sole purpose is for searching, so how does this command fair with UTF-16 files? Fire up any Grep tool within reach and search for "xml" within your directory of UTF-16 XSLT files.

Here were my results.

Windows BareGrepX
Mac GrepX
HP Unix GrepX

To understand this behaviour, open a UTF-16 file in a binary editor, you'll see that there is a "00" byte (shown as a space character) between every character, this is how UTF-16 files are physically represented in the file system, but open it in Notepad or your favourite unicode aware editor, and it will magically render the UTF-16 code points into your familiar readable XML.

With that understanding, it becomes clear why text matches will not work in programs that use a non UTF-16 regex engine.

This is a problem for any unsuspecting EAI developer and Siebel project that relies heavily on XSLT for its request/response Integration, but knowing thy enemy is half the battle already won, so here are a few tools to deal with UTF-16 files.

1. Windows findstr
eg. findstr /s /c:"siebelmessage" *.xslt

This command can match strings in UTF-16, and can be used to search directories.

2. Notepad++
3. TextPad

The above two text editors, could search entire directories, and match for text within UTF-16 files.

With this in mind, I'm sure you could easily find other utilities and programs that can scour through UTF-16 files.

It is a little unsettling to discover that you cannot rely on such ubiquitous tools such as Find or Grep for your Integration needs, but hopefully the next time you get that foreign XML in your inbox, you'll already have a hand on a nearby UTF-16 scraper.


1 comment:

  1. I've just wasted an hour or so trying to locate strings in UTF-16 files in Windows using findstr, so I can report that it certainly doesn't work, at least not in Vista.

    Of course, what I'd really like would be a command line tool to do xpath queries against XML files matching a skeleton file name...


Comments are open to all, please make it constructive.