Improving comparison performance

Jun 3, 2013 at 8:50 PM
I wanted to get your suggestions on improving performance.

The tool works great on result sets less than 1000. One of my tests returns 20K rows and basically never finishes. GetHashCode is called (n * m) * 4, where "n" is the size of result set 1 and "m" is the size of the result set 2. Of course if there are a lot of columns, performance is even worse.

I want to pre-calculate the code to "key" and the "values," and link those to the dataset. Suggestions on how to do this?

Jun 4, 2013 at 7:42 AM
If the GetHashCode method from the class 'DataRowKeysComparer' is the bottleneck then we can improve this by storing the result of this calculation somewhere in the datarow and so avoid the recalculation (n * m) * 4. The method will still be called (n * m) * 4 but the deep calculation of the HashCode will be done only n+m, other calls to this method will return the stored value.

Where to store this HashCode? I think that inheriting from DataRow will be such a pain that I'd prefer to have to investigate other ways. Usage of property "RowError" is really a dirty hack but can still be used for a prototype or quick check of the solution. For the final solution I'd add a column to my dataset (default value is null) and clearly identify it as the HashCode of the row. In the method GetHashCode will check the value of this column, if not null return it else calculate, store it and return it.

This is from the top of my head, there are probably other opportunities/solutions. Any suggestion is welcome.
Jun 4, 2013 at 4:34 PM
Interesting solution. I was wondering if it was possible to do some thing like that.

I'm thinking of using a hash table to perform the comparisons. They tend to have great performance.

The hash table would store: the hash of key, the hash of the values, and a reference to the DataRow. If "AllKeys" is specified, then the hash of the key and the values is the same. Ideally we could store the hash of the key in the DataRow, so we would have bi-directional association.

Then all of the comparisons would be performed using integers. The current code would be relatively unchanged.

I can probably code this in the next two days. I'll send you a pull request once I'm done.

Your thoughts?
Jun 5, 2013 at 4:58 AM
I checked in code that pre-calculates the hash values. It didn't require too many changes and turned out to be quite simple.

The performance is a lot better. The tests I ran completed within milliseconds.

I also made the output of the elapsed time more human readable.
Jun 5, 2013 at 7:20 AM
Thx for the fixes that you've pushed.

Glad, to hear that you've been able to improve the performances of this feature.

PS: I'm not sure that I'll be able to review your pull requests before Saturday
Jun 5, 2013 at 5:01 PM
I made best guesses about things based my understanding of the code. Please rename and refactor as appropriate.

The changes I made do change the behavior slightly. Duplicate keys are no longer allowed in either result set. Should not be too much of an issue, people just need to add "DISTINCT" to there SQL queries. MDX queries are always unique since dimension slicers form the key.
Jun 5, 2013 at 9:44 PM
I don't mind about duplicate keys, it has never been the goal to support them. Anyway, your code to detect them sounds most robust than mine, two tests were incorrect and existing code didn't detect them, the new code you've submitted is throwing the expected exception.

I've commented your first pull request about an issue with value-column and numeric as default. You've not submitted your last commits as a pull requests, just something forgotten or an intentional move?
Jun 5, 2013 at 9:48 PM
I'm glad to hear the code is working.

I tried to submit another pull request, but codeplex complains that pull request already exists. I'm sure I'm doing something incorrectly.
Marked as answer by Seddryck on 12/23/2014 at 5:24 AM