Large Data Sets (Part 2 of 3?)

Our second topic for dealing with large data sets is sorting. The problem with sorting is you need to have all of the data available in order to produce an accurate sort. However, as we saw in the last article bringing down all of the data from server to client is incredibly expensive and therefore unrealistic. Even normal web applications rarely load all of their data from the database server to the app server when dealing with a large amount.

Instead of downloading all of the data to the client we’ll let the database take care of sorting for us (which is one of the things a database is very good at). Let’s build on our examples from the previous article and add sorting capability to the implicit paging solution. The examples can be found here and are a superset of what we did in article 1 (meaning that the old examples are there too).


Sorting

The first part to adding sorting support is in our data service. Remember from the last example we had a ValueListService which was similar to the ValueListHandler. I’ve now created a subinterface called SortedValueListService. The first method we care about is an extension of getElements called getSortedElements. We’ll keep this method simple and only allow sorting on a single field plus a boolean indicating whether the sort should be in ascending or descending order. If you were to build on this you might change the sortField into some form of sort descriptor.

I’ve implemented this method in the CensusService and added the appropriate method to the CensusDAO. The CensusDAO is now much more closely tied to the MS SQL Server database as it uses a stored procedure to retrieve the data as we need to keep track of row numbers which requires use of temporary tables. The same would not be true in Oracle which supports rownum inherently and I’m not sure about MySQL but I’d bet it’s not hard there either. My stored procedure is based on info I found on the 4GuysFromRolla.com site. The premise is the same as the normal getElements SQL, the only difference is that we need to throw sorting in and therefore can’t rely on the ID being the same as the row number.

Now I need to update my Flex code to take advantage of sorting. Most of my changes to support sorting go in the SimpleDataProvider. I’ve added an implementation of the sortItemsBy method which is specified by the DataProvider "interface" but did not implement the sortItems method as that uses a comparator function (which means needing all of the data on the client). I then assume that the data provider will always be retrieving sorted data, so I changed it to always call the getSortedElements method instead of getElements. The SDP then maintains the sortField and sortAscending boolean to pass to the server whenever a new page is needed. When the sort changes we first check to see if all the data is available in which case the sort can occur on the client. However, that rarely happens so we essentially need to start over retrieving data from the server. Therefore all we need to do is set sortField and sortAscending variables, then clear the list and let the DataGrid simply ask for the correct data. You can see this implementation in Example 4.

Maintaining the current selection

If you played with the example you may have noticed that when you change the sorting options the DataGrid maintains its vertical scroll position but does not maintain the selection. This is the default DataGrid behavior and is fine for small amounts of data but might be frustrating if you’re looking to use the sort to group similar items near each other starting from your selected item. Example 5 is one attempt at addressing this issue (and I’m not guaranteeing that it’s the best). Another method is added to SortedValueListService called getSortedPosition which given the same search criteria as getSortedElements will tell you the 0-based position of the element you pass in. This requires another stored procedure in the database for MS SQL Server but would be similarly straightforward on another database.

Integrating this change into the Flex code is a little more complicated as I didn’t want to corrupt the DataProvider itself with the concept of a selected item, as that is dependent on the view into that data. So what I decided to do is control the sorting operations myself instead of letting the DataGrid manage it. Now when the header is pressed I keep track of what the intended sort is, and then go ahead and get the index of the selectedItem and find out what its sorted position would be (this is all before actually modifying the DataProvider). Finding out the new sorted position requires a server call so it’s possible that you’d want to use the busy cursor or some other mechanism to indicate that things are happening, but I didn’t in the example. Once the new sorted position is known for the selected item we can go ahead and sort the list. Next we scroll the DataGrid to where the selectedItem should show up, and finally once that page has loaded we select the actual item. The reason we need to wait for the page to load before setting selectedIndex is that the DataGrid will clear the selection if it changes the underlying data. If you put a trace in the handleLoad method you’ll see that a few pages are loaded after a sort even though they aren’t the pages that should be visible. I think I’d probably need to change the DataGrid class to avoid this so we’ll ignore that for now despite its minor performance impact. If we were going to generalize this I think we’d want to subclass the DataGrid so that it knows it’s working with the pageable data provider along and give it a reference to an object that can provide certain information like the sorted position of an element.

Conclusion

So that’s the basic approach for sorting large data sets. If anyone has had experience that might improve upon these by all means let me know. In the next article I’ll write some things we can do to improve the perceived performance when dealing with large data, and then I think we’ll try to move on to other topics.