Large Data Sets (Part 1 of Many)

This is the first of what will most likely be many postings related to data management and Flex. Our opening topic is a brief discussion on large data sets and how to efficiently transfer them from your application server to the client. I’m going to assume you are familiar with MXML and the components and services that are a part of the Flex runtime.

This post will refer to many examples that are found here. Instructions for installing the samples can be found in a README included in the zip.

The Problem

Our problem is simple. We need to display a lot of data, and the user does not want to wait too long for it to appear. The more data a user requests, the longer it takes for the data to display. What we need to do is a find a way to download data in pieces so that the user can view a smaller amount of data immediately and then only wait a brief amount of time as they request additional data. To keep the problem domain simple we are only going to discuss improving the actual retrieval of data. We’ll leave discussions on sorting, modification, filtering, etc. for another time.

Example 1 can help you visualize this problem. The data is census information that I got from The UCI KDD Archive and reduced to 20,000 records. Start by downloading a few records, perhaps 10. This doesn’t take long at all. If we increase that number to 100 the response is usually a little slower (these numbers can vary depending on your system setup, for example whether the database and application servers are on the same machine in addition to where you run the Flex app). Increase the number to 1000 and you’ll see a bigger jump. Finally 5000 and the time is becoming painful. Painful enough that an end-user might give up.


The Solution: Paging

Smarter people than I solved this problem a long time ago by displaying data in pages. A page allows the user to see a discrete number of entries, and then when ready move on to the next batch. A typical example would be a search engine. Go to your favorite search engine and notice that whenever you search for something you only have a small number of results (10 or maybe 20) before you reach the bottom and can move to the next page. Even if a search returned 3000 results I’ve personally found it unlikely to ever look at anything past the third page. So if that search engine had returned all 3000 results at once to my browser, I’d be waiting forever for it to come down over the wire and render on screen, when really I only cared about those first 30.

The Server Component

Before getting into writing paging support in Flex, let’s talk about how you retrieve pages on the server. I’ve written some very simple classes to retrieve my census data. The first is an interface (just to show that I’m thinking generically), the ValueListService. As you might have guessed, this is a very simple adaptation of the ValueListHandler described at the Sun Core J2EE Patterns site. The original ValueListHandler is treated as an Iterator which means that it must maintain state about its current position. Since my Flex application is going to do that for me, all I need from the ValueListService is the number of elements, and the ability to retrieve an arbitrary number of elements starting from a certain position. Therefore my ValueListService looks like this:

public interface ValueListService{public int getNumElements();public List getElements(int begin, int count);}

I’ve then implemented this interface using a CensusService which is capable of using a CensusDAO to retrieve the relevant data, which is populated into CensusEntryVO objects. The CensusDAO is currently only tested on MS SQLServer but the SQL is pretty standard. It’s also not very robust given that this is sample code; all exceptions are simply caught and printed. So now when the Flex client wishes to view census data it will ask the CensusService for a List of CensusEntryVOs.

The RemoteObject Tag

We’re going to use the RemoteObject tag to access the CensusService. Flex offers three data services: HTTPService, WebService, and RemoteObject. HTTPService is useful for getting information from plain XML files or perhaps servlets that serve XML. WebService is obvious, it is used to call web services which return data over the wire as XML. RemoteObject allows you to access Java objects on your application server and by default uses a binary format called AMF for efficient data transfer. Since this "article" is about transferring large amounts of data, RemoteObject is a good choice of service because of its efficient transfer mechanism. However, the concepts discussed can easily be adapted to the other services.

Explicit Paging

Our first solution follows the same idea as a search engine. Return data one page at a time and let the user request the next page. I’m going to call this technique explicit paging. The explicit paging solution is ideal for when a user requests a number of records but may only care about a specific subset rather than the entire result set. Searching is a primary example, an address book could be another. Pages do not have be to sized according to the number of records, they can meet other search criteria like last names beginning with C.

Example 2 let’s you see how this works. Assuming you ran Example 1 you got a sense of how many records could be downloaded before the wait became unbearable. On my particular setup about 1000 records was the max I wanted to put myself through, but I think in reality that number would be closer to 100 for the app to appear speedy. There is a slider on the app that allows you to change the number of records in a page. From there you can see that the page selector is created to allow access to whichever page the user desires. The page selector uses a Repeater to create a number of Links which are then used to select the pages.

This solution is pretty straightforward. You’ll see that most of my code is actually related to the page selector itself, not the data. This means that someone with better UI skills can write a whizbang page selector that can be re-used and all you’ll need to do is configure it. Flex is new, but I have no doubt that we’ll see robust components sprouting up soon for these kinds of purposes.

So an explicit paging solution is useful when a user wants to see pieces of data. However we might need another solution if it’s important for the user to see all the data at once.

Implicit Paging

Our second approach is a technique I’ll call implicit paging. Here we don’t want the end-user thinking about viewing finite pieces of data, instead it should appear that all of the data is available. Rather than downloading all of the data at once, we’ll download a small piece and then as the user comes to data we haven’t downloaded we’ll go ahead and get it. If you use a mailreader and store your mail on a server this concept will be familiar to you. View one of your mailbox folders and begin to scroll down. As you scroll there may be a little pause as the reader gathers more data for you to look at.

This problem might sound hard; how do you configure the DataGrid control to only get small chunks of data at a time? It’s actually pretty simple! The DataGrid takes a dataProvider that must conform to an "interface." Interface is in quotes because unfortunately we don’t actually use an interface due to the need to mix in functionality to classes whose signature we can’t modify. So all we need to do is implement the DataProvider interface using a class that knows how to retrieve data in pages instead of all at once.

Example 3 shows my simple implementation. The SimplePagingDataProvider (SPDP) is given a reference to a RemoteObject that will communicate with a ValueListService. Whenever the DataGrid asks for data (using the getItemAt method) the SPDP will check to see if the page for that item is loaded. If so the data is returned; if not the SPDP will ask the RemoteObject to download the data and in the meantime return a dummy value. This is pretty straightforward:

public function getItemAt(index : Number){var item = data[index];if (item == null){item = miss(index);}return item;}private function miss(index : Number) : Object{var page : Number = Math.floor(index / pageSize);//if the page was already loaded then the value actually is nullif (pagesLoaded[page] == true) return null;//it's possible that the page is already being loadedif (pagesPending[page] == true){//this miss event is useful for just monitoring what's going ondispatchEvent({type: "miss", index: index, alreadyPending: true});return "loading";}//if the page is not loaded call for itvar call = dataService.getElements(page * pageSize, pageSize, this);call.page = page;//we want to keep track of how long it takes to loadcall.startTime = getTimer();pagesPending[page] = true;dispatchEvent({type: "miss", index: index, alreadyPending: false});return "loading";}

The DataGrid will show blank rows while the data is loading in the background but it will continue to function (i.e., no hanging). When the data is downloaded the rows will be filled in. In the example I used a page size of 100 because on my setup there were only brief periods where the DataGrid appeared not filled in. Note that this solution works both when a user slowly moves down the list (perhaps using the PgDn key) and when dragging the ScrollBar.

Conclusion

This is just the first step in addressing the problem of dealing with large datasets in your Flex RIA. I’ve introduced the concept of paging (though I doubt it’s new to you) and showed two different techniques for integrating a paging solution into your application. The explicit paging solution is useful when there is a lot of data to be shown but it is unlikely that the user wants to view all of it at once. The implicit paging solution acknowledges that a user wants to see a lot of data, but we bring it across the wire incremementally so that performance is acceptable. In future posts I’ll try to talk about expanding these solutions to improve performance (both real and perceived), take into account dynamic data, allow sorting, and more. If you have any thoughts on this including topics you’d like to see discussed in the future please drop me a comment.

Some resources that I’ve begun looking at and will come into play more in future entries: