Large Data Sets (Part 1 of Many)

This is the first of what will most likely be many postings related to data management and Flex. Our opening topic is a brief discussion on large data sets and how to efficiently transfer them from your application server to the client. I’m going to assume you are familiar with MXML and the components and services that are a part of the Flex runtime.

This post will refer to many examples that are found here. Instructions for installing the samples can be found in a README included in the zip.

The Problem

Our problem is simple. We need to display a lot of data, and the user does not want to wait too long for it to appear. The more data a user requests, the longer it takes for the data to display. What we need to do is a find a way to download data in pieces so that the user can view a smaller amount of data immediately and then only wait a brief amount of time as they request additional data. To keep the problem domain simple we are only going to discuss improving the actual retrieval of data. We’ll leave discussions on sorting, modification, filtering, etc. for another time.

Example 1 can help you visualize this problem. The data is census information that I got from The UCI KDD Archive and reduced to 20,000 records. Start by downloading a few records, perhaps 10. This doesn’t take long at all. If we increase that number to 100 the response is usually a little slower (these numbers can vary depending on your system setup, for example whether the database and application servers are on the same machine in addition to where you run the Flex app). Increase the number to 1000 and you’ll see a bigger jump. Finally 5000 and the time is becoming painful. Painful enough that an end-user might give up.


The Solution: Paging

Smarter people than I solved this problem a long time ago by displaying data in pages. A page allows the user to see a discrete number of entries, and then when ready move on to the next batch. A typical example would be a search engine. Go to your favorite search engine and notice that whenever you search for something you only have a small number of results (10 or maybe 20) before you reach the bottom and can move to the next page. Even if a search returned 3000 results I’ve personally found it unlikely to ever look at anything past the third page. So if that search engine had returned all 3000 results at once to my browser, I’d be waiting forever for it to come down over the wire and render on screen, when really I only cared about those first 30.

The Server Component

Before getting into writing paging support in Flex, let’s talk about how you retrieve pages on the server. I’ve written some very simple classes to retrieve my census data. The first is an interface (just to show that I’m thinking generically), the ValueListService. As you might have guessed, this is a very simple adaptation of the ValueListHandler described at the Sun Core J2EE Patterns site. The original ValueListHandler is treated as an Iterator which means that it must maintain state about its current position. Since my Flex application is going to do that for me, all I need from the ValueListService is the number of elements, and the ability to retrieve an arbitrary number of elements starting from a certain position. Therefore my ValueListService looks like this:

public interface ValueListService{public int getNumElements();public List getElements(int begin, int count);}

I’ve then implemented this interface using a CensusService which is capable of using a CensusDAO to retrieve the relevant data, which is populated into CensusEntryVO objects. The CensusDAO is currently only tested on MS SQLServer but the SQL is pretty standard. It’s also not very robust given that this is sample code; all exceptions are simply caught and printed. So now when the Flex client wishes to view census data it will ask the CensusService for a List of CensusEntryVOs.

The RemoteObject Tag

We’re going to use the RemoteObject tag to access the CensusService. Flex offers three data services: HTTPService, WebService, and RemoteObject. HTTPService is useful for getting information from plain XML files or perhaps servlets that serve XML. WebService is obvious, it is used to call web services which return data over the wire as XML. RemoteObject allows you to access Java objects on your application server and by default uses a binary format called AMF for efficient data transfer. Since this "article" is about transferring large amounts of data, RemoteObject is a good choice of service because of its efficient transfer mechanism. However, the concepts discussed can easily be adapted to the other services.

Explicit Paging

Our first solution follows the same idea as a search engine. Return data one page at a time and let the user request the next page. I’m going to call this technique explicit paging. The explicit paging solution is ideal for when a user requests a number of records but may only care about a specific subset rather than the entire result set. Searching is a primary example, an address book could be another. Pages do not have be to sized according to the number of records, they can meet other search criteria like last names beginning with C.

Example 2 let’s you see how this works. Assuming you ran Example 1 you got a sense of how many records could be downloaded before the wait became unbearable. On my particular setup about 1000 records was the max I wanted to put myself through, but I think in reality that number would be closer to 100 for the app to appear speedy. There is a slider on the app that allows you to change the number of records in a page. From there you can see that the page selector is created to allow access to whichever page the user desires. The page selector uses a Repeater to create a number of Links which are then used to select the pages.

This solution is pretty straightforward. You’ll see that most of my code is actually related to the page selector itself, not the data. This means that someone with better UI skills can write a whizbang page selector that can be re-used and all you’ll need to do is configure it. Flex is new, but I have no doubt that we’ll see robust components sprouting up soon for these kinds of purposes.

So an explicit paging solution is useful when a user wants to see pieces of data. However we might need another solution if it’s important for the user to see all the data at once.

Implicit Paging

Our second approach is a technique I’ll call implicit paging. Here we don’t want the end-user thinking about viewing finite pieces of data, instead it should appear that all of the data is available. Rather than downloading all of the data at once, we’ll download a small piece and then as the user comes to data we haven’t downloaded we’ll go ahead and get it. If you use a mailreader and store your mail on a server this concept will be familiar to you. View one of your mailbox folders and begin to scroll down. As you scroll there may be a little pause as the reader gathers more data for you to look at.

This problem might sound hard; how do you configure the DataGrid control to only get small chunks of data at a time? It’s actually pretty simple! The DataGrid takes a dataProvider that must conform to an "interface." Interface is in quotes because unfortunately we don’t actually use an interface due to the need to mix in functionality to classes whose signature we can’t modify. So all we need to do is implement the DataProvider interface using a class that knows how to retrieve data in pages instead of all at once.

Example 3 shows my simple implementation. The SimplePagingDataProvider (SPDP) is given a reference to a RemoteObject that will communicate with a ValueListService. Whenever the DataGrid asks for data (using the getItemAt method) the SPDP will check to see if the page for that item is loaded. If so the data is returned; if not the SPDP will ask the RemoteObject to download the data and in the meantime return a dummy value. This is pretty straightforward:

public function getItemAt(index : Number){var item = data[index];if (item == null){item = miss(index);}return item;}private function miss(index : Number) : Object{var page : Number = Math.floor(index / pageSize);//if the page was already loaded then the value actually is nullif (pagesLoaded[page] == true) return null;//it's possible that the page is already being loadedif (pagesPending[page] == true){//this miss event is useful for just monitoring what's going ondispatchEvent({type: "miss", index: index, alreadyPending: true});return "loading";}//if the page is not loaded call for itvar call = dataService.getElements(page * pageSize, pageSize, this);call.page = page;//we want to keep track of how long it takes to loadcall.startTime = getTimer();pagesPending[page] = true;dispatchEvent({type: "miss", index: index, alreadyPending: false});return "loading";}

The DataGrid will show blank rows while the data is loading in the background but it will continue to function (i.e., no hanging). When the data is downloaded the rows will be filled in. In the example I used a page size of 100 because on my setup there were only brief periods where the DataGrid appeared not filled in. Note that this solution works both when a user slowly moves down the list (perhaps using the PgDn key) and when dragging the ScrollBar.

Conclusion

This is just the first step in addressing the problem of dealing with large datasets in your Flex RIA. I’ve introduced the concept of paging (though I doubt it’s new to you) and showed two different techniques for integrating a paging solution into your application. The explicit paging solution is useful when there is a lot of data to be shown but it is unlikely that the user wants to view all of it at once. The implicit paging solution acknowledges that a user wants to see a lot of data, but we bring it across the wire incremementally so that performance is acceptable. In future posts I’ll try to talk about expanding these solutions to improve performance (both real and perceived), take into account dynamic data, allow sorting, and more. If you have any thoughts on this including topics you’d like to see discussed in the future please drop me a comment.

Some resources that I’ve begun looking at and will come into play more in future entries:

34 Responses to Large Data Sets (Part 1 of Many)

  1. Waldo Smeets says:

    Hi Matt, great article. Just a few comments for other people that try running this application:- I had to remove the empty from the flex config file to make the samples work. I first just created a new node for census.* and that did not work (the server returned an exception which was hard to debug)- For example 2 to run, you should copu Slider.swc from the flexstore sample into the folder where you are running this demo

  2. Robin Ward says:

    What about sorting?I recently spent about two weeks here at the office coding our own customizable Grid component using Flash MX that works in a similar fashion to the one you demonstrate.It is bound to a REST-based URL that returns an XML feed of data in chunks. As you are viewing, there is a progress bar on the bottom. You can also adjust things like the sizes of the columns, and since all fields are editable, dynamic drop-downs can appear.Here’s an image of it:http://trinja.cryptek.org/robin/images/SDGrid.jpgThe flash component is below the HTML form that FireFox has rendered there. Entering new search terms and clicking submit sends a JavaScript message to the Grid to filter on specific fields. Additionally, you can click on the headings of columns to have them sort ascendingly or descendingly.The biggest problem that I faced was figuring out in what application layer to do the sorting. At first, I thought that it could be done in the Flash Player, on the client side (as the Flex online demos suggest.) It worked great for result sets of 100-500 records, but when I threw thousands at it, it would hang up the machine or throw the MX Development Environment into scripting errors.In the end, the only solution I could come up with was to dump the current results from memory, and contact the server again for new results (again, showing the progress bar, so that the user can edit the data they’ve been given before waiting for the rest of it to finish downloading.)Do you think this is the best approach? How about filtering, such as with the form above the grid? Is the best way to dump what you have in memory and ask for more from the application server?Fortunately, we have all the results in memory, so we’re not going to the database with each new request. The first request to the application server will load the entire results into a write-through EJB, and subsequent requests will have it filter on that. This is surprisingly fast, even on result sets up to 20,000+ rows can be processed by a web application server in under a second.-Robin WardSenior Web Developer, SalesDriver

  3. Matt says:

    Hi Robin,This is definitely something that I hope to address in a future article. As I said in the problem description I want to leave sorting, filtering, and other similar topics out for now to keep the problem domain smaller (and because it takes a long time to come up with general solutions).Matt

  4. Robin says:

    Ooops, sorry, missed that line :$Looking forward to that series.-RW

  5. Sami Hoda says:

    Any idea how to start Flex as a Windows service instead of manually each time?

  6. Matt says:

    That’s the kind of question for a support forum or flexcoders.

  7. Hans Omli says:

    service threw an exception during method invocation: nullideas?

  8. Matt says:

    Probably there is a problem in flex-config.xml, maybe you should check your logs? As Waldo pointed out when filling in a RemoteObject whitelist you cannot have any empty tags, they have to be filled in. That’s my thought for now.

  9. Eric Wilcox says:

    Thanks for the examples… I am looking forward to more. If you are keeping a list of things to discuss, I would be interested in learning more about the best practices when using RemoteObjects. Maybe answers to questions like, does the Java Class corresponding to the RObj have to be serializable or a bean? What’s the best way to handle return types that aren’t primitive (i.e., other Java classes)? What’s a good example of a facade class?

  10. Hans Omli says:

    Think my problem may be at the database. I’m using SQL Server in this case. I imported the CSV file into a table in an existing database. The table was automatically named “censusincome-small”. Then I created a resource in JRun with DSN “censusdata” pointing to the existing database. The DSN verifies successfully. Any ideas where I could be going wrong?

  11. Matt says:

    The Java expects the table to be named censusincome, not censusincome-small. Sorry, I didn’t test the import too carefully.

  12. Hans Omli says:

    I also had to set the JNDI reference to “jdbc/censusincome” rather than “censusincome” alone. Duh. Anyway, got them working now and am looking forward to your discussions on improving performance and sorting!

  13. wakuwaku says:

    hiI want add census.* inflex_config.xml. but I found 6in this xml .which is correct position?thanks

  14. wakuwaku says:

    oh ,sorry. is right choice.but problem still happend:service threw an exception during method invocation: null :(your JNDI name is “java:jdbc/censusdata”maybe “java:/comp/dev/jdbc/censusdata”?thxs

  15. wakuwaku says:

    ha, I solve this problemHans Omliyou maybe user tomcat 5jndi name should be “java:comp/env/jdbc/censusdata”modify & recompile CensusService.java,put it into Census.jar

  16. wakuwaku says:

    java:/comp/env/jdbc/censusdata

  17. Matt says:

    Take out the java:, the jndi name should be jdbc/censusincome I think.

  18. Matt says:

    Sorry, in JRun the jndi-name setting in my jrun-resources.xml is jdbc/censusincome and then as you know the Java src says java:jdbc/censusincome. Not sure what might need to be different for other app servers.

  19. Al says:

    Can anyone tell me how I configure the JNDI file (jrun-resources.xml) so it is aware of the sql database which contains the correctly named table. I am working the the Flex trial version on supplied JRunThanks

  20. Matt says:

    Can’t seem to get html working but try copying this. http://www.markme.com/mchotin/files/data-1-jrun.xml

  21. Al Choudhury says:

    First of all thank you for the config file. It did the trick.I have a question regarding the retrieval of data from the database. In ‘example 3’ I notice you make a call up to the database each time the user gets to the end of the current chunk of data. Does the flex grid have a way for knowing “first visible row in the gridâ€?, so a call for new data can be made when a pre-defined record becomes visible. Also is there a way to set off a background thread within flex UI (client side) which will call the server method and return the result-set and append it to the visible record-set.Your assistance is much appreciated.

  22. Matt says:

    I’m going to be addressing issues like this in my next post hopefully. If you look at the SPDP you’ll see that getItemAt is called by the DataGrid, so it’s relatively straightforward to keep track of the current request and add in a lookahead there.As for background threads, all calls to RemoteObjects are asynchronous so the Player will not hang while accessing the data. However you will see a hiccup when the data returns and is loaded into the DataGrid. That’s why you can continue scrolling even when you know the DP has requested more data.Matt

  23. Al Choudhury says:

    Our application is typically written to connect up with Oracle, MS SQL or any main stream Enterprise database, thus database vendor independent. Therefore, we are unable to use stored procedures or any other vendor specific Database attributes.I believe your example was going down the ‘prime key’. However, most of our queries consist of compound columns. For instance, suppose your query ended with ‘ORDER BY age asc, classofworker desc, id asc’. The record ID in this instance is not relevant, so would Flex be able cope effectively with this type of query?

  24. Matt says:

    Sure, Flex can cope just fine. You’ll notice in article 2 that for getting the current position of the selected item I passed down the entire object. No reason why you can’t do the same thing for searching purposes, I just like keeping examples simple. As for stored procedure vs normal SQL that’s not a problem either. This is the point of the DataAccessObject pattern that I implemented. You can simply pick the right DAO depending on your database but keep the CensusService class the same. If you use ANSI SQL that works on every database then you’ll only need one DAO, but people often find that writing database-specific code provides better performance.

  25. Great story, thanks. Best regards from Germany.

  26. Great story, thanks. Best regards from Germany.

  27. Great story, thanks. Best regards from Germany.

  28. norville says:

    Service threw an exception during method invocation: census/CensusService (Bad Magic Number) …I got this after i recompiled census.jar.

  29. norville says:

    Ignore that, i am a re-re and i forgot to compile the class files.

  30. jerry says:

    In exmaple2 (Paging by using slider)How should I modify the program if my table do not have a numeric PK but only a PK of string (e.g email address)?What arguments should I pass to the method?Original :SQL: SELECT * FROM Table WHERE id > ? AND id ? AND Email <= ?begin: ??count: ??

  31. jerry says:

    In exmaple2 (Paging by using slider)How should I modify the program if my table do not have a numeric PK but only a PK of string (e.g email address)?What arguments should I pass to the method?Original :SQL: SELECT * FROM Table WHERE id > ? AND id ? AND Email <= ?begin: ??count: ??

  32. zabestia says:

    Hi! I can’t download the example zip file. Can you post the valid URL for this?

    • Matt Chotin says:

      Was lost in the blog migration to WordPress, will see if I can recover. Check back in a few days.