Rules and Tools for Accessing & Remixing the Deep Web

Michael Bazeley wrote today in the print Merc News and on SiliconValley.com about a startup scouring the deep web. They are also doing some HousingMaps-style remixing to display job listings geographically.
Deep linking and remixing are clearly a major trend. One reason the Web has so rapidly evolved to become the world’s dominant communication backbone is that its compositional architecture enables these kinds of unintended combinations. And the separation of information feeds and presentation which encourages novel applications is a primary hallmark of Web 2.0.
But when remixing apps start using robots that simulate human users accessing content databases I believe there are some ethical and perhaps legal issues that deserve more attention. Unlike an Amazon RSS feed, job site A was not necessarily expecting to enable job site B to access its postings, and its business model may or may not support such access. Bloggers can get pretty bent out of shape at RSS “feed stealing”. And some Ivy League applicants got in serious trouble for what seemed to be pretty much a manual version of what the Merc News is glorifying as “Mining the Deep Web” but which was (IMO mis-)portrayed as “hacking”.
I believe we need to think through the legal & ethical rules for accessing & remixing the deep web, as well as providing infrastructure support to adequately enforce these rules in the Web 2.0 platform. Otherwise, service providers may find themselves incented to stick with Web 1.0 solutions where presentation and data remain tangled up. And surely the right answer is not just more captchas.
A related issue is the personal-level remixing that’s going on via tools like Greasemonkey. As a user who wants control, I applaud Greasemonkey. And while these scripts may be somewhat parasitical, parasites can also accelerate evolution (some ancestor’s virus infection became my mitochondria). But as an app developer it is not a good thing if my internal JavaScript APIs become effectively frozen because some popular 3rd-party script uses them, or worse yet get mis-used for a purpose counter to my business goals.