geothought: database

Showing posts with label database. Show all posts

Tuesday, August 4, 2009

Netezza announces new architecture with 10-15x price-performance improvement

I have previously discussed Netezza, who produce data warehousing appliances that provide outstanding performance and simplicity for complex analytics on very large data volumes. I did some consulting work with them last year as they added spatial capabilities to their system. Today they announced a major new architecture, which they say gives a 3-5x performance improvement for typical workloads (more for some operations, less for others), and reduces price per terabyte by a factor of 3. So overall price performance improves by a factor of 10-15. Database guru Curt Monash has a good discussion of the new architecture and pricing implications on his blog.

The new hardware architecture is more flexible than the old one, which makes it easier to vary the proportions of processor, memory and disk, which will allow them to provide additional product families in future:

High storage (more disk, lower cost per terabyte, lower throughput)
High throughput (higher cost per terabyte but faster)
Entry level models

I think that the entry level and high throughput models will be especially interesting for geospatial applications, many of which could do interesting analytics with a Netezza appliance, but may not have the super large data volumes (multiple terabytes) that Netezza's business intelligence customers have. Another interesting change for the future is that Netezza's parallel processing units (now Snippet blades, or S-blades, formerly snippet processing units or SPUs) are now running Linux, whereas previously they were running a rather more obscure operating system called Nucleus. In future, this should make it easier to port existing analytic applications to take advantage of Netezza's highly parallel architecture (though this is not something that is available yet). The parallel processing units also do floating point operations in hardware rather than software, which should also have a significant performance benefit for their spatial capabilities.

I continue to think that Netezza offers some very interesting capabilities for users wanting to do high end geospatial analytic applications on very large data volumes, and that there will be a lot of scope for its use in analyzing historical location data generated by GPS and other location sensors. And I am just impressed by anyone who produces an overnight 10-15x price performance improvement in any product :) !

Wednesday, April 8, 2009

Google App Engine and BigTable - VERY interesting!

Every so often you come across a radically different approach to a certain class of data processing problem which makes you completely rethink what you knew before about how best to develop applications in that space. Systems which I would put in this category over my career include:

Smallworld VMDS (early 90s), for its approach to handling long transactions and graphically intensive applications in a database
Apama (early 2000s), for its approach to streaming analytics on real time data streams, by indexing queries instead of indexing data
Netezza (last year - for me), for its approach to data warehousing using "SQL in hardware" in its smart disk readers, together with extreme parallel processing

I would now add to that list Google's BigTable for incredibly scalable request-oriented web applications (update: Barry Hunter pointed out in the comments that the App Engine datastore and BigTable are not the same thing - datastore is built on top of the lower level BigTable, and adds extra capabilities. I haven't updated the whole post but in most cases where I say BigTable, it should say datastore. Thanks Barry!). I know I'm a bit behind the times on this - Google has been using it internally for years, and it was first made available for external use with the release of Google's App Engine last year. I had read about App Engine but hadn't got around to looking at it in any detail until last weekend, when for some reason I read a few more detailed articles, downloaded the App Engine development environment and ran through the tutorial and played around with it a bit.

There are too many interesting things to talk about in this regard for one post, so I'll spread them over several. And I should add the caveat that everything I say here is based on a few hours of poking around App Engine and BigTable, so it is quite possible I have missed or misunderstood certain things - if anyone with more experience in the environment has thoughts I would be very interested to hear them.

In general I was very impressed with App Engine - in less than an hour I was able to run through the getting started tutorial, which included getting a local development environment set up, going through 5 or 6 iterations of a simple web application, including user authentication, database setup, etc, and deploying several iterations of the application online. It takes care of a huge amount for you, including the ability to automatically scale to zillions of users. We could throw away large portions of our code for whereyougonnabe if we moved to App Engine, something which I am now seriously considering.

But for the rest of this post I'd like to talk about BigTable, the "database" behind App Engine. Google stresses that it isn't a traditional database - this paper, from several years ago, describes it as a "distributed storage system". It can handle petabytes of data spread across thousands of servers and is used by many Google applications, including search and Google Earth. So clearly BigTable is enormously scalable.

However, it also has some limitations on the types of queries it allows, which at first glance for someone used to a traditional relational database management system seem incredibly restrictive. Some of these restrictions seem to have good technical reasons and some seem a bit arbitrary. For example:

A query cannot return more than 1000 rows

You cannot use an inequality operator (<, <=, >=, >, !=) on more than one property (aka "field") in a query - so you can do

SELECT * FROM Person WHERE birth_year >= :min
AND birth_year <= :max

but not

SELECT * FROM Person WHERE birth_year >= :min_year
AND height >= :min_height

If a query has both a filter with an inequality comparison and one or more sort orders, the query must include a sort order for the property used in the inequality, and the sort order must appear before sort orders on other properties.
And more - see the full list.

While these sort of constraints impose some challenges, the positive side is that as far as I can see, you can't write an inefficient query using BigTable (if anyone has a counterexample to this statement - based as I said on a couple of hours exposure to BigTable - please let me know!). It changes your whole approach to the problem. A traditional relational DBMS makes it very easy to ask whatever question you want (generally speaking), but you may then need quite a lot of work in terms of indexing, tuning, even data model redesign, to make the answer to that question come back quickly. It's easy to do sloppy data model and query design. With BigTable you may need to think more up front about how to fit your problem into the constraints it imposes, but if you can then you are guaranteed (I think!) that it will run quickly and scale.

There's an interesting example of this type of redesign in this post, which shows how you can redesign a query on a date range, where the obvious approach is to have two fields storing start_date and end_date, and run a query which includes an inequality operator against both fields - something which BigTable does not allow. The interesting solution given here is to use one field containing a list of (two) dates, which BigTable does allow (and most traditional DBMSs don't). This is a real world example of a query which is pretty inefficient if you do it in the obvious way in a traditional database (I have seen performance issues for this type of query in the development of whereyougonnabe) - BigTable forces you to structure the data in a different way which ends up being far more efficient.

I am still in two minds about the restriction of not allowing inequality operators on more than one field. This clearly guarantees that the query can run quickly, but restricts you from answering certain questions. Most database management systems would use the approach of having a "primary filter" and a "secondary filter" for a compound query like this - the system uses the primary filter to efficiently retrieve candidate records from the database which satisfy the first condition, and then you test each of those against the second condition to decide whether to return them. This is very common in spatial queries, where you return candidate records based on a bounding box search which can be done very efficiently, and then you compare candidate records returned against a more precise polygon to decide if they should be included in the result set. But this also adds complexity - it is non-trivial to decide which one of multiple clauses to use as the primary filter (this is what a query optimizer does), and it is quite possible that you end up scanning large portions of a table, which seems to be one of the things that BigTable wants to avoid.

Nonetheless, technically it would be easy for Google to implement a secondary filter capability, so I can only assume it is a conscious decision to omit this, to force you to design your data structures and queries in a way which only scan a small portion of a table (as the restriction of 1000 records returned does) - so ensuring the scalability of the application. I would be curious as to whether some of these restrictions, like the 1000 record limit, apply to internal Google applications also, or just to the public site where App Engine runs (in order to stop applications consuming too many resources). When App Engine first came out it was free with quotas, but Google now has a system for charging based on system usage (CPU, bandwidth, etc) once you go above certain limits - so it will be interesting to see if they lift some of these restrictions at some point or not.

But in general it's an interesting philosophical approach to impose certain artificial restrictions to make you design things in a certain way (in this case, for efficiency and scalability). Twitter imposing a message length of 140 characters is limiting in certain ways, but imposes a certain communication style which is key to how it works. The Ignite and Pecha-Kucha presentation formats impose restrictive and artificial constraints on how you present a topic (20 slides which auto-advance after 15 or 20 seconds respectively), but they force you to really think about how to present your subject matter concisely.

With BigTable I think there is an interesting mix of constraints which have clear technical reasons (they can't be easily overcome) and those which don't (they could be easily overcome - like secondary inequality filters and the 1000 record limit). Whether there is really a conscious philosophy here or whether the approach is just to avoid overloading resources on the public site (or a mix of both), I am intrigued by this idea of having a system where seemingly any query you can write is "guaranteed" to run fast and be extremely scalable (not formally guaranteed by Google, but it seems to me that this should be the case).

Of course one key question for me and for readers of this blog is how well does BigTable handle geospatial data - especially since a standard bounding box query involves inequality operators on multiple fields, which is not allowed. BigTable does support a simple "GeoPt" data type, but doesn't support spatial queries out of the box. I have seen some examples using a geohash (which is claimed on Wikipedia to be a recent invention, which as a referencing scheme may be true, but as an indexing scheme it is just a simple form of the good old quadtree index which has been in use since the 1980s - see the end of this old paper of mine). The examples I have seen so far using a geohash are simple and just "approximate" - they will give good results in some cases but incorrect results in others. I have several ideas for using simple quadtree or grid indexes which I will experiment with, but I'll save the discussion on spatial data in BigTable for another post in the not too distant future.

Thursday, February 12, 2009

Webinar next week on data warehouse appliances for Location Intelligence

I have posted previously about Netezza, who make data warehouse appliances, which can perform certain types of complex spatial analysis from 10x to 100x faster than traditional systems - I did some consulting for them last year. On Thursday next week I am speaking in a free webinar hosted by Directions Magazine and sponsored by Netezza, on the topic of data warehouse appliances for Location Intelligence. My talk will include the following topics:

One enterprise DBMS?
Data warehousing concepts
Data warehouse appliances
New possibilities for geospatial applications

On the subject of "one enterprise DBMS" I get to re-use the following slide, which I first used a (slightly different) version of back in 1993 or so when talking about Smallworld's VMDS database ... it's good to have a little material that can last that long in these times of rapid change :) !

The other main speaker will be Shajy Mathay from reinsurance company Guy Carpenter, who have been doing some very interesting things with Netezza - Shajy gave a very interesting presentation at the Netezza User Conference and I look forward to hearing what he has to say.

If you're interested you can get more information and sign up here.

Friday, September 19, 2008

Analysis at the speed of thought, and other interesting ideas

As I have posted previously, I spent last week out at the Netezza User Conference, where they announced their new Netezza Spatial product for very high performance spatial analytics on large data volumes. I thought it was an excellent event, and I continue to be very impressed with Netezza's products, people and ideas. I thought I would discuss a couple of general ideas that I found interesting from the opening presentation by CEO Jit Saxena.

The first was that if you can provide information at "the speed of thought", or the speed of a click, this enables people to do interesting things, and work in a different and much more productive way. Google Search is an example - you can ask a question, and you get an answer immediately. The answer may or may not be what you were looking for, but if it isn't you can ask a different question. And if you do get a useful answer, it may trigger you to ask additional questions to gain further insight on the question you are investigating. Netezza sees information at the speed of thought as a goal for complex analytics, which can lead us to get greater insights from data - more than you would if you spent the same amount of time working on a system which was say 20 times slower (spread over 20 times as much elapsed time), as you lose the continuity of thought. This seems pretty plausible to me.

A second idea is that when you are looking for insights from business data, the most valuable data is "on the edges" - one or two standard deviations away from the mean. This leads to another Netezza philosophy which is that you should have all of your data available and online, all of the time. This is in contrast to the approach which is often taken when you have very large data volumes, where you may work on aggregated data, and/or not keep a lot of historical data, to keep performance at reasonable levels (historical data may be archived offline). In this case of course you may lose the details of the most interesting / valuable data.

This got me to thinking about some of the places where you might apply some of those principles in the geospatial world. The following examples are somewhat speculative, but they are intended to get people thinking about the type of things we might do if we can do analysis 100x faster than we can now on very large data volumes, and follow the principle of looking for data "on the edges".

One area is in optimizing inspection, maintenance and management of assets for any organization managing infrastructure, like a utility, telecom or cable company, or local government. This type of infrastructure typically has a long life cycle. What if you stored say the last 10 or 20 years of data on when equipment failed and was replaced, when it was inspected and maintained, etc. Add in information on load/usage if you have it, detailed weather information (for exposed equipment), soil type (for underground equipment), etc, and you would have a pretty interesting (and large) dataset to analyze for patterns, which you could apply to how you do work in the future. People have been talking about doing more sophisticated pre-emptive / preventive maintenance in utilities for a long time, but I don't know of anyone doing very large scale analysis in this space. I suspect there are a lot of applications in different areas where interesting insights could be obtaining by analyzing large historical datasets.

This leads into another thought, which is that of analyzing GPS tracks. As GPS and other types of data tracking (like RFID) become more pervasive, we will have access to huge volumes of data which could provide valuable insights but are challenging to analyze. Many organizations now have GPS in their vehicles for operational purposes, but in most cases do not keep much historical data online, and may well store relatively infrequent location samples, depending on the application (for a long distance trucking company, samples every 5, 15 or even 60 minutes would provide data that had some interest). But there are many questions that you couldn't answer with a coarse sampling but could with a denser sampling of data (like every second or two). Suppose I wanted to see how much time my fleet of vehicles spent waiting to turn left compared to how long they spend waiting to turn right, to see if I could save a significant amount of time for a local delivery service by calculating routes that had more right turns in them (assuming I am in a country which drives on the right)? I have no idea if this would be the case or not, but it would be an interesting question to ask, which could be supported by a dense GPS track but not by a sparse one. Or I might want to look at how fuel consumption is affected by how quickly vehicles accelerate (and model the trade-off in potential cost savings versus potential time lost) - again this is something that in theory I could look at with a dense dataset but not a sparse one. Again, this is a somewhat speculative / hypothetical example, but I think it is interesting to contemplate new types of questions we could ask with the sort of processing power that Netezza can provide - and think about situations where we may be throwing away (or at least archiving offline) data that could be useful. In general I think that analyzing large spatio-temporal datasets is going to become a much more common requirement in the near future.

I should probably mention a couple of more concrete examples too. I have talked to several companies doing site selection with sophisticated models that take a day or two to run. Often they only have a few days to decide whether (and how much) to bid for a site, so they may only be able to run one or two analyses before having to decide. Being able to run tens or hundreds of analyses in the same time would let them vary their assumptions and test the sensitivity of the model to changes, and analyze details which are specific to that site - going back to the "speed of thought" idea, they may be able to ask more insightful questions if they can do multiple analyses in quick succession.

Finally, for now, another application that we have had interest in is analyzing the pattern of dropped cell phone calls. There are millions of calls placed every day, and this is an application where there is both interest in doing near real time analysis, as well as more extended historical analysis. As with the hurricane analysis application discussed previously, the Netezza system is well suited to analysis on rapidly changing data, as it can be loaded extremely quickly, in part because of the lack of indexes in Netezza - maintaining indexes adds a lot of overhead to data loading in traditional system architectures.

Wednesday, September 17, 2008

Interview with Rich Zimmerman about Netezza Spatial

A new development for the geothought blog, our first video interview! It's not going to win any awards for cinematography or production, but hopefully it may be somewhat interesting for the geospatial database geeks out there :). Rich Zimmerman of IISi is the lead developer of the recently announced spatial extensions to Netezza, and I chatted to him about some technical aspects of the work he's done. Topics include the geospatial standards followed in the development, why he chose not to use PostGIS source code directly, and how queries work in Netezza's highly parallelized architecture.

Interview with Rich Zimmerman about Netezza Spatial from Peter Batty on Vimeo.

Tuesday, September 16, 2008

Netezza Spatial

I have alluded previously to some interesting developments going on in very high performance spatial analytics, and today the official announcement went out about Netezza Spatial (after being pre-announced via Adena at All Points Blog and James Fee).

For me, the most impressive aspect of today at the Netezza User Conference was the presentation from Shajy Mathai of Guy Carpenter, the first customer for Netezza Spatial, who talked about how they have improved the performance of their exposure management application, which analyzes insurance risk due to an incoming hurricane. They have reduced the time taken to do an analysis of the risk on over 4 million insured properties from 45 minutes using Oracle Spatial to an astonishing 5 seconds using Netezza (that’s over 500x improvement!). Their current application won the Oracle Spatial Excellence “Innovator Award” in 2006. About half of the 45 minutes is taken up loading the latest detailed weather forecast/risk polygons and other related data, and the other half doing point in polygon calculations for the insured properties. In Netezza the data updates just run continuously in the background as they are so fast, and the point in polygon analysis takes about 5 seconds. For insurance companies with billions of dollars of insured properties at risk, this time difference to get updated information is hugely valuable. The performance improvement you will see over traditional database systems will vary depending on the data and the types of analysis being performed - in general we anticipate performance improvements will typically be in the range of 10x to 100x.

Netezza is a company I have been very impressed with (and in the interests of full disclosure, I am currently doing some consulting work for them and have been for several months). They have taken a radically different approach to complex database applications in the business intelligence space, developing a “database appliance” – a combination of specialized hardware and their own database software, which delivers performance for complex queries on large (multi-terabyte) databases which is typically 10 to 100 times faster than traditional relational database architectures like Oracle or SQL Server. There are two primary means by which they achieve this level of performance. One is by highly parallelizing the processing of queries – a small Netezza configuration has about 50 parallel processing units, each one a powerful computer in its own right, and a large one has around 1000 parallel units (known as Snippet Processing Units or SPUs). Effectively parallelizing queries is a complex software problem – it’s not just a case of throwing lots of hardware at the issue. The second key element is their smart disk readers, which use technology called Field Programmable Gate Arrays (FPGAs), which essentially implement major elements of SQL in hardware, so that basic filtering (eliminating unwanted rows) and projection (eliminating unwanted fields) of data all happens in the disk reader, so unnecessary data is never even read from disk, which eliminates a huge bottleneck in doing complex ad hoc queries in traditional systems.

Apart from outstanding performance, the other key benefit of Netezza is significantly simpler design and administration than with traditional complex database applications. Much of this is due to the fact that Netezza has no indexes, and design of indexes and other ongoing performance tuning operations usually take a lot of time for complex analytic applications in a traditional environment.

Netezza’s technology has been validated by their dramatic success in the database market, which in my experience is quite conservative and resistant to change. This year they expect revenues of about $180m, growth of over 40% over last year’s $127m. About a year ago, Larry Ellison of Oracle said in a press conference that Oracle would have something to compete with Netezza within a year. This is notable because it’s unusual for them to mention specific competitors, and even more unusual to admit that they basically can’t compete with them today and won’t for a year. Given the complexity of what Netezza has done, and the difficulty of developing specialized hardware as well as software, I am skeptical about others catching them any time soon.

So anyway (to get back to the spatial details), the exciting news for people trying to do complex large scale spatial analytics is that Netezza has now announced support for spatial data types and operators – specifically vector data types: points, lines and areas. They support the OGC standard SQL for Simple Features, as well as commonly used functions not included in the standard (the functionality is similar to PostGIS). This enables dramatic performance improvements for complex applications, and in many cases lets us answer questions that we couldn’t even contemplate asking before. We have seen strong interest already from several markets, including insurance, retail, telecom, online advertising, crime analysis and intelligence, and Federal government. I suspect that many of the early users will be existing Netezza customers, or other business intelligence (BI) users, who want to add a location element to their existing BI applications. But I also anticipate some users with existing complex spatial applications and large data volumes, for whom Netezza can deliver these substantial performance improvements for analytics, while simplifying adminstration and tuning requirements.

One important thing to note is that Netezza is specifically not focused on "operational" geospatial applications. The architecture is designed to work effectively for mass queries and analysis - if you are just trying to access a single record or small set of records with a pre-defined query, then a traditional database architecture is the right solution. So in cases where the application focus is not exclusively on complex analytics, Netezza is likely to be an add-on to existing operational systems, not a replacement. This is typical in most organizations doing business intelligence applications, where data is consolidated from multiple operational systems into a corporate data warehouse for analytics (whether spatial or non-spatial).

Aside from the new spatial capabilities, the Netezza conference has been extremely interesting in general, and I will post again in the near future with more general comments on some of the interesting themes that I have heard here, including "providing information at the speed of thought"!

Having worked with interesting innovations in spatial database technologies for many years, from IBM's early efforts on storing spatial data in DB2 in the mid to late eighties, to Smallworld's innovations with long transactions, graphical performance and extreme scalability in terms of concurrent update users in the early nineties, and Ubisense's very high performance real time precision tracking system more recently, it's exciting to see another radical step forward for the industry, this time in terms of what is possible in the area of complex spatial analytics.

Friday, September 12, 2008

If you could do geospatial analysis 50 to 100 times faster ... (revisited)

A little while back I posted on the topic of what compelling new things would you do if you could do geospatial analysis 50-100 times faster than you can today, on very large data volumes. This generated quite a bit of interesting discussion both on my blog and over at James Fee's. This project will be coming out of stealth mode with an announcement next week - if you are a friend of mine on whereyougonnabe you should be able to figure out where the technology is coming from (you can also get the answer if you watch this 3 minute video carefully)!

One interesting thing you might do is analyze the projected impact of hurricane Ike in a much more comprehensive and timely fashion than you can do with current technologies, and we'll have a case study about that next week. I'll be blogging about all this in much more detail next week.

Thursday, February 28, 2008

If you could do geospatial analysis 50 to 100 times faster …

… than you can today, what compelling new things would this enable you to do? And yes, I mean 50 to 100 times faster, not 50 to 100 percent faster! I’m looking for challenging geospatial analytical problems that would deliver a high business value if you could do this, involving many gigabytes or terabytes of data. If you have a complex analysis that takes a week to run, but you only need to run it once a year for regulatory purposes, there is no compelling business value to being able to run it in an hour or two. But if you are a retail chain and you need to run some complex analysis to decide whether you want to buy an available site for a new store within the next three days, it makes a huge difference whether you can just run one analysis which takes two days, or dozens of analyses which take 30 minutes each, allowing you to try a range of assumptions and different models. Or if you’re a utility or emergency response agency running models to decide where to deploy resources as a hurricane is approaching your territory, being able to run analyses in minutes rather than hours could make a huge difference to being able to adjust your plans to changing conditions. There may be highly valuable analyses that you don’t even consider running today as they would take months to run, but which would have very high value if you could run them in a day.

If you have problems in this category I would be really interested to hear about them, either in the comments here, or feel free to email me if you prefer.

Update: I wanted to say that this is not just a hypothetical question, but I can't talk about any details yet. See the comments for more discussion.

Monday, September 17, 2007

GIS in the Rockies panel on Service Oriented Architecture (SOA)

As mentioned previously, last week I was on the "all star panel" at GIS in the Rockies, with a distinguished cast of characters. Joe Berry published his response, so I thought I would do the same. The following is based on the notes I made in advance, with a few edits to reflect changes that happened on the fly.

The six panelists were asked to spend five minutes each answering the following question: Do you agree that service oriented architecture (SOA) is the key to enterprise data integration and interoperability? If so, how do you see geospatial technology evolving to support the concept of location services and data interoperability? If not, what is the alternative?

My short answer to the question was that SOA is one of various elements which can play a role in implementing integrated enterprise systems, but it is not THE key, just one of various aspects of solving the problem. The preceding panelists had given quite varied responses about different aspects of interoperability and SOA, which was perhaps indicative of the fact that SOA is a bit of a fuzzy notion with no universally agreed upon definition. The one constant in the responses was that 5 minute time limit was not well adhered to :) !!

So in my response, I first tried to clarify what SOA was from my point of view (based on various inputs), and then talked about other aspects of enterprise integration and interoperability.

SOA is really nothing radically new, just an evolution of notions of distributed computing and encapsulating functionality that have been around for a long time. In fact, a reader survey of Network Computing magazine ranked SOA as the most-despised tech buzzword of 2007! SOA is an approach to creating distributed computing systems, combining functionality from multiple different systems into something which appears to be one coherent application to a user.

CORBA and DCOM were both earlier attempts at distributed computing. The so-called "EAI" (Enterprise Application Integration) vendors like Tibco (who existed since since 1985 as part of Teknekron, since 1997 independently) and Vitria (since 1994) pioneered many of the important ideas for a long time before SOA became the buzzword de jour. More recently the big IT vendors have been getting involved in the space, like Oracle with their Fusion middleware and Microsoft with Biztalk

In general, most commentators seem to regard SOA as having two distinguishing characteristics from other distributed computing architectures:

Services are fairly large chunks of functionality - not very granular
Typically there is a common “orchestration” layer which lets you specify how services connect to each other (often without requiring programming)

SOA is most often implemented using web services to talk to various applications, but this does not have to be the case. One thing that has helped adoption though is the widespread usage of web services, which owes much to the simplicity of XML. Back in early days of OGC there were three separate standards which didn’t talk to each other - COM, CORBA and SQL - and OGC gained a lot of momentum at the end of the 90s when it switched its focused to an XML-based approach.

Anyway, SOA certainly is a logical approach for various aspects of data integration with geospatial as with other kinds of data.

Now let's talk about some other issues apart from SOA which are important in enterprise integration and interoperability ...

Integration of front end user functionality
SOA is really focused on data integration (as the question said), but this is only a part of enterprise integration. Integration or embedding of functionality is important also, especially when you look at geospatial functionality - map display with common operations like panning, zooming, query, etc is a good example. There are various approaches to doing this, especially using browser-based applications, but still you need some consistency in client architecture approach across the organization, including integration with many legacy applications.

Standards
One key for really getting full benefits from SOA is standards. One of the key selling points is that you can replace components of your enterprise system without having to rewrite many custom interfaces - for example, you buy a new work management system and just rewrite one connector to the SOA, rather than 20 point to point interfaces. But this argument is even more compelling if there are standard interfaces which are implemented by multiple vendors in a given application space, so that custom integration work is minimized.

On the geospatial side, we have OGC standards and more recently geoRSS and KML, and there are efforts going on to harmonize these. The more “lightweight” standards have some strong momentum through the growth in non-traditional applications.

But in many ways the bigger question for enterprise integration is not about geospatial standards, it’s about standards for integrating major business systems, for example in a utility you have CIS (Customer Information System), work management, asset management, outage management, etc.

There have been various efforts to standardize these types of workflows in electric utilities - including multispeak for small utilities (co-ops), and the IEC TC57 Working Group 14 for larger utilities. Multispeak is probably further along in terms of adoption.

Database level integration
Database level integration is somewhat over-rated in certain respects, especially for doing updates - with any sort of application involving complex business logic, typically you need to do updates via some sort of API or service. But having data in a common database is useful in certain regards, especially for reporting. This could be reporting against operational data, or in a data warehouse / data mart.

Scalability
SOA approach largely (but not entirely) focused on 3 tier approach with thin client
Still for certain applications, especially heavy duty data creation and editing, a thick client approach is preferable for performance and scalability

Separation of geospatial data into a distinct “GIS” or “GIS database”
Enterprise systems involving geospatial data have typically had additional barriers to overcome because of technological limitations which required the geospatial data to be stored in its own separate "GIS database". Increasingly though these artificial barriers shouldn’t be necessary. All the major database management systems now support geospatial data types, there are increasingly easy and lightweight mechanisms for publishing that data (geoRSS and KML) ... so why shouldn’t location of customers be stored in the CIS, and location of assets be stored in the asset database, rather than implementing complex ways of linking objects in these systems back to locations stored in the “GIS” database? This is not just a technology question, we also need a change in mentality here - to spatial is not special, just another kind of data, a theme I have talked about before (and will talk about again!).

Summary
SOA is a good thing and helps with many aspects of enterprise integration, but it’s not a silver bullet - many other things are important in this area, including the following:

Standards are key, covering whole business processes, not just geospatial aspects
As geospatial data increasingly becomes just another data type, this will be a significant help in removing technological integration barriers which existed before - this really changes how we think about integration issues
Integration of front end user functionality is important for geospatial, as well as data / back end service integration (the latter being where SOA is focused)
There are still certain applications for which a “thick client” makes sense

Thursday, June 21, 2007

Microsoft SQL Server Spatial update

I attended the Microsoft SQL Server SIG at the ESRI User Conference to hear a short update from Ed Katibah, who is leading the development of the SQL Server spatial capabilities. I thought he said a few interesting things of note, which I hadn't heard elsewhere. One was that they are doing nightly performance tests against all their major competitors, including Oracle, Postgres, Informix and DB2, and, while he wasn't allowed to be specific, he was "very pleased" with the results. He showed an example of a polygon with 600,000 vertices and 7,000 holes, and said that it could be intersected with an offset version of itself in 15 seconds on a 1GHz machine. From the way Ed described the spatial indexing approach, I suspect that it will probably do a pretty good job on this problem of Paul Ramsey's, though I'm sure it would take more than that for Paul to consider a Microsoft solution :) ! He mentioned that Microsoft was rejoining OGC - they had left because of legal concerns relating to how OGC handled certain IP issues, but these had now been resolved. He said that the target for release was middle of next year, and that there was a very strong focus within Microsoft on meeting that deadline. Both Ed and various ESRI people mentioned that Microsoft and ESRI had been working together to ensure that ESRI would support the SQL Server Spatial capabilities.

Tuesday, June 12, 2007

Interesting points from ESRI customer survey

ESRI has published a lengthy pre-conference Q&A document on the user conference blog, which several people have commented on. One answer talked about results from their customer survey, and I thought this highlighted some interesting industry trends.

They said that 45% of customers have asked for tight integration with Google Earth and nearly 47% have asked for support for interoperability - so overall, 92% of ESRI customers are looking for integration with Google Earth (assuming that these two response categories were mutually exclusive, which seems to be the case from the context). For Virtual Earth the numbers were a little lower, 26% and 43%, so 69% in total. This is just reconfirmation of the trend we are all aware of that "serious" GIS users are interested in using Google and Microsoft as a means to distribute their data - but it's interesting to see hard numbers, and 92% is a resounding endorsement for Google. It's also interesting that the vote for Google is quite a bit higher than Microsoft. I think that in the consumer world and the blogosphere, Google has pretty clearly had a higher geospatial profile, but among "corporate" GIS users I have talked to, many have had a bit more of a leaning towards Microsoft, if only because their organizations tend to be doing business with Microsoft already. This survey goes against the subjective impression I had formed on that particular point (admittedly from a small sample size).

The other point that was interesting was that 80% of customers want ESRI to support or tightly integrate their technology with the upcoming Microsoft SQL Server spatial extension - this is a very high number, especially given that Oracle probably still has around 50% of the database market share (48.6% in 2005, according to Gartner). These two numbers don't directly correspond in that the ESRI number is based on number of customers, so is likely to more strongly reflect the interests of smaller organizations (assuming that there are a large number of small organizations responding), whereas the Gartner number is based on revenue so probably more influenced by large organizations. But nevertheless, a very strong statement about the level of interest in Microsoft SQL Server Spatial.

There is a separate statement that less than 19% of customers have asked for tight integration with Oracle Spatial - but unfortunately no comment on what percentage want "support" for Oracle Spatial (which is currently provided via what was ArcSDE, now part of ArcGIS Server), so no direct information on relative levels of interest in Oracle versus SQL Server. I have been thinking for a little while that Oracle Spatial is at an interesting juncture in terms of its position in the market, but I'll save my thoughts on that for a future post :) !

Thursday, June 7, 2007

Some ancient history - GIS database article from 1990!

In both my presentations at the last GITA conference, I included a few slides talking about my personal perspective on the history of geospatial technology moving into the mainstream, and Geoff Zeiss was kind enough to comment on this and say that he found it interesting. One of the main points I made was what a long time it has taken, and how we have been talking about moving to the mainstream for 20 years or so.

This prompted me to think that it might be interesting to elaborate on a few of these historical themes, in addition to looking at new developments. I managed to dig out of the (paper) archives (that makes me feel old!) the first significant article which I had published, "Exploiting Relational Database Technology in GIS", which first appeared in Mapping Awareness magazine in the UK in 1990, and a couple of slightly edited versions came out elsewhere over the next year or so. This reflected the work we were doing at IBM with our GFIS product at that time, using IBM's SQL/DS and DB2, and at the same time the Canadian company GeoVision was taking a similar approach using Oracle. Doug Seaborn of GeoVision presented a paper with some of the same themes at the 1992 AM/FM (now GITA) conference with the bold title "1995: the year that GIS disappeared" (note: when making visionary predictions, be wary about attaching dates to them :) !!). He had the right idea, saying that GIS would become absorbed into mainstream IT, he was just a decade or so ahead in terms of timing. As I remarked at GITA, we techies always tend to think that change will happen faster than it actually does. (As an aside, Doug has been out of the geospatial industry for a long time, but I got an email from him the other day saying that he was now back, and working for ESRI).

Anyway, back in those days these were fairly radical ideas - at the time, most systems used file-based (and tiled) systems for (primarily vector) graphics, and a separate database for alphanumeric data. It's interesting how things often go in cycles with many aspects of technology - we spent lots of effort getting to continuous databases and eliminating tiling, which has big advantages for editing linear and areal data, and now the big emphasis is on file-based tiled systems again for easy, fast and scalable distribution of data (but still with continuous database-oriented systems in the background to create and maintain the data).

I think that while this mainstream database approach was a good philosophy, Smallworld came out with its proprietary database in 1991, which had huge advantages over anything else available at the time - you could dynamically pan and zoom around very large continuous databases without having to extract data into small working datasets first, while more or less all the other approaches at that time (certainly those using the type of database approach described in my article) required you to do data extracts which typically took minutes rather than seconds. This delayed the uptake of the standard relational approach, since it just couldn't match the performance you could get with other approaches. Now 15+ years later, we have gone through 10 iterations of Moore's Law (performance doubling every 18 months), so computers are 1000 times faster and that goes a long way to overcoming those issues!

Quite by accident, I happened to be at the GIS 95 conference in Vancouver, when Oracle announced its new "Multidimension" product, which would later become Oracle Spatial. I met Edric Keighan there (now with CubeWerx), who led that development, and he told me that the development team had a copy of my article posted on their noticeboard as it articulated well what they were trying to achieve.

So anyway, after that rather meadering introduction, here is the article.

Friday, May 18, 2007

SQL Server spatial

There have been various comments floating around about Microsoft's announcement of spatial support for SQL Server, ranging from great excitement to extreme skepticism. I have known about this for a while under NDA, and I think it is an important step for the industry. While Oracle Spatial has been out for 12 years so Microsoft has quite a bit of catching up to do, many of the recent Oracle developments have been focused beyond pure "spatial database" functionality, and more into functionality that was previously the domain of the GIS vendors, such as spatial analysis and mapping - I think that Microsoft should be able to provide core database functionality relatively quickly. It will be good for the industry for Oracle to have some more serious competition within the commercial database world, in regard to spatial. It really reinforces the trend that everyone has been talking about of geospatial data becoming mainstream, now that the top two commercial database vendors will both support spatial. The fact that Microsoft has an offering both in the space of the new disruptive online mapping offerings, with extensive packaged data (like Google and Microsoft) and in the database (like Oracle, which does not have an offering in the former space), opens up some interesting possibilities for them to offer solutions leveraging both aspects. I think that geospatial technology has much higher visibility at high levels in Microsoft than at Oracle (driven by Virtual Earth) - I talked more about this in an article at Geoplace.

So Microsoft will need to follow through on the announcement, and has plenty to do to catch up with Oracle and the various open source spatial database offerings, but it's another important step in terms of geospatial really being a major area for Microsoft.

geothought