SQL Blues: Get Happy with 10 Tips
July 15, 2009
A happy quack to the reader who sent me a link to “10 Tips for Working Smarter with SQL”. I am sufficiently old and addled to remember the joy of crafting by hand complex SQL statements. I even remember the great little tool that was made available by either Illustra, Informix, or another old school database vendor. The Web page would permit one to enter a SQL statement, and then respond with a mark up of that statement. I recall that the little tool worked quite well, then it disappeared, and I had to return to more traditional ways of coaxing Dr. Codd’s invention to spit out what I wanted. If you are working with Oracle Ultra Search (Oracle Text) or Thunderstone search systems, you will need to have some familiarity with SQL or SQL variants.
I downloaded and saved Susan Harkins’ (TechRepublic) article because it contained several quite useful tips. I can’t reproduce the full list. But I want to highlight two of her tips and urge you to visit Builder.au to garner the rest of the insights.
First, she does a very good job of reminding me about the differences between ALL, DISTINCT, and DISTINCTROW. She includes a useful table which I immediately printed and taped in my database notebook. (Yes, I still use paper.)
Second, she makes short work of the UNION operator. A glitch here can trash tables, forcing addled geese like me to reopen the two tables and rerun the instruction. She wrote:
By default, UNION sorts records by the values in the first column because UNION uses an implicit DISTINCT predicate to omit duplicate records. To include all records, including duplicates, use UNION ALL, which eliminates the implicit sort. If you know there are no duplicate records, but there are a lot of records, you can use UNION ALL to improve performance because the engine will skip the comparison that’s necessary to sort (to find duplicates).
Good work this.
Stephen Arnold, July 15, 2009
Software Robots Determine Content Quality
July 15, 2009
ZDNet ran an interesting article by Tom Steinert-Threlkeld about software taking over human editorial judgment. “Quality Scores for Web Content: How Numbers Will Create a Beautiful Cycle of Greatness for Us All” is worth tucking into one’s folder for future reference.
Some background. Mr. Steinert-Threlkeld notes that the hook for his story is a fellow named Patrick Keane, who worked at the Google for several years. What’s not included in Mr. Steinert-Threlkeld’s write up is that Google has been working on “quality scores” for many years. You can get references to specific patent and technical documents in my Google monographs. I just wanted to point out that the notion of letting software methods do the work that arbiters of taste have been doing is not a new idea.
The core of the ZDNet story was:
Keane is at work on figuring out what will constitute a Quality Score, for every article, podcast, Webcast or other piece of output generated by an Associated Content contributor. If his 21st Century content production and distribution network can figure out how to put a useful rank on what it puts out on the Web then it can raise it up, notch by notch. This scoring comes right back to the Page Rank process that is at the heart of Google’s success as a search engine. “The great thing about Page Rank in Google ‘ s algorithm is … seeing the Web as a big popularity contest,’’ said Keane, in Associated Content’s offices on Ninth Avenue in Manhattan.
Mr. Steinert-Threlkeld does a good job of explaining how the method at Mr. Keane’s company (Associated Content) will approach the scoring issue.
My thoughts, before I forget them, are:
- Digging into what Google has disclosed about its scoring systems and methods is probably a useful exercise for those covering Google and the businesses in which former Googlers find themselves. The key point is that the Google is leaning more heavily on smart software and less on humans. The implication of this decision is that as content flows go up, Google’s costs will rise less quickly than those of outfits such as Associated Content. Costs are the name of the game in my opinion.
- Former Googlers are going to find themselves playing in interesting jungle gyms. The insights about information will create what I cool “Cuil situations”; that is, how far from the Googzilla nest with a Xoogler stray? My hunch is that Associated Content may find itself surfing on Google because Associated Content will not have the plumbing that the Google possesses.
- Dependent services, by definition, will be subordinate to the core provider. Xooglers may be capping the uplift of their new employers who will find themselves looking at short term benefits, not the long term implications of certain methods.
I think Associated Content will be an interesting company to watch.
Stephen Arnold, July 15
UK Book Industry Under Pressure
July 14, 2009
The Independent on July 13, 2009, ran “Two Weeks to Save Britain’s Book Tread.” You can locate a version of the story by poking around on the newspaper’s Web site. The article explains that book stores are closing and publishers are in a world of hurt. Not too surprising in the wake of similar woes in the American publishing sector.
Two aspects of this write up surprised me. First, there were some telling quotes. Let me highlight two:
The first is attributed to Jonny Geller, an executive at the Curtis Brown literary agency. The remark pertains to the advances that publishers pay to authors who can sell books that are likely to be blockbusters:
Publishing has become quite reactive., It is sales led. We need publishers to starting taking risks again.
I found this interesting because the Independent ran a special section of the July 13 newspaper made up of old news. Yep, recycled information. I found that revealing.
The second was this statement in a side bar written by Arifa Akbar. She wrote:
The agreement’s collapse did not just pave the way for supermarkets and chain stores to dominate the trade with deeply discounted prices, but it was at this point that books lost their immunity from the changing winds of market forces.
The “agreement” refers to the Net Book Agreement which provided guidelines for how the book industry would be run in order to prevent market forces from operating.
There is a keen insight in her “Comment”; specifically, book publishing works when the market forces are blocked. Remove the force field and the book industry faces a tough financial storm.
Will the British book trade survive? In my opinion, books the “old fashioned way” face a bleak autumn. I noticed a promotion for Dan Brown’s latest blockbuster. Place a prepublication order and save some money. Will one novel power the book trade’s trireme? Possibly, but I think the book industry’s vessel will be safer in a protected harbor tied to a dock. The old ship may have difficulty in the open sea.
Stephen Arnold, July 14, 2009
Semantic Search Revealed
July 14, 2009
I read “Semantic Search round Table at the Semantic Technology Conference” in ZDNet Web logs. Paul Miller, the author of the write up, did a good job, including snippets from the participants in the round table. In order to get a sense of the companies, the topics covered, and the nuances of the session, please, read the original. I want to highlight three points that jumped out at me:
First, I saw that there was a lot of talk about semantics, but I did not come away from the participants’ comments with a sense that a single definition was in play. Two quick examples:
- One participant said, ‘It means different things’. Okay, but once again we have “wizards” talking about search in general and semantic search in particular and I am forced to deal with ambiguity. “Different things” means absolutely zero to me. True, I am an addled goose, but my warning flights started flashing.
- The Googler (artificial intelligence guru Dr. Peter Norvig) put my feathers back in place. He is quoted as saying, ‘Different types of answers are appropriate for different types of questions…’. That’s okay, but I think that definition should have been the operating foundation for the entire session.
Second, the wrap up of the article focused on Bing.com. Now Bing incorporates Powerset, according to what I have read. But Bing.com is variation on the types of search results that have been available from such companies as Endeca for a while and from newcomers like Kosmix. The point I wanted to have addressed is what specific use is being made of semantics in each of the search and content processing systems represented in the roundtable discussion. Unreasonable? Sure, but facts are better than generalities and faux politesse.
Finally, I did not learn much about search. Nothing unusual in that. Innovation was not what the participants were conveying in their comments.
Bottomline: great write up, disappointing information.
Stephen Arnold, July 14, 2009
Overflight Adds Coveo and Thunderstone
July 14, 2009
If you want to keep up with what’s new at Coveo and Thunderstone, navigate to the ArnoldIT.com Overflight service. In addition to real time updates about the Google, you can now enjoy the same multi-source information “overflight” about Coveo (privately held Canadian company with enterprise and mobile search) and Thunderstone (privately held company in Cleveland, long an innovator in search and retrieval). The goslings and I use the Coveo tools. We had a Thunderstone appliance, but we had to be good geese and return it. Sigh.
Watch for more companies on the Overflight service, which is free to anyone who chooses to visit the service. A commercial version is available which permits integration and merging of internal content along with the Web information shown on this demonstration site.
Stephen Arnold, July 14, 2009
Oracle, Publishing, and XSQL
July 14, 2009
I am a big fan of the MarkLogic technology. A reader told me that I should not be such a fan boy. That’s a fair point, but the reader has not worked on the same engagements I have. As a result, the reader has zero clue about how the MarkLogic technology can resolve some of the fundamental information management, access, and repurposing issues that some organizations face. I am all for making parental type suggestions. I give them to my dog. They don’t work because the dog does not share my context.
The same reader who wanted me to be less supportive of MarkLogic urged me to dig into Oracle’s capabilities in Oracle XSQL, which I know something about because XSQL has been around longer than MarkLogic has.
Now Oracle is a lot like IBM. The company is under pressure because its core business lights up the radar of its licensees’ chief financial officer every time an invoice arrives. Oracle is in the software, consulting, open source, and hardware business. Sure, Oracle may not want to make SPARC chips, but until those units of Sun Micro are dumped, Oracle is a hardware outfit. Like I said, “Like IBM.”
MarkLogic has been growing rapidly. The last time I talked with MarkLogic’s tech team, it was clear to me that the company was thriving. New hires, new clients, and new technologies—these added to the buzz about the company. Then MarkLogic nailed another round of financing to fuel its growth. Positive signs.
Oracle cannot sit on its hands and watch a company that is just up Highway 101 expand into a data management sector right under Oracle’s nose. Enter Oracle XSQL, which is Oracle’s answer to MarkLogic Server.
The first document I examined was “XSQL Pages Publishing Framework” from the Oracle 9i/XML Developer’s Kits Guide. I printed out my copy, but you can locate an online instance on the Oracle West download site. I am not sure if you will have to register. Parts of Oracle recognize me; other parts want me to set up a new account. Go figure. Also, Oracle has published a book about XSQL, and you can learn more about that from eBooksLab.com. You can also snag a Wiley book on the subject: Oracle XSQL: Combining SQL, Oracle Text, XSLT, and Java to Publish Dynamic Web Content (2003). A Google preview is available as well. (I find this possibly ironic because I think Wiley is a MarkLogic licensee but I might be wrong about that.)
Oracle has an Oracle BI Publisher Web log that provides information about the use of XSQL. The most recent post I located was a June 11, 2009, write up but the link pointed to “Crystal Fallout” dated May 22, 2009. Scroll to the bottom of this page because the results are listed in chronological order, with the most recent write up at the bottom of the stack. The first article, dated May 3, 2006, is interesting. “It’s Here: XML Publisher Enterprise Is Released” by Tim Dexter provides a run down of the features of this XSQL product. A download link is provided, but it points to a registration process. I terminated the process because I wasn’t that interested in having an Oracle rep call me.
I found “BI Publisher Enterprise 10.1.3.2. Comes Out of Hiding” interesting as well. The notion that an Oracle product cannot be found underscores another aspect of Oracle’s messaging. From surprising chronological order to hiding a key product, Oracle XSQL seems to be on the sidelines in my opinion.
An August 31, 2007 post “A Brief History of BIP” surprised me. The enterprise publishing project was not a main development effort. It evolved out of frustration with circa 2007 Oracle tools. Mr. Dexter wrote:
Three years later and the tool has come a long way … we still have a long way to go of course. But you’ll find it in EBS, PeopleSoft, JDE, BIEE as a standalone product, integrated with APEX and maybe even bundled with the database one day – its a fun ride, exhausting but fun.
This statement, if accurate, pegs one part of XSQL in 2004. (I apologize that the links point to the long list of postings, but Oracle’s system apparently cannot link to a single Web log post on a separate Web page. Annoying, I know. MarkLogic’s system provides such fine grain control with a mouse click, gentle reader.)
When we hit 2009, posts begin to taper off. A new release—10.1.3.3.3—was announced in May 2008. The interesting posts described the method of tapping into External Data Engines Part I, May 13, 2008) and Part 2, May 15, 2008).
The flow seems somewhat non intuitive to me, even after reading two detailed Web log posts.
An iPhone version of Publisher became available on July 17, 2008.
In August 2008, Version 10.1.3.4 was released. The principal features, as I understand them, were:
- Integration with Oracle Enterprise Performance Management Workspace
- Integration with Oracle “Smart Space”
- Support for multidimensional data sources, including Hyperion Essbase, SQL Server, and SAP Business Information Warehouse (!)
- Usability and operation enhancements which seem to eliminate the need to write scripts for routine functions
- Support for triggers
- Enhanced Web services support
- A Word template builder
- Support for BEA Web Logic, JBoss, and Mac OS X.
Another release came out in April 2009. This one was 10.1.3.4.1 and focused on enhancements. When I scanned the list of changes, most of these modifications looked like bug fixes to me. In April 2009, Tim Dexter explained a migration gotcha. I read this as a pretty big glitch in one Oracle service integrating with another Oracle service.
Stepping back I am left with the impression that XSQL and this product are not the mainstream interest of “big” Oracle. In fact, if I had to decide between using Oracle’s XSQL, I would not hesitate in selecting MarkLogic’s solution for these reasons:
- MarkLogic has one mission: facilitate content and information management. The company is not running an XQuery side show. The company runs an XQuery main event.
- The MarkLogic server generates pages that make it easy to produce crunchy content. The Oracle system produces big chunks of content that are difficult to access and print out. Manual copying and pasting is necessary to extract information from the referenced Web log.
- The search function in MarkLogic works. Search in Oracle is slow and returns unpredictable results. I encountered this problem when trying to figure out whether “search” means “Ultra Search” or “SES”.
So, I appreciate the feedback about my enthusiasm for MarkLogic. I think my judgment is sound. Go with an outfit that does something well, not something that is a sideline.
Stephen Arnold, July 14, 2009
The Gilbane Lecture: Google Wave as One Environmental Factor
July 14, 2009
Author’s note: In early June 2009, I gave a talk to about 50 attendees of the Gilbane content management systems conference in San Francisco. When I tried to locate the room in which I was to speak, the sign in team could not find me on the program. After a bit of 30 something “we’re sure we’re right” outputs, the organizer of the session located me and got me to the room about five minutes late. No worries because the Microsoft speaker was revved and ready.
When my turn came, I fired through my briefing in 20 minutes and plopped down, expecting no response from the audience. Whenever I talk about the Google, I am greeted with either blank stares or gentle snores. I was surprised because I did get several questions. I may have to start arriving late and recycling more old content. Seems to be a winner formula.
This post is a summary of my comments. I will hit the highlights. If you want more information about this topic, you can get it by searching this Web log for the word “Wave”, buying the IDC report No. 213562 Sue Feldman and I did last September, or buying a copy of Google: The Digital Gutenberg. If you want to grouse about my lack of detail, spare me. This is a free Web log that serves a specific purpose for me. If you are not familiar with my editorial policy, take a moment to get up to speed. Keep in mind I am not a journalist, don’t pretend to be one, and don’t want to be included in the occupational category.
Here’s we go with my original manuscript written in UltraEdit from which I gave my talk on June 5, 2009, in San Francisco:
For the last two years, I have been concluding my Google briefings with a picture of a big wave. I showed the wave smashing a skin cancer victim, throwing surfer dude and surf board high into the air. I showed the surfer dude riding inside the “tube”. I showed pictures of waves smashing stuff. I quite like the pictures of tsunami waves crushing fancy resorts, sending people in sherbert colored shirts and beach wear running for their lives.
Yep, wave.
Now Google has made public why I use the wave images to explain one of the important capabilities Google is developing. Today, I want to review some features of what makes the wave possible. Keep in mind that the wave is a consequence of deeper geophysical forces. Google operates at this deeper level, and most people find themselves dealing with the visible manifestations of the company’s technical physics.
Source: http://www.toocharger.com/fiches/graphique/surf/38525.htm
This is important for enterprise search for three reasons. First, search is a commodity and no one, not even I, find key word queries useful. More sophisticated information retrieval methods are needed on the “surface” and in the deeper physics of the information factory. Second, Google is good at glacial movement. People see incremental actions that are separated in time and conceptual space. Then these coalesce and the competitors say, “Wow, where did that come from?” Google Wave, the present media darling, is a superficial development that combines a number of Google technologies. It is not the deep geophysical force, however. Third, Google has a Stalin-era type of planning horizon. Think in terms of five years, then you have the timeline on which to plot Google developments. Wave, in fact, is more than three years old if you start when Google bought a company called Transformics, older if you dig into the background of the Transformics technology and some other components Google snagged in the last five years. Keep that time thing in mind.
First, key word search is at a dead end. I have been one of the most vocal critics of key word search and variants of that approach. When someone says, “Key word search is what we need,” I reply, “Search is dead.” In my mind, I add, “So is your future in this organization.” I keep my parenthetical comment to myself.
Users need information access, not a puzzle to solve in order to open the information lock box. In fact, we have now entered the era of “data anticipation”, a phrase I borrowed from SAS, the statistics outfit. We have to view search in terms of social analytics because human interactions provide important metadata not otherwise obtainable by search, semantic, or linguistic technology. I will give you an example of this to make this type of metadata crystal clear.
You work at Enron. You get an email about creating a false transaction. You don’t take action but you forward the email to your boss and then ignore the issue. When Enron collapsed, the “fact” that you knew and did nothing when you first knew and subsequently is used to make a case that you abetted fraud. You say, “I sent the email to my boss.” From your prison cell, you keep telling your attorney the same thing. Doesn’t matter. The metadata about what you did to that piece of information through time put your tail feather in a cell with a biker convicted of third degree murder and a prior for aggravated assault.
Got it?
Convera, Hakia Added to Overflight
July 13, 2009
The Overflight search intelligence service allows a person interested in search and content processing to visit a page, select a vendor, and see a report. The report is free and draws information from the ArnoldIT.com database of information about more than 350 vendors in the information retrieval sector. Today, Convera (the challenged vertical search company) and Hakia (a vendor of semantic technology and systems) have been added to Overflight. The following links will get you to the Overflight information:
- The Overflight splash page is at http://arnoldit.com/overflight/
- Convera click here
- Hakia click here.
You can watch Google’s daily incremental thrusts from the Overflight splash page as well.
Stephen Arnold, July 13, 2009
Search, Broken, Content Management, Maybe Hopeless
July 13, 2009
I spent some time in the last three months dealing with a “search” challenge that had little to do with search. In fact, after poking around I found that the vendor (who must remain nameless according to my legal eagle) has a successful implementation in the same organization where another implementation has not worked too well.
What’s the issue?
I think the operative factor boils down to people. Yep, technology is innocent of this collision between user expectations and the top dogs’ ability to make good decisions. I am almost at the point where I can assert that search technology is not broken. It works quite well, and I have yet to find substantive variances between some free systems’ performance and some of the up market solutions.
The variable is the people. If the vendor has the right people AND the client has the right people, success is likely. When the people part of the implementation shows signs of stress, I am not sure that swapping technology will make the situation better.
Data reported by eConsultancy in its “Companies Focus on CMS Implementation” suggest that spending for new systems is flat and that ease of use is a key issue. Search is complicated, and so is CMS.
Both enterprise applications are in similar situations. CMS, as the business is described by those in the know, gained momentum when the need for Web content exploded. People who were not gifted in writing or trained to crank out usable material had to produce for Web sites, problems ensued. The fix was to develop systems that allowed anyone to plug content in a block and punch publish. The system worked up to a point, but as the Web become less brochureware and more application software, more problems developed.
CMS today are the equivalent of the editorial systems that once were found in major publishing houses. The difference is that the tradition, training, and rules of the game are different today. CMS, not surprisingly, has not been able to turn a company filled with telemarketers into Web content producers. The solution seems to be to let the customers create the content and the CMS will do the knowledge work—well, sort of do the knowledge work.
Combine the two and you have companies that are making an effort to capture the revenue from search vendors and from enterprise software firms delivering certain back office functions.
I read a news release on the Reuters about Autonomy, once a vendor of search, as the “fastest growing vendor” of enterprise content management, according to “Autonomy Gains Leader Position in Gartner Report”. I have no doubt that Autonomy is making headway against the likes of Interwoven, Vignette, and other ECM companies. The reason is that Autonomy’s management has seized an opportunity to bundle several services in one cohesive “product package”.
Autonomy offers eDiscovery (legal), CMS (Web and enterprise publishing), and IDOL content processing (analytics, search, autoindexing, etc.). I think that Microsoft wants to play in this arena as well. Google, although a slow mover in the enterprise, will probably enter the fray as well.
With these giant firms providing all-in-one services, I don’t see a significant change in the success rate for search, CMS, or bundles. None of these successful companies can control the people equation. Thus, as the market converges, unless the “people factor” is addressed, user grousing is not likely to decrease.
Stephen Arnold, July 13, 2009
Mysteries of Online Available as a Free Report
July 13, 2009
A student at a library school in Toronto sent me an email asking for permission to reuse two of the write ups in this Web log’s “Mysteries of Online” series. I wrote nine essays which are finable via the Blossom search box on any of this Web log’s pages. After that call, I decided to make life easy for students and any other person who wanted to review what I have learned in the last couple of decades about online information and deriving revenue from that type of information.
You can now click here and download a PDF that contains the nine essays. I have added a short disclaimer and a basic table of contents so you can locate the essay you wish to review. I did not prepare an index or insert the illustrations that I use in my formal lectures and presentations.
The only caveat attached to the document is that if you work for a commercial enterprise, write me at seaky2000 at yahoo dot com to let me know what you want to do. There is some legal boilerplate that must be inserted in you want to recycle my work.
Stephen Arnold, July 13, 2009