List of Free Data Sets

November 15, 2008

I have relied on KDnuggets.com for quite a while. KD stands for Knowledge Discovery) and the site is one of the leading sources of information on Data Mining, Web Mining, Knowledge Discovery, and Decision Support Topics, including News, Software, Solutions, Companies, Jobs, Courses, Meetings, Publications, and more. Here is site map and FAQ. The Web site does a good job of covering the data mining and knowledge discovery space. If you navigate here, you will find a useful listing of 20 data sets which you can use for benchmarking or testing. The listing does not include large collections of text yet. You may find this listing with hot links useful.

Stephen Arnold, November 15, 2008

Adhere’s SPIRAL Leverages the Google Search Appliance

November 14, 2008

Adhere Solutions, the innovative Google partner creating a buzz for next-generation solutions for Google’s enterprise products, announced its SPIRAL methodology. Google has placed more than 24, 000 Google Search Appliances on customer premises since the product’s début several years ago. Many customers want to build on the GSA device, and often need assistance. Adhere Solutions has a proven methodology to make the GSA more than a search system.

The SPIRAL methodology encapsulates the collective experience of Adhere’s consultants, who have vast experience in web search, site search, and enterprise search deployments. With the SPIRAL process, organizations now can leverage search best practices to design, develop, and deploy Google’s suite of enterprise products and technologies.

Erik Arnold, co-founder of Adhere said:

Everyday we talk to customers who want to put the full power of the Google Search Appliance to work for them. They want to offer users a robust search experience, tailored to their needs. Our SPIRAL methodology saves organizations time and money by helping assess users’ needs, apply search best practices, and fully benefit from the enterprise search technologies that Google has to offer.’

The methodology bundles analysis, consulting, and coding into one business process. Adhere Solutions has used the approach to extend the utility of the GSA at the American Library Association, the Department of Energy, and the Federal Trade Commission.

Google has a solid track record in selecting partners who can deliver. Adhere Solutions has emerged as one of the “go to” partners for enterprise applications with the Google Search Appliance as technology component. You can get more information about Adhere Solutions here.

Donald Anderson, November 14, 2008

Google: Maturing More Quickly

November 14, 2008

Nicholas Carlson at Silicon Alley Insider has an interesting story about Google. The article “Google Finally Starts Firing Slackers” is here. Mr. Carlson explains that Google is taking a closer look at employee productivity. He writes, “You can’t slack off and expect to keep your job anymore”, which is a statement provided to Mr. Carlson by a reader. Mr. Carlson’s information dovetails with some comments I have heard. For example, some of the Googlers making sales calls are showing a bit more intensity recently. If these indications are accurate, Google may be maturing more quickly. And if not exactly acting like a Fortune 50 company, the firm is responding to the changing economic environment. Google may not winnow the bottom 10 percent of its work force the way some executives do, but change may be rippling through the GOOG. The next round of financial reports will be quite interesting.

Stephen Arnold, November 14, 2008

Yasni: People Search

November 14, 2008

yasni a people search engine, just launched in the U.S. If you’re on the web, yasni supposedly will find you. But the search is on first and last names, and there are lots of “Jessica Bratcher”s out there. My yasni search returned 30 results, including hits on amazon.com, Facebook, MySpace, Google News and Blogs, Technorati, even criminal searches. But for more listings, they’ll send me an e-mail list within 24 hours.

People search has been and remains very important. Zoom Info, LinkedIn, and other sites provide useful information. I have found Cluuz.com useful as well. Cluuz.com displays relationship charts. I did some ego surfing to test yasni and I ran the same queries on Cluuz.com. On Cluuz.com, I found an interview I did in 2005. Cluuz.com also surfaced several articles about newspaper awards I’ve received. On my test queries, I did not find yasni as useful. But it is early in the game for yasni. I will check back in a month or so to see how the service develops. I do recommend that you give it a whirl.

Jessica Bratcher, November 14, 2008

ZDNet Identifies Google’s Fatal Flaw

November 14, 2008

Sure, the stock is at 2005 levels. The company took it on the snoot with the Yahoo deal, the possible loss of the Verizon account, and grousing about employee options that are underwater. Dana Blankenhorn’s “Google Fatal Flaw Revealed” takes a view of Google that I have not previously considered. As a result, here ZDNet article is a must read. I don’t want to spoil the “fatal flaw” for you. Click here and you will see the “fatal flaw” revealed in the first subhead. In my two Google books, whose titles will not retype today, I identified a number of Google vulnerabilities. I admit. I did not hit upon the weakness Dana Blankenhorn’s researched unearthed. For me, aside from the fatal flaw, the most interesting comment in the article was:

Yet whether I’m covering the efforts at Chrome, at Android, or at Google Health, what I see are Google employees working on a Google Island, depending only on fellow Googlers and Google-made code in their efforts.

The idea is that Google is an island. I know that the company buys technology. The purchase of Transformic, Inc. is an example. Most people don’t recognize the name of this acquisition. Google doesn’t talk about some of its more interesting activities. My own research about Google suggests that when it buys a company, it gets new people. In the case of Transformic, the gurus running that shop attracted more fresh talent to Google. As a result, Google has a number of engineers and scientists who are steeped in the type of systems for which Transformic was developing. As a result, I think that Google may be operating as an island, but it is an island with regular shuttle service to the mainland, abundant bandwidth, and very dynamic presence at certain technology venues. See if you agree with the ZDNet article and share your comments.

Stephen Arnold, November 14, 2008

Google: Site Search Tweak for Publishers

November 14, 2008

DMSNews published on November 13, 2008, “Google Launches On Demand Indexing for Publishers.” You should take a gander at this interesting article here. The author, Mary Elizabeth Hurn, describes Google’s new on demand indexing service. The service is, from what I understand, an extension of Google’s site search service. The new service includes more controls over what the Site Search robot does. The one demand service will, according to Ms. Hurn, “effect only searches within a publisher’s site and not searches done from Google.com.” In the 1980s, indexing was the primary source of revenue for commercial database producers and companies such as Dialog Information Services and LexisNexis. Google continues to move into the high-value content arena. My thought is that the Library of Congress may face competition from the Google.

Stephen Arnold, November 14, 2008

Webinar: Open Standards and Semantic Technology

November 14, 2008

The economic downturn worldwide bodes poorly for dollars to add more search technologies to the enterprise, but the umbrella in the thunderstorm may be found in a movement quietly readying for a download launch. When will a standardized, semantic IT infrastructure be the basis of the enterprise’s entire IT framework for operations across all divisions?

There is a growing discussion in Europe, now spilling over into the US, regarding the SMILA project, the SeMantic Information Logistics Architecture. For more detail, click here or navigate http://eccenca.broxblogs.de. This open source solution is coming from a partnering of brox IT-Solutions, and empolis in Germany through Eclipse.org.

Semantic Technologies

Semantic technologies continue to gain in the discussion amongst researchers and companies investing in their own search frameworks across the organization because it is the unstructured data that remains the elephant in the room. There are proponents in several large IT companies that believe an answer is available in SMILA. When will a semantic IT infrastructure be the basis of the enterprise’s entire IT framework for operations across all divisions? Consider this white paper (in German-use translate.google.com) http://www.heise.de/open/Union-Investment-Integrationsplattform-auf-Basis-offener-Standards–/artikel/118395 The paper contends that:

“Open standards make applications more quickly realized and flawless.”

Eccenca is the commercial level version available for enterprise that is being deployed with professional services and support. At brox, the company is building commercial-grade architecture and applications for the enterprise under the Eccenca Foundation, based on the SMILA codebase. Eccenca products will reflect internal expertise of existing customer requests, including those of startups in Theseus, Volkswagen, and others. See more information in the response to this blog’s recent discussion (Nov.4th) at http://h3lge.de/weblog/. Eccenca.com and the first download of SMILA are anticipated in short order. At Eccenca.com, brox will set up and manage a marketplace for standard-based plug ins, solutions, and expertise.

Webinar

There is a webinar in English coming up to discuss this whole approach further, coming up on December 17, 2008. The seminar will run about one hour and take place at 8:00 am PDT / 11:00 pm EDT / 4:00 pm GMT. The seminar will be given by Georg Schmidt (brox IT-Solutions) and Igor Novakovic (empolis). The title of the webinar is “SMILA – SeMantic Information Logistics Architecture.” This webinar will present the SMILA project (emphasizing the integration possibilities), provide the status report about the latest project developments and give a short demonstration of currently implemented features.

The webinar will discuss the challenge of the amount and diversity of information is growing exponentially, mainly in the area of unstructured data, like emails, text files, blogs and images. Poor data accessibility, user rights integration and the lack of semantic metadata are constraining factors for building next generation enterprise search and other document centric applications. Missing standards result in proprietary solutions with huge short and long term cost. SMILA is an extensible framework for building search solutions to access unstructured information in the enterprise. Besides providing essential infrastructure components and services, SMILA also delivers ready-to-use add-on components, like connectors to most relevant data sources. Using the framework as their basis will enable developers to concentrate on the creation of higher value solutions, like semantic driven applications.

An article authored by Dawn Marie Yankeelov, president of ASPectx.

Google: We’re Reliable, No, Really Reliable

November 14, 2008

InfoWorld ran a remarkable story by Juan Carlos Perez called “Google Cries Foul over Coverage of Google Apps.” You must read the full text here. This is a three part story, and I have to admit I found it surprising. To sum up the long write up, I would say, “We’re Google. We’re reliable. Really reliable.” There were a number of interesting comments in the well written story. For me, the most memorable is this passage:

We’re definitely hearing what people are saying and responding to feedback in that very transparent way and also looking at whether we need a centralized place like Amazon and Salesforce do.

My take is that after 10 years in business, Google might want to have a way to allow customers to allow customers to talk to a human Googler or get a pleasant email response to a problem. I find the word “transparency” amusing. In fact, I wrote a column for KMWorld about Google’s summer of transparency. Well, fall is here and the transparency like the clear summer sky of Kentucky seems to have become opaque. Just the opinion of an addled goose who heard today that a customer with some serious money invested in Google Search Appliances couldn’t reach a Googler with a question. Probably a fluke, not a transparency issue at all.

Stephen Arnold, November 14, 2008

SEO Revealed: Do the Basics Well

November 13, 2008

The Webmaster Central Blog, published by Google, linked to a search engine optimization starter guide. You can read the original Google post here and download the PDF of the “Starter Guide” here. In my 2005 The Google Legacy, I presented in summary form more than 60 factors the Google PageRank algorithm seemed to use when determining a page’s relevance. I reviewed the data I collected in 2003 and 2004, and I was reminded that the individual factors such as how to present urls was mostly common sense. The problem then and now was that as an outsider running tests with different page set ups and features, it is difficult to know which factor carries the greatest burden in a particular context. ArnoldIT.com no longer does much SEO because we have discovered that content, not tricks, works better for us over time. However, the value of this new Google guide is that it highlights certain factors that Google in late 2008 seems to suggest are important. Two quick examples are:

  • Emphasis on descriptive page titles
  • Descriptive meta tags.

We identified a number of minor factors, but this guide reminds me that Google wants the basics included in pages that its crawler identifies and its system processes. The real benefit for me is that this guide makes explicit which factors are important to Google. In my opinion, the guide also underscores how many basic problems plague the billions of Web pages that Google processes. Finally, the guide makes clear upon which factors Google wants Webmasters to focus. Google could easily charge for this publication. Its information is crisp and more authoritative than much of the information I have seen about search engine optimization.

Stephen Arnold, November 13, 2008

Speeding Up Database Piggies

November 13, 2008

In Denmark, I mentioned next generation data management companies. Several people talked with me after my lecture. My comments were unsettling because I described the costs of scaling Codd-style databases. I know I mentioned Aster Data, a company that I have written about in this Web log. Aster Data’s profile in Aarhus was not high, so when I suggested that database piggies may be left behind the Aster Datas, the database administrators put their shields up.

I told these DBAs that if they needed to speed up their existing Oracle, SQL Server, and DB2 system, there were some options. One option is to throw hardware at the problem. The per CPU pricing model is designed for this brute force approach. However, another alternative is to look at companies with subsystems that can smooth out some of the most aggravating Codd database flaws; for example, trashing tables when writes go wrong or choking when transactions exceed the system’s capability or when data tables want more space and there is no more space. You can probably think of your favorite Codd database challenge. We each have our memorable moments.

One company that can help ameliorate some Codd database problems is an outfit called GoldenGate Software. You can find out more about this company by navigating to their Web site, going through a dorky registration process, and reading about their database middleware. The company does not describe its technology as middleware, but I find it a convenient metaphor. You license GoldenGate’s transactions system and maybe its transformation component. You plug one end into the relational database systems and the other end into whatever system needs the databased content. The system works wonders at financial institutions, intelligence agencies, and any other implementation when near real time access to database content is required by enterprise systems.

The company has a new release of its flagship product, TDM, shorthand for Transactional Data Management. You can read about some of its nifty features here. The company thrives on database jargon, but the software works. Some of the banks for which I worked before these outfits went down the drain discovered that it was more cost effective to license the GoldenGate product than add database servers to break a data transfer bottleneck.

So, if you aren’t in a position to jump to a next generation database, you will find GoldenGate a useful subsystem about which to learn more.

Stephen Arnold, November 13, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta