UK Government Manual Outlines Open Source Approach
March 28, 2013
The U.K. addresses its workers on the issue of open-source software in the Open Source section of its Government Service Design Manual. The site is still in Beta as of this writing, but the full release is expected in April. These prescriptions for when and how to use open-source resources contain some good advice, pertinent even to those of us who don’t report to a U.K. government office.
For example, on preparing developers, the document counsels:
“Ensure developers have the ability to install and experiment with open source software, have environments to easily publish prototype services on The Web, have convenient access to a wide variety of network connected devices for testing Web sites, and have unrestricted access to collaboration tools such as GitHub, Stack Overflow and IRC.”
It is worth noting that the text goes on to recommend giving back to open source projects, as well as citing any open source code used. The document also notes that, where security is concerned, open source software actually has an advantage:
“Separation of project code from deployed instances of a project is good development practice, and using open source enables developers to easily fork and experiment with multiple development, and operations to quickly spin-up multiple test and integration environments. . . . A number of metrics and models attest to the quicker response to security issues in open source products when compared to closed source equivalents.”
See the clear and concise document for more of its perspective on open-source software. The conclusion also points to the site’s Open Standards and Licensing page for more information.
Cynthia Murrell, March 28, 2013
Sponsored by ArnoldIT.com, developer of Augmentext
The Pros of the Oracle Update from Fatwire Content Server
March 28, 2013
The details on the Oracle update are discussed in this article from Element Solutions,
Why Upgrade From Fatwire Content Server to Oracle WebCenter Site 11g. Beginning with the faults of the Fatwire server including difficulties inherent in finding and deleting assets, as well as the lack of multi asset editing are cited, along with the confusing UI and system of creating and placing pages. The benefits to the upgrade without template code modification include,
“Separation of Administration UI and Content Contributor UI…Multi asset editing via Multi tabs…Drag and drop assets in Form Mode…Search for assets across multiple assets types…Moves MetaData Fields out of the way…Save and Continue work: the ability to save and continue work is now available with a simple click….
The key to a successful upgrade is proper planning, testing and training of your content contributors.”
There are also benefits available by making template code changes. The ability to drag and drop content as they are being viewed in Preview Mode as well as the ability to add content directly to page assets, instead of using the page assets only for construction of other assets.
Chelsea Kerwin, March 28, 2013
Sponsored by ArnoldIT.com, developer of Augmentext
Raspberry Pi Running Apache Tomcat Server Hosting Ontopia Declared a Success
March 28, 2013
Raspberry Pi system enthusiasts will be excited to read Escape Velocity’s article, Ontopia Runs on Raspberry Pi. Ontopia, a collection of open source tools for building, preserving and developing Topic Maps based applications, reportedly works successfully on the Raspberry Pi, a credit card sized ARM GNU/Linux box based on the Raspberry Pi Foundation’s work. The article demonstrates,
“Using the Raspberry Pi to run the Apache Tomcat server that hosts the Ontopia software, response time is as good or better than I have experienced when hosting the Ontopia software on a cloud-based Linux server at my ISP. Topic maps open quickly in all three applications and navigation from topic to topic within each application is downright snappy…
I am expecting the Pi to be viable development platform and a decent host for low-volume Tomcat-based demonstration applications that Pi?enthusiasts might create.”
The promising findings reported in the article hold the implication that the Pi may be capable of supporting more applications and technologies than previously thought. If you are interested in embedding taxonomy functions into a Raspberry Pi system, Ontopia has an answer. Pi enthusiasm has spread especially among high school and grade schoolers. Fans even meet up in monthly “Raspberry Jams” where like-minded fans discuss the ins and outs of the system.
Chelsea Kerwin, March 28, 2013
Sponsored by ArnoldIT.com, developer of Augmentext
Call for Experts: Search Experts Need Not Apply
March 27, 2013
A reader sent me a link to a call for experts issued by one of the European Commission’s entities. The program is called horizon 2020 and a countdown timer on the Web site reports how many days until Horizon 2020 launches. The program is a “framework” for research and innovation. The Europa.eu Web site says:
The European Commission is widening its search for experts from all fields to participate in shaping the agenda of Horizon 2020, the European Union’s future funding programme for research and innovation. The experts of the advisory groups will provide high quality and timely advice for the preparation of the Horizon 2020 calls for project proposals. The Commission services plan to set up a certain number of Advisory Groups covering the Societal Challenges and other specific objectives of Horizon 2020. To reach the broadest range of individuals and actors with profiles suited to contribute to the European Union’s vision and objectives for Horizon 2020, including striving for a large proportion of newcomers, and to gain consistent and consolidated advice of high quality, the Commission is calling for expressions of interest with the aim of creating lists of high level experts that will participate in each of these groups.
The list of expertise required is wide ranging. What is fascinating is that in the lengthy list of what’s needed there is no call for search, big data, content processing, or analytics. The EC has funded Promise (more accurately PPromise) which has a focus on search from what strikes me as a somewhat traditional approach combined with a quest for “good enough” solutions. I suppose innovation can result from the pursuit of “good enough.” I wonder if the exclusion of search and its related disciplines form this call for experts is a reflection on the role of information retrieval or one the results which have flowed from previous EC support of findability projects. On the other hand, perhaps the assumption is that search is a slam dunk. If so, then those engaged in search and content processing have to do a better job of communicating the dismal state of search and its related disciplines.
Much work remains to be done, and calls for expertise which omit specific remarks about information retrieval trouble me. Maybe the “good enough” notion is more pervasive than I understood.
Stephen E Arnold, March 27, 2013
Retire the Label Unstructured Data
March 27, 2013
Grant Ingersoll, CTO of LucidWorks, is sick and tired of the term “unstructured data.” It is really hard to blame him. The term is everywhere these days, and tends to sum up an idea of any data that is hard for a traditional database to capture.
Ingersoll says:
“I think that, in the early days of databases, someone coined ‘unstructured’ as a derogatory term to mean ‘all the stuff a database isn’t good at working on.’ If ‘structured’ is good, then ‘un’-structured must be bad, right? The problem is that working with text is one of the defining computational challenges of our time. We need our best and brightest working on it; and not just so we can better target ads to consumers. It’s too full of promise to describe with such a diminutive word as ‘unstructured.’ Numerical data? Child’s play! Text? Now there’s a real challenge.”
Ingersoll goes on to say that “rich data” is his new phrase of choice. If unstructured is meant to be negative, and text is some of the most challenging, but most rewarding content we have available, then rich may very well fit the bill. Regardless, end users are looking for solutions to tackle their individual content storage and retrieval problems. LucidWorks, the company that Ingersoll helped found, does just that. So unstructured or rich, LucidWorks has the solution to meet your data needs.
Emily Rae Aldridge, March 27, 2013
Sponsored by ArnoldIT.com, developer of Beyond Search
Loom Dataset Management for Hadoop Released by Revelytix
March 27, 2013
In the article, Revelytix Launches Loom Dataset Management for Hadoop from Data Center Knowledge, the early access availability of Loom Dataset for Hadoop is celebrated. Revelytix, big data software and resource provider, offers tools to enable data scientists to work with Hadoop. Loom is the product of years of design and innovation for the Department of Defense, pharmaceutical companies, financial services and leading intelligence agencies in the United States. Loom’s capabilities are explained in the article as follows,
“Loom makes it easy for data scientists and IT to build more analytics faster with easy-to-use interfaces that simplify getting the right data for the job quickly and managing datasets efficiently over time with proper tracking and data auditing,” said Revelytix CEO Mike Lang. Loom includes dataset lineage so you know where a dataset came from, Active Scan to dynamically profile datasets, Lab Bench for finding, transforming, and analyzing data in Hadoop and Hive; data suitability, and open APIs.”
As this excerpt reveals, the article reads more like a company newsletter of Revelytix than anything else. It goes on to state that Revelytix also recently announced that it would continue its work for the Department of Defense in 2013, broadening the implementation of the data management capabilities already in place.
Chelsea Kerwin, March 27, 2013
Sponsored by ArnoldIT.com, developer of Augmentext
Learn the Possibilities of Content Enrichment with OpenCalais
March 27, 2013
For a simple explanation of content enrichment, there is Web CMS Content Enrichment with OpenCalais, Crafter Rivet and Alfresco, on Rivet Logic Blogs. Content enrichment, the art of mining data and adding value to it, has now been organized by such services as OpenCalais, a free resource of semantic data mining from Thomson Reuters. For use on your blog, website or application, OpenCalais’s mission is to make “the worlds content more accessible.” The article explains,
“A few examples of content enrichment include: entity extraction, topic detection, SEO (Search Engine Optimization,) and sentiment analysis. Entity extraction is the process of identifying unique entities like people and places and tagging the content with it. Topic detection looks at the content and determines to some probabilistic measure what the content is about. SEO enrichment will look at the content and suggest edits and keywords that will boost the content’s search engine performance. Sentiment analysis can determine the tone or polarity (negative or positive) of the content.”
The tutorial on using OpenCalais with Crafter Rivet’s operating platform offered in this article is short and straightforward. Without tools like OpenCalais, the huge advantages of content enrichment for author and content managers would take countless hours. The resources available can save time while improving the effectiveness of content.
Chelsea Kerwin, March 27, 2013
Sponsored by ArnoldIT.com, developer of Augmentext
Dataset Management for Revelytix Loom and Cloudera Navigator
March 27, 2013
A surprising article from DBMS 2 (DataBase Management System Services) about Dataset management includes an explanation of the new term, dataset. It was created for Revelytix, a big data software company, seems to have had trouble with the older term for what they do: metadata management. This term is problematic because it could refer to several types of data. Dataset management describes both Revelytix and the recently released Cloudera Navigator. The author asserts,
“My idea for the term dataset is to connote more grandeur than would be implied by the term “table”, but less than one might assume for a whole “database”. I.e.:
A dataset contains all the information about something. This makes it a bigger deal than a mere table, which could be meaningless outside the context of a database.
But the totality of information in a “dataset” could be less comprehensive than what we’d expect in a whole “database”.”
Mid-tier consultants may try to use the new problem as a revenue lever. Products to look to are Cloudera Navigator, which is from a leading Hadoop company and starts with auditing, and Revelytix Loom, which already does lineage in addition to auditing and is the main product of a company that does metadata management.
Chelsea Kerwin, March 27, 2013
Sponsored by ArnoldIT.com, developer of Augmentext
Search Evaluation in the Wild
March 26, 2013
If you are struggling with search, you may be calling your search engine optimization advisor. I responded to a query from an SEO expert who needed information about enterprise search. His clients, as I understood the question, were seeking guidance from a person with expertise in spoofing the indexing and relevance algorithms used by public Web search vendors. (The discussion appeared in the Search-Based Applications (SBA) and Enterprise Search group on LinkedIn. Note that you may need to be a member of LinkedIn to view the archived discussion.)
The whole notion of turning search into marketing has interested me for a number of year. Our modern technology environment creates a need for faux information. The idea, as Jacques Ellul pointed out in Propaganda, is that modern man needs something to fill a void.
How can search deliver easy, comfortable, and good enough results? Easy. Don’t let the user formulate a query. A happy quack to Resistance Quotes.
It, therefore, makes perfect sense that a customer who is buying relevance in a page of free Web results would expect an SEO expert to provide similar functionality for enterprise search. Not surprisingly, the notion of controlling search results based on an externality like key word stuffing or content flooding is a logical way to approach enterprise search.
Precision, recall, hard metrics about indexing time, and the other impedimenta of the traditional information retrieval expert are secondary to results. Like the metrics about Web traffic, a number is better than no number. If the number’s flaws are not understood, the number is better than nothing. In fact, the entire approach to search as marketing is based on results which are good enough. One can see the consequences of this thinking when one runs a query on Bing or on systems which permit users’ comments to influence relevancy. Vivisimo activated this type of value adding years ago and it still is a good example of trying to make search useful. A result which delivers a laundry list of results which forces the user to work through the document list and determine what is useful is gone. If a document has internal votes of excellence, that document is the “right” one. Instead of precision and recall, modern systems are delivering “good enough” results. The user sees one top hit and makes the assumption that the system has made decisions more informed.
There are some downsides to the good enough approach to search which deliver a concrete result which, like Web traffic statistics, looks so solid, so meaningful. That downside is that the user consumes information which may not be accurate, germane, or timely. In the quest for better search, good enough trumps the mentally exhausting methods of the traditional precision and recall crowd.
To get a better feel for the implications of this “good enough” line of thinking, you may find the September 2012 “deliverable” from Promise whose acronym should be spelled PPromise in my opinion, “Tutorial on Evaluation in the Wild.” The abstract for the document does not emphasize the “good enough” angle, stating:
The methodology estimates the user perception based on a wide range of criteria that cover four categories, namely indexing, document matching, the quality of the search results and the user interface of the system. The criteria are established best practices in the information retrieval domain as well as advancements for user search experience. For each criterion a test script has been defined that contains step-by-step instructions, a scoring schema and adaptations for the three PROMISE use case domains.
The idea is that by running what strike me as subjective data collection from users of systems, an organization can gain insight into the search system’s “performance” and “all aspects of his or her behavior.” (The “all” is a bit problematic to me.)
IBM Content Analytics and Search V2.2 Exam
March 26, 2013
I am not sure how, but two links found their way to me today. The subject of the exam is IBM’s Content Analytics and Search V2.2.
Information about the IBM test is at http://www-03.ibm.com/certify/certs/27003701.shtml. Information about the April 2011 version of the system which is the current one is at this IBM link. The current version is going on three years old, which does not suggest continuous, aggressive updating to me.
The first link points to Blog Pass 4 Test. The site presents some sample questions for the examination, which is part of the IBM certification process.
You can pass the IBM 000-583 (IBM Content Analytics and Search V2.2 with an “examination guide.”
The examination is available from Blog.pass4test.net. Here are three sample questions to whet your appetite:
Which documents from the collection are used to create the clustering proposal?
A. All of the documents in the index are used.
B. A random sample of the number that you specify
C. The first 1000 documents that were added to the index.
D. A round-robin alphabetically ordered sampling from each different crawler
Answer: BWhich languages listed are supported for text analytics collections?
A. French, Arabic, Hindi, Malay
B. German, English, Polish, Greek
C. Hebrew, Italian, English, Russian
D. English, Spanish, Arabic, German
Answer: DWhich is NOT a supported operating system.?
A. AIX 5.3 (32-bit)
B. AIX 6.1 (64-bit)
C. Red Hat Enterprise Linux Advanced Server (32-bit)
D. Microsoft Windows Server 2003 Enterprise (32-bit)
Answer: A
Pretty thin gruel for the cold winter mornings required to get complex proprietary and open source systems to work in an optimal manner.
The second link is to Exam 2 Home. The idea is that for $49, a person can buy a PDF with questions and answers. You can find this exam guide at http://www.exam2home.com/000-583.htm. The site asserts:
Many IBM Content Analytics and Search V2.2 test questions or brain dump providers in the market focus solely on passing the exam while skipping the real-world exam preparation. This approach only gives short-term solution while giving the candidates real setbacks in the job market. The main focus of Exam2Home’s IBM 000-583 questions is to teach you the techniques to prepare your exam in the right sense covering all aspects of the exam. We have truly a 1-2 knockout solution for your IBM 000-583 exam.
Two observations. I must be on a list of folks trying to master IBM Content Analytics and Search V2.2. Interesting idea, just not accurate. Second, these two pitches seem quite similar. Is this another of the learn quick, get a raise training schemes. I ran across a similar program for Quicken. Interesting but suspicious to me.
Stephen E Arnold, March 26, 2013