Affinio and the Differences between Useful Data and Fanciful Data
May 17, 2016
I read “Understanding the Cultural Differences Between NASCAR and Formula One Fans [Analysis].” The write up is in a blog post from Affinio. The company describes itself in this way:
Marketing Intelligence that leverages the social graph to understand today’s customer.
The information in the write up presents clusters of interest between the two fan bases for each of these motor sports. F1 consists of clusters labeled this way:
To illustrate the differences, Affinio presents a visualization of the Nascar audience:
The labels strike me as unhelpful; for example, Cluster 14, Cluster 6, etc.
The top interests of the two audiences consist of a collage of small images. I am not sure what each image represents.
Equally unhelpful is the word clouds for each of the audiences; for example:
The map showing the geographic area where F1 is popular focuses on a global scale with a centroid in Western Europe. The absence of a hot spot in the Middle East was puzzling. Is Australia as large an F1 market as the UAE in terms of money spent on F1 activities?
The map for the Nascar market depicts only the US of A. My question, “Why not show a global map?”
Thinking about this analysis, I have several questions:
- A list of dot points would get the message across in a more efficient, possibly less confusing way would it not?
- What is analyzed? It seems that the single actionable fact is that the F1 market is global and the Nascar market is local.
- What are the data sets used for the analysis?
- Why are terms like “Cluster 14” used instead of words?
The most important data from my uninformed vantage point is the money generated by the two types of motor racing.
My hunch is that the Affino write up wanted to show off visualizations, not substantive and actionable data analysis. In short, is this marketing or is it substance? I will leave the answer to you, gentle reader.
Stephen E Arnold, May 17, 2016
The Most Dangerous Writing App Will Delete Your Work If You Stop Typing, for Free
May 2, 2016
The article on The Verge titled The Most Dangerous Writing App Lets You Delete All of Your Work For Free speculates on the difficulties and hubris of charging money for technology that someone can clone and offer for free. Manuel Ebert’s The Most Dangerous Writing App offers a self-detonating notebook that you trigger if you stop typing. The article explains,
“Ebert’s service appears to be a repackaging of Flowstate, a $15 Mac app released back in January that functions in a nearly identical way. He even calls it The Most Dangerous Writing App, which is a direct reference to the words displayed on Flowstate creator Overman’s website. The difference: Ebert’s app is free, which could help it take off among the admittedly niche community of writers looking for self-deleting online notebooks.”
One such community that comes to mind is that of the creative writers. Many writers, and poets in particular, rely on exercises akin to the philosophy of The Most Dangerous Writing App: don’t let your pen leave the page, even if you are just writing nonsense. Adding higher stakes to the process might be an interesting twist, especially for those writers who believe that just as the nonsense begins, truth and significance are unlocked.
Chelsea Kerwin, May 2, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Search without Indexing
April 27, 2016
I read “Outsmarting Google Search: Making Fuzzy Search Fast and Easy Without Indexing.”
Here’s a passage I highlighted:
It’s clear the “Google way” of indexing data to enable fuzzy search isn’t always the best way. It’s also clear that limiting the fuzzy search to an edit distance of two won’t give you the answers you need or the most comprehensive view of your data. To get real-time fuzzy searches that return all relevant results you must use a data analytics platform that is not constrained by the underlying sequential processing architectures that make up software parallelism. The key is hardware parallelism, not software parallelism, made possible by the hybrid FPGA/x86 compute engine at the heart of the Ryft ONE.
I also circled:
By combining massively parallel FPGA processing with an x86-powered Linux front-end, 48 TB of storage, a library of algorithmic components and open APIs in a small 1U device, Ryft has created the first easy-to-use appliance to accelerate fuzzy search to match exact search speeds without indexing.
An outfit called InsideBigData published “Ryft Makes Real-time Fuzzy Search a Reality.” Alas, that link is now dead.
Perhaps a real time fuzzy search will reveal the quickly deleted content?
Sounds promising. How does one retrieve information within videos, audio streams, and images? How does one hook together or link a reference to an entity (discovered without controlled term lists) with a phone number?
My hunch is that the methods disclosed in the article have promise, the future of search seems to be lurching toward applications that solve real world, real time problems. Ryft may be heading in that direction in a search climate which presents formidable headwinds.
Stephen E Arnold, April 27, 2016
Data Intake: Still a Hassle
April 21, 2016
I read “Big Data’s Biggest Problem: It’s Too Hard to Get the Data In.” Here’s a quote I noted:
According to a study by data integration specialist Xplenty, a third of business intelligence professionals spend 50% to 90% of their time cleaning up raw data and preparing to input it into the company’s data platforms. That probably has a lot to do with why only 28% of companies think they are generating strategic value from their data.
My hunch is that with the exciting hyperbole about Big Data, the problem of normalizing, cleaning, and importing data is ignored. The challenge of taking file A in a particular file format and converting to another file type is indeed a hassle. A number of companies offer expensive filters to perform this task. The one I remember is Outside In, which sort of worked. I recall that when odd ball characters appeared in the file, there would be some issues. (Does anyone remember XyWrite?) Stellent purchased Outside In in order to move content into that firm’s content management system. Oracle purchased Stellent in 2006. Then Kapow “popped” on the scene. The firm promoted lots of functionality, but I remember it as a vendor who offered software which could take a file in one format and convert it into another format. Kofax (yep, the scanner oriented outfit) bought Kofax to move content from one format into one that Kofax systems could process. Then Lexmark bought Kofax and ended up with Kapow. With that deal, Palantir and other users of the Kapow technology probably had a nervous moment or are now having a nervous moment as Lexmark marches toward a new owner. Entropy, a French outfit, was a file conversion outfit. It sold out to Salesforce. Once again, converting files from Type A to another desired format seems to have been the motivating factor.
Let us not forget the wonderful file conversion tools baked into software. I can save a Word file as an RTF file. I can import a comma separated file into Excel. I can even fire up Framemaker and save a Dot fm file as RTF. In fact, many programs offer these import and export options. The idea is to lessen the pain of have a file in one format which another system cannot handle. Hey, for fun, try opening a macro filled XyWrite file in Framemaker or Indesign. Just change the file extension to one the system thinks it recognizes. This is indeed entertaining.
The write up is not interested in the companies which have sold for big bucks because their technology could make file conversion a walk in the Hounz Lane Park. (Watch out for the rats, gentle reader.) The write up points out three developments which will make the file intake issues go away:
- The software performing file conversion “gets better.” Okay, I have been waiting for decades for this happy time to arrive. No joy at the moment.
- “Data preparers become the paralegals of data science.” Now that’s a special idea. I am not clear on what a “data preparer” is, but it sounds like a task that will be outsourced pretty quickly to some country far from the home of NASCAR.
- Artificial intelligence” will help cleanse data. Excuse me, but smart software has been operative in file conversion methods for quite a while. In my experience, the exception files keep on piling up.
What is the problem with file conversion? I don’t want to convert this free blog post into a lengthy explanation. I can highlight five issues which have plagued me and my work in file conversion for many years:
First, file types change over time. Some of the changes are not announced. Others like the Microsoft Word XML thing are the subject of months long marketing., The problem is that unless the outfit responsible for the file conversion system creates a fix, the exception files can overrun a system’s capacity to keep track of problems. If someone is asleep at the switch, data in the exception folder can have an adverse impact on some production systems. Loss of data is interesting but trashing the file structure is a carnival. Who does not pay attention? In my experience, vendors, licensees, third parties, and probably most of the people responsible for a routine file conversion task.
Second, the thrill of XML is that it is not particularly consistent. Somewhere along the line, creativity takes precedence over for well formed. How does one deal with a couple hundred thousand XML files in an exception folder? What do you think about deleting them?
Third, the file conversion software works as long as the person creating a document does not use Fancy Dan “inserts” in the source document. Problems arise from videos, certain links, macros, and odd ball formatting of the source document. Yep, some folks create text in Excel and wonder why the resulting text is a bit of a mess.
Fourth, workflows get screwed up. A file conversion system is semi smart. If a process creates a file with an unrecognized extension, the file conversion system fills the exception folder. But what if one valid extension is changed to a supported but incorrect extension. Yep, XML users be aware that there are proprietary XML formats. The files converted and made available to a system are “sort of right.” Unfortunately sort of right in mission critical applications can have some interesting consequences.
Fifth, attention to detail is often less popular than fiddling with one’s mobile phone or reading Facebook posts. Human inattention can make large scale data conversion fail. I have watched as a person of my acquaintance deleted the folder of exception files. Yo, it is time for lunch.
So what? Smart software makes certain assumptions. At this time, file intake is perceived as a problem which has been solved. My view is that file intake is a core function which needs a little bit more attention. I do not need to be told that smart software will make file intake pain go away.
Stephen E Arnold, April 21, 2016
Chinese Restaurant Names as Journalism
April 19, 2016
I read an article in Jeff Bezos’ newspaper. The title was “We Analyzed the Names of Almost Every Chinese Restaurant in America. This Is What We Learned.” The almost is a nifty way of slip sliding around the sampling method which used restaurants listed in Yelp. Close enough for “real” journalism.
Using the notion of a frequency count, the write up revealed:
- The word appearing most frequently in the names of the sample was “restaurant.”
- The words “China” and “Chinese” appear in about 15,000 of the sample’s restaurant names
- “Express” is a popular word, not far ahead of “panda”.
The word list and their frequencies were used to generate a word cloud:
To answer the question where Chinese food is most popular in the US, the intrepid data wranglers at Jeff Bezos’ newspaper output a map:
Amazing. I wonder if law enforcement and intelligence entities know that one can map data to discover things like the fact that the word “restaurant” is the most used word in a restaurant’s name.
Stephen E Arnold, April 19, 2016
Content Marketers at Risk
April 19, 2016
I read “Goldman Sachs Leads a $30 million Round for Persado’s AI-Based, Automated Copywriting Service.” My first reactions:
- Search engine optimization wizards will have a tool to increase the flow of baloney search and content marketing to people who write blogs
- Journalists, who have been subject to reduction in force actions, may face fierce competition from a smart software
- Teachers of college composition will have a tough time figuring out if the student essays are coming from fraternity and sorority reference files or from a cloud based writing service.
According to the write up, the service is a “cognitive one.” Poor IBM. The company wants Watson to be the cognitive champion. Now an outfit which uses software to create articles has embraced the concept. I noted:
The company [Persado] has cataloged 1 million words and phrases that marketers use in their copy, and scored those words based on sentiment analysis and the structure of marketing pitches defined by a message’s format, linguistic structure, description, emotional language, and its actual call to action. The software can create a message, optimize its language, and then translate that message into any of 23 language…
There is a bright side. IBM could purchase Persado and then use the system to flog its confection of Lucene, acquired technology, and home brew code into a system which tirelessly promotes IBM.
Stephen E Arnold, April 19, 2016
What Is the Potential of Social Media?
April 11, 2016
Short honk. I read “How to Hack an Election.” The write up reports that a person was able to rig elections. According to the story:
For $12,000 a month, a customer hired a crew that could hack smartphones, spoof and clone Web pages, and send mass e-mails and texts. The premium package, at $20,000 a month, also included a full range of digital interception, attack, decryption, and defense. The jobs were carefully laundered through layers of middlemen and consultants.
Worth reading and then considering this question:
What are the implications of weaponized information?
Are pundits, mavens, self appointed experts, and real journalists on the job and helping to ensure that information online is “accurate”?
Stephen E Arnold, April 11, 2016
Machine Learning: 10 Numerical Recipes
April 8, 2016
The chatter about smart is loud. I cannot hear the mixes on my Creamfields 2014 CD. Mozart, you are a goner.
If you want to cook up some smart algorithms to pick music or drive your autonomous vehicle without crashing into a passenger carrying bus, navigate to “Top 10 Machine Learning Algorithms.”
The write up points out that just like pop music, there is a top 10 list. More important in my opinion is the concomitant observation that smart software may be based on a limited number of procedures. Hey, this stuff is taught in many universities. Go with what you know maybe?
What are the top 10? The write up asserts:
- Linear regression
- Logistic regression
- Linear discriminant analysis
- Classification and regression trees
- Naive Bayes
- K nearest neighbors
- Learning vector quantization
- Support vector machines
- Bagged decision trees and random forest
- Boosting and AdaBoost.
The article tosses in a bonus too: Gradient descent.
What is interesting is that there is considerable overlap with the list I developed for my lecture on manipulating content processing using shaped or weaponized text strings. How’s that, Ms. Null?
The point is that when systems use the same basic methods, are those systems sufficiently different? If so, in what ways? How are systems using standard procedures configured? What if those configurations or “settings” are incorrect?
Exciting.
Stephen E Arnold, April 8, 2016
ThomsonReuters: Palantir Not Enough Math?
April 6, 2016
I read “TRRI Users Will Gain Access to FiscalNote’s Legislative Modeling Techniques.” The licensees of Palantir Metropolitan and the owner of Westlaw smart software for legal eagles is pushing into new territory. That’s probably good news for stakeholders who have watch ThomsonReuters bump into a bit of a revenue ceiling in the last few years.
According to the write up:
The main benefit of the agreement [with FiscalNote] will grant Thomson Reuters’ Regulatory Intelligence (TRRI) newly extended capabilities across its predictive legislative analytics. TRRI is a global solution that helps clients focus and leverage their regulatory risk. Per the agreement, FiscalNote will help provide TRRI users with likelihood factors and other insights relegated to specifics pieces of legislative passage.
Interesting. I assumed that Palantir’s platform would have the extensibility to handle this type of content processing and analysis. Wrong again.
I learned:
FiscalNote utilizes machine learning and natural language processing in its modeling techniques that help it engineer models to conduct a host of analyses on open government data. In essence, these models allow FiscalNote to automatically analyze how legislation is going to yield any material impact via a combination of factors such as legislators, committee assignments, actions taken, bill versions, and amendments.
Wait, wait, don’t tell me. Westlaw’s smart software which can do many wonderful advanced text processing tricks is not able to perform in the manner of FiscalNote.
My hunch is that the deal has less to do with technologies, extensible or not, and more to do with getting some customers and an opportunity to find a way to pump up those revenues. Another idea: Is ThomsonReuters emulating IBM’s tactic of buying duplicative technology as a revenue rocket booster?
Perhaps Palantir and Westlaw should team up so ThomsonReuters’ customers have additional choices? Think of the XML slicing and dicing strategy with the intelligence and legal technology working in harmony.
Stephen E Arnold, April 6, 2016
Patents and Semantic Search: No Good, No Good
March 31, 2016
I have been working on a profile of Palantir (open source information only, however) for my forthcoming Dark Web Notebook. I bumbled into a video from an outfit called ClearstoneIP. I noted that ClearstoneIP’s video showed how one could select from a classification system. With every click,the result set changed. For some types of searching, a user may find the point-and-click approach helpful. However, there are other ways to root through what appears to be patent applications. There are the very expensive methods happily provided by Reed Elsevier and Thomson Reuters, two find outfits. And then there are less expensive methods like Alphabet Google’s odd ball patent search system or the quite functional FreePatentsOnline service. In between, you and I have many options.
None of them is a slam dunk. When I was working through the publicly accessible Palantir Technologies’ patents, I had to fall back on my very old-fashioned method. I tracked down a PDF, printed it out, and read it. Believe me, gentle reader, this is not the most fun I have ever had. In contrast to the early Google patents, Palantir’s documents lack the detailed “background of the invention” information which the salad days’ Googlers cheerfully presented. Palantir’s write ups are slogs. Perhaps the firm’s attorneys were born with dour brain circuitry.
I did a side jaunt and came across a white paper from ClearstoneIP called “Why Semantic Searching Fails for Freedom-to-Operate (FTO).”i The 12 page write up is from a company called ClearstoneIP, which is a patent analysis company. The firm’s 12 pager is about patent searching. The company, according to its Web site is a “paradigm shifter.” The company describes itself this way:
ClearstoneIP is a California-based company built to provide industry leaders and innovators with a truly revolutionary platform for conducting product clearance, freedom to operate, and patent infringement-based analyses. ClearstoneIP was founded by a team of forward-thinking patent attorneys and software developers who believe that barriers to innovation can be overcome with innovation itself.
The “freedom to operate” phrase is a bit of legal jargon which I don’t understand. I am, thank goodness, not an attorney.
The firm’s search method makes much of the ontology, taxonomy, classification approach to information access. Hence, the reason my exploration of Palantir’s dynamic ontology with objects tossed ClearstoneIP into one of my search result sets.
The white paper is interesting if one works around the legal mumbo jumbo. The company’s approach is remarkable and invokes some of my caution light words; for example:
- “Not all patent searches are the same.”, page two
- “This all leads to the question…”, page seven
- “…there is never a single “right” way to do so.”, page eight
- “And if an analyst were to try to capture all of the ways…”, page eight
- “to capture all potentially relevant patents…”, page nine.
The absolutist approach to argument is fascinating.
Okay, what’s the ClearstoneIP search system doing? Well, it seems to me that it is taking a path to consider some of the subtlties in patent claims’ statements. The approach is very different from that taken by Brainware and its tri-gram technology. Now that Lexmark owns Brainware, the application of the Brainware system to patent searching has fallen off my radar. Brainware relied on patterns; ClearstoneIP uses the ontology-classification approach.
Both are useful in identifying patents related to a particular subject.
What is interesting in the write up is its approach to “semantics.” I highlighted in billable hour green:
Anticipating all the ways in which a product can be described is serious guesswork.
Yep, but isn’t that the role of a human with relevant training and expertise becomes important? The white paper takes the approach that semantic search fails for the ClearstoneIP method dubbed FTO or freedom to operate information access.
The white paper asserted:
Semantic
Semantic searching is the primary focus of this discussion, as it is the most evolved.
ClearstoneIP defines semantic search in this way:
Semantic patent searching generally refers to automatically enhancing a text -based query to better represent its underlying meaning, thereby better identifying conceptually related references.
I think the definition of semantic is designed to strike directly at the heart of the methods offered to lawyers with paying customers by Lexis-type and Westlaw-type systems. Lawyers to be usually have access to the commercial-type services when in law school. In the legal market, there are quite a few outfits trying to provide better, faster, and sometimes less expensive ways to make sense of the Miltonesque prose popular among the patent crowd.
The white paper, in a lawyerly way, the approach of semantic search systems. Note that the “narrowing” to the concerns of attorneys engaged in patent work is in the background even though the description seems to be painted in broad strokes:
This process generally includes: (1) supplementing terms of a text-based query with their synonyms; and (2) assessing the proximity of resulting patents to the determined underlying meaning of the text – based query. Semantic platforms are often touted as critical add-ons to natural language searching. They are said to account for discrepancies in word form and lexicography between the text of queries and patent disclosure.
The white paper offers this conclusion about semantic search:
it [semantic search] is surprisingly ineffective for FTO.
Seems reasonable, right? Semantic search assumes a “paradigm.” In my experience, taxonomies, classification schema, and ontologies perform the same intellectual trick. The idea is to put something into a cubby. Organizing information makes manifest what something is and where it fits in a mental construct.
But these semantic systems do a lousy job figuring out what’s in the Claims section of a patent. That’s a flaw which is a direct consequence of the lingo lawyers use to frame the claims themselves.
Search systems use many different methods to pigeonhole a statement. The “aboutness” of a statement or a claim is a sticky wicket. As I have written in many articles, books, and blog posts, finding on point information is very difficult. Progress has been made when one wants a pizza. Less progress has been made in finding the colleagues of the bad actors in Brussels.
Palantir requires that those adding content to the Gotham data management system add tags from a “dynamic ontology.” In addition to what the human has to do, the Gotham system generates additional metadata automatically. Other systems use mostly automatic systems which are dependent on a traditional controlled term list. Others just use algorithms to do the trick. The systems which are making friends with users strike a balance; that is, using human input directly or indirectly and some administrator only knowledgebases, dictionaries, synonym lists, etc.
ClearstoneIP keeps its eye on its FTO ball, which is understandable. The white paper asserts:
The point here is that semantic platforms can deliver effective results for patentability searches at a reasonable cost but, when it comes to FTO searching, the effectiveness of the platforms is limited even at great cost.
Okay, I understand. ClearstoneIP includes a diagram which drives home how its FTO approach soars over the competitors’ systems:
ClearstoneIP, © 2016
My reaction to the white paper is that for decades I have evaluated and used information access systems. None of the systems is without serious flaws. That includes the clever n gram-based systems, the smart systems from dozens of outfits, the constantly reinvented keyword centric systems from the Lexis-type and Westlaw-type vendor, even the simplistic methods offered by free online patent search systems like Pat2PDF.org.
What seems to be reality of the legal landscape is:
- Patent experts use a range of systems. With lots of budget, many fee and for fee systems will be used. The name of the game is meeting the client needs and obviously billing the client for time.
- No patent search system to which I have been exposed does an effective job of thinking like an very good patent attorney. I know that the notion of artificial intelligence is the hot trend, but the reality is that seemingly smart software usually cheats by formulating queries based on analysis of user behavior, facts like geographic location, and who pays to get their pizza joint “found.”
- A patent search system, in order to be useful for the type of work I do, has to index germane content generated in the course of the patent process. Comprehensiveness is simply not part of the patent search systems’ modus operandi. If there’s a B, where’s the A? If there is a germane letter about a patent, where the heck is it?
I am not on the “side” of the taxonomy-centric approach. I am not on the side of the crazy semantic methods. I am not on the side of the keyword approach when inventors use different names on different patents, Babak Parviz aliases included. I am not in favor of any one system.
How do I think patent search is evolving? ClearstoneIP has it sort of right. Attorneys have to tag what is needed. The hitch in the git along has been partially resolved by Palantir’’-type systems; that is, the ontology has to be dynamic and available to anyone authorized to use a collection in real time.
But for lawyers there is one added necessity which will not leave us any time soon. Lawyers bill; hence, whatever is output from an information access system has to be read, annotated, and considered by a semi-capable human.
What’s the future of patent search? My view is that there will be new systems. The one constant is that, by definition, a lawyer cannot trust the outputs. The way to deal with this is to pay a patent attorney to read patent documents.
In short, like the person looking for information in the scriptoria at the Alexandria Library, the task ends up as a manual one. Perhaps there will be a friendly Boston Dynamics librarian available to do the work some day. For now, search systems won’t do the job because attorneys cannot trust an algorithm when the likelihood of missing something exists.
Oh, I almost forget. Attorneys have to get paid via that billable time thing.
Stephen E Arnold, March 30, 2016