Searching Google Patent Documents with ISYS Version 9

October 13, 2008

After my two lectures at the Enterprise Search Summit in San Jose, California, in mid-September 2008, I had two people write me about my method for figuring out Google patent documents. Please, appreciate that I can’t reveal the tools that I use which my team has developed. These are my secret sauce, but I can describe the broad approach and provide some detail about what Constance, Don, Stuart, and Tony do when I have to cut through the “fog of transparency” and lava lamp light emanating from Google.

Background

Google generates a large volume of technical information and comparatively modest amounts of patent-related documents. The starting point, therefore, is a fact that catches my attention.  One client sent two people to “watch” me investigate a technical topic. After five days of taking notes, snapping digital photos, and reviewing the information that I have flowing into my Harrod’s Creek, Kentucky, offices, the pair gave up. The procedure was easily flow charted, but the identification of an important and interesting item was a consequence of effort and years of grunting through technical material. Knowing what to research, it seems, is a matter of experience, judgment, and instinct.

The two “watchers” looked at the dozens of search, text mining, and content utilities I had on my machines. The two even fiddled with the systems’ ability to pattern match using n-gram technology, entity extraction using 12-year-old methods that some companies still find cutting edge, and various search systems from companies still in business as well as those long since bought out or simply shut down.

Here’s the big picture:

  1. Spider and collect information via various push methods. The data may be in XML, PDF, or other formats. The key point is that everything I process is open source. This means that I rely on search engines, university Web sites, government agencies with search systems that are prone to time outs, and postings of Web logs. I use exactly the same data that you can use when you run a query on any of the more than 500 systems listed here. This list is one of the keys to our work because none of the well known search systems index “everything”. The popular search engines don’t even come close. In fact, most don’t go more than two or three links deep for certain Web sites. Do some exploring on the US Department of Energy Web site, and you will what I mean. The key is to run the query across multiple systems and filter out duplicates. Software and humans do this work, just as humans process information at national security operations in many countries. (If you read my Web log, you will know that I have a close familiarity with systems developed by former intelligence professionals.)
  2. Take the filtered subset and process it with a search engine. The bulk of this Web log post describes the ISYS Search Software system. We have been using this system for several years, and we find that it is a quick indexer, so we can process new information quickly.
  3. Subset analysis. Once we have a cut from the content we are processing, then we move the subset into our proprietary tools. One of these tools runs stored queries or what some people call saved searches against the subset looking for specific people and things. My team looks at these outputs.
  4. I review the winnowed subset, and, as time allows, I involve myself in the preceding steps. Once the subset is on my machine, I have to do what anyone reviewing patents and technical documents must do. I read these materials. No, I don’t like to do it, but I have found that doing consistently the dog work that most people prefer to dismiss as irrelevant is what makes it possible for me to “connect the dots”.

Searching

There’s not much to say about running queries and collecting information that comes via RSS or other push technologies. We get “stuff” from open sources, and we filter out the spam, duplicates, and uninteresting material. Let’s assume that we have information regarding new Google patent documents. We get this information pushed to us, and these are easy to flag. You can navigate to the USPTO Web site and see what we get. You can pay commercial services to send you alerts when new Google documents are filed or published. You can poke around on the Web and find a number of free patent services. If you want to use Google to track Google, then you can use Google’s own patent service. I don’t find it particularly helpful, but Google may improve it at some point in the future. Right now, it’s on my list, but it’s like a dull but well meaning student. I let the student attend my lectures, but I don’t pay much attention to the outputs. If you want some basic information about patent documents, click here.

datacenterresults

Narrowed result set for a Google hardware invention related to cooling. This is an image generated using ISYS Version 9, which is now available.

Before Running Queries

You can’t search patent documents and technical materials shooting from the hip. When I look for information about Google or Microsoft, for instance, I have to get smart with regards to terminology. Let me illustrate. If you want to find out how Microsoft is building data centers to compete with Google, you will get zero useful information with this type of query on any system: “Microsoft and “data centers”. My actual queries are more complex and use nesting, but this test query is one you can use on Microsoft’s Live.com search. Now run the same query for “Microsoft Monsoon”. You will see what you need to know here. If you don’t know the code word “Monsoon”, you will never find the information. It’s that simple.

Read more

Mark Logic and Basis Technology

October 13, 2008

Browsing the Basis Technology Web site revealed an October 7, 2008, news release about a Basis Technology and Mark Logic tie up. You can read the news release here or here. Basis Technology licenses text and content processing components and systems. The Basis Technology announcement says “Rosette Entity Extractor provides advanced search and text analytics for MarkLogic Server 4.0.” Mark Logic, as I have noted elsewhere in this Web log, is one of the leading providers of XML server technology. The new version can store, manage, search, and deliver content in a variety of forms to individual users, other enterprise systems, or to devices. REX (shorthand for Rosette Entity Extractor) can identify people, organizations, locations, numeric strings such as credit card numbers, email address, geographic data, and other items such as dates from unstructured or semi structured content. I don’t have details on the deal. My take on this is that Mark Logic wants to put its XML engine into gear and drive into market spaces not now well served with applications and functions in other vendors’ XML systems. Enterprise search is dead. Long live more sophisticated information and data management systems. Search will be tucked in these solutions, but it’s no longer the focal point of the system. I am pondering the impact of this announcement on other XML vendors and upon such companies as Microsoft Fast Search.

Stephen Arnold, October 13, 2008

Smart Money versus Start Ups

October 12, 2008

Matt Marshall’s “Expect to See Start Ups and VCs Hit Standoff over Valuations” is a very important article. The piece appeared in Venture Beat on October 10, 2008. The hook for the story is that a lousy market puts VCs at odds with the companies these firms funded. Mr. Marshall provides useful color about the different approaches some VC firms use when looking for the next Google. The examples are ones you will want to tuck in your notebook. Insider info like Mr. Marshall’s is hard to find.

For me, the most interest comment in the article was this passage:

It may be next year before we can give a serious assessment of the true fallout for start-ups. Expect to see more companies go out of business too, as VCs in some cases decide not to invest at all.

The impact on start ups will be immediate and continue for at least a year. I agree.

So, what’s the impact on search and content processing? I will be giving this subject considerable thought in the weeks and months ahead, but I have some preliminary thoughts. I want to capture these before they dissolve from this addled goose’s mind. Also, keep in mind that I may change my views as I obtain more data and do more critical thinking. Here goes:

  1. Backlash. I think there will be a backlash against consultants who promise quick, easy, and cheap fixes to problems with search, content processing, and content management systems. The notion that a dab of Neosporin and a bit of tape will make the pain of flawed information systems go away will be dismissed out of hand. Martin White and I have written 250 pages that explain the methodical approach needed to back out of a search system disaster. A big problem cannot be resolved overnight, so management expertise in budgeting and controlling work becomes more important than “recipes” or “silver bullets”. Tough times demand management resolve, not placebos and truisms.
  2. Push back. Companies offering platform solutions that are not will have a difficult time closing new deals. In fact, I think the economic climate will encourage organizations to seek point solutions that can, if warranted, be scaled to handle larger jobs.
  3. Protectionism. Vendors will escalate their efforts to create lock ins for their existing customers and whenever possible set up deals that lock out competitors. I learned about one large company that is solidly Microsoft and the procurement team is looking only to Microsoft for a solution. The goal is “one throat to choke” for the customer. For Microsoft, it is control of the account. The problem is that Microsoft does not have a solution that will work, so the loser in this deal with the naive licensee who will spend millions and end up with the same information problem. The goals of each party deliver a problem wrapped in what looks like an ideal solution. The fur will fly in 18 months. Today, customer and vendor are drinking to one another’s health.
  4. Attrition. I encounter too many entrepreneurs who believe their approach to search is the “next big thing.” In most cases, these companies will find that revenues will be tough to generate. I talked with one company three weeks ago and encountered paranoia about my call. The irony of this call is that it was prompted to put the company in a major consulting firm’s “watch” database. The call was, therefore, a “good news” call, but the business owner heard only the veiled threats his own mind whispered in his ear. The issue was resolved, but this “fear” will close off opportunities for some companies leading to less likelihood for revenue magnetism. Fear and paranoia are not as appealing in tough economic times; pragmatism and common sense are pretty charming in my opinion.
  5. Skepticism. Prospects won’t believe much of what some vendors say unless the vendor is already in the fox hole with the customer.
  6. Baloney. Lots of Buffy and Trent marketing and PR information will be generated. I wish I was 23, filled with energy, and able to invent new buzzwords to describe functions and operations that are 50 years old.

If you want to add or modify the items on this preliminary list, please, use the comments section for this Web log. Don’t write me directly. I am on the road, returning to the US in about nine or 10 days. My email systems perform miserably when out of the country, but the Web log system is pretty reliable.

Stephen Arnold, October 12, 2008

Google and Customer Service: The Math Club Syndrome

October 12, 2008

Rhodri Marsden, writing in The Independent, gnaws on a topic that I have heard quite a few people discuss. Mr. Marsden’s article “Cyberclinic: Why Is It Impossible to Contact Google by Telephone” ran on October 8, 2008, but the story just made its way to rural Kentucky. You can read the opinion piece here. The point of the story is that people can’t get a Googler on the phone. For me, the most interesting comment in the article was:

‘The New York Times’ reported this week on the efforts of their readers to phone Google to resolve problems with their GMail service. After their emails went unanswered, they scoured the website for a contact number – but to no avail.

My thought was, “So, this is a surprise?” Mr. Marsden doesn’t understand what I call Math Club Syndrome, which I will explain in a moment.

I heard in early September 2008 that Google paid a big wheel consulting firm to provide some insight into Google’s sales activities. I haven’t seen the report, nor do I know if what I heard was true. I offer this as a possible Google action, not a fact. Here’s the story: the person with whom I was speaking told me that the big wheel consulting firm delivered its findings. Among these findings were these recommendations: return phone calls, keep appointments for meetings, and follow up.

I have had very few direct dealings with the GOOG. The most interesting was the interaction with the firm when my Programmable Search Engine piece appeared as a BearStearns’s note. The Googler calling asserted that the information in the PSE write up was confidential. We faxed the patent application numbers and the Googler dropped the matter. I concluded that Google doesn’t know what Google itself knows.

In my opinion, Google will change, but change won’t come too quickly.

I in a feeble attempt at humor characterize Google’s approach as the Math Club Syndrome. Here’s what I mean.

In math club, it is reassuring to be with people who understand math and those who like math, maybe like math more than sports, members of the opposite sex, parents, and the latest fashion trends.

As a former math club member, I remember the fun we had. And nothing elicited enjoyment and challenge like a reference to the Euler-Mascheroni constant. Yes! Do you feel the thrill?

So now you know what the Math Club Syndrome means.

If you don’t get it, you don’t belong. In my high school math club, people who didn’t get math were losers. Snort. Snort. We knew that there was no extra credit for working with people who were not in the Math Club.

My hunch is that this Math Club Syndrome influences Google’s approach to customer support. If you get it, you don’t need support. If you don’t get it, it’s not Google’s job to teach you to be Googley.

I was asked last week about the person Cyrus to whom I refer in my semi-humorous Web log posts. Well, Cyrus is (maybe was?) a Googler who told my son and the president of a search company in Utah that I faked (his word was allegedly “photoshopped”) a representation of Google’s little known dossier report output.

Brilliant and sparkling  Cyrus did not know that Google put the shoddy and blurry image in a public Google patent document. Did Cyrus apologize?

No way, dude. Cyrus will go through his Googley life convinced that Mother Google would never create such a crappy graphic.

That’s the Math Club Syndrome, and that’s why Mr. Marsden, the New York Times, the Department of Agriculture, and others can’t get a Googler on the phone.

But there’s progress.

Before Google abandoned its exhibit at the Enterprise Search Summit in San Jose, there were six Googlers talking among themselves in the booth. Now the Googlers were not reaching out to attendees, but the Googlers were present and deeply engaged with one another just like my high school math club get togethers. On the third day of the show, no Googlers. So, in the exhibit two of the three days is a form of progress. Returning phone calls is right around the corner.

Agree? Disagree? Help me learn.

Stephen Arnold, October 12, 2008

Google and Publishers

October 12, 2008

Book2Book (BookTrade.info) reported on October 9, 2008, that Google may be smoking the peace pipe with traditional publishers, well, some traditional publishers. You can read “Settlement in Google Lawsuit Appears Near” here. This is a short news item, and I don’t want to quote anything from the story so I can avoid being pinged for ripping off another person’s content. Google has been scanning books for several years. Publishers don’t like that idea for many reasons. The law suit was an attempt by the copyright holders to keep the GOOG at bay. Now, if the Book2Book news item is on target, Google and the publishers may have an deal soon. Read the story; decide for yourself.

Here’s my take:

  1. Google is saying that it is not a publisher. Anyone with knowledge or Knol can figure out that Google intakes original content and outputs it. That sure seems like publishing to me.
  2. A deal with Google does little to stop Google from becoming the Internet and online. Publishers can’t stop this, and the legal tactics over the last couple of years to stop Google have had zero impact. Any financial deal with Google is going to be too little to late. The publishers’ children, just like my neighbors’ children, use Google. Google has, for a certain demographic, won and will keep winning.
  3. Publishers have been unable to adjust their business model. It is no longer an issue of technology; publishers are working in a frame. Google is outside the frame. If a deal emerges, I can visualize sitting in the audience at a publisher conference hearing speakers explain how the deal with protect the publishers’ franchise. This is what I call the imaginary number problem. If you don’t know about the square root of minus one, then it’s tough to understand the solution to certain problems.

What is your view of Google’s relationship with publishers? What am I missing?

Stephen Arnold, October 12, 2008

Oracle Looking to Acquire: Is Autonomy a Candidate

October 12, 2008

ITWorld ran Chris Kanaracus’ “Ellison Strikes Bullish Tone at Shareholder Meeting” contained a reference to Larry Ellison’s view that the economy can help Oracle grow. You can read the ITWorld story here. Bloomberg.com picked up on the point. In Vivek Shankar’s and Rochelle Garner’s story “Oracle Holders’ Vote Shows Dissent on Ellison’s Pay” here, this comment caught my attention:

Ellison said today he plans to boost profit with stock buybacks and acquisitions. He owned 1.15 billion shares as of Feb. 15, according to Bloomberg data. “We are better positioned than our peers, the other software companies, to do well in tough times,”‘ Ellison said, referring to the credit and financial market crises. “There are opportunities to make acquisitions that would cost less.'”

The notion of taking advantage of the cratering of the financial market and its subsequent shock waves triggered my speculative gene this fine Saturday morning in rural Kentucky. Here are the highlights of the notes I jotted down as I watch the geese, wolves, and other beasties frolic in the misty hollow:

  1. Oracle has not made much visible progress in information access–also known as search–in the last 18 months. Could Oracle reignite its passion for Autonomy. Autonomy has a beefy customer base, OEM “annuity” revenue, and a foot in the door in the eDiscovery sector. You can click here to see the dip in Autonomy’s share price in the last week. If the shares continue to drop, Autonomy could become affordable. Prior to the crash, the share price made the company financially unpalatable according to my back of envelope calculations.
  2. Could Oracle start buying Google partners who are gaining traction? I was surprised to learn that Oracle is selling Google Search Appliances. I don’t have much detail about this tie up, but if the Google Search Appliance is able to drive more Oracle consulting and application revenue, why not roll up some of the partners. Adhere Solutions, a hot partner, is one example of a candidate for this type of play. Some Google partners are listed here.
  3. Oracle R&D has not had the impact on the expanding data management market that companies like Aster Data and InfoBright have had. InfoBright has hooked up with Hewlett Packard, and it might be time for Oracle to start thinking about next generation data issues from an angle that is more agile that the traditional Oracle database.

I had several other thoughts. The theme in my speculation is that Oracle is going to have to take some positive action because my research suggests that a roiling of the traditional database world may be coming. The trigger point may be none other than Oracle’s neighbors in Mountain View, California. Google, based on my analysis of some 2007 and 2008 technical papers, may be poised to pull a leap frog in enterprise data management.

Will Oracle buy search, partners and service firms, data management, or move in some other direction? I don’t know, but if the company wants to take advantage of the present financial climate, Oracle will have to move before even more severe shocks rock the landscape.

Stephen Arnold, October 11, 2008

Ask: In Trouble

October 11, 2008

CNet ran a Web 2.0 deathwatch article on October 10, 2008. The company that caught my attention was Ask.com, Barry Diller’s killer search system. Well, Mr. Diller terminated with extreme prejudice Jeeves, the cartoon logo that I liked. Bang, Jeeves. Now, according to Rafe Needleman, Ask.com may be troubled. You can read his story “11 Troubled Web Companies: The Next Kozmos?” here. The write up about Ask.com is short, so I won’t quote any of Mr. Needleman’s material. My question is, “Is this a surprise?” Maybe I should start my own search deathwatch? I have some patients in mind. Just a thought. Just a thought.

Stephen Arnold, October 11, 2008

Aberdeen on Search: Data That Surprises

October 11, 2008

Aberdeen Group published a research brief this week (October 6, 2008). You can read a brief description of the report here. I was clicking on the page, and I was able to download the six page document from this page. I scanned the document, and for me the most interesting comment in the report was:

Not surprisingly, Google appears to be the most dominant vendor amongst our survey respondents with 29 percent of respondents indicating they use Google for enterprise search…

I found this sentence’s use of the phrase “not surprisingly” quite remarkable. I am not sure that Google perceives itself as the leader in enterprise search. That distinction goes to SharePoint with 100 million licenses, according to my sources. I found the figure of 29 percent intriguing as well. Most organizations have multiple search systems, so any break out will total more than 100 percent, and most of the systems generate significant dissatisfaction among their users. So the Google penetration raised my eye brows. I think the numbers are out of sync with the data I have collected for my search analyses. But that’s statistics.

Check out the report. You may find the thumbnail descriptions of the search leaders amusing. Yahoo is in the list. Yahoo! Wow.

Stephen Arnold, October 11, 2008

SharePoint Search How To

October 11, 2008

A happy quack to the reader who alerted me to TechRepublic’s pointer to a Microsoft search how-to document. The Word file is 40 pages long, and you can start the download process (which is pretty annoying) here. Once you have signed up for a Microsoft sales professional to contact you or my colleague Tess the Boxer, you can read ” Deploying and Supporting Enterprise Search. Using SharePoint Server 2007 to Help Employees Locate Information and People at Microsoft.” The white paper was written in July 2007, not quite a year before Microsoft bought the outstanding search airplane kit, Fast Search & Transfer. The paper is blissfully unaware of Microsoft Office SharePoint Search, but you may find some interesting information in the white paper. In fact, I think I read this paper. I know I have used it as a source in my Microsoft Data Center Architecture lecture. Heck, I may have written a log post about it. Nevertheless, let me highlight a couple of gems from the document:

  1. Microsoft had indexed at the time the paper was written 27 million documents. With SharePoint search struggling when asked to index about 50 million documents, I wonder how Microsoft indexes its email. Microsoft’s employees must generate a significant volume of email, and I wonder if SharePoint processes this content or omits it. My guess is that Microsoft hopes Fast Search & Transfer can process documents collections that exceed SharePoint’s current ceiling. See page 19 for what Microsoft indexed in May 2007.
  2. Search when this paper was written did not permit a query across all content in SharePoint installations in world wide Microsoft. My guess is that federating search across SharePoint servers is too expensive and compute intensive even for Microsoft. See pages 13-15 for more information about the Microsoft implementation of SharePoint.
  3. The administrative task summary is remarkable, comprising the bulk of the document. These tasks remind me to get a notebook, write down what was done, make screenshots, print them out, paste them in the notebook, and keep it handy.

clip_image001

The search interface in May 2007. I don’t know what I am looking at or what I am supposed to be seeing.

If you are a SharePoint engineer, you may want to download this paper. Put it in your configuration notebook. It may come in handy.

Stephen Arnold, October 11, 2008

Yahoo BOSS: Not with Much Authority

October 11, 2008

I have been identifying technical challenges for Yahoo for a number of years. The core of my concern is that Yahoo has collected search systems the way my mother hoarded figurines. To recap: Yahoo had a primitive search system for its directory (now long gone) then Yahoo bought Inktomi then Yahoo bought Overture and got Overture’s home grown search system and AltaVista.com then Yahoo bought AllTheWeb.com from Fast Search & Transfer then Yahoo licensed InQuira for customer support then Yahoo bought Stata Labs then Yahoo cut a deal with X1 for desktop search then Yahoo divorced X1 and married IBM OmniFind (Lucene) then Yahoo bought Flickr and got its search system then Yahoo bought Delicious.com recoded it and its search system (a two year project) and … now I’m tired. I don’t think I have these in chronological order. I skipped Yahoo Mindset and the semantic search system and probably two or three other systems.

Why is this a big deal? Cost. If an engineer knows something about Stata Labs and the Overture team needs some help, the skills are not transferable across heterogeneous systems.

The article about Yahoo BOSS that triggers my creating a variant of Homer’s list of ships in the Iliad is an article by Stephen Shankland, which is quite good. His story “Academics Sink Teeth into Yahoo Search Service” here. Mr. Shankland reviews the purpose of BOSS, a play by Yahoo to get more traction for the company’s Web search service. With some clever people building on the Yahoo index, Yahoo hopes to pump up its ad revenues. The twist in the lariat, however, is that Google has a Web search market share of 65 percent, maybe higher depending on which research firms’ data one choose to consult. Also, Yahoo has fumbled a deal with Microsoft. With its shares below $15, Yahoo doesn’t command much authority in Web search, with investors, or with me. Mr. Shankland notes that academics are liking the Yahoo BOSS service.

For me, the most interesting comment was this one:

“We’re not a market leader,” said Prabhakar Raghavan, chief strategist for Yahoo Search. “From a strategic standpoint, it does make sense to let other people innovate on top of us. If the pie grows, our share of the pie grows at the expense of somebody else.”

Baloney. By the time the pie grows, Google will have eaten more. Google is growing at Yahoo’s, Microsoft’s, and Ask.com’s expense. Even with the crash this week, the GOOG continues to chomp away at other vendors’ market share.

Let me capture my thoughts about the choral singing of Kumbaya by Yahoo and wizards-in-training from a number of prestigious academic institutions:

  1. Costs. Yahoo is going to have to find a way to create a more homogeneous approach to search. If one of these wizards-in-training hits a home run, Yahoo is going to need a way to scale which means money. With heterogeneous search systems, something’s got to give. That “something” will be cost control.
  2. Time. Time is running out. Yahoo has made minimal progress in its Web traffic race with Google. Instead of focusing, Yahoo has fiddled as its market share has burned. Yahoo has its own Nero, and I’m not sure how much longer the head Yahoo is going to be left in power.
  3. Google. Cutting a deal with Google seemed like a great way to get out of the Microsoft deal. Now the Google deal is not a reality, and if it is or it isn’t, Yahoo is too far gone to be a threat. The company will become a GM to Google’s Toyota. If the last six months suggest Yahoo’s strategic strength, Yahoo is going to be Studebaker.

In short, Yahoo has got to do more than BOSS (a name that suggests control and superiority) to convince me that it has much of a future. In fact, when I hear boss I think of deboss, a word that denotes stamping a hole into a surface.

Stephen Arnold, October 11, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta