Web Scraping Made Easier
October 14, 2020
If one has a list of Web sites of interest, how can a person suck down the content? In the consumerized world of business intelligence, the answer is, “Every day.” Crawling Web sites is getting easier. Scrape Owl wants to make the process ever simpler. For as little as $30 per month you can chug along with 250,000 API calls and 10 concurrent requests. If you need more scraping horsepower, sign up for the $100 per month service. You use Scrape Owl’s API and can custom headers, cookies, and inject JavaScript. The more expensive service also allows proxies. These can be helpful under certain circumstances because some site operators are not keen on slurping. For more information, navigate to Scrape Owl’s FAQ page and take your first steps to becoming the new Google.
Stephen E Arnold, October 14, 2020
2020: Reactive, Semi-Proactive, and Missing the Next Big Thing
July 27, 2020
I wanted to wrap up my July 28, 2020, DarkCyber this morning. Producing my one hour pre recorded lecture for the US National Cyber Crime Conference sucked up my time.
But I scanned two quite different write ups AFTER I read “Public Asked To Report Receipt of any Unsolicited Packages of Seeds.” Call me suspicious, but I noted this passage in the news release from the Virginia Department of Agriculture and Consumer Services:
The Virginia Department of Agriculture and Consumer Services (VDACS) has been notified that several Virginia residents have received unsolicited packages containing seeds that appear to have originated from China. The types of seeds in the packages are unknown at this time and may be invasive plant species. The packages were sent by mail and may have Chinese writing on them. Please do not plant these seeds.
And why, pray tell. What’s the big deal with seeds possibly from China, America’s favorite place to sell soy beans? Here’s the key passage:
Invasive species wreak havoc on the environment, displace or destroy native plants and insects and severely damage crops. Taking steps to prevent their introduction is the most effective method of reducing both the risk of invasive species infestations and the cost to control and mitigate those infestations.
Call me suspicious, but the US is struggling with the Rona or what I call WuFlu, is it not? Now seeds. My mind suggested from parts unknown that perhaps, just perhaps, the soy bean buyers are testing another bio-vector.
As the other 49 states realize that they too may want to put some “real” scientists to work examining the freebie seeds, I noted two other articles.
I am less concerned with the intricate arguments, the charts, and the factoids and more about how I view each write up in the context of serious thinking about some individuals’ ability to perceive risk.
The first write up is by a former Andreessen Horowitz partner. The title of the essay is “Regulating Technology.” The article explains that technology is now a big deal, particularly online technology. The starting point is 1994, which is about 20 years after the early RECON initiatives. The key point is that regulators have had plenty of time to come to grips with unregulated digital information flows. (I want to point out that those in Mr. Evans’ circle tossed accelerants into the cyberfires which were containable decades ago.) My point is that current analysis makes what is happening so logical, just a half century too late.
The second write up is about TikTok, the Chinese centric app banned in India and accursed of the phone home tricks popular among the Huawai and Xiaomi crowd. “TikTok, the Facebook competitor?’s” point seems to be that TikTok has bought its way into the American market. The same big tech companies that continue to befuddle analysts and regulators took TikTok’s cash and said, “Come on down.” The TikTok prize may be a stream of free flowing data particularized to tasty demographics. My point is that this is a real time, happening event. There’s nothing like a “certain blindness” to ensure a supercharged online service will smash through data collection barriers.
News flash. The online vulnerabilities (lack of regulation, thumb typing clueless users, and lack of meaningful regulatory action) are the old threat vector.
The new threat vector? Seeds. Bio-attacks. Bio-probes. Bio-ignorance. Big, fancy thoughts are great. Charts are wonderful. Reformed Facebookers’ observations are interesting. But the now problem is the bio thing.
Just missing what in front of their faces maybe? Rona masks and seed packets. Probes or attacks? The motto may be a certain foreign power’s willingness to learn the lessons of action oriented people like Generals Curtis LeMay or George Patton. Add some soy sauce and stir in a cup of Sun Tzu. Yummy. Cheap. Maybe brutally effective?
So pundits and predictive analytics experts, analyze but look for the muted glowing of threat vector beyond the screen of one’s mobile phone.
Stephen E Arnold, July 27, 2020
Another Dust Up: A Consequence of Swisherism?
July 3, 2020
I associate Silicon Valley journalism with the dynamic duo of Swisher and Mossberg. The Walt has retired from the field of battle—almost. Kara Swisher sallies forth. The analytic approach taken by the “I” journalist has had a significant impact on others who want to reveal the gears, levers, and machine oil keeping the Silicon Valley factories running the way their owners and bankers intended.
Hence, Swisherism which I define as:
A critical look at Silicon Valley as a metaphor for the foibles of individuals who perceive themselves as smarter than anyone else, including those not in the room.
A good example of Swisherism’s consequences appears in “Silicon Valley Elite Discuss Journalists Having Too Much Power in Private App.” The write up is like a techno anime fueled with Jolt Cola.
For example:
During a conversation held Wednesday night on the invite-only Clubhouse app—an audio social network popular with venture capitalists and celebrities—entrepreneur Balaji Srinivasan, several Andreessen Horowitz venture capitalists, and, for some reason, television personality Roland Martin spent at least an hour talking about how journalists have too much power to “cancel” people and wondering what they, the titans of Silicon Valley, could do about it.
This is inside baseball given a dramatic twist. Big names (for some I suppose). A country-club app for insiders. An us versus them plot line worthy of Homer. The specter of retribution.
Yikes.
Even more interesting is that the article references a “recording” of what may have been perceived as a private conversation.
There’s nothing to inspire confidence like leaked recordings, right?
There is a sprinkling of foul language. A journalist becomes the target of interest. There is loaded language like “has been harassed and impersonated” to make sure that the reader understands that badness of the situation.
Swisherisms? Sort of, but the spirit is there. The under dog needs some support. Pitch in. Let’s make attitudes “better.” Rah rah.
I particularly like the use of Twitter as a weapon of myth destruction:
Lorenz’s tweet was immediately tweeted about by several Silicon Valley venture capitalists, most notably Srinivasan, who eventually made a seven-tweet thread in which he suggested Lorenz, and journalists like her, are “sociopaths.” That same day, a self-described Taylor Lorenz “parody” Twitter account started retweeting Srinivasan and other tech investors and executives critical of her work. The account’s bio also links to a website, also self-described as parody, which is dedicated to harassing Lorenz. (Twitter told Motherboard it deleted another account for impersonating Lorenz.)
“Lorenz” is the journalist who became the windmill toward which the Silicon Valley elite turned their digital lances.
Net net: Darned exciting. New type of “real” journalism. That’s the Swisherism in bright regalia. Snarkiness, insults, crude talk, and the other oddments of Silicon Valley excitement. No one like constructive criticism it seems. Politics, invective, overt and latent hostility, and a “you should do better” leitmotif. Sturm und drang to follow? Absolutely.
Stephen E Arnold, July 3, 2020
Business Intelligence: When Case Studies Are Not
June 25, 2020
A case study in the good old days was a little soft, a little firm, and a lot mushy. The precise definition of a case study is “it depends.” The problem is that case studies are often not easily duplicated. The data collection methods vary because many organizations do not keep data or, if kept, do not maintain data in a consistent manner. There’s a bright young sprout who wants data in a format unintelligible to other people and maybe systems.
Other minor potholes wander through thickets of subjectivity and into the mysterious world of sparse data. Ever heard, “Well, we don’t have that data, but we can take the inverse of these data and use them.”
The silly idea of answer who, what, why, where, where, and how are often discarded because the information is not available, secret, or just too much work. Just because you know. Meetings.
I thought about the murky world of case studies when I read “5 Valuable Business Intelligence Use Cases for Organizations.” First, there is the word “valuable.” Second, there is the phrase “business intelligence.” Third, there is the jargon “use cases.” Examples is a useful word. Why not employ it?
What caught my attention was that the examples illustrate the type of effort a group of volunteers make when no one wants to work very hard. You may have participated in filling a food basket with canned goods which few would actually consume.
Let’s look at one “use case,” and I will leave it to you to dig through the other four.
Use case number 2 explains how business intelligence can speed up and make better decision making manifest themselves. Okay, we have this pandemic thing. We have a bit of a financial downturn. We have the disruption of supply chains. We have the work from home method. We have Zoom solutions to knit together humans who like to hang out in break rooms and share gossip.
The fix is to use business intelligence to bring
together data mining, data analysis and data visualization to give executives and other business users a comprehensive view of enterprise data, which they can then use to make business decisions in a more informed way.
Now what do these terms mean? Data mining, data analysis, and data visualization. Where to the data come from? Are the data valid? Are the data comprehensive?
The “evidence” in the example is a survey conducted in a sample of an unknown number. The sample which may or may not be representative reports that “reliable data” is a hurdle. No kidding.
The case explains that the shift to real time data is important. Plus real time data piped into predictive analytics allows “fast action.”
The conclusion: Instantaneous decisions are possible.
Net net: The write up is a fluffy promotion of a nebulous concept. Use case my foot! I made an instant decision. Business intelligence like knowledge management and content management is a confection.
That’s why crazy “use case” explanations are needed. The other four examples in the article? Similar. Disconnected. A food basket filled with stuff no one will consume.
Stephen E Arnold, June 25, 2020
Conferences: A Juicy Source of Intelligence?
June 9, 2020
Conferences are interesting. These face-to-face experiences are becoming virtual. After decades of operating off the radar for most attendees, the content of conferences is “suddenly” getting some love.
Decades ago, I worked at a company which produced a database called CPI or Conference Papers Index. That database was sold to another firm, and I am not sure if the original product persists 39 years later. Only a handful of customers accessed this product compared to our flagship databases ABI/INFORM and Business Dateline.
“Potential Organized Fraud in ACM/IEEE Computer Architecture Conferences” caused me to think about who (the people) and the companies (the outfits hiring the people) used CPI. Almost 40 years ago, the who and the companies were either government agencies from countries which now provide high technology to the US and other nation states and companies either based in the US with non-US owners or outfits with names difficult to connect to a particular discipline. Did I care 40 years ago? Nope. We wanted to sell that database for several reasons:
- Conference organizers were among the most disorganized and distracted outfits we tapped for information; for example, copies of talks, abstracts, and names and affiliations of speakers. Much effort and many “let’s have lunch” and “yes, we will send that information tomorrow.” Sorry, lesson learned. Conferences 40 years ago were a different content animal. Fiefdoms, ego centric owners who wanted “total control”, trade associations eager to serve their members and preserve their mostly concierge type jobs, and similar flora and fauna. Much remains unchanged even as conferences undergo Rona-ization.
- Customers were not plentiful. The customers the CPI attracted wanted more: More images, more full text, more presentation foils. Delivering more cost money and it was not clear that if we invested the money to get “more” information that it would be a profitable operation. My hunch is that indexes of conferences, including the wonky listings one can find on the Internet, are essentially useless. Why? Sponsors are not indexed consistently. Names of speakers are not included as searchable content. The presentations, if one is lucky, becomes a YouTube video, usually delivered with both lousy audio and video. Sigh. Conferences are today a black hole of content. Going into the virtual conference business just makes the black hole deeper and weirder than before Rona.
- Conference organization is a remarkable exercise in rejecting, begging, and scrambling. Each conference wants stars for the keynotes. Each conference wants new talent to deliver hot information. Each conference desperately needs sponsors; that is, people to pay for snacks (yuck), liquor (much loved by attendees except for virtual presentations unless a company FedExes bottles to an attendee-with-a-budget’s home), and lunch (now a weird buffet brown bag thing which hopefully will disappear from real and virtual events completely). The organizer wants to put on a stellar show but lacks the expertise, money, and organizational talent to pull off most events.
What’s the fix?
If the information in the write up is accurate, it seems — note the hedge word “seems” — that individuals, companies, and countries are doing everything in their power to get their hands on the same information that people told us to include in our Conference Papers Index.
Valuable data include:
- Abstracts of proposed talks, some submitted a year before an event in certain event cycles
- The actual draft presentations: Text, PDFs of the visuals, author’s biography, and author details
- Names of speakers, addresses, email, etc.
The blog post suggests that some fancy dancing has been underway in the rarified world of big tech at the ACM and IEEE computer architecture conferences.
The article is worth reading.
However, there is context for what amounts to intelligence exploitation.
The question is, “Will most conference organizers care?” Another question, “Will most conference organizers be sufficiently adept at addressing the alleged problem?”
DarkCyber has a tentative answer, “Nope. The sucking of conference data is an institutionalized behavior for many “experts,” their employers, some government entities, and even employees of conference companies.
Net net: Squeeze the fruit for informational juice.
Stephen E Arnold, June 9, 2020
MBA Think Reveals a Stunningly Obvious Insight
April 16, 2020
I am not sure when I began to feel uncomfortable with MBA speak. Perhaps it was after I developed an aversion to lawyer speak? On the other hand, my reaction could have been triggered by the rash I developed when I was exposed to accountant speak. Each “specialty” has its own lingo. But MBA speak is usually fascinating.
I read “Coronavirus Clarity.” The write up explains that the present medical challenge makes it clear that large technology companies have “power.” There is the patois of the MBA; for example:
- Conversation
- Differentiation
- Scale
- Zero margins
There are examples: Apple, Facebook, et al.
What’s the main idea?
The big technology companies like Apple, Facebook, et al are powerful.
Amazing. Who knew these monopolies were capable of collusion and operating as nation states. I feel in a way similar to Jonathan Edwards’ reaction to his revelatory moment in the woods.
Who knew? Plus, those not associated with Apple, Facebook et al should be grateful these firms are just doing so much for everyone. Proof? Check out “Google’s Former CEO Hopes the Coronavirus Makes People More Grateful for Big Tech.” Absolutely.
Stephen E Arnold, April 16, 2020
Acquisdata: High Value Intelligence for Financial and Intelligence Analysts
March 31, 2020
Are venture capitalist, investment analysts, and other financial professionals like intelligence officers? The answer, according to James Harker-Mortlock, is, “Yes.”
The reasons, as DarkCyber understands them, are:
- Financial professionals to be successful have to be data omnivores; that is, masses of data, different types, and continuously flowing inputs
- The need for near real time or real time data streams can make the difference between making a profit and losses
- The impact of changing work patterns on the trading floor are forcing even boutique investment firms and global giants to rely upon smart software to provide a competitive edge. These smart systems require data for training machine learning modules.
James Harker-Mortlock, founder of Acquidata, told DarkCyber:
The need for high-value data from multiple sources in formats easily imported into analytic engines is growing rapidly. Our Acquisdata service provides what the financial analysts and their smart software require. We have numerous quant driven hedge funds downloading all our data every week to assist them in maintaining a comprehensive picture of their target companies and industries.”
According to the company’s Web site, Acquisdata:
Acquisdata is a fast growing digital financial publishing company. Established in 2010, we have quickly become a provider to the world’s leading financial news companies, including Thomson Reuters/Refinitiv, Bloomberg, Factset, IHS Markit, and Standard and Poor’s Capital IQ, part of McGraw Hill Financial, and ISI Emerging Markets. We also provide content to a range of global academic and business database providers, including EBSCO, ProQuest, OCLC, Research & Markets, CNKI and Thomson Reuters West. We know and understand the electronic publishing business well. Our management has experience in the electronic publishing industry going back 40 years. We aim to provide comprehensive and timely information for investors and others interested in the drivers of the global economy, primarily through our core products, the Industry SnapShot, Company SnapShot and Executive SnapShot products. Our units provide the annual and interim reports of public companies around the world and fundamental research on companies in emerging markets sectors, and aggregated data from third-party sources. In a world where electronic publishing is quickly changing the way we consume news and information, Acquisdata is at the very forefront of providing digital news and content solutions.
DarkCyber was able to obtain one of the firm’s proprietary Acquisdata Industry Snapshots. “United States Armaments, 16 March 2020” provides a digest of information about the US weapons industry. the contents of the 66 page report include news and commentary, selected news releases, research data, industry sector data, and company-specific information.
Obtaining these types of information from many commercial sources poses a problem for a financial professional. Some reports are in Word files; some are in Excel; some are in Adobe PDF image format; and some are in formats proprietary to a data aggregator. We provide data in XML which can be easily imported into an analytic system; for example, Palantir’s Metropolitan or similar analytical tool. PDF versions of the more than 100 weekly reports are available.
DarkCyber’s reaction to these intelligence “briefs” was positive. The approach is similar to the briefing documents prepared for the White House.
Net net: The service is of high value and warrants a close look for professionals who need current, multi-type data about a range of company and industry investment opportunities.
You can get more information about Acquisdata at www.acquidata.com.
Stephen E Arnold, March 31, 2020
Medical Surveillance: Numerous Applications for Government Entities and Entrepreneurs
March 16, 2020
With the Corona virus capturing headlines and disrupting routines, how can smart software monitoring data help with the current problem?
DarkCyber assumes that government health professionals would want to make use of technology that reduced a Corona disruption. Enforcement professionals would understand that monitoring, alerting, and identifying functions could assist in spotting issues; for example, in a particular region.
What’s interesting is that the application of intelware systems and methods to health issues is likely to become a robust business. However, despite the effective application of established techniques, identifying signals in a stream of data is an extension of innovations reaching back to i2 Analyst Notebook and other sensemaking systems in wide use in many countries’ enforcement and intelligence agencies.
What’s different is the keen attention these monitoring, alerting, and identifying systems are attracting.
Let’s take one example: Bluedot, a company operating from Canada. Founded by an infectious disease physician, Dr. Kamran Kahn. This company was one of the first firms to highlight the threat posed by the Coronavirus. According to Diginomica, BlueDot “alerted its private sector and government clients about a cluster of unusual pneumonia cases happening around a market in Wuhan, China.”
BlueDot, founded in 2013, combined expertise in infectious disease, artificial intelligence, analytics, and flows of open source and specialized information. “How Canadian AI start-up BlueDot Spotted Coronavirus before Anyone Else Had a Clue” explains what the company did to sound the alarm:
The BlueDot engine gathers data on over 150 diseases and syndromes around the world searching every 15 minutes, 24 hours a day. This includes official data from organizations like the Center for Disease Control or the World Health Organization. But, the system also counts on less structured information. Much of BlueDot’s predictive ability comes from data it collects outside official health care sources including, for example, the worldwide movements of more than four billion travelers on commercial flights every year; human, animal and insect population data; climate data from satellites; and local information from journalists and healthcare workers, pouring through 100,000 online articles each day spanning 65 languages. BlueDot’s specialists manually classified the data, developed a taxonomy so relevant keywords could be scanned efficiently, and then applied machine learning and natural language processing to train the system. As a result, it says, only a handful of cases are flagged for human experts to analyze. BlueDot sends out regular alerts to health care, government, business, and public health clients. The alerts provide brief synopses of anomalous disease outbreaks that its AI engine has discovered and the risks they may pose.
DarkCyber interprets BlueDot’s pinpointing of the Corona virus as an important achievement. More importantly, DarkCyber sees BlueDot’s system as an example of innovators replicating the systems, methods, procedures, and outputs from intelware and policeware systems.
Independent thinkers arrive at a practical workflow to convert raw data into high-value insights. BlueDot is a company that points the way to the future of deriving actionable information from a range of content.
Some vendors of specialized software work hard to keep their systems and methods confidential and in some cases secret. Now a person interested in how some specialized software and service providers assist government agencies, intelligence professionals, and security experts can read about BlueDot in open source articles like the one cited in this blog post or work through the information on the BlueDot Web site. The company wants to hire a surveillance analyst. Click here for information.
Net net: BlueDot provides a template for innovators wanting to apply systems and methods that once were classified or confidential to commercial problems. Business intelligence may become more like traditional intelligence more quickly than some anticipated.
Stephen E Arnold, March 16, 2020
Import.io and Connotate: One Year Later
March 3, 2020
There has been an interesting shift in search and content processing. Import.io, founded in 2012, purchased Connotate. Before you ask, “Connotate what?”, let me say that Connotate was a content scraping and analysis firm. I paid some attention to Connotate when it acquired Fetch, an outfit with an honest-to-goodness Xoogler on its team. Fetch processed structure data and Connotate was mostly an unstructured data outfit. I asked a Connotate professional when the company would process Dark Web content, only to be told, “We can’t comment on that.” Secretive, right.
Connotate was founded in 2000 and required about $25 million in funding. The amount Import.io paid was not revealed in a source to which DarkCyber has access. Import.io, which has ingested about $38 million. DarkCyber assumes that the stakeholders are confident that 1 + 1 will equal 3 or more.
Import.io says:
We are funded by some of the greatest minds in technology.
The great minds include AME Cloud Ventures, Open Ocean, IP Group, and several others.
The company explains:
Starting from a simple web data extractor and evolving to an enterprise level solution for concurrently getting data that drives business, industry, and goodness.
What’s the company provide? The answer is Web data integration: Identify, extract, prepare, integrate, and consume content from a user-provided list of urls. To illustrate the depth of the company’s capabilities, Import.io defines “prepare” this way:
Integrate prepared data with a library of APIs to support seamless integration with internal business systems and workflows or deliver it to any data repository to develop robust data sets for advanced analytics capabilities.
The firm’s Web site makes it clear that it serves the online travel, retail, manufacturing, hedge fund, advisory services, data scientists, analysts, journalists, marketing and product, hospitality, and media producers. These are a mix of sectors and industries, and DarkCyber did not create the grammatically inconsistent listing.
Import.io offers videos which provide some information about one of its important innovations “interactive extractors.” The idea is to convert script editing to point-and-click choices.
The company is growing. About a year ago, Import.io said that it experienced record sales growth. The company provided a link to its Help Center, but a number of panels contained neither information nor links to content.
The company offers a free version and a premium version. Price quotes are provided by the company.
Like Amplyfi and maybe ServiceMaster, Import.io is a company providing search and content processing with a 21st century business positioning. A new buzzword is needed to convey what Import.io, Amplyfi, and Service Master are providing. DarkCyber believes that these companies are examples of where search and content processing has begun to coalesce.
The question is, “Is acquiring, indexing, and analyzing OSINT content a truck stop or a destination like Miami Beach?”
Worth monitoring the trajectory of the company.
Stephen E Arnold, March 3, 2020
An Interesting View of Snowden
October 2, 2019
DarkCyber noted “Looking Back at the Snowden Revelations.” This essay highlights the “cryptographic” angle of the leaked documents. Key points in the essay are:
- Explanation of the collect everything method
- The importance of signals intelligence
- The “problem” of encryption.
The write up states:
…the world that Snowden brought to our attention isn’t necessarily a world that Americans have much say in. As an example: today the U.S. government is in the midst of forcing a standoff with China over the global deployment of Huawei’s 5G wireless networks around the world. This is a complicated issue, and financial interest probably plays a big role. But global security also matters here. This conflict is perhaps the clearest acknowledgement we’re likely to see that our own government knows how much control of communications networks really matters, and our inability to secure communications on these networks could really hurt us. This means that we, here in the West, had better get our stuff together — or else we should be prepared to get a taste of our own medicine.
Interesting write up. Should the focus be on government collection and analysis?
Stephen E Arnold, October 2, 2019