☀️ Your guide to AI: June 2023
Welcome to the latest issue of your guide to AI, an editorialized newsletter covering key developments in AI policy, research, industry, and startups during May 2023. Before we kick off, a couple of news items from us :-)
Our full speaker lineup is live for the 7th Research and Applied AI Summit in London on 23 June. RSVP here.
I went on the Venture Europe podcast to discuss the State of AI…
Congratulations to team Valence Discovery (Air Street portfolio) on their acquisition by Recursion. These are two special teams reimagining drug discovery using high-throughput experimentation and machine learning.
The UK government is inviting submissions to its university spinout survey that will directly inform policy in the fall. I really encourage anyone with spinout experience to submit.
Careers @ Air Street: If you or someone in your network is particularly excited about developer/ML relations, leading our work in building AI communities, synthesising and distributing best practices, and building new AI-first products using the latest tools please reply here!
As usual, we love hearing what you’re up to and what’s on your mind, just hit reply or forward to your friends :-)
🌎 The (geo)politics of AI
If there ever was a month for the political recognition of AI, it was May 2023. OpenAI CEO Sam Altman testified before Congress to discuss the benefits and risks of AI. “We believe that the benefits of the tools we have deployed so far vastly outweigh the risks, but ensuring their safety is vital to our work”, he said. But the main highlight of the hearing was probably his call for AI regulation via, for example, the government giving licenses to companies that build AI models. “The US government might consider a mix of licensing and testing requirements for the development and release of AI models above a threshold of capabilities”. His statement was slammed for being vague and self-serving. The criticism is at least partially deserved. But if AI is to be as dangerous as some experts claim, having a license to build it isn’t such a bad idea… The problem is with the “if” and with the “threshold of capabilities”: much of the debates around AI risks hold on shaky grounds: we still haven’t built tools to reliably and cheaply evaluate AI models and the risks behind them; The AI community doesn’t agree yet on measures for these capabilities, never mind choosing the right threshold.
Authors from Google DeepMind, OpenAI, University of Toronto, University of Oxford, University of Cambridge, Anthropic, Université de Montréal, Mila, and AI safety focused research centers, wrote a paper presenting a framework for Model evaluation for extreme risks. Read it and you’ll realize how tall of a task AI model evaluation is. And even when an evaluation toolbox is established, there are so many blind spots that it can miss. Who’d have thought a technology relying on messy human data built on little mathematical theory with massive financial costs and incentives would be hard to regulate?! In the absence of evaluation tools and generally accepted standards, we’re stuck in a very polarized debate between “Stop everything or we are doomed” and “Race ahead. This technology will save us”.
Altman and the CEOs of Alphabet, Anthropic and Microsoft had also previously met with VP Kamala Harris to discuss the risks and benefits of AI, after the administration announced a $140M investment to establish seven new AI research institutes. The British PM met with Sam Altman, Demis Hassabis (Google DeepMind) and Dario Amodei (Anthropic) to discuss AI risks. Sunak will also meet with Biden to discuss AI “extinction risk”. So many meetings; Let’s hope something constructive comes out of these.
With regards to extinction risk: The latest event to-date is (UC Berkeley’s) Center for AI Safety’s one-sentence statement signed by the who’s-who of AI, “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.” As much as this statement has been mocked, it’s once again so vague that it’s difficult to judge. Sure, let’s make that a global priority… but what actions that entails is what will need to be discussed. And please, signatory AI scientists, either prepare your TV appearances or cancel them altogether. These botched TV interviews do you, your cause, and your audience a massive disservice. A recent interview from NYU professor and Prescient Design/Genentech research scientist Kyunghyun Cho is a rather measured (and welcome) contribution to what is increasingly becoming a debate of faith.
In the absence of a common voice around AI safety and of concrete ways to evaluate and define AI risks, OpenAI is pushing for public oversight of AI development, calling for example for an IAEA (International Atomic Energy Agency) for AI. More generally, OpenAI wants to promote a democratic process for deciding rules that should govern AI systems, by setting up “a program to award ten $100,000 grants to fund experiments in setting up a democratic process for deciding what rules AI systems should follow, within the bounds defined by the law.”
What comes closest to AI regulation in the Western world is the EU’s AI act. Amusingly, Sam Altman, who was doing a tour of the EU this month, first threatened that Open AI could leave the EU if the regulation ended up being too stringent. But he soon backtracked, saying that OpenAI had no plans of leaving the EU. Again, paroles paroles. The EU AI Act, led by France and Germany, which seems to be uniting the 27 members of the EU bloc, is privately criticized especially by more tech-friendly Central and Northern European countries. They fear that complex compliance rules on small companies, and a possible crackdown on American Big Tech could hamper innovation and reduce US investment in the EU. The fear is all the more justified as the counterbalancing voice of the UK, which is generally less prone to regulation, has been lost since Brexit.
Speaking of the UK, are we beginning to see the first signs of a change of heart amid the rising tide of extinction warnings? This month, we witnessed the surreal sight of a government minister writing into The Economist to complain about their use of ‘light-touch’ to describe the UK’s approach to AI regulation. As a number of eagle-eyed observers have pointed out, the government itself described its framework last year as “proportionate, light-touch, and forward-looking”. We’re also now also seeing rumors that the UK is considering a new safety-focused global oversight body. Whether it’s CERN, the IAEA, or the IPCC, we’ve seen a swirl of suggested acronyms from which this body could draw inspiration. The devil will of course be in the detail: who would run this body, would it be adequately funded, and at the moment, could an oversight body in London command the global respect it would need to be effective?
Meanwhile, Japan made a bold move on AI regulation by deciding to NOT protect copyrighted materials used in AI datasets.
🏭 Big tech
By now it’s obvious that, driven by market appetite for AI, companies are being more vocal about AI integration inside their products. What might have been hidden a few years ago and presented as technical advancements under the hood without mention of AI – or merely in low-profile blog posts – is now presented front and center (just check out this video of Sundar Pichai from the latest Google I/O). For example, hardly anyone remembers DeepMind’s MuZero’s first integration into a real-world product in YouTube to optimize video compression or their labeling model for Youtube videos. Still, interestingly, AI at the time (February 2022!!) was dismissed by most as a marketing tool (“it looks cool but it doesn’t work yet…”). It has now become as much of a marketing tool as an actual tool underpinning hundreds of massively used products. At the risk of spitting trivialities: When something (mostly) works, it’s easy to sell.
With this preamble in mind, let’s move on to a product-heavy month of AI.
Google announced PaLM2, the second iteration of their most capable Language Model. This one was trained on more multilingual texts (100+ languages). We also know that PaLM2 was heavily trained on scientific papers and web pages as well as codes, like all LMs nowadays. Apart from this, as is now standard, no other details were released on the datasets used to train it. Google has already integrated PaLM 2 into its existing products, most notably Bard and Med-PaLM 2 (a version of PaLM trained with medical expert demonstrations that “can answer questions and summarize insights from a variety of dense medical texts”).
YouTube integrated DeepMind’s Flamingo model into Shorts. Flamingo is a multimodal model trained using self-supervision that can explain in natural language what appears in images. It is a natural fit with Shorts, which are fast to create, but whose creators often don’t include descriptions or even titles, making them difficult to find while browsing YouTube. This is a typical example of what we just discussed: Flamingo’s action won’t appear anywhere for the user; the user will just see better search results. A few years ago, Google may have barely mentioned it in a developer event.
This year’s Microsoft Build event was another occasion for Microsoft to do what they’ve been doing for what now feels like 10 years of product development condensed in a few weeks: stuffing GPT-4 everywhere and integrating their products with OpenAI’s. Now Bing can use ChatGPT plugins, and ChatGPT can use Bing to present up-to-date answers. Windows, just like Microsoft 365 apps, has its Copilot, the Windows Terminal has its Copilot, and Microsoft Edge will soon have its Copilot. If this feels a bit redundant, it’s because it is. But when you invest tens of billions of dollars into an LLM company, you need to squeeze everything you can out of those LLMs. We’re looking forward to seeing usage (and satisfaction) numbers for these products.
A more developer-focused announcement was the release of a Model catalog as part of their AI studio, which allows developers to use not only OpenAI’s models, but also open-source ones easily on Azure. Notably, the model catalog included a HuggingFace Hub, which is a big win for HuggingFace following on their early promise of being the Github of Machine Learning. This adds to a similar integration with Amazon SageMaker, and IBM watsonx.ai.
The most impressive demo of the month comes from Adobe’s Generative Fill, a product which is integrated with Photoshop that allows one to modify images using natural language. The video is worth a thousand words.
Facebook released the Massively Multilingual Speech Project, which includes datasets, and text-to-speech and speech-to-text models covering more than 1,100 languages, more than 10 times the existing datasets (and models built on them).
Amazon is reportedly working on an integration of a conversational agent as part of its shopping experience.
Apple launched a range of AI features “for cognitive, vision, hearing, and mobility accessibility, along with innovative tools for individuals who are non-speaking or at risk of losing their ability to speak.”
NVIDIA revealed DGX GH200, the supercomputer with the largest shared memory space, reaching 144 TB across 256 NVIDIA Grace Hopper Superchips. A main advantage of this supercomputer is that it uses components developed solely in-house: the Grace CPU, the Hopper GPU, and the NVLink Chip-2-Chip interconnect, which increases bandwidth and significantly reduces interconnect power consumption compared to previous DGX generations. NVIDIA advertises it as “The Trillion-Parameter Instrument of AI”. Let’s see who will be their customers for such an expensive little toy. Or will it be themselves? As covered in previous newsletters, NVIDIA has been developing their own AI models either alone or in partnership with other companies like HuggingFace.
NVIDIA recently became the latest member of the $1T market company club, joining Apple, Microsoft, Google, and Amazon. The latest surge in their stock price was driven by its latest earning and sales reports, where it announced that its profit increased by 26% and sales by 19%. I opined in the FT that NVIDIA’s lead is likely worth a couple of years thanks to its grip over the software stack.
Meanwhile, at Google I/O, Google Cloud announced its own brand new supercomputer called the Compute Engine A3, which brings together 26,000 Nvidia H100 GPUs. Based on the data we’ve collected in our State of AI Report Compute Index, this would be larger than Meta’s A100 cluster.
🏥 Life science
A couple of massive funding rounds and partnerships took place lately in AI and biotech. XtalPi, a Chinese startup building a platform for small-molecule drug discovery, struck a $250M deal (upfront and milestone payments) with Lily for the design and delivery of drug candidates for an undisclosed target. Hippocratic AI, which develops LLMs trained on medical data and claims to have developed a model that outperforms GPT-4 on most healthcare exams, raised a $50M seed round. Recognizing that LLMs aren’t robust enough yet to assist with clinical diagnosis, the company targets less critical tasks such as emphatic bedside care for patients. CHARM therapeutics announced a new investment coming from NVIDIA. The company has now raised $70M in total for its DragonFold platform (see our April newsletter). Unfortunately, details of the deal, including a potential computing deal, haven’t been shared.
Meanwhile, Exscientia’s deals with pharma companies continue to be fruitful. The company announced that the Japanese Sumitomo Pharma plans to initiate a Phase 1 clinical study of a new molecule “with broad potential in psychiatric disease”. The molecule was created using their platform (amusingly now explicitly marketed as a “generative AI platform”), and is the 6th such molecule to enter clinical trials.
LIMA: Less Is More for Alignment, Meta, Carnegie Mellon University, University of Southern California, Tel Aviv University. This papers tries to answer the following question: What matters most in training the recent slew of LLMs: pretraining or instruction tuning and reinforcement learning? The researchers fine-tune a 65B LLaMA model – which is pre-trained on open source datasets – on a set of 1,000 prompts and responses, some manually authored and others collected from Q&A forums (StackExchange, Wikihow, Reddit, etc.). To elicit the same type of responses from the model (and effectively use supervised learning to replace RLHF), the input prompts are chosen/designed to be diverse while the outputs have similar formats. LIMA is then compared to existing instruction-tuned and RLHF-trained models (OpenAI’s GPT-4 and DaVinci003, Google’s Bard, Anthropic’s Claude) and an additional model called Alpaca 65B they trained using the the 52k Alpaca training set. As a side note, it’s fascinating to see the quick feedback loop in today’s large scale deep learning research, going in 4 weeks from Meta’s LLaMA models to Stanford’s LLaMa-based 7B Alpaca back to Meta (&Co)’s Alpaca 65B now. To evaluate the models, the authors generated a single response per test prompt and per model and asked crowd workers to evaluate whether they preferred LIMA’s response or each of the other models’. The percentage of ties + LIMA wins: vs. GPT-4 43%, Claude 46%, Alpaca 65B 83% . They also evaluated the models using GPT-4 as an evaluator, which resulted in a slightly less uniform distribution (low win rate when LIMA loses, high when it wins). Unfortunately, the authors didn’t compare LIMA to Alpaca 65B with RLHF (which requires a budget of its own…). Given the results, it’s not clear that careful supervised fine-tuning offers a competitive alternative compared to very large RLHF-trained models, but it does offer a cost-efficient alternative to these.
Yet another way that researchers devised to cheaply fine-tune their pretrained models is to fine-tune them on the outputs of those large RLHF-trained models. In The False Promise of Imitating Proprietary LLMs, UC Berkeley researchers show that using this technique leads to smaller models whose outputs are stylistically similar to larger models, but that are factually different and often simply incorrect. They “finetune a series of LMs that imitate ChatGPT using varying base model sizes (1.5B–13B), data sources, and imitation data amounts (0.3M–150M tokens).” Their conclusion and proposed path ahead: “We conclude that model imitation is a false promise: there exists a substantial capabilities gap between open and closed LMs that, with current methods, can only be bridged using an unwieldy amount of imitation data or by using more capable base LMs. In turn, we argue that the highest leverage action for improving open-source models is to tackle the difficult challenge of developing better base LMs, rather than taking the shortcut of imitating proprietary systems.” Another interesting finding from their work is that due to the remarkable imitation ability of small fine-tuned models, crowd workers actually have trouble differentiating between imitation models and larger ones even when the former is factually incorrect. This confirms one of the most pressing issues regarding LLM evaluation and safety: “how can we cheaply and quickly probe the utility of a powerful LLM?”
Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Princeton University, Google DeepMind. It is more or less commonly admitted that bare LLMs fail on involved, multi-step reasoning problems. And because of lack of access to large models and the cost-driven model inertia (it’s too expensive to retrain giant models), one has to use a fixed model and find better ways to prompt them. The starting point of this paper is to notice that, as elaborate as they can be, all common prompting strategies for LLMs are “forward-only”: input-output prompting starts with one or a few examples of input-output texts and generates a single input; Chain-of-thought prompting goes a step further to deal with more complex input-output relations by prompting the model with not only input-output examples but with input-step1-step2-...-step#-output examples, which leads the model to reproduce a similar reasoning; Self-Consistency with CoT, which is tailored for tasks where a ground-truth answer is expected, generates multiple CoT outputs and then does a majority vote to define its final output. In contrast, when we want to solve a problem, we reason in multiple steps, exploring for each step multiple options at our hand, choosing the one that seems the most promising at one point but potentially abandoning later because it leads to a dead-end, and exploring the second most promising one, etc. In essence, human reasoning follows a tree structure, where we combine a mix of breadth-first-search and depth-first-search depending on our intuition, to search for the solution. This paper emulates this process for tasks where this is tractable via a prompting technique the authors call Tree of Thoughts – ToT (Section 3 contains a more rigorous presentation of the implementation of their method). On Game of 24, a game where the goal is to use 4 given numbers to obtain 24 using basic arithmetic operations: “while GPT-4 with chain-of-thought prompting only solved 4% of tasks, [their] method achieved a success rate of 74%.” They used ToT with BFS with a breadth of 5. The second-best (non-ToT) prompting strategy reached 49%.
Scaling data-constrained language models, Hugging Face, Harvard University, University of Turku. LLM scaling laws are simple models predicting the performance of LLMs depending on used compute and dataset size. While compute is potentially infinite, only constrained by time and benefitting from hardware innovation, original (non synthetic) data is comparatively constrained. Indeed, given current model sizes and compute-optimal-scaling laws, we will soon be running out of available data if we want to train ever-larger language models without wasting compute. This begs the question: given a constrained dataset size, how many epochs can we train models before further training becomes useless (as measured by non-decreasing validation loss)? Researchers train a large range of models on a large range of training tokens and “propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters”. The magic number (with caveats) is 4: 4 epochs is how far you can go before adding compute becomes useless on most dataset sizes. To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis, from NUS, University of Edinburgh and ETH Zurich also examines the same problem, but on models of smaller size and with more epoch repetitions. They show that using more epochs is always harmful, and that carefully tuned dropout can help.
Also on our radar:
Let’s Verify Step by Step, OpenAI. Authors show that supervising every step of reasoning for LLMs (process supervision) leads to SOTA models on the MATH benchmark. They release a dataset of 800,000 step-level human feedback labels.
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers, Meta AI. Devises a scalable tokenization-free method for training transformers. Read Andrej Karpathy’s thoughts on why this is important. Closely related is this even more recent work from Apple: Bytes Are All You Need: Transformers Operating Directly On File Bytes.
Video Prediction Models as Rewards for Reinforcement Learning, UC Berkeley. Uses pretrained video prediction models’ likelihood as reward signals for reinforcement learning.
Towards Expert-Level Medical Question Answering with Large Language Models (a.k.a Med-PaLM 2), Google. Improves Med-PaLM by (i) using PaLM 2 as a base-model, (ii) better medical domain finetuning and (iii) better prompting strategies. Results in a 19 pct points performance improvement on the MedQA dataset (which mimics US medical Licensing Examination questions) over the previous SOTA, Med-PaLM.
Gorilla: Large Language Model Connected with Massive APIs, UC Berkeley, Microsoft Research. Proposes a fine-tuned LLaMa based model – Gorilla – that performs better than GPT-4 on API calls. Given a prompt, Gorilla can pick among 1000 APIs to execute a desired task.
ESMFold hallucinates native-like protein sequences, GSK. Authors invert ESMFold (originally trained to predict protein structure from sequence) to generate protein sequences from structure/desired properties. Inverting AlphaFold results in unnatural and poorly expressive protein sequences, whereas Protein Language Model-based models like ESMFold do in fact result in “more native-like and more likely to express” proteins. A drawback is that this comes with a prohibitively high memory footprint for large proteins.
🚀 Funding highlight reel
Anthropic, the OpenAI spinoff behind the Claude language model, raised a $450M Series C led by Spark Capital. Anthropic is now up to $1.5B in total VC funding.
Builder AI, which offers AI powered tools to make building apps faster, raised a $250M Series D led by Qatar Investment Authority.
Lightmatter, a company promising to build photonic chips for AI, raised a $154M Series C from multiple investors.
AI robotics company building humanoid robots Figure raised a $70M Series A led by Parkway Venture Capital.
Sana Labs raised a $28M Series B extension to expand its AI platform for learning and development into enterprise AI solutions.
Poolside (Air Street portfolio), founded by ex-Github CTO Jason Warner and serial entrepreneur Eiso Kant raised a $26M seed round from Redpoint Ventures, where Jason Warner is a managing director. The company is pursuing narrow AGI for code. Meanwhile Coverity, which will use AI to help users with code tests (cc Codium), raised a $10M seed. Coverity is the second company founded by Andy Chou, who went after a similar vision back in 2002 before AI was cool.
Data and Machine-learning workflow tooling platform Union AI raised a $19M Series A round from NEA and Nava Ventures.
Together, which is building “a decentralized cloud for artificial intelligence”, raised a $20M seed round led by Lux Capital.
Hyro raised a $20M Series B led by Liberty Mutual, Macquarie Capital and Black Opal for its healthcare-focused conversational AI platform.
Tinycorp, founded by popular hacker George Hotz, raised a $5.1M seed round. He aims to write software that makes AMD chips competitive with NVIDIA’s on MLPerf, an ML-focused hardware benchmark.
RunwayML, creators of reactive tools for video editing, raised $100M at a $1.5B valuation from a cloud provider.
Everseen, which uses computer vision to prevent theft at self-checkout counters, raised €65M from existing investor Crosspoint Capital Partners.
BenchSci, a company helping pharma companies increase the efficiency of their R&D through reagent selection, raised a CAD $95M Series D led by Generation Investment Management.
NewLimit, a new aging/regenerative medicine company focused on epigenetics, raised $40M from Dimension, Founders Fund and Kleiner Perkins in addition to a $110M commitment from its founders, which includes Coinbase’s Brian Armstrong and Blake Byers formerly of GV.
Vectara, a conversational (generative) AI startup built on retrieval augmented generation, raised a $28.5M Seed from Race Capital and announced a partnership with Databricks.
Adonis, a healthcare revenue intelligence platform, raised a $17.3M Series A led by General Catalyst.
Deep Sentinel, a computer vision security company, raised $15M from Intel Capital.
Mitiga, a climate simulation spinout from Barcelona’s National Supercomputer Center, raised a $14.4M Series A led by Kibo Ventures.
Chemix, an AI-first battery chemistry development company, raised a $10M Seed led by Mayfield Fund.
AMP Robotics, the robotics-driven recycling company, raised additional Series C funding from Microsoft Climate Innovation Fund.
Etched, a new AI semiconductor startup, raised a $5M Seed led by Primary Venture Partners.
Embark Trucks, an automated trucking startup formed out of Waterloo that ultimately SPAC’d at $5B valuation has folded and been acquired for $71M by Applied Intuition.
Neeva, the once hyped “Google Search killer” has closed down its consumer search application, refocused on applying its generative search technology and has been acquired by Snowflake…search is hard.
6 River Systems, one of the early moving warehouse robotics startups that was first acquired by Shopify has been sold onwards to British technology company, Ocado, as the former divests from its vertical integration ambitions of a few years ago. Shopfiy has offloaded its logistics business to Flexport too.
DiA Imaging Analysis, an Israeli AI-based ultrasound image analysis company with FDA clearance, was acquired for close to $100M by Philips.
Valence Discovery, an AI-first drug design company formed out of the Mila Institute with a focus on generative chemistry, was acquired by Recursion Pharmaceuticals for over $50M. Air Street Capital had co-led their Seed investment. My storyline here 🙂. Alongside Valence, Recursion also acquired Cyclica.
Nathan Benaich, Othmane Sebbouh, 4 June 2023
Air Street Capital invests in AI-first technology and life science entrepreneurs from the very beginning of your company-building journey.