Want more? Subscribe to my free newsletter:

Hands-on Gemini 1.5 Pro with AI Studio: Images, Video, Text & Code

April 8, 2024

Let's talk about Gemini 1.5 Pro and practical examples of what it can do. It's a mid-size multimodal model, optimized to scale across a wide range of tasks involving text, images, videos, audio, and even code. I’ll cover all of these today. The real difference here is the model's long-context window, capable of processing up to a whopping 1 million tokens in production.

To put this into perspective, the original Gemini 1.0 had a context window of 32,000 tokens. We've managed to increase 1.5 Pro's capacity to process vast amounts of information in one fell swoop. We're talking about analyzing 1 hour of video, 11 hours of audio, codebases with over 30,000 lines, or more than 700,000 words of text.

Under the hood, Gemini 1.5 Pro's Mixture-of-Experts (MoE) architecture is a thing of beauty. Imagine you're building a house, and instead of having one jack-of-all-trades contractor, you've got a team of specialists – an electrician, a plumber, a carpenter, etc. That's essentially what the MoE architecture does for Gemini It breaks down the model into smaller, specialized "expert" networks that are activated only when needed.


But enough with the numbers; let's dive into my hands-on experience. As an avid reader and a sucker for leadership and management books, I wanted to put Gemini 1.5 Pro's multimodal chops to the test. I started by feeding Google AI Studio with 1.5 Pro an image of a bookshelf, and prompted it to extract a list of book titles. It did this without breaking a sweat.

images to list


Next up, I threw a video of quickly panning past a bookshelf. Inspired by Simon Willison's demo, I wanted to see if the model could generate a JSON array of the books it spotted. And boy, did it deliver. Not only did it accurately capture the book titles, but it even managed to fill in the author names, despite that information not being explicitly present in the video.

video to json


To showcase Gemini’s abilities in handling long-form text, I fed it an entire public domain book – "Psychology of Management" from Project Gutenberg. With a simple prompt for the key highlights, the model churned through the book and provided a helpful summary. It was like having a personal librarian that can summarize dense texts in a snap.

long-context window with text


Now, let's talk about code. To put Gemini to the test, I dusted off an old side project – a simple web app for showcasing my favorite books. I uploaded a screencast of the app to Gemini 1.5 Pro and asked it to recreate a simpler version.

Video to JSON

First, I tasked the model with generating a JSON array of the books featured in the video. Not only did it nail the book titles, but it also managed to fill in the author names, going above and beyond the information explicitly present in the screencast.

video to json

Video to HTML/CSS/JavaScript

Next, I challenged Gemini 1.5 Pro to recreate the HTML and CSS layout of the original web page. It generated the necessary HTML, JavaScript to fetch the JSON data, and even the CSS styles to render the layout using CSS Grid.

video to html css and js

Data extraction and reformatting

I realized that the generated app was missing the book cover images. So, I threw another challenge at Gemini 1.5 Pro: match the book cover images from my original Astro-based app to the titles in the generated JSON. And guess what? It nailed it, populating the "coverImageUrl" entries in the JSON like a champ.

Building UI with the data

To take things up a notch, I asked Gemini 1.5 Pro to add a modal view for displaying book details when clicking on an entry. With a few prompts, it updated the code, and voila! Clicking on a book now triggered a sleek modal showcasing the cover image and title. If I had included the book descriptions in the JSON, it would have displayed those too.

building ui with data

Uploading a full GitHub repository for processing

But the real magic happened when I uploaded the entire folder of individual markdown files from my original Astro app's repo. I asked Gemini 1.5 Pro to scrape the title, author, Amazon URL, cover, and description from each file and generate a comprehensive JSON. And guess what? It handled the heavy lifting, reworking the files into the desired format for me. Talk about a productivity boost!

gemini with a github repo

Extending data

Finally, I put the cherry on top by asking Gemini 1.5 Pro to generate another 20 book recommendations based on the existing JSON data. Lo and behold, it updated the JSON with new titles and author names, giving me a fresh batch of reading material to explore.

As you can see, Gemini 1.5 Pro could be a game-changer for software engineers. Its multimodal capabilities and long-context window open up a world of possibilities for enhancing our development workflows. From extracting information from images and videos to processing entire codebases and generating comprehensive JSON data, this model is like having a team of expert assistants at your fingertips.


So, if you have the opportunity to get your hands on Gemini 1.5 Pro through Google AI Studio, I highly recommend giving it a spin. Explore its multimodal magic, push the boundaries of its long-context window, and witness firsthand how it can revolutionize your development workflows. The possibilities are endless, and I can't wait to see the incredible things our community will build with tools like Gemini 1.5 Pro.