#1293: The Personalized AI-Driven “Tulpamancer” VR Sandpaintings with AI Text to Audio & VR Workflow

I interviewed Tulpamancer co-directors Marc Da Costa and Matthew Niederhauser as well as AI Engineer Aaron Santiago at Venice Immersive 2023. See more context in the rough transcript below.

This is a listener-supported podcast through the Voices of VR Patreon.

Music: Fatality

Rough Transcript

[00:00:05.452] Kent Bye: The Voices of VR Podcast. Hello, my name is Kent Bye, and welcome to the Voices of VR Podcast. It's a podcast that looks at immersive storytelling, experiential design, and the future of spatial computing. You can support the podcast at patreon.com slash voicesofvr. So continuing on my series of looking at different experiences from Venice Immersive 2023, this is episode number 23 of 35 and number 3 of 10 of looking at the context of ideas and adventure. So this piece is called Topomancer, which is by creators Mark DaCosta and Matthew Niederhauser, as well as the AI engineer of Aaron Santiago. So this is a very provocative piece that you go into this room and you start to enter in lots of different answers to these different prompts that you're given on this really old computer named Tulpa. And then once you answer a lot of questions around both your childhood memories and where you've been in the past and maybe an ephemeral moment that happened in that day, and then your aspirations for the future. Then you go into a room and then you put on a virtual reality headset and you have an entire virtual reality experience that's created just for you. And then when you're done, it's all deleted and kind of like a sand painting. You're the only one who gets to experience this experience. So it's making lots of different calls in the back end to like chat GPT to be able to weave it together with this narrative and this arc. It's also sent to a text to speech to be able to generate a whole audio narrative aspects to it. And then some of these prompts that are then also fed into stable diffusion to create these equirectangular with depth maps, virtual reality scenes that are taking these like six or seven or so different moments and creating a series of these kind of still shots as you listen to the narration that's custom generated for you. So this is a hot topic of the future of artificial intelligence and creativity. It's a piece that is cultivated to push the limits of what AI can do when it comes to creating these kind of personalized narratives, taking your input that you're providing it and then be able to create a whole immersive experience around it. So this is a center of gravity of looking at the different concepts and ideas and exploring different aspects of artificial intelligence and collective intelligence and exploring aspects of your dreams, but also your own self and your identity as you have this dialectical conversation with these machines and having them reflect back to you different aspects of your memory and your identity and your aspirations in the future. So the primary center of gravity of presence in this piece is this mental and social presence where you're asked lots of different input. As the creators say throughout the course of this interview, it's kind of like the more you put in, the more you get out. So the more that you are answering and responding with language from your memories, then the more specific this experience may provide for you. And you have this like AI narrator that's speaking to you. So it kind of creates this proto social presence, like someone speaking to you about you, which is quite surreal to have such a customized experience like that. And then there's aspects of your active presence and agency. There's a lot of your ability to kind of really steer where this experience goes based upon the different types of input that you give it. So it's very much a customized type of experience where the more that you decide and put into the experience has more possibilities for it to kind of go into different directions. And then there's the emotional presence, which is just the story that is resonating or not resonating. I think this is a better hit or miss type of thing where you're entering all these different things together and you're getting this multimodal experience back. And sometimes it all fuses together into this perfect moments of synchronicity. And other times it just kind of falls flat because it, for whatever reason, you, you break presence or break plausibility because it's not really resonating. So, but it does have the potential to kind of really resonate on emotional levels for some folks. And it's also using different dimensions of environmental presence because you're in both the installation of this piece, which is quite exquisite and very well done. And then you go into the virtual reality experience and then there's things that may or may not land because you're basically seeing these generated images from AI. So that's sort of the, the architecture of the piece. And so we'll be diving into, in this conversation, just unpacking a lot more and talking about some of the ethical and moral implications of technology like this. So that's what we're coming on today's episode of the voices of VR podcast. So this interview with Mark, Matthew and Aaron have been on Saturday, September 2nd, 2023 at Venice immersive in Venice, Italy. So with that, let's go ahead and dive right in.

[00:04:24.863] Marc Da Costa: My name is Mark DaCosta. I am a cultural anthropologist by training. I also have spent a lot of time working in the tech world, sort of with a data analytics company in New York I started. And this is actually my first VR work. My artistic practice to date has been mostly focused on exploring the ways in which data and emerging technologies shape our understanding of the world and what the creative possibilities of intervening in that and playing with it. So I've seen a lot of VR experiences that have been very moving to me and I was very excited to work with Matt and Aaron on this project to actually get my hands dirty in it.

[00:05:07.913] Matthew Niederhauser: And my name is Matthew Niederhauser. I have a background as a photojournalist and also in cinematography. And around eight years ago, I started an experiential studio called Sensorium while at an incubator called New Ink in New York. and ever since then I have been building original immersive experiences in VR, AR, XR, all the R's, and recently started collaborating with Mark on some new projects, especially looking at how we can use artificial intelligence and machine learning to incorporate it into these sort of interactive installations. And, yeah, we're the two creators of Tulpamancer, and we're very lucky to also meet Aaron Santiago, who at Onyx, a studio in New York that sort of accelerates XR projects, and he came on board to help us build a project we'd been trying to imagine, or we'd been imagining for years, and, you know, finally had access to the tools to build it.

[00:06:22.943] Aaron Santiago: I'm Aaron Santiago. I'm a freelance creative technologist based out of New York. Years ago I had a background in software engineering but now really I just work on fun projects with fancy new technology and I'm super excited about this one. There's all this super really cutting-edge stuff that I haven't really gotten to build at this level yet and it's been a blast. It's really come together.

[00:06:45.537] Kent Bye: Awesome. Maybe each of you could give a bit more context to your background and your journey into working with XR, spatial storytelling, and artificial intelligence.

[00:06:54.859] Marc Da Costa: Absolutely. So as I mentioned a bit earlier, I have a little bit of a dual-tracked background. So as a cultural anthropologist, I did a lot of work on what are called placemaking practices, or trying to understand the answer to the question, how do we know where we are? And in my scholarly and academic work, I was very interested in the role of Antarctic science and climate data modeling for giving us a sense of what it means to live on a planet with a climate system that's at risk and how we come to have that understanding of the world. And there's a deep current in my work that's very interested in how complicated technical systems actually shape our reality in a way. They inform kind of our phenomenological experience of the world around us and are really quite profound and invisible in a lot of ways. So I've had, I suppose you could say, a deep philosophical interest in this for a long time. and also recognizing, unfortunately, that academic work isn't always the best way to have a sustainable life. I began a data technology company in New York that also kind of emerged from some of this research and realizing how much public data is produced and maintained by Google. governments, so everything from the contents of every shipping container to who owns what aircraft, you know, how shell companies interrelate, and I was totally amazed by that and certainly saw that there was an opportunity to do something with it in a commercial way, but I think the through line for all of that was a real philosophical and artistic interest in what data is, what archives are, and how they shape our everyday experiences.

[00:08:42.757] Matthew Niederhauser: pretty complex background with some of the immersive stuff that I've been building over the years. I should also note that I did start Sensorium with John Fitzgerald, my good friend. I didn't throw that out there in the beginning. But we got into this by building 360 cameras from scratch and immediately were investigating how to tell immersive stories within that context. and we're very lucky to be in a very active community in New York and immediately started questioning, you know, how to layer on interactivity, what it means to take people on a journey inside of these new modalities, all these new headsets and other devices that we could see the world differently. And that led to a bunch of projects, especially where we would put people into virtual worlds together at the same time, including Zikr, a four-person VR experience we showed at Sundance. It was also made with Gaba Aurora and John Fitzgerald. And also, objects in mirror are closer than they appear. really fun AR experience made with Jeff Sobel and Graham Sack and also John Fitzgerald that used AR to explore these set designs and most recently or before this string of projects I've been doing with Mark a really great piece called Metamorphic that was shown at Sundance in 2020 before the pandemic that was made with Wesley Allsbrook, Elie Zanoniri, and John Fitzgerald. And once again, was about putting people in virtual worlds together. And that sort of took on this journey that included body swapping and changing aspects of your world with the environment. And so for me, it's really been constantly questioning new ways of telling people stories that make them question their fundamental being, in a sense. or put them in a place that's outside of the normal context of their body. And, you know, finding devices, sometimes that are even outside of language, that can offer a new experience. And most recently, as I've mentioned, the tools that we have used to create Tulpamancer have gotten to a state where it feels like we can really craft an installation and an experience that is ready to be shown. And so it's been a big experiment coming here to Venice with Topamancer.

[00:11:20.899] Aaron Santiago: My background's been in games mainly. I think it's funny, I started programming when I was very young, I was 12 years old, because I really just wanted to make games for myself while I was in detention at school. Because back then, you know, they gave us access to those computers. But, you know, fast forward, like, however many years, I saw a Kickstarter for the original Oculus Rift, and I immediately fell in love. I spent, that was all of my money back then, whatever it was, $600 to get it, and I think really since then I've just been tracking Like, how can I be a part of this world, these people that make things with these new technologies? And as I've gotten deeper and deeper, I've realized that that's a whole movement of generations. You know, there's VR, there was depth sensors, there were NFTs, and now it's AI. And I want to stay at this forefront, and I have very strong opinions on how these things should be made. how they should be presented to the public and how people should understand them. So that's a big reason that I'm here today working on Toadmantor.

[00:12:24.472] Kent Bye: Yeah, and I know, Matt, we just had a chance to talk just three months ago at the summer screening that was happening at Onyx Studios, where I had a chance to see the Connected Fusion, which, Aaron, I know that you were helping to work on, and I had a whole interview with Brandon Powers that was talking about that was very much a timed experience where there was like a number of different beats and prompts that were being sent out that gave this kind of timeline to the prompt engineering. And I know, Matt, when we were talking, you were talking about how you were, with Mark, working with a number of different AI projects. And so there's a number of different streams that are all coming together here to create what is TurboBanzer. So I'd love to hear where it began and what was the catalyzing moment for you to start working together and to start to initially explore AI and then bring in Aaron with all the different stuff that's happening on Connect to Fusion. So yeah, just love to hear how that story plays out.

[00:13:14.685] Matthew Niederhauser: I like to tell this one because it's a pretty funny anecdote for me, at least. But as you mentioned, Mark and I, this is actually our third project that we built this year that's using this type of technology. And the last one was called Parallels, which was a LED installation that was shown at Plasmata 2 in Ioannina, Greece, which is a big outdoor digital arts exhibition that the Onassis Foundation puts together. And it includes sort of a camera on the back of this LED wall that is perfectly positioned to create this little slice of the world that you can encounter on the other side that's being interpreted by machine learning. It's like a little slice that you can have this visceral encounter with how these new technologies are decoding the world and this landscape. And so I was at Onyx in New York building this special cart with like a 4090 computer and this camera attached and this HDMI input and prepping it to test some software that Mark had been developing. And I don't know, I think it was at 4 p.m. and Brandon and Aaron had booked the space next and I was wrapping up for the day. And so they walked up and Aaron basically was like, oh, this is exactly what I need. And I was like, what? Like, what are you doing? You know, I've literally built this device to interpret live video through a stable diffusion box, and this is exactly what you need for your project right now? And slightly offended, because I basically felt like I'd prepped their project all day, but at the same time super intrigued, and this was kinetic diffusion that they were working on. So I was just immediately interested, you know, I mean, I've known Brandon Powers for a while and everything that he's been doing, I love his work, but saw Aaron's capabilities in terms of how he was building that system and working with Brandon on that. And so I immediately went to Mark, who we'd been talking about Tulpamancer for a while, as I said, had seen an architecture that I think we knew could be accomplished and very quickly recognized that Aaron was somebody who had the capability to come in and help us bring it to life really.

[00:15:42.200] Kent Bye: Yeah, I want to dive into the Kinect diffusion. But before that, I want to go back to the first experience that you guys created, because you gave me a little bit of context, Matthew, when we were in New York City for the Onyx Studios exhibition in the summer. And you had a chance to talk about that a little bit on a podcast that I've published already. But I'd love to hear from your perspective, coming in and starting to work with these AI technologies. And my memory and recollection was trying to take some of this writing that had a little bit of poetic interpretation and the use of metaphor, and how can you start to turn that into something that's understandable or digestible by the machine learning. So yeah, I'd love to hear about how that project came about.

[00:16:16.010] Marc Da Costa: Yeah, absolutely. So the first project that we did this year was called Ikphrasis, and in it we were using poetry as a way to try to basically bring a critical lens and a sort of interrogation to the ways in which AI image generating algorithms see and conceive of the world. So, you know, when you read a poem it can be a very sort of fragmentary and visual experience where images come into your head and then they go away and this sort of helps build around the meaning of a poem and your experience of it. And so we in Ekphrasis engaged with the corpus of a poet called Constantine Cavafy, who was probably the most famous 20th century Greek poet. And in it we basically were making a series of video works that were in a sense visualizations or driven from the poetry itself. And this was a very interesting experience because it was I think our first time working with stable diffusion and these sorts of technologies. I think we did a lot of learning on how they work and how to even kind of operate the knobs if you will. but it was very interesting because most of the times the images that were produced in these videos were frankly pretty bad but then there were a bunch that were very surprising and I think even for me they were sort of very interesting because when you read a poem you read it very closely trying to really understand each word and how it works and why is that comma there and what's going on and of course we don't speak Greek and we're reading all of Cavafy's poetry in translation and in any event this work was really focused on seeing what it means to even do a close reading of poetry with this sort of technology what kinds of outputs can be made and in addition to making these videos we also produced sort of an artist book as a codex next to it that was very focused on documenting the process, so Matt and I had some screenshots of some of our WhatsApp chats, and the interfaces that we used, and really sort of trying to recognize that this was in March 2023, that what we were doing was a total snapshot of the state of the tech zeitgeist for these things at the time. And this immediately led, I think we learned a lot about how these things worked, immediately led to the project we did in June in Ioannina called Parallels, which was very interested in, as Matt was saying, thinking about how we can bring this sort of machine learning technology, these image generation technologies, out of the laboratory, out of the sort of festival space, and embed them in the natural landscape, in nature, and do it in such a way that gives a broad public visceral and immediate access to how this stuff is seeing the world. For this we trained it on a very site-specific corpus of artistic styles from people that were from this part of Greece over the millennia, you know, from what we have, but also from foreigners who would go there and sort of paint the landscape as part of the grand tours. different landscape painting traditions and so it was very much so I think again in a sense hearkening back to sort of a history of fine art and trying to use that as a way to crack open or take another look at what is actually going on with these new technologies.

[00:19:37.931] Matthew Niederhauser: I'll add quickly into that. Parallels will probably was seen by close to a hundred thousand people over six to eight weeks, whereas Topamanchor will be seen by a maximum of 200 to 250 people over the course of the week because of how it is designed, it's totally fine. But I will also add, you know, those two projects were also the beginning of figuring out what it meant to, as many people say, collaborate or have a conversation with machine learning tools. It was really very fundamental in terms of where we eventually ended up with Tulpa Mansur. And it's a little bit crazy because of the pace of the technology and how it's moving. You know, it's crazy to think that that was in March and then June, and now we're in August. And honestly, it's like maybe even every three months. And as soon as we're finished, it almost feels like it's old. And there's like something new that we want to do. And we can talk about that eventually, like what's post-Tolpemancer. But anyway, just a little addendum to what he was talking about.

[00:20:46.952] Kent Bye: Yeah, and so Aaron, so I want to cut in from your walking into the Onyx Studios with Brandon. You see what Matthew's been working on and you have a reaction and then I had a chance to see Connect to Fusion at the Onyx Studios in June and yeah, just a really amazing like 30 frames per second. Brandon was telling me there's like 11 GPUs that are in the cloud that are taking these different frames and rendering them out and you've got Click track or a metronome that's playing and so there's a timing that happens in these three acts And so as he's dancing then you have these prompts that are going at very specific times and so love to hear about some of your Reflections on the types of innovations you had to do in order to create this type of architecture to this kind of real-time generative AI style transfer with a multi-channel projection installation that then is now We're here at Venice and you're applying some of those different insights into what we are seeing with Topamancer. So yeah, I'd love to hear your journey on that.

[00:21:41.370] Aaron Santiago: Yeah, I think I'm definitely very grateful that I had Kinetic Diffusion before Teleplomancer, just so that I really, really came out of that project, came out of Kinetic Diffusion, understanding stable diffusion, like, extremely deeply. Like, I know how all the models work and the checkpoints, how everything interacts, what causes different kinds of... And it's interesting how different the projects are as well, because I think Kinetic Diffusion's raison d'être is, like, It's fast and nobody does it this fast and nobody does it this real time You know, there's like a two-second latency and it comes in the frame rate is super high and it's just so dazzling but one commonality between it and Tulpamancer is that everything else is in terms of the limitations of stable diffusion is extremely unimportant to the audience like the resolution like who cares like how the prompts are engineered like who cares it's about for kinetic diffusion is about this core of immediate interactivity and I think for Tulpamancer, it's the same way of taking what people say and showing it back to you. And a lot of the work for both of these projects has been trying to erase any other rough edge around how any of these processes happen. And I think, like a lot of the stuff that you mentioned, like the multi-channel installation and the tight timing of the beats, at the end of the day, Kinetic Diffusion is a dance performance, and those are just, when you do that with technology, you want three projectors if you can have them, you want things to be tightly timed, and when you look at it as an audience, that's a lot of where the soul of it is, is that it's so, like, crisp. And that's because I always want to make something good. And I think the parallel to Tulpamancer is sort of, which we haven't really gotten into yet, is like the installation aspect of it is sort of far away from the technology as I understand it. Like it presents with this very old desk and extremely old computer and this printer that prints out in a way that's super nostalgic for people and super alien for young people. But it really adds to the character of the piece. And that's, to me, like a lot of the aspects of timing and projection in Kinetic Fusion are basically what that is for Tulpamancer.

[00:23:46.957] Matthew Niederhauser: You don't know the deep feeling of a dot matrix printer going off, Aaron? This is the first time you've told us, I mean.

[00:23:58.400] Kent Bye: So yeah, let's get into what was the catalyzing moment for Tuplemancer, where you've been looking at these generative AI tools, both from image generation, but for chat GBT, this is all about entering text and getting interactions with larger language models to get back some sort of like, you know, There's critiques from AI ethicists who call it the stochastic parrot that is just trying to come up with the statistical best next word that comes from that. But because the probabilities of the relational dynamics of the human language, it's able to give the perception that it's actually speaking with some sort of knowledge or understanding. using this input of this to come back and to give a narrative. My experience with chatTBT has been that it often will either ramble on or it won't be like an arc of a narrative. It'll just kind of be a bunch of average knowledge dump, like almost reading a Wikipedia article in some ways. But you've managed to take enough inputs and create this whole narrative arc of this experience. And so what was the catalyzing moment where you thought, OK, now this is ready to actually translate what these technologies and these tools can do and then actually turn it into a meaningful, immersive experience?

[00:25:08.622] Matthew Niederhauser: I could kick it off a little bit with some of the broader structure and hand it over to Mark who actually did a lot of the prompt engineering and we actually had to really split up tasks because we rushed to get this done for Venice. I'm super happy with how it's turned out here of course. But the first thing in terms of approaching this was trying to realize very clearly the limitations and what type of information could actually be taken and gotten out of the tools themselves. And especially this one was catalyzed by understanding that with stable diffusion we could create these compelling equirectangular images and get matching depth maps. which made me feel like we could do something in VR. And that is the fundamental assets that are then animated and turned into particle systems that create these dreamlike visual worlds. And knowing that we could achieve that, it really then became around, at least in my head, how are we going to structure an installation that got text out of people or like had them respond to it, you know, in an open and sincere manner, and also give us time to build a little, you know, in the background to deliver what we wanted to. So that was what really made me feel like it was time to go with this in a big way. It was not looking just at the quality of the text and scripts we were getting out of, you know, we did use ChatGPT for a lot of that, but what were the media assets that you were going to hear and see? And, you know, the last component was using a text-to-voice tool called PlayHT, which we actually had first started diving into with the ekphrasis project about Cavafy's poetry. So that for me was the go time and then, you know, catalyzed how are we going to bring a narrative to it.

[00:27:18.917] Marc Da Costa: And I would add to that, I mean, you know, at a high level, I think, you know, Tulpa Mansour is trying to give participants and audiences a sort of intimate and engaged encounter with what's actually going on with these AI tools And, you know, to sort of just play off of the way you framed the question and sort of touching on some of the ethical questions and sort of what's going on here, you know, it's not the matrix that we've made. It's not like a Skynet Terminator sort of thing. It's really an experience that's designed to prime people, put them in a certain headspace where they're thinking about their past and their future and alternative timelines and things, and then creating something totally out of the text that they share, totally for them, from the voice to the images and everything else, and then deleting it at the end. So it's sort of like a little bubble moment to reflect, see how it lands, how it fails, and creating a space to give some other perspectives and kind of ways for people to approach these really big scary questions of like, a lot of stuff is changing around us, what are we going to do about it? I think we're trying to get into that space a little bit. And from the production ready to roll level, I think, you know, for me, it was also seeing kind of what's possible, you know, certainly, as you spoke about with stable diffusion, additionally, seeing with chat GPT, how you can use the API to sort of give a whole context and backstory, which there is. you know, when we sort of initialize it and there's a whole kind of mythology that we feed this conversational thread in a way and seeing how you can shape what the encounter looks like and then also very even banal sorts of things like there are ways to coax chat GPT to give you a response in a way that you can then split it using a basic function and then actually use it for other stuff. So realizing also that with certain versions of that model, there is enough stability where you can put rails around it, have confidence that you're going to get something within a set of parameters back, and then focus on the task of shaping it.

[00:29:29.247] Matthew Niederhauser: Yeah, I would add very quickly to that. You definitely have to know how to tell chat GPT to shut up eventually as well, because it'll just keep trying and grasping in terms of a narrative style.

[00:29:44.155] Marc Da Costa: One just funny thing to add on that is that if you spend too much time telling chat GPT to shut up, it breaks because it like forgets the other thing you told it to do. So it only shuts up about that, but not so it's always a little bit of chasing your own tail.

[00:29:56.722] Matthew Niederhauser: But definitely what really carries this experience is the quality of essentially the script that is turned into the voiceover that is really carrying people and getting them emotionally engaged. and otherwise I don't think we would be in the place where we are right now in terms of being able to coax out those stories and you know how amazing it integrates those responses and they are very different every single time. and we are in a strange new place where, you know, for a very long time I would check the Turing test every single year and see what was new out there and suddenly it's just, we've gone way past it and we're able to model these encounters with these large-scale machine learning models that are spooky in action and foreshadowing a lot of what is to come.

[00:31:03.544] Kent Bye: I wanted to get your perspective, Aaron, and coming out of this project. But before I do that, I want to ask one follow-up question, which is the script that you're saying. And before I came out to Venice, I'd posted a 17-episode series of looking at the intersection of XR and AI as I've, over the past four months, there's a lot of interviews that I've done looking at this intersection, including a couple of projects at Laval Virtual, the Quantum Bar, and another student project that was feeding what ended up being ChatGBT 3 and ChatGBT 3.5 to do this world building where you're able to prime it with certain information. And so when you said that you're feeding it a script, do you mean that you're feeding it the script in terms of the screenplay that you have for the overall arc? Or do you mean that you're feeding it a primed prompt that is setting a context that then you're melding in whatever input that is also coming from the user?

[00:31:53.413] Matthew Niederhauser: Yeah, it's a primed prompt. I mean, with the API, what he's saying is you're almost telling chatGBT what type of character it's embodying in the first place. And then, yeah, I mean, we sometimes when we're first talking about it, the prompt is a little bit of a mad lib. You know, it's sort of like, given someone said this, you know, respond to them in a way that does this and in a certain tone, also based off of you know, essentially this character backstory. And that was a process that needed a lot of fine-tuning, but it is not a script insofar as it needs to hit certain dramatic points every single time. No one's going on a hero's journey, you know, in a different fashion each way. It is really responding to the textual input and using that as the basis for trying to reach out and connect to what the people are actually inputting or the semblance thereof. So yeah, there's, I don't know, it's not a script, it's an architecture.

[00:33:04.693] Kent Bye: So yeah, in my experience of the piece, you go in and you're at this older computer and you're answering questions that are being then mined to feed into the chat GBT, which is creating like an audio narration that has like the audio translation and it's sort of feeding the heart of the piece of creating this arc. But there's also a lot of images that are in that. And so you're also taking fragments from those text inputs and you're creating images. And so I'd love to hear from your perspective as you're coming on to this project from the stable diffusion or the other images stuff. I'm assuming that's some of what you're working on, but I'd love to hear from your perspective coming on to the Topelmancer project.

[00:33:41.379] Aaron Santiago: Yeah, I think a big part of my contribution to that process was trying to understand how to use the resources that we had during a run in a way that gives people the best experience. Like, a big thing is, you know, there's so many different levers with AI. There's like, how high quality is the image? How much resolution is it? Like, how long are the prompts that we send to GPT? How long are the text files that we send to PlayHT? And when are those requests happening? And when can we expect them back? So a big part of the challenge that I had to face was orchestrating all of that so that when people sat down in the headset that everything was just ready. And sort of like I was mentioning earlier, people don't care about this stuff. So I felt like a lot of my job was to make all that stuff as good as possible within the restrictions that we had. We're generating assets up until the last second almost when we can afford it. just so that we can get that extra hundred pixels when we need it. But I think Mark should talk about the prompt engineering that goes into generating the script because there's a lot of tricks that we do when turning the text into the script that people hear at the end.

[00:34:54.925] Marc Da Costa: Yeah, and it was, I mean, on the stable diffusion side, you know, it's interesting, like, it is difficult to generate equirectangular images out of the box with the stable diffusion models. And, you know, there's a very active open source community around it where there's probably hundreds, if not even maybe a thousand, sort of models that are based off of these base models of stable diffusion 1.5, or now 1.0 XL, where people give it their own spin in a little way. And we had to do a lot of work testing different ones to find which would consistently create workable equirectangulars. And it's interesting because it's not about nailing it once, it's about nailing it a thousand times when you kind of run it. And we don't get it every time. you know, there was sort of an interesting, like, evaluation process almost that we had to do of coming up with hypothetical answers to all of these questions, pretending to be different literary characters or whatever, and then literally just batch running all of this stuff to kind of fine-tune, because again, it is really this chasing your tail sort of thing, where change one thing, it breaks something else, and really trying to find this kind of sweet spot of where you put the knobs and how you actually craft not only the prompt itself, but how you coax the prompt. Because it's basically, you answer things, there's sort of a script, and then there are visuals that connect to that. And that's sort of done in an automated way of trying to connect these frankly, very slippery and chaotic juncture points to get from A to B. And so it really took a lot of just mass scale testing to sort of see where did it feel like we were consistently landing in the right place. Because again, it isn't just like with phrases about making one banging video, if you will, and just really being able to edit it and tweak it. It's how do you set up a context where this can be open to the public and it can work.

[00:36:55.391] Matthew Niederhauser: Yeah, when you start answering the questions, it's a waterfall at that point and all of it needs to work seamlessly so that it literally ends up ready for the headset immediately after. We're not there interceding at any single moment. And in that sense, you know, it's talking to itself at a few points in terms of taking the response to create the scripts and then turning the scripts inside out to get stable diffusion prompts and then cutting those up to send to text to speech. and then having all of them land gracefully on a different computer that is then waiting to churn it through another architecture within Unity itself. And that feat was accomplished by Aaron and me being like, no, it's like this thing into this. And Aaron did an amazing job of listening to my rants and actually doing shader and particle work that really visualized it in ways that I wasn't even expecting sometimes. And so, you know, it's been a real group effort to make this entire thing land at Venice this week. But what's really interesting about that back end, though, is its malleability in terms of using other tools as they're rolling out this year. This has been a milestone for the project and we always talk about this with machine learning projects. We sort of want to encase it a little bit because even the state of the visuals right now are going to be very particular to this time and they could be recreated easily because stable diffusion is open source and we'll probably always have this ensconced somewhere, but being here, having people go through it, especially the community that shows up for Venice, it's really amazing feedback, like the people you're not even expecting to go through and haven't even met before, aside from the usual suspects like you, Kent. It's really catalyzing a lot of new ways of thinking about how, you know, this back end that we've built could be used outside of the context of Tulpa Mansur, but also how Tulpa Mansur itself could continue to involve and incorporate new machine learning tools, you know, moving forward.

[00:39:30.524] Kent Bye: The image that came to mind as you were describing that was a little bit of the Ouroboros of the snake eating its tail. You get one output and slice and dice it and put in another. So it's really this multimodal exploration of how to use all this input and output options to be able to really at the end create an immersive experience that's really quite compelling of this. immersive equirectangular images from Staple Diffusion created into this stereoscopic spatial experience that has the audio narration that ends up having a complete arc. But Aaron would love to have any other reflections on this kind of like Ouroboros, you know, slicing and dicing and different inputs and outputs that you had to put together and somehow feed it all into Unity in real time. And like, by the time it's done entering in all the text, have an immersive experience people can hop into.

[00:40:13.743] Aaron Santiago: Look, I mean, credit where credit is due. When these guys approached me with a project, they had it mapped out and I looked at it and maybe for the first time in my life, I was like, yeah, that's exactly how it's going to look. So, I mean, a lot of the challenges came with, you know, we would first do it and it would say things that just made no sense and we would see images that had horrible seams in them and it was all about ironing out these little details and the smaller the details got, the harder it got to iron them out. I think really this concept of the Ouroboros and how many times we feed ChatGPT into itself is like the big breakthrough that we had that allowed us to get such quality out of both the images and the audio. Yeah, yeah.

[00:41:01.665] Marc Da Costa: I would just sort of add on to that just a forward-looking statement, which is, you know, we're really happy to be here and be able to share it with this community, but this is a project that's going to keep evolving, and I think we're very excited to get, you know, our hands dirty again with the developments that have happened with stable diffusion, you know, potentially training some more of our own models and really taking a more curatorial, thoughtful approach instead of frankensteining things that are out there. And also, it's been very interesting on the large language model front, where increasingly there are quite good local language models for generating text that you can run on a 4090 gaming computer. And so we think also having everything locally, because right now we do ChatGPT and Play.ht, our remote services, opens up a lot of possibility for us to extend it and continue to refine

[00:41:56.751] Aaron Santiago: Yeah, it's so funny, we always say this about how it's about right now and how the tools are just coming out right now about this piece. During the development of this, a brand new version of Stable Diffusion came out, a brand new version of Play.ht came out, a brand new version of Meta's Llama came out, and we were already locked on the stack that we had because we needed a ticket to this festival, but it's like all of a sudden we have Vision of a totally new iteration of it that may just eclipse what we have in quality.

[00:42:29.195] Kent Bye: I Wanted to speak a little bit about my own experience of the piece which you know I've done a lot of interviews about AI and there's a lot of deeper ethical aspects of how AI is fitting into the broader ecosystem of artistic creation and authorship and all the different writer strike and sag that's happening that AI is a big part and that in terms of its relationship to human labor. But from an experiential perspective, I feel like the piece and the prompts roughly is sort of getting aspects of my memories, the nostalgia from the past, an ephemeral moment that just happened in the present moment of the day, and then looking out to the aspirations of the future. So you can have the past, present, and future that's roughly looking at these arcs from where you're coming from, where you're at now, and where you may be going. And so I felt like the ethics of sort of messing with people's memories, when I go into Google Earth VR, and I go into a place where I used to live, and then I see this really low-resolution, imperfect representation of what my memories of something is, it's sort of like, somehow, when I'm remembering something, I'm maybe overriding some of those different memories. And so, when you're asking these different nostalgic moments, then, to what degree am I going to be opening up to this new version of my childhood bedroom? Is that gonna be overwritten? But then it was a really magical moment of like having the description of how I got here today and like well I had to take a boat from Venice Airport and I just like was describing all the things I was saying and then just to see that play out in this it started pouring down rain like really hard and then I had to like go hide in a restaurant and then there's a waiter that came up and like looked at me and just was like nodded like yeah, like that's what I would be doing and He's like, can I help you? And I was like, no, I'm just waiting. And he's like, yeah. And so that was the moment I shared. And so then it had created this whole like ephemeral moment that had happened that day that was entered in as a prompt and kind of transformed into this whole other meaningful thing that I wasn't even thinking about. So it's like taking an ephemeral moment and adding all these layers of meaning that is mythologized. And that one little moment I'll probably like remember for a long time where ordinarily I'd probably forget it in the absence of this. And then the more aspirations of the future of trying to reflect upon these like dreams and aspirations that I have and have the chat GPT like reflect back to me this like, yeah, yeah, you can do this type of like feeling of reaffirming this vision of my identity of this future potential self that I want to have. So I think, you know, there's the ethical aspects that we can get into, but there's also the experiential aspect, which is the technology is there. What can we do with it? But also wrestling with the implications of what does this mean for artists and creators? If Topelmancer walks away with Venice from the top prize, is this a sign that the arts and the creators no longer are needed because now we have the AI can just tell us all of our stories. So there's things like that, that I think about these potential possibilities that are like, okay, what's that mean? as we move forward, so I'd love to hear any reflections on that.

[00:45:20.789] Matthew Niederhauser: We thought none of this, no. No, these are the important questions right now, and especially in terms of trying to have this very personal moment, and to draw out these, what would potentially be private memories, was intentional and we were very careful with that architecture in the back end. And it is a very easy tool to turn against somebody as well, depending on how you prompt it and the backstory that you give to chat GPT that is trying to interpret you. And I would say we use very, very soft gloves in terms of the model itself. We have had a lot of conversations. It's like, yeah, you could have potentially been more provocative even in terms of the stories that we could tell or challenge. For our first iteration using these tools, I didn't think this was the place, you know, to do it. And we're being very careful in terms of how we handle people's data within that context and also in terms of it is gone afterwards. But there's always risk involved in terms of letting the tools sort of speak for themselves eventually, even though we spent so much time trying to architect the experience itself. So, you know, fundamentally I feel that these tools are out there and they're going to be used and they need to be experimented with, they need to be interacted with, they need to be used in a way that actually allows greater criticality and I think that every individual should be aware of what's coming and a lot of it's going to be hidden in terms of its use. We're very open about what we're doing. And in that sense, I think we were playing within a good zone to introduce this type of technology within the context of XR and storytelling. We did not imagine this project within the context of the strikes that are occurring right now, which are amazing and one of the first labor movement responses to the intrusion of these technologies into people's livelihoods. You say artificial intelligence, the national Italian media have definitely come over and we're on the five o'clock news on Rai Television. We've had some reporters actually even coming over from the film side. So you've created an experience that has no screenwriters, no actors, and no production crew. And, you know, it's like, well, you're technically correct. And it's been interesting trying to represent what we feel are artistic values to use this tool in new and compelling ways and hopefully like thought provoking ways that give you a better understanding of what the implications of these technologies are, you know, and the fact that it's going to have some pretty major consequences on society as a whole, quite frankly. Slow rolling or not. So I don't know I mean we really wanted to stick out some ground with this project by coming to Venice with it its Context has taken on a lot of new dimensions even in the past month or two leading up to it But yeah, I don't know mark you've I mean, we've all been in the trenches with this as well, but yeah.

[00:48:57.776] Marc Da Costa: Yeah, I mean, we've certainly been putting in our share of 14, 15-hour days for a while, so it's not as if there's no labor involved in producing something like this. I mean, I think it is, just to touch on it, a provocative question, a very interesting one that you ask with respect to the ethics of memory. So we do ask people to share very personal things, and it's something that you know, we did think very carefully about, but the way that you framed it of what are the ethics of potentially intervening in those memories or reshaping it or scrambling or whatever, and I think I hadn't actually quite framed it in that way, but the first thing that I thought of as you were saying that was to ask myself, What are the ethics? Every time we read a novel about a child or see a film or something, these new experiences and stories do inevitably start to intermingle with our own and sort of touch these ways, but I think it's not I think it's not an obvious answer, though, to this question of what it does mean. It was actually interesting in an early development phase of this. I think I did give chat, I forget the name of it, but there is like a discredited psychoanalytic or whatever technique of bringing up repressed memories. There's a name for it. And I did ask chat GPT to like pretend you're taking or just curious what would happen. And it was like this. And it was funny because it's also you then in a lot of ways start to run into these the ethical guardrails that are like baked into the software from whatever the company decides. So it was just like, this is a controversial thing, should only be done by a doctor, no one should do it, and just kind of totally closed down that whole train of things. But I think there is, yeah, it's interesting. Something to meditate on, I think.

[00:50:47.824] Matthew Niederhauser: I'll be a little provocative. I sort of think it's fair game a little bit, you know, within a context that is trying to still create safe spaces for exploring it. And, you know, as I said, like we were mindful of the architecture. We were fundamentally looking for positive responses that acted as a bit of a reflection to what you gave to it. I mean, quite frankly, it really gives back as much as you put in. And if you withheld a lot, you know, it would still try to connect with you about what you shared. But fundamentally, insofar as it tried to reach out and speak to your memories, you know, it also exposed what it doesn't understand about humanity and its context always. Some people actually, well, some people didn't get that and like we definitely have seen some tears. We've definitely seen some people who are like, I don't wanna say angry, but just like, like that didn't land and you know, the spectrum of responses have been greatly varied and that's great, you know? And I don't know, one of the most crazy things about immersive or using these technologies over the past six or seven years as storytelling devices and is it actually like somebody coming out of experience, this is after doing a ton of festivals and all these contexts of showing this technology often first time to people, is it really exposes their sort of phenomenological context. in a strange way and it could be even like basic physiology like people who really don't understand like how to maneuver in VR or don't have the coordination of something like wow is that how you navigate the world all the time like barely Holding on or catching on to things or you know, it's like it really exposes, you know It's amazing that we can all hang out and talk to each other and society's great. But you know, they were fundamentally sometimes see the world very differently and Immersive brings that out in a big way and the strange thing about this machine learning project I think is is in terms of trying to create an entity that's going to connect to you, is it sort of brought out the different ways that people anthropomorphize things in a way, that they connect or can humanize. Like, you know, we imbue consciousness into toys, into objects, into pets, into everything around us all the time. You know, it's we look for patterns and subjectivity and personality and things everywhere. And this is a big litmus test in a strange way. You know, obviously it's imbued with a lot of the new cycles and perception of what AI and ML stands for in the first place, but it's been a new I don't know, maybe this is just exhaustion, you know, looking in after five days of running this thing, but it's sort of opened up a new dimension in terms of analyzing the responses of what people have come out of it. But, you know, I don't know. We've tried to do good by our participants, I would say, fundamentally, but it's been really interesting.

[00:54:12.222] Aaron Santiago: Yeah, I think one of the really beautiful things about Tulpamancer is the fact that we ask people about their memories because a lot of the impetus and the impact of the piece is getting people to understand the relationship between what you put into AI and what you get out. And one of the biggest challenges with this kind of people, especially like definitely when showing to the general public, is that it's so hard to get them to be creative and honest about things that they have in their mind. So when we have them address their memories, it invokes such visceral imagery in their heads that they can't help but to compare it to what the AI is telling them and showing them. and I think this is something that we've talked about recently but there's like this big fetishization of like the capabilities of AI but we definitely trusted that it's really not that good and it's not gonna go in and manipulate you into doing something crazy and it's not gonna like send your information off to some like secret underground Skynet overlord and hide that away. Really it's gonna sometimes say something very insightful and sometimes say something totally incomprehensible and you know every time people are just rolling the dice to see what they get.

[00:55:25.889] Matthew Niederhauser: Yeah, and this brings up the idea of what the tulpa is in the first place. And it is based off of this concept that is originally Buddhist and is in the Tibetan Book of the Dead is that through meditative practice or otherwise that you could manifest an entity that you could speak to or talk to. and how it became repopularized again in early 20th century intellectual circles, especially through a book called Thought Forms. But we feel in this strange sense that a lot of these, or at least I feel that these Large-scale machine learning models are essentially trained and based off of this corpus of knowledge, digital knowledge, that has been created by billions of people over the past 20 to 30 years. And in a strange way, it's been analyzed and externalized into this massive probability machine that is a strange tulpa of its own kind. We've created something that we can talk to, in a way, and it is oddly based off of humanities digital archive and it's sort of like I guess one of the bigger picture things that we're thinking about in terms of the project and is the idea behind the Topamancer in the first place. And that is to maybe even counteract a little bit what Aaron just said. It's just the probability of word and word coming after another done to such a large scale that it is, I don't know, potentially smarter than most of us. But it's, man, I don't know. We just also did this big fishbowl AI thing earlier as well. And there is so much to think about and talk about just in the core nomenclature of How do we define intelligence? What is its being? You know, why is that different for everybody in the first place? And yeah, I don't know We've sort of tried to poke the bear a little bit with this by creating this encounter and I think the best part has been the takeaways that really showed that provoked new thoughts about how the technology works and and how it can be used. And that's the least I think we could do right now by showing Topamancer here.

[00:58:00.368] Kent Bye: I want to ask a quick technical follow-up, like an ethical question, because you say that the data are deleted at the end, but in the back end, are you monitoring what's being said? Are you taking a look at what the text prompts are? Are you doing some sort of level of quality control? And what level of privacy do people have when it comes to the data that they're entering and what the creators have access to?

[00:58:23.068] Matthew Niederhauser: Well, yeah, I mean, if we wanted to flip a switch back there, you could suddenly start saving everything, of course, but it is scripted to delete it, you know?

[00:58:35.135] Kent Bye: I'm just wondering if you can monitor it in real time.

[00:58:38.865] Matthew Niederhauser: Oh, yeah, there's logs. But the logs are mainly there to look at all of the APIs moving back and forth. And within it, text would be contained. But it doesn't pop up. You know what I mean? We would have to do this while it was happening. You know what I mean? So yeah.

[00:58:59.032] Kent Bye: So you're not in the back watching all the responses?

[00:59:01.853] Matthew Niederhauser: No. Yeah, there's a monitor back there that you could probably turn on. And I mean, when it's in the act of it occurring, you could also be peeking in from behind the curtain as well. I mean, there's a way when the act is occurring, you could potentially try to find out what they're saying. But no, we set it up so we don't have to be back there. I mean, it is a cool, quiet place where you can potentially escape the chaos, but we're not there to monitor it. The only thing we might check is whether it's showing visuals. But I think the thing we're definitely trying to protect the most is their text input and also the audio output. But sometimes we'll take a peek to make sure that the headset is moving. But otherwise, we've really worked to, in a festival context, basically for it to run itself and to delete and move on through each cycle.

[00:59:59.666] Aaron Santiago: I think this is even one of the greatest technical challenges of being here at Venice is that we have no insight into what happened. Like people will come out and they'll say something and my brain will sort of tickle of like, did it work properly? And we won't know. The logs have like almost no useful information in terms of what people wrote and what came out of it. We have no idea. And you know, by the time the next run is over, everything is just gone. And that's, it's crazy.

[01:00:29.789] Matthew Niederhauser: I will say that there is a physical takeaway that we haven't talked about at the end. May we return to Aaron's lack of acknowledgement of the genius of the dot matrix printer sound. There is a final chat GPT call that creates this final message in some ASCII art. And so when the experience is over, you actually, the first thing you hear right next to you is a very loud dot matrix printer going off. We take you out of the headset. And the docent rips off this one page, and it's a final message from the tulpa. And so that physical item, which you can destroy or take away or do whatever you want with, is a trace of your input. So there is this physical, hard copy, interpretive piece that is left. So in that sense, it's a very vulnerable piece of data is left behind, but it's in your hands and you can do with it what you will.

[01:01:34.462] Kent Bye: And finally, as we start to wrap up here, I'm curious, what do you think the ultimate potential of virtuality, immersive storytelling, and artificial intelligence might be, and what it might be able to enable?

[01:01:45.675] Marc Da Costa: Well, this is a big question. I would, first of all, disaggregate them a little bit. You know, I think certainly virtual reality experiences. You know, from the ones that I've sort of gone through, I think there is something really special about the medium as a way of telling stories in modes that can be quite all-encompassing and really fill the frame in different ways. I don't know the ultimate reality of it. I've sort of been enjoying the journey as it were. I think on the AI side, you know, no one knows ultimately where it's going to go, but I think that it will be very strange, you know, when it sort of totally fades into the background and, you know, we have our AIs calling other people's AIs and sort of shaping the world around us in this way. I think I'll leave it there.

[01:02:39.696] Aaron Santiago: Yeah, I think, you know, a big part of the point of this piece, especially as it pertains to AI, is to sort of uncover what that looks like. We're taking a step forward in terms of what AI is capable of by forcing it to give people an experience in real time. Personally, I think VR is still sort of nascent in its existence. Right now it's a great tool for us to get people to sit down and feel like they're alone with the Tulpa, but it sort of falls to the background when it comes to its existence in this piece. A lot of the immersion comes from the fact that we isolate you and tell you that everything will be destroyed afterwards and you're only left with this piece of paper at the end. But, you know, there's a lot of discussions that we have with people that come out of it and between us of like, what does this mean for AI? Is it that you will have automatically generated movies and TV shows always? Or like a little personal assistant that tells you bedtime stories on the fly? Or is it stuck where it is now and won't get any better and people will always consider it to be a party trick or just not sure?

[01:03:50.649] Matthew Niederhauser: I think I ranted a little bit about this the last time we talked around Tribeca. It's going to have an impact, especially artificial intelligence, in terms of almost all the information that we consume now, especially in terms of the people who create media. It's going to work its way into almost every type of creative process. whether the written word or the podcast or television episodic VR. And, you know, I still think we are very much in an infancy with the hardware that we strap to our faces and the weight and the complexity and stability. I think that five, 10 years out, it's going to become more seamless. The implications of that could be major and also scary. but there is a part of me for better or worse in terms of working with these tools and creating custom narratives for people and reactive environments that this is really the basis for what we would imagine like a truly complex large-scale metaverse would look like and it would most likely largely be built by machine learning tools and it is complex and interesting to think of how that could manifest and you know, is it isolating? Is it truly social? Is it going to consume all the resources on our planet to run? And I don't know, that's for me is like a 30, 40, 50 year time scale still in my brain. But you know, maybe machine learning will finally hit a level where they'll be making even better machine learning. And then we have the machine learning cycle of better and better and better. And you know, it's no longer going to be a slow rollout, you know, I think that it could fundamentally change up a lot of different aspects of our world, like how energy is made, how we work and commute and live, and then eventually how we are entertained and communicate. So, you know, ready player one situations? I still don't know. I mean, there's so many speculative futures out there to explore, you know, and hope to take part in those for the time that we have at this point. Topamancher would be a little drop in the bucket, but you know, we're going to try to stay engaged with and continue to use these tools as they're coming out. But there's no doubt in my mind that we're in the middle of a very accelerated pace of development. And so it's exciting to use them as artists and hopefully bring new context to them and change the dialogue around them in whatever way we can.

[01:06:57.323] Kent Bye: Awesome. Is there anything else that's left unsaid that you'd like to say to the broader immersive community?

[01:07:02.786] Aaron Santiago: Look, a drop in the bucket, even the smallest drop can cause the biggest waves, Matthew.

[01:07:08.829] Matthew Niederhauser: OK. OK. OK. No, it's great to be here. Thanks to Michelle and Liz for including us. About time. And, you know, it's just been really important for us to be here and get this type of feedback, response, the conversations. And yeah, onwards and upwards. See what happens next.

[01:07:35.809] Kent Bye: Yeah, well I just published a 17 part series on the future of VR and AI and there's a lot of artists I talked to that were pushing for the potentials of AI and a couple of episodes where I was thinking about the constraints and limits and how do I make sense and resolve the ethics and how there's this technology pacing gap that's going so much faster than we can really understand and so I feel like this is a piece I'm very excited about the technical innovations that you've been able to do but I'm still wrestling with the ethical implications of it all and where it's all going to be going in the future. So I feel like this sort of mixed reaction where I'm excited about the artistic achievement that we've been able to do, but I'm also sort of like have a little bit more of skepticism or hesitation in terms of what this all means for where this is all going in the future. But it's very provocative and it's certainly starting a lot of different conversations. And yeah, I just wanted to thank each of you for joining me here on the podcast to help tell a bit of the story, the origin story of how it all came about and to help break it all down. So thank you so much.

[01:08:32.135] Matthew Niederhauser: Thank you, Kent. Always a pleasure.

[01:08:34.095] Marc Da Costa: Thanks, Kent. Thank you, Kent.

[01:08:36.756] Kent Bye: Thanks for listening to this interview from Fitness Immersive 2023. You can go check out the Critics' Roundtable in episode 1305 to get more breakdown in each of these different experiences. And I hope to be posting more information on my Patreon at some point. There's a lot to digest here. I'm going to be giving some presentations here over the next couple of months and tune into my Patreon at patreon.com slash Voices of VR, since there's certainly a lot of digest about the structures and patterns of immersive storytelling, some of the different emerging grammar that we're starting to develop, as well as the underlying patterns of experiential design. So that's all I have for today, and thanks for listening to the Voices of VR podcast. And again, if you enjoyed the podcast, then please do spread the word, tell your friends, and consider becoming a member of the Patreon. This is a listener-supported podcast, and so I do rely upon donations from people like yourself in order to continue to bring this coverage. So you can become a member and donate today at patreon.com slash voicesofvr. Thanks for listening.

More from this show