SpeakSite

Ok so I want to build a React app that lets me edit HTML pages using my voice. This should be a progressive webapp (PWA), so we need manifest.json (even if service workers will be limited as Whisper API needs to be online). TECH STACK # Whisper API and OpenAI (via OpenRouter) will be needed. We will want to handle voice recording with RecordRTC (to ensure voice recording compatibility with iOS). This initial version will be single-user. We will save sessions with Firebase, but with no login/multi-user required. OVERVIEW OF APP # Here's how the app will work: I will import a block of HTML, a script will segment it into chunks, then I load up the HTML in an iframe. I select a chunk, talk out the changes, Whisper API transcribes my voice, then the transcription is wrapped inside a prompt (along with the selected code chunk and other instructions). The LLM then returns the code - edited as per my transcribed instructions - and the new code replaces the old code within the iframe. I can then save and export the edited page. This is v1 so we are not adding all the features here at once (although they may appear in the UI frontend as placeholders). OVERVIEW OF PAGES # There will be 3 pages - import / current / sessions IMPORT - user pastes in HTML to edit, it is imported with a script and saved to Firebase as a "new session" CURRENT - displays current (most recent) session. HTML appears as iframe and user can select area to edit, talk out text to edit and it is edited. SESSIONS - table which displays all sessions. user can download final HTML as a zip or continue editing. NAVBAR # We will need a navbar with `LOGO.png` on the left and import / current / sessions on the right. Here is a detailed overview of each page: FRONTEND - IMPORT PAGE # The import page lets users import the HTML by pasting it into a big text area and clicking import. As well as the `html to import` text area we will want a few extra fields: - Title: (text field) - Type: (radio buttons to select between `Agent X` `Raw text` and `all HTML` with `Agent X` being the default and only option to start (later on we will add different rules, and use different components at each stage when we bring on new types, but to start they must select `Agent X` as only checkable option) - Folder URL (text field - although this is just a placeholder for now and isn't used) - Tag to divide: (text field - although this is just a placeholder for now and isn't used) Then there will be a big import button. BACKEND - IMPORT SCRIPT # Clicking `import` will run the import-agentx.js script, as follows (later we will add other scripts to import other types) Firstly, we should create a new session - with an identifier (eg 12345), the title, and all the other fields the user entered. This session will be appended to with SVGs, HTML saves and more as described next... Secondly, we search the HTML document for any SVG files and for each one, save them as svg1, svg2, etc, and replace everything inside to with [SVG1], [SVG2], etc (so the HTML like this is content of svg1 will now be replaced with [svg1] Then we need to save each of these SVGs to Firebase, associated with this session (we will want to readd the SVGs later). Thirdly, we will want to insert separators, so that before each

tag and at the very end of the document (after the last ) we insert SEPARATOR1, SEPARATOR2. So if text is
some section content here
more section content here
another section content here
then we would return it like SECTION01
some section content here
SECTION02
more section content here
SECTION03
another section content here
SECTION04 Finally we save this HTML, along with the timestamp, associated to this session ID as our first draft (every time we save an updated version of the HTML, we add a new draft with a timestamp so we can always query the most recent draft and easily undo it). We now have a Session ID created, with SVGs saved, and the first draft. CURRENT SESSION # Once the user imports the HTML and the import-agentx script is complete, they are redirected to the `current session` page. (If a user opens the app and visits this page without having imported a page there should be an 80% width placeholder.png image here) The current session page is the main page where users view the HTML page they are editing (displayed in an iframe) and request changes to it (using their voice). The overall layout is that there is a 70% iframe of the HTML on the left and a 25% `record/edit` column on the right. At the very bottom is a very thin sticky frame row which lets user save or exit (this row is only present on the current session page). IFRAME ## The iframe displays the content of the HTML. Users can scroll up and down through the page. The page content should be resized such that it is like a normal desktop browser with no left/right scrolling required (perhaps this can be done with the viewport settings). The iframe lets user click on any part of the HTML and a cursor.png image then appears at that location, to signify they have selected it. At the same time there is a listener script which listens for clicks (or taps on mobile). We will in the future want the ability to listen for clicks on images, but to start with - for this MVP v1 - we only want to listen for clicks inside Essentially, we will want to see where the user has clicked on, by looking at within which sections they have clicked on. So for example, if user clicks... SECTION01
some section content here
SECTION02
more section content here
<-- user clicks somewhere inside here SECTION03
another section content here
SECTION04 ...then we would note that they have clicked between `SECTION02` and `SECTION03` We will store this in the cache like `current selected HTML`, ready to be used in the LLM prompt, and the next script, as discussed later. Important: this process will need to work on mobile two, although here we are listening for a `tap` instead of a `click`. RIGHT HAND COLUMN (EDITING AND VOICE RECORDING) ## To the right of the iframe is the right hand column (although on mobile this could potentially appear responsively below the iframe instead). This right hand column contains the following: - The section selected in h3 text (eg `SECTION 03`) <-- this will change each time a user clicks/taps a section - a big RECORD button <-- clicking this will start the voice recording with recordRTC - edit type: users can select between agentx, image, remix, html, text, web - perhaps displayed as a grid of buttons 3 columns x2 rows. Initially, only agentx button logic is operational so other buttons cannot current be selected). This selector option should be done with a button, with some color/border change to show which has been selected/active. - undo button <-- clicking this will undo the previous change, basically querying firebase and loading up the previous HTML in the iframe - redo button <-- obviously this does the opposite, checking firebase to see what the next HTML is and loading that in the iframe BOTTOM FRAME ## Finally there is a (vertically small) row/frame stuck to the bottom which is divided into three columns and displays Session name - on the left Last saved time - in the middle (we should query the timestamp and check every minute to see when HTML was last saved for this session, displaying it format like 59m or 23hr59m or 1d) Then on the right there are 2 buttons: SAVE, BUILD & LEAVE <-- clicking this will run the `save-build.js` script described below SAVE <-- clicking this simply saves the current HTML to firebase (with the current, most recent timestamp) BACKEND - RECORDING & EDITING As mentioned, the core idea is that user can edit the HTML sections using their voice. Here is how it should work... Firstly, the user clicks in an area of the iframe as mentioned earlier, and the listener spots the section So for example, if user clicks... SECTION01
some section content here
SECTION02
more section content here
<-- user clicks somewhere inside here SECTION03
another section content here
SECTION04 ...then we would note that they have clicked between `SECTION02` and `SECTION03` So that area is now saved to cache, as {selected} Now user hits record button, they are prompted to allow microphone, and RecordRTC records their audio (in either webm or mp4 depending on iOS or windows etc). When they hit stop, recording is complete and audio is then sent to WhisperAPI to be transcribed. Next we take this transcription and send it to ChatGPT via openrouter, along with the selected text and instructions. This should be a separate file like agentx-edit-html.js containing the following initial prompt: ---------------------------------------- I need you to edit some HTML based on some instructions. You are only editing a section of the HTML, so do not add any extra tags like . The code you are edit should begin with
... The edited code you return should also begin with
... Do not say "here is the code.." or anything like that. Do not write ```html. Just return the edited code, beginning
The instructions you have been given have been transcribed from a voice note, so may be somewhat rambling or vague. But hopefully you can understand the changes that are required and edit them accordingly. Please now return the code based on the instructions. Here is the code to edit: HTML CODE TO EDIT: ======================== {selected-html-section appears here} ======================== Here are the instructions to follow: ======================== Please edit the HTML above based on these instructions. Do not add ```html or any comments. Do not replace any image or videos. Do not add extra tags or try to complete the HTML. Simply return the edited code based on the following instructions: {transcription-from-whisper-api-appears-here} ======================== Now return the edited HTML ---------------------------------------- Note: obviously... {selected-html-section appears here} --> this should be replaced with selected text, it everything inside SECTION3-SECTION4 or whatever {transcription-from-whisper-api-appears-here} --> obviously this should be the transcribed text from Whisper API (in future, user could choose different types but initialy in v1 MVP, only `agentx` option is selected which triggers the above prompt) Next, after perhaps 5-20 seconds, Openrouter responds with updated HTML. Now we update the HTML, replacing the original selected area with exactly what Openrouter has responded with. We also save this HTML to Firebase, associating it with the current session and adding a timestamp. User can click UNDO if they don't like the change. ABOUT UPDATING IFRAME - SCROLL RESTORE # Since we are working with an iframe, there is one thing to note. If we are not careful, then the change will cause the frame to reload and just appear at the top again. If user is, say, halfway down the page this could be annoying. One idea is that, before reloading, record the iframe’s scrollTop (and possibly scrollLeft). After setting the new HTML, reapply that scroll position. This is fairly straightforward to do as long as you can access iframe.contentWindow and the iFrame is same-origin (e.g. a local data URL or blob). This will mean iframe loads in right place ... Obviously user can keep updating different sections, clicking on sections, recording instructions to change via Whisper>Openrouter and then either UNDO/REDO or SAVE when they are happy. SAVE-BUILD.JS SCRIPT When user is happy with the page and is ready to save, they should click `Save, Build & Leave` on bottom row sticky frame. This will then save this as the latest HTML session. It will then a) remove all SESSION001, SESSION002, SESSION003, etc from the HTML b) replace [SVG1], [SVG2], etc with the SVGs saved with this particular session This now gives us our updated HTML file. But this cannot be used again and imported back into the tool so we need to save it as a separate type to the other HTML files. So, finally, it will also create a save to a second type of HTML (also associated with the Session ID), perhaps a `final-build` type. Finally, user is redirected back to the `Sessions` page with their new HTML at the top row, and an option to download. Here all HTML edit sessions are displayed in a table with rows: title/ last edited / preview / edit page / rebuild page (readds SVGs and images into folder) / Title - simply the name of the session (which user gave on import page) Last edited - the timestamp of the last HTML edit Preview - a placeholder link which doesn't work (we will add this in v2) Edit - a link to open the last HTML within the `current session` window and continue editing it again Download Page - a link to download the HTML (the final `save-build` type with SVG replaced and SECTION001/002/etc removed ... V2 In future we will add the ability to import a second type (`HTML`) as well as AgentX. We will also introduce new `edit types` (remember, currently only `agentx` type is optional which triggers the above API sequences (whisper>openrouter), but in future we will also add types for `image, remix, html, text, web`, which will use different APIs and have different logic. In future we will also add multi-user login and database storage, and perhaps add settings page which lets user control prompts, choose models, etc. Hopefully this is all clear and you can proceed to code the application.

Open App Folder

Open Zip Of App Files


O1 Response


Gemini Response


Claude Response



Final Consenus


Files To Code


API Template


SESSION 1 - APP.JS AND APP.CSS


SESSION 2 - API FILE(S)


SESSION 3 - COMPONENTS PT1


SESSION 4 - COMPONENTS PT2


SESSION 5 - PAGES PT1


SESSION 6 - PAGES PT2


SESSION 7 - EXTRA FILES


SESSION 8 - README


SESSION 9 - DEBUG SUMMARY