Peace, here are three images, original one and second with " img2img" only, and third image 4X upscaled with “LDSR img2img upscale”, i used “LDSR stable diffusion img2img upscale” and then enhanced the face with “GFPgan” , the results is incredibly amazing, and we can get insane quality too if we used “img2img” only without upscaler, but we will need to increase “denoise” values which will cause losing some of original image details And in return will got massive boost in quality, this method is very great with low res videos, the model has a downsides, like taking a lot of time to upscale" because of lack and poor optimization" “it’s free model”, and in some photos upscale results are not good “because of poor model training” ,i hope Topaz will make model like this, you will be amazed if you try this, it needs a lot of trying to get some great upscaled photos, but this is because it’s idea is revolutionary but execution is not refined well till now, so i hope to see topaz think about this conception hopefully.
first photo : original.
second photo : img2img" upscale only.
third photo : LDSR img2img upscale.
I don’t agree with it, Video AI’s positioning is video repair and enhancement, and Stable Diffusion goes beyond this category, it essentially makes a similar image based on the input. Like this one of yours, adding too much content that would never appear in the input.
It’s been around for a long time with copyright disputes, and if you look at its effects on video input, every frame it outputs isn’t continuous. So all you get is the “artwork” of the original input, which I believe most people using Video AI would not want.
They are not essentially the same, what you cite is just face enhancement + global oversharpening, it doesn’t add something that wouldn’t otherwise be there, so it still falls under the category of restoration and enhancement.
I agree with you.
But from the images posted by OP. It didn’t generate anything new in the image.
It seem that he only use the upscale features come with Stable Diffusion.
The backgroud is upscaled by “Latent Diffusion Super Resolution (LDSR) upscaler” and the human face is upscaled by “GPFGAN”.
I can do similar upscaling using CodeFormer. (Not as detail as LDSR but faster)
If you just look at this picture, it should be like this.
I checked out the Latent Diffusion model, which has some things in common with the Stable Diffusion model, so it also falls outside the category of restoration. These models (including the face Restoration model) all have the same problem. Reasonable use can bring a better look and feel, but the generation of distortion is also difficult to control, just like your CodeFormer output, the lips are wrongly repaired.
here is tutorial i used to get the third photo result, because i don’t know but i think you mean the upscale tab on extras tab and choose LDSR upscaler from available upscale models in webui, which is not the one i used here, here is the link for more clarification:
I know you are not using Stable Diffusion but LSDR, they are similar in some respects, these models are not designed for image restoration, so it is prone to distortion, relatively low versatility, and what I mentioned above other problems.
i agree with you, but i mention some of downgrades of the model on my first post, if i wasn’t miss understanding the model developer or one of the team is was using one GTX 1080 to do this, so no good training or optimization expected, plus the model is for photos not for videos, so it will not look in previous and next frames and take their details in count and will not use it for enhance current frame either, and this will cause flickering and ai artifacts… etc, but the model concept is very promising if trained well enough, which didn’t happen till now.
for more refine for my suggestion, let’s imagine that “Topaz” creates a model that works similarly to the “LDSR stable diffusion img2img upscale” model. When you import a video into the model, you can click on any object or draw a box around it to mask it and add a description. For example, you could click on a red undefined object in a table and type “red telephone”, click on the floor and type “parquet floor”, or put a mask or box around a person and type “Adam” to identify them.
By doing this across multiple frames of the video, the model will be able to recognize that these objects and people are the same throughout the video, and your descriptions will help the AI understand their identity. This process would be similar to using a stable diffusion photo generator, where you add a few words to get amazing photos in return. However, in this case, the video farmes already exists, so in every frame, the text-to-image model would receive a thousand words or more from the video itself, plus the descriptions that you added, like we say “A picture is worth a thousand words”
To make this even more powerful, a sub-model could convert frames to words to help the main model understand the video better. The potential of this idea is mind-blowing and could have incredible Models if it were to become a reality.
I know you are looking forward to its future, what I want to express is that because of the fundamental difference, it will never be able to avoid distortion in any scene (so the versatility is low), the more likely reality you need to use according to the input content multiple attempts to adjust results. You might think that the current model also needs multiple adjustments so it doesn’t make a difference, but it should be more than you think, especially you are dealing with video.
In the “img2img” tab of the Stable Diffusion webui, if you increase the denoising, you can get a significant improvement in the quality of the photo. However, this can also result in changes to image details, such as clothing. On the other hand, if you set the denoising to zero, you may not see a noticeable difference in the quality or original details of the image. This is because the model is not specifically designed for upscaling.
However, if Topaz were to create a similar model for videos, many of these problems could potentially be resolved. This new model could be optimized to handle the unique challenges of video processing, in a way that preserves important details and maintains the overall quality of the video.
The biggest advantage of this type of model is to introduce content from elsewhere to expand the details of the input, but it is also the biggest disadvantage. If there is no external input, it cannot boost the input; if it is introduced, distortion will follow. This is not a problem that Topaz can solve.
If you read a lot of posts, you will find that almost every update has people reporting errors in the output results. If there is such a model, there will be more feedback, and it will increase the user’s usage cost a lot. Of course, there is also a way to eradicate it, that is to fully accept its result, but this will not be everyone’s choice.
we talk about low quality SD videos, so there are already a few original details, so changing the original details little bit but in other hand have massive quality improvement will be very acceptable by a lot of people, the other solution will be take a high end camera and go back in time and re shot the video again
Whenever there is a lot of “guesswork” for any trained model, it is as good as training material. So you are basically limited to the database of images or video it was used to train it. In common situations, this can be enough, but when it runs into unique situations that are outside the scope of the data set used to train it, it will have to do some more “guesswork” and that leads to all kind of issues.
For example. I have to use topaz gigapixel to upscale sand texture. Guess what, it does not have enough of images of sand in its database, so it does not understand what sand is, it than has to use something similar. Hence when I upscale sand texture it ends up as something else. This is not the problem of the model per se, or original image, but its dataset that was used to train it. And its impossible to cover all possibilities. As time goes on , more data is added, sure , but there will always be unique situations, hence there either has to be a way to train the model to do something least destructive in situations when it cannot replicate something in greater resolution or we will end up seeing problems that look like artifacts, glitches etc.