Frame-by-frame detailed video analysis


I use Affectiva’s valance measure in real time, but after a game I’d like to analyze the player’s face in maximum detail. To do this, I record a video as they play, and I’d like to use something like a slow-motion replay to extract all the Affectiva emotion and expression measures from every frame of video. It’s ok if this takes a long time. I’m getting close by using Unity’s VideoPlayer instead of a MovieTexture and setting the play speed to be slower. Now I’m stuck on the Affectiva side.

Is there a way for me to use Affectiva’s Unity SDK to feed a video to Affectiva frame by frame, making sure that every frame is analyzed exactly once?


Hi, the Affectiva Developer Portal describes processing a recorded video, or processing individual frames one at a time:


Thanks for the reply, andy_dennie! Those are the instructions I worked off of. The VideoFileInput.cs example passes individual frames, but it still takes a sample rate as input and starts a coroutine at that rate rather than detecting new frames as they come in. In practice, setting that sample rate to be equal to the movie frame rate has resulted in more Affectiva outputs than frames in the movie, and I can’t tell which frames were analyzed more than once.

I think the ideal setup would be to advance the movie one frame, send that frame to Affectiva, then advance another frame. Do you know if there’s a way to do that?


Hi, yes, just decode the video yourself, then feed the Frames individually to the FrameDetector using the ProcessFrame method.


Hi andy_dennie - I think I’ve got it working! Unity’s VideoPlayer object has a StepForward method that I had missed earlier. Now the problem I’m encountering is that Affectiva’s original measure of Valence from the webcam stream is different from its measure of Valence from the same stream when it’s read back in as a video.

Any idea why this might be happening? I’m working off of Unity’s webcam and video examples. Since the 2nd frame above looks positive but the orig line is at neutral, I’m suspicious of the original webcam-processing code. Here’s my ProcessFrame() code for sending the original webcam stream to Affectiva and saving it to a video:

  Frame.Orientation orientation = Frame.Orientation.Upright;

  //dj[c] get frame
  var framePixels = cameraTexture.GetPixels32();
  var tFrame = Time.realtimeSinceStartup;

  //dj[c] Send frame to Affectiva
  Frame frame = new Frame(framePixels, cameraTexture.width, cameraTexture.height, orientation, tFrame);

  //dj[c] Send frame to movie file
  if (saveToFile)
      writerScript.write("t=" + tFrame + ", Frame=" + iFrame);

  // increment frame index
  iFrame = iFrame + 1;

Here’s my ProcessFrame() code for reading the movie back in and sending it to Affectiva during the replay:

  //A render texture is required to copy the pixels from the movie clip
  RenderTexture rt = RenderTexture.GetTemporary((int)movie.clip.width, (int)movie.clip.height, 0, RenderTextureFormat.ARGB32, RenderTextureReadWrite.Default, 1);
  //RenderTexture rt = RenderTexture.GetTemporary((int)movie.clip.width, (int)movie.clip.height, 0, RenderTextureFormat.ARGB32, RenderTextureReadWrite.Default, 1); = rt;

  //Copy the movie texture to the render texture
  Graphics.Blit(movie.texture, rt);

  //Read the render texture to our temporary texture
  t2d.ReadPixels(new Rect(0, 0, rt.width, rt.height), 0, 0);

  //apply the bytes

  //Send to the detector
  //Frame frame = new Frame(t2d.GetPixels32(), t2d.width, t2d.height, Frame.Orientation.Upright, Time.realtimeSinceStartup * movie.playbackSpeed);
  Frame frame = new Frame(t2d.GetPixels32(), t2d.width, t2d.height, Frame.Orientation.Upright, (float)(movie.frame * movie.clip.frameRate));
  print("Sent frame " + movie.frame);



Hmm. Well, tough to say what’s causing this, but I’ll mention a few possibilities…

  • are you sure that the timestamps from the camera and the timestamps in the video file match up?
  • the detector may drop frames to keep up with the rate at which frames are being fed to it, so it could be that some frames processed from the camera feed are not processed from the video, or vice versa (however, if the timestamps for the processed frames line up, then that’s not an issue).
  • if the video resolution is different than the resolution of the frames from the camera, that might have some effect
  • the process of encoding to the video file format and then decoding back may have been “lossy”, such that the frames processed from the video are not identical to the frames from the camera


Thanks for these suggestions. I’ve checked the resolution, which does match. We’re using 1280x720 video with a large face - could the lossy compression make a big difference? For now, I’m focusing on timestamps. I’ve been trying to get the timestamps as reliable as possible by recording both the times when frames were sent to Affectiva and the times when results were received. Is there a way to get the frame time from within onImageResults (that is, the timestamp sent to detector.processFrame() along with that frame)? This would help me determine which frames were dropped when aligning with the replay.

By calling ProcessFrame() from within FixedUpdate, I’ve gotten sampling to a pretty even 20Hz (see middle row below), so I was hopeful I could just use iFrame*frameRate as the timestamp sent to Affectiva in the replay. Do you think I need to read in the exact frame times from the original run’s log instead? (note that “movie(reported)” is the timestamps sent with webcam frames and “orig” is the times when image results were received.)


The onImageResults callback includes a Frame parameter, and you can call getTimestamp() on that Frame. That may help you “line up” the frame results processed from the recorded video with the corresponding frame results processed from the camera stream


That would be helpful! I don’t see the frame parameter of onImageResults, though, just a dictionary with ints and faces. Is there a way to get the frame & timestamp from that?


Arg, sorry, my bad, I was looking at the C++ onImageResults signature. OK, that’s not going to work, then. Looking at your graphs above, ultimately that’s not the cause of your issue, but it might have been helpful.

Looking at your code above a little more, I notice that you’re not using tFrame when writing the frame to the video file, nor are you reading and using the frame timestamp from the recorded video file. Is there a way you could do that?


The MediaEncoder I’m using won’t write frame times to the video (it assumes a constant sampling rate). But I do write out the frame times to a log, and I could have the replay read those frame times back in.

As an alternative, I’m considering just trying to write out all Affectiva expressions & emotions in real-time, which we could maybe accomplish by analyzing/recording fewer frames per second, and doing away with the replay. Does this sound like an ok approach to you?


Yes, that sounds simpler.