Beta

Intro to Statistics - Part 3: Correlation Coefficients

Description
Loading description...
Statistics
Algorithms
Data Science
  • Please sign in or sign up to leave a comment.
  • Voile Avatar

    If you want to actually test our knowledge of is_correlation_causal, you can't just put in a return value in initial code that can pass the tests: it doesn't teach anything if we don't even to touch it in the first place. At least it should be something like pass.

  • Voile Avatar

    Even the linked Wikipedia article contains multiple definitions of correlation coefficient, so it's unclear which one is expected.

  • jolaf Avatar

    In general, what would really make description and examples clearer is making the DIMENSIONS of the intermediate values clearly visible. Some things are arrays equals in dimension to the original data, others are just scalars. Also, some values are calculated separately for each source data sequence, and some combine data from both sequences. Making these things clear for m, d, v, cd, cv, pd, cc would really help.

    • ChristianECooper Avatar

      I've tried to add more detail without actually supplying the algorithm.

      Any better?

    • jolaf Avatar

      Better, but still strange. It seems that either I'm reading your words wrong, or there's a flaw in the description.

      Here's a sample of data from my calculation that is performed (or it seems to me so) according to your description, and yet the results are clearly wrong:

      data = ((1, 5), (2, 4), (3, 3), (4, 2), (5, 1))
      a = (1, 2, 3, 4, 5)
      b = (5, 4, 3, 2, 1)
      m(a) = m(b) = 3
      d(a) = (-2, -1, 0, 1, 2)
      d(b) = (2, 1, 0, -1, -2)
      cd(a, b) = (-4, -1, 0, -1, -4)
      v(a) = v(b) = (4, 1, 0, 1, 4)
      cv(a, b) = (16, 1, 0, 1, 16)
      pd = sqrt(sum(cv)) = sqrt(34) ~= 5.83
      cc = sum(cd) / pd = -10 / pd ~= -1.71
    • ChristianECooper Avatar

      This comment has been hidden.

    • jolaf Avatar

      Hmm, the trouble is I don't see your previous 'spoiler' comment that is probably answer to my 'spoiler' comment. :(

      Could you please 'unspoiler' it for some time for me to read?

      Update: oh, now I see it. It's pretty clear, thanks.

    • jolaf Avatar

      Well, in the kata description you say: "Variance [v]: A squared deviation. One per value". If in fact is should be per-variable, could you please correct it?

    • ChristianECooper Avatar

      Oh, I thought I had, my mistake, wait a moment!

    • ChristianECooper Avatar

      OK, all donw now!

    • jolaf Avatar

      And the Co-variance definition also seems wrong therefore: "The product of the variance of two variables. One per pair"

    • jolaf Avatar

      In examples there's also a couple of strange lines:

      v(x[0], xs) = 4
      cd(x[0], xs, y[0], ys) = -4
      cv(x[0], xs, y[0], ys) = 16


      What all those x[0] and y[0] mean??

    • jolaf Avatar

      And also

      Pooled deviation [pd]: The square root of the sum of co-variances. One.

      If co-variance is a product of variances, it's a single value. So what is a SUM of co-varianceS?

    • ChristianECooper Avatar

      OK, I think we are there now! :)

    • jolaf Avatar

      This comment has been hidden.

  • jolaf Avatar

    This line in the example is really unclear:

    cd(1, [1, 2, 3, 4, 5], 5, [1, 2, 3, 4, 5]) = -4

    In the description it is said:

    Co-deviation [cd]: The product of the deviation of two variables

    Which seems to mean that if deviations for two variables are (da1, da2, ..., dan) and (db1, db2, ..., dbn) correspondingly, than co-deviation is (da1 * db1, da2 * db2, ..., dan * dvn). I don't see how this relates to the cd example above.

    What are the parameters of cd function? Why the same [1, 2, 3, 4, 5] is passed there twice? What are 1 and 5, passed additionally?